blob: 9ff266e2092f449461ebef3ea1089678cd48c9bb [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye07d9f102017-11-10 01:00:54 +000087 =========== =============== ============ ===== ======= ==================
88 Processor Alternative Target dGPU/ ROCm Example
89 Processor Triple APU Support Products
90 Architecture
91 =========== =============== ============ ===== ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000092 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye07d9f102017-11-10 01:00:54 +000093 -------------------------------------------------------------------------
94 ``r600`` ``r600`` dGPU
95 ``r630`` ``r600`` dGPU
96 ``rs880`` ``r600`` dGPU
97 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye07d9f102017-11-10 01:00:54 +000099 -------------------------------------------------------------------------
100 ``rv710`` ``r600`` dGPU
101 ``rv730`` ``r600`` dGPU
102 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000103 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000104 -------------------------------------------------------------------------
105 ``cedar`` ``r600`` dGPU
106 ``redwood`` ``r600`` dGPU
107 ``sumo`` ``r600`` dGPU
108 ``juniper`` ``r600`` dGPU
109 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000110 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye07d9f102017-11-10 01:00:54 +0000111 -------------------------------------------------------------------------
112 ``barts`` ``r600`` dGPU
113 ``turks`` ``r600`` dGPU
114 ``caicos`` ``r600`` dGPU
115 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000116 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye07d9f102017-11-10 01:00:54 +0000117 -------------------------------------------------------------------------
118 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
119 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
120 - ``verde``
121 - ``oland``
122 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000123 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye07d9f102017-11-10 01:00:54 +0000124 -------------------------------------------------------------------------
125 ``gfx700`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
126 - Radeon HD 8770
127 - R7 260
128 - R7 260X
129 \ - ``kaveri`` ``amdgcn`` APU - A6-7000
130 - A6 Pro-7050B
131 - A8-7100
132 - A8 Pro-7150B
133 - A10-7300
134 - A10 Pro-7350B
135 - FX-7500
136 - A8-7200P
137 - A10-7400P
138 - FX-7600P
139 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
140 - FirePro W9100
141 - FirePro S9150
142 - FirePro S9170
143 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
144 - Radeon R9 290x
145 - Radeon R390
146 - Radeon R390x
147 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
148 - ``mullins`` - E1-2200
149 - E1-2500
150 - E2-3000
151 - E2-3800
152 - A4-5000
153 - A4-5100
154 - A6-5200
155 - A4 Pro-3340B
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000156 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye07d9f102017-11-10 01:00:54 +0000157 -------------------------------------------------------------------------
158 ``gfx800`` - ``iceland`` ``amdgcn`` dGPU - FirePro S7150
159 - FirePro S7100
160 - FirePro W7100
161 - Radeon R285
162 - Radeon R9 380
163 - Radeon R9 385
164 - Mobile FirePro
165 M7170
166 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - A6-8500P
167 - Pro A6-8500B
168 - A8-8600P
169 - Pro A8-8600B
170 - FX-8800P
171 - Pro A12-8800B
172 \ ``amdgcn`` APU ROCm - A10-8700P
173 - Pro A10-8700B
174 - A10-8780P
175 \ ``amdgcn`` APU - A10-9600P
176 - A10-9630P
177 - A12-9700P
178 - A12-9730P
179 - FX-9800P
180 - FX-9830P
181 \ ``amdgcn`` APU - E2-9010
182 - A6-9210
183 - A9-9410
184 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU ROCm Same as gfx800
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU ROCm - Radeon R9 Nano
186 - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU ROCm - Radeon RX 470
192 - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU ROCm - Radeon RX 460
195 ``gfx810`` - ``stoney`` ``amdgcn`` APU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000196 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye07d9f102017-11-10 01:00:54 +0000197 -------------------------------------------------------------------------
198 ``gfx900`` ``amdgcn`` dGPU ROCm - Radeon Vega
199 Frontier Edition
200 - Radeon RX Vega 56
201 - Radeon RX Vega 64
202 - Radeon RX Vega 64
203 Liquid
204 - Radeon Instinct MI25
205 ``gfx902`` ``amdgcn`` APU *TBA*
Tony Tyef16a45e2017-06-06 20:31:59 +0000206
Tony Tye07d9f102017-11-10 01:00:54 +0000207 .. TODO
208 Add product
209 names.
210 =========== =============== ============ ===== ======= ==================
211
212.. _amdgpu-target-features:
213
214Target Features
215---------------
216
217Target features control how code is generated to support certain
218features. Not all target features are supported by all processors. The
219runtime must ensure that the features supported by the device used to
220execute the code match the features enabled when generating the
221code. A mismatch of features may result in incorrect execution, or a
222reduction in performance.
223
224Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
225target features.
226
227For example:
228
229``-mxnack``
230 Enable the *XNACK* feature.
231``-mno-xnack``
232 Disable the *XNACK* feature.
233
234 .. table:: AMDGPU Target Features
235 :name: amdgpu-target-feature-table
236
237 ============== ======== ==================================================
238 Target Feature Default Description
239 ============== ======== ==================================================
240 -m[no-]xnack disabled Enable/disable generating code that has
241 memory clauses that are compatible with
242 having XNACK replay enabled.
243
244 This is used for demand paging and page
245 migration. If XNACK replay is enabled in
246 the device, then if a page fault occurs
247 the code may execute incorrectly if the
248 XNACK feature is not enabled. Executing
249 code that has the feature enabled on a
250 device that does not have XNACK replay
251 enabled will execute correctly, but may
252 be less performant than code with the
253 feature disabled.
254
255 This feature is supported by the
256 ``amdgcn`` architecture for GFX7-GFX9.
257 ============== ======== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000258
259.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000260
261Address Spaces
262--------------
263
Tony Tyef16a45e2017-06-06 20:31:59 +0000264The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000265
Tony Tyef16a45e2017-06-06 20:31:59 +0000266The memory space names used in the table, aside from the region memory space, is
267from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000270
Tony Tyef16a45e2017-06-06 20:31:59 +0000271 .. table:: Address Space Mapping
272 :name: amdgpu-address-space-mapping-table
273
274 ================== ================= ================= ================= =================
275 LLVM Address Space Memory Space
276 ------------------ -----------------------------------------------------------------------
277 \ Current Default amdgiz/amdgizcl hcc Future Default
278 ================== ================= ================= ================= =================
279 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
280 1 Global Global Global Global
281 2 Constant Constant Constant Region (GDS)
282 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
283 4 Generic (Flat) Region (GDS) Region (GDS) Constant
284 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
285 ================== ================= ================= ================= =================
286
287Current Default
288 This is the current default address space mapping used for all languages
289 except hcc. This will shortly be deprecated.
290
291amdgiz/amdgizcl
292 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
293 is specified as the target triple environment value.
294
295hcc
296 This is the current address space mapping used when ``hcc`` is specified as
297 the target triple environment value.This will shortly be deprecated.
298
299Future Default
300 This will shortly be the only address space mapping for all languages using
301 AMDGPU backend.
302
303.. _amdgpu-memory-scopes:
304
305Memory Scopes
306-------------
307
308This section provides LLVM memory synchronization scopes supported by the AMDGPU
309backend memory model when the target triple OS is ``amdhsa`` (see
310:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
311
312The memory model supported is based on the HSA memory model [HSA]_ which is
313based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
314relation is transitive over the synchonizes-with relation independent of scope,
315and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000316table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000317
318This is different to the OpenCL [OpenCL]_ memory model which does not have scope
319inclusion and requires the memory scopes to exactly match. However, this
320is conservatively correct for OpenCL.
321
Tony Tye07d9f102017-11-10 01:00:54 +0000322 .. table:: AMDHSA LLVM Sync Scopes
323 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000324
325 ================ ==========================================================
326 LLVM Sync Scope Description
327 ================ ==========================================================
328 *none* The default: ``system``.
329
330 Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system``.
337 - ``agent`` and executed by a thread on the same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``agent`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system`` or ``agent`` and executed by a thread on the
350 same agent.
351 - ``workgroup`` and executed by a thread in the same
352 workgroup.
353 - ``wavefront`` and executed by a thread in the same
354 wavefront.
355
356 ``workgroup`` Synchronizes with, and participates in modification and
357 seq_cst total orderings with, other operations (except
358 image operations) for all address spaces (except private,
359 or generic that accesses private) provided the other
360 operation's sync scope is:
361
362 - ``system``, ``agent`` or ``workgroup`` and executed by a
363 thread in the same workgroup.
364 - ``wavefront`` and executed by a thread in the same
365 wavefront.
366
367 ``wavefront`` Synchronizes with, and participates in modification and
368 seq_cst total orderings with, other operations (except
369 image operations) for all address spaces (except private,
370 or generic that accesses private) provided the other
371 operation's sync scope is:
372
373 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
374 and executed by a thread in the same wavefront.
375
376 ``singlethread`` Only synchronizes with, and participates in modification
377 and seq_cst total orderings with, other operations (except
378 image operations) running in the same thread for all
379 address spaces (for example, in signal handlers).
380 ================ ==========================================================
381
382AMDGPU Intrinsics
383-----------------
384
385The AMDGPU backend implements the following intrinsics.
386
387*This section is WIP.*
388
389.. TODO
390 List AMDGPU intrinsics
391
392Code Object
393===========
394
395The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
396can be linked by ``lld`` to produce a standard ELF shared code object which can
397be loaded and executed on an AMDGPU target.
398
399Header
400------
401
402The AMDGPU backend uses the following ELF header:
403
404 .. table:: AMDGPU ELF Header
405 :name: amdgpu-elf-header-table
406
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000407 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000409 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000410 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
411 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000412 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
413 - ``ELFOSABI_AMDGPU_HSA``
414 - ``ELFOSABI_AMDGPU_PAL``
415 - ``ELFOSABI_AMDGPU_MESA3D``
416 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
417 - ``ELFABIVERSION_AMDGPU_PAL``
418 - ``ELFABIVERSION_AMDGPU_MESA3D``
419 ``e_type`` - ``ET_REL``
420 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000421 ``e_machine`` ``EM_AMDGPU``
422 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000423 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000424 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000425
426..
427
428 .. table:: AMDGPU ELF Header Enumeration Values
429 :name: amdgpu-elf-header-enumeration-values-table
430
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000431 =============================== =====
432 Name Value
433 =============================== =====
434 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000435 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000436 ``ELFOSABI_AMDGPU_HSA`` 64
437 ``ELFOSABI_AMDGPU_PAL`` 65
438 ``ELFOSABI_AMDGPU_MESA3D`` 66
439 ``ELFABIVERSION_AMDGPU_HSA`` 1
440 ``ELFABIVERSION_AMDGPU_PAL`` 0
441 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
442 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000443
444``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000445 The ELF class is:
446
447 * ``ELFCLASS32`` for ``r600`` architecture.
448
449 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
450 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000451
452``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000453 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 One of the following AMD GPU architecture specific OS ABIs
457 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000458
Tony Tye07d9f102017-11-10 01:00:54 +0000459 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000460
Tony Tye07d9f102017-11-10 01:00:54 +0000461 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000462
Tony Tye07d9f102017-11-10 01:00:54 +0000463 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
464
465 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000468 The ABI version of the AMD GPU architecture specific OS ABI to which the code
469 object conforms:
470
471 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
472 runtime ABI.
473
474 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
475 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000476
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000477 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000478 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000479
Tony Tyef16a45e2017-06-06 20:31:59 +0000480``e_type``
481 Can be one of the following values:
482
483
484 ``ET_REL``
485 The type produced by the AMD GPU backend compiler as it is relocatable code
486 object.
487
488 ``ET_DYN``
489 The type produced by the linker as it is a shared code object.
490
491 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
492
493``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000494 The value ``EM_AMDGPU`` is used for the machine for all processors supported
495 by the ``r600`` and ``amdgcn`` architectures (see
496 :ref:`amdgpu-processor-table`). The specific processor is specified in the
497 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
498 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000499
500``e_entry``
501 The entry point is 0 as the entry points for individual kernels must be
502 selected in order to invoke them through AQL packets.
503
504``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000505 The AMDGPU backend uses the following ELF header flags:
506
507 .. table:: AMDGPU ELF Header ``e_flags``
508 :name: amdgpu-elf-header-e_flags-table
509
510 ================================= ========== =============================
511 Name Value Description
512 ================================= ========== =============================
513 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
514 -------------------------------------------- -----------------------------
515 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
516 mask for
517 ``EF_AMDGPU_MACH_xxx`` values
518 defined in
519 :ref:`amdgpu-ef-amdgpu-mach-table`.
520 ================================= ========== =============================
521
522 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
523 :name: amdgpu-ef-amdgpu-mach-table
524
525 ================================= ========== =============================
526 Name Value Description (see
527 :ref:`amdgpu-processor-table`)
528 ================================= ========== =============================
529 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
530 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
531 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
532 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
533 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
534 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
535 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
536 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
537 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
540 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
541 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
542 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
544 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
545 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
546 *reserved* 17-31 Reserved for ``r600``
547 architecture processors.
548 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
553 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
554 ``EF_AMDGPU_MACH_AMDGCN_GFX800`` 38 ``gfx800``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
561 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000562
563Sections
564--------
565
566An AMDGPU target ELF code object has the standard ELF sections which include:
567
568 .. table:: AMDGPU ELF Sections
569 :name: amdgpu-elf-sections-table
570
571 ================== ================ =================================
572 Name Type Attributes
573 ================== ================ =================================
574 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
575 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
576 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
577 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
578 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
579 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
580 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
581 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
582 ``.note`` ``SHT_NOTE`` *none*
583 ``.rela``\ *name* ``SHT_RELA`` *none*
584 ``.rela.dyn`` ``SHT_RELA`` *none*
585 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
586 ``.shstrtab`` ``SHT_STRTAB`` *none*
587 ``.strtab`` ``SHT_STRTAB`` *none*
588 ``.symtab`` ``SHT_SYMTAB`` *none*
589 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
590 ================== ================ =================================
591
592These sections have their standard meanings (see [ELF]_) and are only generated
593if needed.
594
595``.debug``\ *\**
596 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
597 DWARF produced by the AMDGPU backend.
598
Tony Tye46d35762017-08-15 20:47:41 +0000599``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000600 The standard sections used by a dynamic loader.
601
602``.note``
603 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
604 backend.
605
606``.rela``\ *name*, ``.rela.dyn``
607 For relocatable code objects, *name* is the name of the section that the
608 relocation records apply. For example, ``.rela.text`` is the section name for
609 relocation records associated with the ``.text`` section.
610
611 For linked shared code objects, ``.rela.dyn`` contains all the relocation
612 records from each of the relocatable code object's ``.rela``\ *name* sections.
613
614 See :ref:`amdgpu-relocation-records` for the relocation records supported by
615 the AMDGPU backend.
616
617``.text``
618 The executable machine code for the kernels and functions they call. Generated
619 as position independent code. See :ref:`amdgpu-code-conventions` for
620 information on conventions used in the isa generation.
621
622.. _amdgpu-note-records:
623
624Note Records
625------------
626
Tony Tye07d9f102017-11-10 01:00:54 +0000627As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
628be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
629aligned. In addition, minimal zero byte padding must be generated to ensure the
630``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
631``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000632
633The AMDGPU backend code object uses the following ELF note records in the
634``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000635record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000636variable size strings have a corresponding ``*_size`` field that specifies the
637number of bytes, including the terminating null character, in the string. The
638string(s) come immediately after the preceding fields.
639
640Additional note records can be present.
641
642 .. table:: AMDGPU ELF Note Records
643 :name: amdgpu-elf-note-records-table
644
Tony Tye46d35762017-08-15 20:47:41 +0000645 ===== ============================== ======================================
646 Name Type Description
647 ===== ============================== ======================================
648 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000649 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000650
651..
652
653 .. table:: AMDGPU ELF Note Record Enumeration Values
654 :name: amdgpu-elf-note-record-enumeration-values-table
655
Tony Tye46d35762017-08-15 20:47:41 +0000656 ============================== =====
657 Name Value
658 ============================== =====
659 *reserved* 0-9
660 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000661 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000662 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000663
Tony Tye46d35762017-08-15 20:47:41 +0000664``NT_AMD_AMDGPU_HSA_METADATA``
665 Specifies extensible metadata associated with the code objects executed on HSA
666 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
667 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
668 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
669 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000670
Tony Tye46d35762017-08-15 20:47:41 +0000671.. _amdgpu-symbols:
672
673Symbols
674-------
675
676Symbols include the following:
677
678 .. table:: AMDGPU ELF Symbols
679 :name: amdgpu-elf-symbols-table
680
681 ===================== ============== ============= ==================
682 Name Type Section Description
683 ===================== ============== ============= ==================
684 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
685 - ``.rodata``
686 - ``.bss``
687 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
688 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
689 ===================== ============== ============= ==================
690
691Global variable
692 Global variables both used and defined by the compilation unit.
693
694 If the symbol is defined in the compilation unit then it is allocated in the
695 appropriate section according to if it has initialized data or is readonly.
696
697 If the symbol is external then its section is ``STN_UNDEF`` and the loader
698 will resolve relocations using the definition provided by another code object
699 or explicitly defined by the runtime.
700
701 All global symbols, whether defined in the compilation unit or external, are
702 accessed by the machine code indirectly through a GOT table entry. This
703 allows them to be preemptable. The GOT table is only supported when the target
704 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000705
706 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000707 Add description of linked shared object symbols. Seems undefined symbols
708 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000709
Tony Tye46d35762017-08-15 20:47:41 +0000710Kernel descriptor
711 Every HSA kernel has an associated kernel descriptor. It is the address of the
712 kernel descriptor that is used in the AQL dispatch packet used to invoke the
713 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
714 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
715
716Kernel entry point
717 Every HSA kernel also has a symbol for its machine code entry point.
718
719.. _amdgpu-relocation-records:
720
721Relocation Records
722------------------
723
724AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
725relocatable fields are:
726
727``word32``
728 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
729 alignment. These values use the same byte order as other word values in the
730 AMD GPU architecture.
731
732``word64``
733 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
734 alignment. These values use the same byte order as other word values in the
735 AMD GPU architecture.
736
737Following notations are used for specifying relocation calculations:
738
739**A**
740 Represents the addend used to compute the value of the relocatable field.
741
742**G**
743 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000744 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000745
746**GOT**
747 Represents the address of the global offset table.
748
749**P**
750 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
751 of the storage unit being relocated (computed using ``r_offset``).
752
753**S**
754 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000755 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
756
757**B**
758 Represents the base address of a loaded executable or shared object which is
759 the difference between the ELF address and the actual load address. Relocations
760 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000761
762The following relocation types are supported:
763
764 .. table:: AMDGPU ELF Relocation Records
765 :name: amdgpu-elf-relocation-records-table
766
767 ========================== ===== ========== ==============================
768 Relocation Type Value Field Calculation
769 ========================== ===== ========== ==============================
770 ``R_AMDGPU_NONE`` 0 *none* *none*
771 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
772 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
773 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
774 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
775 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
776 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
777 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
778 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
779 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
780 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
781 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000782 *reserved* 12
783 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000784 ========================== ===== ========== ==============================
785
786.. _amdgpu-dwarf:
787
788DWARF
789-----
790
791Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
792information that maps the code object executable code and data to the source
793language constructs. It can be used by tools such as debuggers and profilers.
794
795Address Space Mapping
796~~~~~~~~~~~~~~~~~~~~~
797
798The following address space mapping is used:
799
800 .. table:: AMDGPU DWARF Address Space Mapping
801 :name: amdgpu-dwarf-address-space-mapping-table
802
803 =================== =================
804 DWARF Address Space Memory Space
805 =================== =================
806 1 Private (Scratch)
807 2 Local (group/LDS)
808 *omitted* Global
809 *omitted* Constant
810 *omitted* Generic (Flat)
811 *not supported* Region (GDS)
812 =================== =================
813
814See :ref:`amdgpu-address-spaces` for information on the memory space terminology
815used in the table.
816
817An ``address_class`` attribute is generated on pointer type DIEs to specify the
818DWARF address space of the value of the pointer when it is in the *private* or
819*local* address space. Otherwise the attribute is omitted.
820
821An ``XDEREF`` operation is generated in location list expressions for variables
822that are allocated in the *private* and *local* address space. Otherwise no
823``XDREF`` is omitted.
824
825Register Mapping
826~~~~~~~~~~~~~~~~
827
828*This section is WIP.*
829
830.. TODO
831 Define DWARF register enumeration.
832
833 If want to present a wavefront state then should expose vector registers as
834 64 wide (rather than per work-item view that LLVM uses). Either as separate
835 registers, or a 64x4 byte single register. In either case use a new LANE op
836 (akin to XDREF) to select the current lane usage in a location
837 expression. This would also allow scalar register spilling to vector register
838 lanes to be expressed (currently no debug information is being generated for
839 spilling). If choose a wide single register approach then use LANE in
840 conjunction with PIECE operation to select the dword part of the register for
841 the current lane. If the separate register approach then use LANE to select
842 the register.
843
844Source Text
845~~~~~~~~~~~
846
847*This section is WIP.*
848
849.. TODO
850 DWARF extension to include runtime generated source text.
851
852.. _amdgpu-code-conventions:
853
854Code Conventions
855================
856
857This section provides code conventions used for each supported target triple OS
858(see :ref:`amdgpu-target-triples`).
859
860AMDHSA
861------
862
863This section provides code conventions used when the target triple OS is
864``amdhsa`` (see :ref:`amdgpu-target-triples`).
865
866.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000867
868Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000869~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000870
Tony Tye46d35762017-08-15 20:47:41 +0000871The code object metadata specifies extensible metadata associated with the code
872objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
873[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
874(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
875``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
876information necessary to support the ROCM kernel queries. For example, the
877segment sizes needed in a dispatch packet. In addition, a high level language
878runtime may require other information to be included. For example, the AMD
879OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000880
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000881The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000882:doc:`YamlIO`).
883
Tony Tye46d35762017-08-15 20:47:41 +0000884.. TODO
885 Is the string null terminated? It probably should not if YAML allows it to
886 contain null characters, otherwise it should be.
887
Tony Tyef16a45e2017-06-06 20:31:59 +0000888The metadata is represented as a single YAML document comprised of the mapping
889defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
890referenced tables.
891
892For boolean values, the string values of ``false`` and ``true`` are used for
893false and true respectively.
894
895Additional information can be added to the mappings. To avoid conflicts, any
896non-AMD key names should be prefixed by "*vendor-name*.".
897
898 .. table:: AMDHSA Code Object Metadata Mapping
899 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
900
901 ========== ============== ========= =======================================
902 String Key Value Type Required? Description
903 ========== ============== ========= =======================================
904 "Version" sequence of Required - The first integer is the major
905 2 integers version. Currently 1.
906 - The second integer is the minor
907 version. Currently 0.
908 "Printf" sequence of Each string is encoded information
909 strings about a printf function call. The
910 encoded information is organized as
911 fields separated by colon (':'):
912
913 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
914
915 where:
916
917 ``ID``
918 A 32 bit integer as a unique id for
919 each printf function call
920
921 ``N``
922 A 32 bit integer equal to the number
923 of arguments of printf function call
924 minus 1
925
926 ``S[i]`` (where i = 0, 1, ... , N-1)
927 32 bit integers for the size in bytes
928 of the i-th FormatString argument of
929 the printf function call
930
931 FormatString
932 The format string passed to the
933 printf function call.
934 "Kernels" sequence of Required Sequence of the mappings for each
935 mapping kernel in the code object. See
936 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
937 for the definition of the mapping.
938 ========== ============== ========= =======================================
939
940..
941
942 .. table:: AMDHSA Code Object Kernel Metadata Mapping
943 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
944
945 ================= ============== ========= ================================
946 String Key Value Type Required? Description
947 ================= ============== ========= ================================
948 "Name" string Required Source name of the kernel.
949 "SymbolName" string Required Name of the kernel
950 descriptor ELF symbol.
951 "Language" string Source language of the kernel.
952 Values include:
953
954 - "OpenCL C"
955 - "OpenCL C++"
956 - "HCC"
957 - "OpenMP"
958
959 "LanguageVersion" sequence of - The first integer is the major
960 2 integers version.
961 - The second integer is the
962 minor version.
963 "Attrs" mapping Mapping of kernel attributes.
964 See
965 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
966 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000967 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000968 mapping kernel arguments. See
969 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
970 for the definition of the mapping.
971 "CodeProps" mapping Mapping of properties related to
972 the kernel code. See
973 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
974 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000975 ================= ============== ========= ================================
976
977..
978
979 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
980 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
981
982 =================== ============== ========= ==============================
983 String Key Value Type Required? Description
984 =================== ============== ========= ==============================
985 "ReqdWorkGroupSize" sequence of The dispatch work-group size
986 3 integers X, Y, Z must correspond to the
987 specified values.
988
989 Corresponds to the OpenCL
990 ``reqd_work_group_size``
991 attribute.
992 "WorkGroupSizeHint" sequence of The dispatch work-group size
993 3 integers X, Y, Z is likely to be the
994 specified values.
995
996 Corresponds to the OpenCL
997 ``work_group_size_hint``
998 attribute.
999 "VecTypeHint" string The name of a scalar or vector
1000 type.
1001
1002 Corresponds to the OpenCL
1003 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001004
1005 "RuntimeHandle" string The external symbol name
1006 associated with a kernel.
1007 OpenCL runtime allocates a
1008 global buffer for the symbol
1009 and saves the kernel's address
1010 to it, which is used for
1011 device side enqueueing. Only
1012 available for device side
1013 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001014 =================== ============== ========= ==============================
1015
1016..
1017
1018 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1019 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1020
1021 ================= ============== ========= ================================
1022 String Key Value Type Required? Description
1023 ================= ============== ========= ================================
1024 "Name" string Kernel argument name.
1025 "TypeName" string Kernel argument type name.
1026 "Size" integer Required Kernel argument size in bytes.
1027 "Align" integer Required Kernel argument alignment in
1028 bytes. Must be a power of two.
1029 "ValueKind" string Required Kernel argument kind that
1030 specifies how to set up the
1031 corresponding argument.
1032 Values include:
1033
1034 "ByValue"
1035 The argument is copied
1036 directly into the kernarg.
1037
1038 "GlobalBuffer"
1039 A global address space pointer
1040 to the buffer data is passed
1041 in the kernarg.
1042
1043 "DynamicSharedPointer"
1044 A group address space pointer
1045 to dynamically allocated LDS
1046 is passed in the kernarg.
1047
1048 "Sampler"
1049 A global address space
1050 pointer to a S# is passed in
1051 the kernarg.
1052
1053 "Image"
1054 A global address space
1055 pointer to a T# is passed in
1056 the kernarg.
1057
1058 "Pipe"
1059 A global address space pointer
1060 to an OpenCL pipe is passed in
1061 the kernarg.
1062
1063 "Queue"
1064 A global address space pointer
1065 to an OpenCL device enqueue
1066 queue is passed in the
1067 kernarg.
1068
1069 "HiddenGlobalOffsetX"
1070 The OpenCL grid dispatch
1071 global offset for the X
1072 dimension is passed in the
1073 kernarg.
1074
1075 "HiddenGlobalOffsetY"
1076 The OpenCL grid dispatch
1077 global offset for the Y
1078 dimension is passed in the
1079 kernarg.
1080
1081 "HiddenGlobalOffsetZ"
1082 The OpenCL grid dispatch
1083 global offset for the Z
1084 dimension is passed in the
1085 kernarg.
1086
1087 "HiddenNone"
1088 An argument that is not used
1089 by the kernel. Space needs to
1090 be left for it, but it does
1091 not need to be set up.
1092
1093 "HiddenPrintfBuffer"
1094 A global address space pointer
1095 to the runtime printf buffer
1096 is passed in kernarg.
1097
1098 "HiddenDefaultQueue"
1099 A global address space pointer
1100 to the OpenCL device enqueue
1101 queue that should be used by
1102 the kernel by default is
1103 passed in the kernarg.
1104
1105 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001106 A global address space pointer
1107 to help link enqueued kernels into
1108 the ancestor tree for determining
1109 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001110
1111 "ValueType" string Required Kernel argument value type. Only
1112 present if "ValueKind" is
1113 "ByValue". For vector data
1114 types, the value is for the
1115 element type. Values include:
1116
1117 - "Struct"
1118 - "I8"
1119 - "U8"
1120 - "I16"
1121 - "U16"
1122 - "F16"
1123 - "I32"
1124 - "U32"
1125 - "F32"
1126 - "I64"
1127 - "U64"
1128 - "F64"
1129
1130 .. TODO
1131 How can it be determined if a
1132 vector type, and what size
1133 vector?
1134 "PointeeAlign" integer Alignment in bytes of pointee
1135 type for pointer type kernel
1136 argument. Must be a power
1137 of 2. Only present if
1138 "ValueKind" is
1139 "DynamicSharedPointer".
1140 "AddrSpaceQual" string Kernel argument address space
1141 qualifier. Only present if
1142 "ValueKind" is "GlobalBuffer" or
1143 "DynamicSharedPointer". Values
1144 are:
1145
1146 - "Private"
1147 - "Global"
1148 - "Constant"
1149 - "Local"
1150 - "Generic"
1151 - "Region"
1152
1153 .. TODO
1154 Is GlobalBuffer only Global
1155 or Constant? Is
1156 DynamicSharedPointer always
1157 Local? Can HCC allow Generic?
1158 How can Private or Region
1159 ever happen?
1160 "AccQual" string Kernel argument access
1161 qualifier. Only present if
1162 "ValueKind" is "Image" or
1163 "Pipe". Values
1164 are:
1165
1166 - "ReadOnly"
1167 - "WriteOnly"
1168 - "ReadWrite"
1169
1170 .. TODO
1171 Does this apply to
1172 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001173 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001174 performed by the kernel on the
1175 kernel argument. Only present if
1176 "ValueKind" is "GlobalBuffer",
1177 "Image", or "Pipe". This may be
1178 more restrictive than indicated
1179 by "AccQual" to reflect what the
1180 kernel actual does. If not
1181 present then the runtime must
1182 assume what is implied by
1183 "AccQual" and "IsConst". Values
1184 are:
1185
1186 - "ReadOnly"
1187 - "WriteOnly"
1188 - "ReadWrite"
1189
1190 "IsConst" boolean Indicates if the kernel argument
1191 is const qualified. Only present
1192 if "ValueKind" is
1193 "GlobalBuffer".
1194
1195 "IsRestrict" boolean Indicates if the kernel argument
1196 is restrict qualified. Only
1197 present if "ValueKind" is
1198 "GlobalBuffer".
1199
1200 "IsVolatile" boolean Indicates if the kernel argument
1201 is volatile qualified. Only
1202 present if "ValueKind" is
1203 "GlobalBuffer".
1204
1205 "IsPipe" boolean Indicates if the kernel argument
1206 is pipe qualified. Only present
1207 if "ValueKind" is "Pipe".
1208
1209 .. TODO
1210 Can GlobalBuffer be pipe
1211 qualified?
1212 ================= ============== ========= ================================
1213
1214..
1215
1216 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1217 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1218
1219 ============================ ============== ========= =====================
1220 String Key Value Type Required? Description
1221 ============================ ============== ========= =====================
1222 "KernargSegmentSize" integer Required The size in bytes of
1223 the kernarg segment
1224 that holds the values
1225 of the arguments to
1226 the kernel.
1227 "GroupSegmentFixedSize" integer Required The amount of group
1228 segment memory
1229 required by a
1230 work-group in
1231 bytes. This does not
1232 include any
1233 dynamically allocated
1234 group segment memory
1235 that may be added
1236 when the kernel is
1237 dispatched.
1238 "PrivateSegmentFixedSize" integer Required The amount of fixed
1239 private address space
1240 memory required for a
1241 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001242 bytes. If the kernel
1243 uses a dynamic call
1244 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001245 space must be added
1246 to this value for the
1247 call stack.
1248 "KernargSegmentAlign" integer Required The maximum byte
1249 alignment of
1250 arguments in the
1251 kernarg segment. Must
1252 be a power of 2.
1253 "WavefrontSize" integer Required Wavefront size. Must
1254 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001255 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001256 registers used by a
1257 wavefront for
1258 GFX6-GFX9. This
1259 includes the special
1260 SGPRs for VCC, Flat
1261 Scratch (GFX7-GFX9)
1262 and XNACK (for
1263 GFX8-GFX9). It does
1264 not include the 16
1265 SGPR added if a trap
1266 handler is
1267 enabled. It is not
1268 rounded up to the
1269 allocation
1270 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001271 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001272 registers used by
1273 each work-item for
1274 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001275 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001276 work-group size
1277 supported by the
1278 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001279 Must be >=1 and
1280 consistent with any
1281 non-0 values in
1282 FixedWorkGroupSize.
1283 "FixedWorkGroupSize" sequence of Corresponds to the
1284 3 integers dispatch work-group
1285 size X, Y, Z. If
1286 omitted, defaults to
1287 0, 0, 0. If an
1288 element is non-0 then
1289 the kernel must only
1290 be launched with a
1291 matching corresponding
1292 work-group size.
Tony Tyef16a45e2017-06-06 20:31:59 +00001293 "IsXNACKEnabled" boolean Indicates if the
1294 generated machine
1295 code is capable of
Tony Tye07d9f102017-11-10 01:00:54 +00001296 supporting XNACK. See
1297 :ref:`amdgpu-target-features`.
Tony Tyef16a45e2017-06-06 20:31:59 +00001298 ============================ ============== ========= =====================
1299
1300..
1301
Tony Tyef16a45e2017-06-06 20:31:59 +00001302Kernel Dispatch
1303~~~~~~~~~~~~~~~
1304
1305The HSA architected queuing language (AQL) defines a user space memory interface
1306that can be used to control the dispatch of kernels, in an agent independent
1307way. An agent can have zero or more AQL queues created for it using the ROCm
1308runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1309*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1310mechanics and packet layouts.
1311
1312The packet processor of a kernel agent is responsible for detecting and
1313dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1314packet processor is implemented by the hardware command processor (CP),
1315asynchronous dispatch controller (ADC) and shader processor input controller
1316(SPI).
1317
1318The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1319mode driver to initialize and register the AQL queue with CP.
1320
1321To dispatch a kernel the following actions are performed. This can occur in the
1322CPU host program, or from an HSA kernel executing on a GPU.
1323
13241. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1325 executed is obtained.
13262. A pointer to the kernel descriptor (see
1327 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1328 obtained. It must be for a kernel that is contained in a code object that that
1329 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1330 associated.
13313. Space is allocated for the kernel arguments using the ROCm runtime allocator
1332 for a memory region with the kernarg property for the kernel agent that will
1333 execute the kernel. It must be at least 16 byte aligned.
13344. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001335 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001336 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1337 memory in the same way constant memory is accessed. (Note that the HSA
1338 specification allows an implementation to copy the kernel argument contents to
1339 another location that is accessed by the kernel.)
13405. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1341 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1342 packet. The packet must be set up, and the final write must use an atomic
1343 store release to set the packet kind to ensure the packet contents are
1344 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1345 notify the kernel agent that the AQL queue has been updated. These rules, and
1346 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1347 System Architecture Specification* [HSA]_.
13486. A kernel dispatch packet includes information about the actual dispatch,
1349 such as grid and work-group size, together with information from the code
1350 object about the kernel, such as segment sizes. The ROCm runtime queries on
1351 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001352 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013537. CP executes micro-code and is responsible for detecting and setting up the
1354 GPU to execute the wavefronts of a kernel dispatch.
13558. CP ensures that when the a wavefront starts executing the kernel machine
1356 code, the scalar general purpose registers (SGPR) and vector general purpose
1357 registers (VGPR) are set up as required by the machine code. The required
1358 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1359 register state is defined in
1360 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13619. The prolog of the kernel machine code (see
1362 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1363 before continuing executing the machine code that corresponds to the kernel.
136410. When the kernel dispatch has completed execution, CP signals the completion
1365 signal specified in the kernel dispatch packet if not 0.
1366
1367.. _amdgpu-amdhsa-memory-spaces:
1368
1369Memory Spaces
1370~~~~~~~~~~~~~
1371
1372The memory space properties are:
1373
1374 .. table:: AMDHSA Memory Spaces
1375 :name: amdgpu-amdhsa-memory-spaces-table
1376
1377 ================= =========== ======== ======= ==================
1378 Memory Space Name HSA Segment Hardware Address NULL Value
1379 Name Name Size
1380 ================= =========== ======== ======= ==================
1381 Private private scratch 32 0x00000000
1382 Local group LDS 32 0xFFFFFFFF
1383 Global global global 64 0x0000000000000000
1384 Constant constant *same as 64 0x0000000000000000
1385 global*
1386 Generic flat flat 64 0x0000000000000000
1387 Region N/A GDS 32 *not implemented
1388 for AMDHSA*
1389 ================= =========== ======== ======= ==================
1390
1391The global and constant memory spaces both use global virtual addresses, which
1392are the same virtual address space used by the CPU. However, some virtual
1393addresses may only be accessible to the CPU, some only accessible by the GPU,
1394and some by both.
1395
1396Using the constant memory space indicates that the data will not change during
1397the execution of the kernel. This allows scalar read instructions to be
1398used. The vector and scalar L1 caches are invalidated of volatile data before
1399each kernel dispatch execution to allow constant memory to change values between
1400kernel dispatches.
1401
1402The local memory space uses the hardware Local Data Store (LDS) which is
1403automatically allocated when the hardware creates work-groups of wavefronts, and
1404freed when all the wavefronts of a work-group have terminated. The data store
1405(DS) instructions can be used to access it.
1406
1407The private memory space uses the hardware scratch memory support. If the kernel
1408uses scratch, then the hardware allocates memory that is accessed using
1409wavefront lane dword (4 byte) interleaving. The mapping used from private
1410address to physical address is:
1411
1412 ``wavefront-scratch-base +
1413 (private-address * wavefront-size * 4) +
1414 (wavefront-lane-id * 4)``
1415
1416There are different ways that the wavefront scratch base address is determined
1417by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1418memory can be accessed in an interleaved manner using buffer instruction with
1419the scratch buffer descriptor and per wave scratch offset, by the scratch
1420instructions, or by flat instructions. If each lane of a wavefront accesses the
1421same private address, the interleaving results in adjacent dwords being accessed
1422and hence requires fewer cache lines to be fetched. Multi-dword access is not
1423supported except by flat and scratch instructions in GFX9.
1424
1425The generic address space uses the hardware flat address support available in
1426GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1427local appertures), that are outside the range of addressible global memory, to
1428map from a flat address to a private or local address.
1429
1430FLAT instructions can take a flat address and access global, private (scratch)
1431and group (LDS) memory depending in if the address is within one of the
1432apperture ranges. Flat access to scratch requires hardware aperture setup and
1433setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1434access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1435(see :ref:`amdgpu-amdhsa-m0`).
1436
1437To convert between a segment address and a flat address the base address of the
1438appertures address can be used. For GFX7-GFX8 these are available in the
1439:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1440Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1441GFX9 the appature base addresses are directly available as inline constant
1442registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1443address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1444which makes it easier to convert from flat to segment or segment to flat.
1445
Tony Tye46d35762017-08-15 20:47:41 +00001446Image and Samplers
1447~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001448
1449Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1450hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1451HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1452enumeration values for the queries that are not trivially deducible from the S#
1453representation.
1454
1455HSA Signals
1456~~~~~~~~~~~
1457
Tony Tye46d35762017-08-15 20:47:41 +00001458HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1459structure allocated in memory accessible from both the CPU and GPU. The
1460structure is defined by the ROCm runtime and subject to change between releases
1461(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001462
1463.. _amdgpu-amdhsa-hsa-aql-queue:
1464
1465HSA AQL Queue
1466~~~~~~~~~~~~~
1467
Tony Tye46d35762017-08-15 20:47:41 +00001468The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001469between releases (see [AMD-ROCm-github]_). For some processors it contains
1470fields needed to implement certain language features such as the flat address
1471aperture bases. It also contains fields used by CP such as managing the
1472allocation of scratch memory.
1473
1474.. _amdgpu-amdhsa-kernel-descriptor:
1475
1476Kernel Descriptor
1477~~~~~~~~~~~~~~~~~
1478
1479A kernel descriptor consists of the information needed by CP to initiate the
1480execution of a kernel, including the entry point address of the machine code
1481that implements the kernel.
1482
1483Kernel Descriptor for GFX6-GFX9
1484+++++++++++++++++++++++++++++++
1485
1486CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1487
1488 .. table:: Kernel Descriptor for GFX6-GFX9
1489 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1490
Tony Tye6baa6d22017-10-18 22:16:55 +00001491 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001492 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001493 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001494 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001495 address space memory
1496 required for a work-group
1497 in bytes. This does not
1498 include any dynamically
1499 allocated local address
1500 space memory that may be
1501 added when the kernel is
1502 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001503 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001504 private address space
1505 memory required for a
1506 work-item in bytes. If
1507 is_dynamic_callstack is 1
1508 then additional space must
1509 be added to this value for
1510 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001511 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001512 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001513 negative) from base
1514 address of kernel
1515 descriptor to kernel's
1516 entry point instruction
1517 which must be 256 byte
1518 aligned.
Tony Tye07d9f102017-11-10 01:00:54 +00001519 223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1520 size supported by the
1521 kernel in work-items. If
1522 an exact work-group size
1523 is required then must be
1524 omitted or 0 and
1525 ReqdWorkGroupSize* must
1526 be set to non-0.
1527 239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
1528 the kernel
1529 must be executed with the
1530 specified work-group size
1531 for X.
1532 255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
1533 the kernel
1534 must be executed with the
1535 specified work-group size
1536 for Y.
1537 271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
1538 the kernel
1539 must be executed with the
1540 specified work-group size
1541 for Z.
1542 383:271 14 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001543 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001544 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001545 program settings used by
1546 CP to set up
1547 ``COMPUTE_PGM_RSRC1``
1548 configuration
1549 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001550 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001551 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001552 program settings used by
1553 CP to set up
1554 ``COMPUTE_PGM_RSRC2``
1555 configuration
1556 register. See
1557 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001558 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1559 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001560 (see
1561 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1562
1563 The total number of SGPR
1564 user data registers
1565 requested must not exceed
1566 16 and match value in
1567 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1568 Any requests beyond 16
1569 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001570 449 1 bit EnableSGPRDispatchPtr *see above*
1571 450 1 bit EnableSGPRQueuePtr *see above*
1572 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1573 452 1 bit EnableSGPRDispatchID *see above*
1574 453 1 bit EnableSGPRFlatScratchInit *see above*
1575 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1576 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1577 should always be 0.
1578 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1579 should always be 0.
1580 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1581 should always be 0.
Tony Tye07d9f102017-11-10 01:00:54 +00001582 462:458 5 bits Reserved, must be 0.
1583 463 1 bit IsXNACKEnabled Indicates if the generated
1584 machine code is capable of
1585 supporting XNACK.
Tony Tye6baa6d22017-10-18 22:16:55 +00001586 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001587 bytes
1588 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001589 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001590
1591..
1592
1593 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001594 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001595
Tony Tye3b340612017-06-07 00:46:08 +00001596 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001597 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001598 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001599 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 used by each work-item,
1601 granularity is device
1602 specific:
1603
Tony Tye07d9f102017-11-10 01:00:54 +00001604 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001605 - max_vgpr 1..256
1606 - roundup((max_vgpg + 1)
1607 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001608
1609 Used by CP to set up
1610 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001611 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001612 used by a wavefront,
1613 granularity is device
1614 specific:
1615
Tony Tye07d9f102017-11-10 01:00:54 +00001616 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001617 - max_sgpr 1..112
1618 - roundup((max_sgpg + 1)
1619 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001620 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001621 - max_sgpr 1..112
1622 - roundup((max_sgpg + 1)
1623 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001624
1625 Includes the special SGPRs
1626 for VCC, Flat Scratch (for
1627 GFX7 onwards) and XNACK
1628 (for GFX8 onwards). It does
1629 not include the 16 SGPR
1630 added if a trap handler is
1631 enabled.
1632
1633 Used by CP to set up
1634 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001635 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001636
1637 Start executing wavefront
1638 at the specified priority.
1639
1640 CP is responsible for
1641 filling in
1642 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001643 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001644 with specified rounding
1645 mode for single (32
1646 bit) floating point
1647 precision floating point
1648 operations.
1649
1650 Floating point rounding
1651 mode values are defined in
1652 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1653
1654 Used by CP to set up
1655 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001656 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001657 with specified rounding
1658 denorm mode for half/double (16
1659 and 64 bit) floating point
1660 precision floating point
1661 operations.
1662
1663 Floating point rounding
1664 mode values are defined in
1665 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1666
1667 Used by CP to set up
1668 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001669 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001670 with specified denorm mode
1671 for single (32
1672 bit) floating point
1673 precision floating point
1674 operations.
1675
1676 Floating point denorm mode
1677 values are defined in
1678 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1679
1680 Used by CP to set up
1681 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001682 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001683 with specified denorm mode
1684 for half/double (16
1685 and 64 bit) floating point
1686 precision floating point
1687 operations.
1688
1689 Floating point denorm mode
1690 values are defined in
1691 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1692
1693 Used by CP to set up
1694 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001695 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001696
1697 Start executing wavefront
1698 in privilege trap handler
1699 mode.
1700
1701 CP is responsible for
1702 filling in
1703 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001704 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001705 with DX10 clamp mode
1706 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001707 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001708 treatment of NaN's (when
1709 set, clamp NaN to zero,
1710 otherwise pass NaN
1711 through).
1712
1713 Used by CP to set up
1714 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001715 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001716
1717 Start executing wavefront
1718 in single step mode.
1719
1720 CP is responsible for
1721 filling in
1722 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001723 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001724 with IEEE mode
1725 enabled. Floating point
1726 opcodes that support
1727 exception flag gathering
1728 will quiet and propagate
1729 signaling-NaN inputs per
1730 IEEE 754-2008. Min_dx10 and
1731 max_dx10 become IEEE
1732 754-2008 compliant due to
1733 signaling-NaN propagation
1734 and quieting.
1735
1736 Used by CP to set up
1737 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001738 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001739
1740 Only one work-group allowed
1741 to execute on a compute
1742 unit.
1743
1744 CP is responsible for
1745 filling in
1746 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001747 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001748
1749 Flag that can be used to
1750 control debugging code.
1751
1752 CP is responsible for
1753 filling in
1754 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001755 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001756 Reserved, must be 0.
1757 GFX9
1758 Wavefront starts execution
1759 with specified fp16 overflow
1760 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001761
Tony Tye6baa6d22017-10-18 22:16:55 +00001762 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001763 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001764 - If 1, fp16 overflow that is the
1765 result of an +/-INF input value
1766 or divide by 0 produces a +/-INF,
1767 otherwise clamps computed
1768 overflow to +/-MAX_FP16 as
1769 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001770
1771 Used by CP to set up
1772 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001773 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001774 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001775 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001776
1777..
1778
1779 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1780 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1781
Tony Tye3b340612017-06-07 00:46:08 +00001782 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001783 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001784 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001785 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1786 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001787 system register (see
1788 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1789
1790 Used by CP to set up
1791 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001792 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001793 user data registers
1794 requested. This number must
1795 match the number of user
1796 data registers enabled.
1797
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001800 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001801 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001802 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001803 be enabled.
1804
1805 CP sets
1806 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1807 if the runtime has
1808 installed a trap handler
1809 regardless of the setting
1810 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001811 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001812 system SGPR register for
1813 the work-group id in the X
1814 dimension (see
1815 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1816
1817 Used by CP to set up
1818 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001819 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001820 system SGPR register for
1821 the work-group id in the Y
1822 dimension (see
1823 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1824
1825 Used by CP to set up
1826 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001827 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001828 system SGPR register for
1829 the work-group id in the Z
1830 dimension (see
1831 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1832
1833 Used by CP to set up
1834 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001835 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001836 system SGPR register for
1837 work-group information (see
1838 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1839
1840 Used by CP to set up
1841 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001842 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001843 VGPR system registers used
1844 for the work-item ID.
1845 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1846 defines the values.
1847
1848 Used by CP to set up
1849 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001850 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001851
1852 Wavefront starts execution
1853 with address watch
1854 exceptions enabled which
1855 are generated when L1 has
1856 witnessed a thread access
1857 an *address of
1858 interest*.
1859
1860 CP is responsible for
1861 filling in the address
1862 watch bit in
1863 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1864 according to what the
1865 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001866 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001867
1868 Wavefront starts execution
1869 with memory violation
1870 exceptions exceptions
1871 enabled which are generated
1872 when a memory violation has
1873 occurred for this wave from
1874 L1 or LDS
1875 (write-to-read-only-memory,
1876 mis-aligned atomic, LDS
1877 address out of range,
1878 illegal address, etc.).
1879
1880 CP sets the memory
1881 violation bit in
1882 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1883 according to what the
1884 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001885 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001886
1887 CP uses the rounded value
1888 from the dispatch packet,
1889 not this value, as the
1890 dispatch may contain
1891 dynamically allocated group
1892 segment memory. CP writes
1893 directly to
1894 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1895
1896 Amount of group segment
1897 (LDS) to allocate for each
1898 work-group. Granularity is
1899 device specific:
1900
1901 GFX6:
1902 roundup(lds-size / (64 * 4))
1903 GFX7-GFX9:
1904 roundup(lds-size / (128 * 4))
1905
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001906 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1907 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001908 enabled.
1909
1910 Used by CP to set up
1911 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1912 (set from bits 0..6).
1913
1914 IEEE 754 FP Invalid
1915 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001916 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1917 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001918 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001919 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1920 _DIVISION_BY_ZERO Zero
1921 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1922 _OVERFLOW
1923 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1924 _UNDERFLOW
1925 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1926 _INEXACT
1927 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1928 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001929 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001930 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001931 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001932 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001933
1934..
1935
1936 .. table:: Floating Point Rounding Mode Enumeration Values
1937 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1938
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001939 ====================================== ===== ==============================
1940 Enumeration Name Value Description
1941 ====================================== ===== ==============================
1942 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1943 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1944 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1945 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1946 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001947
1948..
1949
1950 .. table:: Floating Point Denorm Mode Enumeration Values
1951 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1952
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001953 ====================================== ===== ==============================
1954 Enumeration Name Value Description
1955 ====================================== ===== ==============================
1956 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1957 Denorms
1958 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1959 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1960 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1961 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001962
1963..
1964
1965 .. table:: System VGPR Work-Item ID Enumeration Values
1966 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1967
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001968 ======================================== ===== ============================
1969 Enumeration Name Value Description
1970 ======================================== ===== ============================
1971 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1972 ID.
1973 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1974 dimensions ID.
1975 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1976 dimensions ID.
1977 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1978 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001979
1980.. _amdgpu-amdhsa-initial-kernel-execution-state:
1981
1982Initial Kernel Execution State
1983~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1984
1985This section defines the register state that will be set up by the packet
1986processor prior to the start of execution of every wavefront. This is limited by
1987the constraints of the hardware controllers of CP/ADC/SPI.
1988
1989The order of the SGPR registers is defined, but the compiler can specify which
1990ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1991fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1992for enabled registers are dense starting at SGPR0: the first enabled register is
1993SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1994an SGPR number.
1995
1996The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1997all waves of the grid. It is possible to specify more than 16 User SGPRs using
1998the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1999initialized. These are then immediately followed by the System SGPRs that are
2000set up by ADC/SPI and can have different values for each wave of the grid
2001dispatch.
2002
2003SGPR register initial state is defined in
2004:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2005
2006 .. table:: SGPR Register Set Up Order
2007 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2008
2009 ========== ========================== ====== ==============================
2010 SGPR Order Name Number Description
2011 (kernel descriptor enable of
2012 field) SGPRs
2013 ========== ========================== ====== ==============================
2014 First Private Segment Buffer 4 V# that can be used, together
2015 (enable_sgpr_private with Scratch Wave Offset as an
2016 _segment_buffer) offset, to access the private
2017 memory space using a segment
2018 address.
2019
2020 CP uses the value provided by
2021 the runtime.
2022 then Dispatch Ptr 2 64 bit address of AQL dispatch
2023 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2024 actually executing.
2025 then Queue Ptr 2 64 bit address of amd_queue_t
2026 (enable_sgpr_queue_ptr) object for AQL queue on which
2027 the dispatch packet was
2028 queued.
2029 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2030 (enable_sgpr_kernarg segment. This is directly
2031 _segment_ptr) copied from the
2032 kernarg_address in the kernel
2033 dispatch packet.
2034
2035 Having CP load it once avoids
2036 loading it at the beginning of
2037 every wavefront.
2038 then Dispatch Id 2 64 bit Dispatch ID of the
2039 (enable_sgpr_dispatch_id) dispatch packet being
2040 executed.
2041 then Flat Scratch Init 2 This is 2 SGPRs:
2042 (enable_sgpr_flat_scratch
2043 _init) GFX6
2044 Not supported.
2045 GFX7-GFX8
2046 The first SGPR is a 32 bit
2047 byte offset from
2048 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2049 to per SPI base of memory
2050 for scratch for the queue
2051 executing the kernel
2052 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002053 from the runtime. (The
2054 Scratch Segment Buffer base
2055 address is
2056 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2057 plus this offset.) The value
2058 of Scratch Wave Offset must
2059 be added to this offset by
2060 the kernel machine code,
2061 right shifted by 8, and
2062 moved to the FLAT_SCRATCH_HI
2063 SGPR register.
2064 FLAT_SCRATCH_HI corresponds
2065 to SGPRn-4 on GFX7, and
2066 SGPRn-6 on GFX8 (where SGPRn
2067 is the highest numbered SGPR
2068 allocated to the wave).
2069 FLAT_SCRATCH_HI is
2070 multiplied by 256 (as it is
2071 in units of 256 bytes) and
2072 added to
2073 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2074 to calculate the per wave
2075 FLAT SCRATCH BASE in flat
2076 memory instructions that
2077 access the scratch
2078 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002079
2080 The second SGPR is 32 bit
2081 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002082 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002083 usage. CP obtains this from
2084 the runtime, and it is
2085 always a multiple of DWORD.
2086 CP checks that the value in
2087 the kernel dispatch packet
2088 Private Segment Byte Size is
2089 not larger, and requests the
2090 runtime to increase the
2091 queue's scratch size if
2092 necessary. The kernel code
2093 must move it to
2094 FLAT_SCRATCH_LO which is
2095 SGPRn-3 on GFX7 and SGPRn-5
2096 on GFX8. FLAT_SCRATCH_LO is
2097 used as the FLAT SCRATCH
2098 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002099 instructions. Having CP load
2100 it once avoids loading it at
2101 the beginning of every
Tony Tye46d35762017-08-15 20:47:41 +00002102 wavefront. GFX9 This is the
2103 64 bit base address of the
2104 per SPI scratch backing
2105 memory managed by SPI for
2106 the queue executing the
2107 kernel dispatch. CP obtains
2108 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002109 divides it if there are
2110 multiple Shader Arrays each
2111 with its own SPI). The value
2112 of Scratch Wave Offset must
2113 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002114 machine code and the result
2115 moved to the FLAT_SCRATCH
2116 SGPR which is SGPRn-6 and
2117 SGPRn-5. It is used as the
2118 FLAT SCRATCH BASE in flat
2119 memory instructions. then
2120 Private Segment Size 1 The
2121 32 bit byte size of a
2122 (enable_sgpr_private single
2123 work-item's
2124 scratch_segment_size) memory
2125 allocation. This is the
2126 value from the kernel
2127 dispatch packet Private
2128 Segment Byte Size rounded up
2129 by CP to a multiple of
2130 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002131
2132 Having CP load it once avoids
2133 loading it at the beginning of
2134 every wavefront.
2135
2136 This is not used for
2137 GFX7-GFX8 since it is the same
2138 value as the second SGPR of
2139 Flat Scratch Init. However, it
2140 may be needed for GFX9 which
2141 changes the meaning of the
2142 Flat Scratch Init value.
2143 then Grid Work-Group Count X 1 32 bit count of the number of
2144 (enable_sgpr_grid work-groups in the X dimension
2145 _workgroup_count_X) for the grid being
2146 executed. Computed from the
2147 fields in the kernel dispatch
2148 packet as ((grid_size.x +
2149 workgroup_size.x - 1) /
2150 workgroup_size.x).
2151 then Grid Work-Group Count Y 1 32 bit count of the number of
2152 (enable_sgpr_grid work-groups in the Y dimension
2153 _workgroup_count_Y && for the grid being
2154 less than 16 previous executed. Computed from the
2155 SGPRs) fields in the kernel dispatch
2156 packet as ((grid_size.y +
2157 workgroup_size.y - 1) /
2158 workgroupSize.y).
2159
2160 Only initialized if <16
2161 previous SGPRs initialized.
2162 then Grid Work-Group Count Z 1 32 bit count of the number of
2163 (enable_sgpr_grid work-groups in the Z dimension
2164 _workgroup_count_Z && for the grid being
2165 less than 16 previous executed. Computed from the
2166 SGPRs) fields in the kernel dispatch
2167 packet as ((grid_size.z +
2168 workgroup_size.z - 1) /
2169 workgroupSize.z).
2170
2171 Only initialized if <16
2172 previous SGPRs initialized.
2173 then Work-Group Id X 1 32 bit work-group id in X
2174 (enable_sgpr_workgroup_id dimension of grid for
2175 _X) wavefront.
2176 then Work-Group Id Y 1 32 bit work-group id in Y
2177 (enable_sgpr_workgroup_id dimension of grid for
2178 _Y) wavefront.
2179 then Work-Group Id Z 1 32 bit work-group id in Z
2180 (enable_sgpr_workgroup_id dimension of grid for
2181 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002182 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002183 (enable_sgpr_workgroup ordered_append_term[10:0],
2184 _info) threadgroup_size_in_waves[5:0]}
2185 then Scratch Wave Offset 1 32 bit byte offset from base
2186 (enable_sgpr_private of scratch base of queue
2187 _segment_wave_offset) executing the kernel
2188 dispatch. Must be used as an
2189 offset with Private
2190 segment address when using
2191 Scratch Segment Buffer. It
2192 must be used to set up FLAT
2193 SCRATCH for flat addressing
2194 (see
2195 :ref:`amdgpu-amdhsa-flat-scratch`).
2196 ========== ========================== ====== ==============================
2197
2198The order of the VGPR registers is defined, but the compiler can specify which
2199ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2200fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2201for enabled registers are dense starting at VGPR0: the first enabled register is
2202VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2203VGPR number.
2204
2205VGPR register initial state is defined in
2206:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2207
2208 .. table:: VGPR Register Set Up Order
2209 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2210
2211 ========== ========================== ====== ==============================
2212 VGPR Order Name Number Description
2213 (kernel descriptor enable of
2214 field) VGPRs
2215 ========== ========================== ====== ==============================
2216 First Work-Item Id X 1 32 bit work item id in X
2217 (Always initialized) dimension of work-group for
2218 wavefront lane.
2219 then Work-Item Id Y 1 32 bit work item id in Y
2220 (enable_vgpr_workitem_id dimension of work-group for
2221 > 0) wavefront lane.
2222 then Work-Item Id Z 1 32 bit work item id in Z
2223 (enable_vgpr_workitem_id dimension of work-group for
2224 > 1) wavefront lane.
2225 ========== ========================== ====== ==============================
2226
2227The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2228
22291. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2230 registers.
22312. Work-group Id registers X, Y, Z are set by ADC which supports any
2232 combination including none.
22333. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2234 cannot included with the flat scratch init value which is per queue.
22354. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2236 or (X, Y, Z).
2237
2238Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2239value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2240
2241The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002242has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002243instructions (GFX9).
2244
2245If buffer operations are used then the compiler can generate a V# with the
2246following properties:
2247
2248* base address of 0
2249* no swizzle
2250* ATC: 1 if IOMMU present (such as APU)
2251* ptr64: 1
2252* MTYPE set to support memory coherence that matches the runtime (such as CC for
2253 APU and NC for dGPU).
2254
2255.. _amdgpu-amdhsa-kernel-prolog:
2256
2257Kernel Prolog
2258~~~~~~~~~~~~~
2259
2260.. _amdgpu-amdhsa-m0:
2261
2262M0
2263++
2264
2265GFX6-GFX8
2266 The M0 register must be initialized with a value at least the total LDS size
2267 if the kernel may access LDS via DS or flat operations. Total LDS size is
2268 available in dispatch packet. For M0, it is also possible to use maximum
2269 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2270 GFX7-GFX8).
2271GFX9
2272 The M0 register is not used for range checking LDS accesses and so does not
2273 need to be initialized in the prolog.
2274
2275.. _amdgpu-amdhsa-flat-scratch:
2276
2277Flat Scratch
2278++++++++++++
2279
2280If the kernel may use flat operations to access scratch memory, the prolog code
2281must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2282are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2283Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2284
2285GFX6
2286 Flat scratch is not supported.
2287
Tony Tye07d9f102017-11-10 01:00:54 +00002288GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002289 1. The low word of Flat Scratch Init is 32 bit byte offset from
2290 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2291 being managed by SPI for the queue executing the kernel dispatch. This is
2292 the same value used in the Scratch Segment Buffer V# base address. The
2293 prolog must add the value of Scratch Wave Offset to get the wave's byte
2294 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2295 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2296 by 8 before moving into FLAT_SCRATCH_LO.
2297 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2298 work-items scratch memory usage. This is directly loaded from the kernel
2299 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2300 DWORD. Having CP load it once avoids loading it at the beginning of every
2301 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2302 SIZE.
2303GFX9
2304 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2305 memory being managed by SPI for the queue executing the kernel dispatch. The
2306 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2307 pair for use as the flat scratch base in flat memory instructions.
2308
2309.. _amdgpu-amdhsa-memory-model:
2310
2311Memory Model
2312~~~~~~~~~~~~
2313
2314This section describes the mapping of LLVM memory model onto AMDGPU machine code
2315(see :ref:`memmodel`). *The implementation is WIP.*
2316
2317.. TODO
2318 Update when implementation complete.
2319
Tony Tyef16a45e2017-06-06 20:31:59 +00002320The AMDGPU backend supports the memory synchronization scopes specified in
2321:ref:`amdgpu-memory-scopes`.
2322
2323The code sequences used to implement the memory model are defined in table
2324:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2325
2326The sequences specify the order of instructions that a single thread must
2327execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2328to other memory instructions executed by the same thread. This allows them to be
2329moved earlier or later which can allow them to be combined with other instances
2330of the same instruction, or hoisted/sunk out of loops to improve
2331performance. Only the instructions related to the memory model are given;
2332additional ``s_waitcnt`` instructions are required to ensure registers are
2333defined before being used. These may be able to be combined with the memory
2334model ``s_waitcnt`` instructions as described above.
2335
Tony Tye6baa6d22017-10-18 22:16:55 +00002336The AMDGPU backend supports the following memory models:
2337
2338 HSA Memory Model [HSA]_
2339 The HSA memory model uses a single happens-before relation for all address
2340 spaces (see :ref:`amdgpu-address-spaces`).
2341 OpenCL Memory Model [OpenCL]_
2342 The OpenCL memory model which has separate happens-before relations for the
2343 global and local address spaces. Only a fence specifying both global and
2344 local address space, and seq_cst instructions join the relationships. Since
2345 the LLVM ``memfence`` instruction does not allow an address space to be
2346 specified the OpenCL fence has to convervatively assume both local and
2347 global address space was specified. However, optimizations can often be
2348 done to eliminate the additional ``s_waitcnt`` instructions when there are
2349 no intervening memory instructions which access the corresponding address
2350 space. The code sequences in the table indicate what can be omitted for the
2351 OpenCL memory. The target triple environment is used to determine if the
2352 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002353
2354``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2355operations.
2356
2357``buffer/global/flat_load/store/atomic`` instructions to global memory are
2358termed vector memory operations.
2359
2360For GFX6-GFX9:
2361
2362* Each agent has multiple compute units (CU).
2363* Each CU has multiple SIMDs that execute wavefronts.
2364* The wavefronts for a single work-group are executed in the same CU but may be
2365 executed by different SIMDs.
2366* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2367 executing on it.
2368* All LDS operations of a CU are performed as wavefront wide operations in a
2369 global order and involve no caching. Completion is reported to a wavefront in
2370 execution order.
2371* The LDS memory has multiple request queues shared by the SIMDs of a
2372 CU. Therefore, the LDS operations performed by different waves of a work-group
2373 can be reordered relative to each other, which can result in reordering the
2374 visibility of vector memory operations with respect to LDS operations of other
2375 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002376 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002377 between waves of a work-group, but not between operations performed by the
2378 same wavefront.
2379* The vector memory operations are performed as wavefront wide operations and
2380 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002381 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002382 vector memory order if they access LDS memory, and out of LDS operation order
2383 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002384* The vector memory operations access a single vector L1 cache shared by all
2385 SIMDs a CU. Therefore, no special action is required for coherence between the
2386 lanes of a single wavefront, or for coherence between wavefronts in the same
2387 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2388 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002389* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2390 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2391 scalar operations are used in a restricted way so do not impact the memory
2392 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2393* The vector and scalar memory operations use an L2 cache shared by all CUs on
2394 the same agent.
2395* The L2 cache has independent channels to service disjoint ranges of virtual
2396 addresses.
2397* Each CU has a separate request queue per channel. Therefore, the vector and
2398 scalar memory operations performed by waves executing in different work-groups
2399 (which may be executing on different CUs) of an agent can be reordered
2400 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002401 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002402 previous vector memory operation has completed before executing a subsequent
2403 vector memory or LDS operation and so can be used to meet the requirements of
2404 acquire and release.
2405* The L2 cache can be kept coherent with other agents on some targets, or ranges
2406 of virtual addresses can be set up to bypass it to ensure system coherence.
2407
Tony Tye07d9f102017-11-10 01:00:54 +00002408Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002409or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2410memory, atomic memory orderings are not meaningful and all accesses are treated
2411as non-atomic.
2412
2413Constant address space uses ``buffer/global_load`` instructions (or equivalent
2414scalar memory instructions). Since the constant address space contents do not
2415change during the execution of a kernel dispatch it is not legal to perform
2416stores, and atomic memory orderings are not meaningful and all access are
2417treated as non-atomic.
2418
2419A memory synchronization scope wider than work-group is not meaningful for the
2420group (LDS) address space and is treated as work-group.
2421
2422The memory model does not support the region address space which is treated as
2423non-atomic.
2424
2425Acquire memory ordering is not meaningful on store atomic instructions and is
2426treated as non-atomic.
2427
2428Release memory ordering is not meaningful on load atomic instructions and is
2429treated a non-atomic.
2430
2431Acquire-release memory ordering is not meaningful on load or store atomic
2432instructions and is treated as acquire and release respectively.
2433
2434AMDGPU backend only uses scalar memory operations to access memory that is
2435proven to not change during the execution of the kernel dispatch. This includes
2436constant address space and global address space for program scope const
2437variables. Therefore the kernel machine code does not have to maintain the
2438scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2439and vector L1 caches are invalidated between kernel dispatches by CP since
2440constant address space data may change between kernel dispatch executions. See
2441:ref:`amdgpu-amdhsa-memory-spaces`.
2442
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002443The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002444case the AMDGPU backend ensures the memory location used to spill is never
2445accessed by vector memory operations at the same time. If scalar writes are used
2446then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2447return since the locations may be used for vector memory instructions by a
2448future wave that uses the same scratch area, or a function call that creates a
2449frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2450as all scalar writes are write-before-read in the same thread.
2451
Tony Tye6baa6d22017-10-18 22:16:55 +00002452Scratch backing memory (which is used for the private address space)
2453is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2454address space is only accessed by a single thread, and is always
2455write-before-read, there is never a need to invalidate these entries from the L1
2456cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2457volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002458
2459On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002460to invalidate the L2 cache. This also causes it to be treated as
2461non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2462(cache coherent) and so the L2 cache will coherent with the CPU and other
2463agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002464
2465 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2466 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2467
Tony Tye6baa6d22017-10-18 22:16:55 +00002468 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002469 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2470 Ordering Sync Scope Address
2471 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002472 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002473 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002474 -----------------------------------------------------------------------------------
2475 load *none* *none* - global - !volatile & !nontemporal
2476 - generic
2477 - private 1. buffer/global/flat_load
2478 - constant
2479 - volatile & !nontemporal
2480
Tony Tyef16a45e2017-06-06 20:31:59 +00002481 1. buffer/global/flat_load
2482 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002483
2484 - nontemporal
2485
2486 1. buffer/global/flat_load
2487 glc=1 slc=1
2488
Tony Tyef16a45e2017-06-06 20:31:59 +00002489 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002490 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002491 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002492 - private 1. buffer/global/flat_store
2493 - constant
2494 - nontemporal
2495
2496 1. buffer/global/flat_stote
2497 glc=1 slc=1
2498
Tony Tyef16a45e2017-06-06 20:31:59 +00002499 store *none* *none* - local 1. ds_store
2500 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002501 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002502 load atomic unordered *any* *any* *Same as non-atomic*.
2503 store atomic unordered *any* *any* *Same as non-atomic*.
2504 atomicrmw unordered *any* *any* *Same as monotonic
2505 atomic*.
2506 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002507 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002508 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2509 - wavefront - generic
2510 - workgroup
2511 load atomic monotonic - singlethread - local 1. ds_load
2512 - wavefront
2513 - workgroup
2514 load atomic monotonic - agent - global 1. buffer/global/flat_load
2515 - system - generic glc=1
2516 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2517 - wavefront - generic
2518 - workgroup
2519 - agent
2520 - system
2521 store atomic monotonic - singlethread - local 1. ds_store
2522 - wavefront
2523 - workgroup
2524 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2525 - wavefront - generic
2526 - workgroup
2527 - agent
2528 - system
2529 atomicrmw monotonic - singlethread - local 1. ds_atomic
2530 - wavefront
2531 - workgroup
2532 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002533 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002534 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2535 - wavefront - local
2536 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002537 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2538 load atomic acquire - workgroup - local 1. ds_load
2539 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002540
Tony Tye6baa6d22017-10-18 22:16:55 +00002541 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002542 - Must happen before
2543 any following
2544 global/generic
2545 load/load
2546 atomic/store/store
2547 atomic/atomicrmw.
2548 - Ensures any
2549 following global
2550 data read is no
2551 older than the load
2552 atomic value being
2553 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002554 load atomic acquire - workgroup - generic 1. flat_load
2555 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002556
Tony Tye6baa6d22017-10-18 22:16:55 +00002557 - If OpenCL, omit.
2558 - Must happen before
2559 any following
2560 global/generic
2561 load/load
2562 atomic/store/store
2563 atomic/atomicrmw.
2564 - Ensures any
2565 following global
2566 data read is no
2567 older than the load
2568 atomic value being
2569 acquired.
2570 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002571 - system glc=1
2572 2. s_waitcnt vmcnt(0)
2573
2574 - Must happen before
2575 following
2576 buffer_wbinvl1_vol.
2577 - Ensures the load
2578 has completed
2579 before invalidating
2580 the cache.
2581
2582 3. buffer_wbinvl1_vol
2583
2584 - Must happen before
2585 any following
2586 global/generic
2587 load/load
2588 atomic/atomicrmw.
2589 - Ensures that
2590 following
2591 loads will not see
2592 stale global data.
2593
2594 load atomic acquire - agent - generic 1. flat_load glc=1
2595 - system 2. s_waitcnt vmcnt(0) &
2596 lgkmcnt(0)
2597
2598 - If OpenCL omit
2599 lgkmcnt(0).
2600 - Must happen before
2601 following
2602 buffer_wbinvl1_vol.
2603 - Ensures the flat_load
2604 has completed
2605 before invalidating
2606 the cache.
2607
2608 3. buffer_wbinvl1_vol
2609
2610 - Must happen before
2611 any following
2612 global/generic
2613 load/load
2614 atomic/atomicrmw.
2615 - Ensures that
2616 following loads
2617 will not see stale
2618 global data.
2619
2620 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2621 - wavefront - local
2622 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002623 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2624 atomicrmw acquire - workgroup - local 1. ds_atomic
2625 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002626
Tony Tye6baa6d22017-10-18 22:16:55 +00002627 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002628 - Must happen before
2629 any following
2630 global/generic
2631 load/load
2632 atomic/store/store
2633 atomic/atomicrmw.
2634 - Ensures any
2635 following global
2636 data read is no
2637 older than the
2638 atomicrmw value
2639 being acquired.
2640
Tony Tye6baa6d22017-10-18 22:16:55 +00002641 atomicrmw acquire - workgroup - generic 1. flat_atomic
2642 2. waitcnt lgkmcnt(0)
2643
2644 - If OpenCL, omit.
2645 - Must happen before
2646 any following
2647 global/generic
2648 load/load
2649 atomic/store/store
2650 atomic/atomicrmw.
2651 - Ensures any
2652 following global
2653 data read is no
2654 older than the
2655 atomicrmw value
2656 being acquired.
2657
2658 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002659 - system 2. s_waitcnt vmcnt(0)
2660
2661 - Must happen before
2662 following
2663 buffer_wbinvl1_vol.
2664 - Ensures the
2665 atomicrmw has
2666 completed before
2667 invalidating the
2668 cache.
2669
2670 3. buffer_wbinvl1_vol
2671
2672 - Must happen before
2673 any following
2674 global/generic
2675 load/load
2676 atomic/atomicrmw.
2677 - Ensures that
2678 following loads
2679 will not see stale
2680 global data.
2681
2682 atomicrmw acquire - agent - generic 1. flat_atomic
2683 - system 2. s_waitcnt vmcnt(0) &
2684 lgkmcnt(0)
2685
2686 - If OpenCL, omit
2687 lgkmcnt(0).
2688 - Must happen before
2689 following
2690 buffer_wbinvl1_vol.
2691 - Ensures the
2692 atomicrmw has
2693 completed before
2694 invalidating the
2695 cache.
2696
2697 3. buffer_wbinvl1_vol
2698
2699 - Must happen before
2700 any following
2701 global/generic
2702 load/load
2703 atomic/atomicrmw.
2704 - Ensures that
2705 following loads
2706 will not see stale
2707 global data.
2708
2709 fence acquire - singlethread *none* *none*
2710 - wavefront
2711 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2712
2713 - If OpenCL and
2714 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002715 not generic, omit.
2716 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002717 currently has no
2718 address space on
2719 the fence need to
2720 conservatively
2721 always generate. If
2722 fence had an
2723 address space then
2724 set to address
2725 space of OpenCL
2726 fence flag, or to
2727 generic if both
2728 local and global
2729 flags are
2730 specified.
2731 - Must happen after
2732 any preceding
2733 local/generic load
2734 atomic/atomicrmw
2735 with an equal or
2736 wider sync scope
2737 and memory ordering
2738 stronger than
2739 unordered (this is
2740 termed the
2741 fence-paired-atomic).
2742 - Must happen before
2743 any following
2744 global/generic
2745 load/load
2746 atomic/store/store
2747 atomic/atomicrmw.
2748 - Ensures any
2749 following global
2750 data read is no
2751 older than the
2752 value read by the
2753 fence-paired-atomic.
2754
Tony Tye6baa6d22017-10-18 22:16:55 +00002755 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2756 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002757
2758 - If OpenCL and
2759 address space is
2760 not generic, omit
2761 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002762 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002763 currently has no
2764 address space on
2765 the fence need to
2766 conservatively
2767 always generate
2768 (see comment for
2769 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002770 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002771 separate s_waitcnt
2772 vmcnt(0) and
2773 s_waitcnt
2774 lgkmcnt(0) to allow
2775 them to be
2776 independently moved
2777 according to the
2778 following rules.
2779 - s_waitcnt vmcnt(0)
2780 must happen after
2781 any preceding
2782 global/generic load
2783 atomic/atomicrmw
2784 with an equal or
2785 wider sync scope
2786 and memory ordering
2787 stronger than
2788 unordered (this is
2789 termed the
2790 fence-paired-atomic).
2791 - s_waitcnt lgkmcnt(0)
2792 must happen after
2793 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002794 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002795 atomic/atomicrmw
2796 with an equal or
2797 wider sync scope
2798 and memory ordering
2799 stronger than
2800 unordered (this is
2801 termed the
2802 fence-paired-atomic).
2803 - Must happen before
2804 the following
2805 buffer_wbinvl1_vol.
2806 - Ensures that the
2807 fence-paired atomic
2808 has completed
2809 before invalidating
2810 the
2811 cache. Therefore
2812 any following
2813 locations read must
2814 be no older than
2815 the value read by
2816 the
2817 fence-paired-atomic.
2818
2819 2. buffer_wbinvl1_vol
2820
Tony Tye6baa6d22017-10-18 22:16:55 +00002821 - Must happen before any
2822 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002823 load/load
2824 atomic/store/store
2825 atomic/atomicrmw.
2826 - Ensures that
2827 following loads
2828 will not see stale
2829 global data.
2830
2831 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002832 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002833 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2834 - wavefront - local
2835 - generic
2836 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002837
2838 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002839 - Must happen after
2840 any preceding
2841 local/generic
2842 load/store/load
2843 atomic/store
2844 atomic/atomicrmw.
2845 - Must happen before
2846 the following
2847 store.
2848 - Ensures that all
2849 memory operations
2850 to local have
2851 completed before
2852 performing the
2853 store that is being
2854 released.
2855
2856 2. buffer/global/flat_store
2857 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002858 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2859
2860 - If OpenCL, omit.
2861 - Must happen after
2862 any preceding
2863 local/generic
2864 load/store/load
2865 atomic/store
2866 atomic/atomicrmw.
2867 - Must happen before
2868 the following
2869 store.
2870 - Ensures that all
2871 memory operations
2872 to local have
2873 completed before
2874 performing the
2875 store that is being
2876 released.
2877
2878 2. flat_store
2879 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2880 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002881
2882 - If OpenCL, omit
2883 lgkmcnt(0).
2884 - Could be split into
2885 separate s_waitcnt
2886 vmcnt(0) and
2887 s_waitcnt
2888 lgkmcnt(0) to allow
2889 them to be
2890 independently moved
2891 according to the
2892 following rules.
2893 - s_waitcnt vmcnt(0)
2894 must happen after
2895 any preceding
2896 global/generic
2897 load/store/load
2898 atomic/store
2899 atomic/atomicrmw.
2900 - s_waitcnt lgkmcnt(0)
2901 must happen after
2902 any preceding
2903 local/generic
2904 load/store/load
2905 atomic/store
2906 atomic/atomicrmw.
2907 - Must happen before
2908 the following
2909 store.
2910 - Ensures that all
2911 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002912 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002913 completed before
2914 performing the
2915 store that is being
2916 released.
2917
2918 2. buffer/global/ds/flat_store
2919 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2920 - wavefront - local
2921 - generic
2922 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002923
2924 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002925 - Must happen after
2926 any preceding
2927 local/generic
2928 load/store/load
2929 atomic/store
2930 atomic/atomicrmw.
2931 - Must happen before
2932 the following
2933 atomicrmw.
2934 - Ensures that all
2935 memory operations
2936 to local have
2937 completed before
2938 performing the
2939 atomicrmw that is
2940 being released.
2941
2942 2. buffer/global/flat_atomic
2943 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002944 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2945
2946 - If OpenCL, omit.
2947 - Must happen after
2948 any preceding
2949 local/generic
2950 load/store/load
2951 atomic/store
2952 atomic/atomicrmw.
2953 - Must happen before
2954 the following
2955 atomicrmw.
2956 - Ensures that all
2957 memory operations
2958 to local have
2959 completed before
2960 performing the
2961 atomicrmw that is
2962 being released.
2963
2964 2. flat_atomic
2965 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2966 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002967
2968 - If OpenCL, omit
2969 lgkmcnt(0).
2970 - Could be split into
2971 separate s_waitcnt
2972 vmcnt(0) and
2973 s_waitcnt
2974 lgkmcnt(0) to allow
2975 them to be
2976 independently moved
2977 according to the
2978 following rules.
2979 - s_waitcnt vmcnt(0)
2980 must happen after
2981 any preceding
2982 global/generic
2983 load/store/load
2984 atomic/store
2985 atomic/atomicrmw.
2986 - s_waitcnt lgkmcnt(0)
2987 must happen after
2988 any preceding
2989 local/generic
2990 load/store/load
2991 atomic/store
2992 atomic/atomicrmw.
2993 - Must happen before
2994 the following
2995 atomicrmw.
2996 - Ensures that all
2997 memory operations
2998 to global and local
2999 have completed
3000 before performing
3001 the atomicrmw that
3002 is being released.
3003
Tony Tye6baa6d22017-10-18 22:16:55 +00003004 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003005 fence release - singlethread *none* *none*
3006 - wavefront
3007 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3008
3009 - If OpenCL and
3010 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003011 not generic, omit.
3012 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003013 currently has no
3014 address space on
3015 the fence need to
3016 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003017 always generate. If
3018 fence had an
3019 address space then
3020 set to address
3021 space of OpenCL
3022 fence flag, or to
3023 generic if both
3024 local and global
3025 flags are
3026 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003027 - Must happen after
3028 any preceding
3029 local/generic
3030 load/load
3031 atomic/store/store
3032 atomic/atomicrmw.
3033 - Must happen before
3034 any following store
3035 atomic/atomicrmw
3036 with an equal or
3037 wider sync scope
3038 and memory ordering
3039 stronger than
3040 unordered (this is
3041 termed the
3042 fence-paired-atomic).
3043 - Ensures that all
3044 memory operations
3045 to local have
3046 completed before
3047 performing the
3048 following
3049 fence-paired-atomic.
3050
Tony Tye6baa6d22017-10-18 22:16:55 +00003051 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3052 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003053
3054 - If OpenCL and
3055 address space is
3056 not generic, omit
3057 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003058 - If OpenCL and
3059 address space is
3060 local, omit
3061 vmcnt(0).
3062 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003063 currently has no
3064 address space on
3065 the fence need to
3066 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003067 always generate. If
3068 fence had an
3069 address space then
3070 set to address
3071 space of OpenCL
3072 fence flag, or to
3073 generic if both
3074 local and global
3075 flags are
3076 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003077 - Could be split into
3078 separate s_waitcnt
3079 vmcnt(0) and
3080 s_waitcnt
3081 lgkmcnt(0) to allow
3082 them to be
3083 independently moved
3084 according to the
3085 following rules.
3086 - s_waitcnt vmcnt(0)
3087 must happen after
3088 any preceding
3089 global/generic
3090 load/store/load
3091 atomic/store
3092 atomic/atomicrmw.
3093 - s_waitcnt lgkmcnt(0)
3094 must happen after
3095 any preceding
3096 local/generic
3097 load/store/load
3098 atomic/store
3099 atomic/atomicrmw.
3100 - Must happen before
3101 any following store
3102 atomic/atomicrmw
3103 with an equal or
3104 wider sync scope
3105 and memory ordering
3106 stronger than
3107 unordered (this is
3108 termed the
3109 fence-paired-atomic).
3110 - Ensures that all
3111 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003112 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003113 completed before
3114 performing the
3115 following
3116 fence-paired-atomic.
3117
3118 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003119 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003120 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3121 - wavefront - local
3122 - generic
3123 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3124
Tony Tye6baa6d22017-10-18 22:16:55 +00003125 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003126 - Must happen after
3127 any preceding
3128 local/generic
3129 load/store/load
3130 atomic/store
3131 atomic/atomicrmw.
3132 - Must happen before
3133 the following
3134 atomicrmw.
3135 - Ensures that all
3136 memory operations
3137 to local have
3138 completed before
3139 performing the
3140 atomicrmw that is
3141 being released.
3142
Tony Tye6baa6d22017-10-18 22:16:55 +00003143 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003144 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3145 2. s_waitcnt lgkmcnt(0)
3146
Tony Tye6baa6d22017-10-18 22:16:55 +00003147 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003148 - Must happen before
3149 any following
3150 global/generic
3151 load/load
3152 atomic/store/store
3153 atomic/atomicrmw.
3154 - Ensures any
3155 following global
3156 data read is no
3157 older than the load
3158 atomic value being
3159 acquired.
3160
3161 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3162
Tony Tye6baa6d22017-10-18 22:16:55 +00003163 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003164 - Must happen after
3165 any preceding
3166 local/generic
3167 load/store/load
3168 atomic/store
3169 atomic/atomicrmw.
3170 - Must happen before
3171 the following
3172 atomicrmw.
3173 - Ensures that all
3174 memory operations
3175 to local have
3176 completed before
3177 performing the
3178 atomicrmw that is
3179 being released.
3180
3181 2. flat_atomic
3182 3. s_waitcnt lgkmcnt(0)
3183
Tony Tye6baa6d22017-10-18 22:16:55 +00003184 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003185 - Must happen before
3186 any following
3187 global/generic
3188 load/load
3189 atomic/store/store
3190 atomic/atomicrmw.
3191 - Ensures any
3192 following global
3193 data read is no
3194 older than the load
3195 atomic value being
3196 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003197
3198 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3199 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003200
3201 - If OpenCL, omit
3202 lgkmcnt(0).
3203 - Could be split into
3204 separate s_waitcnt
3205 vmcnt(0) and
3206 s_waitcnt
3207 lgkmcnt(0) to allow
3208 them to be
3209 independently moved
3210 according to the
3211 following rules.
3212 - s_waitcnt vmcnt(0)
3213 must happen after
3214 any preceding
3215 global/generic
3216 load/store/load
3217 atomic/store
3218 atomic/atomicrmw.
3219 - s_waitcnt lgkmcnt(0)
3220 must happen after
3221 any preceding
3222 local/generic
3223 load/store/load
3224 atomic/store
3225 atomic/atomicrmw.
3226 - Must happen before
3227 the following
3228 atomicrmw.
3229 - Ensures that all
3230 memory operations
3231 to global have
3232 completed before
3233 performing the
3234 atomicrmw that is
3235 being released.
3236
Tony Tye6baa6d22017-10-18 22:16:55 +00003237 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003238 3. s_waitcnt vmcnt(0)
3239
3240 - Must happen before
3241 following
3242 buffer_wbinvl1_vol.
3243 - Ensures the
3244 atomicrmw has
3245 completed before
3246 invalidating the
3247 cache.
3248
3249 4. buffer_wbinvl1_vol
3250
3251 - Must happen before
3252 any following
3253 global/generic
3254 load/load
3255 atomic/atomicrmw.
3256 - Ensures that
3257 following loads
3258 will not see stale
3259 global data.
3260
Tony Tye6baa6d22017-10-18 22:16:55 +00003261 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3262 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003263
3264 - If OpenCL, omit
3265 lgkmcnt(0).
3266 - Could be split into
3267 separate s_waitcnt
3268 vmcnt(0) and
3269 s_waitcnt
3270 lgkmcnt(0) to allow
3271 them to be
3272 independently moved
3273 according to the
3274 following rules.
3275 - s_waitcnt vmcnt(0)
3276 must happen after
3277 any preceding
3278 global/generic
3279 load/store/load
3280 atomic/store
3281 atomic/atomicrmw.
3282 - s_waitcnt lgkmcnt(0)
3283 must happen after
3284 any preceding
3285 local/generic
3286 load/store/load
3287 atomic/store
3288 atomic/atomicrmw.
3289 - Must happen before
3290 the following
3291 atomicrmw.
3292 - Ensures that all
3293 memory operations
3294 to global have
3295 completed before
3296 performing the
3297 atomicrmw that is
3298 being released.
3299
3300 2. flat_atomic
3301 3. s_waitcnt vmcnt(0) &
3302 lgkmcnt(0)
3303
3304 - If OpenCL, omit
3305 lgkmcnt(0).
3306 - Must happen before
3307 following
3308 buffer_wbinvl1_vol.
3309 - Ensures the
3310 atomicrmw has
3311 completed before
3312 invalidating the
3313 cache.
3314
3315 4. buffer_wbinvl1_vol
3316
3317 - Must happen before
3318 any following
3319 global/generic
3320 load/load
3321 atomic/atomicrmw.
3322 - Ensures that
3323 following loads
3324 will not see stale
3325 global data.
3326
3327 fence acq_rel - singlethread *none* *none*
3328 - wavefront
3329 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3330
3331 - If OpenCL and
3332 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003333 not generic, omit.
3334 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003335 since LLVM
3336 currently has no
3337 address space on
3338 the fence need to
3339 conservatively
3340 always generate
3341 (see comment for
3342 previous fence).
3343 - Must happen after
3344 any preceding
3345 local/generic
3346 load/load
3347 atomic/store/store
3348 atomic/atomicrmw.
3349 - Must happen before
3350 any following
3351 global/generic
3352 load/load
3353 atomic/store/store
3354 atomic/atomicrmw.
3355 - Ensures that all
3356 memory operations
3357 to local have
3358 completed before
3359 performing any
3360 following global
3361 memory operations.
3362 - Ensures that the
3363 preceding
3364 local/generic load
3365 atomic/atomicrmw
3366 with an equal or
3367 wider sync scope
3368 and memory ordering
3369 stronger than
3370 unordered (this is
3371 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003372 acquire-fence-paired-atomic
3373 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003374 before following
3375 global memory
3376 operations. This
3377 satisfies the
3378 requirements of
3379 acquire.
3380 - Ensures that all
3381 previous memory
3382 operations have
3383 completed before a
3384 following
3385 local/generic store
3386 atomic/atomicrmw
3387 with an equal or
3388 wider sync scope
3389 and memory ordering
3390 stronger than
3391 unordered (this is
3392 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003393 release-fence-paired-atomic
3394 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003395 requirements of
3396 release.
3397
Tony Tye6baa6d22017-10-18 22:16:55 +00003398 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3399 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003400
3401 - If OpenCL and
3402 address space is
3403 not generic, omit
3404 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003405 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003406 currently has no
3407 address space on
3408 the fence need to
3409 conservatively
3410 always generate
3411 (see comment for
3412 previous fence).
3413 - Could be split into
3414 separate s_waitcnt
3415 vmcnt(0) and
3416 s_waitcnt
3417 lgkmcnt(0) to allow
3418 them to be
3419 independently moved
3420 according to the
3421 following rules.
3422 - s_waitcnt vmcnt(0)
3423 must happen after
3424 any preceding
3425 global/generic
3426 load/store/load
3427 atomic/store
3428 atomic/atomicrmw.
3429 - s_waitcnt lgkmcnt(0)
3430 must happen after
3431 any preceding
3432 local/generic
3433 load/store/load
3434 atomic/store
3435 atomic/atomicrmw.
3436 - Must happen before
3437 the following
3438 buffer_wbinvl1_vol.
3439 - Ensures that the
3440 preceding
3441 global/local/generic
3442 load
3443 atomic/atomicrmw
3444 with an equal or
3445 wider sync scope
3446 and memory ordering
3447 stronger than
3448 unordered (this is
3449 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003450 acquire-fence-paired-atomic
3451 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003452 before invalidating
3453 the cache. This
3454 satisfies the
3455 requirements of
3456 acquire.
3457 - Ensures that all
3458 previous memory
3459 operations have
3460 completed before a
3461 following
3462 global/local/generic
3463 store
3464 atomic/atomicrmw
3465 with an equal or
3466 wider sync scope
3467 and memory ordering
3468 stronger than
3469 unordered (this is
3470 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003471 release-fence-paired-atomic
3472 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003473 requirements of
3474 release.
3475
3476 2. buffer_wbinvl1_vol
3477
3478 - Must happen before
3479 any following
3480 global/generic
3481 load/load
3482 atomic/store/store
3483 atomic/atomicrmw.
3484 - Ensures that
3485 following loads
3486 will not see stale
3487 global data. This
3488 satisfies the
3489 requirements of
3490 acquire.
3491
3492 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003493 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003494 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003495 - wavefront - local load atomic acquire,
3496 - generic except must generated
3497 all instructions even
3498 for OpenCL.*
3499 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3500 - generic
3501 - Must
3502 happen after
3503 preceding
3504 global/generic load
3505 atomic/store
3506 atomic/atomicrmw
3507 with memory
3508 ordering of seq_cst
3509 and with equal or
3510 wider sync scope.
3511 (Note that seq_cst
3512 fences have their
3513 own s_waitcnt
3514 lgkmcnt(0) and so do
3515 not need to be
3516 considered.)
3517 - Ensures any
3518 preceding
3519 sequential
3520 consistent local
3521 memory instructions
3522 have completed
3523 before executing
3524 this sequentially
3525 consistent
3526 instruction. This
3527 prevents reordering
3528 a seq_cst store
3529 followed by a
3530 seq_cst load. (Note
3531 that seq_cst is
3532 stronger than
3533 acquire/release as
3534 the reordering of
3535 load acquire
3536 followed by a store
3537 release is
3538 prevented by the
3539 waitcnt of
3540 the release, but
3541 there is nothing
3542 preventing a store
3543 release followed by
3544 load acquire from
3545 competing out of
3546 order.)
3547
3548 2. *Following
3549 instructions same as
3550 corresponding load
3551 atomic acquire,
3552 except must generated
3553 all instructions even
3554 for OpenCL.*
3555 load atomic seq_cst - workgroup - local *Same as corresponding
3556 load atomic acquire,
3557 except must generated
3558 all instructions even
3559 for OpenCL.*
3560 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3561 - system - generic vmcnt(0)
3562
3563 - Could be split into
3564 separate s_waitcnt
3565 vmcnt(0)
3566 and s_waitcnt
3567 lgkmcnt(0) to allow
3568 them to be
3569 independently moved
3570 according to the
3571 following rules.
3572 - waitcnt lgkmcnt(0)
3573 must happen after
3574 preceding
3575 global/generic load
3576 atomic/store
3577 atomic/atomicrmw
3578 with memory
3579 ordering of seq_cst
3580 and with equal or
3581 wider sync scope.
3582 (Note that seq_cst
3583 fences have their
3584 own s_waitcnt
3585 lgkmcnt(0) and so do
3586 not need to be
3587 considered.)
3588 - waitcnt vmcnt(0)
3589 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003590 preceding
3591 global/generic load
3592 atomic/store
3593 atomic/atomicrmw
3594 with memory
3595 ordering of seq_cst
3596 and with equal or
3597 wider sync scope.
3598 (Note that seq_cst
3599 fences have their
3600 own s_waitcnt
3601 vmcnt(0) and so do
3602 not need to be
3603 considered.)
3604 - Ensures any
3605 preceding
3606 sequential
3607 consistent global
3608 memory instructions
3609 have completed
3610 before executing
3611 this sequentially
3612 consistent
3613 instruction. This
3614 prevents reordering
3615 a seq_cst store
3616 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003617 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003618 that seq_cst is
3619 stronger than
3620 acquire/release as
3621 the reordering of
3622 load acquire
3623 followed by a store
3624 release is
3625 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003626 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003627 the release, but
3628 there is nothing
3629 preventing a store
3630 release followed by
3631 load acquire from
3632 competing out of
3633 order.)
3634
3635 2. *Following
3636 instructions same as
3637 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003638 atomic acquire,
3639 except must generated
3640 all instructions even
3641 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003642 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003643 - wavefront - local store atomic release,
3644 - workgroup - generic except must generated
3645 all instructions even
3646 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003647 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003648 - system - generic store atomic release,
3649 except must generated
3650 all instructions even
3651 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003652 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003653 - wavefront - local atomicrmw acq_rel,
3654 - workgroup - generic except must generated
3655 all instructions even
3656 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003657 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003658 - system - generic atomicrmw acq_rel,
3659 except must generated
3660 all instructions even
3661 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003662 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003663 - wavefront fence acq_rel,
3664 - workgroup except must generated
3665 - agent all instructions even
3666 - system for OpenCL.*
3667 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003668
3669The memory order also adds the single thread optimization constrains defined in
3670table
3671:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3672
3673 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3674 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3675
3676 ============ ==============================================================
3677 LLVM Memory Optimization Constraints
3678 Ordering
3679 ============ ==============================================================
3680 unordered *none*
3681 monotonic *none*
3682 acquire - If a load atomic/atomicrmw then no following load/load
3683 atomic/store/ store atomic/atomicrmw/fence instruction can
3684 be moved before the acquire.
3685 - If a fence then same as load atomic, plus no preceding
3686 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003687 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003688 atomic/store/ store atomic/atomicrmw/fence instruction can
3689 be moved after the release.
3690 - If a fence then same as store atomic, plus no following
3691 associated fence-paired-atomic can be moved before the
3692 fence.
3693 acq_rel Same constraints as both acquire and release.
3694 seq_cst - If a load atomic then same constraints as acquire, plus no
3695 preceding sequentially consistent load atomic/store
3696 atomic/atomicrmw/fence instruction can be moved after the
3697 seq_cst.
3698 - If a store atomic then the same constraints as release, plus
3699 no following sequentially consistent load atomic/store
3700 atomic/atomicrmw/fence instruction can be moved before the
3701 seq_cst.
3702 - If an atomicrmw/fence then same constraints as acq_rel.
3703 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003704
Wei Ding16289cf2017-02-21 18:48:01 +00003705Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003706~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003707
Tony Tyef16a45e2017-06-06 20:31:59 +00003708For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3709(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3710the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003711
Tony Tyef16a45e2017-06-06 20:31:59 +00003712 .. table:: AMDGPU Trap Handler for AMDHSA OS
3713 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003714
Tony Tyef16a45e2017-06-06 20:31:59 +00003715 =================== =============== =============== =======================
3716 Usage Code Sequence Trap Handler Description
3717 Inputs
3718 =================== =============== =============== =======================
3719 reserved ``s_trap 0x00`` Reserved by hardware.
3720 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3721 ``queue_ptr`` ``debugtrap``
3722 ``VGPR0``: intrinsic (not
3723 ``arg`` implemented).
3724 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3725 ``queue_ptr`` terminated and its
3726 associated queue put
3727 into the error state.
3728 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3729 ``queue_ptr`` installed handled
3730 same as ``llvm.trap``.
3731 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3732 breakpoints.
3733 debugger ``s_trap 0x08`` Reserved for debugger.
3734 debugger ``s_trap 0xfe`` Reserved for debugger.
3735 debugger ``s_trap 0xff`` Reserved for debugger.
3736 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003737
Tony Tye46d35762017-08-15 20:47:41 +00003738Unspecified OS
3739--------------
3740
3741This section provides code conventions used when the target triple OS is
3742empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003743
3744Trap Handler ABI
3745~~~~~~~~~~~~~~~~
3746
3747For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3748not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3749instructions are handled as follows:
3750
3751 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3752 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3753
3754 =============== =============== ===========================================
3755 Usage Code Sequence Description
3756 =============== =============== ===========================================
3757 llvm.trap s_endpgm Causes wavefront to be terminated.
3758 llvm.debugtrap *none* Compiler warning given that there is no
3759 trap handler installed.
3760 =============== =============== ===========================================
3761
3762Source Languages
3763================
3764
3765.. _amdgpu-opencl:
3766
3767OpenCL
3768------
3769
3770When generating code for the OpenCL language the target triple environment
3771should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3772
3773When the language is OpenCL the following differences occur:
3774
37751. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37762. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037773. Additional metadata is generated
3778 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003779
3780.. TODO
3781 Specify what affect this has. Hidden arguments added. Additional metadata
3782 generated.
3783
3784.. _amdgpu-hcc:
3785
3786HCC
3787---
3788
3789When generating code for the OpenCL language the target triple environment
3790should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3791
3792When the language is OpenCL the following differences occur:
3793
37941. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3795
3796.. TODO
3797 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003798
Tom Stellard45bb48e2015-06-13 03:28:10 +00003799Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003800---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003801
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003802AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef16a45e2017-06-06 20:31:59 +00003803It supports AMDGCN GFX6-GFX8.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003804
Tony Tyef16a45e2017-06-06 20:31:59 +00003805This section describes general syntax for instructions and operands. For more
3806information about instructions, their semantics and supported combinations of
3807operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003808[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003809
Tony Tyef16a45e2017-06-06 20:31:59 +00003810An instruction has the following syntax (register operands are normally
3811comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003812
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003813*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003814
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003815Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003816~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003817
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003818The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003819
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003820* SGPR registers: s0, ... or s[0], ...
3821* VGPR registers: v0, ... or v[0], ...
3822* TTMP registers: ttmp0, ... or ttmp[0], ...
3823* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3824* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3825* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3826* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3827* Register index expressions: v[2*2], s[1-1:2-1]
3828* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003829
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003830The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003831
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003832* offset, offset0, offset1
3833* idxen, offen bits
3834* glc, slc, tfe bits
3835* waitcnt: integer or combination of counter values
3836* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003837
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003838 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003839
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003840* DPP modifiers:
3841
3842 - row_shl, row_shr, row_ror, row_rol
3843 - row_mirror, row_half_mirror, row_bcast
3844 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3845 - row_mask, bank_mask, bound_ctrl
3846
3847* SDWA modifiers:
3848
3849 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3850 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3851 - abs, neg, sext
3852
Tony Tyef16a45e2017-06-06 20:31:59 +00003853Instruction Examples
3854~~~~~~~~~~~~~~~~~~~~
3855
3856DS
3857~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003858
3859.. code-block:: nasm
3860
3861 ds_add_u32 v2, v4 offset:16
3862 ds_write_src2_b64 v2 offset0:4 offset1:8
3863 ds_cmpst_f32 v2, v4, v6
3864 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3865
3866
3867For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3868
Tony Tyef16a45e2017-06-06 20:31:59 +00003869FLAT
3870++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003871
3872.. code-block:: nasm
3873
3874 flat_load_dword v1, v[3:4]
3875 flat_store_dwordx3 v[3:4], v[5:7]
3876 flat_atomic_swap v1, v[3:4], v5 glc
3877 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3878 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3879
3880For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3881
Tony Tyef16a45e2017-06-06 20:31:59 +00003882MUBUF
3883+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003884
3885.. code-block:: nasm
3886
3887 buffer_load_dword v1, off, s[4:7], s1
3888 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3889 buffer_store_format_xy v[1:2], off, s[4:7], s1
3890 buffer_wbinvl1
3891 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3892
3893For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3894
Tony Tyef16a45e2017-06-06 20:31:59 +00003895SMRD/SMEM
3896+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003897
3898.. code-block:: nasm
3899
3900 s_load_dword s1, s[2:3], 0xfc
3901 s_load_dwordx8 s[8:15], s[2:3], s4
3902 s_load_dwordx16 s[88:103], s[2:3], s4
3903 s_dcache_inv_vol
3904 s_memtime s[4:5]
3905
3906For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3907
Tony Tyef16a45e2017-06-06 20:31:59 +00003908SOP1
3909++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003910
3911.. code-block:: nasm
3912
3913 s_mov_b32 s1, s2
3914 s_mov_b64 s[0:1], 0x80000000
3915 s_cmov_b32 s1, 200
3916 s_wqm_b64 s[2:3], s[4:5]
3917 s_bcnt0_i32_b64 s1, s[2:3]
3918 s_swappc_b64 s[2:3], s[4:5]
3919 s_cbranch_join s[4:5]
3920
3921For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3922
Tony Tyef16a45e2017-06-06 20:31:59 +00003923SOP2
3924++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003925
3926.. code-block:: nasm
3927
3928 s_add_u32 s1, s2, s3
3929 s_and_b64 s[2:3], s[4:5], s[6:7]
3930 s_cselect_b32 s1, s2, s3
3931 s_andn2_b32 s2, s4, s6
3932 s_lshr_b64 s[2:3], s[4:5], s6
3933 s_ashr_i32 s2, s4, s6
3934 s_bfm_b64 s[2:3], s4, s6
3935 s_bfe_i64 s[2:3], s[4:5], s6
3936 s_cbranch_g_fork s[4:5], s[6:7]
3937
3938For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3939
Tony Tyef16a45e2017-06-06 20:31:59 +00003940SOPC
3941++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003942
3943.. code-block:: nasm
3944
3945 s_cmp_eq_i32 s1, s2
3946 s_bitcmp1_b32 s1, s2
3947 s_bitcmp0_b64 s[2:3], s4
3948 s_setvskip s3, s5
3949
3950For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3951
Tony Tyef16a45e2017-06-06 20:31:59 +00003952SOPP
3953++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003954
3955.. code-block:: nasm
3956
3957 s_barrier
3958 s_nop 2
3959 s_endpgm
3960 s_waitcnt 0 ; Wait for all counters to be 0
3961 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3962 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3963 s_sethalt 9
3964 s_sleep 10
3965 s_sendmsg 0x1
3966 s_sendmsg sendmsg(MSG_INTERRUPT)
3967 s_trap 1
3968
3969For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3970
3971Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003972of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003973range or acceptable values.
3974
Tony Tyef16a45e2017-06-06 20:31:59 +00003975VALU
3976++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003977
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003978For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3979the assembler will automatically use optimal encoding based on its operands.
3980To force specific encoding, one can add a suffix to the opcode of the instruction:
3981
3982* _e32 for 32-bit VOP1/VOP2/VOPC
3983* _e64 for 64-bit VOP3
3984* _dpp for VOP_DPP
3985* _sdwa for VOP_SDWA
3986
3987VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003988
3989.. code-block:: nasm
3990
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003991 v_mov_b32 v1, v2
3992 v_mov_b32_e32 v1, v2
3993 v_nop
3994 v_cvt_f64_i32_e32 v[1:2], v2
3995 v_floor_f32_e32 v1, v2
3996 v_bfrev_b32_e32 v1, v2
3997 v_add_f32_e32 v1, v2, v3
3998 v_mul_i32_i24_e64 v1, v2, 3
3999 v_mul_i32_i24_e32 v1, -3, v3
4000 v_mul_i32_i24_e32 v1, -100, v3
4001 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4002 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004003
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004004VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004005
4006.. code-block:: nasm
4007
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004008 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4009 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4010 v_mov_b32 v0, v0 wave_shl:1
4011 v_mov_b32 v0, v0 row_mirror
4012 v_mov_b32 v0, v0 row_bcast:31
4013 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4014 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4015 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004016
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004017VOP_SDWA examples:
4018
4019.. code-block:: nasm
4020
4021 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4022 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4023 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4024 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4025 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4026
4027For full list of supported instructions, refer to "Vector ALU instructions".
4028
4029HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004030~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004031
4032AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4033one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004034
4035.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004036+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004037
4038*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004039object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004040
4041.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004042+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4043
Tom Stellard347ac792015-06-26 21:15:07 +00004044
4045*major*, *minor*, and *stepping* are all integers that describe the instruction
4046set architecture (ISA) version of the assembly program.
4047
4048*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4049"AMD" and *arch* should always be equal to "AMDGPU".
4050
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004051By default, the assembler will derive the ISA version, *vendor*, and *arch*
4052from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004053
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004054.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004055+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004056
4057This directives specifies that the symbol with given name is a kernel entry point
4058(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004059
4060.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004061++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004062
4063This directive marks the beginning of a list of key / value pairs that are used
4064to specify the amd_kernel_code_t object that will be emitted by the assembler.
4065The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4066any amd_kernel_code_t values that are unspecified a default value will be
4067used. The default value for all keys is 0, with the following exceptions:
4068
4069- *kernel_code_version_major* defaults to 1.
4070- *machine_kind* defaults to 1.
4071- *machine_version_major*, *machine_version_minor*, and
4072 *machine_version_stepping* are derived from the value of the -mcpu option
4073 that is passed to the assembler.
4074- *kernel_code_entry_byte_offset* defaults to 256.
4075- *wavefront_size* defaults to 6.
4076- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004077 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004078 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4079
4080The *.amd_kernel_code_t* directive must be placed immediately after the
4081function label and before any instructions.
4082
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004083For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4084comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004085
4086Here is an example of a minimal amd_kernel_code_t specification:
4087
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004088.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004089
4090 .hsa_code_object_version 1,0
4091 .hsa_code_object_isa
4092
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004093 .hsatext
4094 .globl hello_world
4095 .p2align 8
4096 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004097
4098 hello_world:
4099
4100 .amd_kernel_code_t
4101 enable_sgpr_kernarg_segment_ptr = 1
4102 is_ptr64 = 1
4103 compute_pgm_rsrc1_vgprs = 0
4104 compute_pgm_rsrc1_sgprs = 0
4105 compute_pgm_rsrc2_user_sgpr = 2
4106 kernarg_segment_byte_size = 8
4107 wavefront_sgpr_count = 2
4108 workitem_vgpr_count = 3
4109 .end_amd_kernel_code_t
4110
4111 s_load_dwordx2 s[0:1], s[0:1] 0x0
4112 v_mov_b32 v0, 3.14159
4113 s_waitcnt lgkmcnt(0)
4114 v_mov_b32 v1, s0
4115 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004116 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004117 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004118 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004119 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004120
4121Additional Documentation
4122========================
4123
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004124.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4125.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4126.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4127.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4128.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4129.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4130.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4131.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004132.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4133.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4134.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4135.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4136.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4137.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4138.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004139.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004140.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4141.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4142.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__