blob: bce77333175fdcbe4335d008a705e8a3cf076449 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye31105cc2017-12-11 15:35:27 +000087 =========== =============== ============ ===== ========= ======= ==================
88 Processor Alternative Target dGPU/ Target ROCm Example
89 Processor Triple APU Features Support Products
90 Architecture Supported
91 [Default]
92 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``r600`` ``r600`` dGPU
96 ``r630`` ``r600`` dGPU
97 ``rs880`` ``r600`` dGPU
98 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000099 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000100 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000101 ``rv710`` ``r600`` dGPU
102 ``rv730`` ``r600`` dGPU
103 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000104 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000105 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000106 ``cedar`` ``r600`` dGPU
107 ``redwood`` ``r600`` dGPU
108 ``sumo`` ``r600`` dGPU
109 ``juniper`` ``r600`` dGPU
110 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``barts`` ``r600`` dGPU
114 ``turks`` ``r600`` dGPU
115 ``caicos`` ``r600`` dGPU
116 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000117 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000118 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000119 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
120 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
121 - ``verde``
122 - ``oland``
123 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000124 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000125 -----------------------------------------------------------------------------------
126 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
127 - A6 Pro-7050B
128 - A8-7100
129 - A8 Pro-7150B
130 - A10-7300
131 - A10 Pro-7350B
132 - FX-7500
133 - A8-7200P
134 - A10-7400P
135 - FX-7600P
136 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
137 - FirePro W9100
138 - FirePro S9150
139 - FirePro S9170
140 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
141 - Radeon R9 290x
142 - Radeon R390
143 - Radeon R390x
144 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
145 - ``mullins`` - E1-2200
146 - E1-2500
147 - E2-3000
148 - E2-3800
149 - A4-5000
150 - A4-5100
151 - A6-5200
152 - A4 Pro-3340B
153 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
154 - Radeon HD 8770
155 - R7 260
156 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000157 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000158 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000159 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
160 [on] - Pro A6-8500B
161 - A8-8600P
162 - Pro A8-8600B
163 - FX-8800P
164 - Pro A12-8800B
165 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
166 [on] - Pro A10-8700B
167 - A10-8780P
168 \ ``amdgcn`` APU - xnack - A10-9600P
169 [on] - A10-9630P
170 - A12-9700P
171 - A12-9730P
172 - FX-9800P
173 - FX-9830P
174 \ ``amdgcn`` APU - xnack - E2-9010
175 [on] - A6-9210
176 - A9-9410
177 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tyea6978802017-12-12 05:47:00 +0000178 - ``iceland`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000179 - FirePro W7100
180 - Radeon R285
181 - Radeon R9 380
182 - Radeon R9 385
183 - Mobile FirePro
184 M7170
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
186 [off] - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
192 [off] - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
195 [off]
196 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
197 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000198 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000199 -----------------------------------------------------------------------------------
200 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
201 [off] Frontier Edition
202 - Radeon RX Vega 56
203 - Radeon RX Vega 64
204 - Radeon RX Vega 64
205 Liquid
206 - Radeon Instinct MI25
207 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
208 [on]
209 .. TODO
210 Add product
211 names.
212 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000213
214.. _amdgpu-target-features:
215
216Target Features
217---------------
218
219Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000220processor specific features. Not all target features are supported by
221all processors. The runtime must ensure that the features supported by
222the device used to execute the code match the features enabled when
223generating the code. A mismatch of features may result in incorrect
224execution, or a reduction in performance.
225
226The target features supported by each processor, and the default value
227used if not specified explicitly, is listed in
228:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000229
230Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
231target features.
232
233For example:
234
235``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000236 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000237``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000238 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000239
240 .. table:: AMDGPU Target Features
241 :name: amdgpu-target-feature-table
242
Tony Tye31105cc2017-12-11 15:35:27 +0000243 ============== ==================================================
244 Target Feature Description
245 ============== ==================================================
246 -m[no-]xnack Enable/disable generating code that has
247 memory clauses that are compatible with
248 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000249
Tony Tye31105cc2017-12-11 15:35:27 +0000250 This is used for demand paging and page
251 migration. If XNACK replay is enabled in
252 the device, then if a page fault occurs
253 the code may execute incorrectly if the
254 ``xnack`` feature is not enabled. Executing
255 code that has the feature enabled on a
256 device that does not have XNACK replay
257 enabled will execute correctly, but may
258 be less performant than code with the
259 feature disabled.
260 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000261
262.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000263
264Address Spaces
265--------------
266
Tony Tyef16a45e2017-06-06 20:31:59 +0000267The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269The memory space names used in the table, aside from the region memory space, is
270from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000271
Tony Tyef16a45e2017-06-06 20:31:59 +0000272LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000273
Tony Tyef16a45e2017-06-06 20:31:59 +0000274 .. table:: Address Space Mapping
275 :name: amdgpu-address-space-mapping-table
276
277 ================== ================= ================= ================= =================
278 LLVM Address Space Memory Space
279 ------------------ -----------------------------------------------------------------------
280 \ Current Default amdgiz/amdgizcl hcc Future Default
281 ================== ================= ================= ================= =================
282 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
283 1 Global Global Global Global
284 2 Constant Constant Constant Region (GDS)
285 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286 4 Generic (Flat) Region (GDS) Region (GDS) Constant
287 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
288 ================== ================= ================= ================= =================
289
290Current Default
291 This is the current default address space mapping used for all languages
292 except hcc. This will shortly be deprecated.
293
294amdgiz/amdgizcl
295 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296 is specified as the target triple environment value.
297
298hcc
299 This is the current address space mapping used when ``hcc`` is specified as
300 the target triple environment value.This will shortly be deprecated.
301
302Future Default
303 This will shortly be the only address space mapping for all languages using
304 AMDGPU backend.
305
306.. _amdgpu-memory-scopes:
307
308Memory Scopes
309-------------
310
311This section provides LLVM memory synchronization scopes supported by the AMDGPU
312backend memory model when the target triple OS is ``amdhsa`` (see
313:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
314
315The memory model supported is based on the HSA memory model [HSA]_ which is
316based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317relation is transitive over the synchonizes-with relation independent of scope,
318and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000319table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000320
321This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322inclusion and requires the memory scopes to exactly match. However, this
323is conservatively correct for OpenCL.
324
Tony Tye07d9f102017-11-10 01:00:54 +0000325 .. table:: AMDHSA LLVM Sync Scopes
326 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000327
328 ================ ==========================================================
329 LLVM Sync Scope Description
330 ================ ==========================================================
331 *none* The default: ``system``.
332
333 Synchronizes with, and participates in modification and
334 seq_cst total orderings with, other operations (except
335 image operations) for all address spaces (except private,
336 or generic that accesses private) provided the other
337 operation's sync scope is:
338
339 - ``system``.
340 - ``agent`` and executed by a thread on the same agent.
341 - ``workgroup`` and executed by a thread in the same
342 workgroup.
343 - ``wavefront`` and executed by a thread in the same
344 wavefront.
345
346 ``agent`` Synchronizes with, and participates in modification and
347 seq_cst total orderings with, other operations (except
348 image operations) for all address spaces (except private,
349 or generic that accesses private) provided the other
350 operation's sync scope is:
351
352 - ``system`` or ``agent`` and executed by a thread on the
353 same agent.
354 - ``workgroup`` and executed by a thread in the same
355 workgroup.
356 - ``wavefront`` and executed by a thread in the same
357 wavefront.
358
359 ``workgroup`` Synchronizes with, and participates in modification and
360 seq_cst total orderings with, other operations (except
361 image operations) for all address spaces (except private,
362 or generic that accesses private) provided the other
363 operation's sync scope is:
364
365 - ``system``, ``agent`` or ``workgroup`` and executed by a
366 thread in the same workgroup.
367 - ``wavefront`` and executed by a thread in the same
368 wavefront.
369
370 ``wavefront`` Synchronizes with, and participates in modification and
371 seq_cst total orderings with, other operations (except
372 image operations) for all address spaces (except private,
373 or generic that accesses private) provided the other
374 operation's sync scope is:
375
376 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377 and executed by a thread in the same wavefront.
378
379 ``singlethread`` Only synchronizes with, and participates in modification
380 and seq_cst total orderings with, other operations (except
381 image operations) running in the same thread for all
382 address spaces (for example, in signal handlers).
383 ================ ==========================================================
384
385AMDGPU Intrinsics
386-----------------
387
388The AMDGPU backend implements the following intrinsics.
389
390*This section is WIP.*
391
392.. TODO
393 List AMDGPU intrinsics
394
395Code Object
396===========
397
398The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399can be linked by ``lld`` to produce a standard ELF shared code object which can
400be loaded and executed on an AMDGPU target.
401
402Header
403------
404
405The AMDGPU backend uses the following ELF header:
406
407 .. table:: AMDGPU ELF Header
408 :name: amdgpu-elf-header-table
409
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000410 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000411 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000412 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000413 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
414 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000415 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
416 - ``ELFOSABI_AMDGPU_HSA``
417 - ``ELFOSABI_AMDGPU_PAL``
418 - ``ELFOSABI_AMDGPU_MESA3D``
419 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420 - ``ELFABIVERSION_AMDGPU_PAL``
421 - ``ELFABIVERSION_AMDGPU_MESA3D``
422 ``e_type`` - ``ET_REL``
423 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000424 ``e_machine`` ``EM_AMDGPU``
425 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000426 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000427 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000428
429..
430
431 .. table:: AMDGPU ELF Header Enumeration Values
432 :name: amdgpu-elf-header-enumeration-values-table
433
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000434 =============================== =====
435 Name Value
436 =============================== =====
437 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000438 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000439 ``ELFOSABI_AMDGPU_HSA`` 64
440 ``ELFOSABI_AMDGPU_PAL`` 65
441 ``ELFOSABI_AMDGPU_MESA3D`` 66
442 ``ELFABIVERSION_AMDGPU_HSA`` 1
443 ``ELFABIVERSION_AMDGPU_PAL`` 0
444 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000446
447``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000448 The ELF class is:
449
450 * ``ELFCLASS32`` for ``r600`` architecture.
451
452 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
453 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000457
458``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000459 One of the following AMD GPU architecture specific OS ABIs
460 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000461
Tony Tye07d9f102017-11-10 01:00:54 +0000462 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000463
Tony Tye07d9f102017-11-10 01:00:54 +0000464 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000465
Tony Tye07d9f102017-11-10 01:00:54 +0000466 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
467
468 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000469
Tony Tyef16a45e2017-06-06 20:31:59 +0000470``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000471 The ABI version of the AMD GPU architecture specific OS ABI to which the code
472 object conforms:
473
474 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
475 runtime ABI.
476
477 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
478 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000479
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000480 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000481 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000482
Tony Tyef16a45e2017-06-06 20:31:59 +0000483``e_type``
484 Can be one of the following values:
485
486
487 ``ET_REL``
488 The type produced by the AMD GPU backend compiler as it is relocatable code
489 object.
490
491 ``ET_DYN``
492 The type produced by the linker as it is a shared code object.
493
494 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
495
496``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000497 The value ``EM_AMDGPU`` is used for the machine for all processors supported
498 by the ``r600`` and ``amdgcn`` architectures (see
499 :ref:`amdgpu-processor-table`). The specific processor is specified in the
500 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000502
503``e_entry``
504 The entry point is 0 as the entry points for individual kernels must be
505 selected in order to invoke them through AQL packets.
506
507``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000508 The AMDGPU backend uses the following ELF header flags:
509
510 .. table:: AMDGPU ELF Header ``e_flags``
511 :name: amdgpu-elf-header-e_flags-table
512
513 ================================= ========== =============================
514 Name Value Description
515 ================================= ========== =============================
516 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
517 -------------------------------------------- -----------------------------
518 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
519 mask for
520 ``EF_AMDGPU_MACH_xxx`` values
521 defined in
522 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000523 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
524 target feature is
525 enabled for all code
526 contained in the code object.
527 See
528 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000529 ================================= ========== =============================
530
531 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532 :name: amdgpu-ef-amdgpu-mach-table
533
534 ================================= ========== =============================
535 Name Value Description (see
536 :ref:`amdgpu-processor-table`)
537 ================================= ========== =============================
538 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
539 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
540 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
541 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
542 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
543 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
544 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
545 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
546 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
547 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
548 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
549 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
550 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
551 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
552 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
553 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
554 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
555 *reserved* 17-31 Reserved for ``r600``
556 architecture processors.
557 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
562 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye31105cc2017-12-11 15:35:27 +0000563 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tyea6978802017-12-12 05:47:00 +0000564 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
565 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
566 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
567 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
568 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
569 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
Tony Tye07d9f102017-11-10 01:00:54 +0000570 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000571
572Sections
573--------
574
575An AMDGPU target ELF code object has the standard ELF sections which include:
576
577 .. table:: AMDGPU ELF Sections
578 :name: amdgpu-elf-sections-table
579
580 ================== ================ =================================
581 Name Type Attributes
582 ================== ================ =================================
583 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
584 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
585 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
586 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
587 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
588 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
589 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
590 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
591 ``.note`` ``SHT_NOTE`` *none*
592 ``.rela``\ *name* ``SHT_RELA`` *none*
593 ``.rela.dyn`` ``SHT_RELA`` *none*
594 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
595 ``.shstrtab`` ``SHT_STRTAB`` *none*
596 ``.strtab`` ``SHT_STRTAB`` *none*
597 ``.symtab`` ``SHT_SYMTAB`` *none*
598 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
599 ================== ================ =================================
600
601These sections have their standard meanings (see [ELF]_) and are only generated
602if needed.
603
604``.debug``\ *\**
605 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
606 DWARF produced by the AMDGPU backend.
607
Tony Tye46d35762017-08-15 20:47:41 +0000608``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000609 The standard sections used by a dynamic loader.
610
611``.note``
612 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
613 backend.
614
615``.rela``\ *name*, ``.rela.dyn``
616 For relocatable code objects, *name* is the name of the section that the
617 relocation records apply. For example, ``.rela.text`` is the section name for
618 relocation records associated with the ``.text`` section.
619
620 For linked shared code objects, ``.rela.dyn`` contains all the relocation
621 records from each of the relocatable code object's ``.rela``\ *name* sections.
622
623 See :ref:`amdgpu-relocation-records` for the relocation records supported by
624 the AMDGPU backend.
625
626``.text``
627 The executable machine code for the kernels and functions they call. Generated
628 as position independent code. See :ref:`amdgpu-code-conventions` for
629 information on conventions used in the isa generation.
630
631.. _amdgpu-note-records:
632
633Note Records
634------------
635
Tony Tye07d9f102017-11-10 01:00:54 +0000636As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
637be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
638aligned. In addition, minimal zero byte padding must be generated to ensure the
639``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
640``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
642The AMDGPU backend code object uses the following ELF note records in the
643``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000644record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000645variable size strings have a corresponding ``*_size`` field that specifies the
646number of bytes, including the terminating null character, in the string. The
647string(s) come immediately after the preceding fields.
648
649Additional note records can be present.
650
651 .. table:: AMDGPU ELF Note Records
652 :name: amdgpu-elf-note-records-table
653
Tony Tye46d35762017-08-15 20:47:41 +0000654 ===== ============================== ======================================
655 Name Type Description
656 ===== ============================== ======================================
657 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000658 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000659
660..
661
662 .. table:: AMDGPU ELF Note Record Enumeration Values
663 :name: amdgpu-elf-note-record-enumeration-values-table
664
Tony Tye46d35762017-08-15 20:47:41 +0000665 ============================== =====
666 Name Value
667 ============================== =====
668 *reserved* 0-9
669 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000670 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000671 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000672
Tony Tye46d35762017-08-15 20:47:41 +0000673``NT_AMD_AMDGPU_HSA_METADATA``
674 Specifies extensible metadata associated with the code objects executed on HSA
675 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
676 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
677 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
678 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000679
Tony Tye46d35762017-08-15 20:47:41 +0000680.. _amdgpu-symbols:
681
682Symbols
683-------
684
685Symbols include the following:
686
687 .. table:: AMDGPU ELF Symbols
688 :name: amdgpu-elf-symbols-table
689
690 ===================== ============== ============= ==================
691 Name Type Section Description
692 ===================== ============== ============= ==================
693 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
694 - ``.rodata``
695 - ``.bss``
696 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
697 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
698 ===================== ============== ============= ==================
699
700Global variable
701 Global variables both used and defined by the compilation unit.
702
703 If the symbol is defined in the compilation unit then it is allocated in the
704 appropriate section according to if it has initialized data or is readonly.
705
706 If the symbol is external then its section is ``STN_UNDEF`` and the loader
707 will resolve relocations using the definition provided by another code object
708 or explicitly defined by the runtime.
709
710 All global symbols, whether defined in the compilation unit or external, are
711 accessed by the machine code indirectly through a GOT table entry. This
712 allows them to be preemptable. The GOT table is only supported when the target
713 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000714
715 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000716 Add description of linked shared object symbols. Seems undefined symbols
717 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000718
Tony Tye46d35762017-08-15 20:47:41 +0000719Kernel descriptor
720 Every HSA kernel has an associated kernel descriptor. It is the address of the
721 kernel descriptor that is used in the AQL dispatch packet used to invoke the
722 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
723 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
724
725Kernel entry point
726 Every HSA kernel also has a symbol for its machine code entry point.
727
728.. _amdgpu-relocation-records:
729
730Relocation Records
731------------------
732
733AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
734relocatable fields are:
735
736``word32``
737 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
738 alignment. These values use the same byte order as other word values in the
739 AMD GPU architecture.
740
741``word64``
742 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
743 alignment. These values use the same byte order as other word values in the
744 AMD GPU architecture.
745
746Following notations are used for specifying relocation calculations:
747
748**A**
749 Represents the addend used to compute the value of the relocatable field.
750
751**G**
752 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000753 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000754
755**GOT**
756 Represents the address of the global offset table.
757
758**P**
759 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
760 of the storage unit being relocated (computed using ``r_offset``).
761
762**S**
763 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000764 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
765
766**B**
767 Represents the base address of a loaded executable or shared object which is
768 the difference between the ELF address and the actual load address. Relocations
769 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000770
771The following relocation types are supported:
772
773 .. table:: AMDGPU ELF Relocation Records
774 :name: amdgpu-elf-relocation-records-table
775
776 ========================== ===== ========== ==============================
777 Relocation Type Value Field Calculation
778 ========================== ===== ========== ==============================
779 ``R_AMDGPU_NONE`` 0 *none* *none*
780 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
781 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
782 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
783 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
784 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
785 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
786 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
787 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
788 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
789 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
790 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000791 *reserved* 12
792 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000793 ========================== ===== ========== ==============================
794
795.. _amdgpu-dwarf:
796
797DWARF
798-----
799
800Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
801information that maps the code object executable code and data to the source
802language constructs. It can be used by tools such as debuggers and profilers.
803
804Address Space Mapping
805~~~~~~~~~~~~~~~~~~~~~
806
807The following address space mapping is used:
808
809 .. table:: AMDGPU DWARF Address Space Mapping
810 :name: amdgpu-dwarf-address-space-mapping-table
811
812 =================== =================
813 DWARF Address Space Memory Space
814 =================== =================
815 1 Private (Scratch)
816 2 Local (group/LDS)
817 *omitted* Global
818 *omitted* Constant
819 *omitted* Generic (Flat)
820 *not supported* Region (GDS)
821 =================== =================
822
823See :ref:`amdgpu-address-spaces` for information on the memory space terminology
824used in the table.
825
826An ``address_class`` attribute is generated on pointer type DIEs to specify the
827DWARF address space of the value of the pointer when it is in the *private* or
828*local* address space. Otherwise the attribute is omitted.
829
830An ``XDEREF`` operation is generated in location list expressions for variables
831that are allocated in the *private* and *local* address space. Otherwise no
832``XDREF`` is omitted.
833
834Register Mapping
835~~~~~~~~~~~~~~~~
836
837*This section is WIP.*
838
839.. TODO
840 Define DWARF register enumeration.
841
842 If want to present a wavefront state then should expose vector registers as
843 64 wide (rather than per work-item view that LLVM uses). Either as separate
844 registers, or a 64x4 byte single register. In either case use a new LANE op
845 (akin to XDREF) to select the current lane usage in a location
846 expression. This would also allow scalar register spilling to vector register
847 lanes to be expressed (currently no debug information is being generated for
848 spilling). If choose a wide single register approach then use LANE in
849 conjunction with PIECE operation to select the dword part of the register for
850 the current lane. If the separate register approach then use LANE to select
851 the register.
852
853Source Text
854~~~~~~~~~~~
855
856*This section is WIP.*
857
858.. TODO
859 DWARF extension to include runtime generated source text.
860
861.. _amdgpu-code-conventions:
862
863Code Conventions
864================
865
866This section provides code conventions used for each supported target triple OS
867(see :ref:`amdgpu-target-triples`).
868
869AMDHSA
870------
871
872This section provides code conventions used when the target triple OS is
873``amdhsa`` (see :ref:`amdgpu-target-triples`).
874
875.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000876
877Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000878~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000879
Tony Tye46d35762017-08-15 20:47:41 +0000880The code object metadata specifies extensible metadata associated with the code
881objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
882[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
883(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
884``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
885information necessary to support the ROCM kernel queries. For example, the
886segment sizes needed in a dispatch packet. In addition, a high level language
887runtime may require other information to be included. For example, the AMD
888OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000889
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000890The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000891:doc:`YamlIO`).
892
Tony Tye46d35762017-08-15 20:47:41 +0000893.. TODO
894 Is the string null terminated? It probably should not if YAML allows it to
895 contain null characters, otherwise it should be.
896
Tony Tyef16a45e2017-06-06 20:31:59 +0000897The metadata is represented as a single YAML document comprised of the mapping
898defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
899referenced tables.
900
901For boolean values, the string values of ``false`` and ``true`` are used for
902false and true respectively.
903
904Additional information can be added to the mappings. To avoid conflicts, any
905non-AMD key names should be prefixed by "*vendor-name*.".
906
907 .. table:: AMDHSA Code Object Metadata Mapping
908 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
909
910 ========== ============== ========= =======================================
911 String Key Value Type Required? Description
912 ========== ============== ========= =======================================
913 "Version" sequence of Required - The first integer is the major
914 2 integers version. Currently 1.
915 - The second integer is the minor
916 version. Currently 0.
917 "Printf" sequence of Each string is encoded information
918 strings about a printf function call. The
919 encoded information is organized as
920 fields separated by colon (':'):
921
922 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
923
924 where:
925
926 ``ID``
927 A 32 bit integer as a unique id for
928 each printf function call
929
930 ``N``
931 A 32 bit integer equal to the number
932 of arguments of printf function call
933 minus 1
934
935 ``S[i]`` (where i = 0, 1, ... , N-1)
936 32 bit integers for the size in bytes
937 of the i-th FormatString argument of
938 the printf function call
939
940 FormatString
941 The format string passed to the
942 printf function call.
943 "Kernels" sequence of Required Sequence of the mappings for each
944 mapping kernel in the code object. See
945 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
946 for the definition of the mapping.
947 ========== ============== ========= =======================================
948
949..
950
951 .. table:: AMDHSA Code Object Kernel Metadata Mapping
952 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
953
954 ================= ============== ========= ================================
955 String Key Value Type Required? Description
956 ================= ============== ========= ================================
957 "Name" string Required Source name of the kernel.
958 "SymbolName" string Required Name of the kernel
959 descriptor ELF symbol.
960 "Language" string Source language of the kernel.
961 Values include:
962
963 - "OpenCL C"
964 - "OpenCL C++"
965 - "HCC"
966 - "OpenMP"
967
968 "LanguageVersion" sequence of - The first integer is the major
969 2 integers version.
970 - The second integer is the
971 minor version.
972 "Attrs" mapping Mapping of kernel attributes.
973 See
974 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
975 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000976 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000977 mapping kernel arguments. See
978 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
979 for the definition of the mapping.
980 "CodeProps" mapping Mapping of properties related to
981 the kernel code. See
982 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
983 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000984 ================= ============== ========= ================================
985
986..
987
988 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
989 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
990
991 =================== ============== ========= ==============================
992 String Key Value Type Required? Description
993 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +0000994 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
995 3 integers must be >=1 and the dispatch
996 work-group size X, Y, Z must
997 correspond to the specified
998 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +0000999
1000 Corresponds to the OpenCL
1001 ``reqd_work_group_size``
1002 attribute.
1003 "WorkGroupSizeHint" sequence of The dispatch work-group size
1004 3 integers X, Y, Z is likely to be the
1005 specified values.
1006
1007 Corresponds to the OpenCL
1008 ``work_group_size_hint``
1009 attribute.
1010 "VecTypeHint" string The name of a scalar or vector
1011 type.
1012
1013 Corresponds to the OpenCL
1014 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001015
1016 "RuntimeHandle" string The external symbol name
1017 associated with a kernel.
1018 OpenCL runtime allocates a
1019 global buffer for the symbol
1020 and saves the kernel's address
1021 to it, which is used for
1022 device side enqueueing. Only
1023 available for device side
1024 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001025 =================== ============== ========= ==============================
1026
1027..
1028
1029 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1030 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1031
1032 ================= ============== ========= ================================
1033 String Key Value Type Required? Description
1034 ================= ============== ========= ================================
1035 "Name" string Kernel argument name.
1036 "TypeName" string Kernel argument type name.
1037 "Size" integer Required Kernel argument size in bytes.
1038 "Align" integer Required Kernel argument alignment in
1039 bytes. Must be a power of two.
1040 "ValueKind" string Required Kernel argument kind that
1041 specifies how to set up the
1042 corresponding argument.
1043 Values include:
1044
1045 "ByValue"
1046 The argument is copied
1047 directly into the kernarg.
1048
1049 "GlobalBuffer"
1050 A global address space pointer
1051 to the buffer data is passed
1052 in the kernarg.
1053
1054 "DynamicSharedPointer"
1055 A group address space pointer
1056 to dynamically allocated LDS
1057 is passed in the kernarg.
1058
1059 "Sampler"
1060 A global address space
1061 pointer to a S# is passed in
1062 the kernarg.
1063
1064 "Image"
1065 A global address space
1066 pointer to a T# is passed in
1067 the kernarg.
1068
1069 "Pipe"
1070 A global address space pointer
1071 to an OpenCL pipe is passed in
1072 the kernarg.
1073
1074 "Queue"
1075 A global address space pointer
1076 to an OpenCL device enqueue
1077 queue is passed in the
1078 kernarg.
1079
1080 "HiddenGlobalOffsetX"
1081 The OpenCL grid dispatch
1082 global offset for the X
1083 dimension is passed in the
1084 kernarg.
1085
1086 "HiddenGlobalOffsetY"
1087 The OpenCL grid dispatch
1088 global offset for the Y
1089 dimension is passed in the
1090 kernarg.
1091
1092 "HiddenGlobalOffsetZ"
1093 The OpenCL grid dispatch
1094 global offset for the Z
1095 dimension is passed in the
1096 kernarg.
1097
1098 "HiddenNone"
1099 An argument that is not used
1100 by the kernel. Space needs to
1101 be left for it, but it does
1102 not need to be set up.
1103
1104 "HiddenPrintfBuffer"
1105 A global address space pointer
1106 to the runtime printf buffer
1107 is passed in kernarg.
1108
1109 "HiddenDefaultQueue"
1110 A global address space pointer
1111 to the OpenCL device enqueue
1112 queue that should be used by
1113 the kernel by default is
1114 passed in the kernarg.
1115
1116 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001117 A global address space pointer
1118 to help link enqueued kernels into
1119 the ancestor tree for determining
1120 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001121
1122 "ValueType" string Required Kernel argument value type. Only
1123 present if "ValueKind" is
1124 "ByValue". For vector data
1125 types, the value is for the
1126 element type. Values include:
1127
1128 - "Struct"
1129 - "I8"
1130 - "U8"
1131 - "I16"
1132 - "U16"
1133 - "F16"
1134 - "I32"
1135 - "U32"
1136 - "F32"
1137 - "I64"
1138 - "U64"
1139 - "F64"
1140
1141 .. TODO
1142 How can it be determined if a
1143 vector type, and what size
1144 vector?
1145 "PointeeAlign" integer Alignment in bytes of pointee
1146 type for pointer type kernel
1147 argument. Must be a power
1148 of 2. Only present if
1149 "ValueKind" is
1150 "DynamicSharedPointer".
1151 "AddrSpaceQual" string Kernel argument address space
1152 qualifier. Only present if
1153 "ValueKind" is "GlobalBuffer" or
1154 "DynamicSharedPointer". Values
1155 are:
1156
1157 - "Private"
1158 - "Global"
1159 - "Constant"
1160 - "Local"
1161 - "Generic"
1162 - "Region"
1163
1164 .. TODO
1165 Is GlobalBuffer only Global
1166 or Constant? Is
1167 DynamicSharedPointer always
1168 Local? Can HCC allow Generic?
1169 How can Private or Region
1170 ever happen?
1171 "AccQual" string Kernel argument access
1172 qualifier. Only present if
1173 "ValueKind" is "Image" or
1174 "Pipe". Values
1175 are:
1176
1177 - "ReadOnly"
1178 - "WriteOnly"
1179 - "ReadWrite"
1180
1181 .. TODO
1182 Does this apply to
1183 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001184 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001185 performed by the kernel on the
1186 kernel argument. Only present if
1187 "ValueKind" is "GlobalBuffer",
1188 "Image", or "Pipe". This may be
1189 more restrictive than indicated
1190 by "AccQual" to reflect what the
1191 kernel actual does. If not
1192 present then the runtime must
1193 assume what is implied by
1194 "AccQual" and "IsConst". Values
1195 are:
1196
1197 - "ReadOnly"
1198 - "WriteOnly"
1199 - "ReadWrite"
1200
1201 "IsConst" boolean Indicates if the kernel argument
1202 is const qualified. Only present
1203 if "ValueKind" is
1204 "GlobalBuffer".
1205
1206 "IsRestrict" boolean Indicates if the kernel argument
1207 is restrict qualified. Only
1208 present if "ValueKind" is
1209 "GlobalBuffer".
1210
1211 "IsVolatile" boolean Indicates if the kernel argument
1212 is volatile qualified. Only
1213 present if "ValueKind" is
1214 "GlobalBuffer".
1215
1216 "IsPipe" boolean Indicates if the kernel argument
1217 is pipe qualified. Only present
1218 if "ValueKind" is "Pipe".
1219
1220 .. TODO
1221 Can GlobalBuffer be pipe
1222 qualified?
1223 ================= ============== ========= ================================
1224
1225..
1226
1227 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1228 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1229
1230 ============================ ============== ========= =====================
1231 String Key Value Type Required? Description
1232 ============================ ============== ========= =====================
1233 "KernargSegmentSize" integer Required The size in bytes of
1234 the kernarg segment
1235 that holds the values
1236 of the arguments to
1237 the kernel.
1238 "GroupSegmentFixedSize" integer Required The amount of group
1239 segment memory
1240 required by a
1241 work-group in
1242 bytes. This does not
1243 include any
1244 dynamically allocated
1245 group segment memory
1246 that may be added
1247 when the kernel is
1248 dispatched.
1249 "PrivateSegmentFixedSize" integer Required The amount of fixed
1250 private address space
1251 memory required for a
1252 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001253 bytes. If the kernel
1254 uses a dynamic call
1255 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001256 space must be added
1257 to this value for the
1258 call stack.
1259 "KernargSegmentAlign" integer Required The maximum byte
1260 alignment of
1261 arguments in the
1262 kernarg segment. Must
1263 be a power of 2.
1264 "WavefrontSize" integer Required Wavefront size. Must
1265 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001266 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001267 registers used by a
1268 wavefront for
1269 GFX6-GFX9. This
1270 includes the special
1271 SGPRs for VCC, Flat
1272 Scratch (GFX7-GFX9)
1273 and XNACK (for
1274 GFX8-GFX9). It does
1275 not include the 16
1276 SGPR added if a trap
1277 handler is
1278 enabled. It is not
1279 rounded up to the
1280 allocation
1281 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001282 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001283 registers used by
1284 each work-item for
1285 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001286 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001287 work-group size
1288 supported by the
1289 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001290 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001291 consistent with
1292 ReqdWorkGroupSize if
1293 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001294 "NumSpilledSGPRs" integer Number of stores from
1295 a scalar register to
1296 a register allocator
1297 created spill
1298 location.
1299 "NumSpilledVGPRs" integer Number of stores from
1300 a vector register to
1301 a register allocator
1302 created spill
1303 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001304 ============================ ============== ========= =====================
1305
1306..
1307
Tony Tyef16a45e2017-06-06 20:31:59 +00001308Kernel Dispatch
1309~~~~~~~~~~~~~~~
1310
1311The HSA architected queuing language (AQL) defines a user space memory interface
1312that can be used to control the dispatch of kernels, in an agent independent
1313way. An agent can have zero or more AQL queues created for it using the ROCm
1314runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1315*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1316mechanics and packet layouts.
1317
1318The packet processor of a kernel agent is responsible for detecting and
1319dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1320packet processor is implemented by the hardware command processor (CP),
1321asynchronous dispatch controller (ADC) and shader processor input controller
1322(SPI).
1323
1324The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1325mode driver to initialize and register the AQL queue with CP.
1326
1327To dispatch a kernel the following actions are performed. This can occur in the
1328CPU host program, or from an HSA kernel executing on a GPU.
1329
13301. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1331 executed is obtained.
13322. A pointer to the kernel descriptor (see
1333 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1334 obtained. It must be for a kernel that is contained in a code object that that
1335 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1336 associated.
13373. Space is allocated for the kernel arguments using the ROCm runtime allocator
1338 for a memory region with the kernarg property for the kernel agent that will
1339 execute the kernel. It must be at least 16 byte aligned.
13404. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001341 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001342 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1343 memory in the same way constant memory is accessed. (Note that the HSA
1344 specification allows an implementation to copy the kernel argument contents to
1345 another location that is accessed by the kernel.)
13465. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1347 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1348 packet. The packet must be set up, and the final write must use an atomic
1349 store release to set the packet kind to ensure the packet contents are
1350 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1351 notify the kernel agent that the AQL queue has been updated. These rules, and
1352 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1353 System Architecture Specification* [HSA]_.
13546. A kernel dispatch packet includes information about the actual dispatch,
1355 such as grid and work-group size, together with information from the code
1356 object about the kernel, such as segment sizes. The ROCm runtime queries on
1357 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001358 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013597. CP executes micro-code and is responsible for detecting and setting up the
1360 GPU to execute the wavefronts of a kernel dispatch.
13618. CP ensures that when the a wavefront starts executing the kernel machine
1362 code, the scalar general purpose registers (SGPR) and vector general purpose
1363 registers (VGPR) are set up as required by the machine code. The required
1364 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1365 register state is defined in
1366 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13679. The prolog of the kernel machine code (see
1368 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1369 before continuing executing the machine code that corresponds to the kernel.
137010. When the kernel dispatch has completed execution, CP signals the completion
1371 signal specified in the kernel dispatch packet if not 0.
1372
1373.. _amdgpu-amdhsa-memory-spaces:
1374
1375Memory Spaces
1376~~~~~~~~~~~~~
1377
1378The memory space properties are:
1379
1380 .. table:: AMDHSA Memory Spaces
1381 :name: amdgpu-amdhsa-memory-spaces-table
1382
1383 ================= =========== ======== ======= ==================
1384 Memory Space Name HSA Segment Hardware Address NULL Value
1385 Name Name Size
1386 ================= =========== ======== ======= ==================
1387 Private private scratch 32 0x00000000
1388 Local group LDS 32 0xFFFFFFFF
1389 Global global global 64 0x0000000000000000
1390 Constant constant *same as 64 0x0000000000000000
1391 global*
1392 Generic flat flat 64 0x0000000000000000
1393 Region N/A GDS 32 *not implemented
1394 for AMDHSA*
1395 ================= =========== ======== ======= ==================
1396
1397The global and constant memory spaces both use global virtual addresses, which
1398are the same virtual address space used by the CPU. However, some virtual
1399addresses may only be accessible to the CPU, some only accessible by the GPU,
1400and some by both.
1401
1402Using the constant memory space indicates that the data will not change during
1403the execution of the kernel. This allows scalar read instructions to be
1404used. The vector and scalar L1 caches are invalidated of volatile data before
1405each kernel dispatch execution to allow constant memory to change values between
1406kernel dispatches.
1407
1408The local memory space uses the hardware Local Data Store (LDS) which is
1409automatically allocated when the hardware creates work-groups of wavefronts, and
1410freed when all the wavefronts of a work-group have terminated. The data store
1411(DS) instructions can be used to access it.
1412
1413The private memory space uses the hardware scratch memory support. If the kernel
1414uses scratch, then the hardware allocates memory that is accessed using
1415wavefront lane dword (4 byte) interleaving. The mapping used from private
1416address to physical address is:
1417
1418 ``wavefront-scratch-base +
1419 (private-address * wavefront-size * 4) +
1420 (wavefront-lane-id * 4)``
1421
1422There are different ways that the wavefront scratch base address is determined
1423by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1424memory can be accessed in an interleaved manner using buffer instruction with
1425the scratch buffer descriptor and per wave scratch offset, by the scratch
1426instructions, or by flat instructions. If each lane of a wavefront accesses the
1427same private address, the interleaving results in adjacent dwords being accessed
1428and hence requires fewer cache lines to be fetched. Multi-dword access is not
1429supported except by flat and scratch instructions in GFX9.
1430
1431The generic address space uses the hardware flat address support available in
1432GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1433local appertures), that are outside the range of addressible global memory, to
1434map from a flat address to a private or local address.
1435
1436FLAT instructions can take a flat address and access global, private (scratch)
1437and group (LDS) memory depending in if the address is within one of the
1438apperture ranges. Flat access to scratch requires hardware aperture setup and
1439setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1440access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1441(see :ref:`amdgpu-amdhsa-m0`).
1442
1443To convert between a segment address and a flat address the base address of the
1444appertures address can be used. For GFX7-GFX8 these are available in the
1445:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1446Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1447GFX9 the appature base addresses are directly available as inline constant
1448registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1449address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1450which makes it easier to convert from flat to segment or segment to flat.
1451
Tony Tye46d35762017-08-15 20:47:41 +00001452Image and Samplers
1453~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001454
1455Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1456hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1457HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1458enumeration values for the queries that are not trivially deducible from the S#
1459representation.
1460
1461HSA Signals
1462~~~~~~~~~~~
1463
Tony Tye46d35762017-08-15 20:47:41 +00001464HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1465structure allocated in memory accessible from both the CPU and GPU. The
1466structure is defined by the ROCm runtime and subject to change between releases
1467(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001468
1469.. _amdgpu-amdhsa-hsa-aql-queue:
1470
1471HSA AQL Queue
1472~~~~~~~~~~~~~
1473
Tony Tye46d35762017-08-15 20:47:41 +00001474The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001475between releases (see [AMD-ROCm-github]_). For some processors it contains
1476fields needed to implement certain language features such as the flat address
1477aperture bases. It also contains fields used by CP such as managing the
1478allocation of scratch memory.
1479
1480.. _amdgpu-amdhsa-kernel-descriptor:
1481
1482Kernel Descriptor
1483~~~~~~~~~~~~~~~~~
1484
1485A kernel descriptor consists of the information needed by CP to initiate the
1486execution of a kernel, including the entry point address of the machine code
1487that implements the kernel.
1488
1489Kernel Descriptor for GFX6-GFX9
1490+++++++++++++++++++++++++++++++
1491
1492CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1493
1494 .. table:: Kernel Descriptor for GFX6-GFX9
1495 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1496
Tony Tye6baa6d22017-10-18 22:16:55 +00001497 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001498 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001499 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001500 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001501 address space memory
1502 required for a work-group
1503 in bytes. This does not
1504 include any dynamically
1505 allocated local address
1506 space memory that may be
1507 added when the kernel is
1508 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001509 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001510 private address space
1511 memory required for a
1512 work-item in bytes. If
1513 is_dynamic_callstack is 1
1514 then additional space must
1515 be added to this value for
1516 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001517 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001518 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001519 negative) from base
1520 address of kernel
1521 descriptor to kernel's
1522 entry point instruction
1523 which must be 256 byte
1524 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001525 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001526 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001527 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001528 program settings used by
1529 CP to set up
1530 ``COMPUTE_PGM_RSRC1``
1531 configuration
1532 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001533 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001534 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001535 program settings used by
1536 CP to set up
1537 ``COMPUTE_PGM_RSRC2``
1538 configuration
1539 register. See
1540 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001541 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1542 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001543 (see
1544 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1545
1546 The total number of SGPR
1547 user data registers
1548 requested must not exceed
1549 16 and match value in
1550 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1551 Any requests beyond 16
1552 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001553 449 1 bit EnableSGPRDispatchPtr *see above*
1554 450 1 bit EnableSGPRQueuePtr *see above*
1555 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1556 452 1 bit EnableSGPRDispatchID *see above*
1557 453 1 bit EnableSGPRFlatScratchInit *see above*
1558 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1559 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1560 should always be 0.
1561 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1562 should always be 0.
1563 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1564 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001565 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001566 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001567 bytes
1568 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001569 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001570
1571..
1572
1573 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001574 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001575
Tony Tye3b340612017-06-07 00:46:08 +00001576 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001577 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001578 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001579 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001580 used by each work-item,
1581 granularity is device
1582 specific:
1583
Tony Tye07d9f102017-11-10 01:00:54 +00001584 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001585 - max_vgpr 1..256
1586 - roundup((max_vgpg + 1)
1587 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001588
1589 Used by CP to set up
1590 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001591 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001592 used by a wavefront,
1593 granularity is device
1594 specific:
1595
Tony Tye07d9f102017-11-10 01:00:54 +00001596 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001597 - max_sgpr 1..112
1598 - roundup((max_sgpg + 1)
1599 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001601 - max_sgpr 1..112
1602 - roundup((max_sgpg + 1)
1603 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001604
1605 Includes the special SGPRs
1606 for VCC, Flat Scratch (for
1607 GFX7 onwards) and XNACK
1608 (for GFX8 onwards). It does
1609 not include the 16 SGPR
1610 added if a trap handler is
1611 enabled.
1612
1613 Used by CP to set up
1614 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001615 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001616
1617 Start executing wavefront
1618 at the specified priority.
1619
1620 CP is responsible for
1621 filling in
1622 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001623 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001624 with specified rounding
1625 mode for single (32
1626 bit) floating point
1627 precision floating point
1628 operations.
1629
1630 Floating point rounding
1631 mode values are defined in
1632 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1633
1634 Used by CP to set up
1635 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001636 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001637 with specified rounding
1638 denorm mode for half/double (16
1639 and 64 bit) floating point
1640 precision floating point
1641 operations.
1642
1643 Floating point rounding
1644 mode values are defined in
1645 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1646
1647 Used by CP to set up
1648 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001649 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001650 with specified denorm mode
1651 for single (32
1652 bit) floating point
1653 precision floating point
1654 operations.
1655
1656 Floating point denorm mode
1657 values are defined in
1658 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1659
1660 Used by CP to set up
1661 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001662 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001663 with specified denorm mode
1664 for half/double (16
1665 and 64 bit) floating point
1666 precision floating point
1667 operations.
1668
1669 Floating point denorm mode
1670 values are defined in
1671 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1672
1673 Used by CP to set up
1674 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001675 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001676
1677 Start executing wavefront
1678 in privilege trap handler
1679 mode.
1680
1681 CP is responsible for
1682 filling in
1683 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001684 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001685 with DX10 clamp mode
1686 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001687 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001688 treatment of NaN's (when
1689 set, clamp NaN to zero,
1690 otherwise pass NaN
1691 through).
1692
1693 Used by CP to set up
1694 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001695 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001696
1697 Start executing wavefront
1698 in single step mode.
1699
1700 CP is responsible for
1701 filling in
1702 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001703 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001704 with IEEE mode
1705 enabled. Floating point
1706 opcodes that support
1707 exception flag gathering
1708 will quiet and propagate
1709 signaling-NaN inputs per
1710 IEEE 754-2008. Min_dx10 and
1711 max_dx10 become IEEE
1712 754-2008 compliant due to
1713 signaling-NaN propagation
1714 and quieting.
1715
1716 Used by CP to set up
1717 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001718 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001719
1720 Only one work-group allowed
1721 to execute on a compute
1722 unit.
1723
1724 CP is responsible for
1725 filling in
1726 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001727 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001728
1729 Flag that can be used to
1730 control debugging code.
1731
1732 CP is responsible for
1733 filling in
1734 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001735 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001736 Reserved, must be 0.
1737 GFX9
1738 Wavefront starts execution
1739 with specified fp16 overflow
1740 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001741
Tony Tye6baa6d22017-10-18 22:16:55 +00001742 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001743 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001744 - If 1, fp16 overflow that is the
1745 result of an +/-INF input value
1746 or divide by 0 produces a +/-INF,
1747 otherwise clamps computed
1748 overflow to +/-MAX_FP16 as
1749 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001750
1751 Used by CP to set up
1752 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001753 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001754 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001755 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001756
1757..
1758
1759 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1760 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1761
Tony Tye3b340612017-06-07 00:46:08 +00001762 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001763 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001764 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001765 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1766 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001767 system register (see
1768 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1769
1770 Used by CP to set up
1771 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001772 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001773 user data registers
1774 requested. This number must
1775 match the number of user
1776 data registers enabled.
1777
1778 Used by CP to set up
1779 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001780 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001781 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001782 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001783 be enabled.
1784
1785 CP sets
1786 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1787 if the runtime has
1788 installed a trap handler
1789 regardless of the setting
1790 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001791 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001792 system SGPR register for
1793 the work-group id in the X
1794 dimension (see
1795 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1796
1797 Used by CP to set up
1798 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001799 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001800 system SGPR register for
1801 the work-group id in the Y
1802 dimension (see
1803 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1804
1805 Used by CP to set up
1806 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001807 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001808 system SGPR register for
1809 the work-group id in the Z
1810 dimension (see
1811 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1812
1813 Used by CP to set up
1814 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001815 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001816 system SGPR register for
1817 work-group information (see
1818 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1819
1820 Used by CP to set up
1821 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001822 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001823 VGPR system registers used
1824 for the work-item ID.
1825 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1826 defines the values.
1827
1828 Used by CP to set up
1829 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001830 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001831
1832 Wavefront starts execution
1833 with address watch
1834 exceptions enabled which
1835 are generated when L1 has
1836 witnessed a thread access
1837 an *address of
1838 interest*.
1839
1840 CP is responsible for
1841 filling in the address
1842 watch bit in
1843 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1844 according to what the
1845 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001846 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001847
1848 Wavefront starts execution
1849 with memory violation
1850 exceptions exceptions
1851 enabled which are generated
1852 when a memory violation has
1853 occurred for this wave from
1854 L1 or LDS
1855 (write-to-read-only-memory,
1856 mis-aligned atomic, LDS
1857 address out of range,
1858 illegal address, etc.).
1859
1860 CP sets the memory
1861 violation bit in
1862 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1863 according to what the
1864 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001865 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001866
1867 CP uses the rounded value
1868 from the dispatch packet,
1869 not this value, as the
1870 dispatch may contain
1871 dynamically allocated group
1872 segment memory. CP writes
1873 directly to
1874 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1875
1876 Amount of group segment
1877 (LDS) to allocate for each
1878 work-group. Granularity is
1879 device specific:
1880
1881 GFX6:
1882 roundup(lds-size / (64 * 4))
1883 GFX7-GFX9:
1884 roundup(lds-size / (128 * 4))
1885
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001886 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1887 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001888 enabled.
1889
1890 Used by CP to set up
1891 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1892 (set from bits 0..6).
1893
1894 IEEE 754 FP Invalid
1895 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001896 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1897 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001898 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001899 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1900 _DIVISION_BY_ZERO Zero
1901 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1902 _OVERFLOW
1903 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1904 _UNDERFLOW
1905 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1906 _INEXACT
1907 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1908 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001909 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001910 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001911 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001912 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001913
1914..
1915
1916 .. table:: Floating Point Rounding Mode Enumeration Values
1917 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1918
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001919 ====================================== ===== ==============================
1920 Enumeration Name Value Description
1921 ====================================== ===== ==============================
1922 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1923 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1924 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1925 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1926 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001927
1928..
1929
1930 .. table:: Floating Point Denorm Mode Enumeration Values
1931 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1932
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001933 ====================================== ===== ==============================
1934 Enumeration Name Value Description
1935 ====================================== ===== ==============================
1936 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1937 Denorms
1938 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1939 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1940 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1941 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001942
1943..
1944
1945 .. table:: System VGPR Work-Item ID Enumeration Values
1946 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1947
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001948 ======================================== ===== ============================
1949 Enumeration Name Value Description
1950 ======================================== ===== ============================
1951 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1952 ID.
1953 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1954 dimensions ID.
1955 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1956 dimensions ID.
1957 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1958 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001959
1960.. _amdgpu-amdhsa-initial-kernel-execution-state:
1961
1962Initial Kernel Execution State
1963~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1964
1965This section defines the register state that will be set up by the packet
1966processor prior to the start of execution of every wavefront. This is limited by
1967the constraints of the hardware controllers of CP/ADC/SPI.
1968
1969The order of the SGPR registers is defined, but the compiler can specify which
1970ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1971fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1972for enabled registers are dense starting at SGPR0: the first enabled register is
1973SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1974an SGPR number.
1975
1976The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1977all waves of the grid. It is possible to specify more than 16 User SGPRs using
1978the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1979initialized. These are then immediately followed by the System SGPRs that are
1980set up by ADC/SPI and can have different values for each wave of the grid
1981dispatch.
1982
1983SGPR register initial state is defined in
1984:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1985
1986 .. table:: SGPR Register Set Up Order
1987 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1988
1989 ========== ========================== ====== ==============================
1990 SGPR Order Name Number Description
1991 (kernel descriptor enable of
1992 field) SGPRs
1993 ========== ========================== ====== ==============================
1994 First Private Segment Buffer 4 V# that can be used, together
1995 (enable_sgpr_private with Scratch Wave Offset as an
1996 _segment_buffer) offset, to access the private
1997 memory space using a segment
1998 address.
1999
2000 CP uses the value provided by
2001 the runtime.
2002 then Dispatch Ptr 2 64 bit address of AQL dispatch
2003 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2004 actually executing.
2005 then Queue Ptr 2 64 bit address of amd_queue_t
2006 (enable_sgpr_queue_ptr) object for AQL queue on which
2007 the dispatch packet was
2008 queued.
2009 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2010 (enable_sgpr_kernarg segment. This is directly
2011 _segment_ptr) copied from the
2012 kernarg_address in the kernel
2013 dispatch packet.
2014
2015 Having CP load it once avoids
2016 loading it at the beginning of
2017 every wavefront.
2018 then Dispatch Id 2 64 bit Dispatch ID of the
2019 (enable_sgpr_dispatch_id) dispatch packet being
2020 executed.
2021 then Flat Scratch Init 2 This is 2 SGPRs:
2022 (enable_sgpr_flat_scratch
2023 _init) GFX6
2024 Not supported.
2025 GFX7-GFX8
2026 The first SGPR is a 32 bit
2027 byte offset from
2028 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2029 to per SPI base of memory
2030 for scratch for the queue
2031 executing the kernel
2032 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002033 from the runtime. (The
2034 Scratch Segment Buffer base
2035 address is
2036 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2037 plus this offset.) The value
2038 of Scratch Wave Offset must
2039 be added to this offset by
2040 the kernel machine code,
2041 right shifted by 8, and
2042 moved to the FLAT_SCRATCH_HI
2043 SGPR register.
2044 FLAT_SCRATCH_HI corresponds
2045 to SGPRn-4 on GFX7, and
2046 SGPRn-6 on GFX8 (where SGPRn
2047 is the highest numbered SGPR
2048 allocated to the wave).
2049 FLAT_SCRATCH_HI is
2050 multiplied by 256 (as it is
2051 in units of 256 bytes) and
2052 added to
2053 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2054 to calculate the per wave
2055 FLAT SCRATCH BASE in flat
2056 memory instructions that
2057 access the scratch
2058 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002059
2060 The second SGPR is 32 bit
2061 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002062 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002063 usage. CP obtains this from
2064 the runtime, and it is
2065 always a multiple of DWORD.
2066 CP checks that the value in
2067 the kernel dispatch packet
2068 Private Segment Byte Size is
2069 not larger, and requests the
2070 runtime to increase the
2071 queue's scratch size if
2072 necessary. The kernel code
2073 must move it to
2074 FLAT_SCRATCH_LO which is
2075 SGPRn-3 on GFX7 and SGPRn-5
2076 on GFX8. FLAT_SCRATCH_LO is
2077 used as the FLAT SCRATCH
2078 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002079 instructions. Having CP load
2080 it once avoids loading it at
2081 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002082 wavefront.
2083 GFX9
2084 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002085 64 bit base address of the
2086 per SPI scratch backing
2087 memory managed by SPI for
2088 the queue executing the
2089 kernel dispatch. CP obtains
2090 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002091 divides it if there are
2092 multiple Shader Arrays each
2093 with its own SPI). The value
2094 of Scratch Wave Offset must
2095 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002096 machine code and the result
2097 moved to the FLAT_SCRATCH
2098 SGPR which is SGPRn-6 and
2099 SGPRn-5. It is used as the
2100 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002101 memory instructions.
2102 then Private Segment Size 1 The 32 bit byte size of a
2103 (enable_sgpr_private single
2104 work-item's
2105 scratch_segment_size) memory
2106 allocation. This is the
2107 value from the kernel
2108 dispatch packet Private
2109 Segment Byte Size rounded up
2110 by CP to a multiple of
2111 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002112
2113 Having CP load it once avoids
2114 loading it at the beginning of
2115 every wavefront.
2116
2117 This is not used for
2118 GFX7-GFX8 since it is the same
2119 value as the second SGPR of
2120 Flat Scratch Init. However, it
2121 may be needed for GFX9 which
2122 changes the meaning of the
2123 Flat Scratch Init value.
2124 then Grid Work-Group Count X 1 32 bit count of the number of
2125 (enable_sgpr_grid work-groups in the X dimension
2126 _workgroup_count_X) for the grid being
2127 executed. Computed from the
2128 fields in the kernel dispatch
2129 packet as ((grid_size.x +
2130 workgroup_size.x - 1) /
2131 workgroup_size.x).
2132 then Grid Work-Group Count Y 1 32 bit count of the number of
2133 (enable_sgpr_grid work-groups in the Y dimension
2134 _workgroup_count_Y && for the grid being
2135 less than 16 previous executed. Computed from the
2136 SGPRs) fields in the kernel dispatch
2137 packet as ((grid_size.y +
2138 workgroup_size.y - 1) /
2139 workgroupSize.y).
2140
2141 Only initialized if <16
2142 previous SGPRs initialized.
2143 then Grid Work-Group Count Z 1 32 bit count of the number of
2144 (enable_sgpr_grid work-groups in the Z dimension
2145 _workgroup_count_Z && for the grid being
2146 less than 16 previous executed. Computed from the
2147 SGPRs) fields in the kernel dispatch
2148 packet as ((grid_size.z +
2149 workgroup_size.z - 1) /
2150 workgroupSize.z).
2151
2152 Only initialized if <16
2153 previous SGPRs initialized.
2154 then Work-Group Id X 1 32 bit work-group id in X
2155 (enable_sgpr_workgroup_id dimension of grid for
2156 _X) wavefront.
2157 then Work-Group Id Y 1 32 bit work-group id in Y
2158 (enable_sgpr_workgroup_id dimension of grid for
2159 _Y) wavefront.
2160 then Work-Group Id Z 1 32 bit work-group id in Z
2161 (enable_sgpr_workgroup_id dimension of grid for
2162 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002163 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002164 (enable_sgpr_workgroup ordered_append_term[10:0],
2165 _info) threadgroup_size_in_waves[5:0]}
2166 then Scratch Wave Offset 1 32 bit byte offset from base
2167 (enable_sgpr_private of scratch base of queue
2168 _segment_wave_offset) executing the kernel
2169 dispatch. Must be used as an
2170 offset with Private
2171 segment address when using
2172 Scratch Segment Buffer. It
2173 must be used to set up FLAT
2174 SCRATCH for flat addressing
2175 (see
2176 :ref:`amdgpu-amdhsa-flat-scratch`).
2177 ========== ========================== ====== ==============================
2178
2179The order of the VGPR registers is defined, but the compiler can specify which
2180ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2181fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2182for enabled registers are dense starting at VGPR0: the first enabled register is
2183VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2184VGPR number.
2185
2186VGPR register initial state is defined in
2187:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2188
2189 .. table:: VGPR Register Set Up Order
2190 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2191
2192 ========== ========================== ====== ==============================
2193 VGPR Order Name Number Description
2194 (kernel descriptor enable of
2195 field) VGPRs
2196 ========== ========================== ====== ==============================
2197 First Work-Item Id X 1 32 bit work item id in X
2198 (Always initialized) dimension of work-group for
2199 wavefront lane.
2200 then Work-Item Id Y 1 32 bit work item id in Y
2201 (enable_vgpr_workitem_id dimension of work-group for
2202 > 0) wavefront lane.
2203 then Work-Item Id Z 1 32 bit work item id in Z
2204 (enable_vgpr_workitem_id dimension of work-group for
2205 > 1) wavefront lane.
2206 ========== ========================== ====== ==============================
2207
2208The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2209
22101. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2211 registers.
22122. Work-group Id registers X, Y, Z are set by ADC which supports any
2213 combination including none.
22143. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2215 cannot included with the flat scratch init value which is per queue.
22164. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2217 or (X, Y, Z).
2218
2219Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2220value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2221
2222The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002223has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002224instructions (GFX9).
2225
2226If buffer operations are used then the compiler can generate a V# with the
2227following properties:
2228
2229* base address of 0
2230* no swizzle
2231* ATC: 1 if IOMMU present (such as APU)
2232* ptr64: 1
2233* MTYPE set to support memory coherence that matches the runtime (such as CC for
2234 APU and NC for dGPU).
2235
2236.. _amdgpu-amdhsa-kernel-prolog:
2237
2238Kernel Prolog
2239~~~~~~~~~~~~~
2240
2241.. _amdgpu-amdhsa-m0:
2242
2243M0
2244++
2245
2246GFX6-GFX8
2247 The M0 register must be initialized with a value at least the total LDS size
2248 if the kernel may access LDS via DS or flat operations. Total LDS size is
2249 available in dispatch packet. For M0, it is also possible to use maximum
2250 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2251 GFX7-GFX8).
2252GFX9
2253 The M0 register is not used for range checking LDS accesses and so does not
2254 need to be initialized in the prolog.
2255
2256.. _amdgpu-amdhsa-flat-scratch:
2257
2258Flat Scratch
2259++++++++++++
2260
2261If the kernel may use flat operations to access scratch memory, the prolog code
2262must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2263are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2264Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2265
2266GFX6
2267 Flat scratch is not supported.
2268
Tony Tye07d9f102017-11-10 01:00:54 +00002269GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002270 1. The low word of Flat Scratch Init is 32 bit byte offset from
2271 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2272 being managed by SPI for the queue executing the kernel dispatch. This is
2273 the same value used in the Scratch Segment Buffer V# base address. The
2274 prolog must add the value of Scratch Wave Offset to get the wave's byte
2275 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2276 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2277 by 8 before moving into FLAT_SCRATCH_LO.
2278 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2279 work-items scratch memory usage. This is directly loaded from the kernel
2280 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2281 DWORD. Having CP load it once avoids loading it at the beginning of every
2282 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2283 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002284
Tony Tyef16a45e2017-06-06 20:31:59 +00002285GFX9
2286 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2287 memory being managed by SPI for the queue executing the kernel dispatch. The
2288 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2289 pair for use as the flat scratch base in flat memory instructions.
2290
2291.. _amdgpu-amdhsa-memory-model:
2292
2293Memory Model
2294~~~~~~~~~~~~
2295
2296This section describes the mapping of LLVM memory model onto AMDGPU machine code
2297(see :ref:`memmodel`). *The implementation is WIP.*
2298
2299.. TODO
2300 Update when implementation complete.
2301
Tony Tyef16a45e2017-06-06 20:31:59 +00002302The AMDGPU backend supports the memory synchronization scopes specified in
2303:ref:`amdgpu-memory-scopes`.
2304
2305The code sequences used to implement the memory model are defined in table
2306:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2307
2308The sequences specify the order of instructions that a single thread must
2309execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2310to other memory instructions executed by the same thread. This allows them to be
2311moved earlier or later which can allow them to be combined with other instances
2312of the same instruction, or hoisted/sunk out of loops to improve
2313performance. Only the instructions related to the memory model are given;
2314additional ``s_waitcnt`` instructions are required to ensure registers are
2315defined before being used. These may be able to be combined with the memory
2316model ``s_waitcnt`` instructions as described above.
2317
Tony Tye6baa6d22017-10-18 22:16:55 +00002318The AMDGPU backend supports the following memory models:
2319
2320 HSA Memory Model [HSA]_
2321 The HSA memory model uses a single happens-before relation for all address
2322 spaces (see :ref:`amdgpu-address-spaces`).
2323 OpenCL Memory Model [OpenCL]_
2324 The OpenCL memory model which has separate happens-before relations for the
2325 global and local address spaces. Only a fence specifying both global and
2326 local address space, and seq_cst instructions join the relationships. Since
2327 the LLVM ``memfence`` instruction does not allow an address space to be
2328 specified the OpenCL fence has to convervatively assume both local and
2329 global address space was specified. However, optimizations can often be
2330 done to eliminate the additional ``s_waitcnt`` instructions when there are
2331 no intervening memory instructions which access the corresponding address
2332 space. The code sequences in the table indicate what can be omitted for the
2333 OpenCL memory. The target triple environment is used to determine if the
2334 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002335
2336``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2337operations.
2338
2339``buffer/global/flat_load/store/atomic`` instructions to global memory are
2340termed vector memory operations.
2341
2342For GFX6-GFX9:
2343
2344* Each agent has multiple compute units (CU).
2345* Each CU has multiple SIMDs that execute wavefronts.
2346* The wavefronts for a single work-group are executed in the same CU but may be
2347 executed by different SIMDs.
2348* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2349 executing on it.
2350* All LDS operations of a CU are performed as wavefront wide operations in a
2351 global order and involve no caching. Completion is reported to a wavefront in
2352 execution order.
2353* The LDS memory has multiple request queues shared by the SIMDs of a
2354 CU. Therefore, the LDS operations performed by different waves of a work-group
2355 can be reordered relative to each other, which can result in reordering the
2356 visibility of vector memory operations with respect to LDS operations of other
2357 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002358 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002359 between waves of a work-group, but not between operations performed by the
2360 same wavefront.
2361* The vector memory operations are performed as wavefront wide operations and
2362 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002363 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002364 vector memory order if they access LDS memory, and out of LDS operation order
2365 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002366* The vector memory operations access a single vector L1 cache shared by all
2367 SIMDs a CU. Therefore, no special action is required for coherence between the
2368 lanes of a single wavefront, or for coherence between wavefronts in the same
2369 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2370 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002371* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2372 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2373 scalar operations are used in a restricted way so do not impact the memory
2374 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2375* The vector and scalar memory operations use an L2 cache shared by all CUs on
2376 the same agent.
2377* The L2 cache has independent channels to service disjoint ranges of virtual
2378 addresses.
2379* Each CU has a separate request queue per channel. Therefore, the vector and
2380 scalar memory operations performed by waves executing in different work-groups
2381 (which may be executing on different CUs) of an agent can be reordered
2382 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002383 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002384 previous vector memory operation has completed before executing a subsequent
2385 vector memory or LDS operation and so can be used to meet the requirements of
2386 acquire and release.
2387* The L2 cache can be kept coherent with other agents on some targets, or ranges
2388 of virtual addresses can be set up to bypass it to ensure system coherence.
2389
Tony Tye07d9f102017-11-10 01:00:54 +00002390Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002391or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2392memory, atomic memory orderings are not meaningful and all accesses are treated
2393as non-atomic.
2394
2395Constant address space uses ``buffer/global_load`` instructions (or equivalent
2396scalar memory instructions). Since the constant address space contents do not
2397change during the execution of a kernel dispatch it is not legal to perform
2398stores, and atomic memory orderings are not meaningful and all access are
2399treated as non-atomic.
2400
2401A memory synchronization scope wider than work-group is not meaningful for the
2402group (LDS) address space and is treated as work-group.
2403
2404The memory model does not support the region address space which is treated as
2405non-atomic.
2406
2407Acquire memory ordering is not meaningful on store atomic instructions and is
2408treated as non-atomic.
2409
2410Release memory ordering is not meaningful on load atomic instructions and is
2411treated a non-atomic.
2412
2413Acquire-release memory ordering is not meaningful on load or store atomic
2414instructions and is treated as acquire and release respectively.
2415
2416AMDGPU backend only uses scalar memory operations to access memory that is
2417proven to not change during the execution of the kernel dispatch. This includes
2418constant address space and global address space for program scope const
2419variables. Therefore the kernel machine code does not have to maintain the
2420scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2421and vector L1 caches are invalidated between kernel dispatches by CP since
2422constant address space data may change between kernel dispatch executions. See
2423:ref:`amdgpu-amdhsa-memory-spaces`.
2424
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002425The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002426case the AMDGPU backend ensures the memory location used to spill is never
2427accessed by vector memory operations at the same time. If scalar writes are used
2428then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2429return since the locations may be used for vector memory instructions by a
2430future wave that uses the same scratch area, or a function call that creates a
2431frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2432as all scalar writes are write-before-read in the same thread.
2433
Tony Tye6baa6d22017-10-18 22:16:55 +00002434Scratch backing memory (which is used for the private address space)
2435is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2436address space is only accessed by a single thread, and is always
2437write-before-read, there is never a need to invalidate these entries from the L1
2438cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2439volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002440
2441On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002442to invalidate the L2 cache. This also causes it to be treated as
2443non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2444(cache coherent) and so the L2 cache will coherent with the CPU and other
2445agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002446
2447 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2448 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2449
Tony Tye6baa6d22017-10-18 22:16:55 +00002450 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002451 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2452 Ordering Sync Scope Address
2453 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002454 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002455 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002456 -----------------------------------------------------------------------------------
2457 load *none* *none* - global - !volatile & !nontemporal
2458 - generic
2459 - private 1. buffer/global/flat_load
2460 - constant
2461 - volatile & !nontemporal
2462
Tony Tyef16a45e2017-06-06 20:31:59 +00002463 1. buffer/global/flat_load
2464 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002465
2466 - nontemporal
2467
2468 1. buffer/global/flat_load
2469 glc=1 slc=1
2470
Tony Tyef16a45e2017-06-06 20:31:59 +00002471 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002472 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002473 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002474 - private 1. buffer/global/flat_store
2475 - constant
2476 - nontemporal
2477
2478 1. buffer/global/flat_stote
2479 glc=1 slc=1
2480
Tony Tyef16a45e2017-06-06 20:31:59 +00002481 store *none* *none* - local 1. ds_store
2482 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002483 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002484 load atomic unordered *any* *any* *Same as non-atomic*.
2485 store atomic unordered *any* *any* *Same as non-atomic*.
2486 atomicrmw unordered *any* *any* *Same as monotonic
2487 atomic*.
2488 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002489 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002490 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2491 - wavefront - generic
2492 - workgroup
2493 load atomic monotonic - singlethread - local 1. ds_load
2494 - wavefront
2495 - workgroup
2496 load atomic monotonic - agent - global 1. buffer/global/flat_load
2497 - system - generic glc=1
2498 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2499 - wavefront - generic
2500 - workgroup
2501 - agent
2502 - system
2503 store atomic monotonic - singlethread - local 1. ds_store
2504 - wavefront
2505 - workgroup
2506 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2507 - wavefront - generic
2508 - workgroup
2509 - agent
2510 - system
2511 atomicrmw monotonic - singlethread - local 1. ds_atomic
2512 - wavefront
2513 - workgroup
2514 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002515 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002516 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2517 - wavefront - local
2518 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002519 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2520 load atomic acquire - workgroup - local 1. ds_load
2521 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002522
Tony Tye6baa6d22017-10-18 22:16:55 +00002523 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002524 - Must happen before
2525 any following
2526 global/generic
2527 load/load
2528 atomic/store/store
2529 atomic/atomicrmw.
2530 - Ensures any
2531 following global
2532 data read is no
2533 older than the load
2534 atomic value being
2535 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002536 load atomic acquire - workgroup - generic 1. flat_load
2537 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002538
Tony Tye6baa6d22017-10-18 22:16:55 +00002539 - If OpenCL, omit.
2540 - Must happen before
2541 any following
2542 global/generic
2543 load/load
2544 atomic/store/store
2545 atomic/atomicrmw.
2546 - Ensures any
2547 following global
2548 data read is no
2549 older than the load
2550 atomic value being
2551 acquired.
2552 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002553 - system glc=1
2554 2. s_waitcnt vmcnt(0)
2555
2556 - Must happen before
2557 following
2558 buffer_wbinvl1_vol.
2559 - Ensures the load
2560 has completed
2561 before invalidating
2562 the cache.
2563
2564 3. buffer_wbinvl1_vol
2565
2566 - Must happen before
2567 any following
2568 global/generic
2569 load/load
2570 atomic/atomicrmw.
2571 - Ensures that
2572 following
2573 loads will not see
2574 stale global data.
2575
2576 load atomic acquire - agent - generic 1. flat_load glc=1
2577 - system 2. s_waitcnt vmcnt(0) &
2578 lgkmcnt(0)
2579
2580 - If OpenCL omit
2581 lgkmcnt(0).
2582 - Must happen before
2583 following
2584 buffer_wbinvl1_vol.
2585 - Ensures the flat_load
2586 has completed
2587 before invalidating
2588 the cache.
2589
2590 3. buffer_wbinvl1_vol
2591
2592 - Must happen before
2593 any following
2594 global/generic
2595 load/load
2596 atomic/atomicrmw.
2597 - Ensures that
2598 following loads
2599 will not see stale
2600 global data.
2601
2602 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2603 - wavefront - local
2604 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002605 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2606 atomicrmw acquire - workgroup - local 1. ds_atomic
2607 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002608
Tony Tye6baa6d22017-10-18 22:16:55 +00002609 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002610 - Must happen before
2611 any following
2612 global/generic
2613 load/load
2614 atomic/store/store
2615 atomic/atomicrmw.
2616 - Ensures any
2617 following global
2618 data read is no
2619 older than the
2620 atomicrmw value
2621 being acquired.
2622
Tony Tye6baa6d22017-10-18 22:16:55 +00002623 atomicrmw acquire - workgroup - generic 1. flat_atomic
2624 2. waitcnt lgkmcnt(0)
2625
2626 - If OpenCL, omit.
2627 - Must happen before
2628 any following
2629 global/generic
2630 load/load
2631 atomic/store/store
2632 atomic/atomicrmw.
2633 - Ensures any
2634 following global
2635 data read is no
2636 older than the
2637 atomicrmw value
2638 being acquired.
2639
2640 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002641 - system 2. s_waitcnt vmcnt(0)
2642
2643 - Must happen before
2644 following
2645 buffer_wbinvl1_vol.
2646 - Ensures the
2647 atomicrmw has
2648 completed before
2649 invalidating the
2650 cache.
2651
2652 3. buffer_wbinvl1_vol
2653
2654 - Must happen before
2655 any following
2656 global/generic
2657 load/load
2658 atomic/atomicrmw.
2659 - Ensures that
2660 following loads
2661 will not see stale
2662 global data.
2663
2664 atomicrmw acquire - agent - generic 1. flat_atomic
2665 - system 2. s_waitcnt vmcnt(0) &
2666 lgkmcnt(0)
2667
2668 - If OpenCL, omit
2669 lgkmcnt(0).
2670 - Must happen before
2671 following
2672 buffer_wbinvl1_vol.
2673 - Ensures the
2674 atomicrmw has
2675 completed before
2676 invalidating the
2677 cache.
2678
2679 3. buffer_wbinvl1_vol
2680
2681 - Must happen before
2682 any following
2683 global/generic
2684 load/load
2685 atomic/atomicrmw.
2686 - Ensures that
2687 following loads
2688 will not see stale
2689 global data.
2690
2691 fence acquire - singlethread *none* *none*
2692 - wavefront
2693 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2694
2695 - If OpenCL and
2696 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002697 not generic, omit.
2698 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002699 currently has no
2700 address space on
2701 the fence need to
2702 conservatively
2703 always generate. If
2704 fence had an
2705 address space then
2706 set to address
2707 space of OpenCL
2708 fence flag, or to
2709 generic if both
2710 local and global
2711 flags are
2712 specified.
2713 - Must happen after
2714 any preceding
2715 local/generic load
2716 atomic/atomicrmw
2717 with an equal or
2718 wider sync scope
2719 and memory ordering
2720 stronger than
2721 unordered (this is
2722 termed the
2723 fence-paired-atomic).
2724 - Must happen before
2725 any following
2726 global/generic
2727 load/load
2728 atomic/store/store
2729 atomic/atomicrmw.
2730 - Ensures any
2731 following global
2732 data read is no
2733 older than the
2734 value read by the
2735 fence-paired-atomic.
2736
Tony Tye6baa6d22017-10-18 22:16:55 +00002737 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2738 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002739
2740 - If OpenCL and
2741 address space is
2742 not generic, omit
2743 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002744 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002745 currently has no
2746 address space on
2747 the fence need to
2748 conservatively
2749 always generate
2750 (see comment for
2751 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002752 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002753 separate s_waitcnt
2754 vmcnt(0) and
2755 s_waitcnt
2756 lgkmcnt(0) to allow
2757 them to be
2758 independently moved
2759 according to the
2760 following rules.
2761 - s_waitcnt vmcnt(0)
2762 must happen after
2763 any preceding
2764 global/generic load
2765 atomic/atomicrmw
2766 with an equal or
2767 wider sync scope
2768 and memory ordering
2769 stronger than
2770 unordered (this is
2771 termed the
2772 fence-paired-atomic).
2773 - s_waitcnt lgkmcnt(0)
2774 must happen after
2775 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002776 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002777 atomic/atomicrmw
2778 with an equal or
2779 wider sync scope
2780 and memory ordering
2781 stronger than
2782 unordered (this is
2783 termed the
2784 fence-paired-atomic).
2785 - Must happen before
2786 the following
2787 buffer_wbinvl1_vol.
2788 - Ensures that the
2789 fence-paired atomic
2790 has completed
2791 before invalidating
2792 the
2793 cache. Therefore
2794 any following
2795 locations read must
2796 be no older than
2797 the value read by
2798 the
2799 fence-paired-atomic.
2800
2801 2. buffer_wbinvl1_vol
2802
Tony Tye6baa6d22017-10-18 22:16:55 +00002803 - Must happen before any
2804 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002805 load/load
2806 atomic/store/store
2807 atomic/atomicrmw.
2808 - Ensures that
2809 following loads
2810 will not see stale
2811 global data.
2812
2813 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002814 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002815 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2816 - wavefront - local
2817 - generic
2818 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002819
2820 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002821 - Must happen after
2822 any preceding
2823 local/generic
2824 load/store/load
2825 atomic/store
2826 atomic/atomicrmw.
2827 - Must happen before
2828 the following
2829 store.
2830 - Ensures that all
2831 memory operations
2832 to local have
2833 completed before
2834 performing the
2835 store that is being
2836 released.
2837
2838 2. buffer/global/flat_store
2839 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002840 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2841
2842 - If OpenCL, omit.
2843 - Must happen after
2844 any preceding
2845 local/generic
2846 load/store/load
2847 atomic/store
2848 atomic/atomicrmw.
2849 - Must happen before
2850 the following
2851 store.
2852 - Ensures that all
2853 memory operations
2854 to local have
2855 completed before
2856 performing the
2857 store that is being
2858 released.
2859
2860 2. flat_store
2861 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2862 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002863
2864 - If OpenCL, omit
2865 lgkmcnt(0).
2866 - Could be split into
2867 separate s_waitcnt
2868 vmcnt(0) and
2869 s_waitcnt
2870 lgkmcnt(0) to allow
2871 them to be
2872 independently moved
2873 according to the
2874 following rules.
2875 - s_waitcnt vmcnt(0)
2876 must happen after
2877 any preceding
2878 global/generic
2879 load/store/load
2880 atomic/store
2881 atomic/atomicrmw.
2882 - s_waitcnt lgkmcnt(0)
2883 must happen after
2884 any preceding
2885 local/generic
2886 load/store/load
2887 atomic/store
2888 atomic/atomicrmw.
2889 - Must happen before
2890 the following
2891 store.
2892 - Ensures that all
2893 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002894 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002895 completed before
2896 performing the
2897 store that is being
2898 released.
2899
2900 2. buffer/global/ds/flat_store
2901 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2902 - wavefront - local
2903 - generic
2904 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002905
2906 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002907 - Must happen after
2908 any preceding
2909 local/generic
2910 load/store/load
2911 atomic/store
2912 atomic/atomicrmw.
2913 - Must happen before
2914 the following
2915 atomicrmw.
2916 - Ensures that all
2917 memory operations
2918 to local have
2919 completed before
2920 performing the
2921 atomicrmw that is
2922 being released.
2923
2924 2. buffer/global/flat_atomic
2925 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002926 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2927
2928 - If OpenCL, omit.
2929 - Must happen after
2930 any preceding
2931 local/generic
2932 load/store/load
2933 atomic/store
2934 atomic/atomicrmw.
2935 - Must happen before
2936 the following
2937 atomicrmw.
2938 - Ensures that all
2939 memory operations
2940 to local have
2941 completed before
2942 performing the
2943 atomicrmw that is
2944 being released.
2945
2946 2. flat_atomic
2947 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2948 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002949
2950 - If OpenCL, omit
2951 lgkmcnt(0).
2952 - Could be split into
2953 separate s_waitcnt
2954 vmcnt(0) and
2955 s_waitcnt
2956 lgkmcnt(0) to allow
2957 them to be
2958 independently moved
2959 according to the
2960 following rules.
2961 - s_waitcnt vmcnt(0)
2962 must happen after
2963 any preceding
2964 global/generic
2965 load/store/load
2966 atomic/store
2967 atomic/atomicrmw.
2968 - s_waitcnt lgkmcnt(0)
2969 must happen after
2970 any preceding
2971 local/generic
2972 load/store/load
2973 atomic/store
2974 atomic/atomicrmw.
2975 - Must happen before
2976 the following
2977 atomicrmw.
2978 - Ensures that all
2979 memory operations
2980 to global and local
2981 have completed
2982 before performing
2983 the atomicrmw that
2984 is being released.
2985
Tony Tye6baa6d22017-10-18 22:16:55 +00002986 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002987 fence release - singlethread *none* *none*
2988 - wavefront
2989 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2990
2991 - If OpenCL and
2992 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002993 not generic, omit.
2994 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002995 currently has no
2996 address space on
2997 the fence need to
2998 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00002999 always generate. If
3000 fence had an
3001 address space then
3002 set to address
3003 space of OpenCL
3004 fence flag, or to
3005 generic if both
3006 local and global
3007 flags are
3008 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003009 - Must happen after
3010 any preceding
3011 local/generic
3012 load/load
3013 atomic/store/store
3014 atomic/atomicrmw.
3015 - Must happen before
3016 any following store
3017 atomic/atomicrmw
3018 with an equal or
3019 wider sync scope
3020 and memory ordering
3021 stronger than
3022 unordered (this is
3023 termed the
3024 fence-paired-atomic).
3025 - Ensures that all
3026 memory operations
3027 to local have
3028 completed before
3029 performing the
3030 following
3031 fence-paired-atomic.
3032
Tony Tye6baa6d22017-10-18 22:16:55 +00003033 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3034 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003035
3036 - If OpenCL and
3037 address space is
3038 not generic, omit
3039 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003040 - If OpenCL and
3041 address space is
3042 local, omit
3043 vmcnt(0).
3044 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003045 currently has no
3046 address space on
3047 the fence need to
3048 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003049 always generate. If
3050 fence had an
3051 address space then
3052 set to address
3053 space of OpenCL
3054 fence flag, or to
3055 generic if both
3056 local and global
3057 flags are
3058 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003059 - Could be split into
3060 separate s_waitcnt
3061 vmcnt(0) and
3062 s_waitcnt
3063 lgkmcnt(0) to allow
3064 them to be
3065 independently moved
3066 according to the
3067 following rules.
3068 - s_waitcnt vmcnt(0)
3069 must happen after
3070 any preceding
3071 global/generic
3072 load/store/load
3073 atomic/store
3074 atomic/atomicrmw.
3075 - s_waitcnt lgkmcnt(0)
3076 must happen after
3077 any preceding
3078 local/generic
3079 load/store/load
3080 atomic/store
3081 atomic/atomicrmw.
3082 - Must happen before
3083 any following store
3084 atomic/atomicrmw
3085 with an equal or
3086 wider sync scope
3087 and memory ordering
3088 stronger than
3089 unordered (this is
3090 termed the
3091 fence-paired-atomic).
3092 - Ensures that all
3093 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003094 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003095 completed before
3096 performing the
3097 following
3098 fence-paired-atomic.
3099
3100 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003101 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003102 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3103 - wavefront - local
3104 - generic
3105 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3106
Tony Tye6baa6d22017-10-18 22:16:55 +00003107 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003108 - Must happen after
3109 any preceding
3110 local/generic
3111 load/store/load
3112 atomic/store
3113 atomic/atomicrmw.
3114 - Must happen before
3115 the following
3116 atomicrmw.
3117 - Ensures that all
3118 memory operations
3119 to local have
3120 completed before
3121 performing the
3122 atomicrmw that is
3123 being released.
3124
Tony Tye6baa6d22017-10-18 22:16:55 +00003125 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003126 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3127 2. s_waitcnt lgkmcnt(0)
3128
Tony Tye6baa6d22017-10-18 22:16:55 +00003129 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003130 - Must happen before
3131 any following
3132 global/generic
3133 load/load
3134 atomic/store/store
3135 atomic/atomicrmw.
3136 - Ensures any
3137 following global
3138 data read is no
3139 older than the load
3140 atomic value being
3141 acquired.
3142
3143 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3144
Tony Tye6baa6d22017-10-18 22:16:55 +00003145 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003146 - Must happen after
3147 any preceding
3148 local/generic
3149 load/store/load
3150 atomic/store
3151 atomic/atomicrmw.
3152 - Must happen before
3153 the following
3154 atomicrmw.
3155 - Ensures that all
3156 memory operations
3157 to local have
3158 completed before
3159 performing the
3160 atomicrmw that is
3161 being released.
3162
3163 2. flat_atomic
3164 3. s_waitcnt lgkmcnt(0)
3165
Tony Tye6baa6d22017-10-18 22:16:55 +00003166 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003167 - Must happen before
3168 any following
3169 global/generic
3170 load/load
3171 atomic/store/store
3172 atomic/atomicrmw.
3173 - Ensures any
3174 following global
3175 data read is no
3176 older than the load
3177 atomic value being
3178 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003179
3180 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3181 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003182
3183 - If OpenCL, omit
3184 lgkmcnt(0).
3185 - Could be split into
3186 separate s_waitcnt
3187 vmcnt(0) and
3188 s_waitcnt
3189 lgkmcnt(0) to allow
3190 them to be
3191 independently moved
3192 according to the
3193 following rules.
3194 - s_waitcnt vmcnt(0)
3195 must happen after
3196 any preceding
3197 global/generic
3198 load/store/load
3199 atomic/store
3200 atomic/atomicrmw.
3201 - s_waitcnt lgkmcnt(0)
3202 must happen after
3203 any preceding
3204 local/generic
3205 load/store/load
3206 atomic/store
3207 atomic/atomicrmw.
3208 - Must happen before
3209 the following
3210 atomicrmw.
3211 - Ensures that all
3212 memory operations
3213 to global have
3214 completed before
3215 performing the
3216 atomicrmw that is
3217 being released.
3218
Tony Tye6baa6d22017-10-18 22:16:55 +00003219 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003220 3. s_waitcnt vmcnt(0)
3221
3222 - Must happen before
3223 following
3224 buffer_wbinvl1_vol.
3225 - Ensures the
3226 atomicrmw has
3227 completed before
3228 invalidating the
3229 cache.
3230
3231 4. buffer_wbinvl1_vol
3232
3233 - Must happen before
3234 any following
3235 global/generic
3236 load/load
3237 atomic/atomicrmw.
3238 - Ensures that
3239 following loads
3240 will not see stale
3241 global data.
3242
Tony Tye6baa6d22017-10-18 22:16:55 +00003243 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3244 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003245
3246 - If OpenCL, omit
3247 lgkmcnt(0).
3248 - Could be split into
3249 separate s_waitcnt
3250 vmcnt(0) and
3251 s_waitcnt
3252 lgkmcnt(0) to allow
3253 them to be
3254 independently moved
3255 according to the
3256 following rules.
3257 - s_waitcnt vmcnt(0)
3258 must happen after
3259 any preceding
3260 global/generic
3261 load/store/load
3262 atomic/store
3263 atomic/atomicrmw.
3264 - s_waitcnt lgkmcnt(0)
3265 must happen after
3266 any preceding
3267 local/generic
3268 load/store/load
3269 atomic/store
3270 atomic/atomicrmw.
3271 - Must happen before
3272 the following
3273 atomicrmw.
3274 - Ensures that all
3275 memory operations
3276 to global have
3277 completed before
3278 performing the
3279 atomicrmw that is
3280 being released.
3281
3282 2. flat_atomic
3283 3. s_waitcnt vmcnt(0) &
3284 lgkmcnt(0)
3285
3286 - If OpenCL, omit
3287 lgkmcnt(0).
3288 - Must happen before
3289 following
3290 buffer_wbinvl1_vol.
3291 - Ensures the
3292 atomicrmw has
3293 completed before
3294 invalidating the
3295 cache.
3296
3297 4. buffer_wbinvl1_vol
3298
3299 - Must happen before
3300 any following
3301 global/generic
3302 load/load
3303 atomic/atomicrmw.
3304 - Ensures that
3305 following loads
3306 will not see stale
3307 global data.
3308
3309 fence acq_rel - singlethread *none* *none*
3310 - wavefront
3311 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3312
3313 - If OpenCL and
3314 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003315 not generic, omit.
3316 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003317 since LLVM
3318 currently has no
3319 address space on
3320 the fence need to
3321 conservatively
3322 always generate
3323 (see comment for
3324 previous fence).
3325 - Must happen after
3326 any preceding
3327 local/generic
3328 load/load
3329 atomic/store/store
3330 atomic/atomicrmw.
3331 - Must happen before
3332 any following
3333 global/generic
3334 load/load
3335 atomic/store/store
3336 atomic/atomicrmw.
3337 - Ensures that all
3338 memory operations
3339 to local have
3340 completed before
3341 performing any
3342 following global
3343 memory operations.
3344 - Ensures that the
3345 preceding
3346 local/generic load
3347 atomic/atomicrmw
3348 with an equal or
3349 wider sync scope
3350 and memory ordering
3351 stronger than
3352 unordered (this is
3353 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003354 acquire-fence-paired-atomic
3355 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003356 before following
3357 global memory
3358 operations. This
3359 satisfies the
3360 requirements of
3361 acquire.
3362 - Ensures that all
3363 previous memory
3364 operations have
3365 completed before a
3366 following
3367 local/generic store
3368 atomic/atomicrmw
3369 with an equal or
3370 wider sync scope
3371 and memory ordering
3372 stronger than
3373 unordered (this is
3374 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003375 release-fence-paired-atomic
3376 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003377 requirements of
3378 release.
3379
Tony Tye6baa6d22017-10-18 22:16:55 +00003380 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3381 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003382
3383 - If OpenCL and
3384 address space is
3385 not generic, omit
3386 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003387 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003388 currently has no
3389 address space on
3390 the fence need to
3391 conservatively
3392 always generate
3393 (see comment for
3394 previous fence).
3395 - Could be split into
3396 separate s_waitcnt
3397 vmcnt(0) and
3398 s_waitcnt
3399 lgkmcnt(0) to allow
3400 them to be
3401 independently moved
3402 according to the
3403 following rules.
3404 - s_waitcnt vmcnt(0)
3405 must happen after
3406 any preceding
3407 global/generic
3408 load/store/load
3409 atomic/store
3410 atomic/atomicrmw.
3411 - s_waitcnt lgkmcnt(0)
3412 must happen after
3413 any preceding
3414 local/generic
3415 load/store/load
3416 atomic/store
3417 atomic/atomicrmw.
3418 - Must happen before
3419 the following
3420 buffer_wbinvl1_vol.
3421 - Ensures that the
3422 preceding
3423 global/local/generic
3424 load
3425 atomic/atomicrmw
3426 with an equal or
3427 wider sync scope
3428 and memory ordering
3429 stronger than
3430 unordered (this is
3431 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003432 acquire-fence-paired-atomic
3433 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003434 before invalidating
3435 the cache. This
3436 satisfies the
3437 requirements of
3438 acquire.
3439 - Ensures that all
3440 previous memory
3441 operations have
3442 completed before a
3443 following
3444 global/local/generic
3445 store
3446 atomic/atomicrmw
3447 with an equal or
3448 wider sync scope
3449 and memory ordering
3450 stronger than
3451 unordered (this is
3452 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003453 release-fence-paired-atomic
3454 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003455 requirements of
3456 release.
3457
3458 2. buffer_wbinvl1_vol
3459
3460 - Must happen before
3461 any following
3462 global/generic
3463 load/load
3464 atomic/store/store
3465 atomic/atomicrmw.
3466 - Ensures that
3467 following loads
3468 will not see stale
3469 global data. This
3470 satisfies the
3471 requirements of
3472 acquire.
3473
3474 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003475 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003476 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003477 - wavefront - local load atomic acquire,
3478 - generic except must generated
3479 all instructions even
3480 for OpenCL.*
3481 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3482 - generic
3483 - Must
3484 happen after
3485 preceding
3486 global/generic load
3487 atomic/store
3488 atomic/atomicrmw
3489 with memory
3490 ordering of seq_cst
3491 and with equal or
3492 wider sync scope.
3493 (Note that seq_cst
3494 fences have their
3495 own s_waitcnt
3496 lgkmcnt(0) and so do
3497 not need to be
3498 considered.)
3499 - Ensures any
3500 preceding
3501 sequential
3502 consistent local
3503 memory instructions
3504 have completed
3505 before executing
3506 this sequentially
3507 consistent
3508 instruction. This
3509 prevents reordering
3510 a seq_cst store
3511 followed by a
3512 seq_cst load. (Note
3513 that seq_cst is
3514 stronger than
3515 acquire/release as
3516 the reordering of
3517 load acquire
3518 followed by a store
3519 release is
3520 prevented by the
3521 waitcnt of
3522 the release, but
3523 there is nothing
3524 preventing a store
3525 release followed by
3526 load acquire from
3527 competing out of
3528 order.)
3529
3530 2. *Following
3531 instructions same as
3532 corresponding load
3533 atomic acquire,
3534 except must generated
3535 all instructions even
3536 for OpenCL.*
3537 load atomic seq_cst - workgroup - local *Same as corresponding
3538 load atomic acquire,
3539 except must generated
3540 all instructions even
3541 for OpenCL.*
3542 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3543 - system - generic vmcnt(0)
3544
3545 - Could be split into
3546 separate s_waitcnt
3547 vmcnt(0)
3548 and s_waitcnt
3549 lgkmcnt(0) to allow
3550 them to be
3551 independently moved
3552 according to the
3553 following rules.
3554 - waitcnt lgkmcnt(0)
3555 must happen after
3556 preceding
3557 global/generic load
3558 atomic/store
3559 atomic/atomicrmw
3560 with memory
3561 ordering of seq_cst
3562 and with equal or
3563 wider sync scope.
3564 (Note that seq_cst
3565 fences have their
3566 own s_waitcnt
3567 lgkmcnt(0) and so do
3568 not need to be
3569 considered.)
3570 - waitcnt vmcnt(0)
3571 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003572 preceding
3573 global/generic load
3574 atomic/store
3575 atomic/atomicrmw
3576 with memory
3577 ordering of seq_cst
3578 and with equal or
3579 wider sync scope.
3580 (Note that seq_cst
3581 fences have their
3582 own s_waitcnt
3583 vmcnt(0) and so do
3584 not need to be
3585 considered.)
3586 - Ensures any
3587 preceding
3588 sequential
3589 consistent global
3590 memory instructions
3591 have completed
3592 before executing
3593 this sequentially
3594 consistent
3595 instruction. This
3596 prevents reordering
3597 a seq_cst store
3598 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003599 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003600 that seq_cst is
3601 stronger than
3602 acquire/release as
3603 the reordering of
3604 load acquire
3605 followed by a store
3606 release is
3607 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003608 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003609 the release, but
3610 there is nothing
3611 preventing a store
3612 release followed by
3613 load acquire from
3614 competing out of
3615 order.)
3616
3617 2. *Following
3618 instructions same as
3619 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003620 atomic acquire,
3621 except must generated
3622 all instructions even
3623 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003624 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003625 - wavefront - local store atomic release,
3626 - workgroup - generic except must generated
3627 all instructions even
3628 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003629 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003630 - system - generic store atomic release,
3631 except must generated
3632 all instructions even
3633 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003634 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003635 - wavefront - local atomicrmw acq_rel,
3636 - workgroup - generic except must generated
3637 all instructions even
3638 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003639 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003640 - system - generic atomicrmw acq_rel,
3641 except must generated
3642 all instructions even
3643 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003644 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003645 - wavefront fence acq_rel,
3646 - workgroup except must generated
3647 - agent all instructions even
3648 - system for OpenCL.*
3649 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003650
3651The memory order also adds the single thread optimization constrains defined in
3652table
3653:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3654
3655 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3656 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3657
3658 ============ ==============================================================
3659 LLVM Memory Optimization Constraints
3660 Ordering
3661 ============ ==============================================================
3662 unordered *none*
3663 monotonic *none*
3664 acquire - If a load atomic/atomicrmw then no following load/load
3665 atomic/store/ store atomic/atomicrmw/fence instruction can
3666 be moved before the acquire.
3667 - If a fence then same as load atomic, plus no preceding
3668 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003669 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003670 atomic/store/ store atomic/atomicrmw/fence instruction can
3671 be moved after the release.
3672 - If a fence then same as store atomic, plus no following
3673 associated fence-paired-atomic can be moved before the
3674 fence.
3675 acq_rel Same constraints as both acquire and release.
3676 seq_cst - If a load atomic then same constraints as acquire, plus no
3677 preceding sequentially consistent load atomic/store
3678 atomic/atomicrmw/fence instruction can be moved after the
3679 seq_cst.
3680 - If a store atomic then the same constraints as release, plus
3681 no following sequentially consistent load atomic/store
3682 atomic/atomicrmw/fence instruction can be moved before the
3683 seq_cst.
3684 - If an atomicrmw/fence then same constraints as acq_rel.
3685 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003686
Wei Ding16289cf2017-02-21 18:48:01 +00003687Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003688~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003689
Tony Tyef16a45e2017-06-06 20:31:59 +00003690For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3691(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3692the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003693
Tony Tyef16a45e2017-06-06 20:31:59 +00003694 .. table:: AMDGPU Trap Handler for AMDHSA OS
3695 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003696
Tony Tyef16a45e2017-06-06 20:31:59 +00003697 =================== =============== =============== =======================
3698 Usage Code Sequence Trap Handler Description
3699 Inputs
3700 =================== =============== =============== =======================
3701 reserved ``s_trap 0x00`` Reserved by hardware.
3702 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3703 ``queue_ptr`` ``debugtrap``
3704 ``VGPR0``: intrinsic (not
3705 ``arg`` implemented).
3706 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3707 ``queue_ptr`` terminated and its
3708 associated queue put
3709 into the error state.
3710 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3711 ``queue_ptr`` installed handled
3712 same as ``llvm.trap``.
3713 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3714 breakpoints.
3715 debugger ``s_trap 0x08`` Reserved for debugger.
3716 debugger ``s_trap 0xfe`` Reserved for debugger.
3717 debugger ``s_trap 0xff`` Reserved for debugger.
3718 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003719
Tony Tye46d35762017-08-15 20:47:41 +00003720Unspecified OS
3721--------------
3722
3723This section provides code conventions used when the target triple OS is
3724empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003725
3726Trap Handler ABI
3727~~~~~~~~~~~~~~~~
3728
3729For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3730not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3731instructions are handled as follows:
3732
3733 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3734 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3735
3736 =============== =============== ===========================================
3737 Usage Code Sequence Description
3738 =============== =============== ===========================================
3739 llvm.trap s_endpgm Causes wavefront to be terminated.
3740 llvm.debugtrap *none* Compiler warning given that there is no
3741 trap handler installed.
3742 =============== =============== ===========================================
3743
3744Source Languages
3745================
3746
3747.. _amdgpu-opencl:
3748
3749OpenCL
3750------
3751
3752When generating code for the OpenCL language the target triple environment
3753should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3754
3755When the language is OpenCL the following differences occur:
3756
37571. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37582. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037593. Additional metadata is generated
3760 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003761
3762.. TODO
3763 Specify what affect this has. Hidden arguments added. Additional metadata
3764 generated.
3765
3766.. _amdgpu-hcc:
3767
3768HCC
3769---
3770
3771When generating code for the OpenCL language the target triple environment
3772should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3773
3774When the language is OpenCL the following differences occur:
3775
37761. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3777
3778.. TODO
3779 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003780
Tom Stellard45bb48e2015-06-13 03:28:10 +00003781Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003782---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003783
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003784AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003785It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003786
Tony Tyef16a45e2017-06-06 20:31:59 +00003787This section describes general syntax for instructions and operands. For more
3788information about instructions, their semantics and supported combinations of
3789operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003790[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003791
Tony Tyef16a45e2017-06-06 20:31:59 +00003792An instruction has the following syntax (register operands are normally
3793comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003794
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003795*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003796
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003797Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003798~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003799
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003800The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003801
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003802* SGPR registers: s0, ... or s[0], ...
3803* VGPR registers: v0, ... or v[0], ...
3804* TTMP registers: ttmp0, ... or ttmp[0], ...
3805* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3806* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3807* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3808* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3809* Register index expressions: v[2*2], s[1-1:2-1]
3810* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003811
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003812The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003813
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003814* offset, offset0, offset1
3815* idxen, offen bits
3816* glc, slc, tfe bits
3817* waitcnt: integer or combination of counter values
3818* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003819
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003820 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003821
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003822* DPP modifiers:
3823
3824 - row_shl, row_shr, row_ror, row_rol
3825 - row_mirror, row_half_mirror, row_bcast
3826 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3827 - row_mask, bank_mask, bound_ctrl
3828
3829* SDWA modifiers:
3830
3831 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3832 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3833 - abs, neg, sext
3834
Tony Tyef16a45e2017-06-06 20:31:59 +00003835Instruction Examples
3836~~~~~~~~~~~~~~~~~~~~
3837
3838DS
3839~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003840
3841.. code-block:: nasm
3842
3843 ds_add_u32 v2, v4 offset:16
3844 ds_write_src2_b64 v2 offset0:4 offset1:8
3845 ds_cmpst_f32 v2, v4, v6
3846 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3847
3848
3849For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3850
Tony Tyef16a45e2017-06-06 20:31:59 +00003851FLAT
3852++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003853
3854.. code-block:: nasm
3855
3856 flat_load_dword v1, v[3:4]
3857 flat_store_dwordx3 v[3:4], v[5:7]
3858 flat_atomic_swap v1, v[3:4], v5 glc
3859 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3860 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3861
3862For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3863
Tony Tyef16a45e2017-06-06 20:31:59 +00003864MUBUF
3865+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003866
3867.. code-block:: nasm
3868
3869 buffer_load_dword v1, off, s[4:7], s1
3870 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3871 buffer_store_format_xy v[1:2], off, s[4:7], s1
3872 buffer_wbinvl1
3873 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3874
3875For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3876
Tony Tyef16a45e2017-06-06 20:31:59 +00003877SMRD/SMEM
3878+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003879
3880.. code-block:: nasm
3881
3882 s_load_dword s1, s[2:3], 0xfc
3883 s_load_dwordx8 s[8:15], s[2:3], s4
3884 s_load_dwordx16 s[88:103], s[2:3], s4
3885 s_dcache_inv_vol
3886 s_memtime s[4:5]
3887
3888For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3889
Tony Tyef16a45e2017-06-06 20:31:59 +00003890SOP1
3891++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003892
3893.. code-block:: nasm
3894
3895 s_mov_b32 s1, s2
3896 s_mov_b64 s[0:1], 0x80000000
3897 s_cmov_b32 s1, 200
3898 s_wqm_b64 s[2:3], s[4:5]
3899 s_bcnt0_i32_b64 s1, s[2:3]
3900 s_swappc_b64 s[2:3], s[4:5]
3901 s_cbranch_join s[4:5]
3902
3903For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3904
Tony Tyef16a45e2017-06-06 20:31:59 +00003905SOP2
3906++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003907
3908.. code-block:: nasm
3909
3910 s_add_u32 s1, s2, s3
3911 s_and_b64 s[2:3], s[4:5], s[6:7]
3912 s_cselect_b32 s1, s2, s3
3913 s_andn2_b32 s2, s4, s6
3914 s_lshr_b64 s[2:3], s[4:5], s6
3915 s_ashr_i32 s2, s4, s6
3916 s_bfm_b64 s[2:3], s4, s6
3917 s_bfe_i64 s[2:3], s[4:5], s6
3918 s_cbranch_g_fork s[4:5], s[6:7]
3919
3920For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3921
Tony Tyef16a45e2017-06-06 20:31:59 +00003922SOPC
3923++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003924
3925.. code-block:: nasm
3926
3927 s_cmp_eq_i32 s1, s2
3928 s_bitcmp1_b32 s1, s2
3929 s_bitcmp0_b64 s[2:3], s4
3930 s_setvskip s3, s5
3931
3932For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3933
Tony Tyef16a45e2017-06-06 20:31:59 +00003934SOPP
3935++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003936
3937.. code-block:: nasm
3938
3939 s_barrier
3940 s_nop 2
3941 s_endpgm
3942 s_waitcnt 0 ; Wait for all counters to be 0
3943 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3944 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3945 s_sethalt 9
3946 s_sleep 10
3947 s_sendmsg 0x1
3948 s_sendmsg sendmsg(MSG_INTERRUPT)
3949 s_trap 1
3950
3951For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3952
3953Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003954of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003955range or acceptable values.
3956
Tony Tyef16a45e2017-06-06 20:31:59 +00003957VALU
3958++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003959
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003960For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3961the assembler will automatically use optimal encoding based on its operands.
3962To force specific encoding, one can add a suffix to the opcode of the instruction:
3963
3964* _e32 for 32-bit VOP1/VOP2/VOPC
3965* _e64 for 64-bit VOP3
3966* _dpp for VOP_DPP
3967* _sdwa for VOP_SDWA
3968
3969VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003970
3971.. code-block:: nasm
3972
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003973 v_mov_b32 v1, v2
3974 v_mov_b32_e32 v1, v2
3975 v_nop
3976 v_cvt_f64_i32_e32 v[1:2], v2
3977 v_floor_f32_e32 v1, v2
3978 v_bfrev_b32_e32 v1, v2
3979 v_add_f32_e32 v1, v2, v3
3980 v_mul_i32_i24_e64 v1, v2, 3
3981 v_mul_i32_i24_e32 v1, -3, v3
3982 v_mul_i32_i24_e32 v1, -100, v3
3983 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3984 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003985
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003986VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003987
3988.. code-block:: nasm
3989
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003990 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3991 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3992 v_mov_b32 v0, v0 wave_shl:1
3993 v_mov_b32 v0, v0 row_mirror
3994 v_mov_b32 v0, v0 row_bcast:31
3995 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3996 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3997 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003998
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003999VOP_SDWA examples:
4000
4001.. code-block:: nasm
4002
4003 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4004 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4005 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4006 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4007 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4008
4009For full list of supported instructions, refer to "Vector ALU instructions".
4010
4011HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004012~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004013
4014AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4015one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004016
4017.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004018+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004019
4020*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004021object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004022
4023.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004024+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4025
Tom Stellard347ac792015-06-26 21:15:07 +00004026
4027*major*, *minor*, and *stepping* are all integers that describe the instruction
4028set architecture (ISA) version of the assembly program.
4029
4030*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4031"AMD" and *arch* should always be equal to "AMDGPU".
4032
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004033By default, the assembler will derive the ISA version, *vendor*, and *arch*
4034from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004035
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004036.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004037+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004038
4039This directives specifies that the symbol with given name is a kernel entry point
4040(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004041
4042.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004043++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004044
4045This directive marks the beginning of a list of key / value pairs that are used
4046to specify the amd_kernel_code_t object that will be emitted by the assembler.
4047The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4048any amd_kernel_code_t values that are unspecified a default value will be
4049used. The default value for all keys is 0, with the following exceptions:
4050
4051- *kernel_code_version_major* defaults to 1.
4052- *machine_kind* defaults to 1.
4053- *machine_version_major*, *machine_version_minor*, and
4054 *machine_version_stepping* are derived from the value of the -mcpu option
4055 that is passed to the assembler.
4056- *kernel_code_entry_byte_offset* defaults to 256.
4057- *wavefront_size* defaults to 6.
4058- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004059 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004060 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4061
4062The *.amd_kernel_code_t* directive must be placed immediately after the
4063function label and before any instructions.
4064
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004065For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4066comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004067
4068Here is an example of a minimal amd_kernel_code_t specification:
4069
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004070.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004071
4072 .hsa_code_object_version 1,0
4073 .hsa_code_object_isa
4074
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004075 .hsatext
4076 .globl hello_world
4077 .p2align 8
4078 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004079
4080 hello_world:
4081
4082 .amd_kernel_code_t
4083 enable_sgpr_kernarg_segment_ptr = 1
4084 is_ptr64 = 1
4085 compute_pgm_rsrc1_vgprs = 0
4086 compute_pgm_rsrc1_sgprs = 0
4087 compute_pgm_rsrc2_user_sgpr = 2
4088 kernarg_segment_byte_size = 8
4089 wavefront_sgpr_count = 2
4090 workitem_vgpr_count = 3
4091 .end_amd_kernel_code_t
4092
4093 s_load_dwordx2 s[0:1], s[0:1] 0x0
4094 v_mov_b32 v0, 3.14159
4095 s_waitcnt lgkmcnt(0)
4096 v_mov_b32 v1, s0
4097 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004098 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004099 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004100 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004101 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004102
4103Additional Documentation
4104========================
4105
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004106.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4107.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4108.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4109.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4110.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4111.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4112.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4113.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004114.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4115.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4116.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4117.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4118.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004119.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004120.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4121.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__