blob: ff22f2c35977213ce4399378120dd68c3b469d78 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye31105cc2017-12-11 15:35:27 +000087 =========== =============== ============ ===== ========= ======= ==================
88 Processor Alternative Target dGPU/ Target ROCm Example
89 Processor Triple APU Features Support Products
90 Architecture Supported
91 [Default]
92 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``r600`` ``r600`` dGPU
96 ``r630`` ``r600`` dGPU
97 ``rs880`` ``r600`` dGPU
98 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000099 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000100 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000101 ``rv710`` ``r600`` dGPU
102 ``rv730`` ``r600`` dGPU
103 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000104 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000105 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000106 ``cedar`` ``r600`` dGPU
107 ``redwood`` ``r600`` dGPU
108 ``sumo`` ``r600`` dGPU
109 ``juniper`` ``r600`` dGPU
110 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``barts`` ``r600`` dGPU
114 ``turks`` ``r600`` dGPU
115 ``caicos`` ``r600`` dGPU
116 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000117 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000118 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000119 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
120 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
121 - ``verde``
122 - ``oland``
123 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000124 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000125 -----------------------------------------------------------------------------------
126 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
127 - A6 Pro-7050B
128 - A8-7100
129 - A8 Pro-7150B
130 - A10-7300
131 - A10 Pro-7350B
132 - FX-7500
133 - A8-7200P
134 - A10-7400P
135 - FX-7600P
136 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
137 - FirePro W9100
138 - FirePro S9150
139 - FirePro S9170
140 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
141 - Radeon R9 290x
142 - Radeon R390
143 - Radeon R390x
144 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
145 - ``mullins`` - E1-2200
146 - E1-2500
147 - E2-3000
148 - E2-3800
149 - A4-5000
150 - A4-5100
151 - A6-5200
152 - A4 Pro-3340B
153 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
154 - Radeon HD 8770
155 - R7 260
156 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000157 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000158 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000159 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
160 [on] - Pro A6-8500B
161 - A8-8600P
162 - Pro A8-8600B
163 - FX-8800P
164 - Pro A12-8800B
165 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
166 [on] - Pro A10-8700B
167 - A10-8780P
168 \ ``amdgcn`` APU - xnack - A10-9600P
169 [on] - A10-9630P
170 - A12-9700P
171 - A12-9730P
172 - FX-9800P
173 - FX-9830P
174 \ ``amdgcn`` APU - xnack - E2-9010
175 [on] - A6-9210
176 - A9-9410
177 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tyea6978802017-12-12 05:47:00 +0000178 - ``iceland`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000179 - FirePro W7100
180 - Radeon R285
181 - Radeon R9 380
182 - Radeon R9 385
183 - Mobile FirePro
184 M7170
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
186 [off] - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
192 [off] - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
195 [off]
196 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
197 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000198 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000199 -----------------------------------------------------------------------------------
200 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
201 [off] Frontier Edition
202 - Radeon RX Vega 56
203 - Radeon RX Vega 64
204 - Radeon RX Vega 64
205 Liquid
206 - Radeon Instinct MI25
207 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
208 [on]
209 .. TODO
210 Add product
211 names.
212 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000213
214.. _amdgpu-target-features:
215
216Target Features
217---------------
218
219Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000220processor specific features. Not all target features are supported by
221all processors. The runtime must ensure that the features supported by
222the device used to execute the code match the features enabled when
223generating the code. A mismatch of features may result in incorrect
224execution, or a reduction in performance.
225
226The target features supported by each processor, and the default value
227used if not specified explicitly, is listed in
228:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000229
230Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
231target features.
232
233For example:
234
235``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000236 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000237``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000238 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000239
240 .. table:: AMDGPU Target Features
241 :name: amdgpu-target-feature-table
242
Tony Tye31105cc2017-12-11 15:35:27 +0000243 ============== ==================================================
244 Target Feature Description
245 ============== ==================================================
246 -m[no-]xnack Enable/disable generating code that has
247 memory clauses that are compatible with
248 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000249
Tony Tye31105cc2017-12-11 15:35:27 +0000250 This is used for demand paging and page
251 migration. If XNACK replay is enabled in
252 the device, then if a page fault occurs
253 the code may execute incorrectly if the
254 ``xnack`` feature is not enabled. Executing
255 code that has the feature enabled on a
256 device that does not have XNACK replay
257 enabled will execute correctly, but may
258 be less performant than code with the
259 feature disabled.
260 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000261
262.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000263
264Address Spaces
265--------------
266
Tony Tyef16a45e2017-06-06 20:31:59 +0000267The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269The memory space names used in the table, aside from the region memory space, is
270from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000271
Tony Tyef16a45e2017-06-06 20:31:59 +0000272LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000273
Tony Tyef16a45e2017-06-06 20:31:59 +0000274 .. table:: Address Space Mapping
275 :name: amdgpu-address-space-mapping-table
276
277 ================== ================= ================= ================= =================
278 LLVM Address Space Memory Space
279 ------------------ -----------------------------------------------------------------------
280 \ Current Default amdgiz/amdgizcl hcc Future Default
281 ================== ================= ================= ================= =================
282 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
283 1 Global Global Global Global
284 2 Constant Constant Constant Region (GDS)
285 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286 4 Generic (Flat) Region (GDS) Region (GDS) Constant
287 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
288 ================== ================= ================= ================= =================
289
290Current Default
291 This is the current default address space mapping used for all languages
292 except hcc. This will shortly be deprecated.
293
294amdgiz/amdgizcl
295 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296 is specified as the target triple environment value.
297
298hcc
299 This is the current address space mapping used when ``hcc`` is specified as
300 the target triple environment value.This will shortly be deprecated.
301
302Future Default
303 This will shortly be the only address space mapping for all languages using
304 AMDGPU backend.
305
306.. _amdgpu-memory-scopes:
307
308Memory Scopes
309-------------
310
311This section provides LLVM memory synchronization scopes supported by the AMDGPU
312backend memory model when the target triple OS is ``amdhsa`` (see
313:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
314
315The memory model supported is based on the HSA memory model [HSA]_ which is
316based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317relation is transitive over the synchonizes-with relation independent of scope,
318and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000319table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000320
321This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322inclusion and requires the memory scopes to exactly match. However, this
323is conservatively correct for OpenCL.
324
Tony Tye07d9f102017-11-10 01:00:54 +0000325 .. table:: AMDHSA LLVM Sync Scopes
326 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000327
328 ================ ==========================================================
329 LLVM Sync Scope Description
330 ================ ==========================================================
331 *none* The default: ``system``.
332
333 Synchronizes with, and participates in modification and
334 seq_cst total orderings with, other operations (except
335 image operations) for all address spaces (except private,
336 or generic that accesses private) provided the other
337 operation's sync scope is:
338
339 - ``system``.
340 - ``agent`` and executed by a thread on the same agent.
341 - ``workgroup`` and executed by a thread in the same
342 workgroup.
343 - ``wavefront`` and executed by a thread in the same
344 wavefront.
345
346 ``agent`` Synchronizes with, and participates in modification and
347 seq_cst total orderings with, other operations (except
348 image operations) for all address spaces (except private,
349 or generic that accesses private) provided the other
350 operation's sync scope is:
351
352 - ``system`` or ``agent`` and executed by a thread on the
353 same agent.
354 - ``workgroup`` and executed by a thread in the same
355 workgroup.
356 - ``wavefront`` and executed by a thread in the same
357 wavefront.
358
359 ``workgroup`` Synchronizes with, and participates in modification and
360 seq_cst total orderings with, other operations (except
361 image operations) for all address spaces (except private,
362 or generic that accesses private) provided the other
363 operation's sync scope is:
364
365 - ``system``, ``agent`` or ``workgroup`` and executed by a
366 thread in the same workgroup.
367 - ``wavefront`` and executed by a thread in the same
368 wavefront.
369
370 ``wavefront`` Synchronizes with, and participates in modification and
371 seq_cst total orderings with, other operations (except
372 image operations) for all address spaces (except private,
373 or generic that accesses private) provided the other
374 operation's sync scope is:
375
376 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377 and executed by a thread in the same wavefront.
378
379 ``singlethread`` Only synchronizes with, and participates in modification
380 and seq_cst total orderings with, other operations (except
381 image operations) running in the same thread for all
382 address spaces (for example, in signal handlers).
383 ================ ==========================================================
384
385AMDGPU Intrinsics
386-----------------
387
388The AMDGPU backend implements the following intrinsics.
389
390*This section is WIP.*
391
392.. TODO
393 List AMDGPU intrinsics
394
395Code Object
396===========
397
398The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399can be linked by ``lld`` to produce a standard ELF shared code object which can
400be loaded and executed on an AMDGPU target.
401
402Header
403------
404
405The AMDGPU backend uses the following ELF header:
406
407 .. table:: AMDGPU ELF Header
408 :name: amdgpu-elf-header-table
409
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000410 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000411 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000412 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000413 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
414 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000415 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
416 - ``ELFOSABI_AMDGPU_HSA``
417 - ``ELFOSABI_AMDGPU_PAL``
418 - ``ELFOSABI_AMDGPU_MESA3D``
419 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420 - ``ELFABIVERSION_AMDGPU_PAL``
421 - ``ELFABIVERSION_AMDGPU_MESA3D``
422 ``e_type`` - ``ET_REL``
423 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000424 ``e_machine`` ``EM_AMDGPU``
425 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000426 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000427 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000428
429..
430
431 .. table:: AMDGPU ELF Header Enumeration Values
432 :name: amdgpu-elf-header-enumeration-values-table
433
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000434 =============================== =====
435 Name Value
436 =============================== =====
437 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000438 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000439 ``ELFOSABI_AMDGPU_HSA`` 64
440 ``ELFOSABI_AMDGPU_PAL`` 65
441 ``ELFOSABI_AMDGPU_MESA3D`` 66
442 ``ELFABIVERSION_AMDGPU_HSA`` 1
443 ``ELFABIVERSION_AMDGPU_PAL`` 0
444 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000446
447``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000448 The ELF class is:
449
450 * ``ELFCLASS32`` for ``r600`` architecture.
451
452 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
453 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000457
458``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000459 One of the following AMD GPU architecture specific OS ABIs
460 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000461
Tony Tye07d9f102017-11-10 01:00:54 +0000462 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000463
Tony Tye07d9f102017-11-10 01:00:54 +0000464 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000465
Tony Tye07d9f102017-11-10 01:00:54 +0000466 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
467
468 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000469
Tony Tyef16a45e2017-06-06 20:31:59 +0000470``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000471 The ABI version of the AMD GPU architecture specific OS ABI to which the code
472 object conforms:
473
474 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
475 runtime ABI.
476
477 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
478 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000479
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000480 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000481 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000482
Tony Tyef16a45e2017-06-06 20:31:59 +0000483``e_type``
484 Can be one of the following values:
485
486
487 ``ET_REL``
488 The type produced by the AMD GPU backend compiler as it is relocatable code
489 object.
490
491 ``ET_DYN``
492 The type produced by the linker as it is a shared code object.
493
494 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
495
496``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000497 The value ``EM_AMDGPU`` is used for the machine for all processors supported
498 by the ``r600`` and ``amdgcn`` architectures (see
499 :ref:`amdgpu-processor-table`). The specific processor is specified in the
500 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000502
503``e_entry``
504 The entry point is 0 as the entry points for individual kernels must be
505 selected in order to invoke them through AQL packets.
506
507``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000508 The AMDGPU backend uses the following ELF header flags:
509
510 .. table:: AMDGPU ELF Header ``e_flags``
511 :name: amdgpu-elf-header-e_flags-table
512
513 ================================= ========== =============================
514 Name Value Description
515 ================================= ========== =============================
516 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
517 -------------------------------------------- -----------------------------
518 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
519 mask for
520 ``EF_AMDGPU_MACH_xxx`` values
521 defined in
522 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000523 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
524 target feature is
525 enabled for all code
526 contained in the code object.
527 See
528 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000529 ================================= ========== =============================
530
531 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532 :name: amdgpu-ef-amdgpu-mach-table
533
534 ================================= ========== =============================
535 Name Value Description (see
536 :ref:`amdgpu-processor-table`)
537 ================================= ========== =============================
538 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
539 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
540 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
541 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
542 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
543 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
544 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
545 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
546 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
547 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
548 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
549 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
550 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
551 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
552 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
553 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
554 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
555 *reserved* 17-31 Reserved for ``r600``
556 architecture processors.
557 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
562 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye31105cc2017-12-11 15:35:27 +0000563 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tyedb6c9932018-01-30 23:59:43 +0000564 *reserved* 39 Reserved.
565 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 40 ``gfx801``
566 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 41 ``gfx802``
567 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 42 ``gfx803``
568 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 43 ``gfx810``
569 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 44 ``gfx900``
570 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 45 ``gfx902``
Tony Tye07d9f102017-11-10 01:00:54 +0000571 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000572
573Sections
574--------
575
576An AMDGPU target ELF code object has the standard ELF sections which include:
577
578 .. table:: AMDGPU ELF Sections
579 :name: amdgpu-elf-sections-table
580
581 ================== ================ =================================
582 Name Type Attributes
583 ================== ================ =================================
584 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
585 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
586 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
587 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
588 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
589 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
590 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
591 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
592 ``.note`` ``SHT_NOTE`` *none*
593 ``.rela``\ *name* ``SHT_RELA`` *none*
594 ``.rela.dyn`` ``SHT_RELA`` *none*
595 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
596 ``.shstrtab`` ``SHT_STRTAB`` *none*
597 ``.strtab`` ``SHT_STRTAB`` *none*
598 ``.symtab`` ``SHT_SYMTAB`` *none*
599 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
600 ================== ================ =================================
601
602These sections have their standard meanings (see [ELF]_) and are only generated
603if needed.
604
605``.debug``\ *\**
606 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
607 DWARF produced by the AMDGPU backend.
608
Tony Tye46d35762017-08-15 20:47:41 +0000609``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000610 The standard sections used by a dynamic loader.
611
612``.note``
613 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
614 backend.
615
616``.rela``\ *name*, ``.rela.dyn``
617 For relocatable code objects, *name* is the name of the section that the
618 relocation records apply. For example, ``.rela.text`` is the section name for
619 relocation records associated with the ``.text`` section.
620
621 For linked shared code objects, ``.rela.dyn`` contains all the relocation
622 records from each of the relocatable code object's ``.rela``\ *name* sections.
623
624 See :ref:`amdgpu-relocation-records` for the relocation records supported by
625 the AMDGPU backend.
626
627``.text``
628 The executable machine code for the kernels and functions they call. Generated
629 as position independent code. See :ref:`amdgpu-code-conventions` for
630 information on conventions used in the isa generation.
631
632.. _amdgpu-note-records:
633
634Note Records
635------------
636
Tony Tye07d9f102017-11-10 01:00:54 +0000637As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
638be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
639aligned. In addition, minimal zero byte padding must be generated to ensure the
640``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
641``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000642
643The AMDGPU backend code object uses the following ELF note records in the
644``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000645record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000646variable size strings have a corresponding ``*_size`` field that specifies the
647number of bytes, including the terminating null character, in the string. The
648string(s) come immediately after the preceding fields.
649
650Additional note records can be present.
651
652 .. table:: AMDGPU ELF Note Records
653 :name: amdgpu-elf-note-records-table
654
Tony Tye46d35762017-08-15 20:47:41 +0000655 ===== ============================== ======================================
656 Name Type Description
657 ===== ============================== ======================================
658 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000659 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000660
661..
662
663 .. table:: AMDGPU ELF Note Record Enumeration Values
664 :name: amdgpu-elf-note-record-enumeration-values-table
665
Tony Tye46d35762017-08-15 20:47:41 +0000666 ============================== =====
667 Name Value
668 ============================== =====
669 *reserved* 0-9
670 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000671 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000672 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000673
Tony Tye46d35762017-08-15 20:47:41 +0000674``NT_AMD_AMDGPU_HSA_METADATA``
675 Specifies extensible metadata associated with the code objects executed on HSA
676 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
677 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
678 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
679 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000680
Tony Tye46d35762017-08-15 20:47:41 +0000681.. _amdgpu-symbols:
682
683Symbols
684-------
685
686Symbols include the following:
687
688 .. table:: AMDGPU ELF Symbols
689 :name: amdgpu-elf-symbols-table
690
691 ===================== ============== ============= ==================
692 Name Type Section Description
693 ===================== ============== ============= ==================
694 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
695 - ``.rodata``
696 - ``.bss``
697 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
698 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
699 ===================== ============== ============= ==================
700
701Global variable
702 Global variables both used and defined by the compilation unit.
703
704 If the symbol is defined in the compilation unit then it is allocated in the
705 appropriate section according to if it has initialized data or is readonly.
706
707 If the symbol is external then its section is ``STN_UNDEF`` and the loader
708 will resolve relocations using the definition provided by another code object
709 or explicitly defined by the runtime.
710
711 All global symbols, whether defined in the compilation unit or external, are
712 accessed by the machine code indirectly through a GOT table entry. This
713 allows them to be preemptable. The GOT table is only supported when the target
714 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000715
716 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000717 Add description of linked shared object symbols. Seems undefined symbols
718 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000719
Tony Tye46d35762017-08-15 20:47:41 +0000720Kernel descriptor
721 Every HSA kernel has an associated kernel descriptor. It is the address of the
722 kernel descriptor that is used in the AQL dispatch packet used to invoke the
723 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
724 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
725
726Kernel entry point
727 Every HSA kernel also has a symbol for its machine code entry point.
728
729.. _amdgpu-relocation-records:
730
731Relocation Records
732------------------
733
734AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
735relocatable fields are:
736
737``word32``
738 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
739 alignment. These values use the same byte order as other word values in the
740 AMD GPU architecture.
741
742``word64``
743 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
744 alignment. These values use the same byte order as other word values in the
745 AMD GPU architecture.
746
747Following notations are used for specifying relocation calculations:
748
749**A**
750 Represents the addend used to compute the value of the relocatable field.
751
752**G**
753 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000754 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000755
756**GOT**
757 Represents the address of the global offset table.
758
759**P**
760 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
761 of the storage unit being relocated (computed using ``r_offset``).
762
763**S**
764 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000765 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
766
767**B**
768 Represents the base address of a loaded executable or shared object which is
769 the difference between the ELF address and the actual load address. Relocations
770 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000771
772The following relocation types are supported:
773
774 .. table:: AMDGPU ELF Relocation Records
775 :name: amdgpu-elf-relocation-records-table
776
Tony Tyedb6c9932018-01-30 23:59:43 +0000777 ========================== ======= ===== ========== ==============================
778 Relocation Type Kind Value Field Calculation
779 ========================== ======= ===== ========== ==============================
780 ``R_AMDGPU_NONE`` 0 *none* *none*
781 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
782 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
783 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
784 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
785 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
786 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
787 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
788 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
789 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
790 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
791 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
792 *reserved* 12
793 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
794 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000795
796.. _amdgpu-dwarf:
797
798DWARF
799-----
800
801Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
802information that maps the code object executable code and data to the source
803language constructs. It can be used by tools such as debuggers and profilers.
804
805Address Space Mapping
806~~~~~~~~~~~~~~~~~~~~~
807
808The following address space mapping is used:
809
810 .. table:: AMDGPU DWARF Address Space Mapping
811 :name: amdgpu-dwarf-address-space-mapping-table
812
813 =================== =================
814 DWARF Address Space Memory Space
815 =================== =================
816 1 Private (Scratch)
817 2 Local (group/LDS)
818 *omitted* Global
819 *omitted* Constant
820 *omitted* Generic (Flat)
821 *not supported* Region (GDS)
822 =================== =================
823
824See :ref:`amdgpu-address-spaces` for information on the memory space terminology
825used in the table.
826
827An ``address_class`` attribute is generated on pointer type DIEs to specify the
828DWARF address space of the value of the pointer when it is in the *private* or
829*local* address space. Otherwise the attribute is omitted.
830
831An ``XDEREF`` operation is generated in location list expressions for variables
832that are allocated in the *private* and *local* address space. Otherwise no
833``XDREF`` is omitted.
834
835Register Mapping
836~~~~~~~~~~~~~~~~
837
838*This section is WIP.*
839
840.. TODO
841 Define DWARF register enumeration.
842
843 If want to present a wavefront state then should expose vector registers as
844 64 wide (rather than per work-item view that LLVM uses). Either as separate
845 registers, or a 64x4 byte single register. In either case use a new LANE op
846 (akin to XDREF) to select the current lane usage in a location
847 expression. This would also allow scalar register spilling to vector register
848 lanes to be expressed (currently no debug information is being generated for
849 spilling). If choose a wide single register approach then use LANE in
850 conjunction with PIECE operation to select the dword part of the register for
851 the current lane. If the separate register approach then use LANE to select
852 the register.
853
854Source Text
855~~~~~~~~~~~
856
857*This section is WIP.*
858
859.. TODO
860 DWARF extension to include runtime generated source text.
861
862.. _amdgpu-code-conventions:
863
864Code Conventions
865================
866
867This section provides code conventions used for each supported target triple OS
868(see :ref:`amdgpu-target-triples`).
869
870AMDHSA
871------
872
873This section provides code conventions used when the target triple OS is
874``amdhsa`` (see :ref:`amdgpu-target-triples`).
875
876.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000877
878Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000879~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000880
Tony Tye46d35762017-08-15 20:47:41 +0000881The code object metadata specifies extensible metadata associated with the code
882objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
883[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
884(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
885``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
886information necessary to support the ROCM kernel queries. For example, the
887segment sizes needed in a dispatch packet. In addition, a high level language
888runtime may require other information to be included. For example, the AMD
889OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000890
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000891The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000892:doc:`YamlIO`).
893
Tony Tye46d35762017-08-15 20:47:41 +0000894.. TODO
895 Is the string null terminated? It probably should not if YAML allows it to
896 contain null characters, otherwise it should be.
897
Tony Tyef16a45e2017-06-06 20:31:59 +0000898The metadata is represented as a single YAML document comprised of the mapping
899defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
900referenced tables.
901
902For boolean values, the string values of ``false`` and ``true`` are used for
903false and true respectively.
904
905Additional information can be added to the mappings. To avoid conflicts, any
906non-AMD key names should be prefixed by "*vendor-name*.".
907
908 .. table:: AMDHSA Code Object Metadata Mapping
909 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
910
911 ========== ============== ========= =======================================
912 String Key Value Type Required? Description
913 ========== ============== ========= =======================================
914 "Version" sequence of Required - The first integer is the major
915 2 integers version. Currently 1.
916 - The second integer is the minor
917 version. Currently 0.
918 "Printf" sequence of Each string is encoded information
919 strings about a printf function call. The
920 encoded information is organized as
921 fields separated by colon (':'):
922
923 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
924
925 where:
926
927 ``ID``
928 A 32 bit integer as a unique id for
929 each printf function call
930
931 ``N``
932 A 32 bit integer equal to the number
933 of arguments of printf function call
934 minus 1
935
936 ``S[i]`` (where i = 0, 1, ... , N-1)
937 32 bit integers for the size in bytes
938 of the i-th FormatString argument of
939 the printf function call
940
941 FormatString
942 The format string passed to the
943 printf function call.
944 "Kernels" sequence of Required Sequence of the mappings for each
945 mapping kernel in the code object. See
946 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
947 for the definition of the mapping.
948 ========== ============== ========= =======================================
949
950..
951
952 .. table:: AMDHSA Code Object Kernel Metadata Mapping
953 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
954
955 ================= ============== ========= ================================
956 String Key Value Type Required? Description
957 ================= ============== ========= ================================
958 "Name" string Required Source name of the kernel.
959 "SymbolName" string Required Name of the kernel
960 descriptor ELF symbol.
961 "Language" string Source language of the kernel.
962 Values include:
963
964 - "OpenCL C"
965 - "OpenCL C++"
966 - "HCC"
967 - "OpenMP"
968
969 "LanguageVersion" sequence of - The first integer is the major
970 2 integers version.
971 - The second integer is the
972 minor version.
973 "Attrs" mapping Mapping of kernel attributes.
974 See
975 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
976 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000977 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000978 mapping kernel arguments. See
979 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
980 for the definition of the mapping.
981 "CodeProps" mapping Mapping of properties related to
982 the kernel code. See
983 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
984 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000985 ================= ============== ========= ================================
986
987..
988
989 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
990 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
991
992 =================== ============== ========= ==============================
993 String Key Value Type Required? Description
994 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +0000995 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
996 3 integers must be >=1 and the dispatch
997 work-group size X, Y, Z must
998 correspond to the specified
999 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001000
1001 Corresponds to the OpenCL
1002 ``reqd_work_group_size``
1003 attribute.
1004 "WorkGroupSizeHint" sequence of The dispatch work-group size
1005 3 integers X, Y, Z is likely to be the
1006 specified values.
1007
1008 Corresponds to the OpenCL
1009 ``work_group_size_hint``
1010 attribute.
1011 "VecTypeHint" string The name of a scalar or vector
1012 type.
1013
1014 Corresponds to the OpenCL
1015 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001016
1017 "RuntimeHandle" string The external symbol name
1018 associated with a kernel.
1019 OpenCL runtime allocates a
1020 global buffer for the symbol
1021 and saves the kernel's address
1022 to it, which is used for
1023 device side enqueueing. Only
1024 available for device side
1025 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001026 =================== ============== ========= ==============================
1027
1028..
1029
1030 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1031 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1032
1033 ================= ============== ========= ================================
1034 String Key Value Type Required? Description
1035 ================= ============== ========= ================================
1036 "Name" string Kernel argument name.
1037 "TypeName" string Kernel argument type name.
1038 "Size" integer Required Kernel argument size in bytes.
1039 "Align" integer Required Kernel argument alignment in
1040 bytes. Must be a power of two.
1041 "ValueKind" string Required Kernel argument kind that
1042 specifies how to set up the
1043 corresponding argument.
1044 Values include:
1045
1046 "ByValue"
1047 The argument is copied
1048 directly into the kernarg.
1049
1050 "GlobalBuffer"
1051 A global address space pointer
1052 to the buffer data is passed
1053 in the kernarg.
1054
1055 "DynamicSharedPointer"
1056 A group address space pointer
1057 to dynamically allocated LDS
1058 is passed in the kernarg.
1059
1060 "Sampler"
1061 A global address space
1062 pointer to a S# is passed in
1063 the kernarg.
1064
1065 "Image"
1066 A global address space
1067 pointer to a T# is passed in
1068 the kernarg.
1069
1070 "Pipe"
1071 A global address space pointer
1072 to an OpenCL pipe is passed in
1073 the kernarg.
1074
1075 "Queue"
1076 A global address space pointer
1077 to an OpenCL device enqueue
1078 queue is passed in the
1079 kernarg.
1080
1081 "HiddenGlobalOffsetX"
1082 The OpenCL grid dispatch
1083 global offset for the X
1084 dimension is passed in the
1085 kernarg.
1086
1087 "HiddenGlobalOffsetY"
1088 The OpenCL grid dispatch
1089 global offset for the Y
1090 dimension is passed in the
1091 kernarg.
1092
1093 "HiddenGlobalOffsetZ"
1094 The OpenCL grid dispatch
1095 global offset for the Z
1096 dimension is passed in the
1097 kernarg.
1098
1099 "HiddenNone"
1100 An argument that is not used
1101 by the kernel. Space needs to
1102 be left for it, but it does
1103 not need to be set up.
1104
1105 "HiddenPrintfBuffer"
1106 A global address space pointer
1107 to the runtime printf buffer
1108 is passed in kernarg.
1109
1110 "HiddenDefaultQueue"
1111 A global address space pointer
1112 to the OpenCL device enqueue
1113 queue that should be used by
1114 the kernel by default is
1115 passed in the kernarg.
1116
1117 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001118 A global address space pointer
1119 to help link enqueued kernels into
1120 the ancestor tree for determining
1121 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001122
1123 "ValueType" string Required Kernel argument value type. Only
1124 present if "ValueKind" is
1125 "ByValue". For vector data
1126 types, the value is for the
1127 element type. Values include:
1128
1129 - "Struct"
1130 - "I8"
1131 - "U8"
1132 - "I16"
1133 - "U16"
1134 - "F16"
1135 - "I32"
1136 - "U32"
1137 - "F32"
1138 - "I64"
1139 - "U64"
1140 - "F64"
1141
1142 .. TODO
1143 How can it be determined if a
1144 vector type, and what size
1145 vector?
1146 "PointeeAlign" integer Alignment in bytes of pointee
1147 type for pointer type kernel
1148 argument. Must be a power
1149 of 2. Only present if
1150 "ValueKind" is
1151 "DynamicSharedPointer".
1152 "AddrSpaceQual" string Kernel argument address space
1153 qualifier. Only present if
1154 "ValueKind" is "GlobalBuffer" or
1155 "DynamicSharedPointer". Values
1156 are:
1157
1158 - "Private"
1159 - "Global"
1160 - "Constant"
1161 - "Local"
1162 - "Generic"
1163 - "Region"
1164
1165 .. TODO
1166 Is GlobalBuffer only Global
1167 or Constant? Is
1168 DynamicSharedPointer always
1169 Local? Can HCC allow Generic?
1170 How can Private or Region
1171 ever happen?
1172 "AccQual" string Kernel argument access
1173 qualifier. Only present if
1174 "ValueKind" is "Image" or
1175 "Pipe". Values
1176 are:
1177
1178 - "ReadOnly"
1179 - "WriteOnly"
1180 - "ReadWrite"
1181
1182 .. TODO
1183 Does this apply to
1184 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001185 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001186 performed by the kernel on the
1187 kernel argument. Only present if
1188 "ValueKind" is "GlobalBuffer",
1189 "Image", or "Pipe". This may be
1190 more restrictive than indicated
1191 by "AccQual" to reflect what the
1192 kernel actual does. If not
1193 present then the runtime must
1194 assume what is implied by
1195 "AccQual" and "IsConst". Values
1196 are:
1197
1198 - "ReadOnly"
1199 - "WriteOnly"
1200 - "ReadWrite"
1201
1202 "IsConst" boolean Indicates if the kernel argument
1203 is const qualified. Only present
1204 if "ValueKind" is
1205 "GlobalBuffer".
1206
1207 "IsRestrict" boolean Indicates if the kernel argument
1208 is restrict qualified. Only
1209 present if "ValueKind" is
1210 "GlobalBuffer".
1211
1212 "IsVolatile" boolean Indicates if the kernel argument
1213 is volatile qualified. Only
1214 present if "ValueKind" is
1215 "GlobalBuffer".
1216
1217 "IsPipe" boolean Indicates if the kernel argument
1218 is pipe qualified. Only present
1219 if "ValueKind" is "Pipe".
1220
1221 .. TODO
1222 Can GlobalBuffer be pipe
1223 qualified?
1224 ================= ============== ========= ================================
1225
1226..
1227
1228 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1229 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1230
1231 ============================ ============== ========= =====================
1232 String Key Value Type Required? Description
1233 ============================ ============== ========= =====================
1234 "KernargSegmentSize" integer Required The size in bytes of
1235 the kernarg segment
1236 that holds the values
1237 of the arguments to
1238 the kernel.
1239 "GroupSegmentFixedSize" integer Required The amount of group
1240 segment memory
1241 required by a
1242 work-group in
1243 bytes. This does not
1244 include any
1245 dynamically allocated
1246 group segment memory
1247 that may be added
1248 when the kernel is
1249 dispatched.
1250 "PrivateSegmentFixedSize" integer Required The amount of fixed
1251 private address space
1252 memory required for a
1253 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001254 bytes. If the kernel
1255 uses a dynamic call
1256 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001257 space must be added
1258 to this value for the
1259 call stack.
1260 "KernargSegmentAlign" integer Required The maximum byte
1261 alignment of
1262 arguments in the
1263 kernarg segment. Must
1264 be a power of 2.
1265 "WavefrontSize" integer Required Wavefront size. Must
1266 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001267 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001268 registers used by a
1269 wavefront for
1270 GFX6-GFX9. This
1271 includes the special
1272 SGPRs for VCC, Flat
1273 Scratch (GFX7-GFX9)
1274 and XNACK (for
1275 GFX8-GFX9). It does
1276 not include the 16
1277 SGPR added if a trap
1278 handler is
1279 enabled. It is not
1280 rounded up to the
1281 allocation
1282 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001283 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001284 registers used by
1285 each work-item for
1286 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001287 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001288 work-group size
1289 supported by the
1290 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001291 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001292 consistent with
1293 ReqdWorkGroupSize if
1294 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001295 "NumSpilledSGPRs" integer Number of stores from
1296 a scalar register to
1297 a register allocator
1298 created spill
1299 location.
1300 "NumSpilledVGPRs" integer Number of stores from
1301 a vector register to
1302 a register allocator
1303 created spill
1304 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001305 ============================ ============== ========= =====================
1306
1307..
1308
Tony Tyef16a45e2017-06-06 20:31:59 +00001309Kernel Dispatch
1310~~~~~~~~~~~~~~~
1311
1312The HSA architected queuing language (AQL) defines a user space memory interface
1313that can be used to control the dispatch of kernels, in an agent independent
1314way. An agent can have zero or more AQL queues created for it using the ROCm
1315runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1316*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1317mechanics and packet layouts.
1318
1319The packet processor of a kernel agent is responsible for detecting and
1320dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1321packet processor is implemented by the hardware command processor (CP),
1322asynchronous dispatch controller (ADC) and shader processor input controller
1323(SPI).
1324
1325The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1326mode driver to initialize and register the AQL queue with CP.
1327
1328To dispatch a kernel the following actions are performed. This can occur in the
1329CPU host program, or from an HSA kernel executing on a GPU.
1330
13311. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1332 executed is obtained.
13332. A pointer to the kernel descriptor (see
1334 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1335 obtained. It must be for a kernel that is contained in a code object that that
1336 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1337 associated.
13383. Space is allocated for the kernel arguments using the ROCm runtime allocator
1339 for a memory region with the kernarg property for the kernel agent that will
1340 execute the kernel. It must be at least 16 byte aligned.
13414. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001342 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001343 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1344 memory in the same way constant memory is accessed. (Note that the HSA
1345 specification allows an implementation to copy the kernel argument contents to
1346 another location that is accessed by the kernel.)
13475. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1348 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1349 packet. The packet must be set up, and the final write must use an atomic
1350 store release to set the packet kind to ensure the packet contents are
1351 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1352 notify the kernel agent that the AQL queue has been updated. These rules, and
1353 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1354 System Architecture Specification* [HSA]_.
13556. A kernel dispatch packet includes information about the actual dispatch,
1356 such as grid and work-group size, together with information from the code
1357 object about the kernel, such as segment sizes. The ROCm runtime queries on
1358 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001359 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013607. CP executes micro-code and is responsible for detecting and setting up the
1361 GPU to execute the wavefronts of a kernel dispatch.
13628. CP ensures that when the a wavefront starts executing the kernel machine
1363 code, the scalar general purpose registers (SGPR) and vector general purpose
1364 registers (VGPR) are set up as required by the machine code. The required
1365 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1366 register state is defined in
1367 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13689. The prolog of the kernel machine code (see
1369 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1370 before continuing executing the machine code that corresponds to the kernel.
137110. When the kernel dispatch has completed execution, CP signals the completion
1372 signal specified in the kernel dispatch packet if not 0.
1373
1374.. _amdgpu-amdhsa-memory-spaces:
1375
1376Memory Spaces
1377~~~~~~~~~~~~~
1378
1379The memory space properties are:
1380
1381 .. table:: AMDHSA Memory Spaces
1382 :name: amdgpu-amdhsa-memory-spaces-table
1383
1384 ================= =========== ======== ======= ==================
1385 Memory Space Name HSA Segment Hardware Address NULL Value
1386 Name Name Size
1387 ================= =========== ======== ======= ==================
1388 Private private scratch 32 0x00000000
1389 Local group LDS 32 0xFFFFFFFF
1390 Global global global 64 0x0000000000000000
1391 Constant constant *same as 64 0x0000000000000000
1392 global*
1393 Generic flat flat 64 0x0000000000000000
1394 Region N/A GDS 32 *not implemented
1395 for AMDHSA*
1396 ================= =========== ======== ======= ==================
1397
1398The global and constant memory spaces both use global virtual addresses, which
1399are the same virtual address space used by the CPU. However, some virtual
1400addresses may only be accessible to the CPU, some only accessible by the GPU,
1401and some by both.
1402
1403Using the constant memory space indicates that the data will not change during
1404the execution of the kernel. This allows scalar read instructions to be
1405used. The vector and scalar L1 caches are invalidated of volatile data before
1406each kernel dispatch execution to allow constant memory to change values between
1407kernel dispatches.
1408
1409The local memory space uses the hardware Local Data Store (LDS) which is
1410automatically allocated when the hardware creates work-groups of wavefronts, and
1411freed when all the wavefronts of a work-group have terminated. The data store
1412(DS) instructions can be used to access it.
1413
1414The private memory space uses the hardware scratch memory support. If the kernel
1415uses scratch, then the hardware allocates memory that is accessed using
1416wavefront lane dword (4 byte) interleaving. The mapping used from private
1417address to physical address is:
1418
1419 ``wavefront-scratch-base +
1420 (private-address * wavefront-size * 4) +
1421 (wavefront-lane-id * 4)``
1422
1423There are different ways that the wavefront scratch base address is determined
1424by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1425memory can be accessed in an interleaved manner using buffer instruction with
1426the scratch buffer descriptor and per wave scratch offset, by the scratch
1427instructions, or by flat instructions. If each lane of a wavefront accesses the
1428same private address, the interleaving results in adjacent dwords being accessed
1429and hence requires fewer cache lines to be fetched. Multi-dword access is not
1430supported except by flat and scratch instructions in GFX9.
1431
1432The generic address space uses the hardware flat address support available in
1433GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1434local appertures), that are outside the range of addressible global memory, to
1435map from a flat address to a private or local address.
1436
1437FLAT instructions can take a flat address and access global, private (scratch)
1438and group (LDS) memory depending in if the address is within one of the
1439apperture ranges. Flat access to scratch requires hardware aperture setup and
1440setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1441access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1442(see :ref:`amdgpu-amdhsa-m0`).
1443
1444To convert between a segment address and a flat address the base address of the
1445appertures address can be used. For GFX7-GFX8 these are available in the
1446:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1447Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1448GFX9 the appature base addresses are directly available as inline constant
1449registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1450address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1451which makes it easier to convert from flat to segment or segment to flat.
1452
Tony Tye46d35762017-08-15 20:47:41 +00001453Image and Samplers
1454~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001455
1456Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1457hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1458HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1459enumeration values for the queries that are not trivially deducible from the S#
1460representation.
1461
1462HSA Signals
1463~~~~~~~~~~~
1464
Tony Tye46d35762017-08-15 20:47:41 +00001465HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1466structure allocated in memory accessible from both the CPU and GPU. The
1467structure is defined by the ROCm runtime and subject to change between releases
1468(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001469
1470.. _amdgpu-amdhsa-hsa-aql-queue:
1471
1472HSA AQL Queue
1473~~~~~~~~~~~~~
1474
Tony Tye46d35762017-08-15 20:47:41 +00001475The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001476between releases (see [AMD-ROCm-github]_). For some processors it contains
1477fields needed to implement certain language features such as the flat address
1478aperture bases. It also contains fields used by CP such as managing the
1479allocation of scratch memory.
1480
1481.. _amdgpu-amdhsa-kernel-descriptor:
1482
1483Kernel Descriptor
1484~~~~~~~~~~~~~~~~~
1485
1486A kernel descriptor consists of the information needed by CP to initiate the
1487execution of a kernel, including the entry point address of the machine code
1488that implements the kernel.
1489
1490Kernel Descriptor for GFX6-GFX9
1491+++++++++++++++++++++++++++++++
1492
1493CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1494
1495 .. table:: Kernel Descriptor for GFX6-GFX9
1496 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1497
Tony Tye6baa6d22017-10-18 22:16:55 +00001498 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001499 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001500 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001501 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001502 address space memory
1503 required for a work-group
1504 in bytes. This does not
1505 include any dynamically
1506 allocated local address
1507 space memory that may be
1508 added when the kernel is
1509 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001510 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001511 private address space
1512 memory required for a
1513 work-item in bytes. If
1514 is_dynamic_callstack is 1
1515 then additional space must
1516 be added to this value for
1517 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001518 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001519 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001520 negative) from base
1521 address of kernel
1522 descriptor to kernel's
1523 entry point instruction
1524 which must be 256 byte
1525 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001526 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001527 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001528 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001529 program settings used by
1530 CP to set up
1531 ``COMPUTE_PGM_RSRC1``
1532 configuration
1533 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001534 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001535 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001536 program settings used by
1537 CP to set up
1538 ``COMPUTE_PGM_RSRC2``
1539 configuration
1540 register. See
1541 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001542 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1543 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001544 (see
1545 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1546
1547 The total number of SGPR
1548 user data registers
1549 requested must not exceed
1550 16 and match value in
1551 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1552 Any requests beyond 16
1553 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001554 449 1 bit EnableSGPRDispatchPtr *see above*
1555 450 1 bit EnableSGPRQueuePtr *see above*
1556 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1557 452 1 bit EnableSGPRDispatchID *see above*
1558 453 1 bit EnableSGPRFlatScratchInit *see above*
1559 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1560 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1561 should always be 0.
1562 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1563 should always be 0.
1564 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1565 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001566 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001567 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001568 bytes
1569 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001570 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001571
1572..
1573
1574 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001575 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001576
Tony Tye3b340612017-06-07 00:46:08 +00001577 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001578 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001579 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001580 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001581 used by each work-item,
1582 granularity is device
1583 specific:
1584
Tony Tye07d9f102017-11-10 01:00:54 +00001585 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001586 - max_vgpr 1..256
1587 - roundup((max_vgpg + 1)
1588 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001589
1590 Used by CP to set up
1591 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001592 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001593 used by a wavefront,
1594 granularity is device
1595 specific:
1596
Tony Tye07d9f102017-11-10 01:00:54 +00001597 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001598 - max_sgpr 1..112
1599 - roundup((max_sgpg + 1)
1600 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001601 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001602 - max_sgpr 1..112
1603 - roundup((max_sgpg + 1)
1604 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001605
1606 Includes the special SGPRs
1607 for VCC, Flat Scratch (for
1608 GFX7 onwards) and XNACK
1609 (for GFX8 onwards). It does
1610 not include the 16 SGPR
1611 added if a trap handler is
1612 enabled.
1613
1614 Used by CP to set up
1615 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001616 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001617
1618 Start executing wavefront
1619 at the specified priority.
1620
1621 CP is responsible for
1622 filling in
1623 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001624 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001625 with specified rounding
1626 mode for single (32
1627 bit) floating point
1628 precision floating point
1629 operations.
1630
1631 Floating point rounding
1632 mode values are defined in
1633 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1634
1635 Used by CP to set up
1636 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001637 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001638 with specified rounding
1639 denorm mode for half/double (16
1640 and 64 bit) floating point
1641 precision floating point
1642 operations.
1643
1644 Floating point rounding
1645 mode values are defined in
1646 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1647
1648 Used by CP to set up
1649 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001650 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001651 with specified denorm mode
1652 for single (32
1653 bit) floating point
1654 precision floating point
1655 operations.
1656
1657 Floating point denorm mode
1658 values are defined in
1659 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1660
1661 Used by CP to set up
1662 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001663 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001664 with specified denorm mode
1665 for half/double (16
1666 and 64 bit) floating point
1667 precision floating point
1668 operations.
1669
1670 Floating point denorm mode
1671 values are defined in
1672 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1673
1674 Used by CP to set up
1675 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001676 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001677
1678 Start executing wavefront
1679 in privilege trap handler
1680 mode.
1681
1682 CP is responsible for
1683 filling in
1684 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001685 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001686 with DX10 clamp mode
1687 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001688 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001689 treatment of NaN's (when
1690 set, clamp NaN to zero,
1691 otherwise pass NaN
1692 through).
1693
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001696 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001697
1698 Start executing wavefront
1699 in single step mode.
1700
1701 CP is responsible for
1702 filling in
1703 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001704 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001705 with IEEE mode
1706 enabled. Floating point
1707 opcodes that support
1708 exception flag gathering
1709 will quiet and propagate
1710 signaling-NaN inputs per
1711 IEEE 754-2008. Min_dx10 and
1712 max_dx10 become IEEE
1713 754-2008 compliant due to
1714 signaling-NaN propagation
1715 and quieting.
1716
1717 Used by CP to set up
1718 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001719 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001720
1721 Only one work-group allowed
1722 to execute on a compute
1723 unit.
1724
1725 CP is responsible for
1726 filling in
1727 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001728 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001729
1730 Flag that can be used to
1731 control debugging code.
1732
1733 CP is responsible for
1734 filling in
1735 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001736 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001737 Reserved, must be 0.
1738 GFX9
1739 Wavefront starts execution
1740 with specified fp16 overflow
1741 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001742
Tony Tye6baa6d22017-10-18 22:16:55 +00001743 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001744 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001745 - If 1, fp16 overflow that is the
1746 result of an +/-INF input value
1747 or divide by 0 produces a +/-INF,
1748 otherwise clamps computed
1749 overflow to +/-MAX_FP16 as
1750 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001751
1752 Used by CP to set up
1753 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001754 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001755 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001756 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001757
1758..
1759
1760 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1761 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1762
Tony Tye3b340612017-06-07 00:46:08 +00001763 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001764 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001765 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001766 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1767 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001768 system register (see
1769 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1770
1771 Used by CP to set up
1772 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001773 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001774 user data registers
1775 requested. This number must
1776 match the number of user
1777 data registers enabled.
1778
1779 Used by CP to set up
1780 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001781 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001782 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001783 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001784 be enabled.
1785
1786 CP sets
1787 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1788 if the runtime has
1789 installed a trap handler
1790 regardless of the setting
1791 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001792 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001793 system SGPR register for
1794 the work-group id in the X
1795 dimension (see
1796 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1797
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001800 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001801 system SGPR register for
1802 the work-group id in the Y
1803 dimension (see
1804 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1805
1806 Used by CP to set up
1807 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001808 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001809 system SGPR register for
1810 the work-group id in the Z
1811 dimension (see
1812 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1813
1814 Used by CP to set up
1815 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001816 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001817 system SGPR register for
1818 work-group information (see
1819 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1820
1821 Used by CP to set up
1822 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001823 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001824 VGPR system registers used
1825 for the work-item ID.
1826 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1827 defines the values.
1828
1829 Used by CP to set up
1830 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001831 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001832
1833 Wavefront starts execution
1834 with address watch
1835 exceptions enabled which
1836 are generated when L1 has
1837 witnessed a thread access
1838 an *address of
1839 interest*.
1840
1841 CP is responsible for
1842 filling in the address
1843 watch bit in
1844 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1845 according to what the
1846 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001847 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001848
1849 Wavefront starts execution
1850 with memory violation
1851 exceptions exceptions
1852 enabled which are generated
1853 when a memory violation has
1854 occurred for this wave from
1855 L1 or LDS
1856 (write-to-read-only-memory,
1857 mis-aligned atomic, LDS
1858 address out of range,
1859 illegal address, etc.).
1860
1861 CP sets the memory
1862 violation bit in
1863 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1864 according to what the
1865 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001866 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001867
1868 CP uses the rounded value
1869 from the dispatch packet,
1870 not this value, as the
1871 dispatch may contain
1872 dynamically allocated group
1873 segment memory. CP writes
1874 directly to
1875 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1876
1877 Amount of group segment
1878 (LDS) to allocate for each
1879 work-group. Granularity is
1880 device specific:
1881
1882 GFX6:
1883 roundup(lds-size / (64 * 4))
1884 GFX7-GFX9:
1885 roundup(lds-size / (128 * 4))
1886
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001887 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1888 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001889 enabled.
1890
1891 Used by CP to set up
1892 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1893 (set from bits 0..6).
1894
1895 IEEE 754 FP Invalid
1896 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001897 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1898 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001899 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001900 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1901 _DIVISION_BY_ZERO Zero
1902 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1903 _OVERFLOW
1904 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1905 _UNDERFLOW
1906 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1907 _INEXACT
1908 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1909 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001910 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001911 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001912 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001913 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001914
1915..
1916
1917 .. table:: Floating Point Rounding Mode Enumeration Values
1918 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1919
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001920 ====================================== ===== ==============================
1921 Enumeration Name Value Description
1922 ====================================== ===== ==============================
1923 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1924 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1925 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1926 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1927 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001928
1929..
1930
1931 .. table:: Floating Point Denorm Mode Enumeration Values
1932 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1933
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001934 ====================================== ===== ==============================
1935 Enumeration Name Value Description
1936 ====================================== ===== ==============================
1937 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1938 Denorms
1939 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1940 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1941 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1942 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001943
1944..
1945
1946 .. table:: System VGPR Work-Item ID Enumeration Values
1947 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1948
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001949 ======================================== ===== ============================
1950 Enumeration Name Value Description
1951 ======================================== ===== ============================
1952 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1953 ID.
1954 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1955 dimensions ID.
1956 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1957 dimensions ID.
1958 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1959 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001960
1961.. _amdgpu-amdhsa-initial-kernel-execution-state:
1962
1963Initial Kernel Execution State
1964~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1965
1966This section defines the register state that will be set up by the packet
1967processor prior to the start of execution of every wavefront. This is limited by
1968the constraints of the hardware controllers of CP/ADC/SPI.
1969
1970The order of the SGPR registers is defined, but the compiler can specify which
1971ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
1972fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
1973for enabled registers are dense starting at SGPR0: the first enabled register is
1974SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
1975an SGPR number.
1976
1977The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
1978all waves of the grid. It is possible to specify more than 16 User SGPRs using
1979the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
1980initialized. These are then immediately followed by the System SGPRs that are
1981set up by ADC/SPI and can have different values for each wave of the grid
1982dispatch.
1983
1984SGPR register initial state is defined in
1985:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
1986
1987 .. table:: SGPR Register Set Up Order
1988 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
1989
1990 ========== ========================== ====== ==============================
1991 SGPR Order Name Number Description
1992 (kernel descriptor enable of
1993 field) SGPRs
1994 ========== ========================== ====== ==============================
1995 First Private Segment Buffer 4 V# that can be used, together
1996 (enable_sgpr_private with Scratch Wave Offset as an
1997 _segment_buffer) offset, to access the private
1998 memory space using a segment
1999 address.
2000
2001 CP uses the value provided by
2002 the runtime.
2003 then Dispatch Ptr 2 64 bit address of AQL dispatch
2004 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2005 actually executing.
2006 then Queue Ptr 2 64 bit address of amd_queue_t
2007 (enable_sgpr_queue_ptr) object for AQL queue on which
2008 the dispatch packet was
2009 queued.
2010 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2011 (enable_sgpr_kernarg segment. This is directly
2012 _segment_ptr) copied from the
2013 kernarg_address in the kernel
2014 dispatch packet.
2015
2016 Having CP load it once avoids
2017 loading it at the beginning of
2018 every wavefront.
2019 then Dispatch Id 2 64 bit Dispatch ID of the
2020 (enable_sgpr_dispatch_id) dispatch packet being
2021 executed.
2022 then Flat Scratch Init 2 This is 2 SGPRs:
2023 (enable_sgpr_flat_scratch
2024 _init) GFX6
2025 Not supported.
2026 GFX7-GFX8
2027 The first SGPR is a 32 bit
2028 byte offset from
2029 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2030 to per SPI base of memory
2031 for scratch for the queue
2032 executing the kernel
2033 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002034 from the runtime. (The
2035 Scratch Segment Buffer base
2036 address is
2037 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2038 plus this offset.) The value
2039 of Scratch Wave Offset must
2040 be added to this offset by
2041 the kernel machine code,
2042 right shifted by 8, and
2043 moved to the FLAT_SCRATCH_HI
2044 SGPR register.
2045 FLAT_SCRATCH_HI corresponds
2046 to SGPRn-4 on GFX7, and
2047 SGPRn-6 on GFX8 (where SGPRn
2048 is the highest numbered SGPR
2049 allocated to the wave).
2050 FLAT_SCRATCH_HI is
2051 multiplied by 256 (as it is
2052 in units of 256 bytes) and
2053 added to
2054 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2055 to calculate the per wave
2056 FLAT SCRATCH BASE in flat
2057 memory instructions that
2058 access the scratch
2059 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002060
2061 The second SGPR is 32 bit
2062 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002063 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002064 usage. CP obtains this from
2065 the runtime, and it is
2066 always a multiple of DWORD.
2067 CP checks that the value in
2068 the kernel dispatch packet
2069 Private Segment Byte Size is
2070 not larger, and requests the
2071 runtime to increase the
2072 queue's scratch size if
2073 necessary. The kernel code
2074 must move it to
2075 FLAT_SCRATCH_LO which is
2076 SGPRn-3 on GFX7 and SGPRn-5
2077 on GFX8. FLAT_SCRATCH_LO is
2078 used as the FLAT SCRATCH
2079 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002080 instructions. Having CP load
2081 it once avoids loading it at
2082 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002083 wavefront.
2084 GFX9
2085 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002086 64 bit base address of the
2087 per SPI scratch backing
2088 memory managed by SPI for
2089 the queue executing the
2090 kernel dispatch. CP obtains
2091 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002092 divides it if there are
2093 multiple Shader Arrays each
2094 with its own SPI). The value
2095 of Scratch Wave Offset must
2096 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002097 machine code and the result
2098 moved to the FLAT_SCRATCH
2099 SGPR which is SGPRn-6 and
2100 SGPRn-5. It is used as the
2101 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002102 memory instructions.
2103 then Private Segment Size 1 The 32 bit byte size of a
2104 (enable_sgpr_private single
2105 work-item's
2106 scratch_segment_size) memory
2107 allocation. This is the
2108 value from the kernel
2109 dispatch packet Private
2110 Segment Byte Size rounded up
2111 by CP to a multiple of
2112 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002113
2114 Having CP load it once avoids
2115 loading it at the beginning of
2116 every wavefront.
2117
2118 This is not used for
2119 GFX7-GFX8 since it is the same
2120 value as the second SGPR of
2121 Flat Scratch Init. However, it
2122 may be needed for GFX9 which
2123 changes the meaning of the
2124 Flat Scratch Init value.
2125 then Grid Work-Group Count X 1 32 bit count of the number of
2126 (enable_sgpr_grid work-groups in the X dimension
2127 _workgroup_count_X) for the grid being
2128 executed. Computed from the
2129 fields in the kernel dispatch
2130 packet as ((grid_size.x +
2131 workgroup_size.x - 1) /
2132 workgroup_size.x).
2133 then Grid Work-Group Count Y 1 32 bit count of the number of
2134 (enable_sgpr_grid work-groups in the Y dimension
2135 _workgroup_count_Y && for the grid being
2136 less than 16 previous executed. Computed from the
2137 SGPRs) fields in the kernel dispatch
2138 packet as ((grid_size.y +
2139 workgroup_size.y - 1) /
2140 workgroupSize.y).
2141
2142 Only initialized if <16
2143 previous SGPRs initialized.
2144 then Grid Work-Group Count Z 1 32 bit count of the number of
2145 (enable_sgpr_grid work-groups in the Z dimension
2146 _workgroup_count_Z && for the grid being
2147 less than 16 previous executed. Computed from the
2148 SGPRs) fields in the kernel dispatch
2149 packet as ((grid_size.z +
2150 workgroup_size.z - 1) /
2151 workgroupSize.z).
2152
2153 Only initialized if <16
2154 previous SGPRs initialized.
2155 then Work-Group Id X 1 32 bit work-group id in X
2156 (enable_sgpr_workgroup_id dimension of grid for
2157 _X) wavefront.
2158 then Work-Group Id Y 1 32 bit work-group id in Y
2159 (enable_sgpr_workgroup_id dimension of grid for
2160 _Y) wavefront.
2161 then Work-Group Id Z 1 32 bit work-group id in Z
2162 (enable_sgpr_workgroup_id dimension of grid for
2163 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002164 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002165 (enable_sgpr_workgroup ordered_append_term[10:0],
2166 _info) threadgroup_size_in_waves[5:0]}
2167 then Scratch Wave Offset 1 32 bit byte offset from base
2168 (enable_sgpr_private of scratch base of queue
2169 _segment_wave_offset) executing the kernel
2170 dispatch. Must be used as an
2171 offset with Private
2172 segment address when using
2173 Scratch Segment Buffer. It
2174 must be used to set up FLAT
2175 SCRATCH for flat addressing
2176 (see
2177 :ref:`amdgpu-amdhsa-flat-scratch`).
2178 ========== ========================== ====== ==============================
2179
2180The order of the VGPR registers is defined, but the compiler can specify which
2181ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2182fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2183for enabled registers are dense starting at VGPR0: the first enabled register is
2184VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2185VGPR number.
2186
2187VGPR register initial state is defined in
2188:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2189
2190 .. table:: VGPR Register Set Up Order
2191 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2192
2193 ========== ========================== ====== ==============================
2194 VGPR Order Name Number Description
2195 (kernel descriptor enable of
2196 field) VGPRs
2197 ========== ========================== ====== ==============================
2198 First Work-Item Id X 1 32 bit work item id in X
2199 (Always initialized) dimension of work-group for
2200 wavefront lane.
2201 then Work-Item Id Y 1 32 bit work item id in Y
2202 (enable_vgpr_workitem_id dimension of work-group for
2203 > 0) wavefront lane.
2204 then Work-Item Id Z 1 32 bit work item id in Z
2205 (enable_vgpr_workitem_id dimension of work-group for
2206 > 1) wavefront lane.
2207 ========== ========================== ====== ==============================
2208
2209The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2210
22111. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2212 registers.
22132. Work-group Id registers X, Y, Z are set by ADC which supports any
2214 combination including none.
22153. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2216 cannot included with the flat scratch init value which is per queue.
22174. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2218 or (X, Y, Z).
2219
2220Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2221value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2222
2223The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002224has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002225instructions (GFX9).
2226
2227If buffer operations are used then the compiler can generate a V# with the
2228following properties:
2229
2230* base address of 0
2231* no swizzle
2232* ATC: 1 if IOMMU present (such as APU)
2233* ptr64: 1
2234* MTYPE set to support memory coherence that matches the runtime (such as CC for
2235 APU and NC for dGPU).
2236
2237.. _amdgpu-amdhsa-kernel-prolog:
2238
2239Kernel Prolog
2240~~~~~~~~~~~~~
2241
2242.. _amdgpu-amdhsa-m0:
2243
2244M0
2245++
2246
2247GFX6-GFX8
2248 The M0 register must be initialized with a value at least the total LDS size
2249 if the kernel may access LDS via DS or flat operations. Total LDS size is
2250 available in dispatch packet. For M0, it is also possible to use maximum
2251 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2252 GFX7-GFX8).
2253GFX9
2254 The M0 register is not used for range checking LDS accesses and so does not
2255 need to be initialized in the prolog.
2256
2257.. _amdgpu-amdhsa-flat-scratch:
2258
2259Flat Scratch
2260++++++++++++
2261
2262If the kernel may use flat operations to access scratch memory, the prolog code
2263must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2264are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2265Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2266
2267GFX6
2268 Flat scratch is not supported.
2269
Tony Tye07d9f102017-11-10 01:00:54 +00002270GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002271 1. The low word of Flat Scratch Init is 32 bit byte offset from
2272 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2273 being managed by SPI for the queue executing the kernel dispatch. This is
2274 the same value used in the Scratch Segment Buffer V# base address. The
2275 prolog must add the value of Scratch Wave Offset to get the wave's byte
2276 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2277 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2278 by 8 before moving into FLAT_SCRATCH_LO.
2279 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2280 work-items scratch memory usage. This is directly loaded from the kernel
2281 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2282 DWORD. Having CP load it once avoids loading it at the beginning of every
2283 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2284 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002285
Tony Tyef16a45e2017-06-06 20:31:59 +00002286GFX9
2287 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2288 memory being managed by SPI for the queue executing the kernel dispatch. The
2289 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2290 pair for use as the flat scratch base in flat memory instructions.
2291
2292.. _amdgpu-amdhsa-memory-model:
2293
2294Memory Model
2295~~~~~~~~~~~~
2296
2297This section describes the mapping of LLVM memory model onto AMDGPU machine code
2298(see :ref:`memmodel`). *The implementation is WIP.*
2299
2300.. TODO
2301 Update when implementation complete.
2302
Tony Tyef16a45e2017-06-06 20:31:59 +00002303The AMDGPU backend supports the memory synchronization scopes specified in
2304:ref:`amdgpu-memory-scopes`.
2305
2306The code sequences used to implement the memory model are defined in table
2307:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2308
2309The sequences specify the order of instructions that a single thread must
2310execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2311to other memory instructions executed by the same thread. This allows them to be
2312moved earlier or later which can allow them to be combined with other instances
2313of the same instruction, or hoisted/sunk out of loops to improve
2314performance. Only the instructions related to the memory model are given;
2315additional ``s_waitcnt`` instructions are required to ensure registers are
2316defined before being used. These may be able to be combined with the memory
2317model ``s_waitcnt`` instructions as described above.
2318
Tony Tye6baa6d22017-10-18 22:16:55 +00002319The AMDGPU backend supports the following memory models:
2320
2321 HSA Memory Model [HSA]_
2322 The HSA memory model uses a single happens-before relation for all address
2323 spaces (see :ref:`amdgpu-address-spaces`).
2324 OpenCL Memory Model [OpenCL]_
2325 The OpenCL memory model which has separate happens-before relations for the
2326 global and local address spaces. Only a fence specifying both global and
2327 local address space, and seq_cst instructions join the relationships. Since
2328 the LLVM ``memfence`` instruction does not allow an address space to be
2329 specified the OpenCL fence has to convervatively assume both local and
2330 global address space was specified. However, optimizations can often be
2331 done to eliminate the additional ``s_waitcnt`` instructions when there are
2332 no intervening memory instructions which access the corresponding address
2333 space. The code sequences in the table indicate what can be omitted for the
2334 OpenCL memory. The target triple environment is used to determine if the
2335 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002336
2337``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2338operations.
2339
2340``buffer/global/flat_load/store/atomic`` instructions to global memory are
2341termed vector memory operations.
2342
2343For GFX6-GFX9:
2344
2345* Each agent has multiple compute units (CU).
2346* Each CU has multiple SIMDs that execute wavefronts.
2347* The wavefronts for a single work-group are executed in the same CU but may be
2348 executed by different SIMDs.
2349* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2350 executing on it.
2351* All LDS operations of a CU are performed as wavefront wide operations in a
2352 global order and involve no caching. Completion is reported to a wavefront in
2353 execution order.
2354* The LDS memory has multiple request queues shared by the SIMDs of a
2355 CU. Therefore, the LDS operations performed by different waves of a work-group
2356 can be reordered relative to each other, which can result in reordering the
2357 visibility of vector memory operations with respect to LDS operations of other
2358 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002359 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002360 between waves of a work-group, but not between operations performed by the
2361 same wavefront.
2362* The vector memory operations are performed as wavefront wide operations and
2363 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002364 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002365 vector memory order if they access LDS memory, and out of LDS operation order
2366 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002367* The vector memory operations access a single vector L1 cache shared by all
2368 SIMDs a CU. Therefore, no special action is required for coherence between the
2369 lanes of a single wavefront, or for coherence between wavefronts in the same
2370 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2371 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002372* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2373 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2374 scalar operations are used in a restricted way so do not impact the memory
2375 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2376* The vector and scalar memory operations use an L2 cache shared by all CUs on
2377 the same agent.
2378* The L2 cache has independent channels to service disjoint ranges of virtual
2379 addresses.
2380* Each CU has a separate request queue per channel. Therefore, the vector and
2381 scalar memory operations performed by waves executing in different work-groups
2382 (which may be executing on different CUs) of an agent can be reordered
2383 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002384 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002385 previous vector memory operation has completed before executing a subsequent
2386 vector memory or LDS operation and so can be used to meet the requirements of
2387 acquire and release.
2388* The L2 cache can be kept coherent with other agents on some targets, or ranges
2389 of virtual addresses can be set up to bypass it to ensure system coherence.
2390
Tony Tye07d9f102017-11-10 01:00:54 +00002391Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002392or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2393memory, atomic memory orderings are not meaningful and all accesses are treated
2394as non-atomic.
2395
2396Constant address space uses ``buffer/global_load`` instructions (or equivalent
2397scalar memory instructions). Since the constant address space contents do not
2398change during the execution of a kernel dispatch it is not legal to perform
2399stores, and atomic memory orderings are not meaningful and all access are
2400treated as non-atomic.
2401
2402A memory synchronization scope wider than work-group is not meaningful for the
2403group (LDS) address space and is treated as work-group.
2404
2405The memory model does not support the region address space which is treated as
2406non-atomic.
2407
2408Acquire memory ordering is not meaningful on store atomic instructions and is
2409treated as non-atomic.
2410
2411Release memory ordering is not meaningful on load atomic instructions and is
2412treated a non-atomic.
2413
2414Acquire-release memory ordering is not meaningful on load or store atomic
2415instructions and is treated as acquire and release respectively.
2416
2417AMDGPU backend only uses scalar memory operations to access memory that is
2418proven to not change during the execution of the kernel dispatch. This includes
2419constant address space and global address space for program scope const
2420variables. Therefore the kernel machine code does not have to maintain the
2421scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2422and vector L1 caches are invalidated between kernel dispatches by CP since
2423constant address space data may change between kernel dispatch executions. See
2424:ref:`amdgpu-amdhsa-memory-spaces`.
2425
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002426The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002427case the AMDGPU backend ensures the memory location used to spill is never
2428accessed by vector memory operations at the same time. If scalar writes are used
2429then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2430return since the locations may be used for vector memory instructions by a
2431future wave that uses the same scratch area, or a function call that creates a
2432frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2433as all scalar writes are write-before-read in the same thread.
2434
Tony Tye6baa6d22017-10-18 22:16:55 +00002435Scratch backing memory (which is used for the private address space)
2436is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2437address space is only accessed by a single thread, and is always
2438write-before-read, there is never a need to invalidate these entries from the L1
2439cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2440volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002441
2442On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002443to invalidate the L2 cache. This also causes it to be treated as
2444non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2445(cache coherent) and so the L2 cache will coherent with the CPU and other
2446agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002447
2448 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2449 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2450
Tony Tye6baa6d22017-10-18 22:16:55 +00002451 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002452 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2453 Ordering Sync Scope Address
2454 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002455 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002456 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002457 -----------------------------------------------------------------------------------
2458 load *none* *none* - global - !volatile & !nontemporal
2459 - generic
2460 - private 1. buffer/global/flat_load
2461 - constant
2462 - volatile & !nontemporal
2463
Tony Tyef16a45e2017-06-06 20:31:59 +00002464 1. buffer/global/flat_load
2465 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002466
2467 - nontemporal
2468
2469 1. buffer/global/flat_load
2470 glc=1 slc=1
2471
Tony Tyef16a45e2017-06-06 20:31:59 +00002472 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002473 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002474 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002475 - private 1. buffer/global/flat_store
2476 - constant
2477 - nontemporal
2478
2479 1. buffer/global/flat_stote
2480 glc=1 slc=1
2481
Tony Tyef16a45e2017-06-06 20:31:59 +00002482 store *none* *none* - local 1. ds_store
2483 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002484 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002485 load atomic unordered *any* *any* *Same as non-atomic*.
2486 store atomic unordered *any* *any* *Same as non-atomic*.
2487 atomicrmw unordered *any* *any* *Same as monotonic
2488 atomic*.
2489 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002490 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002491 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2492 - wavefront - generic
2493 - workgroup
2494 load atomic monotonic - singlethread - local 1. ds_load
2495 - wavefront
2496 - workgroup
2497 load atomic monotonic - agent - global 1. buffer/global/flat_load
2498 - system - generic glc=1
2499 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2500 - wavefront - generic
2501 - workgroup
2502 - agent
2503 - system
2504 store atomic monotonic - singlethread - local 1. ds_store
2505 - wavefront
2506 - workgroup
2507 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2508 - wavefront - generic
2509 - workgroup
2510 - agent
2511 - system
2512 atomicrmw monotonic - singlethread - local 1. ds_atomic
2513 - wavefront
2514 - workgroup
2515 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002516 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002517 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2518 - wavefront - local
2519 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002520 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2521 load atomic acquire - workgroup - local 1. ds_load
2522 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002523
Tony Tye6baa6d22017-10-18 22:16:55 +00002524 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002525 - Must happen before
2526 any following
2527 global/generic
2528 load/load
2529 atomic/store/store
2530 atomic/atomicrmw.
2531 - Ensures any
2532 following global
2533 data read is no
2534 older than the load
2535 atomic value being
2536 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002537 load atomic acquire - workgroup - generic 1. flat_load
2538 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002539
Tony Tye6baa6d22017-10-18 22:16:55 +00002540 - If OpenCL, omit.
2541 - Must happen before
2542 any following
2543 global/generic
2544 load/load
2545 atomic/store/store
2546 atomic/atomicrmw.
2547 - Ensures any
2548 following global
2549 data read is no
2550 older than the load
2551 atomic value being
2552 acquired.
2553 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002554 - system glc=1
2555 2. s_waitcnt vmcnt(0)
2556
2557 - Must happen before
2558 following
2559 buffer_wbinvl1_vol.
2560 - Ensures the load
2561 has completed
2562 before invalidating
2563 the cache.
2564
2565 3. buffer_wbinvl1_vol
2566
2567 - Must happen before
2568 any following
2569 global/generic
2570 load/load
2571 atomic/atomicrmw.
2572 - Ensures that
2573 following
2574 loads will not see
2575 stale global data.
2576
2577 load atomic acquire - agent - generic 1. flat_load glc=1
2578 - system 2. s_waitcnt vmcnt(0) &
2579 lgkmcnt(0)
2580
2581 - If OpenCL omit
2582 lgkmcnt(0).
2583 - Must happen before
2584 following
2585 buffer_wbinvl1_vol.
2586 - Ensures the flat_load
2587 has completed
2588 before invalidating
2589 the cache.
2590
2591 3. buffer_wbinvl1_vol
2592
2593 - Must happen before
2594 any following
2595 global/generic
2596 load/load
2597 atomic/atomicrmw.
2598 - Ensures that
2599 following loads
2600 will not see stale
2601 global data.
2602
2603 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2604 - wavefront - local
2605 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002606 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2607 atomicrmw acquire - workgroup - local 1. ds_atomic
2608 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002609
Tony Tye6baa6d22017-10-18 22:16:55 +00002610 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002611 - Must happen before
2612 any following
2613 global/generic
2614 load/load
2615 atomic/store/store
2616 atomic/atomicrmw.
2617 - Ensures any
2618 following global
2619 data read is no
2620 older than the
2621 atomicrmw value
2622 being acquired.
2623
Tony Tye6baa6d22017-10-18 22:16:55 +00002624 atomicrmw acquire - workgroup - generic 1. flat_atomic
2625 2. waitcnt lgkmcnt(0)
2626
2627 - If OpenCL, omit.
2628 - Must happen before
2629 any following
2630 global/generic
2631 load/load
2632 atomic/store/store
2633 atomic/atomicrmw.
2634 - Ensures any
2635 following global
2636 data read is no
2637 older than the
2638 atomicrmw value
2639 being acquired.
2640
2641 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002642 - system 2. s_waitcnt vmcnt(0)
2643
2644 - Must happen before
2645 following
2646 buffer_wbinvl1_vol.
2647 - Ensures the
2648 atomicrmw has
2649 completed before
2650 invalidating the
2651 cache.
2652
2653 3. buffer_wbinvl1_vol
2654
2655 - Must happen before
2656 any following
2657 global/generic
2658 load/load
2659 atomic/atomicrmw.
2660 - Ensures that
2661 following loads
2662 will not see stale
2663 global data.
2664
2665 atomicrmw acquire - agent - generic 1. flat_atomic
2666 - system 2. s_waitcnt vmcnt(0) &
2667 lgkmcnt(0)
2668
2669 - If OpenCL, omit
2670 lgkmcnt(0).
2671 - Must happen before
2672 following
2673 buffer_wbinvl1_vol.
2674 - Ensures the
2675 atomicrmw has
2676 completed before
2677 invalidating the
2678 cache.
2679
2680 3. buffer_wbinvl1_vol
2681
2682 - Must happen before
2683 any following
2684 global/generic
2685 load/load
2686 atomic/atomicrmw.
2687 - Ensures that
2688 following loads
2689 will not see stale
2690 global data.
2691
2692 fence acquire - singlethread *none* *none*
2693 - wavefront
2694 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2695
2696 - If OpenCL and
2697 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002698 not generic, omit.
2699 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002700 currently has no
2701 address space on
2702 the fence need to
2703 conservatively
2704 always generate. If
2705 fence had an
2706 address space then
2707 set to address
2708 space of OpenCL
2709 fence flag, or to
2710 generic if both
2711 local and global
2712 flags are
2713 specified.
2714 - Must happen after
2715 any preceding
2716 local/generic load
2717 atomic/atomicrmw
2718 with an equal or
2719 wider sync scope
2720 and memory ordering
2721 stronger than
2722 unordered (this is
2723 termed the
2724 fence-paired-atomic).
2725 - Must happen before
2726 any following
2727 global/generic
2728 load/load
2729 atomic/store/store
2730 atomic/atomicrmw.
2731 - Ensures any
2732 following global
2733 data read is no
2734 older than the
2735 value read by the
2736 fence-paired-atomic.
2737
Tony Tye6baa6d22017-10-18 22:16:55 +00002738 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2739 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002740
2741 - If OpenCL and
2742 address space is
2743 not generic, omit
2744 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002745 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002746 currently has no
2747 address space on
2748 the fence need to
2749 conservatively
2750 always generate
2751 (see comment for
2752 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002753 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002754 separate s_waitcnt
2755 vmcnt(0) and
2756 s_waitcnt
2757 lgkmcnt(0) to allow
2758 them to be
2759 independently moved
2760 according to the
2761 following rules.
2762 - s_waitcnt vmcnt(0)
2763 must happen after
2764 any preceding
2765 global/generic load
2766 atomic/atomicrmw
2767 with an equal or
2768 wider sync scope
2769 and memory ordering
2770 stronger than
2771 unordered (this is
2772 termed the
2773 fence-paired-atomic).
2774 - s_waitcnt lgkmcnt(0)
2775 must happen after
2776 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002777 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002778 atomic/atomicrmw
2779 with an equal or
2780 wider sync scope
2781 and memory ordering
2782 stronger than
2783 unordered (this is
2784 termed the
2785 fence-paired-atomic).
2786 - Must happen before
2787 the following
2788 buffer_wbinvl1_vol.
2789 - Ensures that the
2790 fence-paired atomic
2791 has completed
2792 before invalidating
2793 the
2794 cache. Therefore
2795 any following
2796 locations read must
2797 be no older than
2798 the value read by
2799 the
2800 fence-paired-atomic.
2801
2802 2. buffer_wbinvl1_vol
2803
Tony Tye6baa6d22017-10-18 22:16:55 +00002804 - Must happen before any
2805 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002806 load/load
2807 atomic/store/store
2808 atomic/atomicrmw.
2809 - Ensures that
2810 following loads
2811 will not see stale
2812 global data.
2813
2814 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002815 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002816 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2817 - wavefront - local
2818 - generic
2819 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002820
2821 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002822 - Must happen after
2823 any preceding
2824 local/generic
2825 load/store/load
2826 atomic/store
2827 atomic/atomicrmw.
2828 - Must happen before
2829 the following
2830 store.
2831 - Ensures that all
2832 memory operations
2833 to local have
2834 completed before
2835 performing the
2836 store that is being
2837 released.
2838
2839 2. buffer/global/flat_store
2840 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002841 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2842
2843 - If OpenCL, omit.
2844 - Must happen after
2845 any preceding
2846 local/generic
2847 load/store/load
2848 atomic/store
2849 atomic/atomicrmw.
2850 - Must happen before
2851 the following
2852 store.
2853 - Ensures that all
2854 memory operations
2855 to local have
2856 completed before
2857 performing the
2858 store that is being
2859 released.
2860
2861 2. flat_store
2862 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2863 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002864
2865 - If OpenCL, omit
2866 lgkmcnt(0).
2867 - Could be split into
2868 separate s_waitcnt
2869 vmcnt(0) and
2870 s_waitcnt
2871 lgkmcnt(0) to allow
2872 them to be
2873 independently moved
2874 according to the
2875 following rules.
2876 - s_waitcnt vmcnt(0)
2877 must happen after
2878 any preceding
2879 global/generic
2880 load/store/load
2881 atomic/store
2882 atomic/atomicrmw.
2883 - s_waitcnt lgkmcnt(0)
2884 must happen after
2885 any preceding
2886 local/generic
2887 load/store/load
2888 atomic/store
2889 atomic/atomicrmw.
2890 - Must happen before
2891 the following
2892 store.
2893 - Ensures that all
2894 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002895 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002896 completed before
2897 performing the
2898 store that is being
2899 released.
2900
2901 2. buffer/global/ds/flat_store
2902 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2903 - wavefront - local
2904 - generic
2905 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002906
2907 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002908 - Must happen after
2909 any preceding
2910 local/generic
2911 load/store/load
2912 atomic/store
2913 atomic/atomicrmw.
2914 - Must happen before
2915 the following
2916 atomicrmw.
2917 - Ensures that all
2918 memory operations
2919 to local have
2920 completed before
2921 performing the
2922 atomicrmw that is
2923 being released.
2924
2925 2. buffer/global/flat_atomic
2926 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002927 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2928
2929 - If OpenCL, omit.
2930 - Must happen after
2931 any preceding
2932 local/generic
2933 load/store/load
2934 atomic/store
2935 atomic/atomicrmw.
2936 - Must happen before
2937 the following
2938 atomicrmw.
2939 - Ensures that all
2940 memory operations
2941 to local have
2942 completed before
2943 performing the
2944 atomicrmw that is
2945 being released.
2946
2947 2. flat_atomic
2948 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2949 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002950
2951 - If OpenCL, omit
2952 lgkmcnt(0).
2953 - Could be split into
2954 separate s_waitcnt
2955 vmcnt(0) and
2956 s_waitcnt
2957 lgkmcnt(0) to allow
2958 them to be
2959 independently moved
2960 according to the
2961 following rules.
2962 - s_waitcnt vmcnt(0)
2963 must happen after
2964 any preceding
2965 global/generic
2966 load/store/load
2967 atomic/store
2968 atomic/atomicrmw.
2969 - s_waitcnt lgkmcnt(0)
2970 must happen after
2971 any preceding
2972 local/generic
2973 load/store/load
2974 atomic/store
2975 atomic/atomicrmw.
2976 - Must happen before
2977 the following
2978 atomicrmw.
2979 - Ensures that all
2980 memory operations
2981 to global and local
2982 have completed
2983 before performing
2984 the atomicrmw that
2985 is being released.
2986
Tony Tye6baa6d22017-10-18 22:16:55 +00002987 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002988 fence release - singlethread *none* *none*
2989 - wavefront
2990 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2991
2992 - If OpenCL and
2993 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002994 not generic, omit.
2995 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002996 currently has no
2997 address space on
2998 the fence need to
2999 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003000 always generate. If
3001 fence had an
3002 address space then
3003 set to address
3004 space of OpenCL
3005 fence flag, or to
3006 generic if both
3007 local and global
3008 flags are
3009 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003010 - Must happen after
3011 any preceding
3012 local/generic
3013 load/load
3014 atomic/store/store
3015 atomic/atomicrmw.
3016 - Must happen before
3017 any following store
3018 atomic/atomicrmw
3019 with an equal or
3020 wider sync scope
3021 and memory ordering
3022 stronger than
3023 unordered (this is
3024 termed the
3025 fence-paired-atomic).
3026 - Ensures that all
3027 memory operations
3028 to local have
3029 completed before
3030 performing the
3031 following
3032 fence-paired-atomic.
3033
Tony Tye6baa6d22017-10-18 22:16:55 +00003034 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3035 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003036
3037 - If OpenCL and
3038 address space is
3039 not generic, omit
3040 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003041 - If OpenCL and
3042 address space is
3043 local, omit
3044 vmcnt(0).
3045 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003046 currently has no
3047 address space on
3048 the fence need to
3049 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003050 always generate. If
3051 fence had an
3052 address space then
3053 set to address
3054 space of OpenCL
3055 fence flag, or to
3056 generic if both
3057 local and global
3058 flags are
3059 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003060 - Could be split into
3061 separate s_waitcnt
3062 vmcnt(0) and
3063 s_waitcnt
3064 lgkmcnt(0) to allow
3065 them to be
3066 independently moved
3067 according to the
3068 following rules.
3069 - s_waitcnt vmcnt(0)
3070 must happen after
3071 any preceding
3072 global/generic
3073 load/store/load
3074 atomic/store
3075 atomic/atomicrmw.
3076 - s_waitcnt lgkmcnt(0)
3077 must happen after
3078 any preceding
3079 local/generic
3080 load/store/load
3081 atomic/store
3082 atomic/atomicrmw.
3083 - Must happen before
3084 any following store
3085 atomic/atomicrmw
3086 with an equal or
3087 wider sync scope
3088 and memory ordering
3089 stronger than
3090 unordered (this is
3091 termed the
3092 fence-paired-atomic).
3093 - Ensures that all
3094 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003095 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003096 completed before
3097 performing the
3098 following
3099 fence-paired-atomic.
3100
3101 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003102 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003103 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3104 - wavefront - local
3105 - generic
3106 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3107
Tony Tye6baa6d22017-10-18 22:16:55 +00003108 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003109 - Must happen after
3110 any preceding
3111 local/generic
3112 load/store/load
3113 atomic/store
3114 atomic/atomicrmw.
3115 - Must happen before
3116 the following
3117 atomicrmw.
3118 - Ensures that all
3119 memory operations
3120 to local have
3121 completed before
3122 performing the
3123 atomicrmw that is
3124 being released.
3125
Tony Tye6baa6d22017-10-18 22:16:55 +00003126 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003127 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3128 2. s_waitcnt lgkmcnt(0)
3129
Tony Tye6baa6d22017-10-18 22:16:55 +00003130 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003131 - Must happen before
3132 any following
3133 global/generic
3134 load/load
3135 atomic/store/store
3136 atomic/atomicrmw.
3137 - Ensures any
3138 following global
3139 data read is no
3140 older than the load
3141 atomic value being
3142 acquired.
3143
3144 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3145
Tony Tye6baa6d22017-10-18 22:16:55 +00003146 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003147 - Must happen after
3148 any preceding
3149 local/generic
3150 load/store/load
3151 atomic/store
3152 atomic/atomicrmw.
3153 - Must happen before
3154 the following
3155 atomicrmw.
3156 - Ensures that all
3157 memory operations
3158 to local have
3159 completed before
3160 performing the
3161 atomicrmw that is
3162 being released.
3163
3164 2. flat_atomic
3165 3. s_waitcnt lgkmcnt(0)
3166
Tony Tye6baa6d22017-10-18 22:16:55 +00003167 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003168 - Must happen before
3169 any following
3170 global/generic
3171 load/load
3172 atomic/store/store
3173 atomic/atomicrmw.
3174 - Ensures any
3175 following global
3176 data read is no
3177 older than the load
3178 atomic value being
3179 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003180
3181 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3182 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003183
3184 - If OpenCL, omit
3185 lgkmcnt(0).
3186 - Could be split into
3187 separate s_waitcnt
3188 vmcnt(0) and
3189 s_waitcnt
3190 lgkmcnt(0) to allow
3191 them to be
3192 independently moved
3193 according to the
3194 following rules.
3195 - s_waitcnt vmcnt(0)
3196 must happen after
3197 any preceding
3198 global/generic
3199 load/store/load
3200 atomic/store
3201 atomic/atomicrmw.
3202 - s_waitcnt lgkmcnt(0)
3203 must happen after
3204 any preceding
3205 local/generic
3206 load/store/load
3207 atomic/store
3208 atomic/atomicrmw.
3209 - Must happen before
3210 the following
3211 atomicrmw.
3212 - Ensures that all
3213 memory operations
3214 to global have
3215 completed before
3216 performing the
3217 atomicrmw that is
3218 being released.
3219
Tony Tye6baa6d22017-10-18 22:16:55 +00003220 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003221 3. s_waitcnt vmcnt(0)
3222
3223 - Must happen before
3224 following
3225 buffer_wbinvl1_vol.
3226 - Ensures the
3227 atomicrmw has
3228 completed before
3229 invalidating the
3230 cache.
3231
3232 4. buffer_wbinvl1_vol
3233
3234 - Must happen before
3235 any following
3236 global/generic
3237 load/load
3238 atomic/atomicrmw.
3239 - Ensures that
3240 following loads
3241 will not see stale
3242 global data.
3243
Tony Tye6baa6d22017-10-18 22:16:55 +00003244 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3245 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003246
3247 - If OpenCL, omit
3248 lgkmcnt(0).
3249 - Could be split into
3250 separate s_waitcnt
3251 vmcnt(0) and
3252 s_waitcnt
3253 lgkmcnt(0) to allow
3254 them to be
3255 independently moved
3256 according to the
3257 following rules.
3258 - s_waitcnt vmcnt(0)
3259 must happen after
3260 any preceding
3261 global/generic
3262 load/store/load
3263 atomic/store
3264 atomic/atomicrmw.
3265 - s_waitcnt lgkmcnt(0)
3266 must happen after
3267 any preceding
3268 local/generic
3269 load/store/load
3270 atomic/store
3271 atomic/atomicrmw.
3272 - Must happen before
3273 the following
3274 atomicrmw.
3275 - Ensures that all
3276 memory operations
3277 to global have
3278 completed before
3279 performing the
3280 atomicrmw that is
3281 being released.
3282
3283 2. flat_atomic
3284 3. s_waitcnt vmcnt(0) &
3285 lgkmcnt(0)
3286
3287 - If OpenCL, omit
3288 lgkmcnt(0).
3289 - Must happen before
3290 following
3291 buffer_wbinvl1_vol.
3292 - Ensures the
3293 atomicrmw has
3294 completed before
3295 invalidating the
3296 cache.
3297
3298 4. buffer_wbinvl1_vol
3299
3300 - Must happen before
3301 any following
3302 global/generic
3303 load/load
3304 atomic/atomicrmw.
3305 - Ensures that
3306 following loads
3307 will not see stale
3308 global data.
3309
3310 fence acq_rel - singlethread *none* *none*
3311 - wavefront
3312 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3313
3314 - If OpenCL and
3315 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003316 not generic, omit.
3317 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003318 since LLVM
3319 currently has no
3320 address space on
3321 the fence need to
3322 conservatively
3323 always generate
3324 (see comment for
3325 previous fence).
3326 - Must happen after
3327 any preceding
3328 local/generic
3329 load/load
3330 atomic/store/store
3331 atomic/atomicrmw.
3332 - Must happen before
3333 any following
3334 global/generic
3335 load/load
3336 atomic/store/store
3337 atomic/atomicrmw.
3338 - Ensures that all
3339 memory operations
3340 to local have
3341 completed before
3342 performing any
3343 following global
3344 memory operations.
3345 - Ensures that the
3346 preceding
3347 local/generic load
3348 atomic/atomicrmw
3349 with an equal or
3350 wider sync scope
3351 and memory ordering
3352 stronger than
3353 unordered (this is
3354 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003355 acquire-fence-paired-atomic
3356 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003357 before following
3358 global memory
3359 operations. This
3360 satisfies the
3361 requirements of
3362 acquire.
3363 - Ensures that all
3364 previous memory
3365 operations have
3366 completed before a
3367 following
3368 local/generic store
3369 atomic/atomicrmw
3370 with an equal or
3371 wider sync scope
3372 and memory ordering
3373 stronger than
3374 unordered (this is
3375 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003376 release-fence-paired-atomic
3377 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003378 requirements of
3379 release.
3380
Tony Tye6baa6d22017-10-18 22:16:55 +00003381 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3382 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003383
3384 - If OpenCL and
3385 address space is
3386 not generic, omit
3387 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003388 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003389 currently has no
3390 address space on
3391 the fence need to
3392 conservatively
3393 always generate
3394 (see comment for
3395 previous fence).
3396 - Could be split into
3397 separate s_waitcnt
3398 vmcnt(0) and
3399 s_waitcnt
3400 lgkmcnt(0) to allow
3401 them to be
3402 independently moved
3403 according to the
3404 following rules.
3405 - s_waitcnt vmcnt(0)
3406 must happen after
3407 any preceding
3408 global/generic
3409 load/store/load
3410 atomic/store
3411 atomic/atomicrmw.
3412 - s_waitcnt lgkmcnt(0)
3413 must happen after
3414 any preceding
3415 local/generic
3416 load/store/load
3417 atomic/store
3418 atomic/atomicrmw.
3419 - Must happen before
3420 the following
3421 buffer_wbinvl1_vol.
3422 - Ensures that the
3423 preceding
3424 global/local/generic
3425 load
3426 atomic/atomicrmw
3427 with an equal or
3428 wider sync scope
3429 and memory ordering
3430 stronger than
3431 unordered (this is
3432 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003433 acquire-fence-paired-atomic
3434 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003435 before invalidating
3436 the cache. This
3437 satisfies the
3438 requirements of
3439 acquire.
3440 - Ensures that all
3441 previous memory
3442 operations have
3443 completed before a
3444 following
3445 global/local/generic
3446 store
3447 atomic/atomicrmw
3448 with an equal or
3449 wider sync scope
3450 and memory ordering
3451 stronger than
3452 unordered (this is
3453 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003454 release-fence-paired-atomic
3455 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003456 requirements of
3457 release.
3458
3459 2. buffer_wbinvl1_vol
3460
3461 - Must happen before
3462 any following
3463 global/generic
3464 load/load
3465 atomic/store/store
3466 atomic/atomicrmw.
3467 - Ensures that
3468 following loads
3469 will not see stale
3470 global data. This
3471 satisfies the
3472 requirements of
3473 acquire.
3474
3475 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003476 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003477 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003478 - wavefront - local load atomic acquire,
3479 - generic except must generated
3480 all instructions even
3481 for OpenCL.*
3482 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3483 - generic
3484 - Must
3485 happen after
3486 preceding
3487 global/generic load
3488 atomic/store
3489 atomic/atomicrmw
3490 with memory
3491 ordering of seq_cst
3492 and with equal or
3493 wider sync scope.
3494 (Note that seq_cst
3495 fences have their
3496 own s_waitcnt
3497 lgkmcnt(0) and so do
3498 not need to be
3499 considered.)
3500 - Ensures any
3501 preceding
3502 sequential
3503 consistent local
3504 memory instructions
3505 have completed
3506 before executing
3507 this sequentially
3508 consistent
3509 instruction. This
3510 prevents reordering
3511 a seq_cst store
3512 followed by a
3513 seq_cst load. (Note
3514 that seq_cst is
3515 stronger than
3516 acquire/release as
3517 the reordering of
3518 load acquire
3519 followed by a store
3520 release is
3521 prevented by the
3522 waitcnt of
3523 the release, but
3524 there is nothing
3525 preventing a store
3526 release followed by
3527 load acquire from
3528 competing out of
3529 order.)
3530
3531 2. *Following
3532 instructions same as
3533 corresponding load
3534 atomic acquire,
3535 except must generated
3536 all instructions even
3537 for OpenCL.*
3538 load atomic seq_cst - workgroup - local *Same as corresponding
3539 load atomic acquire,
3540 except must generated
3541 all instructions even
3542 for OpenCL.*
3543 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3544 - system - generic vmcnt(0)
3545
3546 - Could be split into
3547 separate s_waitcnt
3548 vmcnt(0)
3549 and s_waitcnt
3550 lgkmcnt(0) to allow
3551 them to be
3552 independently moved
3553 according to the
3554 following rules.
3555 - waitcnt lgkmcnt(0)
3556 must happen after
3557 preceding
3558 global/generic load
3559 atomic/store
3560 atomic/atomicrmw
3561 with memory
3562 ordering of seq_cst
3563 and with equal or
3564 wider sync scope.
3565 (Note that seq_cst
3566 fences have their
3567 own s_waitcnt
3568 lgkmcnt(0) and so do
3569 not need to be
3570 considered.)
3571 - waitcnt vmcnt(0)
3572 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003573 preceding
3574 global/generic load
3575 atomic/store
3576 atomic/atomicrmw
3577 with memory
3578 ordering of seq_cst
3579 and with equal or
3580 wider sync scope.
3581 (Note that seq_cst
3582 fences have their
3583 own s_waitcnt
3584 vmcnt(0) and so do
3585 not need to be
3586 considered.)
3587 - Ensures any
3588 preceding
3589 sequential
3590 consistent global
3591 memory instructions
3592 have completed
3593 before executing
3594 this sequentially
3595 consistent
3596 instruction. This
3597 prevents reordering
3598 a seq_cst store
3599 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003600 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003601 that seq_cst is
3602 stronger than
3603 acquire/release as
3604 the reordering of
3605 load acquire
3606 followed by a store
3607 release is
3608 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003609 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003610 the release, but
3611 there is nothing
3612 preventing a store
3613 release followed by
3614 load acquire from
3615 competing out of
3616 order.)
3617
3618 2. *Following
3619 instructions same as
3620 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003621 atomic acquire,
3622 except must generated
3623 all instructions even
3624 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003625 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003626 - wavefront - local store atomic release,
3627 - workgroup - generic except must generated
3628 all instructions even
3629 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003630 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003631 - system - generic store atomic release,
3632 except must generated
3633 all instructions even
3634 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003635 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003636 - wavefront - local atomicrmw acq_rel,
3637 - workgroup - generic except must generated
3638 all instructions even
3639 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003640 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003641 - system - generic atomicrmw acq_rel,
3642 except must generated
3643 all instructions even
3644 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003645 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003646 - wavefront fence acq_rel,
3647 - workgroup except must generated
3648 - agent all instructions even
3649 - system for OpenCL.*
3650 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003651
3652The memory order also adds the single thread optimization constrains defined in
3653table
3654:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3655
3656 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3657 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3658
3659 ============ ==============================================================
3660 LLVM Memory Optimization Constraints
3661 Ordering
3662 ============ ==============================================================
3663 unordered *none*
3664 monotonic *none*
3665 acquire - If a load atomic/atomicrmw then no following load/load
3666 atomic/store/ store atomic/atomicrmw/fence instruction can
3667 be moved before the acquire.
3668 - If a fence then same as load atomic, plus no preceding
3669 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003670 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003671 atomic/store/ store atomic/atomicrmw/fence instruction can
3672 be moved after the release.
3673 - If a fence then same as store atomic, plus no following
3674 associated fence-paired-atomic can be moved before the
3675 fence.
3676 acq_rel Same constraints as both acquire and release.
3677 seq_cst - If a load atomic then same constraints as acquire, plus no
3678 preceding sequentially consistent load atomic/store
3679 atomic/atomicrmw/fence instruction can be moved after the
3680 seq_cst.
3681 - If a store atomic then the same constraints as release, plus
3682 no following sequentially consistent load atomic/store
3683 atomic/atomicrmw/fence instruction can be moved before the
3684 seq_cst.
3685 - If an atomicrmw/fence then same constraints as acq_rel.
3686 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003687
Wei Ding16289cf2017-02-21 18:48:01 +00003688Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003689~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003690
Tony Tyef16a45e2017-06-06 20:31:59 +00003691For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3692(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3693the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003694
Tony Tyef16a45e2017-06-06 20:31:59 +00003695 .. table:: AMDGPU Trap Handler for AMDHSA OS
3696 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003697
Tony Tyef16a45e2017-06-06 20:31:59 +00003698 =================== =============== =============== =======================
3699 Usage Code Sequence Trap Handler Description
3700 Inputs
3701 =================== =============== =============== =======================
3702 reserved ``s_trap 0x00`` Reserved by hardware.
3703 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3704 ``queue_ptr`` ``debugtrap``
3705 ``VGPR0``: intrinsic (not
3706 ``arg`` implemented).
3707 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3708 ``queue_ptr`` terminated and its
3709 associated queue put
3710 into the error state.
3711 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3712 ``queue_ptr`` installed handled
3713 same as ``llvm.trap``.
3714 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3715 breakpoints.
3716 debugger ``s_trap 0x08`` Reserved for debugger.
3717 debugger ``s_trap 0xfe`` Reserved for debugger.
3718 debugger ``s_trap 0xff`` Reserved for debugger.
3719 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003720
Tony Tye46d35762017-08-15 20:47:41 +00003721Unspecified OS
3722--------------
3723
3724This section provides code conventions used when the target triple OS is
3725empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003726
3727Trap Handler ABI
3728~~~~~~~~~~~~~~~~
3729
3730For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3731not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3732instructions are handled as follows:
3733
3734 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3735 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3736
3737 =============== =============== ===========================================
3738 Usage Code Sequence Description
3739 =============== =============== ===========================================
3740 llvm.trap s_endpgm Causes wavefront to be terminated.
3741 llvm.debugtrap *none* Compiler warning given that there is no
3742 trap handler installed.
3743 =============== =============== ===========================================
3744
3745Source Languages
3746================
3747
3748.. _amdgpu-opencl:
3749
3750OpenCL
3751------
3752
3753When generating code for the OpenCL language the target triple environment
3754should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3755
3756When the language is OpenCL the following differences occur:
3757
37581. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37592. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037603. Additional metadata is generated
3761 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003762
3763.. TODO
3764 Specify what affect this has. Hidden arguments added. Additional metadata
3765 generated.
3766
3767.. _amdgpu-hcc:
3768
3769HCC
3770---
3771
3772When generating code for the OpenCL language the target triple environment
3773should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3774
3775When the language is OpenCL the following differences occur:
3776
37771. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3778
3779.. TODO
3780 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003781
Tom Stellard45bb48e2015-06-13 03:28:10 +00003782Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003783---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003784
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003785AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003786It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003787
Tony Tyef16a45e2017-06-06 20:31:59 +00003788This section describes general syntax for instructions and operands. For more
3789information about instructions, their semantics and supported combinations of
3790operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003791[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003792
Tony Tyef16a45e2017-06-06 20:31:59 +00003793An instruction has the following syntax (register operands are normally
3794comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003795
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003796*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003797
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003798Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003799~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003800
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003801The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003802
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003803* SGPR registers: s0, ... or s[0], ...
3804* VGPR registers: v0, ... or v[0], ...
3805* TTMP registers: ttmp0, ... or ttmp[0], ...
3806* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3807* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3808* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3809* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3810* Register index expressions: v[2*2], s[1-1:2-1]
3811* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003812
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003813The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003814
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003815* offset, offset0, offset1
3816* idxen, offen bits
3817* glc, slc, tfe bits
3818* waitcnt: integer or combination of counter values
3819* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003820
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003821 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003822
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003823* DPP modifiers:
3824
3825 - row_shl, row_shr, row_ror, row_rol
3826 - row_mirror, row_half_mirror, row_bcast
3827 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3828 - row_mask, bank_mask, bound_ctrl
3829
3830* SDWA modifiers:
3831
3832 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3833 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3834 - abs, neg, sext
3835
Tony Tyef16a45e2017-06-06 20:31:59 +00003836Instruction Examples
3837~~~~~~~~~~~~~~~~~~~~
3838
3839DS
3840~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003841
3842.. code-block:: nasm
3843
3844 ds_add_u32 v2, v4 offset:16
3845 ds_write_src2_b64 v2 offset0:4 offset1:8
3846 ds_cmpst_f32 v2, v4, v6
3847 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3848
3849
3850For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3851
Tony Tyef16a45e2017-06-06 20:31:59 +00003852FLAT
3853++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003854
3855.. code-block:: nasm
3856
3857 flat_load_dword v1, v[3:4]
3858 flat_store_dwordx3 v[3:4], v[5:7]
3859 flat_atomic_swap v1, v[3:4], v5 glc
3860 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3861 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3862
3863For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3864
Tony Tyef16a45e2017-06-06 20:31:59 +00003865MUBUF
3866+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003867
3868.. code-block:: nasm
3869
3870 buffer_load_dword v1, off, s[4:7], s1
3871 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3872 buffer_store_format_xy v[1:2], off, s[4:7], s1
3873 buffer_wbinvl1
3874 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3875
3876For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3877
Tony Tyef16a45e2017-06-06 20:31:59 +00003878SMRD/SMEM
3879+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003880
3881.. code-block:: nasm
3882
3883 s_load_dword s1, s[2:3], 0xfc
3884 s_load_dwordx8 s[8:15], s[2:3], s4
3885 s_load_dwordx16 s[88:103], s[2:3], s4
3886 s_dcache_inv_vol
3887 s_memtime s[4:5]
3888
3889For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3890
Tony Tyef16a45e2017-06-06 20:31:59 +00003891SOP1
3892++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003893
3894.. code-block:: nasm
3895
3896 s_mov_b32 s1, s2
3897 s_mov_b64 s[0:1], 0x80000000
3898 s_cmov_b32 s1, 200
3899 s_wqm_b64 s[2:3], s[4:5]
3900 s_bcnt0_i32_b64 s1, s[2:3]
3901 s_swappc_b64 s[2:3], s[4:5]
3902 s_cbranch_join s[4:5]
3903
3904For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3905
Tony Tyef16a45e2017-06-06 20:31:59 +00003906SOP2
3907++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003908
3909.. code-block:: nasm
3910
3911 s_add_u32 s1, s2, s3
3912 s_and_b64 s[2:3], s[4:5], s[6:7]
3913 s_cselect_b32 s1, s2, s3
3914 s_andn2_b32 s2, s4, s6
3915 s_lshr_b64 s[2:3], s[4:5], s6
3916 s_ashr_i32 s2, s4, s6
3917 s_bfm_b64 s[2:3], s4, s6
3918 s_bfe_i64 s[2:3], s[4:5], s6
3919 s_cbranch_g_fork s[4:5], s[6:7]
3920
3921For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3922
Tony Tyef16a45e2017-06-06 20:31:59 +00003923SOPC
3924++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003925
3926.. code-block:: nasm
3927
3928 s_cmp_eq_i32 s1, s2
3929 s_bitcmp1_b32 s1, s2
3930 s_bitcmp0_b64 s[2:3], s4
3931 s_setvskip s3, s5
3932
3933For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3934
Tony Tyef16a45e2017-06-06 20:31:59 +00003935SOPP
3936++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003937
3938.. code-block:: nasm
3939
3940 s_barrier
3941 s_nop 2
3942 s_endpgm
3943 s_waitcnt 0 ; Wait for all counters to be 0
3944 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3945 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3946 s_sethalt 9
3947 s_sleep 10
3948 s_sendmsg 0x1
3949 s_sendmsg sendmsg(MSG_INTERRUPT)
3950 s_trap 1
3951
3952For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3953
3954Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003955of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003956range or acceptable values.
3957
Tony Tyef16a45e2017-06-06 20:31:59 +00003958VALU
3959++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003960
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003961For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3962the assembler will automatically use optimal encoding based on its operands.
3963To force specific encoding, one can add a suffix to the opcode of the instruction:
3964
3965* _e32 for 32-bit VOP1/VOP2/VOPC
3966* _e64 for 64-bit VOP3
3967* _dpp for VOP_DPP
3968* _sdwa for VOP_SDWA
3969
3970VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003971
3972.. code-block:: nasm
3973
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003974 v_mov_b32 v1, v2
3975 v_mov_b32_e32 v1, v2
3976 v_nop
3977 v_cvt_f64_i32_e32 v[1:2], v2
3978 v_floor_f32_e32 v1, v2
3979 v_bfrev_b32_e32 v1, v2
3980 v_add_f32_e32 v1, v2, v3
3981 v_mul_i32_i24_e64 v1, v2, 3
3982 v_mul_i32_i24_e32 v1, -3, v3
3983 v_mul_i32_i24_e32 v1, -100, v3
3984 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
3985 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00003986
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003987VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003988
3989.. code-block:: nasm
3990
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003991 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
3992 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3993 v_mov_b32 v0, v0 wave_shl:1
3994 v_mov_b32 v0, v0 row_mirror
3995 v_mov_b32 v0, v0 row_bcast:31
3996 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
3997 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
3998 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00003999
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004000VOP_SDWA examples:
4001
4002.. code-block:: nasm
4003
4004 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4005 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4006 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4007 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4008 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4009
4010For full list of supported instructions, refer to "Vector ALU instructions".
4011
4012HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004013~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004014
4015AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4016one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004017
4018.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004019+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004020
4021*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004022object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004023
4024.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004025+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4026
Tom Stellard347ac792015-06-26 21:15:07 +00004027
4028*major*, *minor*, and *stepping* are all integers that describe the instruction
4029set architecture (ISA) version of the assembly program.
4030
4031*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4032"AMD" and *arch* should always be equal to "AMDGPU".
4033
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004034By default, the assembler will derive the ISA version, *vendor*, and *arch*
4035from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004036
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004037.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004038+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004039
4040This directives specifies that the symbol with given name is a kernel entry point
4041(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004042
4043.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004044++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004045
4046This directive marks the beginning of a list of key / value pairs that are used
4047to specify the amd_kernel_code_t object that will be emitted by the assembler.
4048The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4049any amd_kernel_code_t values that are unspecified a default value will be
4050used. The default value for all keys is 0, with the following exceptions:
4051
4052- *kernel_code_version_major* defaults to 1.
4053- *machine_kind* defaults to 1.
4054- *machine_version_major*, *machine_version_minor*, and
4055 *machine_version_stepping* are derived from the value of the -mcpu option
4056 that is passed to the assembler.
4057- *kernel_code_entry_byte_offset* defaults to 256.
4058- *wavefront_size* defaults to 6.
4059- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004060 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004061 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4062
4063The *.amd_kernel_code_t* directive must be placed immediately after the
4064function label and before any instructions.
4065
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004066For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4067comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004068
4069Here is an example of a minimal amd_kernel_code_t specification:
4070
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004071.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004072
4073 .hsa_code_object_version 1,0
4074 .hsa_code_object_isa
4075
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004076 .hsatext
4077 .globl hello_world
4078 .p2align 8
4079 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004080
4081 hello_world:
4082
4083 .amd_kernel_code_t
4084 enable_sgpr_kernarg_segment_ptr = 1
4085 is_ptr64 = 1
4086 compute_pgm_rsrc1_vgprs = 0
4087 compute_pgm_rsrc1_sgprs = 0
4088 compute_pgm_rsrc2_user_sgpr = 2
4089 kernarg_segment_byte_size = 8
4090 wavefront_sgpr_count = 2
4091 workitem_vgpr_count = 3
4092 .end_amd_kernel_code_t
4093
4094 s_load_dwordx2 s[0:1], s[0:1] 0x0
4095 v_mov_b32 v0, 3.14159
4096 s_waitcnt lgkmcnt(0)
4097 v_mov_b32 v1, s0
4098 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004099 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004100 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004101 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004102 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004103
4104Additional Documentation
4105========================
4106
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004107.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4108.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4109.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4110.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4111.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4112.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4113.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4114.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004115.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4116.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4117.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4118.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4119.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004120.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004121.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4122.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__