blob: 439089348fffb8add0314fe259e5608250a49415 [file] [log] [blame]
Tony Tyef16a45e2017-06-06 20:31:59 +00001=============================
2User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
67 *<empty>* Defaults to ``opencl``.
68 ``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
69 ``amdgizcl`` Same as ``opencl`` except a different address space mapping is
70 used (see :ref:`amdgpu-address-spaces`).
71 ``amdgiz`` Same as ``opencl`` except a different address space mapping is
72 used (see :ref:`amdgpu-address-spaces`).
73 ``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
74 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000075
76.. _amdgpu-processors:
77
78Processors
79----------
80
81Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
82names from both the *Processor* and *Alternative Processor* can be used.
83
84 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000085 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000086
Tony Tye31105cc2017-12-11 15:35:27 +000087 =========== =============== ============ ===== ========= ======= ==================
88 Processor Alternative Target dGPU/ Target ROCm Example
89 Processor Triple APU Features Support Products
90 Architecture Supported
91 [Default]
92 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``r600`` ``r600`` dGPU
96 ``r630`` ``r600`` dGPU
97 ``rs880`` ``r600`` dGPU
98 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000099 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000100 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000101 ``rv710`` ``r600`` dGPU
102 ``rv730`` ``r600`` dGPU
103 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000104 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000105 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000106 ``cedar`` ``r600`` dGPU
107 ``redwood`` ``r600`` dGPU
108 ``sumo`` ``r600`` dGPU
109 ``juniper`` ``r600`` dGPU
110 ``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``barts`` ``r600`` dGPU
114 ``turks`` ``r600`` dGPU
115 ``caicos`` ``r600`` dGPU
116 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000117 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000118 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000119 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
120 ``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
121 - ``verde``
122 - ``oland``
123 - ``hainan``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000124 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000125 -----------------------------------------------------------------------------------
126 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
127 - A6 Pro-7050B
128 - A8-7100
129 - A8 Pro-7150B
130 - A10-7300
131 - A10 Pro-7350B
132 - FX-7500
133 - A8-7200P
134 - A10-7400P
135 - FX-7600P
136 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
137 - FirePro W9100
138 - FirePro S9150
139 - FirePro S9170
140 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
141 - Radeon R9 290x
142 - Radeon R390
143 - Radeon R390x
144 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
145 - ``mullins`` - E1-2200
146 - E1-2500
147 - E2-3000
148 - E2-3800
149 - A4-5000
150 - A4-5100
151 - A6-5200
152 - A4 Pro-3340B
153 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
154 - Radeon HD 8770
155 - R7 260
156 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000157 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000158 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000159 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
160 [on] - Pro A6-8500B
161 - A8-8600P
162 - Pro A8-8600B
163 - FX-8800P
164 - Pro A12-8800B
165 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
166 [on] - Pro A10-8700B
167 - A10-8780P
168 \ ``amdgcn`` APU - xnack - A10-9600P
169 [on] - A10-9630P
170 - A12-9700P
171 - A12-9730P
172 - FX-9800P
173 - FX-9830P
174 \ ``amdgcn`` APU - xnack - E2-9010
175 [on] - A6-9210
176 - A9-9410
177 ``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tyea6978802017-12-12 05:47:00 +0000178 - ``iceland`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000179 - FirePro W7100
180 - Radeon R285
181 - Radeon R9 380
182 - Radeon R9 385
183 - Mobile FirePro
184 M7170
185 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
186 [off] - Radeon R9 Fury
187 - Radeon R9 FuryX
188 - Radeon Pro Duo
189 - FirePro S9300x2
190 - Radeon Instinct MI8
191 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
192 [off] - Radeon RX 480
193 - Radeon Instinct MI6
194 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
195 [off]
196 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
197 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000198 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000199 -----------------------------------------------------------------------------------
200 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
201 [off] Frontier Edition
202 - Radeon RX Vega 56
203 - Radeon RX Vega 64
204 - Radeon RX Vega 64
205 Liquid
206 - Radeon Instinct MI25
207 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
208 [on]
209 .. TODO
210 Add product
211 names.
212 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000213
214.. _amdgpu-target-features:
215
216Target Features
217---------------
218
219Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000220processor specific features. Not all target features are supported by
221all processors. The runtime must ensure that the features supported by
222the device used to execute the code match the features enabled when
223generating the code. A mismatch of features may result in incorrect
224execution, or a reduction in performance.
225
226The target features supported by each processor, and the default value
227used if not specified explicitly, is listed in
228:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000229
230Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
231target features.
232
233For example:
234
235``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000236 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000237``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000238 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000239
240 .. table:: AMDGPU Target Features
241 :name: amdgpu-target-feature-table
242
Tony Tye31105cc2017-12-11 15:35:27 +0000243 ============== ==================================================
244 Target Feature Description
245 ============== ==================================================
246 -m[no-]xnack Enable/disable generating code that has
247 memory clauses that are compatible with
248 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000249
Tony Tye31105cc2017-12-11 15:35:27 +0000250 This is used for demand paging and page
251 migration. If XNACK replay is enabled in
252 the device, then if a page fault occurs
253 the code may execute incorrectly if the
254 ``xnack`` feature is not enabled. Executing
255 code that has the feature enabled on a
256 device that does not have XNACK replay
257 enabled will execute correctly, but may
258 be less performant than code with the
259 feature disabled.
260 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000261
262.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000263
264Address Spaces
265--------------
266
Tony Tyef16a45e2017-06-06 20:31:59 +0000267The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000268
Tony Tyef16a45e2017-06-06 20:31:59 +0000269The memory space names used in the table, aside from the region memory space, is
270from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000271
Tony Tyef16a45e2017-06-06 20:31:59 +0000272LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000273
Tony Tyef16a45e2017-06-06 20:31:59 +0000274 .. table:: Address Space Mapping
275 :name: amdgpu-address-space-mapping-table
276
277 ================== ================= ================= ================= =================
278 LLVM Address Space Memory Space
279 ------------------ -----------------------------------------------------------------------
280 \ Current Default amdgiz/amdgizcl hcc Future Default
281 ================== ================= ================= ================= =================
282 0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
283 1 Global Global Global Global
284 2 Constant Constant Constant Region (GDS)
285 3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
286 4 Generic (Flat) Region (GDS) Region (GDS) Constant
287 5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
288 ================== ================= ================= ================= =================
289
290Current Default
291 This is the current default address space mapping used for all languages
292 except hcc. This will shortly be deprecated.
293
294amdgiz/amdgizcl
295 This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
296 is specified as the target triple environment value.
297
298hcc
299 This is the current address space mapping used when ``hcc`` is specified as
300 the target triple environment value.This will shortly be deprecated.
301
302Future Default
303 This will shortly be the only address space mapping for all languages using
304 AMDGPU backend.
305
306.. _amdgpu-memory-scopes:
307
308Memory Scopes
309-------------
310
311This section provides LLVM memory synchronization scopes supported by the AMDGPU
312backend memory model when the target triple OS is ``amdhsa`` (see
313:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
314
315The memory model supported is based on the HSA memory model [HSA]_ which is
316based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
317relation is transitive over the synchonizes-with relation independent of scope,
318and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000319table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000320
321This is different to the OpenCL [OpenCL]_ memory model which does not have scope
322inclusion and requires the memory scopes to exactly match. However, this
323is conservatively correct for OpenCL.
324
Tony Tye07d9f102017-11-10 01:00:54 +0000325 .. table:: AMDHSA LLVM Sync Scopes
326 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000327
328 ================ ==========================================================
329 LLVM Sync Scope Description
330 ================ ==========================================================
331 *none* The default: ``system``.
332
333 Synchronizes with, and participates in modification and
334 seq_cst total orderings with, other operations (except
335 image operations) for all address spaces (except private,
336 or generic that accesses private) provided the other
337 operation's sync scope is:
338
339 - ``system``.
340 - ``agent`` and executed by a thread on the same agent.
341 - ``workgroup`` and executed by a thread in the same
342 workgroup.
343 - ``wavefront`` and executed by a thread in the same
344 wavefront.
345
346 ``agent`` Synchronizes with, and participates in modification and
347 seq_cst total orderings with, other operations (except
348 image operations) for all address spaces (except private,
349 or generic that accesses private) provided the other
350 operation's sync scope is:
351
352 - ``system`` or ``agent`` and executed by a thread on the
353 same agent.
354 - ``workgroup`` and executed by a thread in the same
355 workgroup.
356 - ``wavefront`` and executed by a thread in the same
357 wavefront.
358
359 ``workgroup`` Synchronizes with, and participates in modification and
360 seq_cst total orderings with, other operations (except
361 image operations) for all address spaces (except private,
362 or generic that accesses private) provided the other
363 operation's sync scope is:
364
365 - ``system``, ``agent`` or ``workgroup`` and executed by a
366 thread in the same workgroup.
367 - ``wavefront`` and executed by a thread in the same
368 wavefront.
369
370 ``wavefront`` Synchronizes with, and participates in modification and
371 seq_cst total orderings with, other operations (except
372 image operations) for all address spaces (except private,
373 or generic that accesses private) provided the other
374 operation's sync scope is:
375
376 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
377 and executed by a thread in the same wavefront.
378
379 ``singlethread`` Only synchronizes with, and participates in modification
380 and seq_cst total orderings with, other operations (except
381 image operations) running in the same thread for all
382 address spaces (for example, in signal handlers).
383 ================ ==========================================================
384
385AMDGPU Intrinsics
386-----------------
387
388The AMDGPU backend implements the following intrinsics.
389
390*This section is WIP.*
391
392.. TODO
393 List AMDGPU intrinsics
394
395Code Object
396===========
397
398The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
399can be linked by ``lld`` to produce a standard ELF shared code object which can
400be loaded and executed on an AMDGPU target.
401
402Header
403------
404
405The AMDGPU backend uses the following ELF header:
406
407 .. table:: AMDGPU ELF Header
408 :name: amdgpu-elf-header-table
409
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000410 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000411 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000412 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000413 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
414 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000415 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
416 - ``ELFOSABI_AMDGPU_HSA``
417 - ``ELFOSABI_AMDGPU_PAL``
418 - ``ELFOSABI_AMDGPU_MESA3D``
419 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
420 - ``ELFABIVERSION_AMDGPU_PAL``
421 - ``ELFABIVERSION_AMDGPU_MESA3D``
422 ``e_type`` - ``ET_REL``
423 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000424 ``e_machine`` ``EM_AMDGPU``
425 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000426 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000427 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000428
429..
430
431 .. table:: AMDGPU ELF Header Enumeration Values
432 :name: amdgpu-elf-header-enumeration-values-table
433
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000434 =============================== =====
435 Name Value
436 =============================== =====
437 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000438 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000439 ``ELFOSABI_AMDGPU_HSA`` 64
440 ``ELFOSABI_AMDGPU_PAL`` 65
441 ``ELFOSABI_AMDGPU_MESA3D`` 66
442 ``ELFABIVERSION_AMDGPU_HSA`` 1
443 ``ELFABIVERSION_AMDGPU_PAL`` 0
444 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
445 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000446
447``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000448 The ELF class is:
449
450 * ``ELFCLASS32`` for ``r600`` architecture.
451
452 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
453 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000454
455``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000456 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000457
458``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000459 One of the following AMD GPU architecture specific OS ABIs
460 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000461
Tony Tye07d9f102017-11-10 01:00:54 +0000462 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000463
Tony Tye07d9f102017-11-10 01:00:54 +0000464 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000465
Tony Tye07d9f102017-11-10 01:00:54 +0000466 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
467
468 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000469
Tony Tyef16a45e2017-06-06 20:31:59 +0000470``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000471 The ABI version of the AMD GPU architecture specific OS ABI to which the code
472 object conforms:
473
474 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
475 runtime ABI.
476
477 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
478 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000479
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000480 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000481 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000482
Tony Tyef16a45e2017-06-06 20:31:59 +0000483``e_type``
484 Can be one of the following values:
485
486
487 ``ET_REL``
488 The type produced by the AMD GPU backend compiler as it is relocatable code
489 object.
490
491 ``ET_DYN``
492 The type produced by the linker as it is a shared code object.
493
494 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
495
496``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000497 The value ``EM_AMDGPU`` is used for the machine for all processors supported
498 by the ``r600`` and ``amdgcn`` architectures (see
499 :ref:`amdgpu-processor-table`). The specific processor is specified in the
500 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
501 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000502
503``e_entry``
504 The entry point is 0 as the entry points for individual kernels must be
505 selected in order to invoke them through AQL packets.
506
507``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000508 The AMDGPU backend uses the following ELF header flags:
509
510 .. table:: AMDGPU ELF Header ``e_flags``
511 :name: amdgpu-elf-header-e_flags-table
512
513 ================================= ========== =============================
514 Name Value Description
515 ================================= ========== =============================
516 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
517 -------------------------------------------- -----------------------------
518 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
519 mask for
520 ``EF_AMDGPU_MACH_xxx`` values
521 defined in
522 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000523 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
524 target feature is
525 enabled for all code
526 contained in the code object.
527 See
528 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000529 ================================= ========== =============================
530
531 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
532 :name: amdgpu-ef-amdgpu-mach-table
533
534 ================================= ========== =============================
535 Name Value Description (see
536 :ref:`amdgpu-processor-table`)
537 ================================= ========== =============================
538 ``EF_AMDGPU_MACH_NONE`` 0 *not specified*
539 ``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
540 ``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
541 ``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
542 ``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
543 ``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
544 ``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
545 ``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
546 ``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
547 ``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
548 ``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
549 ``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
550 ``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
551 ``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
552 ``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
553 ``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
554 ``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
555 *reserved* 17-31 Reserved for ``r600``
556 architecture processors.
557 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
560 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
562 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye31105cc2017-12-11 15:35:27 +0000563 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tyea6978802017-12-12 05:47:00 +0000564 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
565 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
566 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
567 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
568 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
569 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
Tony Tye07d9f102017-11-10 01:00:54 +0000570 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000571
572Sections
573--------
574
575An AMDGPU target ELF code object has the standard ELF sections which include:
576
577 .. table:: AMDGPU ELF Sections
578 :name: amdgpu-elf-sections-table
579
580 ================== ================ =================================
581 Name Type Attributes
582 ================== ================ =================================
583 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
584 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
585 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
586 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
587 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
588 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
589 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
590 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
591 ``.note`` ``SHT_NOTE`` *none*
592 ``.rela``\ *name* ``SHT_RELA`` *none*
593 ``.rela.dyn`` ``SHT_RELA`` *none*
594 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
595 ``.shstrtab`` ``SHT_STRTAB`` *none*
596 ``.strtab`` ``SHT_STRTAB`` *none*
597 ``.symtab`` ``SHT_SYMTAB`` *none*
598 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
599 ================== ================ =================================
600
601These sections have their standard meanings (see [ELF]_) and are only generated
602if needed.
603
604``.debug``\ *\**
605 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
606 DWARF produced by the AMDGPU backend.
607
Tony Tye46d35762017-08-15 20:47:41 +0000608``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000609 The standard sections used by a dynamic loader.
610
611``.note``
612 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
613 backend.
614
615``.rela``\ *name*, ``.rela.dyn``
616 For relocatable code objects, *name* is the name of the section that the
617 relocation records apply. For example, ``.rela.text`` is the section name for
618 relocation records associated with the ``.text`` section.
619
620 For linked shared code objects, ``.rela.dyn`` contains all the relocation
621 records from each of the relocatable code object's ``.rela``\ *name* sections.
622
623 See :ref:`amdgpu-relocation-records` for the relocation records supported by
624 the AMDGPU backend.
625
626``.text``
627 The executable machine code for the kernels and functions they call. Generated
628 as position independent code. See :ref:`amdgpu-code-conventions` for
629 information on conventions used in the isa generation.
630
631.. _amdgpu-note-records:
632
633Note Records
634------------
635
Tony Tye07d9f102017-11-10 01:00:54 +0000636As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
637be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
638aligned. In addition, minimal zero byte padding must be generated to ensure the
639``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
640``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000641
642The AMDGPU backend code object uses the following ELF note records in the
643``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000644record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000645variable size strings have a corresponding ``*_size`` field that specifies the
646number of bytes, including the terminating null character, in the string. The
647string(s) come immediately after the preceding fields.
648
649Additional note records can be present.
650
651 .. table:: AMDGPU ELF Note Records
652 :name: amdgpu-elf-note-records-table
653
Tony Tye46d35762017-08-15 20:47:41 +0000654 ===== ============================== ======================================
655 Name Type Description
656 ===== ============================== ======================================
657 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000658 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000659
660..
661
662 .. table:: AMDGPU ELF Note Record Enumeration Values
663 :name: amdgpu-elf-note-record-enumeration-values-table
664
Tony Tye46d35762017-08-15 20:47:41 +0000665 ============================== =====
666 Name Value
667 ============================== =====
668 *reserved* 0-9
669 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000670 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000671 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000672
Tony Tye46d35762017-08-15 20:47:41 +0000673``NT_AMD_AMDGPU_HSA_METADATA``
674 Specifies extensible metadata associated with the code objects executed on HSA
675 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
676 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
677 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
678 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000679
Tony Tye46d35762017-08-15 20:47:41 +0000680.. _amdgpu-symbols:
681
682Symbols
683-------
684
685Symbols include the following:
686
687 .. table:: AMDGPU ELF Symbols
688 :name: amdgpu-elf-symbols-table
689
690 ===================== ============== ============= ==================
691 Name Type Section Description
692 ===================== ============== ============= ==================
693 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
694 - ``.rodata``
695 - ``.bss``
696 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
697 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
698 ===================== ============== ============= ==================
699
700Global variable
701 Global variables both used and defined by the compilation unit.
702
703 If the symbol is defined in the compilation unit then it is allocated in the
704 appropriate section according to if it has initialized data or is readonly.
705
706 If the symbol is external then its section is ``STN_UNDEF`` and the loader
707 will resolve relocations using the definition provided by another code object
708 or explicitly defined by the runtime.
709
710 All global symbols, whether defined in the compilation unit or external, are
711 accessed by the machine code indirectly through a GOT table entry. This
712 allows them to be preemptable. The GOT table is only supported when the target
713 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000714
715 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000716 Add description of linked shared object symbols. Seems undefined symbols
717 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000718
Tony Tye46d35762017-08-15 20:47:41 +0000719Kernel descriptor
720 Every HSA kernel has an associated kernel descriptor. It is the address of the
721 kernel descriptor that is used in the AQL dispatch packet used to invoke the
722 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
723 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
724
725Kernel entry point
726 Every HSA kernel also has a symbol for its machine code entry point.
727
728.. _amdgpu-relocation-records:
729
730Relocation Records
731------------------
732
733AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
734relocatable fields are:
735
736``word32``
737 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
738 alignment. These values use the same byte order as other word values in the
739 AMD GPU architecture.
740
741``word64``
742 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
743 alignment. These values use the same byte order as other word values in the
744 AMD GPU architecture.
745
746Following notations are used for specifying relocation calculations:
747
748**A**
749 Represents the addend used to compute the value of the relocatable field.
750
751**G**
752 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000753 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000754
755**GOT**
756 Represents the address of the global offset table.
757
758**P**
759 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
760 of the storage unit being relocated (computed using ``r_offset``).
761
762**S**
763 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000764 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
765
766**B**
767 Represents the base address of a loaded executable or shared object which is
768 the difference between the ELF address and the actual load address. Relocations
769 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000770
771The following relocation types are supported:
772
773 .. table:: AMDGPU ELF Relocation Records
774 :name: amdgpu-elf-relocation-records-table
775
776 ========================== ===== ========== ==============================
777 Relocation Type Value Field Calculation
778 ========================== ===== ========== ==============================
779 ``R_AMDGPU_NONE`` 0 *none* *none*
780 ``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
781 ``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
782 ``R_AMDGPU_ABS64`` 3 ``word64`` S + A
783 ``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
784 ``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
785 ``R_AMDGPU_ABS32`` 6 ``word32`` S + A
786 ``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
787 ``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
788 ``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
789 ``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
790 ``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tyed2884302017-10-16 20:44:29 +0000791 *reserved* 12
792 ``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye46d35762017-08-15 20:47:41 +0000793 ========================== ===== ========== ==============================
794
795.. _amdgpu-dwarf:
796
797DWARF
798-----
799
800Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
801information that maps the code object executable code and data to the source
802language constructs. It can be used by tools such as debuggers and profilers.
803
804Address Space Mapping
805~~~~~~~~~~~~~~~~~~~~~
806
807The following address space mapping is used:
808
809 .. table:: AMDGPU DWARF Address Space Mapping
810 :name: amdgpu-dwarf-address-space-mapping-table
811
812 =================== =================
813 DWARF Address Space Memory Space
814 =================== =================
815 1 Private (Scratch)
816 2 Local (group/LDS)
817 *omitted* Global
818 *omitted* Constant
819 *omitted* Generic (Flat)
820 *not supported* Region (GDS)
821 =================== =================
822
823See :ref:`amdgpu-address-spaces` for information on the memory space terminology
824used in the table.
825
826An ``address_class`` attribute is generated on pointer type DIEs to specify the
827DWARF address space of the value of the pointer when it is in the *private* or
828*local* address space. Otherwise the attribute is omitted.
829
830An ``XDEREF`` operation is generated in location list expressions for variables
831that are allocated in the *private* and *local* address space. Otherwise no
832``XDREF`` is omitted.
833
834Register Mapping
835~~~~~~~~~~~~~~~~
836
837*This section is WIP.*
838
839.. TODO
840 Define DWARF register enumeration.
841
842 If want to present a wavefront state then should expose vector registers as
843 64 wide (rather than per work-item view that LLVM uses). Either as separate
844 registers, or a 64x4 byte single register. In either case use a new LANE op
845 (akin to XDREF) to select the current lane usage in a location
846 expression. This would also allow scalar register spilling to vector register
847 lanes to be expressed (currently no debug information is being generated for
848 spilling). If choose a wide single register approach then use LANE in
849 conjunction with PIECE operation to select the dword part of the register for
850 the current lane. If the separate register approach then use LANE to select
851 the register.
852
853Source Text
854~~~~~~~~~~~
855
856*This section is WIP.*
857
858.. TODO
859 DWARF extension to include runtime generated source text.
860
861.. _amdgpu-code-conventions:
862
863Code Conventions
864================
865
866This section provides code conventions used for each supported target triple OS
867(see :ref:`amdgpu-target-triples`).
868
869AMDHSA
870------
871
872This section provides code conventions used when the target triple OS is
873``amdhsa`` (see :ref:`amdgpu-target-triples`).
874
875.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000876
877Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000878~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000879
Tony Tye46d35762017-08-15 20:47:41 +0000880The code object metadata specifies extensible metadata associated with the code
881objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
882[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
883(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
884``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
885information necessary to support the ROCM kernel queries. For example, the
886segment sizes needed in a dispatch packet. In addition, a high level language
887runtime may require other information to be included. For example, the AMD
888OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000889
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000890The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000891:doc:`YamlIO`).
892
Tony Tye46d35762017-08-15 20:47:41 +0000893.. TODO
894 Is the string null terminated? It probably should not if YAML allows it to
895 contain null characters, otherwise it should be.
896
Tony Tyef16a45e2017-06-06 20:31:59 +0000897The metadata is represented as a single YAML document comprised of the mapping
898defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
899referenced tables.
900
901For boolean values, the string values of ``false`` and ``true`` are used for
902false and true respectively.
903
904Additional information can be added to the mappings. To avoid conflicts, any
905non-AMD key names should be prefixed by "*vendor-name*.".
906
907 .. table:: AMDHSA Code Object Metadata Mapping
908 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
909
910 ========== ============== ========= =======================================
911 String Key Value Type Required? Description
912 ========== ============== ========= =======================================
913 "Version" sequence of Required - The first integer is the major
914 2 integers version. Currently 1.
915 - The second integer is the minor
916 version. Currently 0.
917 "Printf" sequence of Each string is encoded information
918 strings about a printf function call. The
919 encoded information is organized as
920 fields separated by colon (':'):
921
922 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
923
924 where:
925
926 ``ID``
927 A 32 bit integer as a unique id for
928 each printf function call
929
930 ``N``
931 A 32 bit integer equal to the number
932 of arguments of printf function call
933 minus 1
934
935 ``S[i]`` (where i = 0, 1, ... , N-1)
936 32 bit integers for the size in bytes
937 of the i-th FormatString argument of
938 the printf function call
939
940 FormatString
941 The format string passed to the
942 printf function call.
943 "Kernels" sequence of Required Sequence of the mappings for each
944 mapping kernel in the code object. See
945 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
946 for the definition of the mapping.
947 ========== ============== ========= =======================================
948
949..
950
951 .. table:: AMDHSA Code Object Kernel Metadata Mapping
952 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
953
954 ================= ============== ========= ================================
955 String Key Value Type Required? Description
956 ================= ============== ========= ================================
957 "Name" string Required Source name of the kernel.
958 "SymbolName" string Required Name of the kernel
959 descriptor ELF symbol.
960 "Language" string Source language of the kernel.
961 Values include:
962
963 - "OpenCL C"
964 - "OpenCL C++"
965 - "HCC"
966 - "OpenMP"
967
968 "LanguageVersion" sequence of - The first integer is the major
969 2 integers version.
970 - The second integer is the
971 minor version.
972 "Attrs" mapping Mapping of kernel attributes.
973 See
974 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
975 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +0000976 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +0000977 mapping kernel arguments. See
978 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
979 for the definition of the mapping.
980 "CodeProps" mapping Mapping of properties related to
981 the kernel code. See
982 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
983 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +0000984 ================= ============== ========= ================================
985
986..
987
988 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
989 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
990
991 =================== ============== ========= ==============================
992 String Key Value Type Required? Description
993 =================== ============== ========= ==============================
994 "ReqdWorkGroupSize" sequence of The dispatch work-group size
995 3 integers X, Y, Z must correspond to the
996 specified values.
997
998 Corresponds to the OpenCL
999 ``reqd_work_group_size``
1000 attribute.
1001 "WorkGroupSizeHint" sequence of The dispatch work-group size
1002 3 integers X, Y, Z is likely to be the
1003 specified values.
1004
1005 Corresponds to the OpenCL
1006 ``work_group_size_hint``
1007 attribute.
1008 "VecTypeHint" string The name of a scalar or vector
1009 type.
1010
1011 Corresponds to the OpenCL
1012 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001013
1014 "RuntimeHandle" string The external symbol name
1015 associated with a kernel.
1016 OpenCL runtime allocates a
1017 global buffer for the symbol
1018 and saves the kernel's address
1019 to it, which is used for
1020 device side enqueueing. Only
1021 available for device side
1022 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001023 =================== ============== ========= ==============================
1024
1025..
1026
1027 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1028 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1029
1030 ================= ============== ========= ================================
1031 String Key Value Type Required? Description
1032 ================= ============== ========= ================================
1033 "Name" string Kernel argument name.
1034 "TypeName" string Kernel argument type name.
1035 "Size" integer Required Kernel argument size in bytes.
1036 "Align" integer Required Kernel argument alignment in
1037 bytes. Must be a power of two.
1038 "ValueKind" string Required Kernel argument kind that
1039 specifies how to set up the
1040 corresponding argument.
1041 Values include:
1042
1043 "ByValue"
1044 The argument is copied
1045 directly into the kernarg.
1046
1047 "GlobalBuffer"
1048 A global address space pointer
1049 to the buffer data is passed
1050 in the kernarg.
1051
1052 "DynamicSharedPointer"
1053 A group address space pointer
1054 to dynamically allocated LDS
1055 is passed in the kernarg.
1056
1057 "Sampler"
1058 A global address space
1059 pointer to a S# is passed in
1060 the kernarg.
1061
1062 "Image"
1063 A global address space
1064 pointer to a T# is passed in
1065 the kernarg.
1066
1067 "Pipe"
1068 A global address space pointer
1069 to an OpenCL pipe is passed in
1070 the kernarg.
1071
1072 "Queue"
1073 A global address space pointer
1074 to an OpenCL device enqueue
1075 queue is passed in the
1076 kernarg.
1077
1078 "HiddenGlobalOffsetX"
1079 The OpenCL grid dispatch
1080 global offset for the X
1081 dimension is passed in the
1082 kernarg.
1083
1084 "HiddenGlobalOffsetY"
1085 The OpenCL grid dispatch
1086 global offset for the Y
1087 dimension is passed in the
1088 kernarg.
1089
1090 "HiddenGlobalOffsetZ"
1091 The OpenCL grid dispatch
1092 global offset for the Z
1093 dimension is passed in the
1094 kernarg.
1095
1096 "HiddenNone"
1097 An argument that is not used
1098 by the kernel. Space needs to
1099 be left for it, but it does
1100 not need to be set up.
1101
1102 "HiddenPrintfBuffer"
1103 A global address space pointer
1104 to the runtime printf buffer
1105 is passed in kernarg.
1106
1107 "HiddenDefaultQueue"
1108 A global address space pointer
1109 to the OpenCL device enqueue
1110 queue that should be used by
1111 the kernel by default is
1112 passed in the kernarg.
1113
1114 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001115 A global address space pointer
1116 to help link enqueued kernels into
1117 the ancestor tree for determining
1118 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001119
1120 "ValueType" string Required Kernel argument value type. Only
1121 present if "ValueKind" is
1122 "ByValue". For vector data
1123 types, the value is for the
1124 element type. Values include:
1125
1126 - "Struct"
1127 - "I8"
1128 - "U8"
1129 - "I16"
1130 - "U16"
1131 - "F16"
1132 - "I32"
1133 - "U32"
1134 - "F32"
1135 - "I64"
1136 - "U64"
1137 - "F64"
1138
1139 .. TODO
1140 How can it be determined if a
1141 vector type, and what size
1142 vector?
1143 "PointeeAlign" integer Alignment in bytes of pointee
1144 type for pointer type kernel
1145 argument. Must be a power
1146 of 2. Only present if
1147 "ValueKind" is
1148 "DynamicSharedPointer".
1149 "AddrSpaceQual" string Kernel argument address space
1150 qualifier. Only present if
1151 "ValueKind" is "GlobalBuffer" or
1152 "DynamicSharedPointer". Values
1153 are:
1154
1155 - "Private"
1156 - "Global"
1157 - "Constant"
1158 - "Local"
1159 - "Generic"
1160 - "Region"
1161
1162 .. TODO
1163 Is GlobalBuffer only Global
1164 or Constant? Is
1165 DynamicSharedPointer always
1166 Local? Can HCC allow Generic?
1167 How can Private or Region
1168 ever happen?
1169 "AccQual" string Kernel argument access
1170 qualifier. Only present if
1171 "ValueKind" is "Image" or
1172 "Pipe". Values
1173 are:
1174
1175 - "ReadOnly"
1176 - "WriteOnly"
1177 - "ReadWrite"
1178
1179 .. TODO
1180 Does this apply to
1181 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001182 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001183 performed by the kernel on the
1184 kernel argument. Only present if
1185 "ValueKind" is "GlobalBuffer",
1186 "Image", or "Pipe". This may be
1187 more restrictive than indicated
1188 by "AccQual" to reflect what the
1189 kernel actual does. If not
1190 present then the runtime must
1191 assume what is implied by
1192 "AccQual" and "IsConst". Values
1193 are:
1194
1195 - "ReadOnly"
1196 - "WriteOnly"
1197 - "ReadWrite"
1198
1199 "IsConst" boolean Indicates if the kernel argument
1200 is const qualified. Only present
1201 if "ValueKind" is
1202 "GlobalBuffer".
1203
1204 "IsRestrict" boolean Indicates if the kernel argument
1205 is restrict qualified. Only
1206 present if "ValueKind" is
1207 "GlobalBuffer".
1208
1209 "IsVolatile" boolean Indicates if the kernel argument
1210 is volatile qualified. Only
1211 present if "ValueKind" is
1212 "GlobalBuffer".
1213
1214 "IsPipe" boolean Indicates if the kernel argument
1215 is pipe qualified. Only present
1216 if "ValueKind" is "Pipe".
1217
1218 .. TODO
1219 Can GlobalBuffer be pipe
1220 qualified?
1221 ================= ============== ========= ================================
1222
1223..
1224
1225 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1226 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1227
1228 ============================ ============== ========= =====================
1229 String Key Value Type Required? Description
1230 ============================ ============== ========= =====================
1231 "KernargSegmentSize" integer Required The size in bytes of
1232 the kernarg segment
1233 that holds the values
1234 of the arguments to
1235 the kernel.
1236 "GroupSegmentFixedSize" integer Required The amount of group
1237 segment memory
1238 required by a
1239 work-group in
1240 bytes. This does not
1241 include any
1242 dynamically allocated
1243 group segment memory
1244 that may be added
1245 when the kernel is
1246 dispatched.
1247 "PrivateSegmentFixedSize" integer Required The amount of fixed
1248 private address space
1249 memory required for a
1250 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001251 bytes. If the kernel
1252 uses a dynamic call
1253 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001254 space must be added
1255 to this value for the
1256 call stack.
1257 "KernargSegmentAlign" integer Required The maximum byte
1258 alignment of
1259 arguments in the
1260 kernarg segment. Must
1261 be a power of 2.
1262 "WavefrontSize" integer Required Wavefront size. Must
1263 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001264 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001265 registers used by a
1266 wavefront for
1267 GFX6-GFX9. This
1268 includes the special
1269 SGPRs for VCC, Flat
1270 Scratch (GFX7-GFX9)
1271 and XNACK (for
1272 GFX8-GFX9). It does
1273 not include the 16
1274 SGPR added if a trap
1275 handler is
1276 enabled. It is not
1277 rounded up to the
1278 allocation
1279 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001280 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001281 registers used by
1282 each work-item for
1283 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001284 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001285 work-group size
1286 supported by the
1287 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001288 Must be >=1 and
1289 consistent with any
1290 non-0 values in
1291 FixedWorkGroupSize.
1292 "FixedWorkGroupSize" sequence of Corresponds to the
1293 3 integers dispatch work-group
1294 size X, Y, Z. If
1295 omitted, defaults to
1296 0, 0, 0. If an
1297 element is non-0 then
1298 the kernel must only
1299 be launched with a
1300 matching corresponding
1301 work-group size.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001302 "NumSpilledSGPRs" integer Number of stores from
1303 a scalar register to
1304 a register allocator
1305 created spill
1306 location.
1307 "NumSpilledVGPRs" integer Number of stores from
1308 a vector register to
1309 a register allocator
1310 created spill
1311 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001312 ============================ ============== ========= =====================
1313
1314..
1315
Tony Tyef16a45e2017-06-06 20:31:59 +00001316Kernel Dispatch
1317~~~~~~~~~~~~~~~
1318
1319The HSA architected queuing language (AQL) defines a user space memory interface
1320that can be used to control the dispatch of kernels, in an agent independent
1321way. An agent can have zero or more AQL queues created for it using the ROCm
1322runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1323*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1324mechanics and packet layouts.
1325
1326The packet processor of a kernel agent is responsible for detecting and
1327dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1328packet processor is implemented by the hardware command processor (CP),
1329asynchronous dispatch controller (ADC) and shader processor input controller
1330(SPI).
1331
1332The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1333mode driver to initialize and register the AQL queue with CP.
1334
1335To dispatch a kernel the following actions are performed. This can occur in the
1336CPU host program, or from an HSA kernel executing on a GPU.
1337
13381. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1339 executed is obtained.
13402. A pointer to the kernel descriptor (see
1341 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1342 obtained. It must be for a kernel that is contained in a code object that that
1343 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1344 associated.
13453. Space is allocated for the kernel arguments using the ROCm runtime allocator
1346 for a memory region with the kernarg property for the kernel agent that will
1347 execute the kernel. It must be at least 16 byte aligned.
13484. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001349 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001350 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1351 memory in the same way constant memory is accessed. (Note that the HSA
1352 specification allows an implementation to copy the kernel argument contents to
1353 another location that is accessed by the kernel.)
13545. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1355 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1356 packet. The packet must be set up, and the final write must use an atomic
1357 store release to set the packet kind to ensure the packet contents are
1358 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1359 notify the kernel agent that the AQL queue has been updated. These rules, and
1360 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1361 System Architecture Specification* [HSA]_.
13626. A kernel dispatch packet includes information about the actual dispatch,
1363 such as grid and work-group size, together with information from the code
1364 object about the kernel, such as segment sizes. The ROCm runtime queries on
1365 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001366 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000013677. CP executes micro-code and is responsible for detecting and setting up the
1368 GPU to execute the wavefronts of a kernel dispatch.
13698. CP ensures that when the a wavefront starts executing the kernel machine
1370 code, the scalar general purpose registers (SGPR) and vector general purpose
1371 registers (VGPR) are set up as required by the machine code. The required
1372 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1373 register state is defined in
1374 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
13759. The prolog of the kernel machine code (see
1376 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1377 before continuing executing the machine code that corresponds to the kernel.
137810. When the kernel dispatch has completed execution, CP signals the completion
1379 signal specified in the kernel dispatch packet if not 0.
1380
1381.. _amdgpu-amdhsa-memory-spaces:
1382
1383Memory Spaces
1384~~~~~~~~~~~~~
1385
1386The memory space properties are:
1387
1388 .. table:: AMDHSA Memory Spaces
1389 :name: amdgpu-amdhsa-memory-spaces-table
1390
1391 ================= =========== ======== ======= ==================
1392 Memory Space Name HSA Segment Hardware Address NULL Value
1393 Name Name Size
1394 ================= =========== ======== ======= ==================
1395 Private private scratch 32 0x00000000
1396 Local group LDS 32 0xFFFFFFFF
1397 Global global global 64 0x0000000000000000
1398 Constant constant *same as 64 0x0000000000000000
1399 global*
1400 Generic flat flat 64 0x0000000000000000
1401 Region N/A GDS 32 *not implemented
1402 for AMDHSA*
1403 ================= =========== ======== ======= ==================
1404
1405The global and constant memory spaces both use global virtual addresses, which
1406are the same virtual address space used by the CPU. However, some virtual
1407addresses may only be accessible to the CPU, some only accessible by the GPU,
1408and some by both.
1409
1410Using the constant memory space indicates that the data will not change during
1411the execution of the kernel. This allows scalar read instructions to be
1412used. The vector and scalar L1 caches are invalidated of volatile data before
1413each kernel dispatch execution to allow constant memory to change values between
1414kernel dispatches.
1415
1416The local memory space uses the hardware Local Data Store (LDS) which is
1417automatically allocated when the hardware creates work-groups of wavefronts, and
1418freed when all the wavefronts of a work-group have terminated. The data store
1419(DS) instructions can be used to access it.
1420
1421The private memory space uses the hardware scratch memory support. If the kernel
1422uses scratch, then the hardware allocates memory that is accessed using
1423wavefront lane dword (4 byte) interleaving. The mapping used from private
1424address to physical address is:
1425
1426 ``wavefront-scratch-base +
1427 (private-address * wavefront-size * 4) +
1428 (wavefront-lane-id * 4)``
1429
1430There are different ways that the wavefront scratch base address is determined
1431by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1432memory can be accessed in an interleaved manner using buffer instruction with
1433the scratch buffer descriptor and per wave scratch offset, by the scratch
1434instructions, or by flat instructions. If each lane of a wavefront accesses the
1435same private address, the interleaving results in adjacent dwords being accessed
1436and hence requires fewer cache lines to be fetched. Multi-dword access is not
1437supported except by flat and scratch instructions in GFX9.
1438
1439The generic address space uses the hardware flat address support available in
1440GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1441local appertures), that are outside the range of addressible global memory, to
1442map from a flat address to a private or local address.
1443
1444FLAT instructions can take a flat address and access global, private (scratch)
1445and group (LDS) memory depending in if the address is within one of the
1446apperture ranges. Flat access to scratch requires hardware aperture setup and
1447setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1448access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1449(see :ref:`amdgpu-amdhsa-m0`).
1450
1451To convert between a segment address and a flat address the base address of the
1452appertures address can be used. For GFX7-GFX8 these are available in the
1453:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1454Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1455GFX9 the appature base addresses are directly available as inline constant
1456registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1457address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1458which makes it easier to convert from flat to segment or segment to flat.
1459
Tony Tye46d35762017-08-15 20:47:41 +00001460Image and Samplers
1461~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001462
1463Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1464hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1465HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1466enumeration values for the queries that are not trivially deducible from the S#
1467representation.
1468
1469HSA Signals
1470~~~~~~~~~~~
1471
Tony Tye46d35762017-08-15 20:47:41 +00001472HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1473structure allocated in memory accessible from both the CPU and GPU. The
1474structure is defined by the ROCm runtime and subject to change between releases
1475(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001476
1477.. _amdgpu-amdhsa-hsa-aql-queue:
1478
1479HSA AQL Queue
1480~~~~~~~~~~~~~
1481
Tony Tye46d35762017-08-15 20:47:41 +00001482The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001483between releases (see [AMD-ROCm-github]_). For some processors it contains
1484fields needed to implement certain language features such as the flat address
1485aperture bases. It also contains fields used by CP such as managing the
1486allocation of scratch memory.
1487
1488.. _amdgpu-amdhsa-kernel-descriptor:
1489
1490Kernel Descriptor
1491~~~~~~~~~~~~~~~~~
1492
1493A kernel descriptor consists of the information needed by CP to initiate the
1494execution of a kernel, including the entry point address of the machine code
1495that implements the kernel.
1496
1497Kernel Descriptor for GFX6-GFX9
1498+++++++++++++++++++++++++++++++
1499
1500CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1501
1502 .. table:: Kernel Descriptor for GFX6-GFX9
1503 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1504
Tony Tye6baa6d22017-10-18 22:16:55 +00001505 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001506 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001507 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001508 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001509 address space memory
1510 required for a work-group
1511 in bytes. This does not
1512 include any dynamically
1513 allocated local address
1514 space memory that may be
1515 added when the kernel is
1516 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001517 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001518 private address space
1519 memory required for a
1520 work-item in bytes. If
1521 is_dynamic_callstack is 1
1522 then additional space must
1523 be added to this value for
1524 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001525 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001526 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001527 negative) from base
1528 address of kernel
1529 descriptor to kernel's
1530 entry point instruction
1531 which must be 256 byte
1532 aligned.
Tony Tye07d9f102017-11-10 01:00:54 +00001533 223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
1534 size supported by the
1535 kernel in work-items. If
1536 an exact work-group size
1537 is required then must be
1538 omitted or 0 and
1539 ReqdWorkGroupSize* must
1540 be set to non-0.
1541 239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
1542 the kernel
1543 must be executed with the
1544 specified work-group size
1545 for X.
1546 255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
1547 the kernel
1548 must be executed with the
1549 specified work-group size
1550 for Y.
1551 271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
1552 the kernel
1553 must be executed with the
1554 specified work-group size
1555 for Z.
Mark Searles095d4ea2017-12-07 21:24:27 +00001556 383:272 14 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001557 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001558 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001559 program settings used by
1560 CP to set up
1561 ``COMPUTE_PGM_RSRC1``
1562 configuration
1563 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001564 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001565 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001566 program settings used by
1567 CP to set up
1568 ``COMPUTE_PGM_RSRC2``
1569 configuration
1570 register. See
1571 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001572 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1573 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001574 (see
1575 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1576
1577 The total number of SGPR
1578 user data registers
1579 requested must not exceed
1580 16 and match value in
1581 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1582 Any requests beyond 16
1583 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001584 449 1 bit EnableSGPRDispatchPtr *see above*
1585 450 1 bit EnableSGPRQueuePtr *see above*
1586 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1587 452 1 bit EnableSGPRDispatchID *see above*
1588 453 1 bit EnableSGPRFlatScratchInit *see above*
1589 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1590 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1591 should always be 0.
1592 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1593 should always be 0.
1594 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1595 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001596 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001597 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001598 bytes
1599 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001600 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001601
1602..
1603
1604 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001605 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001606
Tony Tye3b340612017-06-07 00:46:08 +00001607 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001608 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001609 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001610 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001611 used by each work-item,
1612 granularity is device
1613 specific:
1614
Tony Tye07d9f102017-11-10 01:00:54 +00001615 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001616 - max_vgpr 1..256
1617 - roundup((max_vgpg + 1)
1618 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001619
1620 Used by CP to set up
1621 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001622 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001623 used by a wavefront,
1624 granularity is device
1625 specific:
1626
Tony Tye07d9f102017-11-10 01:00:54 +00001627 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001628 - max_sgpr 1..112
1629 - roundup((max_sgpg + 1)
1630 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001631 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001632 - max_sgpr 1..112
1633 - roundup((max_sgpg + 1)
1634 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001635
1636 Includes the special SGPRs
1637 for VCC, Flat Scratch (for
1638 GFX7 onwards) and XNACK
1639 (for GFX8 onwards). It does
1640 not include the 16 SGPR
1641 added if a trap handler is
1642 enabled.
1643
1644 Used by CP to set up
1645 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001646 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001647
1648 Start executing wavefront
1649 at the specified priority.
1650
1651 CP is responsible for
1652 filling in
1653 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001654 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001655 with specified rounding
1656 mode for single (32
1657 bit) floating point
1658 precision floating point
1659 operations.
1660
1661 Floating point rounding
1662 mode values are defined in
1663 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1664
1665 Used by CP to set up
1666 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001667 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001668 with specified rounding
1669 denorm mode for half/double (16
1670 and 64 bit) floating point
1671 precision floating point
1672 operations.
1673
1674 Floating point rounding
1675 mode values are defined in
1676 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1677
1678 Used by CP to set up
1679 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001680 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001681 with specified denorm mode
1682 for single (32
1683 bit) floating point
1684 precision floating point
1685 operations.
1686
1687 Floating point denorm mode
1688 values are defined in
1689 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1690
1691 Used by CP to set up
1692 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001693 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001694 with specified denorm mode
1695 for half/double (16
1696 and 64 bit) floating point
1697 precision floating point
1698 operations.
1699
1700 Floating point denorm mode
1701 values are defined in
1702 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1703
1704 Used by CP to set up
1705 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001706 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001707
1708 Start executing wavefront
1709 in privilege trap handler
1710 mode.
1711
1712 CP is responsible for
1713 filling in
1714 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001715 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001716 with DX10 clamp mode
1717 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001718 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001719 treatment of NaN's (when
1720 set, clamp NaN to zero,
1721 otherwise pass NaN
1722 through).
1723
1724 Used by CP to set up
1725 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001726 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001727
1728 Start executing wavefront
1729 in single step mode.
1730
1731 CP is responsible for
1732 filling in
1733 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001734 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001735 with IEEE mode
1736 enabled. Floating point
1737 opcodes that support
1738 exception flag gathering
1739 will quiet and propagate
1740 signaling-NaN inputs per
1741 IEEE 754-2008. Min_dx10 and
1742 max_dx10 become IEEE
1743 754-2008 compliant due to
1744 signaling-NaN propagation
1745 and quieting.
1746
1747 Used by CP to set up
1748 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001749 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001750
1751 Only one work-group allowed
1752 to execute on a compute
1753 unit.
1754
1755 CP is responsible for
1756 filling in
1757 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001758 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001759
1760 Flag that can be used to
1761 control debugging code.
1762
1763 CP is responsible for
1764 filling in
1765 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001766 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001767 Reserved, must be 0.
1768 GFX9
1769 Wavefront starts execution
1770 with specified fp16 overflow
1771 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001772
Tony Tye6baa6d22017-10-18 22:16:55 +00001773 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001774 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001775 - If 1, fp16 overflow that is the
1776 result of an +/-INF input value
1777 or divide by 0 produces a +/-INF,
1778 otherwise clamps computed
1779 overflow to +/-MAX_FP16 as
1780 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001781
1782 Used by CP to set up
1783 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001784 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001785 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001786 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001787
1788..
1789
1790 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1791 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1792
Tony Tye3b340612017-06-07 00:46:08 +00001793 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001794 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001795 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001796 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1797 _WAVE_OFFSET SGPR wave scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001798 system register (see
1799 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1800
1801 Used by CP to set up
1802 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001803 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001804 user data registers
1805 requested. This number must
1806 match the number of user
1807 data registers enabled.
1808
1809 Used by CP to set up
1810 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001811 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001812 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001813 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001814 be enabled.
1815
1816 CP sets
1817 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1818 if the runtime has
1819 installed a trap handler
1820 regardless of the setting
1821 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001822 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001823 system SGPR register for
1824 the work-group id in the X
1825 dimension (see
1826 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1827
1828 Used by CP to set up
1829 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001830 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001831 system SGPR register for
1832 the work-group id in the Y
1833 dimension (see
1834 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1835
1836 Used by CP to set up
1837 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001838 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001839 system SGPR register for
1840 the work-group id in the Z
1841 dimension (see
1842 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1843
1844 Used by CP to set up
1845 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001846 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001847 system SGPR register for
1848 work-group information (see
1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851 Used by CP to set up
1852 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001853 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001854 VGPR system registers used
1855 for the work-item ID.
1856 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1857 defines the values.
1858
1859 Used by CP to set up
1860 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001861 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001862
1863 Wavefront starts execution
1864 with address watch
1865 exceptions enabled which
1866 are generated when L1 has
1867 witnessed a thread access
1868 an *address of
1869 interest*.
1870
1871 CP is responsible for
1872 filling in the address
1873 watch bit in
1874 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1875 according to what the
1876 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001877 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001878
1879 Wavefront starts execution
1880 with memory violation
1881 exceptions exceptions
1882 enabled which are generated
1883 when a memory violation has
1884 occurred for this wave from
1885 L1 or LDS
1886 (write-to-read-only-memory,
1887 mis-aligned atomic, LDS
1888 address out of range,
1889 illegal address, etc.).
1890
1891 CP sets the memory
1892 violation bit in
1893 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1894 according to what the
1895 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001896 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001897
1898 CP uses the rounded value
1899 from the dispatch packet,
1900 not this value, as the
1901 dispatch may contain
1902 dynamically allocated group
1903 segment memory. CP writes
1904 directly to
1905 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1906
1907 Amount of group segment
1908 (LDS) to allocate for each
1909 work-group. Granularity is
1910 device specific:
1911
1912 GFX6:
1913 roundup(lds-size / (64 * 4))
1914 GFX7-GFX9:
1915 roundup(lds-size / (128 * 4))
1916
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001917 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1918 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001919 enabled.
1920
1921 Used by CP to set up
1922 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1923 (set from bits 0..6).
1924
1925 IEEE 754 FP Invalid
1926 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001927 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1928 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001929 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001930 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1931 _DIVISION_BY_ZERO Zero
1932 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1933 _OVERFLOW
1934 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1935 _UNDERFLOW
1936 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1937 _INEXACT
1938 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1939 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001940 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001941 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001942 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001943 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001944
1945..
1946
1947 .. table:: Floating Point Rounding Mode Enumeration Values
1948 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1949
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001950 ====================================== ===== ==============================
1951 Enumeration Name Value Description
1952 ====================================== ===== ==============================
1953 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1954 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1955 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1956 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1957 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001958
1959..
1960
1961 .. table:: Floating Point Denorm Mode Enumeration Values
1962 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1963
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001964 ====================================== ===== ==============================
1965 Enumeration Name Value Description
1966 ====================================== ===== ==============================
1967 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
1968 Denorms
1969 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
1970 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
1971 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
1972 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001973
1974..
1975
1976 .. table:: System VGPR Work-Item ID Enumeration Values
1977 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
1978
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001979 ======================================== ===== ============================
1980 Enumeration Name Value Description
1981 ======================================== ===== ============================
1982 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
1983 ID.
1984 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
1985 dimensions ID.
1986 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
1987 dimensions ID.
1988 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
1989 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001990
1991.. _amdgpu-amdhsa-initial-kernel-execution-state:
1992
1993Initial Kernel Execution State
1994~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1995
1996This section defines the register state that will be set up by the packet
1997processor prior to the start of execution of every wavefront. This is limited by
1998the constraints of the hardware controllers of CP/ADC/SPI.
1999
2000The order of the SGPR registers is defined, but the compiler can specify which
2001ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2002fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2003for enabled registers are dense starting at SGPR0: the first enabled register is
2004SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2005an SGPR number.
2006
2007The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
2008all waves of the grid. It is possible to specify more than 16 User SGPRs using
2009the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2010initialized. These are then immediately followed by the System SGPRs that are
2011set up by ADC/SPI and can have different values for each wave of the grid
2012dispatch.
2013
2014SGPR register initial state is defined in
2015:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2016
2017 .. table:: SGPR Register Set Up Order
2018 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2019
2020 ========== ========================== ====== ==============================
2021 SGPR Order Name Number Description
2022 (kernel descriptor enable of
2023 field) SGPRs
2024 ========== ========================== ====== ==============================
2025 First Private Segment Buffer 4 V# that can be used, together
2026 (enable_sgpr_private with Scratch Wave Offset as an
2027 _segment_buffer) offset, to access the private
2028 memory space using a segment
2029 address.
2030
2031 CP uses the value provided by
2032 the runtime.
2033 then Dispatch Ptr 2 64 bit address of AQL dispatch
2034 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2035 actually executing.
2036 then Queue Ptr 2 64 bit address of amd_queue_t
2037 (enable_sgpr_queue_ptr) object for AQL queue on which
2038 the dispatch packet was
2039 queued.
2040 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2041 (enable_sgpr_kernarg segment. This is directly
2042 _segment_ptr) copied from the
2043 kernarg_address in the kernel
2044 dispatch packet.
2045
2046 Having CP load it once avoids
2047 loading it at the beginning of
2048 every wavefront.
2049 then Dispatch Id 2 64 bit Dispatch ID of the
2050 (enable_sgpr_dispatch_id) dispatch packet being
2051 executed.
2052 then Flat Scratch Init 2 This is 2 SGPRs:
2053 (enable_sgpr_flat_scratch
2054 _init) GFX6
2055 Not supported.
2056 GFX7-GFX8
2057 The first SGPR is a 32 bit
2058 byte offset from
2059 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2060 to per SPI base of memory
2061 for scratch for the queue
2062 executing the kernel
2063 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002064 from the runtime. (The
2065 Scratch Segment Buffer base
2066 address is
2067 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2068 plus this offset.) The value
2069 of Scratch Wave Offset must
2070 be added to this offset by
2071 the kernel machine code,
2072 right shifted by 8, and
2073 moved to the FLAT_SCRATCH_HI
2074 SGPR register.
2075 FLAT_SCRATCH_HI corresponds
2076 to SGPRn-4 on GFX7, and
2077 SGPRn-6 on GFX8 (where SGPRn
2078 is the highest numbered SGPR
2079 allocated to the wave).
2080 FLAT_SCRATCH_HI is
2081 multiplied by 256 (as it is
2082 in units of 256 bytes) and
2083 added to
2084 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2085 to calculate the per wave
2086 FLAT SCRATCH BASE in flat
2087 memory instructions that
2088 access the scratch
2089 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002090
2091 The second SGPR is 32 bit
2092 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002093 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002094 usage. CP obtains this from
2095 the runtime, and it is
2096 always a multiple of DWORD.
2097 CP checks that the value in
2098 the kernel dispatch packet
2099 Private Segment Byte Size is
2100 not larger, and requests the
2101 runtime to increase the
2102 queue's scratch size if
2103 necessary. The kernel code
2104 must move it to
2105 FLAT_SCRATCH_LO which is
2106 SGPRn-3 on GFX7 and SGPRn-5
2107 on GFX8. FLAT_SCRATCH_LO is
2108 used as the FLAT SCRATCH
2109 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002110 instructions. Having CP load
2111 it once avoids loading it at
2112 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002113 wavefront.
2114 GFX9
2115 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002116 64 bit base address of the
2117 per SPI scratch backing
2118 memory managed by SPI for
2119 the queue executing the
2120 kernel dispatch. CP obtains
2121 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002122 divides it if there are
2123 multiple Shader Arrays each
2124 with its own SPI). The value
2125 of Scratch Wave Offset must
2126 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002127 machine code and the result
2128 moved to the FLAT_SCRATCH
2129 SGPR which is SGPRn-6 and
2130 SGPRn-5. It is used as the
2131 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002132 memory instructions.
2133 then Private Segment Size 1 The 32 bit byte size of a
2134 (enable_sgpr_private single
2135 work-item's
2136 scratch_segment_size) memory
2137 allocation. This is the
2138 value from the kernel
2139 dispatch packet Private
2140 Segment Byte Size rounded up
2141 by CP to a multiple of
2142 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002143
2144 Having CP load it once avoids
2145 loading it at the beginning of
2146 every wavefront.
2147
2148 This is not used for
2149 GFX7-GFX8 since it is the same
2150 value as the second SGPR of
2151 Flat Scratch Init. However, it
2152 may be needed for GFX9 which
2153 changes the meaning of the
2154 Flat Scratch Init value.
2155 then Grid Work-Group Count X 1 32 bit count of the number of
2156 (enable_sgpr_grid work-groups in the X dimension
2157 _workgroup_count_X) for the grid being
2158 executed. Computed from the
2159 fields in the kernel dispatch
2160 packet as ((grid_size.x +
2161 workgroup_size.x - 1) /
2162 workgroup_size.x).
2163 then Grid Work-Group Count Y 1 32 bit count of the number of
2164 (enable_sgpr_grid work-groups in the Y dimension
2165 _workgroup_count_Y && for the grid being
2166 less than 16 previous executed. Computed from the
2167 SGPRs) fields in the kernel dispatch
2168 packet as ((grid_size.y +
2169 workgroup_size.y - 1) /
2170 workgroupSize.y).
2171
2172 Only initialized if <16
2173 previous SGPRs initialized.
2174 then Grid Work-Group Count Z 1 32 bit count of the number of
2175 (enable_sgpr_grid work-groups in the Z dimension
2176 _workgroup_count_Z && for the grid being
2177 less than 16 previous executed. Computed from the
2178 SGPRs) fields in the kernel dispatch
2179 packet as ((grid_size.z +
2180 workgroup_size.z - 1) /
2181 workgroupSize.z).
2182
2183 Only initialized if <16
2184 previous SGPRs initialized.
2185 then Work-Group Id X 1 32 bit work-group id in X
2186 (enable_sgpr_workgroup_id dimension of grid for
2187 _X) wavefront.
2188 then Work-Group Id Y 1 32 bit work-group id in Y
2189 (enable_sgpr_workgroup_id dimension of grid for
2190 _Y) wavefront.
2191 then Work-Group Id Z 1 32 bit work-group id in Z
2192 (enable_sgpr_workgroup_id dimension of grid for
2193 _Z) wavefront.
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002194 then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002195 (enable_sgpr_workgroup ordered_append_term[10:0],
2196 _info) threadgroup_size_in_waves[5:0]}
2197 then Scratch Wave Offset 1 32 bit byte offset from base
2198 (enable_sgpr_private of scratch base of queue
2199 _segment_wave_offset) executing the kernel
2200 dispatch. Must be used as an
2201 offset with Private
2202 segment address when using
2203 Scratch Segment Buffer. It
2204 must be used to set up FLAT
2205 SCRATCH for flat addressing
2206 (see
2207 :ref:`amdgpu-amdhsa-flat-scratch`).
2208 ========== ========================== ====== ==============================
2209
2210The order of the VGPR registers is defined, but the compiler can specify which
2211ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2212fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2213for enabled registers are dense starting at VGPR0: the first enabled register is
2214VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2215VGPR number.
2216
2217VGPR register initial state is defined in
2218:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2219
2220 .. table:: VGPR Register Set Up Order
2221 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2222
2223 ========== ========================== ====== ==============================
2224 VGPR Order Name Number Description
2225 (kernel descriptor enable of
2226 field) VGPRs
2227 ========== ========================== ====== ==============================
2228 First Work-Item Id X 1 32 bit work item id in X
2229 (Always initialized) dimension of work-group for
2230 wavefront lane.
2231 then Work-Item Id Y 1 32 bit work item id in Y
2232 (enable_vgpr_workitem_id dimension of work-group for
2233 > 0) wavefront lane.
2234 then Work-Item Id Z 1 32 bit work item id in Z
2235 (enable_vgpr_workitem_id dimension of work-group for
2236 > 1) wavefront lane.
2237 ========== ========================== ====== ==============================
2238
2239The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
2240
22411. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2242 registers.
22432. Work-group Id registers X, Y, Z are set by ADC which supports any
2244 combination including none.
22453. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
2246 cannot included with the flat scratch init value which is per queue.
22474. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2248 or (X, Y, Z).
2249
2250Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2251value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2252
2253The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002254has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002255instructions (GFX9).
2256
2257If buffer operations are used then the compiler can generate a V# with the
2258following properties:
2259
2260* base address of 0
2261* no swizzle
2262* ATC: 1 if IOMMU present (such as APU)
2263* ptr64: 1
2264* MTYPE set to support memory coherence that matches the runtime (such as CC for
2265 APU and NC for dGPU).
2266
2267.. _amdgpu-amdhsa-kernel-prolog:
2268
2269Kernel Prolog
2270~~~~~~~~~~~~~
2271
2272.. _amdgpu-amdhsa-m0:
2273
2274M0
2275++
2276
2277GFX6-GFX8
2278 The M0 register must be initialized with a value at least the total LDS size
2279 if the kernel may access LDS via DS or flat operations. Total LDS size is
2280 available in dispatch packet. For M0, it is also possible to use maximum
2281 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2282 GFX7-GFX8).
2283GFX9
2284 The M0 register is not used for range checking LDS accesses and so does not
2285 need to be initialized in the prolog.
2286
2287.. _amdgpu-amdhsa-flat-scratch:
2288
2289Flat Scratch
2290++++++++++++
2291
2292If the kernel may use flat operations to access scratch memory, the prolog code
2293must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
2294are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
2295Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2296
2297GFX6
2298 Flat scratch is not supported.
2299
Tony Tye07d9f102017-11-10 01:00:54 +00002300GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002301 1. The low word of Flat Scratch Init is 32 bit byte offset from
2302 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2303 being managed by SPI for the queue executing the kernel dispatch. This is
2304 the same value used in the Scratch Segment Buffer V# base address. The
2305 prolog must add the value of Scratch Wave Offset to get the wave's byte
2306 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2307 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2308 by 8 before moving into FLAT_SCRATCH_LO.
2309 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2310 work-items scratch memory usage. This is directly loaded from the kernel
2311 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2312 DWORD. Having CP load it once avoids loading it at the beginning of every
2313 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2314 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002315
Tony Tyef16a45e2017-06-06 20:31:59 +00002316GFX9
2317 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2318 memory being managed by SPI for the queue executing the kernel dispatch. The
2319 prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
2320 pair for use as the flat scratch base in flat memory instructions.
2321
2322.. _amdgpu-amdhsa-memory-model:
2323
2324Memory Model
2325~~~~~~~~~~~~
2326
2327This section describes the mapping of LLVM memory model onto AMDGPU machine code
2328(see :ref:`memmodel`). *The implementation is WIP.*
2329
2330.. TODO
2331 Update when implementation complete.
2332
Tony Tyef16a45e2017-06-06 20:31:59 +00002333The AMDGPU backend supports the memory synchronization scopes specified in
2334:ref:`amdgpu-memory-scopes`.
2335
2336The code sequences used to implement the memory model are defined in table
2337:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2338
2339The sequences specify the order of instructions that a single thread must
2340execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2341to other memory instructions executed by the same thread. This allows them to be
2342moved earlier or later which can allow them to be combined with other instances
2343of the same instruction, or hoisted/sunk out of loops to improve
2344performance. Only the instructions related to the memory model are given;
2345additional ``s_waitcnt`` instructions are required to ensure registers are
2346defined before being used. These may be able to be combined with the memory
2347model ``s_waitcnt`` instructions as described above.
2348
Tony Tye6baa6d22017-10-18 22:16:55 +00002349The AMDGPU backend supports the following memory models:
2350
2351 HSA Memory Model [HSA]_
2352 The HSA memory model uses a single happens-before relation for all address
2353 spaces (see :ref:`amdgpu-address-spaces`).
2354 OpenCL Memory Model [OpenCL]_
2355 The OpenCL memory model which has separate happens-before relations for the
2356 global and local address spaces. Only a fence specifying both global and
2357 local address space, and seq_cst instructions join the relationships. Since
2358 the LLVM ``memfence`` instruction does not allow an address space to be
2359 specified the OpenCL fence has to convervatively assume both local and
2360 global address space was specified. However, optimizations can often be
2361 done to eliminate the additional ``s_waitcnt`` instructions when there are
2362 no intervening memory instructions which access the corresponding address
2363 space. The code sequences in the table indicate what can be omitted for the
2364 OpenCL memory. The target triple environment is used to determine if the
2365 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002366
2367``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2368operations.
2369
2370``buffer/global/flat_load/store/atomic`` instructions to global memory are
2371termed vector memory operations.
2372
2373For GFX6-GFX9:
2374
2375* Each agent has multiple compute units (CU).
2376* Each CU has multiple SIMDs that execute wavefronts.
2377* The wavefronts for a single work-group are executed in the same CU but may be
2378 executed by different SIMDs.
2379* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2380 executing on it.
2381* All LDS operations of a CU are performed as wavefront wide operations in a
2382 global order and involve no caching. Completion is reported to a wavefront in
2383 execution order.
2384* The LDS memory has multiple request queues shared by the SIMDs of a
2385 CU. Therefore, the LDS operations performed by different waves of a work-group
2386 can be reordered relative to each other, which can result in reordering the
2387 visibility of vector memory operations with respect to LDS operations of other
2388 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002389 ensure synchronization between LDS operations and vector memory operations
Tony Tyef16a45e2017-06-06 20:31:59 +00002390 between waves of a work-group, but not between operations performed by the
2391 same wavefront.
2392* The vector memory operations are performed as wavefront wide operations and
2393 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002394 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002395 vector memory order if they access LDS memory, and out of LDS operation order
2396 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002397* The vector memory operations access a single vector L1 cache shared by all
2398 SIMDs a CU. Therefore, no special action is required for coherence between the
2399 lanes of a single wavefront, or for coherence between wavefronts in the same
2400 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
2401 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002402* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2403 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2404 scalar operations are used in a restricted way so do not impact the memory
2405 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2406* The vector and scalar memory operations use an L2 cache shared by all CUs on
2407 the same agent.
2408* The L2 cache has independent channels to service disjoint ranges of virtual
2409 addresses.
2410* Each CU has a separate request queue per channel. Therefore, the vector and
2411 scalar memory operations performed by waves executing in different work-groups
2412 (which may be executing on different CUs) of an agent can be reordered
2413 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002414 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002415 previous vector memory operation has completed before executing a subsequent
2416 vector memory or LDS operation and so can be used to meet the requirements of
2417 acquire and release.
2418* The L2 cache can be kept coherent with other agents on some targets, or ranges
2419 of virtual addresses can be set up to bypass it to ensure system coherence.
2420
Tony Tye07d9f102017-11-10 01:00:54 +00002421Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002422or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2423memory, atomic memory orderings are not meaningful and all accesses are treated
2424as non-atomic.
2425
2426Constant address space uses ``buffer/global_load`` instructions (or equivalent
2427scalar memory instructions). Since the constant address space contents do not
2428change during the execution of a kernel dispatch it is not legal to perform
2429stores, and atomic memory orderings are not meaningful and all access are
2430treated as non-atomic.
2431
2432A memory synchronization scope wider than work-group is not meaningful for the
2433group (LDS) address space and is treated as work-group.
2434
2435The memory model does not support the region address space which is treated as
2436non-atomic.
2437
2438Acquire memory ordering is not meaningful on store atomic instructions and is
2439treated as non-atomic.
2440
2441Release memory ordering is not meaningful on load atomic instructions and is
2442treated a non-atomic.
2443
2444Acquire-release memory ordering is not meaningful on load or store atomic
2445instructions and is treated as acquire and release respectively.
2446
2447AMDGPU backend only uses scalar memory operations to access memory that is
2448proven to not change during the execution of the kernel dispatch. This includes
2449constant address space and global address space for program scope const
2450variables. Therefore the kernel machine code does not have to maintain the
2451scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2452and vector L1 caches are invalidated between kernel dispatches by CP since
2453constant address space data may change between kernel dispatch executions. See
2454:ref:`amdgpu-amdhsa-memory-spaces`.
2455
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002456The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002457case the AMDGPU backend ensures the memory location used to spill is never
2458accessed by vector memory operations at the same time. If scalar writes are used
2459then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2460return since the locations may be used for vector memory instructions by a
2461future wave that uses the same scratch area, or a function call that creates a
2462frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2463as all scalar writes are write-before-read in the same thread.
2464
Tony Tye6baa6d22017-10-18 22:16:55 +00002465Scratch backing memory (which is used for the private address space)
2466is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2467address space is only accessed by a single thread, and is always
2468write-before-read, there is never a need to invalidate these entries from the L1
2469cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2470volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002471
2472On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002473to invalidate the L2 cache. This also causes it to be treated as
2474non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2475(cache coherent) and so the L2 cache will coherent with the CPU and other
2476agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002477
2478 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2479 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2480
Tony Tye6baa6d22017-10-18 22:16:55 +00002481 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002482 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2483 Ordering Sync Scope Address
2484 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002485 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002486 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002487 -----------------------------------------------------------------------------------
2488 load *none* *none* - global - !volatile & !nontemporal
2489 - generic
2490 - private 1. buffer/global/flat_load
2491 - constant
2492 - volatile & !nontemporal
2493
Tony Tyef16a45e2017-06-06 20:31:59 +00002494 1. buffer/global/flat_load
2495 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002496
2497 - nontemporal
2498
2499 1. buffer/global/flat_load
2500 glc=1 slc=1
2501
Tony Tyef16a45e2017-06-06 20:31:59 +00002502 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002503 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002504 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002505 - private 1. buffer/global/flat_store
2506 - constant
2507 - nontemporal
2508
2509 1. buffer/global/flat_stote
2510 glc=1 slc=1
2511
Tony Tyef16a45e2017-06-06 20:31:59 +00002512 store *none* *none* - local 1. ds_store
2513 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002514 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002515 load atomic unordered *any* *any* *Same as non-atomic*.
2516 store atomic unordered *any* *any* *Same as non-atomic*.
2517 atomicrmw unordered *any* *any* *Same as monotonic
2518 atomic*.
2519 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002520 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002521 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2522 - wavefront - generic
2523 - workgroup
2524 load atomic monotonic - singlethread - local 1. ds_load
2525 - wavefront
2526 - workgroup
2527 load atomic monotonic - agent - global 1. buffer/global/flat_load
2528 - system - generic glc=1
2529 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2530 - wavefront - generic
2531 - workgroup
2532 - agent
2533 - system
2534 store atomic monotonic - singlethread - local 1. ds_store
2535 - wavefront
2536 - workgroup
2537 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2538 - wavefront - generic
2539 - workgroup
2540 - agent
2541 - system
2542 atomicrmw monotonic - singlethread - local 1. ds_atomic
2543 - wavefront
2544 - workgroup
2545 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002546 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002547 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2548 - wavefront - local
2549 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002550 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2551 load atomic acquire - workgroup - local 1. ds_load
2552 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002553
Tony Tye6baa6d22017-10-18 22:16:55 +00002554 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002555 - Must happen before
2556 any following
2557 global/generic
2558 load/load
2559 atomic/store/store
2560 atomic/atomicrmw.
2561 - Ensures any
2562 following global
2563 data read is no
2564 older than the load
2565 atomic value being
2566 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002567 load atomic acquire - workgroup - generic 1. flat_load
2568 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002569
Tony Tye6baa6d22017-10-18 22:16:55 +00002570 - If OpenCL, omit.
2571 - Must happen before
2572 any following
2573 global/generic
2574 load/load
2575 atomic/store/store
2576 atomic/atomicrmw.
2577 - Ensures any
2578 following global
2579 data read is no
2580 older than the load
2581 atomic value being
2582 acquired.
2583 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002584 - system glc=1
2585 2. s_waitcnt vmcnt(0)
2586
2587 - Must happen before
2588 following
2589 buffer_wbinvl1_vol.
2590 - Ensures the load
2591 has completed
2592 before invalidating
2593 the cache.
2594
2595 3. buffer_wbinvl1_vol
2596
2597 - Must happen before
2598 any following
2599 global/generic
2600 load/load
2601 atomic/atomicrmw.
2602 - Ensures that
2603 following
2604 loads will not see
2605 stale global data.
2606
2607 load atomic acquire - agent - generic 1. flat_load glc=1
2608 - system 2. s_waitcnt vmcnt(0) &
2609 lgkmcnt(0)
2610
2611 - If OpenCL omit
2612 lgkmcnt(0).
2613 - Must happen before
2614 following
2615 buffer_wbinvl1_vol.
2616 - Ensures the flat_load
2617 has completed
2618 before invalidating
2619 the cache.
2620
2621 3. buffer_wbinvl1_vol
2622
2623 - Must happen before
2624 any following
2625 global/generic
2626 load/load
2627 atomic/atomicrmw.
2628 - Ensures that
2629 following loads
2630 will not see stale
2631 global data.
2632
2633 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2634 - wavefront - local
2635 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002636 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2637 atomicrmw acquire - workgroup - local 1. ds_atomic
2638 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002639
Tony Tye6baa6d22017-10-18 22:16:55 +00002640 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002641 - Must happen before
2642 any following
2643 global/generic
2644 load/load
2645 atomic/store/store
2646 atomic/atomicrmw.
2647 - Ensures any
2648 following global
2649 data read is no
2650 older than the
2651 atomicrmw value
2652 being acquired.
2653
Tony Tye6baa6d22017-10-18 22:16:55 +00002654 atomicrmw acquire - workgroup - generic 1. flat_atomic
2655 2. waitcnt lgkmcnt(0)
2656
2657 - If OpenCL, omit.
2658 - Must happen before
2659 any following
2660 global/generic
2661 load/load
2662 atomic/store/store
2663 atomic/atomicrmw.
2664 - Ensures any
2665 following global
2666 data read is no
2667 older than the
2668 atomicrmw value
2669 being acquired.
2670
2671 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002672 - system 2. s_waitcnt vmcnt(0)
2673
2674 - Must happen before
2675 following
2676 buffer_wbinvl1_vol.
2677 - Ensures the
2678 atomicrmw has
2679 completed before
2680 invalidating the
2681 cache.
2682
2683 3. buffer_wbinvl1_vol
2684
2685 - Must happen before
2686 any following
2687 global/generic
2688 load/load
2689 atomic/atomicrmw.
2690 - Ensures that
2691 following loads
2692 will not see stale
2693 global data.
2694
2695 atomicrmw acquire - agent - generic 1. flat_atomic
2696 - system 2. s_waitcnt vmcnt(0) &
2697 lgkmcnt(0)
2698
2699 - If OpenCL, omit
2700 lgkmcnt(0).
2701 - Must happen before
2702 following
2703 buffer_wbinvl1_vol.
2704 - Ensures the
2705 atomicrmw has
2706 completed before
2707 invalidating the
2708 cache.
2709
2710 3. buffer_wbinvl1_vol
2711
2712 - Must happen before
2713 any following
2714 global/generic
2715 load/load
2716 atomic/atomicrmw.
2717 - Ensures that
2718 following loads
2719 will not see stale
2720 global data.
2721
2722 fence acquire - singlethread *none* *none*
2723 - wavefront
2724 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2725
2726 - If OpenCL and
2727 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002728 not generic, omit.
2729 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002730 currently has no
2731 address space on
2732 the fence need to
2733 conservatively
2734 always generate. If
2735 fence had an
2736 address space then
2737 set to address
2738 space of OpenCL
2739 fence flag, or to
2740 generic if both
2741 local and global
2742 flags are
2743 specified.
2744 - Must happen after
2745 any preceding
2746 local/generic load
2747 atomic/atomicrmw
2748 with an equal or
2749 wider sync scope
2750 and memory ordering
2751 stronger than
2752 unordered (this is
2753 termed the
2754 fence-paired-atomic).
2755 - Must happen before
2756 any following
2757 global/generic
2758 load/load
2759 atomic/store/store
2760 atomic/atomicrmw.
2761 - Ensures any
2762 following global
2763 data read is no
2764 older than the
2765 value read by the
2766 fence-paired-atomic.
2767
Tony Tye6baa6d22017-10-18 22:16:55 +00002768 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2769 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002770
2771 - If OpenCL and
2772 address space is
2773 not generic, omit
2774 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002775 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002776 currently has no
2777 address space on
2778 the fence need to
2779 conservatively
2780 always generate
2781 (see comment for
2782 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002783 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002784 separate s_waitcnt
2785 vmcnt(0) and
2786 s_waitcnt
2787 lgkmcnt(0) to allow
2788 them to be
2789 independently moved
2790 according to the
2791 following rules.
2792 - s_waitcnt vmcnt(0)
2793 must happen after
2794 any preceding
2795 global/generic load
2796 atomic/atomicrmw
2797 with an equal or
2798 wider sync scope
2799 and memory ordering
2800 stronger than
2801 unordered (this is
2802 termed the
2803 fence-paired-atomic).
2804 - s_waitcnt lgkmcnt(0)
2805 must happen after
2806 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002807 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002808 atomic/atomicrmw
2809 with an equal or
2810 wider sync scope
2811 and memory ordering
2812 stronger than
2813 unordered (this is
2814 termed the
2815 fence-paired-atomic).
2816 - Must happen before
2817 the following
2818 buffer_wbinvl1_vol.
2819 - Ensures that the
2820 fence-paired atomic
2821 has completed
2822 before invalidating
2823 the
2824 cache. Therefore
2825 any following
2826 locations read must
2827 be no older than
2828 the value read by
2829 the
2830 fence-paired-atomic.
2831
2832 2. buffer_wbinvl1_vol
2833
Tony Tye6baa6d22017-10-18 22:16:55 +00002834 - Must happen before any
2835 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002836 load/load
2837 atomic/store/store
2838 atomic/atomicrmw.
2839 - Ensures that
2840 following loads
2841 will not see stale
2842 global data.
2843
2844 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002845 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002846 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2847 - wavefront - local
2848 - generic
2849 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002850
2851 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002852 - Must happen after
2853 any preceding
2854 local/generic
2855 load/store/load
2856 atomic/store
2857 atomic/atomicrmw.
2858 - Must happen before
2859 the following
2860 store.
2861 - Ensures that all
2862 memory operations
2863 to local have
2864 completed before
2865 performing the
2866 store that is being
2867 released.
2868
2869 2. buffer/global/flat_store
2870 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002871 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2872
2873 - If OpenCL, omit.
2874 - Must happen after
2875 any preceding
2876 local/generic
2877 load/store/load
2878 atomic/store
2879 atomic/atomicrmw.
2880 - Must happen before
2881 the following
2882 store.
2883 - Ensures that all
2884 memory operations
2885 to local have
2886 completed before
2887 performing the
2888 store that is being
2889 released.
2890
2891 2. flat_store
2892 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2893 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002894
2895 - If OpenCL, omit
2896 lgkmcnt(0).
2897 - Could be split into
2898 separate s_waitcnt
2899 vmcnt(0) and
2900 s_waitcnt
2901 lgkmcnt(0) to allow
2902 them to be
2903 independently moved
2904 according to the
2905 following rules.
2906 - s_waitcnt vmcnt(0)
2907 must happen after
2908 any preceding
2909 global/generic
2910 load/store/load
2911 atomic/store
2912 atomic/atomicrmw.
2913 - s_waitcnt lgkmcnt(0)
2914 must happen after
2915 any preceding
2916 local/generic
2917 load/store/load
2918 atomic/store
2919 atomic/atomicrmw.
2920 - Must happen before
2921 the following
2922 store.
2923 - Ensures that all
2924 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002925 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002926 completed before
2927 performing the
2928 store that is being
2929 released.
2930
2931 2. buffer/global/ds/flat_store
2932 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2933 - wavefront - local
2934 - generic
2935 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002936
2937 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002938 - Must happen after
2939 any preceding
2940 local/generic
2941 load/store/load
2942 atomic/store
2943 atomic/atomicrmw.
2944 - Must happen before
2945 the following
2946 atomicrmw.
2947 - Ensures that all
2948 memory operations
2949 to local have
2950 completed before
2951 performing the
2952 atomicrmw that is
2953 being released.
2954
2955 2. buffer/global/flat_atomic
2956 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002957 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2958
2959 - If OpenCL, omit.
2960 - Must happen after
2961 any preceding
2962 local/generic
2963 load/store/load
2964 atomic/store
2965 atomic/atomicrmw.
2966 - Must happen before
2967 the following
2968 atomicrmw.
2969 - Ensures that all
2970 memory operations
2971 to local have
2972 completed before
2973 performing the
2974 atomicrmw that is
2975 being released.
2976
2977 2. flat_atomic
2978 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
2979 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002980
2981 - If OpenCL, omit
2982 lgkmcnt(0).
2983 - Could be split into
2984 separate s_waitcnt
2985 vmcnt(0) and
2986 s_waitcnt
2987 lgkmcnt(0) to allow
2988 them to be
2989 independently moved
2990 according to the
2991 following rules.
2992 - s_waitcnt vmcnt(0)
2993 must happen after
2994 any preceding
2995 global/generic
2996 load/store/load
2997 atomic/store
2998 atomic/atomicrmw.
2999 - s_waitcnt lgkmcnt(0)
3000 must happen after
3001 any preceding
3002 local/generic
3003 load/store/load
3004 atomic/store
3005 atomic/atomicrmw.
3006 - Must happen before
3007 the following
3008 atomicrmw.
3009 - Ensures that all
3010 memory operations
3011 to global and local
3012 have completed
3013 before performing
3014 the atomicrmw that
3015 is being released.
3016
Tony Tye6baa6d22017-10-18 22:16:55 +00003017 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003018 fence release - singlethread *none* *none*
3019 - wavefront
3020 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3021
3022 - If OpenCL and
3023 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003024 not generic, omit.
3025 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003026 currently has no
3027 address space on
3028 the fence need to
3029 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003030 always generate. If
3031 fence had an
3032 address space then
3033 set to address
3034 space of OpenCL
3035 fence flag, or to
3036 generic if both
3037 local and global
3038 flags are
3039 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003040 - Must happen after
3041 any preceding
3042 local/generic
3043 load/load
3044 atomic/store/store
3045 atomic/atomicrmw.
3046 - Must happen before
3047 any following store
3048 atomic/atomicrmw
3049 with an equal or
3050 wider sync scope
3051 and memory ordering
3052 stronger than
3053 unordered (this is
3054 termed the
3055 fence-paired-atomic).
3056 - Ensures that all
3057 memory operations
3058 to local have
3059 completed before
3060 performing the
3061 following
3062 fence-paired-atomic.
3063
Tony Tye6baa6d22017-10-18 22:16:55 +00003064 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3065 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003066
3067 - If OpenCL and
3068 address space is
3069 not generic, omit
3070 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003071 - If OpenCL and
3072 address space is
3073 local, omit
3074 vmcnt(0).
3075 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003076 currently has no
3077 address space on
3078 the fence need to
3079 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003080 always generate. If
3081 fence had an
3082 address space then
3083 set to address
3084 space of OpenCL
3085 fence flag, or to
3086 generic if both
3087 local and global
3088 flags are
3089 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003090 - Could be split into
3091 separate s_waitcnt
3092 vmcnt(0) and
3093 s_waitcnt
3094 lgkmcnt(0) to allow
3095 them to be
3096 independently moved
3097 according to the
3098 following rules.
3099 - s_waitcnt vmcnt(0)
3100 must happen after
3101 any preceding
3102 global/generic
3103 load/store/load
3104 atomic/store
3105 atomic/atomicrmw.
3106 - s_waitcnt lgkmcnt(0)
3107 must happen after
3108 any preceding
3109 local/generic
3110 load/store/load
3111 atomic/store
3112 atomic/atomicrmw.
3113 - Must happen before
3114 any following store
3115 atomic/atomicrmw
3116 with an equal or
3117 wider sync scope
3118 and memory ordering
3119 stronger than
3120 unordered (this is
3121 termed the
3122 fence-paired-atomic).
3123 - Ensures that all
3124 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003125 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003126 completed before
3127 performing the
3128 following
3129 fence-paired-atomic.
3130
3131 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003132 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003133 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3134 - wavefront - local
3135 - generic
3136 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3137
Tony Tye6baa6d22017-10-18 22:16:55 +00003138 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003139 - Must happen after
3140 any preceding
3141 local/generic
3142 load/store/load
3143 atomic/store
3144 atomic/atomicrmw.
3145 - Must happen before
3146 the following
3147 atomicrmw.
3148 - Ensures that all
3149 memory operations
3150 to local have
3151 completed before
3152 performing the
3153 atomicrmw that is
3154 being released.
3155
Tony Tye6baa6d22017-10-18 22:16:55 +00003156 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003157 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3158 2. s_waitcnt lgkmcnt(0)
3159
Tony Tye6baa6d22017-10-18 22:16:55 +00003160 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003161 - Must happen before
3162 any following
3163 global/generic
3164 load/load
3165 atomic/store/store
3166 atomic/atomicrmw.
3167 - Ensures any
3168 following global
3169 data read is no
3170 older than the load
3171 atomic value being
3172 acquired.
3173
3174 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3175
Tony Tye6baa6d22017-10-18 22:16:55 +00003176 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003177 - Must happen after
3178 any preceding
3179 local/generic
3180 load/store/load
3181 atomic/store
3182 atomic/atomicrmw.
3183 - Must happen before
3184 the following
3185 atomicrmw.
3186 - Ensures that all
3187 memory operations
3188 to local have
3189 completed before
3190 performing the
3191 atomicrmw that is
3192 being released.
3193
3194 2. flat_atomic
3195 3. s_waitcnt lgkmcnt(0)
3196
Tony Tye6baa6d22017-10-18 22:16:55 +00003197 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003198 - Must happen before
3199 any following
3200 global/generic
3201 load/load
3202 atomic/store/store
3203 atomic/atomicrmw.
3204 - Ensures any
3205 following global
3206 data read is no
3207 older than the load
3208 atomic value being
3209 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003210
3211 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3212 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003213
3214 - If OpenCL, omit
3215 lgkmcnt(0).
3216 - Could be split into
3217 separate s_waitcnt
3218 vmcnt(0) and
3219 s_waitcnt
3220 lgkmcnt(0) to allow
3221 them to be
3222 independently moved
3223 according to the
3224 following rules.
3225 - s_waitcnt vmcnt(0)
3226 must happen after
3227 any preceding
3228 global/generic
3229 load/store/load
3230 atomic/store
3231 atomic/atomicrmw.
3232 - s_waitcnt lgkmcnt(0)
3233 must happen after
3234 any preceding
3235 local/generic
3236 load/store/load
3237 atomic/store
3238 atomic/atomicrmw.
3239 - Must happen before
3240 the following
3241 atomicrmw.
3242 - Ensures that all
3243 memory operations
3244 to global have
3245 completed before
3246 performing the
3247 atomicrmw that is
3248 being released.
3249
Tony Tye6baa6d22017-10-18 22:16:55 +00003250 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003251 3. s_waitcnt vmcnt(0)
3252
3253 - Must happen before
3254 following
3255 buffer_wbinvl1_vol.
3256 - Ensures the
3257 atomicrmw has
3258 completed before
3259 invalidating the
3260 cache.
3261
3262 4. buffer_wbinvl1_vol
3263
3264 - Must happen before
3265 any following
3266 global/generic
3267 load/load
3268 atomic/atomicrmw.
3269 - Ensures that
3270 following loads
3271 will not see stale
3272 global data.
3273
Tony Tye6baa6d22017-10-18 22:16:55 +00003274 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3275 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003276
3277 - If OpenCL, omit
3278 lgkmcnt(0).
3279 - Could be split into
3280 separate s_waitcnt
3281 vmcnt(0) and
3282 s_waitcnt
3283 lgkmcnt(0) to allow
3284 them to be
3285 independently moved
3286 according to the
3287 following rules.
3288 - s_waitcnt vmcnt(0)
3289 must happen after
3290 any preceding
3291 global/generic
3292 load/store/load
3293 atomic/store
3294 atomic/atomicrmw.
3295 - s_waitcnt lgkmcnt(0)
3296 must happen after
3297 any preceding
3298 local/generic
3299 load/store/load
3300 atomic/store
3301 atomic/atomicrmw.
3302 - Must happen before
3303 the following
3304 atomicrmw.
3305 - Ensures that all
3306 memory operations
3307 to global have
3308 completed before
3309 performing the
3310 atomicrmw that is
3311 being released.
3312
3313 2. flat_atomic
3314 3. s_waitcnt vmcnt(0) &
3315 lgkmcnt(0)
3316
3317 - If OpenCL, omit
3318 lgkmcnt(0).
3319 - Must happen before
3320 following
3321 buffer_wbinvl1_vol.
3322 - Ensures the
3323 atomicrmw has
3324 completed before
3325 invalidating the
3326 cache.
3327
3328 4. buffer_wbinvl1_vol
3329
3330 - Must happen before
3331 any following
3332 global/generic
3333 load/load
3334 atomic/atomicrmw.
3335 - Ensures that
3336 following loads
3337 will not see stale
3338 global data.
3339
3340 fence acq_rel - singlethread *none* *none*
3341 - wavefront
3342 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3343
3344 - If OpenCL and
3345 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003346 not generic, omit.
3347 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003348 since LLVM
3349 currently has no
3350 address space on
3351 the fence need to
3352 conservatively
3353 always generate
3354 (see comment for
3355 previous fence).
3356 - Must happen after
3357 any preceding
3358 local/generic
3359 load/load
3360 atomic/store/store
3361 atomic/atomicrmw.
3362 - Must happen before
3363 any following
3364 global/generic
3365 load/load
3366 atomic/store/store
3367 atomic/atomicrmw.
3368 - Ensures that all
3369 memory operations
3370 to local have
3371 completed before
3372 performing any
3373 following global
3374 memory operations.
3375 - Ensures that the
3376 preceding
3377 local/generic load
3378 atomic/atomicrmw
3379 with an equal or
3380 wider sync scope
3381 and memory ordering
3382 stronger than
3383 unordered (this is
3384 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003385 acquire-fence-paired-atomic
3386 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003387 before following
3388 global memory
3389 operations. This
3390 satisfies the
3391 requirements of
3392 acquire.
3393 - Ensures that all
3394 previous memory
3395 operations have
3396 completed before a
3397 following
3398 local/generic store
3399 atomic/atomicrmw
3400 with an equal or
3401 wider sync scope
3402 and memory ordering
3403 stronger than
3404 unordered (this is
3405 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003406 release-fence-paired-atomic
3407 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003408 requirements of
3409 release.
3410
Tony Tye6baa6d22017-10-18 22:16:55 +00003411 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3412 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003413
3414 - If OpenCL and
3415 address space is
3416 not generic, omit
3417 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003418 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003419 currently has no
3420 address space on
3421 the fence need to
3422 conservatively
3423 always generate
3424 (see comment for
3425 previous fence).
3426 - Could be split into
3427 separate s_waitcnt
3428 vmcnt(0) and
3429 s_waitcnt
3430 lgkmcnt(0) to allow
3431 them to be
3432 independently moved
3433 according to the
3434 following rules.
3435 - s_waitcnt vmcnt(0)
3436 must happen after
3437 any preceding
3438 global/generic
3439 load/store/load
3440 atomic/store
3441 atomic/atomicrmw.
3442 - s_waitcnt lgkmcnt(0)
3443 must happen after
3444 any preceding
3445 local/generic
3446 load/store/load
3447 atomic/store
3448 atomic/atomicrmw.
3449 - Must happen before
3450 the following
3451 buffer_wbinvl1_vol.
3452 - Ensures that the
3453 preceding
3454 global/local/generic
3455 load
3456 atomic/atomicrmw
3457 with an equal or
3458 wider sync scope
3459 and memory ordering
3460 stronger than
3461 unordered (this is
3462 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003463 acquire-fence-paired-atomic
3464 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003465 before invalidating
3466 the cache. This
3467 satisfies the
3468 requirements of
3469 acquire.
3470 - Ensures that all
3471 previous memory
3472 operations have
3473 completed before a
3474 following
3475 global/local/generic
3476 store
3477 atomic/atomicrmw
3478 with an equal or
3479 wider sync scope
3480 and memory ordering
3481 stronger than
3482 unordered (this is
3483 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003484 release-fence-paired-atomic
3485 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003486 requirements of
3487 release.
3488
3489 2. buffer_wbinvl1_vol
3490
3491 - Must happen before
3492 any following
3493 global/generic
3494 load/load
3495 atomic/store/store
3496 atomic/atomicrmw.
3497 - Ensures that
3498 following loads
3499 will not see stale
3500 global data. This
3501 satisfies the
3502 requirements of
3503 acquire.
3504
3505 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003506 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003507 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003508 - wavefront - local load atomic acquire,
3509 - generic except must generated
3510 all instructions even
3511 for OpenCL.*
3512 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3513 - generic
3514 - Must
3515 happen after
3516 preceding
3517 global/generic load
3518 atomic/store
3519 atomic/atomicrmw
3520 with memory
3521 ordering of seq_cst
3522 and with equal or
3523 wider sync scope.
3524 (Note that seq_cst
3525 fences have their
3526 own s_waitcnt
3527 lgkmcnt(0) and so do
3528 not need to be
3529 considered.)
3530 - Ensures any
3531 preceding
3532 sequential
3533 consistent local
3534 memory instructions
3535 have completed
3536 before executing
3537 this sequentially
3538 consistent
3539 instruction. This
3540 prevents reordering
3541 a seq_cst store
3542 followed by a
3543 seq_cst load. (Note
3544 that seq_cst is
3545 stronger than
3546 acquire/release as
3547 the reordering of
3548 load acquire
3549 followed by a store
3550 release is
3551 prevented by the
3552 waitcnt of
3553 the release, but
3554 there is nothing
3555 preventing a store
3556 release followed by
3557 load acquire from
3558 competing out of
3559 order.)
3560
3561 2. *Following
3562 instructions same as
3563 corresponding load
3564 atomic acquire,
3565 except must generated
3566 all instructions even
3567 for OpenCL.*
3568 load atomic seq_cst - workgroup - local *Same as corresponding
3569 load atomic acquire,
3570 except must generated
3571 all instructions even
3572 for OpenCL.*
3573 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3574 - system - generic vmcnt(0)
3575
3576 - Could be split into
3577 separate s_waitcnt
3578 vmcnt(0)
3579 and s_waitcnt
3580 lgkmcnt(0) to allow
3581 them to be
3582 independently moved
3583 according to the
3584 following rules.
3585 - waitcnt lgkmcnt(0)
3586 must happen after
3587 preceding
3588 global/generic load
3589 atomic/store
3590 atomic/atomicrmw
3591 with memory
3592 ordering of seq_cst
3593 and with equal or
3594 wider sync scope.
3595 (Note that seq_cst
3596 fences have their
3597 own s_waitcnt
3598 lgkmcnt(0) and so do
3599 not need to be
3600 considered.)
3601 - waitcnt vmcnt(0)
3602 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003603 preceding
3604 global/generic load
3605 atomic/store
3606 atomic/atomicrmw
3607 with memory
3608 ordering of seq_cst
3609 and with equal or
3610 wider sync scope.
3611 (Note that seq_cst
3612 fences have their
3613 own s_waitcnt
3614 vmcnt(0) and so do
3615 not need to be
3616 considered.)
3617 - Ensures any
3618 preceding
3619 sequential
3620 consistent global
3621 memory instructions
3622 have completed
3623 before executing
3624 this sequentially
3625 consistent
3626 instruction. This
3627 prevents reordering
3628 a seq_cst store
3629 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003630 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003631 that seq_cst is
3632 stronger than
3633 acquire/release as
3634 the reordering of
3635 load acquire
3636 followed by a store
3637 release is
3638 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003639 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003640 the release, but
3641 there is nothing
3642 preventing a store
3643 release followed by
3644 load acquire from
3645 competing out of
3646 order.)
3647
3648 2. *Following
3649 instructions same as
3650 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003651 atomic acquire,
3652 except must generated
3653 all instructions even
3654 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003655 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003656 - wavefront - local store atomic release,
3657 - workgroup - generic except must generated
3658 all instructions even
3659 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003660 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003661 - system - generic store atomic release,
3662 except must generated
3663 all instructions even
3664 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003665 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003666 - wavefront - local atomicrmw acq_rel,
3667 - workgroup - generic except must generated
3668 all instructions even
3669 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003670 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003671 - system - generic atomicrmw acq_rel,
3672 except must generated
3673 all instructions even
3674 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003675 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003676 - wavefront fence acq_rel,
3677 - workgroup except must generated
3678 - agent all instructions even
3679 - system for OpenCL.*
3680 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003681
3682The memory order also adds the single thread optimization constrains defined in
3683table
3684:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3685
3686 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3687 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3688
3689 ============ ==============================================================
3690 LLVM Memory Optimization Constraints
3691 Ordering
3692 ============ ==============================================================
3693 unordered *none*
3694 monotonic *none*
3695 acquire - If a load atomic/atomicrmw then no following load/load
3696 atomic/store/ store atomic/atomicrmw/fence instruction can
3697 be moved before the acquire.
3698 - If a fence then same as load atomic, plus no preceding
3699 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003700 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003701 atomic/store/ store atomic/atomicrmw/fence instruction can
3702 be moved after the release.
3703 - If a fence then same as store atomic, plus no following
3704 associated fence-paired-atomic can be moved before the
3705 fence.
3706 acq_rel Same constraints as both acquire and release.
3707 seq_cst - If a load atomic then same constraints as acquire, plus no
3708 preceding sequentially consistent load atomic/store
3709 atomic/atomicrmw/fence instruction can be moved after the
3710 seq_cst.
3711 - If a store atomic then the same constraints as release, plus
3712 no following sequentially consistent load atomic/store
3713 atomic/atomicrmw/fence instruction can be moved before the
3714 seq_cst.
3715 - If an atomicrmw/fence then same constraints as acq_rel.
3716 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003717
Wei Ding16289cf2017-02-21 18:48:01 +00003718Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003719~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003720
Tony Tyef16a45e2017-06-06 20:31:59 +00003721For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3722(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3723the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003724
Tony Tyef16a45e2017-06-06 20:31:59 +00003725 .. table:: AMDGPU Trap Handler for AMDHSA OS
3726 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003727
Tony Tyef16a45e2017-06-06 20:31:59 +00003728 =================== =============== =============== =======================
3729 Usage Code Sequence Trap Handler Description
3730 Inputs
3731 =================== =============== =============== =======================
3732 reserved ``s_trap 0x00`` Reserved by hardware.
3733 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3734 ``queue_ptr`` ``debugtrap``
3735 ``VGPR0``: intrinsic (not
3736 ``arg`` implemented).
3737 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3738 ``queue_ptr`` terminated and its
3739 associated queue put
3740 into the error state.
3741 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3742 ``queue_ptr`` installed handled
3743 same as ``llvm.trap``.
3744 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3745 breakpoints.
3746 debugger ``s_trap 0x08`` Reserved for debugger.
3747 debugger ``s_trap 0xfe`` Reserved for debugger.
3748 debugger ``s_trap 0xff`` Reserved for debugger.
3749 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003750
Tony Tye46d35762017-08-15 20:47:41 +00003751Unspecified OS
3752--------------
3753
3754This section provides code conventions used when the target triple OS is
3755empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003756
3757Trap Handler ABI
3758~~~~~~~~~~~~~~~~
3759
3760For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3761not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3762instructions are handled as follows:
3763
3764 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3765 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3766
3767 =============== =============== ===========================================
3768 Usage Code Sequence Description
3769 =============== =============== ===========================================
3770 llvm.trap s_endpgm Causes wavefront to be terminated.
3771 llvm.debugtrap *none* Compiler warning given that there is no
3772 trap handler installed.
3773 =============== =============== ===========================================
3774
3775Source Languages
3776================
3777
3778.. _amdgpu-opencl:
3779
3780OpenCL
3781------
3782
3783When generating code for the OpenCL language the target triple environment
3784should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
3785
3786When the language is OpenCL the following differences occur:
3787
37881. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
37892. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye46d35762017-08-15 20:47:41 +000037903. Additional metadata is generated
3791 (:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003792
3793.. TODO
3794 Specify what affect this has. Hidden arguments added. Additional metadata
3795 generated.
3796
3797.. _amdgpu-hcc:
3798
3799HCC
3800---
3801
3802When generating code for the OpenCL language the target triple environment
3803should be ``hcc`` (see :ref:`amdgpu-target-triples`).
3804
3805When the language is OpenCL the following differences occur:
3806
38071. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3808
3809.. TODO
3810 Specify what affect this has.
Tom Stellard3ec09e62016-04-06 01:29:19 +00003811
Tom Stellard45bb48e2015-06-13 03:28:10 +00003812Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003813---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003814
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003815AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003816It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003817
Tony Tyef16a45e2017-06-06 20:31:59 +00003818This section describes general syntax for instructions and operands. For more
3819information about instructions, their semantics and supported combinations of
3820operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00003821[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003822
Tony Tyef16a45e2017-06-06 20:31:59 +00003823An instruction has the following syntax (register operands are normally
3824comma-separated while extra operands are space-separated):
Tom Stellard45bb48e2015-06-13 03:28:10 +00003825
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003826*<opcode> <register_operand0>, ... <extra_operand0> ...*
Tom Stellard45bb48e2015-06-13 03:28:10 +00003827
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003828Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00003829~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00003830
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003831The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003832
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003833* SGPR registers: s0, ... or s[0], ...
3834* VGPR registers: v0, ... or v[0], ...
3835* TTMP registers: ttmp0, ... or ttmp[0], ...
3836* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
3837* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
3838* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
3839* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
3840* Register index expressions: v[2*2], s[1-1:2-1]
3841* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00003842
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003843The following extra operands are supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003844
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003845* offset, offset0, offset1
3846* idxen, offen bits
3847* glc, slc, tfe bits
3848* waitcnt: integer or combination of counter values
3849* VOP3 modifiers:
Tom Stellard45bb48e2015-06-13 03:28:10 +00003850
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003851 - abs (\| \|), neg (\-)
Tom Stellard45bb48e2015-06-13 03:28:10 +00003852
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003853* DPP modifiers:
3854
3855 - row_shl, row_shr, row_ror, row_rol
3856 - row_mirror, row_half_mirror, row_bcast
3857 - wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
3858 - row_mask, bank_mask, bound_ctrl
3859
3860* SDWA modifiers:
3861
3862 - dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
3863 - dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
3864 - abs, neg, sext
3865
Tony Tyef16a45e2017-06-06 20:31:59 +00003866Instruction Examples
3867~~~~~~~~~~~~~~~~~~~~
3868
3869DS
3870~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003871
3872.. code-block:: nasm
3873
3874 ds_add_u32 v2, v4 offset:16
3875 ds_write_src2_b64 v2 offset0:4 offset1:8
3876 ds_cmpst_f32 v2, v4, v6
3877 ds_min_rtn_f64 v[8:9], v2, v[4:5]
3878
3879
3880For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
3881
Tony Tyef16a45e2017-06-06 20:31:59 +00003882FLAT
3883++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003884
3885.. code-block:: nasm
3886
3887 flat_load_dword v1, v[3:4]
3888 flat_store_dwordx3 v[3:4], v[5:7]
3889 flat_atomic_swap v1, v[3:4], v5 glc
3890 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
3891 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
3892
3893For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
3894
Tony Tyef16a45e2017-06-06 20:31:59 +00003895MUBUF
3896+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003897
3898.. code-block:: nasm
3899
3900 buffer_load_dword v1, off, s[4:7], s1
3901 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
3902 buffer_store_format_xy v[1:2], off, s[4:7], s1
3903 buffer_wbinvl1
3904 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
3905
3906For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
3907
Tony Tyef16a45e2017-06-06 20:31:59 +00003908SMRD/SMEM
3909+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003910
3911.. code-block:: nasm
3912
3913 s_load_dword s1, s[2:3], 0xfc
3914 s_load_dwordx8 s[8:15], s[2:3], s4
3915 s_load_dwordx16 s[88:103], s[2:3], s4
3916 s_dcache_inv_vol
3917 s_memtime s[4:5]
3918
3919For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
3920
Tony Tyef16a45e2017-06-06 20:31:59 +00003921SOP1
3922++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003923
3924.. code-block:: nasm
3925
3926 s_mov_b32 s1, s2
3927 s_mov_b64 s[0:1], 0x80000000
3928 s_cmov_b32 s1, 200
3929 s_wqm_b64 s[2:3], s[4:5]
3930 s_bcnt0_i32_b64 s1, s[2:3]
3931 s_swappc_b64 s[2:3], s[4:5]
3932 s_cbranch_join s[4:5]
3933
3934For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
3935
Tony Tyef16a45e2017-06-06 20:31:59 +00003936SOP2
3937++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003938
3939.. code-block:: nasm
3940
3941 s_add_u32 s1, s2, s3
3942 s_and_b64 s[2:3], s[4:5], s[6:7]
3943 s_cselect_b32 s1, s2, s3
3944 s_andn2_b32 s2, s4, s6
3945 s_lshr_b64 s[2:3], s[4:5], s6
3946 s_ashr_i32 s2, s4, s6
3947 s_bfm_b64 s[2:3], s4, s6
3948 s_bfe_i64 s[2:3], s[4:5], s6
3949 s_cbranch_g_fork s[4:5], s[6:7]
3950
3951For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
3952
Tony Tyef16a45e2017-06-06 20:31:59 +00003953SOPC
3954++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003955
3956.. code-block:: nasm
3957
3958 s_cmp_eq_i32 s1, s2
3959 s_bitcmp1_b32 s1, s2
3960 s_bitcmp0_b64 s[2:3], s4
3961 s_setvskip s3, s5
3962
3963For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
3964
Tony Tyef16a45e2017-06-06 20:31:59 +00003965SOPP
3966++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003967
3968.. code-block:: nasm
3969
3970 s_barrier
3971 s_nop 2
3972 s_endpgm
3973 s_waitcnt 0 ; Wait for all counters to be 0
3974 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
3975 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
3976 s_sethalt 9
3977 s_sleep 10
3978 s_sendmsg 0x1
3979 s_sendmsg sendmsg(MSG_INTERRUPT)
3980 s_trap 1
3981
3982For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
3983
3984Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00003985of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00003986range or acceptable values.
3987
Tony Tyef16a45e2017-06-06 20:31:59 +00003988VALU
3989++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00003990
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003991For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
3992the assembler will automatically use optimal encoding based on its operands.
3993To force specific encoding, one can add a suffix to the opcode of the instruction:
3994
3995* _e32 for 32-bit VOP1/VOP2/VOPC
3996* _e64 for 64-bit VOP3
3997* _dpp for VOP_DPP
3998* _sdwa for VOP_SDWA
3999
4000VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004001
4002.. code-block:: nasm
4003
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004004 v_mov_b32 v1, v2
4005 v_mov_b32_e32 v1, v2
4006 v_nop
4007 v_cvt_f64_i32_e32 v[1:2], v2
4008 v_floor_f32_e32 v1, v2
4009 v_bfrev_b32_e32 v1, v2
4010 v_add_f32_e32 v1, v2, v3
4011 v_mul_i32_i24_e64 v1, v2, 3
4012 v_mul_i32_i24_e32 v1, -3, v3
4013 v_mul_i32_i24_e32 v1, -100, v3
4014 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4015 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004016
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004017VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004018
4019.. code-block:: nasm
4020
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004021 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4022 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4023 v_mov_b32 v0, v0 wave_shl:1
4024 v_mov_b32 v0, v0 row_mirror
4025 v_mov_b32 v0, v0 row_bcast:31
4026 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4027 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4028 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004029
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004030VOP_SDWA examples:
4031
4032.. code-block:: nasm
4033
4034 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4035 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4036 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4037 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4038 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4039
4040For full list of supported instructions, refer to "Vector ALU instructions".
4041
4042HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004043~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004044
4045AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4046one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004047
4048.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004049+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004050
4051*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004052object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004053
4054.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004055+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4056
Tom Stellard347ac792015-06-26 21:15:07 +00004057
4058*major*, *minor*, and *stepping* are all integers that describe the instruction
4059set architecture (ISA) version of the assembly program.
4060
4061*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4062"AMD" and *arch* should always be equal to "AMDGPU".
4063
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004064By default, the assembler will derive the ISA version, *vendor*, and *arch*
4065from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004066
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004067.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004068+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004069
4070This directives specifies that the symbol with given name is a kernel entry point
4071(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004072
4073.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004074++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004075
4076This directive marks the beginning of a list of key / value pairs that are used
4077to specify the amd_kernel_code_t object that will be emitted by the assembler.
4078The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4079any amd_kernel_code_t values that are unspecified a default value will be
4080used. The default value for all keys is 0, with the following exceptions:
4081
4082- *kernel_code_version_major* defaults to 1.
4083- *machine_kind* defaults to 1.
4084- *machine_version_major*, *machine_version_minor*, and
4085 *machine_version_stepping* are derived from the value of the -mcpu option
4086 that is passed to the assembler.
4087- *kernel_code_entry_byte_offset* defaults to 256.
4088- *wavefront_size* defaults to 6.
4089- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004090 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004091 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4092
4093The *.amd_kernel_code_t* directive must be placed immediately after the
4094function label and before any instructions.
4095
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004096For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4097comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004098
4099Here is an example of a minimal amd_kernel_code_t specification:
4100
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004101.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004102
4103 .hsa_code_object_version 1,0
4104 .hsa_code_object_isa
4105
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004106 .hsatext
4107 .globl hello_world
4108 .p2align 8
4109 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004110
4111 hello_world:
4112
4113 .amd_kernel_code_t
4114 enable_sgpr_kernarg_segment_ptr = 1
4115 is_ptr64 = 1
4116 compute_pgm_rsrc1_vgprs = 0
4117 compute_pgm_rsrc1_sgprs = 0
4118 compute_pgm_rsrc2_user_sgpr = 2
4119 kernarg_segment_byte_size = 8
4120 wavefront_sgpr_count = 2
4121 workitem_vgpr_count = 3
4122 .end_amd_kernel_code_t
4123
4124 s_load_dwordx2 s[0:1], s[0:1] 0x0
4125 v_mov_b32 v0, 3.14159
4126 s_waitcnt lgkmcnt(0)
4127 v_mov_b32 v1, s0
4128 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004129 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004130 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004131 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004132 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004133
4134Additional Documentation
4135========================
4136
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004137.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4138.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4139.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4140.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4141.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4142.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4143.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4144.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004145.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
4146.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
4147.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4148.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4149.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4150.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4151.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004152.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004153.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4154.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
4155.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__