blob: d6255c8a15eaa60d155872d2f2062a2b637bf82e [file] [log] [blame]
Eugene Zelenko3507b042018-03-21 17:09:35 +00001=============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye7a893d42018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye07d9f102017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000080
Tony Tye31105cc2017-12-11 15:35:27 +000081 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
85 [Default]
86 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000088 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +000099 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000106 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000152 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
201 ``gfx902`` ``amdgcn`` APU - xnack *TBA*
202 [on]
203 .. TODO
204 Add product
205 names.
206 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000207
208.. _amdgpu-target-features:
209
210Target Features
211---------------
212
213Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000214processor specific features. Not all target features are supported by
215all processors. The runtime must ensure that the features supported by
216the device used to execute the code match the features enabled when
217generating the code. A mismatch of features may result in incorrect
218execution, or a reduction in performance.
219
220The target features supported by each processor, and the default value
221used if not specified explicitly, is listed in
222:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000223
224Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
225target features.
226
227For example:
228
229``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000230 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000231``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000232 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000233
234 .. table:: AMDGPU Target Features
235 :name: amdgpu-target-feature-table
236
Tony Tye31105cc2017-12-11 15:35:27 +0000237 ============== ==================================================
238 Target Feature Description
239 ============== ==================================================
240 -m[no-]xnack Enable/disable generating code that has
241 memory clauses that are compatible with
242 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000243
Tony Tye31105cc2017-12-11 15:35:27 +0000244 This is used for demand paging and page
245 migration. If XNACK replay is enabled in
246 the device, then if a page fault occurs
247 the code may execute incorrectly if the
248 ``xnack`` feature is not enabled. Executing
249 code that has the feature enabled on a
250 device that does not have XNACK replay
251 enabled will execute correctly, but may
252 be less performant than code with the
253 feature disabled.
254 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000255
256.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000257
258Address Spaces
259--------------
260
Tony Tyef16a45e2017-06-06 20:31:59 +0000261The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000262
Tony Tyef16a45e2017-06-06 20:31:59 +0000263The memory space names used in the table, aside from the region memory space, is
264from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000265
Tony Tyef16a45e2017-06-06 20:31:59 +0000266LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000267
Tony Tyef16a45e2017-06-06 20:31:59 +0000268 .. table:: Address Space Mapping
269 :name: amdgpu-address-space-mapping-table
270
Yaxun Liu0124b542018-02-13 18:00:25 +0000271 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000272 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000273 ================== =================
274 0 Generic (Flat)
275 1 Global
276 2 Region (GDS)
277 3 Local (group/LDS)
278 4 Constant
279 5 Private (Scratch)
280 6 Constant 32-bit
281 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000282
283.. _amdgpu-memory-scopes:
284
285Memory Scopes
286-------------
287
288This section provides LLVM memory synchronization scopes supported by the AMDGPU
289backend memory model when the target triple OS is ``amdhsa`` (see
290:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
291
292The memory model supported is based on the HSA memory model [HSA]_ which is
293based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
294relation is transitive over the synchonizes-with relation independent of scope,
295and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000296table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000297
298This is different to the OpenCL [OpenCL]_ memory model which does not have scope
299inclusion and requires the memory scopes to exactly match. However, this
300is conservatively correct for OpenCL.
301
Tony Tye07d9f102017-11-10 01:00:54 +0000302 .. table:: AMDHSA LLVM Sync Scopes
303 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000304
305 ================ ==========================================================
306 LLVM Sync Scope Description
307 ================ ==========================================================
308 *none* The default: ``system``.
309
310 Synchronizes with, and participates in modification and
311 seq_cst total orderings with, other operations (except
312 image operations) for all address spaces (except private,
313 or generic that accesses private) provided the other
314 operation's sync scope is:
315
316 - ``system``.
317 - ``agent`` and executed by a thread on the same agent.
318 - ``workgroup`` and executed by a thread in the same
319 workgroup.
320 - ``wavefront`` and executed by a thread in the same
321 wavefront.
322
323 ``agent`` Synchronizes with, and participates in modification and
324 seq_cst total orderings with, other operations (except
325 image operations) for all address spaces (except private,
326 or generic that accesses private) provided the other
327 operation's sync scope is:
328
329 - ``system`` or ``agent`` and executed by a thread on the
330 same agent.
331 - ``workgroup`` and executed by a thread in the same
332 workgroup.
333 - ``wavefront`` and executed by a thread in the same
334 wavefront.
335
336 ``workgroup`` Synchronizes with, and participates in modification and
337 seq_cst total orderings with, other operations (except
338 image operations) for all address spaces (except private,
339 or generic that accesses private) provided the other
340 operation's sync scope is:
341
342 - ``system``, ``agent`` or ``workgroup`` and executed by a
343 thread in the same workgroup.
344 - ``wavefront`` and executed by a thread in the same
345 wavefront.
346
347 ``wavefront`` Synchronizes with, and participates in modification and
348 seq_cst total orderings with, other operations (except
349 image operations) for all address spaces (except private,
350 or generic that accesses private) provided the other
351 operation's sync scope is:
352
353 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
354 and executed by a thread in the same wavefront.
355
356 ``singlethread`` Only synchronizes with, and participates in modification
357 and seq_cst total orderings with, other operations (except
358 image operations) running in the same thread for all
359 address spaces (for example, in signal handlers).
360 ================ ==========================================================
361
362AMDGPU Intrinsics
363-----------------
364
365The AMDGPU backend implements the following intrinsics.
366
367*This section is WIP.*
368
369.. TODO
370 List AMDGPU intrinsics
371
372Code Object
373===========
374
375The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
376can be linked by ``lld`` to produce a standard ELF shared code object which can
377be loaded and executed on an AMDGPU target.
378
379Header
380------
381
382The AMDGPU backend uses the following ELF header:
383
384 .. table:: AMDGPU ELF Header
385 :name: amdgpu-elf-header-table
386
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000387 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000388 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000389 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000390 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
391 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000392 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
393 - ``ELFOSABI_AMDGPU_HSA``
394 - ``ELFOSABI_AMDGPU_PAL``
395 - ``ELFOSABI_AMDGPU_MESA3D``
396 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
397 - ``ELFABIVERSION_AMDGPU_PAL``
398 - ``ELFABIVERSION_AMDGPU_MESA3D``
399 ``e_type`` - ``ET_REL``
400 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000401 ``e_machine`` ``EM_AMDGPU``
402 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000403 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000404 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000405
406..
407
408 .. table:: AMDGPU ELF Header Enumeration Values
409 :name: amdgpu-elf-header-enumeration-values-table
410
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000411 =============================== =====
412 Name Value
413 =============================== =====
414 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000415 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000416 ``ELFOSABI_AMDGPU_HSA`` 64
417 ``ELFOSABI_AMDGPU_PAL`` 65
418 ``ELFOSABI_AMDGPU_MESA3D`` 66
419 ``ELFABIVERSION_AMDGPU_HSA`` 1
420 ``ELFABIVERSION_AMDGPU_PAL`` 0
421 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
422 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000423
424``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000425 The ELF class is:
426
427 * ``ELFCLASS32`` for ``r600`` architecture.
428
429 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
430 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000431
432``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000433 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000434
435``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000436 One of the following AMD GPU architecture specific OS ABIs
437 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000438
Tony Tye07d9f102017-11-10 01:00:54 +0000439 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000440
Tony Tye07d9f102017-11-10 01:00:54 +0000441 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000442
Tony Tye07d9f102017-11-10 01:00:54 +0000443 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
444
445 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000446
Tony Tyef16a45e2017-06-06 20:31:59 +0000447``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000448 The ABI version of the AMD GPU architecture specific OS ABI to which the code
449 object conforms:
450
451 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
452 runtime ABI.
453
454 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
455 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000456
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000457 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000458 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000459
Tony Tyef16a45e2017-06-06 20:31:59 +0000460``e_type``
461 Can be one of the following values:
462
463
464 ``ET_REL``
465 The type produced by the AMD GPU backend compiler as it is relocatable code
466 object.
467
468 ``ET_DYN``
469 The type produced by the linker as it is a shared code object.
470
471 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
472
473``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000474 The value ``EM_AMDGPU`` is used for the machine for all processors supported
475 by the ``r600`` and ``amdgcn`` architectures (see
476 :ref:`amdgpu-processor-table`). The specific processor is specified in the
477 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
478 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000479
480``e_entry``
481 The entry point is 0 as the entry points for individual kernels must be
482 selected in order to invoke them through AQL packets.
483
484``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000485 The AMDGPU backend uses the following ELF header flags:
486
487 .. table:: AMDGPU ELF Header ``e_flags``
488 :name: amdgpu-elf-header-e_flags-table
489
490 ================================= ========== =============================
491 Name Value Description
492 ================================= ========== =============================
493 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
494 -------------------------------------------- -----------------------------
495 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
496 mask for
497 ``EF_AMDGPU_MACH_xxx`` values
498 defined in
499 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000500 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
501 target feature is
502 enabled for all code
503 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000504 If the processor
505 does not support the
506 ``xnack`` target
507 feature then must
508 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000509 See
510 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000511 ================================= ========== =============================
512
513 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
514 :name: amdgpu-ef-amdgpu-mach-table
515
516 ================================= ========== =============================
517 Name Value Description (see
518 :ref:`amdgpu-processor-table`)
519 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000520 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
521 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
522 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
523 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
524 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
525 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
526 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
527 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
528 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
529 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
530 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
531 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
532 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
533 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
534 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
535 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
536 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
537 *reserved* 0x011 - Reserved for ``r600``
538 0x01f architecture processors.
539 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
540 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
541 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
542 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
543 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
544 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
545 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
546 *reserved* 0x027 Reserved.
547 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
553 *reserved* 0x02e Reserved.
554 *reserved* 0x02f Reserved.
555 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000556 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000557
558Sections
559--------
560
561An AMDGPU target ELF code object has the standard ELF sections which include:
562
563 .. table:: AMDGPU ELF Sections
564 :name: amdgpu-elf-sections-table
565
566 ================== ================ =================================
567 Name Type Attributes
568 ================== ================ =================================
569 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
570 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
571 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
572 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
573 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
574 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
575 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
576 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
577 ``.note`` ``SHT_NOTE`` *none*
578 ``.rela``\ *name* ``SHT_RELA`` *none*
579 ``.rela.dyn`` ``SHT_RELA`` *none*
580 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
581 ``.shstrtab`` ``SHT_STRTAB`` *none*
582 ``.strtab`` ``SHT_STRTAB`` *none*
583 ``.symtab`` ``SHT_SYMTAB`` *none*
584 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
585 ================== ================ =================================
586
587These sections have their standard meanings (see [ELF]_) and are only generated
588if needed.
589
590``.debug``\ *\**
591 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
592 DWARF produced by the AMDGPU backend.
593
Tony Tye46d35762017-08-15 20:47:41 +0000594``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000595 The standard sections used by a dynamic loader.
596
597``.note``
598 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
599 backend.
600
601``.rela``\ *name*, ``.rela.dyn``
602 For relocatable code objects, *name* is the name of the section that the
603 relocation records apply. For example, ``.rela.text`` is the section name for
604 relocation records associated with the ``.text`` section.
605
606 For linked shared code objects, ``.rela.dyn`` contains all the relocation
607 records from each of the relocatable code object's ``.rela``\ *name* sections.
608
609 See :ref:`amdgpu-relocation-records` for the relocation records supported by
610 the AMDGPU backend.
611
612``.text``
613 The executable machine code for the kernels and functions they call. Generated
614 as position independent code. See :ref:`amdgpu-code-conventions` for
615 information on conventions used in the isa generation.
616
617.. _amdgpu-note-records:
618
619Note Records
620------------
621
Tony Tye07d9f102017-11-10 01:00:54 +0000622As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
623be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
624aligned. In addition, minimal zero byte padding must be generated to ensure the
625``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
626``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000627
628The AMDGPU backend code object uses the following ELF note records in the
629``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000630record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000631variable size strings have a corresponding ``*_size`` field that specifies the
632number of bytes, including the terminating null character, in the string. The
633string(s) come immediately after the preceding fields.
634
635Additional note records can be present.
636
637 .. table:: AMDGPU ELF Note Records
638 :name: amdgpu-elf-note-records-table
639
Tony Tye46d35762017-08-15 20:47:41 +0000640 ===== ============================== ======================================
641 Name Type Description
642 ===== ============================== ======================================
643 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000644 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000645
646..
647
648 .. table:: AMDGPU ELF Note Record Enumeration Values
649 :name: amdgpu-elf-note-record-enumeration-values-table
650
Tony Tye46d35762017-08-15 20:47:41 +0000651 ============================== =====
652 Name Value
653 ============================== =====
654 *reserved* 0-9
655 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000656 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000657 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000658
Tony Tye46d35762017-08-15 20:47:41 +0000659``NT_AMD_AMDGPU_HSA_METADATA``
660 Specifies extensible metadata associated with the code objects executed on HSA
661 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
662 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
663 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
664 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000665
Tony Tye46d35762017-08-15 20:47:41 +0000666.. _amdgpu-symbols:
667
668Symbols
669-------
670
671Symbols include the following:
672
673 .. table:: AMDGPU ELF Symbols
674 :name: amdgpu-elf-symbols-table
675
676 ===================== ============== ============= ==================
677 Name Type Section Description
678 ===================== ============== ============= ==================
679 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
680 - ``.rodata``
681 - ``.bss``
682 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
683 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
684 ===================== ============== ============= ==================
685
686Global variable
687 Global variables both used and defined by the compilation unit.
688
689 If the symbol is defined in the compilation unit then it is allocated in the
690 appropriate section according to if it has initialized data or is readonly.
691
692 If the symbol is external then its section is ``STN_UNDEF`` and the loader
693 will resolve relocations using the definition provided by another code object
694 or explicitly defined by the runtime.
695
696 All global symbols, whether defined in the compilation unit or external, are
697 accessed by the machine code indirectly through a GOT table entry. This
698 allows them to be preemptable. The GOT table is only supported when the target
699 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000700
701 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000702 Add description of linked shared object symbols. Seems undefined symbols
703 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000704
Tony Tye46d35762017-08-15 20:47:41 +0000705Kernel descriptor
706 Every HSA kernel has an associated kernel descriptor. It is the address of the
707 kernel descriptor that is used in the AQL dispatch packet used to invoke the
708 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
709 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
710
711Kernel entry point
712 Every HSA kernel also has a symbol for its machine code entry point.
713
714.. _amdgpu-relocation-records:
715
716Relocation Records
717------------------
718
719AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
720relocatable fields are:
721
722``word32``
723 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
724 alignment. These values use the same byte order as other word values in the
725 AMD GPU architecture.
726
727``word64``
728 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
729 alignment. These values use the same byte order as other word values in the
730 AMD GPU architecture.
731
732Following notations are used for specifying relocation calculations:
733
734**A**
735 Represents the addend used to compute the value of the relocatable field.
736
737**G**
738 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000739 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000740
741**GOT**
742 Represents the address of the global offset table.
743
744**P**
745 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
746 of the storage unit being relocated (computed using ``r_offset``).
747
748**S**
749 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000750 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
751
752**B**
753 Represents the base address of a loaded executable or shared object which is
754 the difference between the ELF address and the actual load address. Relocations
755 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000756
757The following relocation types are supported:
758
759 .. table:: AMDGPU ELF Relocation Records
760 :name: amdgpu-elf-relocation-records-table
761
Tony Tyedb6c9932018-01-30 23:59:43 +0000762 ========================== ======= ===== ========== ==============================
763 Relocation Type Kind Value Field Calculation
764 ========================== ======= ===== ========== ==============================
765 ``R_AMDGPU_NONE`` 0 *none* *none*
766 ``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
767 ``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
768 ``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
769 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
770 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
771 ``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
772 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
773 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
774 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
775 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
776 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
777 *reserved* 12
778 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
779 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000780
781.. _amdgpu-dwarf:
782
783DWARF
784-----
785
Scott Linder16c7bda2018-02-23 23:01:06 +0000786Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000787information that maps the code object executable code and data to the source
788language constructs. It can be used by tools such as debuggers and profilers.
789
790Address Space Mapping
791~~~~~~~~~~~~~~~~~~~~~
792
793The following address space mapping is used:
794
795 .. table:: AMDGPU DWARF Address Space Mapping
796 :name: amdgpu-dwarf-address-space-mapping-table
797
798 =================== =================
799 DWARF Address Space Memory Space
800 =================== =================
801 1 Private (Scratch)
802 2 Local (group/LDS)
803 *omitted* Global
804 *omitted* Constant
805 *omitted* Generic (Flat)
806 *not supported* Region (GDS)
807 =================== =================
808
809See :ref:`amdgpu-address-spaces` for information on the memory space terminology
810used in the table.
811
812An ``address_class`` attribute is generated on pointer type DIEs to specify the
813DWARF address space of the value of the pointer when it is in the *private* or
814*local* address space. Otherwise the attribute is omitted.
815
816An ``XDEREF`` operation is generated in location list expressions for variables
817that are allocated in the *private* and *local* address space. Otherwise no
818``XDREF`` is omitted.
819
820Register Mapping
821~~~~~~~~~~~~~~~~
822
823*This section is WIP.*
824
825.. TODO
826 Define DWARF register enumeration.
827
828 If want to present a wavefront state then should expose vector registers as
829 64 wide (rather than per work-item view that LLVM uses). Either as separate
830 registers, or a 64x4 byte single register. In either case use a new LANE op
831 (akin to XDREF) to select the current lane usage in a location
832 expression. This would also allow scalar register spilling to vector register
833 lanes to be expressed (currently no debug information is being generated for
834 spilling). If choose a wide single register approach then use LANE in
835 conjunction with PIECE operation to select the dword part of the register for
836 the current lane. If the separate register approach then use LANE to select
837 the register.
838
839Source Text
840~~~~~~~~~~~
841
Scott Linder16c7bda2018-02-23 23:01:06 +0000842Source text for online-compiled programs (e.g. those compiled by the OpenCL
843runtime) may be embedded into the DWARF v5 line table using the ``clang
844-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000845
Scott Linder16c7bda2018-02-23 23:01:06 +0000846For example:
847
848``-gembed-source``
849 Enable the embedded source DWARF v5 extension.
850``-gno-embed-source``
851 Disable the embedded source DWARF v5 extension.
852
853 .. table:: AMDGPU Debug Options
854 :name: amdgpu-debug-options
855
856 ==================== ==================================================
857 Debug Flag Description
858 ==================== ==================================================
859 -g[no-]embed-source Enable/disable embedding source text in DWARF
860 debug sections. Useful for environments where
861 source cannot be written to disk, such as
862 when performing online compilation.
863 ==================== ==================================================
864
865This option enables one extended content types in the DWARF v5 Line Number
866Program Header, which is used to encode embedded source.
867
868 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
869 :name: amdgpu-dwarf-extended-content-types
870
871 ============================ ======================
872 Content Type Form
873 ============================ ======================
874 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
875 ============================ ======================
876
877The source field will contain the UTF-8 encoded, null-terminated source text
878with ``'\n'`` line endings. When the source field is present, consumers can use
879the embedded source instead of attempting to discover the source on disk. When
880the source field is absent, consumers can access the file to get the source
881text.
882
883The above content type appears in the ``file_name_entry_format`` field of the
884line table prologue, and its corresponding value appear in the ``file_names``
885field. The current encoding of the content type is documented in table
886:ref:`amdgpu-dwarf-extended-content-types-encoding`
887
888 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
889 :name: amdgpu-dwarf-extended-content-types-encoding
890
891 ============================ ====================
892 Content Type Value
893 ============================ ====================
894 ``DW_LNCT_LLVM_source`` 0x2001
895 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000896
897.. _amdgpu-code-conventions:
898
899Code Conventions
900================
901
902This section provides code conventions used for each supported target triple OS
903(see :ref:`amdgpu-target-triples`).
904
905AMDHSA
906------
907
908This section provides code conventions used when the target triple OS is
909``amdhsa`` (see :ref:`amdgpu-target-triples`).
910
911.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000912
Tony Tye01bfd6c2018-03-27 21:20:46 +0000913Code Object Target Identification
914~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
915
916The AMDHSA OS uses the following syntax to specify the code object
917target as a single string:
918
919 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
920
921Where:
922
923 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
924 are the same as the *Target Triple* (see
925 :ref:`amdgpu-target-triples`).
926
927 - ``<Processor>`` is the same as the *Processor* (see
928 :ref:`amdgpu-processors`).
929
930 - ``<Target Features>`` is a list of the enabled *Target Features*
931 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
932 apply to *Processor*. The list must be in the same order as listed
933 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
934 Features* must be included in the list if they are enabled even if
935 that is the default for *Processor*.
936
937For example:
938
939 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
940
Tony Tyef16a45e2017-06-06 20:31:59 +0000941Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000942~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000943
Tony Tye46d35762017-08-15 20:47:41 +0000944The code object metadata specifies extensible metadata associated with the code
945objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
946[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
947(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
948``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
949information necessary to support the ROCM kernel queries. For example, the
950segment sizes needed in a dispatch packet. In addition, a high level language
951runtime may require other information to be included. For example, the AMD
952OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000953
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000954The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000955:doc:`YamlIO`).
956
Tony Tye46d35762017-08-15 20:47:41 +0000957.. TODO
958 Is the string null terminated? It probably should not if YAML allows it to
959 contain null characters, otherwise it should be.
960
Tony Tyef16a45e2017-06-06 20:31:59 +0000961The metadata is represented as a single YAML document comprised of the mapping
962defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
963referenced tables.
964
965For boolean values, the string values of ``false`` and ``true`` are used for
966false and true respectively.
967
968Additional information can be added to the mappings. To avoid conflicts, any
969non-AMD key names should be prefixed by "*vendor-name*.".
970
971 .. table:: AMDHSA Code Object Metadata Mapping
972 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
973
974 ========== ============== ========= =======================================
975 String Key Value Type Required? Description
976 ========== ============== ========= =======================================
977 "Version" sequence of Required - The first integer is the major
978 2 integers version. Currently 1.
979 - The second integer is the minor
980 version. Currently 0.
981 "Printf" sequence of Each string is encoded information
982 strings about a printf function call. The
983 encoded information is organized as
984 fields separated by colon (':'):
985
986 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
987
988 where:
989
990 ``ID``
991 A 32 bit integer as a unique id for
992 each printf function call
993
994 ``N``
995 A 32 bit integer equal to the number
996 of arguments of printf function call
997 minus 1
998
999 ``S[i]`` (where i = 0, 1, ... , N-1)
1000 32 bit integers for the size in bytes
1001 of the i-th FormatString argument of
1002 the printf function call
1003
1004 FormatString
1005 The format string passed to the
1006 printf function call.
1007 "Kernels" sequence of Required Sequence of the mappings for each
1008 mapping kernel in the code object. See
1009 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1010 for the definition of the mapping.
1011 ========== ============== ========= =======================================
1012
1013..
1014
1015 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1016 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1017
1018 ================= ============== ========= ================================
1019 String Key Value Type Required? Description
1020 ================= ============== ========= ================================
1021 "Name" string Required Source name of the kernel.
1022 "SymbolName" string Required Name of the kernel
1023 descriptor ELF symbol.
1024 "Language" string Source language of the kernel.
1025 Values include:
1026
1027 - "OpenCL C"
1028 - "OpenCL C++"
1029 - "HCC"
1030 - "OpenMP"
1031
1032 "LanguageVersion" sequence of - The first integer is the major
1033 2 integers version.
1034 - The second integer is the
1035 minor version.
1036 "Attrs" mapping Mapping of kernel attributes.
1037 See
1038 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1039 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001040 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001041 mapping kernel arguments. See
1042 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1043 for the definition of the mapping.
1044 "CodeProps" mapping Mapping of properties related to
1045 the kernel code. See
1046 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1047 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001048 ================= ============== ========= ================================
1049
1050..
1051
1052 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1053 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1054
1055 =================== ============== ========= ==============================
1056 String Key Value Type Required? Description
1057 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001058 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1059 3 integers must be >=1 and the dispatch
1060 work-group size X, Y, Z must
1061 correspond to the specified
1062 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001063
1064 Corresponds to the OpenCL
1065 ``reqd_work_group_size``
1066 attribute.
1067 "WorkGroupSizeHint" sequence of The dispatch work-group size
1068 3 integers X, Y, Z is likely to be the
1069 specified values.
1070
1071 Corresponds to the OpenCL
1072 ``work_group_size_hint``
1073 attribute.
1074 "VecTypeHint" string The name of a scalar or vector
1075 type.
1076
1077 Corresponds to the OpenCL
1078 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001079
1080 "RuntimeHandle" string The external symbol name
1081 associated with a kernel.
1082 OpenCL runtime allocates a
1083 global buffer for the symbol
1084 and saves the kernel's address
1085 to it, which is used for
1086 device side enqueueing. Only
1087 available for device side
1088 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001089 =================== ============== ========= ==============================
1090
1091..
1092
1093 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1094 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1095
1096 ================= ============== ========= ================================
1097 String Key Value Type Required? Description
1098 ================= ============== ========= ================================
1099 "Name" string Kernel argument name.
1100 "TypeName" string Kernel argument type name.
1101 "Size" integer Required Kernel argument size in bytes.
1102 "Align" integer Required Kernel argument alignment in
1103 bytes. Must be a power of two.
1104 "ValueKind" string Required Kernel argument kind that
1105 specifies how to set up the
1106 corresponding argument.
1107 Values include:
1108
1109 "ByValue"
1110 The argument is copied
1111 directly into the kernarg.
1112
1113 "GlobalBuffer"
1114 A global address space pointer
1115 to the buffer data is passed
1116 in the kernarg.
1117
1118 "DynamicSharedPointer"
1119 A group address space pointer
1120 to dynamically allocated LDS
1121 is passed in the kernarg.
1122
1123 "Sampler"
1124 A global address space
1125 pointer to a S# is passed in
1126 the kernarg.
1127
1128 "Image"
1129 A global address space
1130 pointer to a T# is passed in
1131 the kernarg.
1132
1133 "Pipe"
1134 A global address space pointer
1135 to an OpenCL pipe is passed in
1136 the kernarg.
1137
1138 "Queue"
1139 A global address space pointer
1140 to an OpenCL device enqueue
1141 queue is passed in the
1142 kernarg.
1143
1144 "HiddenGlobalOffsetX"
1145 The OpenCL grid dispatch
1146 global offset for the X
1147 dimension is passed in the
1148 kernarg.
1149
1150 "HiddenGlobalOffsetY"
1151 The OpenCL grid dispatch
1152 global offset for the Y
1153 dimension is passed in the
1154 kernarg.
1155
1156 "HiddenGlobalOffsetZ"
1157 The OpenCL grid dispatch
1158 global offset for the Z
1159 dimension is passed in the
1160 kernarg.
1161
1162 "HiddenNone"
1163 An argument that is not used
1164 by the kernel. Space needs to
1165 be left for it, but it does
1166 not need to be set up.
1167
1168 "HiddenPrintfBuffer"
1169 A global address space pointer
1170 to the runtime printf buffer
1171 is passed in kernarg.
1172
1173 "HiddenDefaultQueue"
1174 A global address space pointer
1175 to the OpenCL device enqueue
1176 queue that should be used by
1177 the kernel by default is
1178 passed in the kernarg.
1179
1180 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001181 A global address space pointer
1182 to help link enqueued kernels into
1183 the ancestor tree for determining
1184 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001185
1186 "ValueType" string Required Kernel argument value type. Only
1187 present if "ValueKind" is
1188 "ByValue". For vector data
1189 types, the value is for the
1190 element type. Values include:
1191
1192 - "Struct"
1193 - "I8"
1194 - "U8"
1195 - "I16"
1196 - "U16"
1197 - "F16"
1198 - "I32"
1199 - "U32"
1200 - "F32"
1201 - "I64"
1202 - "U64"
1203 - "F64"
1204
1205 .. TODO
1206 How can it be determined if a
1207 vector type, and what size
1208 vector?
1209 "PointeeAlign" integer Alignment in bytes of pointee
1210 type for pointer type kernel
1211 argument. Must be a power
1212 of 2. Only present if
1213 "ValueKind" is
1214 "DynamicSharedPointer".
1215 "AddrSpaceQual" string Kernel argument address space
1216 qualifier. Only present if
1217 "ValueKind" is "GlobalBuffer" or
1218 "DynamicSharedPointer". Values
1219 are:
1220
1221 - "Private"
1222 - "Global"
1223 - "Constant"
1224 - "Local"
1225 - "Generic"
1226 - "Region"
1227
1228 .. TODO
1229 Is GlobalBuffer only Global
1230 or Constant? Is
1231 DynamicSharedPointer always
1232 Local? Can HCC allow Generic?
1233 How can Private or Region
1234 ever happen?
1235 "AccQual" string Kernel argument access
1236 qualifier. Only present if
1237 "ValueKind" is "Image" or
1238 "Pipe". Values
1239 are:
1240
1241 - "ReadOnly"
1242 - "WriteOnly"
1243 - "ReadWrite"
1244
1245 .. TODO
1246 Does this apply to
1247 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001248 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001249 performed by the kernel on the
1250 kernel argument. Only present if
1251 "ValueKind" is "GlobalBuffer",
1252 "Image", or "Pipe". This may be
1253 more restrictive than indicated
1254 by "AccQual" to reflect what the
1255 kernel actual does. If not
1256 present then the runtime must
1257 assume what is implied by
1258 "AccQual" and "IsConst". Values
1259 are:
1260
1261 - "ReadOnly"
1262 - "WriteOnly"
1263 - "ReadWrite"
1264
1265 "IsConst" boolean Indicates if the kernel argument
1266 is const qualified. Only present
1267 if "ValueKind" is
1268 "GlobalBuffer".
1269
1270 "IsRestrict" boolean Indicates if the kernel argument
1271 is restrict qualified. Only
1272 present if "ValueKind" is
1273 "GlobalBuffer".
1274
1275 "IsVolatile" boolean Indicates if the kernel argument
1276 is volatile qualified. Only
1277 present if "ValueKind" is
1278 "GlobalBuffer".
1279
1280 "IsPipe" boolean Indicates if the kernel argument
1281 is pipe qualified. Only present
1282 if "ValueKind" is "Pipe".
1283
1284 .. TODO
1285 Can GlobalBuffer be pipe
1286 qualified?
1287 ================= ============== ========= ================================
1288
1289..
1290
1291 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1292 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1293
1294 ============================ ============== ========= =====================
1295 String Key Value Type Required? Description
1296 ============================ ============== ========= =====================
1297 "KernargSegmentSize" integer Required The size in bytes of
1298 the kernarg segment
1299 that holds the values
1300 of the arguments to
1301 the kernel.
1302 "GroupSegmentFixedSize" integer Required The amount of group
1303 segment memory
1304 required by a
1305 work-group in
1306 bytes. This does not
1307 include any
1308 dynamically allocated
1309 group segment memory
1310 that may be added
1311 when the kernel is
1312 dispatched.
1313 "PrivateSegmentFixedSize" integer Required The amount of fixed
1314 private address space
1315 memory required for a
1316 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001317 bytes. If the kernel
1318 uses a dynamic call
1319 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001320 space must be added
1321 to this value for the
1322 call stack.
1323 "KernargSegmentAlign" integer Required The maximum byte
1324 alignment of
1325 arguments in the
1326 kernarg segment. Must
1327 be a power of 2.
1328 "WavefrontSize" integer Required Wavefront size. Must
1329 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001330 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001331 registers used by a
1332 wavefront for
1333 GFX6-GFX9. This
1334 includes the special
1335 SGPRs for VCC, Flat
1336 Scratch (GFX7-GFX9)
1337 and XNACK (for
1338 GFX8-GFX9). It does
1339 not include the 16
1340 SGPR added if a trap
1341 handler is
1342 enabled. It is not
1343 rounded up to the
1344 allocation
1345 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001346 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001347 registers used by
1348 each work-item for
1349 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001350 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001351 work-group size
1352 supported by the
1353 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001354 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001355 consistent with
1356 ReqdWorkGroupSize if
1357 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001358 "NumSpilledSGPRs" integer Number of stores from
1359 a scalar register to
1360 a register allocator
1361 created spill
1362 location.
1363 "NumSpilledVGPRs" integer Number of stores from
1364 a vector register to
1365 a register allocator
1366 created spill
1367 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001368 ============================ ============== ========= =====================
1369
1370..
1371
Tony Tyef16a45e2017-06-06 20:31:59 +00001372Kernel Dispatch
1373~~~~~~~~~~~~~~~
1374
1375The HSA architected queuing language (AQL) defines a user space memory interface
1376that can be used to control the dispatch of kernels, in an agent independent
1377way. An agent can have zero or more AQL queues created for it using the ROCm
1378runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1379*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1380mechanics and packet layouts.
1381
1382The packet processor of a kernel agent is responsible for detecting and
1383dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1384packet processor is implemented by the hardware command processor (CP),
1385asynchronous dispatch controller (ADC) and shader processor input controller
1386(SPI).
1387
1388The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1389mode driver to initialize and register the AQL queue with CP.
1390
1391To dispatch a kernel the following actions are performed. This can occur in the
1392CPU host program, or from an HSA kernel executing on a GPU.
1393
13941. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1395 executed is obtained.
13962. A pointer to the kernel descriptor (see
1397 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1398 obtained. It must be for a kernel that is contained in a code object that that
1399 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1400 associated.
14013. Space is allocated for the kernel arguments using the ROCm runtime allocator
1402 for a memory region with the kernarg property for the kernel agent that will
1403 execute the kernel. It must be at least 16 byte aligned.
14044. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001405 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001406 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1407 memory in the same way constant memory is accessed. (Note that the HSA
1408 specification allows an implementation to copy the kernel argument contents to
1409 another location that is accessed by the kernel.)
14105. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1411 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1412 packet. The packet must be set up, and the final write must use an atomic
1413 store release to set the packet kind to ensure the packet contents are
1414 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1415 notify the kernel agent that the AQL queue has been updated. These rules, and
1416 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1417 System Architecture Specification* [HSA]_.
14186. A kernel dispatch packet includes information about the actual dispatch,
1419 such as grid and work-group size, together with information from the code
1420 object about the kernel, such as segment sizes. The ROCm runtime queries on
1421 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001422 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000014237. CP executes micro-code and is responsible for detecting and setting up the
1424 GPU to execute the wavefronts of a kernel dispatch.
14258. CP ensures that when the a wavefront starts executing the kernel machine
1426 code, the scalar general purpose registers (SGPR) and vector general purpose
1427 registers (VGPR) are set up as required by the machine code. The required
1428 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1429 register state is defined in
1430 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14319. The prolog of the kernel machine code (see
1432 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1433 before continuing executing the machine code that corresponds to the kernel.
143410. When the kernel dispatch has completed execution, CP signals the completion
1435 signal specified in the kernel dispatch packet if not 0.
1436
1437.. _amdgpu-amdhsa-memory-spaces:
1438
1439Memory Spaces
1440~~~~~~~~~~~~~
1441
1442The memory space properties are:
1443
1444 .. table:: AMDHSA Memory Spaces
1445 :name: amdgpu-amdhsa-memory-spaces-table
1446
1447 ================= =========== ======== ======= ==================
1448 Memory Space Name HSA Segment Hardware Address NULL Value
1449 Name Name Size
1450 ================= =========== ======== ======= ==================
1451 Private private scratch 32 0x00000000
1452 Local group LDS 32 0xFFFFFFFF
1453 Global global global 64 0x0000000000000000
1454 Constant constant *same as 64 0x0000000000000000
1455 global*
1456 Generic flat flat 64 0x0000000000000000
1457 Region N/A GDS 32 *not implemented
1458 for AMDHSA*
1459 ================= =========== ======== ======= ==================
1460
1461The global and constant memory spaces both use global virtual addresses, which
1462are the same virtual address space used by the CPU. However, some virtual
1463addresses may only be accessible to the CPU, some only accessible by the GPU,
1464and some by both.
1465
1466Using the constant memory space indicates that the data will not change during
1467the execution of the kernel. This allows scalar read instructions to be
1468used. The vector and scalar L1 caches are invalidated of volatile data before
1469each kernel dispatch execution to allow constant memory to change values between
1470kernel dispatches.
1471
1472The local memory space uses the hardware Local Data Store (LDS) which is
1473automatically allocated when the hardware creates work-groups of wavefronts, and
1474freed when all the wavefronts of a work-group have terminated. The data store
1475(DS) instructions can be used to access it.
1476
1477The private memory space uses the hardware scratch memory support. If the kernel
1478uses scratch, then the hardware allocates memory that is accessed using
1479wavefront lane dword (4 byte) interleaving. The mapping used from private
1480address to physical address is:
1481
1482 ``wavefront-scratch-base +
1483 (private-address * wavefront-size * 4) +
1484 (wavefront-lane-id * 4)``
1485
1486There are different ways that the wavefront scratch base address is determined
1487by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1488memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001489the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001490instructions, or by flat instructions. If each lane of a wavefront accesses the
1491same private address, the interleaving results in adjacent dwords being accessed
1492and hence requires fewer cache lines to be fetched. Multi-dword access is not
1493supported except by flat and scratch instructions in GFX9.
1494
1495The generic address space uses the hardware flat address support available in
1496GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1497local appertures), that are outside the range of addressible global memory, to
1498map from a flat address to a private or local address.
1499
1500FLAT instructions can take a flat address and access global, private (scratch)
1501and group (LDS) memory depending in if the address is within one of the
1502apperture ranges. Flat access to scratch requires hardware aperture setup and
1503setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1504access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1505(see :ref:`amdgpu-amdhsa-m0`).
1506
1507To convert between a segment address and a flat address the base address of the
1508appertures address can be used. For GFX7-GFX8 these are available in the
1509:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1510Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1511GFX9 the appature base addresses are directly available as inline constant
1512registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1513address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1514which makes it easier to convert from flat to segment or segment to flat.
1515
Tony Tye46d35762017-08-15 20:47:41 +00001516Image and Samplers
1517~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001518
1519Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1520hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1521HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1522enumeration values for the queries that are not trivially deducible from the S#
1523representation.
1524
1525HSA Signals
1526~~~~~~~~~~~
1527
Tony Tye46d35762017-08-15 20:47:41 +00001528HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1529structure allocated in memory accessible from both the CPU and GPU. The
1530structure is defined by the ROCm runtime and subject to change between releases
1531(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001532
1533.. _amdgpu-amdhsa-hsa-aql-queue:
1534
1535HSA AQL Queue
1536~~~~~~~~~~~~~
1537
Tony Tye46d35762017-08-15 20:47:41 +00001538The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001539between releases (see [AMD-ROCm-github]_). For some processors it contains
1540fields needed to implement certain language features such as the flat address
1541aperture bases. It also contains fields used by CP such as managing the
1542allocation of scratch memory.
1543
1544.. _amdgpu-amdhsa-kernel-descriptor:
1545
1546Kernel Descriptor
1547~~~~~~~~~~~~~~~~~
1548
1549A kernel descriptor consists of the information needed by CP to initiate the
1550execution of a kernel, including the entry point address of the machine code
1551that implements the kernel.
1552
1553Kernel Descriptor for GFX6-GFX9
1554+++++++++++++++++++++++++++++++
1555
1556CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1557
1558 .. table:: Kernel Descriptor for GFX6-GFX9
1559 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1560
Tony Tye6baa6d22017-10-18 22:16:55 +00001561 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001562 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001563 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001564 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001565 address space memory
1566 required for a work-group
1567 in bytes. This does not
1568 include any dynamically
1569 allocated local address
1570 space memory that may be
1571 added when the kernel is
1572 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001573 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001574 private address space
1575 memory required for a
1576 work-item in bytes. If
1577 is_dynamic_callstack is 1
1578 then additional space must
1579 be added to this value for
1580 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001581 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001582 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001583 negative) from base
1584 address of kernel
1585 descriptor to kernel's
1586 entry point instruction
1587 which must be 256 byte
1588 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001589 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001590 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001591 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001592 program settings used by
1593 CP to set up
1594 ``COMPUTE_PGM_RSRC1``
1595 configuration
1596 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001597 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001598 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001599 program settings used by
1600 CP to set up
1601 ``COMPUTE_PGM_RSRC2``
1602 configuration
1603 register. See
1604 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001605 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1606 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 (see
1608 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1609
1610 The total number of SGPR
1611 user data registers
1612 requested must not exceed
1613 16 and match value in
1614 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1615 Any requests beyond 16
1616 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001617 449 1 bit EnableSGPRDispatchPtr *see above*
1618 450 1 bit EnableSGPRQueuePtr *see above*
1619 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1620 452 1 bit EnableSGPRDispatchID *see above*
1621 453 1 bit EnableSGPRFlatScratchInit *see above*
1622 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1623 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1624 should always be 0.
1625 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1626 should always be 0.
1627 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1628 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001629 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001630 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001631 bytes
1632 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001633 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001634
1635..
1636
1637 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001638 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001639
Tony Tye3b340612017-06-07 00:46:08 +00001640 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001641 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001642 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001643 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001644 used by each work-item,
1645 granularity is device
1646 specific:
1647
Tony Tye07d9f102017-11-10 01:00:54 +00001648 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001649 - max_vgpr 1..256
1650 - roundup((max_vgpg + 1)
1651 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001652
1653 Used by CP to set up
1654 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001655 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001656 used by a wavefront,
1657 granularity is device
1658 specific:
1659
Tony Tye07d9f102017-11-10 01:00:54 +00001660 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001661 - max_sgpr 1..112
1662 - roundup((max_sgpg + 1)
1663 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001664 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001665 - max_sgpr 1..112
1666 - roundup((max_sgpg + 1)
1667 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001668
1669 Includes the special SGPRs
1670 for VCC, Flat Scratch (for
1671 GFX7 onwards) and XNACK
1672 (for GFX8 onwards). It does
1673 not include the 16 SGPR
1674 added if a trap handler is
1675 enabled.
1676
1677 Used by CP to set up
1678 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001679 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001680
1681 Start executing wavefront
1682 at the specified priority.
1683
1684 CP is responsible for
1685 filling in
1686 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001687 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001688 with specified rounding
1689 mode for single (32
1690 bit) floating point
1691 precision floating point
1692 operations.
1693
1694 Floating point rounding
1695 mode values are defined in
1696 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1697
1698 Used by CP to set up
1699 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001700 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001701 with specified rounding
1702 denorm mode for half/double (16
1703 and 64 bit) floating point
1704 precision floating point
1705 operations.
1706
1707 Floating point rounding
1708 mode values are defined in
1709 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1710
1711 Used by CP to set up
1712 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001713 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001714 with specified denorm mode
1715 for single (32
1716 bit) floating point
1717 precision floating point
1718 operations.
1719
1720 Floating point denorm mode
1721 values are defined in
1722 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1723
1724 Used by CP to set up
1725 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001726 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001727 with specified denorm mode
1728 for half/double (16
1729 and 64 bit) floating point
1730 precision floating point
1731 operations.
1732
1733 Floating point denorm mode
1734 values are defined in
1735 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1736
1737 Used by CP to set up
1738 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001739 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001740
1741 Start executing wavefront
1742 in privilege trap handler
1743 mode.
1744
1745 CP is responsible for
1746 filling in
1747 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001748 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001749 with DX10 clamp mode
1750 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001751 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001752 treatment of NaN's (when
1753 set, clamp NaN to zero,
1754 otherwise pass NaN
1755 through).
1756
1757 Used by CP to set up
1758 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001759 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001760
1761 Start executing wavefront
1762 in single step mode.
1763
1764 CP is responsible for
1765 filling in
1766 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001767 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001768 with IEEE mode
1769 enabled. Floating point
1770 opcodes that support
1771 exception flag gathering
1772 will quiet and propagate
1773 signaling-NaN inputs per
1774 IEEE 754-2008. Min_dx10 and
1775 max_dx10 become IEEE
1776 754-2008 compliant due to
1777 signaling-NaN propagation
1778 and quieting.
1779
1780 Used by CP to set up
1781 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001782 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001783
1784 Only one work-group allowed
1785 to execute on a compute
1786 unit.
1787
1788 CP is responsible for
1789 filling in
1790 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001791 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001792
1793 Flag that can be used to
1794 control debugging code.
1795
1796 CP is responsible for
1797 filling in
1798 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001799 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001800 Reserved, must be 0.
1801 GFX9
1802 Wavefront starts execution
1803 with specified fp16 overflow
1804 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001805
Tony Tye6baa6d22017-10-18 22:16:55 +00001806 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001807 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001808 - If 1, fp16 overflow that is the
1809 result of an +/-INF input value
1810 or divide by 0 produces a +/-INF,
1811 otherwise clamps computed
1812 overflow to +/-MAX_FP16 as
1813 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001814
1815 Used by CP to set up
1816 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001817 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001818 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001819 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001820
1821..
1822
1823 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1824 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1825
Tony Tye3b340612017-06-07 00:46:08 +00001826 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001827 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001828 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001829 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001830 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001831 system register (see
1832 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1833
1834 Used by CP to set up
1835 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001836 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001837 user data registers
1838 requested. This number must
1839 match the number of user
1840 data registers enabled.
1841
1842 Used by CP to set up
1843 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001844 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001845 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001846 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001847 be enabled.
1848
1849 CP sets
1850 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1851 if the runtime has
1852 installed a trap handler
1853 regardless of the setting
1854 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001855 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001856 system SGPR register for
1857 the work-group id in the X
1858 dimension (see
1859 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1860
1861 Used by CP to set up
1862 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001863 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001864 system SGPR register for
1865 the work-group id in the Y
1866 dimension (see
1867 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1868
1869 Used by CP to set up
1870 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001871 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001872 system SGPR register for
1873 the work-group id in the Z
1874 dimension (see
1875 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1876
1877 Used by CP to set up
1878 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001879 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001880 system SGPR register for
1881 work-group information (see
1882 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1883
1884 Used by CP to set up
1885 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001886 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001887 VGPR system registers used
1888 for the work-item ID.
1889 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1890 defines the values.
1891
1892 Used by CP to set up
1893 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001894 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001895
1896 Wavefront starts execution
1897 with address watch
1898 exceptions enabled which
1899 are generated when L1 has
1900 witnessed a thread access
1901 an *address of
1902 interest*.
1903
1904 CP is responsible for
1905 filling in the address
1906 watch bit in
1907 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1908 according to what the
1909 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001910 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001911
1912 Wavefront starts execution
1913 with memory violation
1914 exceptions exceptions
1915 enabled which are generated
1916 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001917 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001918 L1 or LDS
1919 (write-to-read-only-memory,
1920 mis-aligned atomic, LDS
1921 address out of range,
1922 illegal address, etc.).
1923
1924 CP sets the memory
1925 violation bit in
1926 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1927 according to what the
1928 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001929 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001930
1931 CP uses the rounded value
1932 from the dispatch packet,
1933 not this value, as the
1934 dispatch may contain
1935 dynamically allocated group
1936 segment memory. CP writes
1937 directly to
1938 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1939
1940 Amount of group segment
1941 (LDS) to allocate for each
1942 work-group. Granularity is
1943 device specific:
1944
1945 GFX6:
1946 roundup(lds-size / (64 * 4))
1947 GFX7-GFX9:
1948 roundup(lds-size / (128 * 4))
1949
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001950 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1951 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001952 enabled.
1953
1954 Used by CP to set up
1955 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1956 (set from bits 0..6).
1957
1958 IEEE 754 FP Invalid
1959 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001960 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1961 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001962 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001963 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1964 _DIVISION_BY_ZERO Zero
1965 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1966 _OVERFLOW
1967 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1968 _UNDERFLOW
1969 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1970 _INEXACT
1971 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1972 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001973 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001974 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001975 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001976 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001977
1978..
1979
1980 .. table:: Floating Point Rounding Mode Enumeration Values
1981 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1982
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001983 ====================================== ===== ==============================
1984 Enumeration Name Value Description
1985 ====================================== ===== ==============================
1986 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1987 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1988 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1989 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1990 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001991
1992..
1993
1994 .. table:: Floating Point Denorm Mode Enumeration Values
1995 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
1996
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001997 ====================================== ===== ==============================
1998 Enumeration Name Value Description
1999 ====================================== ===== ==============================
2000 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2001 Denorms
2002 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2003 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2004 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2005 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002006
2007..
2008
2009 .. table:: System VGPR Work-Item ID Enumeration Values
2010 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2011
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002012 ======================================== ===== ============================
2013 Enumeration Name Value Description
2014 ======================================== ===== ============================
2015 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2016 ID.
2017 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2018 dimensions ID.
2019 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2020 dimensions ID.
2021 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2022 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002023
2024.. _amdgpu-amdhsa-initial-kernel-execution-state:
2025
2026Initial Kernel Execution State
2027~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2028
2029This section defines the register state that will be set up by the packet
2030processor prior to the start of execution of every wavefront. This is limited by
2031the constraints of the hardware controllers of CP/ADC/SPI.
2032
2033The order of the SGPR registers is defined, but the compiler can specify which
2034ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2035fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2036for enabled registers are dense starting at SGPR0: the first enabled register is
2037SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2038an SGPR number.
2039
2040The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002041all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002042the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2043initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002044set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002045dispatch.
2046
2047SGPR register initial state is defined in
2048:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2049
2050 .. table:: SGPR Register Set Up Order
2051 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2052
2053 ========== ========================== ====== ==============================
2054 SGPR Order Name Number Description
2055 (kernel descriptor enable of
2056 field) SGPRs
2057 ========== ========================== ====== ==============================
2058 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002059 (enable_sgpr_private with Scratch Wavefront Offset
2060 _segment_buffer) as an offset, to access the
2061 private memory space using a
2062 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002063
2064 CP uses the value provided by
2065 the runtime.
2066 then Dispatch Ptr 2 64 bit address of AQL dispatch
2067 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2068 actually executing.
2069 then Queue Ptr 2 64 bit address of amd_queue_t
2070 (enable_sgpr_queue_ptr) object for AQL queue on which
2071 the dispatch packet was
2072 queued.
2073 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2074 (enable_sgpr_kernarg segment. This is directly
2075 _segment_ptr) copied from the
2076 kernarg_address in the kernel
2077 dispatch packet.
2078
2079 Having CP load it once avoids
2080 loading it at the beginning of
2081 every wavefront.
2082 then Dispatch Id 2 64 bit Dispatch ID of the
2083 (enable_sgpr_dispatch_id) dispatch packet being
2084 executed.
2085 then Flat Scratch Init 2 This is 2 SGPRs:
2086 (enable_sgpr_flat_scratch
2087 _init) GFX6
2088 Not supported.
2089 GFX7-GFX8
2090 The first SGPR is a 32 bit
2091 byte offset from
2092 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2093 to per SPI base of memory
2094 for scratch for the queue
2095 executing the kernel
2096 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002097 from the runtime. (The
2098 Scratch Segment Buffer base
2099 address is
2100 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2101 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002102 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002103 be added to this offset by
2104 the kernel machine code,
2105 right shifted by 8, and
2106 moved to the FLAT_SCRATCH_HI
2107 SGPR register.
2108 FLAT_SCRATCH_HI corresponds
2109 to SGPRn-4 on GFX7, and
2110 SGPRn-6 on GFX8 (where SGPRn
2111 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002112 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002113 FLAT_SCRATCH_HI is
2114 multiplied by 256 (as it is
2115 in units of 256 bytes) and
2116 added to
2117 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002118 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002119 FLAT SCRATCH BASE in flat
2120 memory instructions that
2121 access the scratch
2122 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002123
2124 The second SGPR is 32 bit
2125 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002126 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002127 usage. CP obtains this from
2128 the runtime, and it is
2129 always a multiple of DWORD.
2130 CP checks that the value in
2131 the kernel dispatch packet
2132 Private Segment Byte Size is
2133 not larger, and requests the
2134 runtime to increase the
2135 queue's scratch size if
2136 necessary. The kernel code
2137 must move it to
2138 FLAT_SCRATCH_LO which is
2139 SGPRn-3 on GFX7 and SGPRn-5
2140 on GFX8. FLAT_SCRATCH_LO is
2141 used as the FLAT SCRATCH
2142 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002143 instructions. Having CP load
2144 it once avoids loading it at
2145 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002146 wavefront.
2147 GFX9
2148 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002149 64 bit base address of the
2150 per SPI scratch backing
2151 memory managed by SPI for
2152 the queue executing the
2153 kernel dispatch. CP obtains
2154 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002155 divides it if there are
2156 multiple Shader Arrays each
2157 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002158 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002159 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002160 machine code and the result
2161 moved to the FLAT_SCRATCH
2162 SGPR which is SGPRn-6 and
2163 SGPRn-5. It is used as the
2164 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002165 memory instructions.
2166 then Private Segment Size 1 The 32 bit byte size of a
2167 (enable_sgpr_private single
2168 work-item's
2169 scratch_segment_size) memory
2170 allocation. This is the
2171 value from the kernel
2172 dispatch packet Private
2173 Segment Byte Size rounded up
2174 by CP to a multiple of
2175 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002176
2177 Having CP load it once avoids
2178 loading it at the beginning of
2179 every wavefront.
2180
2181 This is not used for
2182 GFX7-GFX8 since it is the same
2183 value as the second SGPR of
2184 Flat Scratch Init. However, it
2185 may be needed for GFX9 which
2186 changes the meaning of the
2187 Flat Scratch Init value.
2188 then Grid Work-Group Count X 1 32 bit count of the number of
2189 (enable_sgpr_grid work-groups in the X dimension
2190 _workgroup_count_X) for the grid being
2191 executed. Computed from the
2192 fields in the kernel dispatch
2193 packet as ((grid_size.x +
2194 workgroup_size.x - 1) /
2195 workgroup_size.x).
2196 then Grid Work-Group Count Y 1 32 bit count of the number of
2197 (enable_sgpr_grid work-groups in the Y dimension
2198 _workgroup_count_Y && for the grid being
2199 less than 16 previous executed. Computed from the
2200 SGPRs) fields in the kernel dispatch
2201 packet as ((grid_size.y +
2202 workgroup_size.y - 1) /
2203 workgroupSize.y).
2204
2205 Only initialized if <16
2206 previous SGPRs initialized.
2207 then Grid Work-Group Count Z 1 32 bit count of the number of
2208 (enable_sgpr_grid work-groups in the Z dimension
2209 _workgroup_count_Z && for the grid being
2210 less than 16 previous executed. Computed from the
2211 SGPRs) fields in the kernel dispatch
2212 packet as ((grid_size.z +
2213 workgroup_size.z - 1) /
2214 workgroupSize.z).
2215
2216 Only initialized if <16
2217 previous SGPRs initialized.
2218 then Work-Group Id X 1 32 bit work-group id in X
2219 (enable_sgpr_workgroup_id dimension of grid for
2220 _X) wavefront.
2221 then Work-Group Id Y 1 32 bit work-group id in Y
2222 (enable_sgpr_workgroup_id dimension of grid for
2223 _Y) wavefront.
2224 then Work-Group Id Z 1 32 bit work-group id in Z
2225 (enable_sgpr_workgroup_id dimension of grid for
2226 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002227 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002228 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002229 _info) threadgroup_size_in_wavefronts[5:0]}
2230 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002231 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002232 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002233 dispatch. Must be used as an
2234 offset with Private
2235 segment address when using
2236 Scratch Segment Buffer. It
2237 must be used to set up FLAT
2238 SCRATCH for flat addressing
2239 (see
2240 :ref:`amdgpu-amdhsa-flat-scratch`).
2241 ========== ========================== ====== ==============================
2242
2243The order of the VGPR registers is defined, but the compiler can specify which
2244ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2245fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2246for enabled registers are dense starting at VGPR0: the first enabled register is
2247VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2248VGPR number.
2249
2250VGPR register initial state is defined in
2251:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2252
2253 .. table:: VGPR Register Set Up Order
2254 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2255
2256 ========== ========================== ====== ==============================
2257 VGPR Order Name Number Description
2258 (kernel descriptor enable of
2259 field) VGPRs
2260 ========== ========================== ====== ==============================
2261 First Work-Item Id X 1 32 bit work item id in X
2262 (Always initialized) dimension of work-group for
2263 wavefront lane.
2264 then Work-Item Id Y 1 32 bit work item id in Y
2265 (enable_vgpr_workitem_id dimension of work-group for
2266 > 0) wavefront lane.
2267 then Work-Item Id Z 1 32 bit work item id in Z
2268 (enable_vgpr_workitem_id dimension of work-group for
2269 > 1) wavefront lane.
2270 ========== ========================== ====== ==============================
2271
Hiroshi Inouebcadfee2018-04-12 05:53:20 +00002272The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tyef16a45e2017-06-06 20:31:59 +00002273
22741. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2275 registers.
22762. Work-group Id registers X, Y, Z are set by ADC which supports any
2277 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022783. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2279 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022804. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2281 or (X, Y, Z).
2282
2283Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2284value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2285
2286The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002287has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002288instructions (GFX9).
2289
2290If buffer operations are used then the compiler can generate a V# with the
2291following properties:
2292
2293* base address of 0
2294* no swizzle
2295* ATC: 1 if IOMMU present (such as APU)
2296* ptr64: 1
2297* MTYPE set to support memory coherence that matches the runtime (such as CC for
2298 APU and NC for dGPU).
2299
2300.. _amdgpu-amdhsa-kernel-prolog:
2301
2302Kernel Prolog
2303~~~~~~~~~~~~~
2304
2305.. _amdgpu-amdhsa-m0:
2306
2307M0
2308++
2309
2310GFX6-GFX8
2311 The M0 register must be initialized with a value at least the total LDS size
2312 if the kernel may access LDS via DS or flat operations. Total LDS size is
2313 available in dispatch packet. For M0, it is also possible to use maximum
2314 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2315 GFX7-GFX8).
2316GFX9
2317 The M0 register is not used for range checking LDS accesses and so does not
2318 need to be initialized in the prolog.
2319
2320.. _amdgpu-amdhsa-flat-scratch:
2321
2322Flat Scratch
2323++++++++++++
2324
2325If the kernel may use flat operations to access scratch memory, the prolog code
2326must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002327are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002328Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2329
2330GFX6
2331 Flat scratch is not supported.
2332
Tony Tye07d9f102017-11-10 01:00:54 +00002333GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002334 1. The low word of Flat Scratch Init is 32 bit byte offset from
2335 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2336 being managed by SPI for the queue executing the kernel dispatch. This is
2337 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002338 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002339 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2340 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2341 by 8 before moving into FLAT_SCRATCH_LO.
2342 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2343 work-items scratch memory usage. This is directly loaded from the kernel
2344 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2345 DWORD. Having CP load it once avoids loading it at the beginning of every
2346 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2347 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002348
Tony Tyef16a45e2017-06-06 20:31:59 +00002349GFX9
2350 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2351 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002352 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002353 pair for use as the flat scratch base in flat memory instructions.
2354
2355.. _amdgpu-amdhsa-memory-model:
2356
2357Memory Model
2358~~~~~~~~~~~~
2359
2360This section describes the mapping of LLVM memory model onto AMDGPU machine code
2361(see :ref:`memmodel`). *The implementation is WIP.*
2362
2363.. TODO
2364 Update when implementation complete.
2365
Tony Tyef16a45e2017-06-06 20:31:59 +00002366The AMDGPU backend supports the memory synchronization scopes specified in
2367:ref:`amdgpu-memory-scopes`.
2368
2369The code sequences used to implement the memory model are defined in table
2370:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2371
2372The sequences specify the order of instructions that a single thread must
2373execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2374to other memory instructions executed by the same thread. This allows them to be
2375moved earlier or later which can allow them to be combined with other instances
2376of the same instruction, or hoisted/sunk out of loops to improve
2377performance. Only the instructions related to the memory model are given;
2378additional ``s_waitcnt`` instructions are required to ensure registers are
2379defined before being used. These may be able to be combined with the memory
2380model ``s_waitcnt`` instructions as described above.
2381
Tony Tye6baa6d22017-10-18 22:16:55 +00002382The AMDGPU backend supports the following memory models:
2383
2384 HSA Memory Model [HSA]_
2385 The HSA memory model uses a single happens-before relation for all address
2386 spaces (see :ref:`amdgpu-address-spaces`).
2387 OpenCL Memory Model [OpenCL]_
2388 The OpenCL memory model which has separate happens-before relations for the
2389 global and local address spaces. Only a fence specifying both global and
2390 local address space, and seq_cst instructions join the relationships. Since
2391 the LLVM ``memfence`` instruction does not allow an address space to be
2392 specified the OpenCL fence has to convervatively assume both local and
2393 global address space was specified. However, optimizations can often be
2394 done to eliminate the additional ``s_waitcnt`` instructions when there are
2395 no intervening memory instructions which access the corresponding address
2396 space. The code sequences in the table indicate what can be omitted for the
2397 OpenCL memory. The target triple environment is used to determine if the
2398 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002399
2400``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2401operations.
2402
2403``buffer/global/flat_load/store/atomic`` instructions to global memory are
2404termed vector memory operations.
2405
2406For GFX6-GFX9:
2407
2408* Each agent has multiple compute units (CU).
2409* Each CU has multiple SIMDs that execute wavefronts.
2410* The wavefronts for a single work-group are executed in the same CU but may be
2411 executed by different SIMDs.
2412* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2413 executing on it.
2414* All LDS operations of a CU are performed as wavefront wide operations in a
2415 global order and involve no caching. Completion is reported to a wavefront in
2416 execution order.
2417* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002418 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002419 can be reordered relative to each other, which can result in reordering the
2420 visibility of vector memory operations with respect to LDS operations of other
2421 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002422 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002423 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002424 same wavefront.
2425* The vector memory operations are performed as wavefront wide operations and
2426 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002427 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002428 vector memory order if they access LDS memory, and out of LDS operation order
2429 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002430* The vector memory operations access a single vector L1 cache shared by all
2431 SIMDs a CU. Therefore, no special action is required for coherence between the
2432 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002433 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002434 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002435* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2436 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2437 scalar operations are used in a restricted way so do not impact the memory
2438 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2439* The vector and scalar memory operations use an L2 cache shared by all CUs on
2440 the same agent.
2441* The L2 cache has independent channels to service disjoint ranges of virtual
2442 addresses.
2443* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002444 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002445 (which may be executing on different CUs) of an agent can be reordered
2446 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002447 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002448 previous vector memory operation has completed before executing a subsequent
2449 vector memory or LDS operation and so can be used to meet the requirements of
2450 acquire and release.
2451* The L2 cache can be kept coherent with other agents on some targets, or ranges
2452 of virtual addresses can be set up to bypass it to ensure system coherence.
2453
Tony Tye07d9f102017-11-10 01:00:54 +00002454Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002455or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2456memory, atomic memory orderings are not meaningful and all accesses are treated
2457as non-atomic.
2458
2459Constant address space uses ``buffer/global_load`` instructions (or equivalent
2460scalar memory instructions). Since the constant address space contents do not
2461change during the execution of a kernel dispatch it is not legal to perform
2462stores, and atomic memory orderings are not meaningful and all access are
2463treated as non-atomic.
2464
2465A memory synchronization scope wider than work-group is not meaningful for the
2466group (LDS) address space and is treated as work-group.
2467
2468The memory model does not support the region address space which is treated as
2469non-atomic.
2470
2471Acquire memory ordering is not meaningful on store atomic instructions and is
2472treated as non-atomic.
2473
2474Release memory ordering is not meaningful on load atomic instructions and is
2475treated a non-atomic.
2476
2477Acquire-release memory ordering is not meaningful on load or store atomic
2478instructions and is treated as acquire and release respectively.
2479
2480AMDGPU backend only uses scalar memory operations to access memory that is
2481proven to not change during the execution of the kernel dispatch. This includes
2482constant address space and global address space for program scope const
2483variables. Therefore the kernel machine code does not have to maintain the
2484scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2485and vector L1 caches are invalidated between kernel dispatches by CP since
2486constant address space data may change between kernel dispatch executions. See
2487:ref:`amdgpu-amdhsa-memory-spaces`.
2488
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002489The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002490case the AMDGPU backend ensures the memory location used to spill is never
2491accessed by vector memory operations at the same time. If scalar writes are used
2492then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2493return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002494future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002495frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2496as all scalar writes are write-before-read in the same thread.
2497
Tony Tye6baa6d22017-10-18 22:16:55 +00002498Scratch backing memory (which is used for the private address space)
2499is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2500address space is only accessed by a single thread, and is always
2501write-before-read, there is never a need to invalidate these entries from the L1
2502cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2503volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002504
2505On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002506to invalidate the L2 cache. This also causes it to be treated as
2507non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2508(cache coherent) and so the L2 cache will coherent with the CPU and other
2509agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002510
2511 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2512 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2513
Tony Tye6baa6d22017-10-18 22:16:55 +00002514 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002515 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2516 Ordering Sync Scope Address
2517 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002518 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002519 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002520 -----------------------------------------------------------------------------------
2521 load *none* *none* - global - !volatile & !nontemporal
2522 - generic
2523 - private 1. buffer/global/flat_load
2524 - constant
2525 - volatile & !nontemporal
2526
Tony Tyef16a45e2017-06-06 20:31:59 +00002527 1. buffer/global/flat_load
2528 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002529
2530 - nontemporal
2531
2532 1. buffer/global/flat_load
2533 glc=1 slc=1
2534
Tony Tyef16a45e2017-06-06 20:31:59 +00002535 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002536 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002537 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002538 - private 1. buffer/global/flat_store
2539 - constant
2540 - nontemporal
2541
2542 1. buffer/global/flat_stote
2543 glc=1 slc=1
2544
Tony Tyef16a45e2017-06-06 20:31:59 +00002545 store *none* *none* - local 1. ds_store
2546 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002547 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002548 load atomic unordered *any* *any* *Same as non-atomic*.
2549 store atomic unordered *any* *any* *Same as non-atomic*.
2550 atomicrmw unordered *any* *any* *Same as monotonic
2551 atomic*.
2552 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002553 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002554 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2555 - wavefront - generic
2556 - workgroup
2557 load atomic monotonic - singlethread - local 1. ds_load
2558 - wavefront
2559 - workgroup
2560 load atomic monotonic - agent - global 1. buffer/global/flat_load
2561 - system - generic glc=1
2562 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2563 - wavefront - generic
2564 - workgroup
2565 - agent
2566 - system
2567 store atomic monotonic - singlethread - local 1. ds_store
2568 - wavefront
2569 - workgroup
2570 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2571 - wavefront - generic
2572 - workgroup
2573 - agent
2574 - system
2575 atomicrmw monotonic - singlethread - local 1. ds_atomic
2576 - wavefront
2577 - workgroup
2578 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002579 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002580 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2581 - wavefront - local
2582 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002583 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2584 load atomic acquire - workgroup - local 1. ds_load
2585 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002586
Tony Tye6baa6d22017-10-18 22:16:55 +00002587 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002588 - Must happen before
2589 any following
2590 global/generic
2591 load/load
2592 atomic/store/store
2593 atomic/atomicrmw.
2594 - Ensures any
2595 following global
2596 data read is no
2597 older than the load
2598 atomic value being
2599 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002600 load atomic acquire - workgroup - generic 1. flat_load
2601 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002602
Tony Tye6baa6d22017-10-18 22:16:55 +00002603 - If OpenCL, omit.
2604 - Must happen before
2605 any following
2606 global/generic
2607 load/load
2608 atomic/store/store
2609 atomic/atomicrmw.
2610 - Ensures any
2611 following global
2612 data read is no
2613 older than the load
2614 atomic value being
2615 acquired.
2616 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002617 - system glc=1
2618 2. s_waitcnt vmcnt(0)
2619
2620 - Must happen before
2621 following
2622 buffer_wbinvl1_vol.
2623 - Ensures the load
2624 has completed
2625 before invalidating
2626 the cache.
2627
2628 3. buffer_wbinvl1_vol
2629
2630 - Must happen before
2631 any following
2632 global/generic
2633 load/load
2634 atomic/atomicrmw.
2635 - Ensures that
2636 following
2637 loads will not see
2638 stale global data.
2639
2640 load atomic acquire - agent - generic 1. flat_load glc=1
2641 - system 2. s_waitcnt vmcnt(0) &
2642 lgkmcnt(0)
2643
2644 - If OpenCL omit
2645 lgkmcnt(0).
2646 - Must happen before
2647 following
2648 buffer_wbinvl1_vol.
2649 - Ensures the flat_load
2650 has completed
2651 before invalidating
2652 the cache.
2653
2654 3. buffer_wbinvl1_vol
2655
2656 - Must happen before
2657 any following
2658 global/generic
2659 load/load
2660 atomic/atomicrmw.
2661 - Ensures that
2662 following loads
2663 will not see stale
2664 global data.
2665
2666 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2667 - wavefront - local
2668 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002669 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2670 atomicrmw acquire - workgroup - local 1. ds_atomic
2671 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002672
Tony Tye6baa6d22017-10-18 22:16:55 +00002673 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002674 - Must happen before
2675 any following
2676 global/generic
2677 load/load
2678 atomic/store/store
2679 atomic/atomicrmw.
2680 - Ensures any
2681 following global
2682 data read is no
2683 older than the
2684 atomicrmw value
2685 being acquired.
2686
Tony Tye6baa6d22017-10-18 22:16:55 +00002687 atomicrmw acquire - workgroup - generic 1. flat_atomic
2688 2. waitcnt lgkmcnt(0)
2689
2690 - If OpenCL, omit.
2691 - Must happen before
2692 any following
2693 global/generic
2694 load/load
2695 atomic/store/store
2696 atomic/atomicrmw.
2697 - Ensures any
2698 following global
2699 data read is no
2700 older than the
2701 atomicrmw value
2702 being acquired.
2703
2704 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002705 - system 2. s_waitcnt vmcnt(0)
2706
2707 - Must happen before
2708 following
2709 buffer_wbinvl1_vol.
2710 - Ensures the
2711 atomicrmw has
2712 completed before
2713 invalidating the
2714 cache.
2715
2716 3. buffer_wbinvl1_vol
2717
2718 - Must happen before
2719 any following
2720 global/generic
2721 load/load
2722 atomic/atomicrmw.
2723 - Ensures that
2724 following loads
2725 will not see stale
2726 global data.
2727
2728 atomicrmw acquire - agent - generic 1. flat_atomic
2729 - system 2. s_waitcnt vmcnt(0) &
2730 lgkmcnt(0)
2731
2732 - If OpenCL, omit
2733 lgkmcnt(0).
2734 - Must happen before
2735 following
2736 buffer_wbinvl1_vol.
2737 - Ensures the
2738 atomicrmw has
2739 completed before
2740 invalidating the
2741 cache.
2742
2743 3. buffer_wbinvl1_vol
2744
2745 - Must happen before
2746 any following
2747 global/generic
2748 load/load
2749 atomic/atomicrmw.
2750 - Ensures that
2751 following loads
2752 will not see stale
2753 global data.
2754
2755 fence acquire - singlethread *none* *none*
2756 - wavefront
2757 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2758
2759 - If OpenCL and
2760 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002761 not generic, omit.
2762 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002763 currently has no
2764 address space on
2765 the fence need to
2766 conservatively
2767 always generate. If
2768 fence had an
2769 address space then
2770 set to address
2771 space of OpenCL
2772 fence flag, or to
2773 generic if both
2774 local and global
2775 flags are
2776 specified.
2777 - Must happen after
2778 any preceding
2779 local/generic load
2780 atomic/atomicrmw
2781 with an equal or
2782 wider sync scope
2783 and memory ordering
2784 stronger than
2785 unordered (this is
2786 termed the
2787 fence-paired-atomic).
2788 - Must happen before
2789 any following
2790 global/generic
2791 load/load
2792 atomic/store/store
2793 atomic/atomicrmw.
2794 - Ensures any
2795 following global
2796 data read is no
2797 older than the
2798 value read by the
2799 fence-paired-atomic.
2800
Tony Tye6baa6d22017-10-18 22:16:55 +00002801 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2802 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002803
2804 - If OpenCL and
2805 address space is
2806 not generic, omit
2807 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002808 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002809 currently has no
2810 address space on
2811 the fence need to
2812 conservatively
2813 always generate
2814 (see comment for
2815 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002816 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002817 separate s_waitcnt
2818 vmcnt(0) and
2819 s_waitcnt
2820 lgkmcnt(0) to allow
2821 them to be
2822 independently moved
2823 according to the
2824 following rules.
2825 - s_waitcnt vmcnt(0)
2826 must happen after
2827 any preceding
2828 global/generic load
2829 atomic/atomicrmw
2830 with an equal or
2831 wider sync scope
2832 and memory ordering
2833 stronger than
2834 unordered (this is
2835 termed the
2836 fence-paired-atomic).
2837 - s_waitcnt lgkmcnt(0)
2838 must happen after
2839 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002840 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002841 atomic/atomicrmw
2842 with an equal or
2843 wider sync scope
2844 and memory ordering
2845 stronger than
2846 unordered (this is
2847 termed the
2848 fence-paired-atomic).
2849 - Must happen before
2850 the following
2851 buffer_wbinvl1_vol.
2852 - Ensures that the
2853 fence-paired atomic
2854 has completed
2855 before invalidating
2856 the
2857 cache. Therefore
2858 any following
2859 locations read must
2860 be no older than
2861 the value read by
2862 the
2863 fence-paired-atomic.
2864
2865 2. buffer_wbinvl1_vol
2866
Tony Tye6baa6d22017-10-18 22:16:55 +00002867 - Must happen before any
2868 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002869 load/load
2870 atomic/store/store
2871 atomic/atomicrmw.
2872 - Ensures that
2873 following loads
2874 will not see stale
2875 global data.
2876
2877 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002878 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002879 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2880 - wavefront - local
2881 - generic
2882 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002883
2884 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002885 - Must happen after
2886 any preceding
2887 local/generic
2888 load/store/load
2889 atomic/store
2890 atomic/atomicrmw.
2891 - Must happen before
2892 the following
2893 store.
2894 - Ensures that all
2895 memory operations
2896 to local have
2897 completed before
2898 performing the
2899 store that is being
2900 released.
2901
2902 2. buffer/global/flat_store
2903 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002904 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2905
2906 - If OpenCL, omit.
2907 - Must happen after
2908 any preceding
2909 local/generic
2910 load/store/load
2911 atomic/store
2912 atomic/atomicrmw.
2913 - Must happen before
2914 the following
2915 store.
2916 - Ensures that all
2917 memory operations
2918 to local have
2919 completed before
2920 performing the
2921 store that is being
2922 released.
2923
2924 2. flat_store
2925 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2926 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002927
2928 - If OpenCL, omit
2929 lgkmcnt(0).
2930 - Could be split into
2931 separate s_waitcnt
2932 vmcnt(0) and
2933 s_waitcnt
2934 lgkmcnt(0) to allow
2935 them to be
2936 independently moved
2937 according to the
2938 following rules.
2939 - s_waitcnt vmcnt(0)
2940 must happen after
2941 any preceding
2942 global/generic
2943 load/store/load
2944 atomic/store
2945 atomic/atomicrmw.
2946 - s_waitcnt lgkmcnt(0)
2947 must happen after
2948 any preceding
2949 local/generic
2950 load/store/load
2951 atomic/store
2952 atomic/atomicrmw.
2953 - Must happen before
2954 the following
2955 store.
2956 - Ensures that all
2957 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002958 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002959 completed before
2960 performing the
2961 store that is being
2962 released.
2963
2964 2. buffer/global/ds/flat_store
2965 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2966 - wavefront - local
2967 - generic
2968 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002969
2970 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002971 - Must happen after
2972 any preceding
2973 local/generic
2974 load/store/load
2975 atomic/store
2976 atomic/atomicrmw.
2977 - Must happen before
2978 the following
2979 atomicrmw.
2980 - Ensures that all
2981 memory operations
2982 to local have
2983 completed before
2984 performing the
2985 atomicrmw that is
2986 being released.
2987
2988 2. buffer/global/flat_atomic
2989 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002990 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2991
2992 - If OpenCL, omit.
2993 - Must happen after
2994 any preceding
2995 local/generic
2996 load/store/load
2997 atomic/store
2998 atomic/atomicrmw.
2999 - Must happen before
3000 the following
3001 atomicrmw.
3002 - Ensures that all
3003 memory operations
3004 to local have
3005 completed before
3006 performing the
3007 atomicrmw that is
3008 being released.
3009
3010 2. flat_atomic
3011 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3012 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003013
3014 - If OpenCL, omit
3015 lgkmcnt(0).
3016 - Could be split into
3017 separate s_waitcnt
3018 vmcnt(0) and
3019 s_waitcnt
3020 lgkmcnt(0) to allow
3021 them to be
3022 independently moved
3023 according to the
3024 following rules.
3025 - s_waitcnt vmcnt(0)
3026 must happen after
3027 any preceding
3028 global/generic
3029 load/store/load
3030 atomic/store
3031 atomic/atomicrmw.
3032 - s_waitcnt lgkmcnt(0)
3033 must happen after
3034 any preceding
3035 local/generic
3036 load/store/load
3037 atomic/store
3038 atomic/atomicrmw.
3039 - Must happen before
3040 the following
3041 atomicrmw.
3042 - Ensures that all
3043 memory operations
3044 to global and local
3045 have completed
3046 before performing
3047 the atomicrmw that
3048 is being released.
3049
Tony Tye6baa6d22017-10-18 22:16:55 +00003050 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003051 fence release - singlethread *none* *none*
3052 - wavefront
3053 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3054
3055 - If OpenCL and
3056 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003057 not generic, omit.
3058 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003059 currently has no
3060 address space on
3061 the fence need to
3062 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003063 always generate. If
3064 fence had an
3065 address space then
3066 set to address
3067 space of OpenCL
3068 fence flag, or to
3069 generic if both
3070 local and global
3071 flags are
3072 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003073 - Must happen after
3074 any preceding
3075 local/generic
3076 load/load
3077 atomic/store/store
3078 atomic/atomicrmw.
3079 - Must happen before
3080 any following store
3081 atomic/atomicrmw
3082 with an equal or
3083 wider sync scope
3084 and memory ordering
3085 stronger than
3086 unordered (this is
3087 termed the
3088 fence-paired-atomic).
3089 - Ensures that all
3090 memory operations
3091 to local have
3092 completed before
3093 performing the
3094 following
3095 fence-paired-atomic.
3096
Tony Tye6baa6d22017-10-18 22:16:55 +00003097 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3098 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003099
3100 - If OpenCL and
3101 address space is
3102 not generic, omit
3103 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003104 - If OpenCL and
3105 address space is
3106 local, omit
3107 vmcnt(0).
3108 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003109 currently has no
3110 address space on
3111 the fence need to
3112 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003113 always generate. If
3114 fence had an
3115 address space then
3116 set to address
3117 space of OpenCL
3118 fence flag, or to
3119 generic if both
3120 local and global
3121 flags are
3122 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003123 - Could be split into
3124 separate s_waitcnt
3125 vmcnt(0) and
3126 s_waitcnt
3127 lgkmcnt(0) to allow
3128 them to be
3129 independently moved
3130 according to the
3131 following rules.
3132 - s_waitcnt vmcnt(0)
3133 must happen after
3134 any preceding
3135 global/generic
3136 load/store/load
3137 atomic/store
3138 atomic/atomicrmw.
3139 - s_waitcnt lgkmcnt(0)
3140 must happen after
3141 any preceding
3142 local/generic
3143 load/store/load
3144 atomic/store
3145 atomic/atomicrmw.
3146 - Must happen before
3147 any following store
3148 atomic/atomicrmw
3149 with an equal or
3150 wider sync scope
3151 and memory ordering
3152 stronger than
3153 unordered (this is
3154 termed the
3155 fence-paired-atomic).
3156 - Ensures that all
3157 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003158 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003159 completed before
3160 performing the
3161 following
3162 fence-paired-atomic.
3163
3164 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003165 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003166 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3167 - wavefront - local
3168 - generic
3169 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3170
Tony Tye6baa6d22017-10-18 22:16:55 +00003171 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003172 - Must happen after
3173 any preceding
3174 local/generic
3175 load/store/load
3176 atomic/store
3177 atomic/atomicrmw.
3178 - Must happen before
3179 the following
3180 atomicrmw.
3181 - Ensures that all
3182 memory operations
3183 to local have
3184 completed before
3185 performing the
3186 atomicrmw that is
3187 being released.
3188
Tony Tye6baa6d22017-10-18 22:16:55 +00003189 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003190 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3191 2. s_waitcnt lgkmcnt(0)
3192
Tony Tye6baa6d22017-10-18 22:16:55 +00003193 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003194 - Must happen before
3195 any following
3196 global/generic
3197 load/load
3198 atomic/store/store
3199 atomic/atomicrmw.
3200 - Ensures any
3201 following global
3202 data read is no
3203 older than the load
3204 atomic value being
3205 acquired.
3206
3207 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3208
Tony Tye6baa6d22017-10-18 22:16:55 +00003209 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003210 - Must happen after
3211 any preceding
3212 local/generic
3213 load/store/load
3214 atomic/store
3215 atomic/atomicrmw.
3216 - Must happen before
3217 the following
3218 atomicrmw.
3219 - Ensures that all
3220 memory operations
3221 to local have
3222 completed before
3223 performing the
3224 atomicrmw that is
3225 being released.
3226
3227 2. flat_atomic
3228 3. s_waitcnt lgkmcnt(0)
3229
Tony Tye6baa6d22017-10-18 22:16:55 +00003230 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003231 - Must happen before
3232 any following
3233 global/generic
3234 load/load
3235 atomic/store/store
3236 atomic/atomicrmw.
3237 - Ensures any
3238 following global
3239 data read is no
3240 older than the load
3241 atomic value being
3242 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003243
3244 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3245 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003246
3247 - If OpenCL, omit
3248 lgkmcnt(0).
3249 - Could be split into
3250 separate s_waitcnt
3251 vmcnt(0) and
3252 s_waitcnt
3253 lgkmcnt(0) to allow
3254 them to be
3255 independently moved
3256 according to the
3257 following rules.
3258 - s_waitcnt vmcnt(0)
3259 must happen after
3260 any preceding
3261 global/generic
3262 load/store/load
3263 atomic/store
3264 atomic/atomicrmw.
3265 - s_waitcnt lgkmcnt(0)
3266 must happen after
3267 any preceding
3268 local/generic
3269 load/store/load
3270 atomic/store
3271 atomic/atomicrmw.
3272 - Must happen before
3273 the following
3274 atomicrmw.
3275 - Ensures that all
3276 memory operations
3277 to global have
3278 completed before
3279 performing the
3280 atomicrmw that is
3281 being released.
3282
Tony Tye6baa6d22017-10-18 22:16:55 +00003283 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003284 3. s_waitcnt vmcnt(0)
3285
3286 - Must happen before
3287 following
3288 buffer_wbinvl1_vol.
3289 - Ensures the
3290 atomicrmw has
3291 completed before
3292 invalidating the
3293 cache.
3294
3295 4. buffer_wbinvl1_vol
3296
3297 - Must happen before
3298 any following
3299 global/generic
3300 load/load
3301 atomic/atomicrmw.
3302 - Ensures that
3303 following loads
3304 will not see stale
3305 global data.
3306
Tony Tye6baa6d22017-10-18 22:16:55 +00003307 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3308 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003309
3310 - If OpenCL, omit
3311 lgkmcnt(0).
3312 - Could be split into
3313 separate s_waitcnt
3314 vmcnt(0) and
3315 s_waitcnt
3316 lgkmcnt(0) to allow
3317 them to be
3318 independently moved
3319 according to the
3320 following rules.
3321 - s_waitcnt vmcnt(0)
3322 must happen after
3323 any preceding
3324 global/generic
3325 load/store/load
3326 atomic/store
3327 atomic/atomicrmw.
3328 - s_waitcnt lgkmcnt(0)
3329 must happen after
3330 any preceding
3331 local/generic
3332 load/store/load
3333 atomic/store
3334 atomic/atomicrmw.
3335 - Must happen before
3336 the following
3337 atomicrmw.
3338 - Ensures that all
3339 memory operations
3340 to global have
3341 completed before
3342 performing the
3343 atomicrmw that is
3344 being released.
3345
3346 2. flat_atomic
3347 3. s_waitcnt vmcnt(0) &
3348 lgkmcnt(0)
3349
3350 - If OpenCL, omit
3351 lgkmcnt(0).
3352 - Must happen before
3353 following
3354 buffer_wbinvl1_vol.
3355 - Ensures the
3356 atomicrmw has
3357 completed before
3358 invalidating the
3359 cache.
3360
3361 4. buffer_wbinvl1_vol
3362
3363 - Must happen before
3364 any following
3365 global/generic
3366 load/load
3367 atomic/atomicrmw.
3368 - Ensures that
3369 following loads
3370 will not see stale
3371 global data.
3372
3373 fence acq_rel - singlethread *none* *none*
3374 - wavefront
3375 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3376
3377 - If OpenCL and
3378 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003379 not generic, omit.
3380 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003381 since LLVM
3382 currently has no
3383 address space on
3384 the fence need to
3385 conservatively
3386 always generate
3387 (see comment for
3388 previous fence).
3389 - Must happen after
3390 any preceding
3391 local/generic
3392 load/load
3393 atomic/store/store
3394 atomic/atomicrmw.
3395 - Must happen before
3396 any following
3397 global/generic
3398 load/load
3399 atomic/store/store
3400 atomic/atomicrmw.
3401 - Ensures that all
3402 memory operations
3403 to local have
3404 completed before
3405 performing any
3406 following global
3407 memory operations.
3408 - Ensures that the
3409 preceding
3410 local/generic load
3411 atomic/atomicrmw
3412 with an equal or
3413 wider sync scope
3414 and memory ordering
3415 stronger than
3416 unordered (this is
3417 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003418 acquire-fence-paired-atomic
3419 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003420 before following
3421 global memory
3422 operations. This
3423 satisfies the
3424 requirements of
3425 acquire.
3426 - Ensures that all
3427 previous memory
3428 operations have
3429 completed before a
3430 following
3431 local/generic store
3432 atomic/atomicrmw
3433 with an equal or
3434 wider sync scope
3435 and memory ordering
3436 stronger than
3437 unordered (this is
3438 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003439 release-fence-paired-atomic
3440 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003441 requirements of
3442 release.
3443
Tony Tye6baa6d22017-10-18 22:16:55 +00003444 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3445 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003446
3447 - If OpenCL and
3448 address space is
3449 not generic, omit
3450 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003451 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003452 currently has no
3453 address space on
3454 the fence need to
3455 conservatively
3456 always generate
3457 (see comment for
3458 previous fence).
3459 - Could be split into
3460 separate s_waitcnt
3461 vmcnt(0) and
3462 s_waitcnt
3463 lgkmcnt(0) to allow
3464 them to be
3465 independently moved
3466 according to the
3467 following rules.
3468 - s_waitcnt vmcnt(0)
3469 must happen after
3470 any preceding
3471 global/generic
3472 load/store/load
3473 atomic/store
3474 atomic/atomicrmw.
3475 - s_waitcnt lgkmcnt(0)
3476 must happen after
3477 any preceding
3478 local/generic
3479 load/store/load
3480 atomic/store
3481 atomic/atomicrmw.
3482 - Must happen before
3483 the following
3484 buffer_wbinvl1_vol.
3485 - Ensures that the
3486 preceding
3487 global/local/generic
3488 load
3489 atomic/atomicrmw
3490 with an equal or
3491 wider sync scope
3492 and memory ordering
3493 stronger than
3494 unordered (this is
3495 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003496 acquire-fence-paired-atomic
3497 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003498 before invalidating
3499 the cache. This
3500 satisfies the
3501 requirements of
3502 acquire.
3503 - Ensures that all
3504 previous memory
3505 operations have
3506 completed before a
3507 following
3508 global/local/generic
3509 store
3510 atomic/atomicrmw
3511 with an equal or
3512 wider sync scope
3513 and memory ordering
3514 stronger than
3515 unordered (this is
3516 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003517 release-fence-paired-atomic
3518 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003519 requirements of
3520 release.
3521
3522 2. buffer_wbinvl1_vol
3523
3524 - Must happen before
3525 any following
3526 global/generic
3527 load/load
3528 atomic/store/store
3529 atomic/atomicrmw.
3530 - Ensures that
3531 following loads
3532 will not see stale
3533 global data. This
3534 satisfies the
3535 requirements of
3536 acquire.
3537
3538 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003539 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003540 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003541 - wavefront - local load atomic acquire,
3542 - generic except must generated
3543 all instructions even
3544 for OpenCL.*
3545 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3546 - generic
3547 - Must
3548 happen after
3549 preceding
3550 global/generic load
3551 atomic/store
3552 atomic/atomicrmw
3553 with memory
3554 ordering of seq_cst
3555 and with equal or
3556 wider sync scope.
3557 (Note that seq_cst
3558 fences have their
3559 own s_waitcnt
3560 lgkmcnt(0) and so do
3561 not need to be
3562 considered.)
3563 - Ensures any
3564 preceding
3565 sequential
3566 consistent local
3567 memory instructions
3568 have completed
3569 before executing
3570 this sequentially
3571 consistent
3572 instruction. This
3573 prevents reordering
3574 a seq_cst store
3575 followed by a
3576 seq_cst load. (Note
3577 that seq_cst is
3578 stronger than
3579 acquire/release as
3580 the reordering of
3581 load acquire
3582 followed by a store
3583 release is
3584 prevented by the
3585 waitcnt of
3586 the release, but
3587 there is nothing
3588 preventing a store
3589 release followed by
3590 load acquire from
3591 competing out of
3592 order.)
3593
3594 2. *Following
3595 instructions same as
3596 corresponding load
3597 atomic acquire,
3598 except must generated
3599 all instructions even
3600 for OpenCL.*
3601 load atomic seq_cst - workgroup - local *Same as corresponding
3602 load atomic acquire,
3603 except must generated
3604 all instructions even
3605 for OpenCL.*
3606 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3607 - system - generic vmcnt(0)
3608
3609 - Could be split into
3610 separate s_waitcnt
3611 vmcnt(0)
3612 and s_waitcnt
3613 lgkmcnt(0) to allow
3614 them to be
3615 independently moved
3616 according to the
3617 following rules.
3618 - waitcnt lgkmcnt(0)
3619 must happen after
3620 preceding
3621 global/generic load
3622 atomic/store
3623 atomic/atomicrmw
3624 with memory
3625 ordering of seq_cst
3626 and with equal or
3627 wider sync scope.
3628 (Note that seq_cst
3629 fences have their
3630 own s_waitcnt
3631 lgkmcnt(0) and so do
3632 not need to be
3633 considered.)
3634 - waitcnt vmcnt(0)
3635 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003636 preceding
3637 global/generic load
3638 atomic/store
3639 atomic/atomicrmw
3640 with memory
3641 ordering of seq_cst
3642 and with equal or
3643 wider sync scope.
3644 (Note that seq_cst
3645 fences have their
3646 own s_waitcnt
3647 vmcnt(0) and so do
3648 not need to be
3649 considered.)
3650 - Ensures any
3651 preceding
3652 sequential
3653 consistent global
3654 memory instructions
3655 have completed
3656 before executing
3657 this sequentially
3658 consistent
3659 instruction. This
3660 prevents reordering
3661 a seq_cst store
3662 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003663 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003664 that seq_cst is
3665 stronger than
3666 acquire/release as
3667 the reordering of
3668 load acquire
3669 followed by a store
3670 release is
3671 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003672 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003673 the release, but
3674 there is nothing
3675 preventing a store
3676 release followed by
3677 load acquire from
3678 competing out of
3679 order.)
3680
3681 2. *Following
3682 instructions same as
3683 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003684 atomic acquire,
3685 except must generated
3686 all instructions even
3687 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003688 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003689 - wavefront - local store atomic release,
3690 - workgroup - generic except must generated
3691 all instructions even
3692 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003693 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003694 - system - generic store atomic release,
3695 except must generated
3696 all instructions even
3697 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003698 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003699 - wavefront - local atomicrmw acq_rel,
3700 - workgroup - generic except must generated
3701 all instructions even
3702 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003703 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003704 - system - generic atomicrmw acq_rel,
3705 except must generated
3706 all instructions even
3707 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003708 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003709 - wavefront fence acq_rel,
3710 - workgroup except must generated
3711 - agent all instructions even
3712 - system for OpenCL.*
3713 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003714
3715The memory order also adds the single thread optimization constrains defined in
3716table
3717:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3718
3719 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3720 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3721
3722 ============ ==============================================================
3723 LLVM Memory Optimization Constraints
3724 Ordering
3725 ============ ==============================================================
3726 unordered *none*
3727 monotonic *none*
3728 acquire - If a load atomic/atomicrmw then no following load/load
3729 atomic/store/ store atomic/atomicrmw/fence instruction can
3730 be moved before the acquire.
3731 - If a fence then same as load atomic, plus no preceding
3732 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003733 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003734 atomic/store/ store atomic/atomicrmw/fence instruction can
3735 be moved after the release.
3736 - If a fence then same as store atomic, plus no following
3737 associated fence-paired-atomic can be moved before the
3738 fence.
3739 acq_rel Same constraints as both acquire and release.
3740 seq_cst - If a load atomic then same constraints as acquire, plus no
3741 preceding sequentially consistent load atomic/store
3742 atomic/atomicrmw/fence instruction can be moved after the
3743 seq_cst.
3744 - If a store atomic then the same constraints as release, plus
3745 no following sequentially consistent load atomic/store
3746 atomic/atomicrmw/fence instruction can be moved before the
3747 seq_cst.
3748 - If an atomicrmw/fence then same constraints as acq_rel.
3749 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003750
Wei Ding16289cf2017-02-21 18:48:01 +00003751Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003752~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003753
Tony Tyef16a45e2017-06-06 20:31:59 +00003754For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3755(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3756the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003757
Tony Tyef16a45e2017-06-06 20:31:59 +00003758 .. table:: AMDGPU Trap Handler for AMDHSA OS
3759 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003760
Tony Tyef16a45e2017-06-06 20:31:59 +00003761 =================== =============== =============== =======================
3762 Usage Code Sequence Trap Handler Description
3763 Inputs
3764 =================== =============== =============== =======================
3765 reserved ``s_trap 0x00`` Reserved by hardware.
3766 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3767 ``queue_ptr`` ``debugtrap``
3768 ``VGPR0``: intrinsic (not
3769 ``arg`` implemented).
3770 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3771 ``queue_ptr`` terminated and its
3772 associated queue put
3773 into the error state.
3774 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3775 ``queue_ptr`` installed handled
3776 same as ``llvm.trap``.
3777 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3778 breakpoints.
3779 debugger ``s_trap 0x08`` Reserved for debugger.
3780 debugger ``s_trap 0xfe`` Reserved for debugger.
3781 debugger ``s_trap 0xff`` Reserved for debugger.
3782 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003783
Tim Corringhamaf2dfc62018-04-04 13:02:09 +00003784AMDPAL
3785------
3786
3787This section provides code conventions used when the target triple OS is
3788``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3789from the application/runtime to each invocation of a hardware shader. These
3790parameters include both generic, application-controlled parameters called
3791*user data* as well as system-generated parameters that are a product of the
3792draw or dispatch execution.
3793
3794User Data
3795~~~~~~~~~
3796
3797Each hardware stage has a set of 32-bit *user data registers* which can be
3798written from a command buffer and then loaded into SGPRs when waves are launched
3799via a subsequent dispatch or draw operation. This is the way most arguments are
3800passed from the application/runtime to a hardware shader.
3801
3802Compute User Data
3803~~~~~~~~~~~~~~~~~
3804
3805Compute shader user data mappings are simpler than graphics shaders, and have a
3806fixed mapping.
3807
3808Note that there are always 10 available *user data entries* in registers -
3809entries beyond that limit must be fetched from memory (via the spill table
3810pointer) by the shader.
3811
3812 .. table:: PAL Compute Shader User Data Registers
3813 :name: pal-compute-user-data-registers
3814
3815 ============= ================================
3816 User Register Description
3817 ============= ================================
3818 0 Global Internal Table (32-bit pointer)
3819 1 Per-Shader Internal Table (32-bit pointer)
3820 2 - 11 Application-Controlled User Data (10 32-bit values)
3821 12 Spill Table (32-bit pointer)
3822 13 - 14 Thread Group Count (64-bit pointer)
3823 15 GDS Range
3824 ============= ================================
3825
3826Graphics User Data
3827~~~~~~~~~~~~~~~~~~
3828
3829Graphics pipelines support a much more flexible user data mapping:
3830
3831 .. table:: PAL Graphics Shader User Data Registers
3832 :name: pal-graphics-user-data-registers
3833
3834 ============= ================================
3835 User Register Description
3836 ============= ================================
3837 0 Global Internal Table (32-bit pointer)
3838 + Per-Shader Internal Table (32-bit pointer)
3839 + 1-15 Application Controlled User Data
3840 (1-15 Contiguous 32-bit Values in Registers)
3841 + Spill Table (32-bit pointer)
3842 + Draw Index (First Stage Only)
3843 + Vertex Offset (First Stage Only)
3844 + Instance Offset (First Stage Only)
3845 ============= ================================
3846
3847 The placement of the global internal table remains fixed in the first *user
3848 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3849 to any desired *user data SGPR register*, with the following regstrictions:
3850
3851 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3852 activehardware stage in a graphics pipeline (i.e. where the API vertex
3853 shader runs).
3854
3855 * Application-controlled user data must be mapped into a contiguous range of
3856 user data registers.
3857
3858 * The application-controlled user data range supports compaction remapping, so
3859 only *entries* that are actually consumed by the shader must be assigned to
3860 corresponding *registers*. Note that in order to support an efficient runtime
3861 implementation, the remapping must pack *registers* in the same order as
3862 *entries*, with unused *entries* removed.
3863
3864.. _pal_global_internal_table:
3865
3866Global Internal Table
3867~~~~~~~~~~~~~~~~~~~~~
3868
3869The global internal table is a table of *shader resource descriptors* (SRDs) that
3870define how certain engine-wide, runtime-managed resources should be accessed
3871from a shader. The majority of these resources have HW-defined formats, and it
3872is up to the compiler to write/read data as required by the target hardware.
3873
3874The following table illustrates the required format:
3875
3876 .. table:: PAL Global Internal Table
3877 :name: pal-git-table
3878
3879 ============= ================================
3880 Offset Description
3881 ============= ================================
3882 0-3 Graphics Scratch SRD
3883 4-7 Compute Scratch SRD
3884 8-11 ES/GS Ring Output SRD
3885 12-15 ES/GS Ring Input SRD
3886 16-19 GS/VS Ring Output #0
3887 20-23 GS/VS Ring Output #1
3888 24-27 GS/VS Ring Output #2
3889 28-31 GS/VS Ring Output #3
3890 32-35 GS/VS Ring Input SRD
3891 36-39 Tessellation Factor Buffer SRD
3892 40-43 Off-Chip LDS Buffer SRD
3893 44-47 Off-Chip Param Cache Buffer SRD
3894 48-51 Sample Position Buffer SRD
3895 52 vaRange::ShadowDescriptorTable High Bits
3896 ============= ================================
3897
3898 The pointer to the global internal table passed to the shader as user data
3899 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3900 the top 32 bits of the pipeline, so the shader may use the program
3901 counter's top 32 bits.
3902
Tony Tye46d35762017-08-15 20:47:41 +00003903Unspecified OS
3904--------------
3905
3906This section provides code conventions used when the target triple OS is
3907empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003908
3909Trap Handler ABI
3910~~~~~~~~~~~~~~~~
3911
3912For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3913not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3914instructions are handled as follows:
3915
3916 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3917 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3918
3919 =============== =============== ===========================================
3920 Usage Code Sequence Description
3921 =============== =============== ===========================================
3922 llvm.trap s_endpgm Causes wavefront to be terminated.
3923 llvm.debugtrap *none* Compiler warning given that there is no
3924 trap handler installed.
3925 =============== =============== ===========================================
3926
3927Source Languages
3928================
3929
3930.. _amdgpu-opencl:
3931
3932OpenCL
3933------
3934
Tony Tyef16a45e2017-06-06 20:31:59 +00003935When the language is OpenCL the following differences occur:
3936
39371. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye7a893d42018-03-23 18:45:18 +000039382. The AMDGPU backend appends additional arguments to the kernel's explicit
3939 arguments for the AMDHSA OS (see
3940 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye46d35762017-08-15 20:47:41 +000039413. Additional metadata is generated
Tony Tye7a893d42018-03-23 18:45:18 +00003942 (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003943
Tony Tye7a893d42018-03-23 18:45:18 +00003944 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3945 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3946
3947 ======== ==== ========= ===========================================
3948 Position Byte Byte Description
3949 Size Alignment
3950 ======== ==== ========= ===========================================
Tony Tye88441a32018-03-23 18:58:47 +00003951 1 8 8 OpenCL Global Offset X
3952 2 8 8 OpenCL Global Offset Y
3953 3 8 8 OpenCL Global Offset Z
3954 4 8 8 OpenCL address of printf buffer
3955 5 8 8 OpenCL address of virtual queue used by
3956 enqueue_kernel.
3957 6 8 8 OpenCL address of AqlWrap struct used by
3958 enqueue_kernel.
Tony Tye7a893d42018-03-23 18:45:18 +00003959 ======== ==== ========= ===========================================
Tony Tyef16a45e2017-06-06 20:31:59 +00003960
3961.. _amdgpu-hcc:
3962
3963HCC
3964---
3965
Tony Tye7a893d42018-03-23 18:45:18 +00003966When the language is HCC the following differences occur:
Tony Tyef16a45e2017-06-06 20:31:59 +00003967
39681. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3969
Tom Stellard45bb48e2015-06-13 03:28:10 +00003970Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003971---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003972
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003973AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003974It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003975
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003976This section describes general syntax for instructions and operands.
3977
3978Instructions
3979~~~~~~~~~~~~
3980
3981.. toctree::
3982 :hidden:
3983
3984 AMDGPUAsmGFX7
3985 AMDGPUAsmGFX8
3986 AMDGPUAsmGFX9
3987 AMDGPUOperandSyntax
3988
3989An instruction has the following syntax:
3990
3991 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
3992
3993Note that operands are normally comma-separated while modifiers are space-separated.
3994
3995The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
3996
3997See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
3998:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
3999
4000Note that features under development are not included in this description.
4001
4002For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00004003operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004004[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004005
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004006Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00004007~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004008
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004009The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004010
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004011* SGPR registers: s0, ... or s[0], ...
4012* VGPR registers: v0, ... or v[0], ...
4013* TTMP registers: ttmp0, ... or ttmp[0], ...
4014* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4015* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4016* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4017* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4018* Register index expressions: v[2*2], s[1-1:2-1]
4019* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00004020
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004021Modifiers
4022~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004023
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004024Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004025
Tony Tyef16a45e2017-06-06 20:31:59 +00004026Instruction Examples
4027~~~~~~~~~~~~~~~~~~~~
4028
4029DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004030++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004031
4032.. code-block:: nasm
4033
4034 ds_add_u32 v2, v4 offset:16
4035 ds_write_src2_b64 v2 offset0:4 offset1:8
4036 ds_cmpst_f32 v2, v4, v6
4037 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4038
4039
4040For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4041
Tony Tyef16a45e2017-06-06 20:31:59 +00004042FLAT
4043++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004044
4045.. code-block:: nasm
4046
4047 flat_load_dword v1, v[3:4]
4048 flat_store_dwordx3 v[3:4], v[5:7]
4049 flat_atomic_swap v1, v[3:4], v5 glc
4050 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4051 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4052
4053For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4054
Tony Tyef16a45e2017-06-06 20:31:59 +00004055MUBUF
4056+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004057
4058.. code-block:: nasm
4059
4060 buffer_load_dword v1, off, s[4:7], s1
4061 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4062 buffer_store_format_xy v[1:2], off, s[4:7], s1
4063 buffer_wbinvl1
4064 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4065
4066For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4067
Tony Tyef16a45e2017-06-06 20:31:59 +00004068SMRD/SMEM
4069+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004070
4071.. code-block:: nasm
4072
4073 s_load_dword s1, s[2:3], 0xfc
4074 s_load_dwordx8 s[8:15], s[2:3], s4
4075 s_load_dwordx16 s[88:103], s[2:3], s4
4076 s_dcache_inv_vol
4077 s_memtime s[4:5]
4078
4079For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4080
Tony Tyef16a45e2017-06-06 20:31:59 +00004081SOP1
4082++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004083
4084.. code-block:: nasm
4085
4086 s_mov_b32 s1, s2
4087 s_mov_b64 s[0:1], 0x80000000
4088 s_cmov_b32 s1, 200
4089 s_wqm_b64 s[2:3], s[4:5]
4090 s_bcnt0_i32_b64 s1, s[2:3]
4091 s_swappc_b64 s[2:3], s[4:5]
4092 s_cbranch_join s[4:5]
4093
4094For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4095
Tony Tyef16a45e2017-06-06 20:31:59 +00004096SOP2
4097++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004098
4099.. code-block:: nasm
4100
4101 s_add_u32 s1, s2, s3
4102 s_and_b64 s[2:3], s[4:5], s[6:7]
4103 s_cselect_b32 s1, s2, s3
4104 s_andn2_b32 s2, s4, s6
4105 s_lshr_b64 s[2:3], s[4:5], s6
4106 s_ashr_i32 s2, s4, s6
4107 s_bfm_b64 s[2:3], s4, s6
4108 s_bfe_i64 s[2:3], s[4:5], s6
4109 s_cbranch_g_fork s[4:5], s[6:7]
4110
4111For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4112
Tony Tyef16a45e2017-06-06 20:31:59 +00004113SOPC
4114++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004115
4116.. code-block:: nasm
4117
4118 s_cmp_eq_i32 s1, s2
4119 s_bitcmp1_b32 s1, s2
4120 s_bitcmp0_b64 s[2:3], s4
4121 s_setvskip s3, s5
4122
4123For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4124
Tony Tyef16a45e2017-06-06 20:31:59 +00004125SOPP
4126++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004127
4128.. code-block:: nasm
4129
4130 s_barrier
4131 s_nop 2
4132 s_endpgm
4133 s_waitcnt 0 ; Wait for all counters to be 0
4134 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4135 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4136 s_sethalt 9
4137 s_sleep 10
4138 s_sendmsg 0x1
4139 s_sendmsg sendmsg(MSG_INTERRUPT)
4140 s_trap 1
4141
4142For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4143
4144Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00004145of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00004146range or acceptable values.
4147
Tony Tyef16a45e2017-06-06 20:31:59 +00004148VALU
4149++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00004150
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004151For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4152the assembler will automatically use optimal encoding based on its operands.
4153To force specific encoding, one can add a suffix to the opcode of the instruction:
4154
4155* _e32 for 32-bit VOP1/VOP2/VOPC
4156* _e64 for 64-bit VOP3
4157* _dpp for VOP_DPP
4158* _sdwa for VOP_SDWA
4159
4160VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004161
4162.. code-block:: nasm
4163
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004164 v_mov_b32 v1, v2
4165 v_mov_b32_e32 v1, v2
4166 v_nop
4167 v_cvt_f64_i32_e32 v[1:2], v2
4168 v_floor_f32_e32 v1, v2
4169 v_bfrev_b32_e32 v1, v2
4170 v_add_f32_e32 v1, v2, v3
4171 v_mul_i32_i24_e64 v1, v2, 3
4172 v_mul_i32_i24_e32 v1, -3, v3
4173 v_mul_i32_i24_e32 v1, -100, v3
4174 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4175 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004176
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004177VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004178
4179.. code-block:: nasm
4180
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004181 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4182 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4183 v_mov_b32 v0, v0 wave_shl:1
4184 v_mov_b32 v0, v0 row_mirror
4185 v_mov_b32 v0, v0 row_bcast:31
4186 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4187 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4188 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004189
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004190VOP_SDWA examples:
4191
4192.. code-block:: nasm
4193
4194 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4195 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4196 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4197 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4198 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4199
4200For full list of supported instructions, refer to "Vector ALU instructions".
4201
4202HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004203~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004204
4205AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4206one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004207
4208.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004209+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004210
4211*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004212object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004213
4214.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004215+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4216
Tom Stellard347ac792015-06-26 21:15:07 +00004217
4218*major*, *minor*, and *stepping* are all integers that describe the instruction
4219set architecture (ISA) version of the assembly program.
4220
4221*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4222"AMD" and *arch* should always be equal to "AMDGPU".
4223
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004224By default, the assembler will derive the ISA version, *vendor*, and *arch*
4225from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004226
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004227.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004228+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004229
4230This directives specifies that the symbol with given name is a kernel entry point
4231(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004232
4233.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004234++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004235
4236This directive marks the beginning of a list of key / value pairs that are used
4237to specify the amd_kernel_code_t object that will be emitted by the assembler.
4238The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4239any amd_kernel_code_t values that are unspecified a default value will be
4240used. The default value for all keys is 0, with the following exceptions:
4241
4242- *kernel_code_version_major* defaults to 1.
4243- *machine_kind* defaults to 1.
4244- *machine_version_major*, *machine_version_minor*, and
4245 *machine_version_stepping* are derived from the value of the -mcpu option
4246 that is passed to the assembler.
4247- *kernel_code_entry_byte_offset* defaults to 256.
4248- *wavefront_size* defaults to 6.
4249- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004250 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004251 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4252
4253The *.amd_kernel_code_t* directive must be placed immediately after the
4254function label and before any instructions.
4255
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004256For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4257comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004258
4259Here is an example of a minimal amd_kernel_code_t specification:
4260
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004261.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004262
4263 .hsa_code_object_version 1,0
4264 .hsa_code_object_isa
4265
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004266 .hsatext
4267 .globl hello_world
4268 .p2align 8
4269 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004270
4271 hello_world:
4272
4273 .amd_kernel_code_t
4274 enable_sgpr_kernarg_segment_ptr = 1
4275 is_ptr64 = 1
4276 compute_pgm_rsrc1_vgprs = 0
4277 compute_pgm_rsrc1_sgprs = 0
4278 compute_pgm_rsrc2_user_sgpr = 2
4279 kernarg_segment_byte_size = 8
4280 wavefront_sgpr_count = 2
4281 workitem_vgpr_count = 3
4282 .end_amd_kernel_code_t
4283
4284 s_load_dwordx2 s[0:1], s[0:1] 0x0
4285 v_mov_b32 v0, 3.14159
4286 s_waitcnt lgkmcnt(0)
4287 v_mov_b32 v1, s0
4288 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004289 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004290 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004291 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004292 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004293
4294Additional Documentation
4295========================
4296
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004297.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4298.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4299.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4300.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4301.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4302.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4303.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4304.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004305.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4306.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4307.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4308.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4309.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004310.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004311.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4312.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__