blob: 7d358dfdce4e17f90aaeba8abbcf889bc4737036 [file] [log] [blame]
Eugene Zelenko3507b042018-03-21 17:09:35 +00001=============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye7a893d42018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye07d9f102017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000080
Tony Tye31105cc2017-12-11 15:35:27 +000081 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
85 [Default]
86 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000088 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +000099 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000106 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000152 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyeb6efb902018-04-14 01:58:10 +0000201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
202 [on] - Ryzen 5 2400G
Matt Arsenault0084adc2018-04-30 19:08:16 +0000203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
204 [off]
205 .. TODO
206 Add product
207 names.
208 ``gfx906`` ``amdgcn`` dGPU - xnack *TBA*
209 [off]
210 .. TODO
211 Add product
212 names.
Tony Tye31105cc2017-12-11 15:35:27 +0000213 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000214
215.. _amdgpu-target-features:
216
217Target Features
218---------------
219
220Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000221processor specific features. Not all target features are supported by
222all processors. The runtime must ensure that the features supported by
223the device used to execute the code match the features enabled when
224generating the code. A mismatch of features may result in incorrect
225execution, or a reduction in performance.
226
227The target features supported by each processor, and the default value
228used if not specified explicitly, is listed in
229:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000230
231Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232target features.
233
234For example:
235
236``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000237 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000238``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000239 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000240
241 .. table:: AMDGPU Target Features
242 :name: amdgpu-target-feature-table
243
Tony Tye31105cc2017-12-11 15:35:27 +0000244 ============== ==================================================
245 Target Feature Description
246 ============== ==================================================
247 -m[no-]xnack Enable/disable generating code that has
248 memory clauses that are compatible with
249 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000250
Tony Tye31105cc2017-12-11 15:35:27 +0000251 This is used for demand paging and page
252 migration. If XNACK replay is enabled in
253 the device, then if a page fault occurs
254 the code may execute incorrectly if the
255 ``xnack`` feature is not enabled. Executing
256 code that has the feature enabled on a
257 device that does not have XNACK replay
258 enabled will execute correctly, but may
259 be less performant than code with the
260 feature disabled.
261 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000262
263.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000264
265Address Spaces
266--------------
267
Tony Tyef16a45e2017-06-06 20:31:59 +0000268The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000269
Tony Tyef16a45e2017-06-06 20:31:59 +0000270The memory space names used in the table, aside from the region memory space, is
271from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000272
Tony Tyef16a45e2017-06-06 20:31:59 +0000273LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000274
Tony Tyef16a45e2017-06-06 20:31:59 +0000275 .. table:: Address Space Mapping
276 :name: amdgpu-address-space-mapping-table
277
Yaxun Liu0124b542018-02-13 18:00:25 +0000278 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000279 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000280 ================== =================
281 0 Generic (Flat)
282 1 Global
283 2 Region (GDS)
284 3 Local (group/LDS)
285 4 Constant
286 5 Private (Scratch)
287 6 Constant 32-bit
288 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000289
290.. _amdgpu-memory-scopes:
291
292Memory Scopes
293-------------
294
295This section provides LLVM memory synchronization scopes supported by the AMDGPU
296backend memory model when the target triple OS is ``amdhsa`` (see
297:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
298
299The memory model supported is based on the HSA memory model [HSA]_ which is
300based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301relation is transitive over the synchonizes-with relation independent of scope,
302and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000303table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000304
305This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306inclusion and requires the memory scopes to exactly match. However, this
307is conservatively correct for OpenCL.
308
Tony Tye07d9f102017-11-10 01:00:54 +0000309 .. table:: AMDHSA LLVM Sync Scopes
310 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000311
312 ================ ==========================================================
313 LLVM Sync Scope Description
314 ================ ==========================================================
315 *none* The default: ``system``.
316
317 Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
322
323 - ``system``.
324 - ``agent`` and executed by a thread on the same agent.
325 - ``workgroup`` and executed by a thread in the same
326 workgroup.
327 - ``wavefront`` and executed by a thread in the same
328 wavefront.
329
330 ``agent`` Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system`` or ``agent`` and executed by a thread on the
337 same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``workgroup`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system``, ``agent`` or ``workgroup`` and executed by a
350 thread in the same workgroup.
351 - ``wavefront`` and executed by a thread in the same
352 wavefront.
353
354 ``wavefront`` Synchronizes with, and participates in modification and
355 seq_cst total orderings with, other operations (except
356 image operations) for all address spaces (except private,
357 or generic that accesses private) provided the other
358 operation's sync scope is:
359
360 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361 and executed by a thread in the same wavefront.
362
363 ``singlethread`` Only synchronizes with, and participates in modification
364 and seq_cst total orderings with, other operations (except
365 image operations) running in the same thread for all
366 address spaces (for example, in signal handlers).
367 ================ ==========================================================
368
369AMDGPU Intrinsics
370-----------------
371
372The AMDGPU backend implements the following intrinsics.
373
374*This section is WIP.*
375
376.. TODO
377 List AMDGPU intrinsics
378
379Code Object
380===========
381
382The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
383can be linked by ``lld`` to produce a standard ELF shared code object which can
384be loaded and executed on an AMDGPU target.
385
386Header
387------
388
389The AMDGPU backend uses the following ELF header:
390
391 .. table:: AMDGPU ELF Header
392 :name: amdgpu-elf-header-table
393
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000394 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000395 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000396 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000397 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
398 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000399 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
400 - ``ELFOSABI_AMDGPU_HSA``
401 - ``ELFOSABI_AMDGPU_PAL``
402 - ``ELFOSABI_AMDGPU_MESA3D``
403 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
404 - ``ELFABIVERSION_AMDGPU_PAL``
405 - ``ELFABIVERSION_AMDGPU_MESA3D``
406 ``e_type`` - ``ET_REL``
407 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 ``e_machine`` ``EM_AMDGPU``
409 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000410 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000411 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000412
413..
414
415 .. table:: AMDGPU ELF Header Enumeration Values
416 :name: amdgpu-elf-header-enumeration-values-table
417
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000418 =============================== =====
419 Name Value
420 =============================== =====
421 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000422 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000423 ``ELFOSABI_AMDGPU_HSA`` 64
424 ``ELFOSABI_AMDGPU_PAL`` 65
425 ``ELFOSABI_AMDGPU_MESA3D`` 66
426 ``ELFABIVERSION_AMDGPU_HSA`` 1
427 ``ELFABIVERSION_AMDGPU_PAL`` 0
428 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
429 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000430
431``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000432 The ELF class is:
433
434 * ``ELFCLASS32`` for ``r600`` architecture.
435
436 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
437 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000438
439``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000440 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000441
442``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000443 One of the following AMD GPU architecture specific OS ABIs
444 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000445
Tony Tye07d9f102017-11-10 01:00:54 +0000446 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000447
Tony Tye07d9f102017-11-10 01:00:54 +0000448 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000449
Tony Tye07d9f102017-11-10 01:00:54 +0000450 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
451
452 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000453
Tony Tyef16a45e2017-06-06 20:31:59 +0000454``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000455 The ABI version of the AMD GPU architecture specific OS ABI to which the code
456 object conforms:
457
458 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
459 runtime ABI.
460
461 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
462 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000463
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000464 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000465 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_type``
468 Can be one of the following values:
469
470
471 ``ET_REL``
472 The type produced by the AMD GPU backend compiler as it is relocatable code
473 object.
474
475 ``ET_DYN``
476 The type produced by the linker as it is a shared code object.
477
478 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
479
480``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000481 The value ``EM_AMDGPU`` is used for the machine for all processors supported
482 by the ``r600`` and ``amdgcn`` architectures (see
483 :ref:`amdgpu-processor-table`). The specific processor is specified in the
484 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
485 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000486
487``e_entry``
488 The entry point is 0 as the entry points for individual kernels must be
489 selected in order to invoke them through AQL packets.
490
491``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000492 The AMDGPU backend uses the following ELF header flags:
493
494 .. table:: AMDGPU ELF Header ``e_flags``
495 :name: amdgpu-elf-header-e_flags-table
496
497 ================================= ========== =============================
498 Name Value Description
499 ================================= ========== =============================
500 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
501 -------------------------------------------- -----------------------------
502 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
503 mask for
504 ``EF_AMDGPU_MACH_xxx`` values
505 defined in
506 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000507 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
508 target feature is
509 enabled for all code
510 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000511 If the processor
512 does not support the
513 ``xnack`` target
514 feature then must
515 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000516 See
517 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000518 ================================= ========== =============================
519
520 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
521 :name: amdgpu-ef-amdgpu-mach-table
522
523 ================================= ========== =============================
524 Name Value Description (see
525 :ref:`amdgpu-processor-table`)
526 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000527 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
528 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
529 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
530 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
531 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
532 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
533 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
534 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
535 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
536 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
537 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
540 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
541 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
542 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
544 *reserved* 0x011 - Reserved for ``r600``
545 0x01f architecture processors.
546 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
553 *reserved* 0x027 Reserved.
554 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault0084adc2018-04-30 19:08:16 +0000560 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000562 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000563 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000564
565Sections
566--------
567
568An AMDGPU target ELF code object has the standard ELF sections which include:
569
570 .. table:: AMDGPU ELF Sections
571 :name: amdgpu-elf-sections-table
572
573 ================== ================ =================================
574 Name Type Attributes
575 ================== ================ =================================
576 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
577 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
578 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
579 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
580 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
581 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
582 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
583 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
584 ``.note`` ``SHT_NOTE`` *none*
585 ``.rela``\ *name* ``SHT_RELA`` *none*
586 ``.rela.dyn`` ``SHT_RELA`` *none*
587 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
588 ``.shstrtab`` ``SHT_STRTAB`` *none*
589 ``.strtab`` ``SHT_STRTAB`` *none*
590 ``.symtab`` ``SHT_SYMTAB`` *none*
591 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
592 ================== ================ =================================
593
594These sections have their standard meanings (see [ELF]_) and are only generated
595if needed.
596
597``.debug``\ *\**
598 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
599 DWARF produced by the AMDGPU backend.
600
Tony Tye46d35762017-08-15 20:47:41 +0000601``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000602 The standard sections used by a dynamic loader.
603
604``.note``
605 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
606 backend.
607
608``.rela``\ *name*, ``.rela.dyn``
609 For relocatable code objects, *name* is the name of the section that the
610 relocation records apply. For example, ``.rela.text`` is the section name for
611 relocation records associated with the ``.text`` section.
612
613 For linked shared code objects, ``.rela.dyn`` contains all the relocation
614 records from each of the relocatable code object's ``.rela``\ *name* sections.
615
616 See :ref:`amdgpu-relocation-records` for the relocation records supported by
617 the AMDGPU backend.
618
619``.text``
620 The executable machine code for the kernels and functions they call. Generated
621 as position independent code. See :ref:`amdgpu-code-conventions` for
622 information on conventions used in the isa generation.
623
624.. _amdgpu-note-records:
625
626Note Records
627------------
628
Tony Tye07d9f102017-11-10 01:00:54 +0000629As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
630be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
631aligned. In addition, minimal zero byte padding must be generated to ensure the
632``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
633``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000634
635The AMDGPU backend code object uses the following ELF note records in the
636``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000637record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000638variable size strings have a corresponding ``*_size`` field that specifies the
639number of bytes, including the terminating null character, in the string. The
640string(s) come immediately after the preceding fields.
641
642Additional note records can be present.
643
644 .. table:: AMDGPU ELF Note Records
645 :name: amdgpu-elf-note-records-table
646
Tony Tye46d35762017-08-15 20:47:41 +0000647 ===== ============================== ======================================
648 Name Type Description
649 ===== ============================== ======================================
650 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000651 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000652
653..
654
655 .. table:: AMDGPU ELF Note Record Enumeration Values
656 :name: amdgpu-elf-note-record-enumeration-values-table
657
Tony Tye46d35762017-08-15 20:47:41 +0000658 ============================== =====
659 Name Value
660 ============================== =====
661 *reserved* 0-9
662 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000663 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000664 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000665
Tony Tye46d35762017-08-15 20:47:41 +0000666``NT_AMD_AMDGPU_HSA_METADATA``
667 Specifies extensible metadata associated with the code objects executed on HSA
668 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
669 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
670 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
671 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000672
Tony Tye46d35762017-08-15 20:47:41 +0000673.. _amdgpu-symbols:
674
675Symbols
676-------
677
678Symbols include the following:
679
680 .. table:: AMDGPU ELF Symbols
681 :name: amdgpu-elf-symbols-table
682
683 ===================== ============== ============= ==================
684 Name Type Section Description
685 ===================== ============== ============= ==================
686 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
687 - ``.rodata``
688 - ``.bss``
689 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
690 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
691 ===================== ============== ============= ==================
692
693Global variable
694 Global variables both used and defined by the compilation unit.
695
696 If the symbol is defined in the compilation unit then it is allocated in the
697 appropriate section according to if it has initialized data or is readonly.
698
699 If the symbol is external then its section is ``STN_UNDEF`` and the loader
700 will resolve relocations using the definition provided by another code object
701 or explicitly defined by the runtime.
702
703 All global symbols, whether defined in the compilation unit or external, are
704 accessed by the machine code indirectly through a GOT table entry. This
705 allows them to be preemptable. The GOT table is only supported when the target
706 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000707
708 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000709 Add description of linked shared object symbols. Seems undefined symbols
710 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000711
Tony Tye46d35762017-08-15 20:47:41 +0000712Kernel descriptor
713 Every HSA kernel has an associated kernel descriptor. It is the address of the
714 kernel descriptor that is used in the AQL dispatch packet used to invoke the
715 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
716 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
717
718Kernel entry point
719 Every HSA kernel also has a symbol for its machine code entry point.
720
721.. _amdgpu-relocation-records:
722
723Relocation Records
724------------------
725
726AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
727relocatable fields are:
728
729``word32``
730 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
731 alignment. These values use the same byte order as other word values in the
732 AMD GPU architecture.
733
734``word64``
735 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
736 alignment. These values use the same byte order as other word values in the
737 AMD GPU architecture.
738
739Following notations are used for specifying relocation calculations:
740
741**A**
742 Represents the addend used to compute the value of the relocatable field.
743
744**G**
745 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000746 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000747
748**GOT**
749 Represents the address of the global offset table.
750
751**P**
752 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
753 of the storage unit being relocated (computed using ``r_offset``).
754
755**S**
756 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000757 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
758
759**B**
760 Represents the base address of a loaded executable or shared object which is
761 the difference between the ELF address and the actual load address. Relocations
762 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000763
764The following relocation types are supported:
765
766 .. table:: AMDGPU ELF Relocation Records
767 :name: amdgpu-elf-relocation-records-table
768
Tony Tyedb6c9932018-01-30 23:59:43 +0000769 ========================== ======= ===== ========== ==============================
770 Relocation Type Kind Value Field Calculation
771 ========================== ======= ===== ========== ==============================
772 ``R_AMDGPU_NONE`` 0 *none* *none*
Tony Tye223f4c72018-04-13 01:01:27 +0000773 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
774 Dynamic
775 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
776 Dynamic
777 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault0084adc2018-04-30 19:08:16 +0000778 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000779 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
780 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye223f4c72018-04-13 01:01:27 +0000781 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
782 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000783 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
784 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
785 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
786 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
787 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
788 *reserved* 12
789 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
790 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000791
Tony Tye223f4c72018-04-13 01:01:27 +0000792``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
793the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
794
795There is no current OS loader support for 32 bit programs and so
796``R_AMDGPU_ABS32`` is not used.
Matt Arsenault0084adc2018-04-30 19:08:16 +0000797
Tony Tye46d35762017-08-15 20:47:41 +0000798.. _amdgpu-dwarf:
799
800DWARF
801-----
802
Scott Linder16c7bda2018-02-23 23:01:06 +0000803Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000804information that maps the code object executable code and data to the source
805language constructs. It can be used by tools such as debuggers and profilers.
806
807Address Space Mapping
808~~~~~~~~~~~~~~~~~~~~~
809
810The following address space mapping is used:
811
812 .. table:: AMDGPU DWARF Address Space Mapping
813 :name: amdgpu-dwarf-address-space-mapping-table
814
815 =================== =================
816 DWARF Address Space Memory Space
817 =================== =================
818 1 Private (Scratch)
819 2 Local (group/LDS)
820 *omitted* Global
821 *omitted* Constant
822 *omitted* Generic (Flat)
823 *not supported* Region (GDS)
824 =================== =================
825
826See :ref:`amdgpu-address-spaces` for information on the memory space terminology
827used in the table.
828
829An ``address_class`` attribute is generated on pointer type DIEs to specify the
830DWARF address space of the value of the pointer when it is in the *private* or
831*local* address space. Otherwise the attribute is omitted.
832
833An ``XDEREF`` operation is generated in location list expressions for variables
834that are allocated in the *private* and *local* address space. Otherwise no
835``XDREF`` is omitted.
836
837Register Mapping
838~~~~~~~~~~~~~~~~
839
840*This section is WIP.*
841
842.. TODO
843 Define DWARF register enumeration.
844
845 If want to present a wavefront state then should expose vector registers as
846 64 wide (rather than per work-item view that LLVM uses). Either as separate
847 registers, or a 64x4 byte single register. In either case use a new LANE op
848 (akin to XDREF) to select the current lane usage in a location
849 expression. This would also allow scalar register spilling to vector register
850 lanes to be expressed (currently no debug information is being generated for
851 spilling). If choose a wide single register approach then use LANE in
852 conjunction with PIECE operation to select the dword part of the register for
853 the current lane. If the separate register approach then use LANE to select
854 the register.
855
856Source Text
857~~~~~~~~~~~
858
Scott Linder16c7bda2018-02-23 23:01:06 +0000859Source text for online-compiled programs (e.g. those compiled by the OpenCL
860runtime) may be embedded into the DWARF v5 line table using the ``clang
861-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000862
Scott Linder16c7bda2018-02-23 23:01:06 +0000863For example:
864
865``-gembed-source``
866 Enable the embedded source DWARF v5 extension.
867``-gno-embed-source``
868 Disable the embedded source DWARF v5 extension.
869
870 .. table:: AMDGPU Debug Options
871 :name: amdgpu-debug-options
872
873 ==================== ==================================================
874 Debug Flag Description
875 ==================== ==================================================
876 -g[no-]embed-source Enable/disable embedding source text in DWARF
877 debug sections. Useful for environments where
878 source cannot be written to disk, such as
879 when performing online compilation.
880 ==================== ==================================================
881
882This option enables one extended content types in the DWARF v5 Line Number
883Program Header, which is used to encode embedded source.
884
885 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
886 :name: amdgpu-dwarf-extended-content-types
887
888 ============================ ======================
889 Content Type Form
890 ============================ ======================
891 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
892 ============================ ======================
893
894The source field will contain the UTF-8 encoded, null-terminated source text
895with ``'\n'`` line endings. When the source field is present, consumers can use
896the embedded source instead of attempting to discover the source on disk. When
897the source field is absent, consumers can access the file to get the source
898text.
899
900The above content type appears in the ``file_name_entry_format`` field of the
901line table prologue, and its corresponding value appear in the ``file_names``
902field. The current encoding of the content type is documented in table
903:ref:`amdgpu-dwarf-extended-content-types-encoding`
904
905 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
906 :name: amdgpu-dwarf-extended-content-types-encoding
907
908 ============================ ====================
909 Content Type Value
910 ============================ ====================
911 ``DW_LNCT_LLVM_source`` 0x2001
912 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000913
914.. _amdgpu-code-conventions:
915
916Code Conventions
917================
918
919This section provides code conventions used for each supported target triple OS
920(see :ref:`amdgpu-target-triples`).
921
922AMDHSA
923------
924
925This section provides code conventions used when the target triple OS is
926``amdhsa`` (see :ref:`amdgpu-target-triples`).
927
928.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000929
Tony Tye01bfd6c2018-03-27 21:20:46 +0000930Code Object Target Identification
931~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
932
933The AMDHSA OS uses the following syntax to specify the code object
934target as a single string:
935
936 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
937
938Where:
939
940 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
941 are the same as the *Target Triple* (see
942 :ref:`amdgpu-target-triples`).
943
944 - ``<Processor>`` is the same as the *Processor* (see
945 :ref:`amdgpu-processors`).
946
947 - ``<Target Features>`` is a list of the enabled *Target Features*
948 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
949 apply to *Processor*. The list must be in the same order as listed
950 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
951 Features* must be included in the list if they are enabled even if
952 that is the default for *Processor*.
953
954For example:
955
956 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
957
Tony Tyef16a45e2017-06-06 20:31:59 +0000958Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000959~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000960
Tony Tye46d35762017-08-15 20:47:41 +0000961The code object metadata specifies extensible metadata associated with the code
962objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
963[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
964(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
965``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
966information necessary to support the ROCM kernel queries. For example, the
967segment sizes needed in a dispatch packet. In addition, a high level language
968runtime may require other information to be included. For example, the AMD
969OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000970
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000971The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000972:doc:`YamlIO`).
973
Tony Tye46d35762017-08-15 20:47:41 +0000974.. TODO
975 Is the string null terminated? It probably should not if YAML allows it to
976 contain null characters, otherwise it should be.
977
Tony Tyef16a45e2017-06-06 20:31:59 +0000978The metadata is represented as a single YAML document comprised of the mapping
979defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
980referenced tables.
981
982For boolean values, the string values of ``false`` and ``true`` are used for
983false and true respectively.
984
985Additional information can be added to the mappings. To avoid conflicts, any
986non-AMD key names should be prefixed by "*vendor-name*.".
987
988 .. table:: AMDHSA Code Object Metadata Mapping
989 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
990
991 ========== ============== ========= =======================================
992 String Key Value Type Required? Description
993 ========== ============== ========= =======================================
994 "Version" sequence of Required - The first integer is the major
995 2 integers version. Currently 1.
996 - The second integer is the minor
997 version. Currently 0.
998 "Printf" sequence of Each string is encoded information
999 strings about a printf function call. The
1000 encoded information is organized as
1001 fields separated by colon (':'):
1002
1003 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1004
1005 where:
1006
1007 ``ID``
1008 A 32 bit integer as a unique id for
1009 each printf function call
1010
1011 ``N``
1012 A 32 bit integer equal to the number
1013 of arguments of printf function call
1014 minus 1
1015
1016 ``S[i]`` (where i = 0, 1, ... , N-1)
1017 32 bit integers for the size in bytes
1018 of the i-th FormatString argument of
1019 the printf function call
1020
1021 FormatString
1022 The format string passed to the
1023 printf function call.
1024 "Kernels" sequence of Required Sequence of the mappings for each
1025 mapping kernel in the code object. See
1026 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1027 for the definition of the mapping.
1028 ========== ============== ========= =======================================
1029
1030..
1031
1032 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1033 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1034
1035 ================= ============== ========= ================================
1036 String Key Value Type Required? Description
1037 ================= ============== ========= ================================
1038 "Name" string Required Source name of the kernel.
1039 "SymbolName" string Required Name of the kernel
1040 descriptor ELF symbol.
1041 "Language" string Source language of the kernel.
1042 Values include:
1043
1044 - "OpenCL C"
1045 - "OpenCL C++"
1046 - "HCC"
1047 - "OpenMP"
1048
1049 "LanguageVersion" sequence of - The first integer is the major
1050 2 integers version.
1051 - The second integer is the
1052 minor version.
1053 "Attrs" mapping Mapping of kernel attributes.
1054 See
1055 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1056 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001057 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001058 mapping kernel arguments. See
1059 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1060 for the definition of the mapping.
1061 "CodeProps" mapping Mapping of properties related to
1062 the kernel code. See
1063 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1064 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001065 ================= ============== ========= ================================
1066
1067..
1068
1069 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1070 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1071
1072 =================== ============== ========= ==============================
1073 String Key Value Type Required? Description
1074 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001075 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1076 3 integers must be >=1 and the dispatch
1077 work-group size X, Y, Z must
1078 correspond to the specified
1079 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001080
1081 Corresponds to the OpenCL
1082 ``reqd_work_group_size``
1083 attribute.
1084 "WorkGroupSizeHint" sequence of The dispatch work-group size
1085 3 integers X, Y, Z is likely to be the
1086 specified values.
1087
1088 Corresponds to the OpenCL
1089 ``work_group_size_hint``
1090 attribute.
1091 "VecTypeHint" string The name of a scalar or vector
1092 type.
1093
1094 Corresponds to the OpenCL
1095 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001096
1097 "RuntimeHandle" string The external symbol name
1098 associated with a kernel.
1099 OpenCL runtime allocates a
1100 global buffer for the symbol
1101 and saves the kernel's address
1102 to it, which is used for
1103 device side enqueueing. Only
1104 available for device side
1105 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001106 =================== ============== ========= ==============================
1107
1108..
1109
1110 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1111 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1112
1113 ================= ============== ========= ================================
1114 String Key Value Type Required? Description
1115 ================= ============== ========= ================================
1116 "Name" string Kernel argument name.
1117 "TypeName" string Kernel argument type name.
1118 "Size" integer Required Kernel argument size in bytes.
1119 "Align" integer Required Kernel argument alignment in
1120 bytes. Must be a power of two.
1121 "ValueKind" string Required Kernel argument kind that
1122 specifies how to set up the
1123 corresponding argument.
1124 Values include:
1125
1126 "ByValue"
1127 The argument is copied
1128 directly into the kernarg.
1129
1130 "GlobalBuffer"
1131 A global address space pointer
1132 to the buffer data is passed
1133 in the kernarg.
1134
1135 "DynamicSharedPointer"
1136 A group address space pointer
1137 to dynamically allocated LDS
1138 is passed in the kernarg.
1139
1140 "Sampler"
1141 A global address space
1142 pointer to a S# is passed in
1143 the kernarg.
1144
1145 "Image"
1146 A global address space
1147 pointer to a T# is passed in
1148 the kernarg.
1149
1150 "Pipe"
1151 A global address space pointer
1152 to an OpenCL pipe is passed in
1153 the kernarg.
1154
1155 "Queue"
1156 A global address space pointer
1157 to an OpenCL device enqueue
1158 queue is passed in the
1159 kernarg.
1160
1161 "HiddenGlobalOffsetX"
1162 The OpenCL grid dispatch
1163 global offset for the X
1164 dimension is passed in the
1165 kernarg.
1166
1167 "HiddenGlobalOffsetY"
1168 The OpenCL grid dispatch
1169 global offset for the Y
1170 dimension is passed in the
1171 kernarg.
1172
1173 "HiddenGlobalOffsetZ"
1174 The OpenCL grid dispatch
1175 global offset for the Z
1176 dimension is passed in the
1177 kernarg.
1178
1179 "HiddenNone"
1180 An argument that is not used
1181 by the kernel. Space needs to
1182 be left for it, but it does
1183 not need to be set up.
1184
1185 "HiddenPrintfBuffer"
1186 A global address space pointer
1187 to the runtime printf buffer
1188 is passed in kernarg.
1189
1190 "HiddenDefaultQueue"
1191 A global address space pointer
1192 to the OpenCL device enqueue
1193 queue that should be used by
1194 the kernel by default is
1195 passed in the kernarg.
1196
1197 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001198 A global address space pointer
1199 to help link enqueued kernels into
1200 the ancestor tree for determining
1201 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001202
1203 "ValueType" string Required Kernel argument value type. Only
1204 present if "ValueKind" is
1205 "ByValue". For vector data
1206 types, the value is for the
1207 element type. Values include:
1208
1209 - "Struct"
1210 - "I8"
1211 - "U8"
1212 - "I16"
1213 - "U16"
1214 - "F16"
1215 - "I32"
1216 - "U32"
1217 - "F32"
1218 - "I64"
1219 - "U64"
1220 - "F64"
1221
1222 .. TODO
1223 How can it be determined if a
1224 vector type, and what size
1225 vector?
1226 "PointeeAlign" integer Alignment in bytes of pointee
1227 type for pointer type kernel
1228 argument. Must be a power
1229 of 2. Only present if
1230 "ValueKind" is
1231 "DynamicSharedPointer".
1232 "AddrSpaceQual" string Kernel argument address space
1233 qualifier. Only present if
1234 "ValueKind" is "GlobalBuffer" or
1235 "DynamicSharedPointer". Values
1236 are:
1237
1238 - "Private"
1239 - "Global"
1240 - "Constant"
1241 - "Local"
1242 - "Generic"
1243 - "Region"
1244
1245 .. TODO
1246 Is GlobalBuffer only Global
1247 or Constant? Is
1248 DynamicSharedPointer always
1249 Local? Can HCC allow Generic?
1250 How can Private or Region
1251 ever happen?
1252 "AccQual" string Kernel argument access
1253 qualifier. Only present if
1254 "ValueKind" is "Image" or
1255 "Pipe". Values
1256 are:
1257
1258 - "ReadOnly"
1259 - "WriteOnly"
1260 - "ReadWrite"
1261
1262 .. TODO
1263 Does this apply to
1264 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001265 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001266 performed by the kernel on the
1267 kernel argument. Only present if
1268 "ValueKind" is "GlobalBuffer",
1269 "Image", or "Pipe". This may be
1270 more restrictive than indicated
1271 by "AccQual" to reflect what the
1272 kernel actual does. If not
1273 present then the runtime must
1274 assume what is implied by
1275 "AccQual" and "IsConst". Values
1276 are:
1277
1278 - "ReadOnly"
1279 - "WriteOnly"
1280 - "ReadWrite"
1281
1282 "IsConst" boolean Indicates if the kernel argument
1283 is const qualified. Only present
1284 if "ValueKind" is
1285 "GlobalBuffer".
1286
1287 "IsRestrict" boolean Indicates if the kernel argument
1288 is restrict qualified. Only
1289 present if "ValueKind" is
1290 "GlobalBuffer".
1291
1292 "IsVolatile" boolean Indicates if the kernel argument
1293 is volatile qualified. Only
1294 present if "ValueKind" is
1295 "GlobalBuffer".
1296
1297 "IsPipe" boolean Indicates if the kernel argument
1298 is pipe qualified. Only present
1299 if "ValueKind" is "Pipe".
1300
1301 .. TODO
1302 Can GlobalBuffer be pipe
1303 qualified?
1304 ================= ============== ========= ================================
1305
1306..
1307
1308 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1309 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1310
1311 ============================ ============== ========= =====================
1312 String Key Value Type Required? Description
1313 ============================ ============== ========= =====================
1314 "KernargSegmentSize" integer Required The size in bytes of
1315 the kernarg segment
1316 that holds the values
1317 of the arguments to
1318 the kernel.
1319 "GroupSegmentFixedSize" integer Required The amount of group
1320 segment memory
1321 required by a
1322 work-group in
1323 bytes. This does not
1324 include any
1325 dynamically allocated
1326 group segment memory
1327 that may be added
1328 when the kernel is
1329 dispatched.
1330 "PrivateSegmentFixedSize" integer Required The amount of fixed
1331 private address space
1332 memory required for a
1333 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001334 bytes. If the kernel
1335 uses a dynamic call
1336 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001337 space must be added
1338 to this value for the
1339 call stack.
1340 "KernargSegmentAlign" integer Required The maximum byte
1341 alignment of
1342 arguments in the
1343 kernarg segment. Must
1344 be a power of 2.
1345 "WavefrontSize" integer Required Wavefront size. Must
1346 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001347 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001348 registers used by a
1349 wavefront for
1350 GFX6-GFX9. This
1351 includes the special
1352 SGPRs for VCC, Flat
1353 Scratch (GFX7-GFX9)
1354 and XNACK (for
1355 GFX8-GFX9). It does
1356 not include the 16
1357 SGPR added if a trap
1358 handler is
1359 enabled. It is not
1360 rounded up to the
1361 allocation
1362 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001363 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001364 registers used by
1365 each work-item for
1366 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001367 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001368 work-group size
1369 supported by the
1370 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001371 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001372 consistent with
1373 ReqdWorkGroupSize if
1374 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001375 "NumSpilledSGPRs" integer Number of stores from
1376 a scalar register to
1377 a register allocator
1378 created spill
1379 location.
1380 "NumSpilledVGPRs" integer Number of stores from
1381 a vector register to
1382 a register allocator
1383 created spill
1384 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001385 ============================ ============== ========= =====================
1386
1387..
1388
Tony Tyef16a45e2017-06-06 20:31:59 +00001389Kernel Dispatch
1390~~~~~~~~~~~~~~~
1391
1392The HSA architected queuing language (AQL) defines a user space memory interface
1393that can be used to control the dispatch of kernels, in an agent independent
1394way. An agent can have zero or more AQL queues created for it using the ROCm
1395runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1396*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1397mechanics and packet layouts.
1398
1399The packet processor of a kernel agent is responsible for detecting and
1400dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1401packet processor is implemented by the hardware command processor (CP),
1402asynchronous dispatch controller (ADC) and shader processor input controller
1403(SPI).
1404
1405The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1406mode driver to initialize and register the AQL queue with CP.
1407
1408To dispatch a kernel the following actions are performed. This can occur in the
1409CPU host program, or from an HSA kernel executing on a GPU.
1410
14111. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1412 executed is obtained.
14132. A pointer to the kernel descriptor (see
1414 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1415 obtained. It must be for a kernel that is contained in a code object that that
1416 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1417 associated.
14183. Space is allocated for the kernel arguments using the ROCm runtime allocator
1419 for a memory region with the kernarg property for the kernel agent that will
1420 execute the kernel. It must be at least 16 byte aligned.
14214. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001422 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001423 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1424 memory in the same way constant memory is accessed. (Note that the HSA
1425 specification allows an implementation to copy the kernel argument contents to
1426 another location that is accessed by the kernel.)
14275. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1428 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1429 packet. The packet must be set up, and the final write must use an atomic
1430 store release to set the packet kind to ensure the packet contents are
1431 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1432 notify the kernel agent that the AQL queue has been updated. These rules, and
1433 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1434 System Architecture Specification* [HSA]_.
14356. A kernel dispatch packet includes information about the actual dispatch,
1436 such as grid and work-group size, together with information from the code
1437 object about the kernel, such as segment sizes. The ROCm runtime queries on
1438 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001439 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000014407. CP executes micro-code and is responsible for detecting and setting up the
1441 GPU to execute the wavefronts of a kernel dispatch.
14428. CP ensures that when the a wavefront starts executing the kernel machine
1443 code, the scalar general purpose registers (SGPR) and vector general purpose
1444 registers (VGPR) are set up as required by the machine code. The required
1445 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1446 register state is defined in
1447 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14489. The prolog of the kernel machine code (see
1449 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1450 before continuing executing the machine code that corresponds to the kernel.
145110. When the kernel dispatch has completed execution, CP signals the completion
1452 signal specified in the kernel dispatch packet if not 0.
1453
1454.. _amdgpu-amdhsa-memory-spaces:
1455
1456Memory Spaces
1457~~~~~~~~~~~~~
1458
1459The memory space properties are:
1460
1461 .. table:: AMDHSA Memory Spaces
1462 :name: amdgpu-amdhsa-memory-spaces-table
1463
1464 ================= =========== ======== ======= ==================
1465 Memory Space Name HSA Segment Hardware Address NULL Value
1466 Name Name Size
1467 ================= =========== ======== ======= ==================
1468 Private private scratch 32 0x00000000
1469 Local group LDS 32 0xFFFFFFFF
1470 Global global global 64 0x0000000000000000
1471 Constant constant *same as 64 0x0000000000000000
1472 global*
1473 Generic flat flat 64 0x0000000000000000
1474 Region N/A GDS 32 *not implemented
1475 for AMDHSA*
1476 ================= =========== ======== ======= ==================
1477
1478The global and constant memory spaces both use global virtual addresses, which
1479are the same virtual address space used by the CPU. However, some virtual
1480addresses may only be accessible to the CPU, some only accessible by the GPU,
1481and some by both.
1482
1483Using the constant memory space indicates that the data will not change during
1484the execution of the kernel. This allows scalar read instructions to be
1485used. The vector and scalar L1 caches are invalidated of volatile data before
1486each kernel dispatch execution to allow constant memory to change values between
1487kernel dispatches.
1488
1489The local memory space uses the hardware Local Data Store (LDS) which is
1490automatically allocated when the hardware creates work-groups of wavefronts, and
1491freed when all the wavefronts of a work-group have terminated. The data store
1492(DS) instructions can be used to access it.
1493
1494The private memory space uses the hardware scratch memory support. If the kernel
1495uses scratch, then the hardware allocates memory that is accessed using
1496wavefront lane dword (4 byte) interleaving. The mapping used from private
1497address to physical address is:
1498
1499 ``wavefront-scratch-base +
1500 (private-address * wavefront-size * 4) +
1501 (wavefront-lane-id * 4)``
1502
1503There are different ways that the wavefront scratch base address is determined
1504by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1505memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001506the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001507instructions, or by flat instructions. If each lane of a wavefront accesses the
1508same private address, the interleaving results in adjacent dwords being accessed
1509and hence requires fewer cache lines to be fetched. Multi-dword access is not
1510supported except by flat and scratch instructions in GFX9.
1511
1512The generic address space uses the hardware flat address support available in
1513GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1514local appertures), that are outside the range of addressible global memory, to
1515map from a flat address to a private or local address.
1516
1517FLAT instructions can take a flat address and access global, private (scratch)
1518and group (LDS) memory depending in if the address is within one of the
1519apperture ranges. Flat access to scratch requires hardware aperture setup and
1520setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1521access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1522(see :ref:`amdgpu-amdhsa-m0`).
1523
1524To convert between a segment address and a flat address the base address of the
1525appertures address can be used. For GFX7-GFX8 these are available in the
1526:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1527Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1528GFX9 the appature base addresses are directly available as inline constant
1529registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1530address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1531which makes it easier to convert from flat to segment or segment to flat.
1532
Tony Tye46d35762017-08-15 20:47:41 +00001533Image and Samplers
1534~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001535
1536Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1537hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1538HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1539enumeration values for the queries that are not trivially deducible from the S#
1540representation.
1541
1542HSA Signals
1543~~~~~~~~~~~
1544
Tony Tye46d35762017-08-15 20:47:41 +00001545HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1546structure allocated in memory accessible from both the CPU and GPU. The
1547structure is defined by the ROCm runtime and subject to change between releases
1548(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001549
1550.. _amdgpu-amdhsa-hsa-aql-queue:
1551
1552HSA AQL Queue
1553~~~~~~~~~~~~~
1554
Tony Tye46d35762017-08-15 20:47:41 +00001555The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001556between releases (see [AMD-ROCm-github]_). For some processors it contains
1557fields needed to implement certain language features such as the flat address
1558aperture bases. It also contains fields used by CP such as managing the
1559allocation of scratch memory.
1560
1561.. _amdgpu-amdhsa-kernel-descriptor:
1562
1563Kernel Descriptor
1564~~~~~~~~~~~~~~~~~
1565
1566A kernel descriptor consists of the information needed by CP to initiate the
1567execution of a kernel, including the entry point address of the machine code
1568that implements the kernel.
1569
1570Kernel Descriptor for GFX6-GFX9
1571+++++++++++++++++++++++++++++++
1572
1573CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1574
1575 .. table:: Kernel Descriptor for GFX6-GFX9
1576 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1577
Tony Tye6baa6d22017-10-18 22:16:55 +00001578 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001579 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001580 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001581 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001582 address space memory
1583 required for a work-group
1584 in bytes. This does not
1585 include any dynamically
1586 allocated local address
1587 space memory that may be
1588 added when the kernel is
1589 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001590 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001591 private address space
1592 memory required for a
1593 work-item in bytes. If
1594 is_dynamic_callstack is 1
1595 then additional space must
1596 be added to this value for
1597 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001598 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001599 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 negative) from base
1601 address of kernel
1602 descriptor to kernel's
1603 entry point instruction
1604 which must be 256 byte
1605 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001606 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001608 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001609 program settings used by
1610 CP to set up
1611 ``COMPUTE_PGM_RSRC1``
1612 configuration
1613 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001614 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001615 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001616 program settings used by
1617 CP to set up
1618 ``COMPUTE_PGM_RSRC2``
1619 configuration
1620 register. See
1621 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001622 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1623 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001624 (see
1625 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1626
1627 The total number of SGPR
1628 user data registers
1629 requested must not exceed
1630 16 and match value in
1631 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1632 Any requests beyond 16
1633 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001634 449 1 bit EnableSGPRDispatchPtr *see above*
1635 450 1 bit EnableSGPRQueuePtr *see above*
1636 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1637 452 1 bit EnableSGPRDispatchID *see above*
1638 453 1 bit EnableSGPRFlatScratchInit *see above*
1639 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1640 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1641 should always be 0.
1642 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1643 should always be 0.
1644 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1645 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001646 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001647 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001648 bytes
1649 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001650 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001651
1652..
1653
1654 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001655 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001656
Tony Tye3b340612017-06-07 00:46:08 +00001657 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001658 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001659 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001660 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001661 used by each work-item,
1662 granularity is device
1663 specific:
1664
Tony Tye07d9f102017-11-10 01:00:54 +00001665 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001666 - max_vgpr 1..256
1667 - roundup((max_vgpg + 1)
1668 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001669
1670 Used by CP to set up
1671 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001672 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001673 used by a wavefront,
1674 granularity is device
1675 specific:
1676
Tony Tye07d9f102017-11-10 01:00:54 +00001677 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001678 - max_sgpr 1..112
1679 - roundup((max_sgpg + 1)
1680 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001681 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001682 - max_sgpr 1..112
1683 - roundup((max_sgpg + 1)
1684 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001685
1686 Includes the special SGPRs
1687 for VCC, Flat Scratch (for
1688 GFX7 onwards) and XNACK
1689 (for GFX8 onwards). It does
1690 not include the 16 SGPR
1691 added if a trap handler is
1692 enabled.
1693
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001696 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001697
1698 Start executing wavefront
1699 at the specified priority.
1700
1701 CP is responsible for
1702 filling in
1703 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001704 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001705 with specified rounding
1706 mode for single (32
1707 bit) floating point
1708 precision floating point
1709 operations.
1710
1711 Floating point rounding
1712 mode values are defined in
1713 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1714
1715 Used by CP to set up
1716 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001717 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001718 with specified rounding
1719 denorm mode for half/double (16
1720 and 64 bit) floating point
1721 precision floating point
1722 operations.
1723
1724 Floating point rounding
1725 mode values are defined in
1726 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1727
1728 Used by CP to set up
1729 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001730 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001731 with specified denorm mode
1732 for single (32
1733 bit) floating point
1734 precision floating point
1735 operations.
1736
1737 Floating point denorm mode
1738 values are defined in
1739 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1740
1741 Used by CP to set up
1742 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001743 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001744 with specified denorm mode
1745 for half/double (16
1746 and 64 bit) floating point
1747 precision floating point
1748 operations.
1749
1750 Floating point denorm mode
1751 values are defined in
1752 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1753
1754 Used by CP to set up
1755 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001756 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001757
1758 Start executing wavefront
1759 in privilege trap handler
1760 mode.
1761
1762 CP is responsible for
1763 filling in
1764 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001765 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001766 with DX10 clamp mode
1767 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001768 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001769 treatment of NaN's (when
1770 set, clamp NaN to zero,
1771 otherwise pass NaN
1772 through).
1773
1774 Used by CP to set up
1775 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001776 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001777
1778 Start executing wavefront
1779 in single step mode.
1780
1781 CP is responsible for
1782 filling in
1783 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001784 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001785 with IEEE mode
1786 enabled. Floating point
1787 opcodes that support
1788 exception flag gathering
1789 will quiet and propagate
1790 signaling-NaN inputs per
1791 IEEE 754-2008. Min_dx10 and
1792 max_dx10 become IEEE
1793 754-2008 compliant due to
1794 signaling-NaN propagation
1795 and quieting.
1796
1797 Used by CP to set up
1798 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001799 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001800
1801 Only one work-group allowed
1802 to execute on a compute
1803 unit.
1804
1805 CP is responsible for
1806 filling in
1807 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001808 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001809
1810 Flag that can be used to
1811 control debugging code.
1812
1813 CP is responsible for
1814 filling in
1815 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001816 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001817 Reserved, must be 0.
1818 GFX9
1819 Wavefront starts execution
1820 with specified fp16 overflow
1821 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001822
Tony Tye6baa6d22017-10-18 22:16:55 +00001823 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001824 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001825 - If 1, fp16 overflow that is the
1826 result of an +/-INF input value
1827 or divide by 0 produces a +/-INF,
1828 otherwise clamps computed
1829 overflow to +/-MAX_FP16 as
1830 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001831
1832 Used by CP to set up
1833 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001834 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001835 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001836 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001837
1838..
1839
1840 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1841 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1842
Tony Tye3b340612017-06-07 00:46:08 +00001843 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001844 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001845 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001846 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001847 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001848 system register (see
1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851 Used by CP to set up
1852 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001853 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001854 user data registers
1855 requested. This number must
1856 match the number of user
1857 data registers enabled.
1858
1859 Used by CP to set up
1860 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001861 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001862 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001863 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001864 be enabled.
1865
1866 CP sets
1867 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1868 if the runtime has
1869 installed a trap handler
1870 regardless of the setting
1871 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001872 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001873 system SGPR register for
1874 the work-group id in the X
1875 dimension (see
1876 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1877
1878 Used by CP to set up
1879 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001880 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001881 system SGPR register for
1882 the work-group id in the Y
1883 dimension (see
1884 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1885
1886 Used by CP to set up
1887 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001888 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001889 system SGPR register for
1890 the work-group id in the Z
1891 dimension (see
1892 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1893
1894 Used by CP to set up
1895 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001896 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001897 system SGPR register for
1898 work-group information (see
1899 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1900
1901 Used by CP to set up
1902 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001903 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001904 VGPR system registers used
1905 for the work-item ID.
1906 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1907 defines the values.
1908
1909 Used by CP to set up
1910 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001911 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001912
1913 Wavefront starts execution
1914 with address watch
1915 exceptions enabled which
1916 are generated when L1 has
1917 witnessed a thread access
1918 an *address of
1919 interest*.
1920
1921 CP is responsible for
1922 filling in the address
1923 watch bit in
1924 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1925 according to what the
1926 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001927 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001928
1929 Wavefront starts execution
1930 with memory violation
1931 exceptions exceptions
1932 enabled which are generated
1933 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001934 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001935 L1 or LDS
1936 (write-to-read-only-memory,
1937 mis-aligned atomic, LDS
1938 address out of range,
1939 illegal address, etc.).
1940
1941 CP sets the memory
1942 violation bit in
1943 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1944 according to what the
1945 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001946 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001947
1948 CP uses the rounded value
1949 from the dispatch packet,
1950 not this value, as the
1951 dispatch may contain
1952 dynamically allocated group
1953 segment memory. CP writes
1954 directly to
1955 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1956
1957 Amount of group segment
1958 (LDS) to allocate for each
1959 work-group. Granularity is
1960 device specific:
1961
1962 GFX6:
1963 roundup(lds-size / (64 * 4))
1964 GFX7-GFX9:
1965 roundup(lds-size / (128 * 4))
1966
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001967 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1968 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001969 enabled.
1970
1971 Used by CP to set up
1972 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1973 (set from bits 0..6).
1974
1975 IEEE 754 FP Invalid
1976 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001977 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1978 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001979 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001980 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1981 _DIVISION_BY_ZERO Zero
1982 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1983 _OVERFLOW
1984 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1985 _UNDERFLOW
1986 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1987 _INEXACT
1988 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1989 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001990 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001991 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001992 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001993 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001994
1995..
1996
1997 .. table:: Floating Point Rounding Mode Enumeration Values
1998 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1999
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002000 ====================================== ===== ==============================
2001 Enumeration Name Value Description
2002 ====================================== ===== ==============================
2003 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2004 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2005 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2006 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
2007 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002008
2009..
2010
2011 .. table:: Floating Point Denorm Mode Enumeration Values
2012 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2013
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002014 ====================================== ===== ==============================
2015 Enumeration Name Value Description
2016 ====================================== ===== ==============================
2017 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2018 Denorms
2019 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2020 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2021 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2022 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002023
2024..
2025
2026 .. table:: System VGPR Work-Item ID Enumeration Values
2027 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2028
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002029 ======================================== ===== ============================
2030 Enumeration Name Value Description
2031 ======================================== ===== ============================
2032 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2033 ID.
2034 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2035 dimensions ID.
2036 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2037 dimensions ID.
2038 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2039 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002040
2041.. _amdgpu-amdhsa-initial-kernel-execution-state:
2042
2043Initial Kernel Execution State
2044~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2045
2046This section defines the register state that will be set up by the packet
2047processor prior to the start of execution of every wavefront. This is limited by
2048the constraints of the hardware controllers of CP/ADC/SPI.
2049
2050The order of the SGPR registers is defined, but the compiler can specify which
2051ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2052fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2053for enabled registers are dense starting at SGPR0: the first enabled register is
2054SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2055an SGPR number.
2056
2057The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002058all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002059the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2060initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002061set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002062dispatch.
2063
2064SGPR register initial state is defined in
2065:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2066
2067 .. table:: SGPR Register Set Up Order
2068 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2069
2070 ========== ========================== ====== ==============================
2071 SGPR Order Name Number Description
2072 (kernel descriptor enable of
2073 field) SGPRs
2074 ========== ========================== ====== ==============================
2075 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002076 (enable_sgpr_private with Scratch Wavefront Offset
2077 _segment_buffer) as an offset, to access the
2078 private memory space using a
2079 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002080
2081 CP uses the value provided by
2082 the runtime.
2083 then Dispatch Ptr 2 64 bit address of AQL dispatch
2084 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2085 actually executing.
2086 then Queue Ptr 2 64 bit address of amd_queue_t
2087 (enable_sgpr_queue_ptr) object for AQL queue on which
2088 the dispatch packet was
2089 queued.
2090 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2091 (enable_sgpr_kernarg segment. This is directly
2092 _segment_ptr) copied from the
2093 kernarg_address in the kernel
2094 dispatch packet.
2095
2096 Having CP load it once avoids
2097 loading it at the beginning of
2098 every wavefront.
2099 then Dispatch Id 2 64 bit Dispatch ID of the
2100 (enable_sgpr_dispatch_id) dispatch packet being
2101 executed.
2102 then Flat Scratch Init 2 This is 2 SGPRs:
2103 (enable_sgpr_flat_scratch
2104 _init) GFX6
2105 Not supported.
2106 GFX7-GFX8
2107 The first SGPR is a 32 bit
2108 byte offset from
2109 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2110 to per SPI base of memory
2111 for scratch for the queue
2112 executing the kernel
2113 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002114 from the runtime. (The
2115 Scratch Segment Buffer base
2116 address is
2117 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2118 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002119 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002120 be added to this offset by
2121 the kernel machine code,
2122 right shifted by 8, and
2123 moved to the FLAT_SCRATCH_HI
2124 SGPR register.
2125 FLAT_SCRATCH_HI corresponds
2126 to SGPRn-4 on GFX7, and
2127 SGPRn-6 on GFX8 (where SGPRn
2128 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002129 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002130 FLAT_SCRATCH_HI is
2131 multiplied by 256 (as it is
2132 in units of 256 bytes) and
2133 added to
2134 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002135 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002136 FLAT SCRATCH BASE in flat
2137 memory instructions that
2138 access the scratch
2139 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002140
2141 The second SGPR is 32 bit
2142 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002143 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002144 usage. CP obtains this from
2145 the runtime, and it is
2146 always a multiple of DWORD.
2147 CP checks that the value in
2148 the kernel dispatch packet
2149 Private Segment Byte Size is
2150 not larger, and requests the
2151 runtime to increase the
2152 queue's scratch size if
2153 necessary. The kernel code
2154 must move it to
2155 FLAT_SCRATCH_LO which is
2156 SGPRn-3 on GFX7 and SGPRn-5
2157 on GFX8. FLAT_SCRATCH_LO is
2158 used as the FLAT SCRATCH
2159 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002160 instructions. Having CP load
2161 it once avoids loading it at
2162 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002163 wavefront.
2164 GFX9
2165 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002166 64 bit base address of the
2167 per SPI scratch backing
2168 memory managed by SPI for
2169 the queue executing the
2170 kernel dispatch. CP obtains
2171 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002172 divides it if there are
2173 multiple Shader Arrays each
2174 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002175 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002176 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002177 machine code and the result
2178 moved to the FLAT_SCRATCH
2179 SGPR which is SGPRn-6 and
2180 SGPRn-5. It is used as the
2181 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002182 memory instructions.
2183 then Private Segment Size 1 The 32 bit byte size of a
2184 (enable_sgpr_private single
2185 work-item's
2186 scratch_segment_size) memory
2187 allocation. This is the
2188 value from the kernel
2189 dispatch packet Private
2190 Segment Byte Size rounded up
2191 by CP to a multiple of
2192 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002193
2194 Having CP load it once avoids
2195 loading it at the beginning of
2196 every wavefront.
2197
2198 This is not used for
2199 GFX7-GFX8 since it is the same
2200 value as the second SGPR of
2201 Flat Scratch Init. However, it
2202 may be needed for GFX9 which
2203 changes the meaning of the
2204 Flat Scratch Init value.
2205 then Grid Work-Group Count X 1 32 bit count of the number of
2206 (enable_sgpr_grid work-groups in the X dimension
2207 _workgroup_count_X) for the grid being
2208 executed. Computed from the
2209 fields in the kernel dispatch
2210 packet as ((grid_size.x +
2211 workgroup_size.x - 1) /
2212 workgroup_size.x).
2213 then Grid Work-Group Count Y 1 32 bit count of the number of
2214 (enable_sgpr_grid work-groups in the Y dimension
2215 _workgroup_count_Y && for the grid being
2216 less than 16 previous executed. Computed from the
2217 SGPRs) fields in the kernel dispatch
2218 packet as ((grid_size.y +
2219 workgroup_size.y - 1) /
2220 workgroupSize.y).
2221
2222 Only initialized if <16
2223 previous SGPRs initialized.
2224 then Grid Work-Group Count Z 1 32 bit count of the number of
2225 (enable_sgpr_grid work-groups in the Z dimension
2226 _workgroup_count_Z && for the grid being
2227 less than 16 previous executed. Computed from the
2228 SGPRs) fields in the kernel dispatch
2229 packet as ((grid_size.z +
2230 workgroup_size.z - 1) /
2231 workgroupSize.z).
2232
2233 Only initialized if <16
2234 previous SGPRs initialized.
2235 then Work-Group Id X 1 32 bit work-group id in X
2236 (enable_sgpr_workgroup_id dimension of grid for
2237 _X) wavefront.
2238 then Work-Group Id Y 1 32 bit work-group id in Y
2239 (enable_sgpr_workgroup_id dimension of grid for
2240 _Y) wavefront.
2241 then Work-Group Id Z 1 32 bit work-group id in Z
2242 (enable_sgpr_workgroup_id dimension of grid for
2243 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002244 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002245 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002246 _info) threadgroup_size_in_wavefronts[5:0]}
2247 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002248 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002249 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002250 dispatch. Must be used as an
2251 offset with Private
2252 segment address when using
2253 Scratch Segment Buffer. It
2254 must be used to set up FLAT
2255 SCRATCH for flat addressing
2256 (see
2257 :ref:`amdgpu-amdhsa-flat-scratch`).
2258 ========== ========================== ====== ==============================
2259
2260The order of the VGPR registers is defined, but the compiler can specify which
2261ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2262fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2263for enabled registers are dense starting at VGPR0: the first enabled register is
2264VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2265VGPR number.
2266
2267VGPR register initial state is defined in
2268:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2269
2270 .. table:: VGPR Register Set Up Order
2271 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2272
2273 ========== ========================== ====== ==============================
2274 VGPR Order Name Number Description
2275 (kernel descriptor enable of
2276 field) VGPRs
2277 ========== ========================== ====== ==============================
2278 First Work-Item Id X 1 32 bit work item id in X
2279 (Always initialized) dimension of work-group for
2280 wavefront lane.
2281 then Work-Item Id Y 1 32 bit work item id in Y
2282 (enable_vgpr_workitem_id dimension of work-group for
2283 > 0) wavefront lane.
2284 then Work-Item Id Z 1 32 bit work item id in Z
2285 (enable_vgpr_workitem_id dimension of work-group for
2286 > 1) wavefront lane.
2287 ========== ========================== ====== ==============================
2288
Hiroshi Inouebcadfee2018-04-12 05:53:20 +00002289The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tyef16a45e2017-06-06 20:31:59 +00002290
22911. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2292 registers.
22932. Work-group Id registers X, Y, Z are set by ADC which supports any
2294 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022953. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2296 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022974. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2298 or (X, Y, Z).
2299
2300Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2301value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2302
2303The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002304has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002305instructions (GFX9).
2306
2307If buffer operations are used then the compiler can generate a V# with the
2308following properties:
2309
2310* base address of 0
2311* no swizzle
2312* ATC: 1 if IOMMU present (such as APU)
2313* ptr64: 1
2314* MTYPE set to support memory coherence that matches the runtime (such as CC for
2315 APU and NC for dGPU).
2316
2317.. _amdgpu-amdhsa-kernel-prolog:
2318
2319Kernel Prolog
2320~~~~~~~~~~~~~
2321
2322.. _amdgpu-amdhsa-m0:
2323
2324M0
2325++
2326
2327GFX6-GFX8
2328 The M0 register must be initialized with a value at least the total LDS size
2329 if the kernel may access LDS via DS or flat operations. Total LDS size is
2330 available in dispatch packet. For M0, it is also possible to use maximum
2331 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2332 GFX7-GFX8).
2333GFX9
2334 The M0 register is not used for range checking LDS accesses and so does not
2335 need to be initialized in the prolog.
2336
2337.. _amdgpu-amdhsa-flat-scratch:
2338
2339Flat Scratch
2340++++++++++++
2341
2342If the kernel may use flat operations to access scratch memory, the prolog code
2343must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002344are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002345Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2346
2347GFX6
2348 Flat scratch is not supported.
2349
Tony Tye07d9f102017-11-10 01:00:54 +00002350GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002351 1. The low word of Flat Scratch Init is 32 bit byte offset from
2352 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2353 being managed by SPI for the queue executing the kernel dispatch. This is
2354 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002355 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002356 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2357 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2358 by 8 before moving into FLAT_SCRATCH_LO.
2359 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2360 work-items scratch memory usage. This is directly loaded from the kernel
2361 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2362 DWORD. Having CP load it once avoids loading it at the beginning of every
2363 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2364 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002365
Tony Tyef16a45e2017-06-06 20:31:59 +00002366GFX9
2367 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2368 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002369 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002370 pair for use as the flat scratch base in flat memory instructions.
2371
2372.. _amdgpu-amdhsa-memory-model:
2373
2374Memory Model
2375~~~~~~~~~~~~
2376
2377This section describes the mapping of LLVM memory model onto AMDGPU machine code
2378(see :ref:`memmodel`). *The implementation is WIP.*
2379
2380.. TODO
2381 Update when implementation complete.
2382
Tony Tyef16a45e2017-06-06 20:31:59 +00002383The AMDGPU backend supports the memory synchronization scopes specified in
2384:ref:`amdgpu-memory-scopes`.
2385
2386The code sequences used to implement the memory model are defined in table
2387:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2388
2389The sequences specify the order of instructions that a single thread must
2390execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2391to other memory instructions executed by the same thread. This allows them to be
2392moved earlier or later which can allow them to be combined with other instances
2393of the same instruction, or hoisted/sunk out of loops to improve
2394performance. Only the instructions related to the memory model are given;
2395additional ``s_waitcnt`` instructions are required to ensure registers are
2396defined before being used. These may be able to be combined with the memory
2397model ``s_waitcnt`` instructions as described above.
2398
Tony Tye6baa6d22017-10-18 22:16:55 +00002399The AMDGPU backend supports the following memory models:
2400
2401 HSA Memory Model [HSA]_
2402 The HSA memory model uses a single happens-before relation for all address
2403 spaces (see :ref:`amdgpu-address-spaces`).
2404 OpenCL Memory Model [OpenCL]_
2405 The OpenCL memory model which has separate happens-before relations for the
2406 global and local address spaces. Only a fence specifying both global and
2407 local address space, and seq_cst instructions join the relationships. Since
2408 the LLVM ``memfence`` instruction does not allow an address space to be
2409 specified the OpenCL fence has to convervatively assume both local and
2410 global address space was specified. However, optimizations can often be
2411 done to eliminate the additional ``s_waitcnt`` instructions when there are
2412 no intervening memory instructions which access the corresponding address
2413 space. The code sequences in the table indicate what can be omitted for the
2414 OpenCL memory. The target triple environment is used to determine if the
2415 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002416
2417``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2418operations.
2419
2420``buffer/global/flat_load/store/atomic`` instructions to global memory are
2421termed vector memory operations.
2422
2423For GFX6-GFX9:
2424
2425* Each agent has multiple compute units (CU).
2426* Each CU has multiple SIMDs that execute wavefronts.
2427* The wavefronts for a single work-group are executed in the same CU but may be
2428 executed by different SIMDs.
2429* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2430 executing on it.
2431* All LDS operations of a CU are performed as wavefront wide operations in a
2432 global order and involve no caching. Completion is reported to a wavefront in
2433 execution order.
2434* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002435 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002436 can be reordered relative to each other, which can result in reordering the
2437 visibility of vector memory operations with respect to LDS operations of other
2438 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002439 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002440 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002441 same wavefront.
2442* The vector memory operations are performed as wavefront wide operations and
2443 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002444 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002445 vector memory order if they access LDS memory, and out of LDS operation order
2446 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002447* The vector memory operations access a single vector L1 cache shared by all
2448 SIMDs a CU. Therefore, no special action is required for coherence between the
2449 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002450 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002451 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002452* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2453 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2454 scalar operations are used in a restricted way so do not impact the memory
2455 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2456* The vector and scalar memory operations use an L2 cache shared by all CUs on
2457 the same agent.
2458* The L2 cache has independent channels to service disjoint ranges of virtual
2459 addresses.
2460* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002461 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002462 (which may be executing on different CUs) of an agent can be reordered
2463 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002464 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002465 previous vector memory operation has completed before executing a subsequent
2466 vector memory or LDS operation and so can be used to meet the requirements of
2467 acquire and release.
2468* The L2 cache can be kept coherent with other agents on some targets, or ranges
2469 of virtual addresses can be set up to bypass it to ensure system coherence.
2470
Tony Tye07d9f102017-11-10 01:00:54 +00002471Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002472or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2473memory, atomic memory orderings are not meaningful and all accesses are treated
2474as non-atomic.
2475
2476Constant address space uses ``buffer/global_load`` instructions (or equivalent
2477scalar memory instructions). Since the constant address space contents do not
2478change during the execution of a kernel dispatch it is not legal to perform
2479stores, and atomic memory orderings are not meaningful and all access are
2480treated as non-atomic.
2481
2482A memory synchronization scope wider than work-group is not meaningful for the
2483group (LDS) address space and is treated as work-group.
2484
2485The memory model does not support the region address space which is treated as
2486non-atomic.
2487
2488Acquire memory ordering is not meaningful on store atomic instructions and is
2489treated as non-atomic.
2490
2491Release memory ordering is not meaningful on load atomic instructions and is
2492treated a non-atomic.
2493
2494Acquire-release memory ordering is not meaningful on load or store atomic
2495instructions and is treated as acquire and release respectively.
2496
2497AMDGPU backend only uses scalar memory operations to access memory that is
2498proven to not change during the execution of the kernel dispatch. This includes
2499constant address space and global address space for program scope const
2500variables. Therefore the kernel machine code does not have to maintain the
2501scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2502and vector L1 caches are invalidated between kernel dispatches by CP since
2503constant address space data may change between kernel dispatch executions. See
2504:ref:`amdgpu-amdhsa-memory-spaces`.
2505
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002506The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002507case the AMDGPU backend ensures the memory location used to spill is never
2508accessed by vector memory operations at the same time. If scalar writes are used
2509then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2510return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002511future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002512frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2513as all scalar writes are write-before-read in the same thread.
2514
Tony Tye6baa6d22017-10-18 22:16:55 +00002515Scratch backing memory (which is used for the private address space)
2516is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2517address space is only accessed by a single thread, and is always
2518write-before-read, there is never a need to invalidate these entries from the L1
2519cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2520volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002521
2522On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002523to invalidate the L2 cache. This also causes it to be treated as
2524non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2525(cache coherent) and so the L2 cache will coherent with the CPU and other
2526agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002527
2528 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2529 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2530
Tony Tye6baa6d22017-10-18 22:16:55 +00002531 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002532 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2533 Ordering Sync Scope Address
2534 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002535 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002536 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002537 -----------------------------------------------------------------------------------
2538 load *none* *none* - global - !volatile & !nontemporal
2539 - generic
2540 - private 1. buffer/global/flat_load
2541 - constant
2542 - volatile & !nontemporal
2543
Tony Tyef16a45e2017-06-06 20:31:59 +00002544 1. buffer/global/flat_load
2545 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002546
2547 - nontemporal
2548
2549 1. buffer/global/flat_load
2550 glc=1 slc=1
2551
Tony Tyef16a45e2017-06-06 20:31:59 +00002552 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002553 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002554 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002555 - private 1. buffer/global/flat_store
2556 - constant
2557 - nontemporal
2558
2559 1. buffer/global/flat_stote
2560 glc=1 slc=1
2561
Tony Tyef16a45e2017-06-06 20:31:59 +00002562 store *none* *none* - local 1. ds_store
2563 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002564 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002565 load atomic unordered *any* *any* *Same as non-atomic*.
2566 store atomic unordered *any* *any* *Same as non-atomic*.
2567 atomicrmw unordered *any* *any* *Same as monotonic
2568 atomic*.
2569 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002570 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002571 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2572 - wavefront - generic
2573 - workgroup
2574 load atomic monotonic - singlethread - local 1. ds_load
2575 - wavefront
2576 - workgroup
2577 load atomic monotonic - agent - global 1. buffer/global/flat_load
2578 - system - generic glc=1
2579 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2580 - wavefront - generic
2581 - workgroup
2582 - agent
2583 - system
2584 store atomic monotonic - singlethread - local 1. ds_store
2585 - wavefront
2586 - workgroup
2587 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2588 - wavefront - generic
2589 - workgroup
2590 - agent
2591 - system
2592 atomicrmw monotonic - singlethread - local 1. ds_atomic
2593 - wavefront
2594 - workgroup
2595 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002596 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002597 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2598 - wavefront - local
2599 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002600 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2601 load atomic acquire - workgroup - local 1. ds_load
2602 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002603
Tony Tye6baa6d22017-10-18 22:16:55 +00002604 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002605 - Must happen before
2606 any following
2607 global/generic
2608 load/load
2609 atomic/store/store
2610 atomic/atomicrmw.
2611 - Ensures any
2612 following global
2613 data read is no
2614 older than the load
2615 atomic value being
2616 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002617 load atomic acquire - workgroup - generic 1. flat_load
2618 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002619
Tony Tye6baa6d22017-10-18 22:16:55 +00002620 - If OpenCL, omit.
2621 - Must happen before
2622 any following
2623 global/generic
2624 load/load
2625 atomic/store/store
2626 atomic/atomicrmw.
2627 - Ensures any
2628 following global
2629 data read is no
2630 older than the load
2631 atomic value being
2632 acquired.
2633 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002634 - system glc=1
2635 2. s_waitcnt vmcnt(0)
2636
2637 - Must happen before
2638 following
2639 buffer_wbinvl1_vol.
2640 - Ensures the load
2641 has completed
2642 before invalidating
2643 the cache.
2644
2645 3. buffer_wbinvl1_vol
2646
2647 - Must happen before
2648 any following
2649 global/generic
2650 load/load
2651 atomic/atomicrmw.
2652 - Ensures that
2653 following
2654 loads will not see
2655 stale global data.
2656
2657 load atomic acquire - agent - generic 1. flat_load glc=1
2658 - system 2. s_waitcnt vmcnt(0) &
2659 lgkmcnt(0)
2660
2661 - If OpenCL omit
2662 lgkmcnt(0).
2663 - Must happen before
2664 following
2665 buffer_wbinvl1_vol.
2666 - Ensures the flat_load
2667 has completed
2668 before invalidating
2669 the cache.
2670
2671 3. buffer_wbinvl1_vol
2672
2673 - Must happen before
2674 any following
2675 global/generic
2676 load/load
2677 atomic/atomicrmw.
2678 - Ensures that
2679 following loads
2680 will not see stale
2681 global data.
2682
2683 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2684 - wavefront - local
2685 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002686 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2687 atomicrmw acquire - workgroup - local 1. ds_atomic
2688 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002689
Tony Tye6baa6d22017-10-18 22:16:55 +00002690 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002691 - Must happen before
2692 any following
2693 global/generic
2694 load/load
2695 atomic/store/store
2696 atomic/atomicrmw.
2697 - Ensures any
2698 following global
2699 data read is no
2700 older than the
2701 atomicrmw value
2702 being acquired.
2703
Tony Tye6baa6d22017-10-18 22:16:55 +00002704 atomicrmw acquire - workgroup - generic 1. flat_atomic
2705 2. waitcnt lgkmcnt(0)
2706
2707 - If OpenCL, omit.
2708 - Must happen before
2709 any following
2710 global/generic
2711 load/load
2712 atomic/store/store
2713 atomic/atomicrmw.
2714 - Ensures any
2715 following global
2716 data read is no
2717 older than the
2718 atomicrmw value
2719 being acquired.
2720
2721 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002722 - system 2. s_waitcnt vmcnt(0)
2723
2724 - Must happen before
2725 following
2726 buffer_wbinvl1_vol.
2727 - Ensures the
2728 atomicrmw has
2729 completed before
2730 invalidating the
2731 cache.
2732
2733 3. buffer_wbinvl1_vol
2734
2735 - Must happen before
2736 any following
2737 global/generic
2738 load/load
2739 atomic/atomicrmw.
2740 - Ensures that
2741 following loads
2742 will not see stale
2743 global data.
2744
2745 atomicrmw acquire - agent - generic 1. flat_atomic
2746 - system 2. s_waitcnt vmcnt(0) &
2747 lgkmcnt(0)
2748
2749 - If OpenCL, omit
2750 lgkmcnt(0).
2751 - Must happen before
2752 following
2753 buffer_wbinvl1_vol.
2754 - Ensures the
2755 atomicrmw has
2756 completed before
2757 invalidating the
2758 cache.
2759
2760 3. buffer_wbinvl1_vol
2761
2762 - Must happen before
2763 any following
2764 global/generic
2765 load/load
2766 atomic/atomicrmw.
2767 - Ensures that
2768 following loads
2769 will not see stale
2770 global data.
2771
2772 fence acquire - singlethread *none* *none*
2773 - wavefront
2774 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2775
2776 - If OpenCL and
2777 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002778 not generic, omit.
2779 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002780 currently has no
2781 address space on
2782 the fence need to
2783 conservatively
2784 always generate. If
2785 fence had an
2786 address space then
2787 set to address
2788 space of OpenCL
2789 fence flag, or to
2790 generic if both
2791 local and global
2792 flags are
2793 specified.
2794 - Must happen after
2795 any preceding
2796 local/generic load
2797 atomic/atomicrmw
2798 with an equal or
2799 wider sync scope
2800 and memory ordering
2801 stronger than
2802 unordered (this is
2803 termed the
2804 fence-paired-atomic).
2805 - Must happen before
2806 any following
2807 global/generic
2808 load/load
2809 atomic/store/store
2810 atomic/atomicrmw.
2811 - Ensures any
2812 following global
2813 data read is no
2814 older than the
2815 value read by the
2816 fence-paired-atomic.
2817
Tony Tye6baa6d22017-10-18 22:16:55 +00002818 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2819 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002820
2821 - If OpenCL and
2822 address space is
2823 not generic, omit
2824 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002825 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002826 currently has no
2827 address space on
2828 the fence need to
2829 conservatively
2830 always generate
2831 (see comment for
2832 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002833 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002834 separate s_waitcnt
2835 vmcnt(0) and
2836 s_waitcnt
2837 lgkmcnt(0) to allow
2838 them to be
2839 independently moved
2840 according to the
2841 following rules.
2842 - s_waitcnt vmcnt(0)
2843 must happen after
2844 any preceding
2845 global/generic load
2846 atomic/atomicrmw
2847 with an equal or
2848 wider sync scope
2849 and memory ordering
2850 stronger than
2851 unordered (this is
2852 termed the
2853 fence-paired-atomic).
2854 - s_waitcnt lgkmcnt(0)
2855 must happen after
2856 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002857 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002858 atomic/atomicrmw
2859 with an equal or
2860 wider sync scope
2861 and memory ordering
2862 stronger than
2863 unordered (this is
2864 termed the
2865 fence-paired-atomic).
2866 - Must happen before
2867 the following
2868 buffer_wbinvl1_vol.
2869 - Ensures that the
2870 fence-paired atomic
2871 has completed
2872 before invalidating
2873 the
2874 cache. Therefore
2875 any following
2876 locations read must
2877 be no older than
2878 the value read by
2879 the
2880 fence-paired-atomic.
2881
2882 2. buffer_wbinvl1_vol
2883
Tony Tye6baa6d22017-10-18 22:16:55 +00002884 - Must happen before any
2885 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002886 load/load
2887 atomic/store/store
2888 atomic/atomicrmw.
2889 - Ensures that
2890 following loads
2891 will not see stale
2892 global data.
2893
2894 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002895 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002896 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2897 - wavefront - local
2898 - generic
2899 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002900
2901 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002902 - Must happen after
2903 any preceding
2904 local/generic
2905 load/store/load
2906 atomic/store
2907 atomic/atomicrmw.
2908 - Must happen before
2909 the following
2910 store.
2911 - Ensures that all
2912 memory operations
2913 to local have
2914 completed before
2915 performing the
2916 store that is being
2917 released.
2918
2919 2. buffer/global/flat_store
2920 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002921 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2922
2923 - If OpenCL, omit.
2924 - Must happen after
2925 any preceding
2926 local/generic
2927 load/store/load
2928 atomic/store
2929 atomic/atomicrmw.
2930 - Must happen before
2931 the following
2932 store.
2933 - Ensures that all
2934 memory operations
2935 to local have
2936 completed before
2937 performing the
2938 store that is being
2939 released.
2940
2941 2. flat_store
2942 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2943 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002944
2945 - If OpenCL, omit
2946 lgkmcnt(0).
2947 - Could be split into
2948 separate s_waitcnt
2949 vmcnt(0) and
2950 s_waitcnt
2951 lgkmcnt(0) to allow
2952 them to be
2953 independently moved
2954 according to the
2955 following rules.
2956 - s_waitcnt vmcnt(0)
2957 must happen after
2958 any preceding
2959 global/generic
2960 load/store/load
2961 atomic/store
2962 atomic/atomicrmw.
2963 - s_waitcnt lgkmcnt(0)
2964 must happen after
2965 any preceding
2966 local/generic
2967 load/store/load
2968 atomic/store
2969 atomic/atomicrmw.
2970 - Must happen before
2971 the following
2972 store.
2973 - Ensures that all
2974 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002975 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002976 completed before
2977 performing the
2978 store that is being
2979 released.
2980
2981 2. buffer/global/ds/flat_store
2982 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2983 - wavefront - local
2984 - generic
2985 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002986
2987 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002988 - Must happen after
2989 any preceding
2990 local/generic
2991 load/store/load
2992 atomic/store
2993 atomic/atomicrmw.
2994 - Must happen before
2995 the following
2996 atomicrmw.
2997 - Ensures that all
2998 memory operations
2999 to local have
3000 completed before
3001 performing the
3002 atomicrmw that is
3003 being released.
3004
3005 2. buffer/global/flat_atomic
3006 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00003007 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3008
3009 - If OpenCL, omit.
3010 - Must happen after
3011 any preceding
3012 local/generic
3013 load/store/load
3014 atomic/store
3015 atomic/atomicrmw.
3016 - Must happen before
3017 the following
3018 atomicrmw.
3019 - Ensures that all
3020 memory operations
3021 to local have
3022 completed before
3023 performing the
3024 atomicrmw that is
3025 being released.
3026
3027 2. flat_atomic
3028 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3029 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003030
3031 - If OpenCL, omit
3032 lgkmcnt(0).
3033 - Could be split into
3034 separate s_waitcnt
3035 vmcnt(0) and
3036 s_waitcnt
3037 lgkmcnt(0) to allow
3038 them to be
3039 independently moved
3040 according to the
3041 following rules.
3042 - s_waitcnt vmcnt(0)
3043 must happen after
3044 any preceding
3045 global/generic
3046 load/store/load
3047 atomic/store
3048 atomic/atomicrmw.
3049 - s_waitcnt lgkmcnt(0)
3050 must happen after
3051 any preceding
3052 local/generic
3053 load/store/load
3054 atomic/store
3055 atomic/atomicrmw.
3056 - Must happen before
3057 the following
3058 atomicrmw.
3059 - Ensures that all
3060 memory operations
3061 to global and local
3062 have completed
3063 before performing
3064 the atomicrmw that
3065 is being released.
3066
Tony Tye6baa6d22017-10-18 22:16:55 +00003067 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003068 fence release - singlethread *none* *none*
3069 - wavefront
3070 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3071
3072 - If OpenCL and
3073 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003074 not generic, omit.
3075 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003076 currently has no
3077 address space on
3078 the fence need to
3079 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003080 always generate. If
3081 fence had an
3082 address space then
3083 set to address
3084 space of OpenCL
3085 fence flag, or to
3086 generic if both
3087 local and global
3088 flags are
3089 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003090 - Must happen after
3091 any preceding
3092 local/generic
3093 load/load
3094 atomic/store/store
3095 atomic/atomicrmw.
3096 - Must happen before
3097 any following store
3098 atomic/atomicrmw
3099 with an equal or
3100 wider sync scope
3101 and memory ordering
3102 stronger than
3103 unordered (this is
3104 termed the
3105 fence-paired-atomic).
3106 - Ensures that all
3107 memory operations
3108 to local have
3109 completed before
3110 performing the
3111 following
3112 fence-paired-atomic.
3113
Tony Tye6baa6d22017-10-18 22:16:55 +00003114 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3115 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003116
3117 - If OpenCL and
3118 address space is
3119 not generic, omit
3120 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003121 - If OpenCL and
3122 address space is
3123 local, omit
3124 vmcnt(0).
3125 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003126 currently has no
3127 address space on
3128 the fence need to
3129 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003130 always generate. If
3131 fence had an
3132 address space then
3133 set to address
3134 space of OpenCL
3135 fence flag, or to
3136 generic if both
3137 local and global
3138 flags are
3139 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003140 - Could be split into
3141 separate s_waitcnt
3142 vmcnt(0) and
3143 s_waitcnt
3144 lgkmcnt(0) to allow
3145 them to be
3146 independently moved
3147 according to the
3148 following rules.
3149 - s_waitcnt vmcnt(0)
3150 must happen after
3151 any preceding
3152 global/generic
3153 load/store/load
3154 atomic/store
3155 atomic/atomicrmw.
3156 - s_waitcnt lgkmcnt(0)
3157 must happen after
3158 any preceding
3159 local/generic
3160 load/store/load
3161 atomic/store
3162 atomic/atomicrmw.
3163 - Must happen before
3164 any following store
3165 atomic/atomicrmw
3166 with an equal or
3167 wider sync scope
3168 and memory ordering
3169 stronger than
3170 unordered (this is
3171 termed the
3172 fence-paired-atomic).
3173 - Ensures that all
3174 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003175 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003176 completed before
3177 performing the
3178 following
3179 fence-paired-atomic.
3180
3181 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003182 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003183 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3184 - wavefront - local
3185 - generic
3186 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3187
Tony Tye6baa6d22017-10-18 22:16:55 +00003188 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003189 - Must happen after
3190 any preceding
3191 local/generic
3192 load/store/load
3193 atomic/store
3194 atomic/atomicrmw.
3195 - Must happen before
3196 the following
3197 atomicrmw.
3198 - Ensures that all
3199 memory operations
3200 to local have
3201 completed before
3202 performing the
3203 atomicrmw that is
3204 being released.
3205
Tony Tye6baa6d22017-10-18 22:16:55 +00003206 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003207 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3208 2. s_waitcnt lgkmcnt(0)
3209
Tony Tye6baa6d22017-10-18 22:16:55 +00003210 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003211 - Must happen before
3212 any following
3213 global/generic
3214 load/load
3215 atomic/store/store
3216 atomic/atomicrmw.
3217 - Ensures any
3218 following global
3219 data read is no
3220 older than the load
3221 atomic value being
3222 acquired.
3223
3224 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3225
Tony Tye6baa6d22017-10-18 22:16:55 +00003226 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003227 - Must happen after
3228 any preceding
3229 local/generic
3230 load/store/load
3231 atomic/store
3232 atomic/atomicrmw.
3233 - Must happen before
3234 the following
3235 atomicrmw.
3236 - Ensures that all
3237 memory operations
3238 to local have
3239 completed before
3240 performing the
3241 atomicrmw that is
3242 being released.
3243
3244 2. flat_atomic
3245 3. s_waitcnt lgkmcnt(0)
3246
Tony Tye6baa6d22017-10-18 22:16:55 +00003247 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003248 - Must happen before
3249 any following
3250 global/generic
3251 load/load
3252 atomic/store/store
3253 atomic/atomicrmw.
3254 - Ensures any
3255 following global
3256 data read is no
3257 older than the load
3258 atomic value being
3259 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003260
3261 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3262 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003263
3264 - If OpenCL, omit
3265 lgkmcnt(0).
3266 - Could be split into
3267 separate s_waitcnt
3268 vmcnt(0) and
3269 s_waitcnt
3270 lgkmcnt(0) to allow
3271 them to be
3272 independently moved
3273 according to the
3274 following rules.
3275 - s_waitcnt vmcnt(0)
3276 must happen after
3277 any preceding
3278 global/generic
3279 load/store/load
3280 atomic/store
3281 atomic/atomicrmw.
3282 - s_waitcnt lgkmcnt(0)
3283 must happen after
3284 any preceding
3285 local/generic
3286 load/store/load
3287 atomic/store
3288 atomic/atomicrmw.
3289 - Must happen before
3290 the following
3291 atomicrmw.
3292 - Ensures that all
3293 memory operations
3294 to global have
3295 completed before
3296 performing the
3297 atomicrmw that is
3298 being released.
3299
Tony Tye6baa6d22017-10-18 22:16:55 +00003300 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003301 3. s_waitcnt vmcnt(0)
3302
3303 - Must happen before
3304 following
3305 buffer_wbinvl1_vol.
3306 - Ensures the
3307 atomicrmw has
3308 completed before
3309 invalidating the
3310 cache.
3311
3312 4. buffer_wbinvl1_vol
3313
3314 - Must happen before
3315 any following
3316 global/generic
3317 load/load
3318 atomic/atomicrmw.
3319 - Ensures that
3320 following loads
3321 will not see stale
3322 global data.
3323
Tony Tye6baa6d22017-10-18 22:16:55 +00003324 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3325 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003326
3327 - If OpenCL, omit
3328 lgkmcnt(0).
3329 - Could be split into
3330 separate s_waitcnt
3331 vmcnt(0) and
3332 s_waitcnt
3333 lgkmcnt(0) to allow
3334 them to be
3335 independently moved
3336 according to the
3337 following rules.
3338 - s_waitcnt vmcnt(0)
3339 must happen after
3340 any preceding
3341 global/generic
3342 load/store/load
3343 atomic/store
3344 atomic/atomicrmw.
3345 - s_waitcnt lgkmcnt(0)
3346 must happen after
3347 any preceding
3348 local/generic
3349 load/store/load
3350 atomic/store
3351 atomic/atomicrmw.
3352 - Must happen before
3353 the following
3354 atomicrmw.
3355 - Ensures that all
3356 memory operations
3357 to global have
3358 completed before
3359 performing the
3360 atomicrmw that is
3361 being released.
3362
3363 2. flat_atomic
3364 3. s_waitcnt vmcnt(0) &
3365 lgkmcnt(0)
3366
3367 - If OpenCL, omit
3368 lgkmcnt(0).
3369 - Must happen before
3370 following
3371 buffer_wbinvl1_vol.
3372 - Ensures the
3373 atomicrmw has
3374 completed before
3375 invalidating the
3376 cache.
3377
3378 4. buffer_wbinvl1_vol
3379
3380 - Must happen before
3381 any following
3382 global/generic
3383 load/load
3384 atomic/atomicrmw.
3385 - Ensures that
3386 following loads
3387 will not see stale
3388 global data.
3389
3390 fence acq_rel - singlethread *none* *none*
3391 - wavefront
3392 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3393
3394 - If OpenCL and
3395 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003396 not generic, omit.
3397 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003398 since LLVM
3399 currently has no
3400 address space on
3401 the fence need to
3402 conservatively
3403 always generate
3404 (see comment for
3405 previous fence).
3406 - Must happen after
3407 any preceding
3408 local/generic
3409 load/load
3410 atomic/store/store
3411 atomic/atomicrmw.
3412 - Must happen before
3413 any following
3414 global/generic
3415 load/load
3416 atomic/store/store
3417 atomic/atomicrmw.
3418 - Ensures that all
3419 memory operations
3420 to local have
3421 completed before
3422 performing any
3423 following global
3424 memory operations.
3425 - Ensures that the
3426 preceding
3427 local/generic load
3428 atomic/atomicrmw
3429 with an equal or
3430 wider sync scope
3431 and memory ordering
3432 stronger than
3433 unordered (this is
3434 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003435 acquire-fence-paired-atomic
3436 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003437 before following
3438 global memory
3439 operations. This
3440 satisfies the
3441 requirements of
3442 acquire.
3443 - Ensures that all
3444 previous memory
3445 operations have
3446 completed before a
3447 following
3448 local/generic store
3449 atomic/atomicrmw
3450 with an equal or
3451 wider sync scope
3452 and memory ordering
3453 stronger than
3454 unordered (this is
3455 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003456 release-fence-paired-atomic
3457 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003458 requirements of
3459 release.
3460
Tony Tye6baa6d22017-10-18 22:16:55 +00003461 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3462 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003463
3464 - If OpenCL and
3465 address space is
3466 not generic, omit
3467 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003468 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003469 currently has no
3470 address space on
3471 the fence need to
3472 conservatively
3473 always generate
3474 (see comment for
3475 previous fence).
3476 - Could be split into
3477 separate s_waitcnt
3478 vmcnt(0) and
3479 s_waitcnt
3480 lgkmcnt(0) to allow
3481 them to be
3482 independently moved
3483 according to the
3484 following rules.
3485 - s_waitcnt vmcnt(0)
3486 must happen after
3487 any preceding
3488 global/generic
3489 load/store/load
3490 atomic/store
3491 atomic/atomicrmw.
3492 - s_waitcnt lgkmcnt(0)
3493 must happen after
3494 any preceding
3495 local/generic
3496 load/store/load
3497 atomic/store
3498 atomic/atomicrmw.
3499 - Must happen before
3500 the following
3501 buffer_wbinvl1_vol.
3502 - Ensures that the
3503 preceding
3504 global/local/generic
3505 load
3506 atomic/atomicrmw
3507 with an equal or
3508 wider sync scope
3509 and memory ordering
3510 stronger than
3511 unordered (this is
3512 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003513 acquire-fence-paired-atomic
3514 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003515 before invalidating
3516 the cache. This
3517 satisfies the
3518 requirements of
3519 acquire.
3520 - Ensures that all
3521 previous memory
3522 operations have
3523 completed before a
3524 following
3525 global/local/generic
3526 store
3527 atomic/atomicrmw
3528 with an equal or
3529 wider sync scope
3530 and memory ordering
3531 stronger than
3532 unordered (this is
3533 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003534 release-fence-paired-atomic
3535 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003536 requirements of
3537 release.
3538
3539 2. buffer_wbinvl1_vol
3540
3541 - Must happen before
3542 any following
3543 global/generic
3544 load/load
3545 atomic/store/store
3546 atomic/atomicrmw.
3547 - Ensures that
3548 following loads
3549 will not see stale
3550 global data. This
3551 satisfies the
3552 requirements of
3553 acquire.
3554
3555 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003556 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003557 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003558 - wavefront - local load atomic acquire,
3559 - generic except must generated
3560 all instructions even
3561 for OpenCL.*
3562 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3563 - generic
3564 - Must
3565 happen after
3566 preceding
3567 global/generic load
3568 atomic/store
3569 atomic/atomicrmw
3570 with memory
3571 ordering of seq_cst
3572 and with equal or
3573 wider sync scope.
3574 (Note that seq_cst
3575 fences have their
3576 own s_waitcnt
3577 lgkmcnt(0) and so do
3578 not need to be
3579 considered.)
3580 - Ensures any
3581 preceding
3582 sequential
3583 consistent local
3584 memory instructions
3585 have completed
3586 before executing
3587 this sequentially
3588 consistent
3589 instruction. This
3590 prevents reordering
3591 a seq_cst store
3592 followed by a
3593 seq_cst load. (Note
3594 that seq_cst is
3595 stronger than
3596 acquire/release as
3597 the reordering of
3598 load acquire
3599 followed by a store
3600 release is
3601 prevented by the
3602 waitcnt of
3603 the release, but
3604 there is nothing
3605 preventing a store
3606 release followed by
3607 load acquire from
3608 competing out of
3609 order.)
3610
3611 2. *Following
3612 instructions same as
3613 corresponding load
3614 atomic acquire,
3615 except must generated
3616 all instructions even
3617 for OpenCL.*
3618 load atomic seq_cst - workgroup - local *Same as corresponding
3619 load atomic acquire,
3620 except must generated
3621 all instructions even
3622 for OpenCL.*
3623 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3624 - system - generic vmcnt(0)
3625
3626 - Could be split into
3627 separate s_waitcnt
3628 vmcnt(0)
3629 and s_waitcnt
3630 lgkmcnt(0) to allow
3631 them to be
3632 independently moved
3633 according to the
3634 following rules.
3635 - waitcnt lgkmcnt(0)
3636 must happen after
3637 preceding
3638 global/generic load
3639 atomic/store
3640 atomic/atomicrmw
3641 with memory
3642 ordering of seq_cst
3643 and with equal or
3644 wider sync scope.
3645 (Note that seq_cst
3646 fences have their
3647 own s_waitcnt
3648 lgkmcnt(0) and so do
3649 not need to be
3650 considered.)
3651 - waitcnt vmcnt(0)
3652 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003653 preceding
3654 global/generic load
3655 atomic/store
3656 atomic/atomicrmw
3657 with memory
3658 ordering of seq_cst
3659 and with equal or
3660 wider sync scope.
3661 (Note that seq_cst
3662 fences have their
3663 own s_waitcnt
3664 vmcnt(0) and so do
3665 not need to be
3666 considered.)
3667 - Ensures any
3668 preceding
3669 sequential
3670 consistent global
3671 memory instructions
3672 have completed
3673 before executing
3674 this sequentially
3675 consistent
3676 instruction. This
3677 prevents reordering
3678 a seq_cst store
3679 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003680 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003681 that seq_cst is
3682 stronger than
3683 acquire/release as
3684 the reordering of
3685 load acquire
3686 followed by a store
3687 release is
3688 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003689 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003690 the release, but
3691 there is nothing
3692 preventing a store
3693 release followed by
3694 load acquire from
3695 competing out of
3696 order.)
3697
3698 2. *Following
3699 instructions same as
3700 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003701 atomic acquire,
3702 except must generated
3703 all instructions even
3704 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003705 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003706 - wavefront - local store atomic release,
3707 - workgroup - generic except must generated
3708 all instructions even
3709 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003710 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003711 - system - generic store atomic release,
3712 except must generated
3713 all instructions even
3714 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003715 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003716 - wavefront - local atomicrmw acq_rel,
3717 - workgroup - generic except must generated
3718 all instructions even
3719 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003720 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003721 - system - generic atomicrmw acq_rel,
3722 except must generated
3723 all instructions even
3724 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003725 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003726 - wavefront fence acq_rel,
3727 - workgroup except must generated
3728 - agent all instructions even
3729 - system for OpenCL.*
3730 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003731
3732The memory order also adds the single thread optimization constrains defined in
3733table
3734:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3735
3736 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3737 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3738
3739 ============ ==============================================================
3740 LLVM Memory Optimization Constraints
3741 Ordering
3742 ============ ==============================================================
3743 unordered *none*
3744 monotonic *none*
3745 acquire - If a load atomic/atomicrmw then no following load/load
3746 atomic/store/ store atomic/atomicrmw/fence instruction can
3747 be moved before the acquire.
3748 - If a fence then same as load atomic, plus no preceding
3749 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003750 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003751 atomic/store/ store atomic/atomicrmw/fence instruction can
3752 be moved after the release.
3753 - If a fence then same as store atomic, plus no following
3754 associated fence-paired-atomic can be moved before the
3755 fence.
3756 acq_rel Same constraints as both acquire and release.
3757 seq_cst - If a load atomic then same constraints as acquire, plus no
3758 preceding sequentially consistent load atomic/store
3759 atomic/atomicrmw/fence instruction can be moved after the
3760 seq_cst.
3761 - If a store atomic then the same constraints as release, plus
3762 no following sequentially consistent load atomic/store
3763 atomic/atomicrmw/fence instruction can be moved before the
3764 seq_cst.
3765 - If an atomicrmw/fence then same constraints as acq_rel.
3766 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003767
Wei Ding16289cf2017-02-21 18:48:01 +00003768Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003769~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003770
Tony Tyef16a45e2017-06-06 20:31:59 +00003771For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3772(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3773the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003774
Tony Tyef16a45e2017-06-06 20:31:59 +00003775 .. table:: AMDGPU Trap Handler for AMDHSA OS
3776 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003777
Tony Tyef16a45e2017-06-06 20:31:59 +00003778 =================== =============== =============== =======================
3779 Usage Code Sequence Trap Handler Description
3780 Inputs
3781 =================== =============== =============== =======================
3782 reserved ``s_trap 0x00`` Reserved by hardware.
3783 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3784 ``queue_ptr`` ``debugtrap``
3785 ``VGPR0``: intrinsic (not
3786 ``arg`` implemented).
3787 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3788 ``queue_ptr`` terminated and its
3789 associated queue put
3790 into the error state.
3791 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3792 ``queue_ptr`` installed handled
3793 same as ``llvm.trap``.
3794 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3795 breakpoints.
3796 debugger ``s_trap 0x08`` Reserved for debugger.
3797 debugger ``s_trap 0xfe`` Reserved for debugger.
3798 debugger ``s_trap 0xff`` Reserved for debugger.
3799 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003800
Tim Corringhamaf2dfc62018-04-04 13:02:09 +00003801AMDPAL
3802------
3803
3804This section provides code conventions used when the target triple OS is
3805``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3806from the application/runtime to each invocation of a hardware shader. These
3807parameters include both generic, application-controlled parameters called
3808*user data* as well as system-generated parameters that are a product of the
3809draw or dispatch execution.
3810
3811User Data
3812~~~~~~~~~
3813
3814Each hardware stage has a set of 32-bit *user data registers* which can be
3815written from a command buffer and then loaded into SGPRs when waves are launched
3816via a subsequent dispatch or draw operation. This is the way most arguments are
3817passed from the application/runtime to a hardware shader.
3818
3819Compute User Data
3820~~~~~~~~~~~~~~~~~
3821
3822Compute shader user data mappings are simpler than graphics shaders, and have a
3823fixed mapping.
3824
3825Note that there are always 10 available *user data entries* in registers -
3826entries beyond that limit must be fetched from memory (via the spill table
3827pointer) by the shader.
3828
3829 .. table:: PAL Compute Shader User Data Registers
3830 :name: pal-compute-user-data-registers
3831
3832 ============= ================================
3833 User Register Description
3834 ============= ================================
3835 0 Global Internal Table (32-bit pointer)
3836 1 Per-Shader Internal Table (32-bit pointer)
3837 2 - 11 Application-Controlled User Data (10 32-bit values)
3838 12 Spill Table (32-bit pointer)
3839 13 - 14 Thread Group Count (64-bit pointer)
3840 15 GDS Range
3841 ============= ================================
3842
3843Graphics User Data
3844~~~~~~~~~~~~~~~~~~
3845
3846Graphics pipelines support a much more flexible user data mapping:
3847
3848 .. table:: PAL Graphics Shader User Data Registers
3849 :name: pal-graphics-user-data-registers
3850
3851 ============= ================================
3852 User Register Description
3853 ============= ================================
3854 0 Global Internal Table (32-bit pointer)
3855 + Per-Shader Internal Table (32-bit pointer)
3856 + 1-15 Application Controlled User Data
3857 (1-15 Contiguous 32-bit Values in Registers)
3858 + Spill Table (32-bit pointer)
3859 + Draw Index (First Stage Only)
3860 + Vertex Offset (First Stage Only)
3861 + Instance Offset (First Stage Only)
3862 ============= ================================
3863
3864 The placement of the global internal table remains fixed in the first *user
3865 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3866 to any desired *user data SGPR register*, with the following regstrictions:
3867
3868 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3869 activehardware stage in a graphics pipeline (i.e. where the API vertex
3870 shader runs).
3871
3872 * Application-controlled user data must be mapped into a contiguous range of
3873 user data registers.
3874
3875 * The application-controlled user data range supports compaction remapping, so
3876 only *entries* that are actually consumed by the shader must be assigned to
3877 corresponding *registers*. Note that in order to support an efficient runtime
3878 implementation, the remapping must pack *registers* in the same order as
3879 *entries*, with unused *entries* removed.
3880
3881.. _pal_global_internal_table:
3882
3883Global Internal Table
3884~~~~~~~~~~~~~~~~~~~~~
3885
3886The global internal table is a table of *shader resource descriptors* (SRDs) that
3887define how certain engine-wide, runtime-managed resources should be accessed
3888from a shader. The majority of these resources have HW-defined formats, and it
3889is up to the compiler to write/read data as required by the target hardware.
3890
3891The following table illustrates the required format:
3892
3893 .. table:: PAL Global Internal Table
3894 :name: pal-git-table
3895
3896 ============= ================================
3897 Offset Description
3898 ============= ================================
3899 0-3 Graphics Scratch SRD
3900 4-7 Compute Scratch SRD
3901 8-11 ES/GS Ring Output SRD
3902 12-15 ES/GS Ring Input SRD
3903 16-19 GS/VS Ring Output #0
3904 20-23 GS/VS Ring Output #1
3905 24-27 GS/VS Ring Output #2
3906 28-31 GS/VS Ring Output #3
3907 32-35 GS/VS Ring Input SRD
3908 36-39 Tessellation Factor Buffer SRD
3909 40-43 Off-Chip LDS Buffer SRD
3910 44-47 Off-Chip Param Cache Buffer SRD
3911 48-51 Sample Position Buffer SRD
3912 52 vaRange::ShadowDescriptorTable High Bits
3913 ============= ================================
3914
3915 The pointer to the global internal table passed to the shader as user data
3916 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3917 the top 32 bits of the pipeline, so the shader may use the program
3918 counter's top 32 bits.
3919
Tony Tye46d35762017-08-15 20:47:41 +00003920Unspecified OS
3921--------------
3922
3923This section provides code conventions used when the target triple OS is
3924empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003925
3926Trap Handler ABI
3927~~~~~~~~~~~~~~~~
3928
3929For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3930not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3931instructions are handled as follows:
3932
3933 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3934 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3935
3936 =============== =============== ===========================================
3937 Usage Code Sequence Description
3938 =============== =============== ===========================================
3939 llvm.trap s_endpgm Causes wavefront to be terminated.
3940 llvm.debugtrap *none* Compiler warning given that there is no
3941 trap handler installed.
3942 =============== =============== ===========================================
3943
3944Source Languages
3945================
3946
3947.. _amdgpu-opencl:
3948
3949OpenCL
3950------
3951
Tony Tyef16a45e2017-06-06 20:31:59 +00003952When the language is OpenCL the following differences occur:
3953
39541. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye7a893d42018-03-23 18:45:18 +000039552. The AMDGPU backend appends additional arguments to the kernel's explicit
3956 arguments for the AMDHSA OS (see
3957 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye46d35762017-08-15 20:47:41 +000039583. Additional metadata is generated
Tony Tye7a893d42018-03-23 18:45:18 +00003959 (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003960
Tony Tye7a893d42018-03-23 18:45:18 +00003961 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3962 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3963
3964 ======== ==== ========= ===========================================
3965 Position Byte Byte Description
3966 Size Alignment
3967 ======== ==== ========= ===========================================
Tony Tye88441a32018-03-23 18:58:47 +00003968 1 8 8 OpenCL Global Offset X
3969 2 8 8 OpenCL Global Offset Y
3970 3 8 8 OpenCL Global Offset Z
3971 4 8 8 OpenCL address of printf buffer
3972 5 8 8 OpenCL address of virtual queue used by
3973 enqueue_kernel.
3974 6 8 8 OpenCL address of AqlWrap struct used by
3975 enqueue_kernel.
Tony Tye7a893d42018-03-23 18:45:18 +00003976 ======== ==== ========= ===========================================
Tony Tyef16a45e2017-06-06 20:31:59 +00003977
3978.. _amdgpu-hcc:
3979
3980HCC
3981---
3982
Tony Tye7a893d42018-03-23 18:45:18 +00003983When the language is HCC the following differences occur:
Tony Tyef16a45e2017-06-06 20:31:59 +00003984
39851. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3986
Tom Stellard45bb48e2015-06-13 03:28:10 +00003987Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003988---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003989
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003990AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003991It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003992
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003993This section describes general syntax for instructions and operands.
3994
3995Instructions
3996~~~~~~~~~~~~
3997
3998.. toctree::
3999 :hidden:
4000
4001 AMDGPUAsmGFX7
4002 AMDGPUAsmGFX8
4003 AMDGPUAsmGFX9
4004 AMDGPUOperandSyntax
4005
4006An instruction has the following syntax:
4007
4008 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4009
4010Note that operands are normally comma-separated while modifiers are space-separated.
4011
4012The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4013
4014See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4015:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4016
4017Note that features under development are not included in this description.
4018
4019For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00004020operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004021[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004022
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004023Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00004024~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004025
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004026The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004027
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004028* SGPR registers: s0, ... or s[0], ...
4029* VGPR registers: v0, ... or v[0], ...
4030* TTMP registers: ttmp0, ... or ttmp[0], ...
4031* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4032* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4033* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4034* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4035* Register index expressions: v[2*2], s[1-1:2-1]
4036* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00004037
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004038Modifiers
4039~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004040
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004041Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004042
Tony Tyef16a45e2017-06-06 20:31:59 +00004043Instruction Examples
4044~~~~~~~~~~~~~~~~~~~~
4045
4046DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004047++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004048
4049.. code-block:: nasm
4050
4051 ds_add_u32 v2, v4 offset:16
4052 ds_write_src2_b64 v2 offset0:4 offset1:8
4053 ds_cmpst_f32 v2, v4, v6
4054 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4055
4056
4057For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4058
Tony Tyef16a45e2017-06-06 20:31:59 +00004059FLAT
4060++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004061
4062.. code-block:: nasm
4063
4064 flat_load_dword v1, v[3:4]
4065 flat_store_dwordx3 v[3:4], v[5:7]
4066 flat_atomic_swap v1, v[3:4], v5 glc
4067 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4068 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4069
4070For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4071
Tony Tyef16a45e2017-06-06 20:31:59 +00004072MUBUF
4073+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004074
4075.. code-block:: nasm
4076
4077 buffer_load_dword v1, off, s[4:7], s1
4078 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4079 buffer_store_format_xy v[1:2], off, s[4:7], s1
4080 buffer_wbinvl1
4081 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4082
4083For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4084
Tony Tyef16a45e2017-06-06 20:31:59 +00004085SMRD/SMEM
4086+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004087
4088.. code-block:: nasm
4089
4090 s_load_dword s1, s[2:3], 0xfc
4091 s_load_dwordx8 s[8:15], s[2:3], s4
4092 s_load_dwordx16 s[88:103], s[2:3], s4
4093 s_dcache_inv_vol
4094 s_memtime s[4:5]
4095
4096For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4097
Tony Tyef16a45e2017-06-06 20:31:59 +00004098SOP1
4099++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004100
4101.. code-block:: nasm
4102
4103 s_mov_b32 s1, s2
4104 s_mov_b64 s[0:1], 0x80000000
4105 s_cmov_b32 s1, 200
4106 s_wqm_b64 s[2:3], s[4:5]
4107 s_bcnt0_i32_b64 s1, s[2:3]
4108 s_swappc_b64 s[2:3], s[4:5]
4109 s_cbranch_join s[4:5]
4110
4111For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4112
Tony Tyef16a45e2017-06-06 20:31:59 +00004113SOP2
4114++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004115
4116.. code-block:: nasm
4117
4118 s_add_u32 s1, s2, s3
4119 s_and_b64 s[2:3], s[4:5], s[6:7]
4120 s_cselect_b32 s1, s2, s3
4121 s_andn2_b32 s2, s4, s6
4122 s_lshr_b64 s[2:3], s[4:5], s6
4123 s_ashr_i32 s2, s4, s6
4124 s_bfm_b64 s[2:3], s4, s6
4125 s_bfe_i64 s[2:3], s[4:5], s6
4126 s_cbranch_g_fork s[4:5], s[6:7]
4127
4128For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4129
Tony Tyef16a45e2017-06-06 20:31:59 +00004130SOPC
4131++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004132
4133.. code-block:: nasm
4134
4135 s_cmp_eq_i32 s1, s2
4136 s_bitcmp1_b32 s1, s2
4137 s_bitcmp0_b64 s[2:3], s4
4138 s_setvskip s3, s5
4139
4140For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4141
Tony Tyef16a45e2017-06-06 20:31:59 +00004142SOPP
4143++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004144
4145.. code-block:: nasm
4146
4147 s_barrier
4148 s_nop 2
4149 s_endpgm
4150 s_waitcnt 0 ; Wait for all counters to be 0
4151 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4152 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4153 s_sethalt 9
4154 s_sleep 10
4155 s_sendmsg 0x1
4156 s_sendmsg sendmsg(MSG_INTERRUPT)
4157 s_trap 1
4158
4159For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4160
4161Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00004162of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00004163range or acceptable values.
4164
Tony Tyef16a45e2017-06-06 20:31:59 +00004165VALU
4166++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00004167
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004168For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4169the assembler will automatically use optimal encoding based on its operands.
4170To force specific encoding, one can add a suffix to the opcode of the instruction:
4171
4172* _e32 for 32-bit VOP1/VOP2/VOPC
4173* _e64 for 64-bit VOP3
4174* _dpp for VOP_DPP
4175* _sdwa for VOP_SDWA
4176
4177VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004178
4179.. code-block:: nasm
4180
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004181 v_mov_b32 v1, v2
4182 v_mov_b32_e32 v1, v2
4183 v_nop
4184 v_cvt_f64_i32_e32 v[1:2], v2
4185 v_floor_f32_e32 v1, v2
4186 v_bfrev_b32_e32 v1, v2
4187 v_add_f32_e32 v1, v2, v3
4188 v_mul_i32_i24_e64 v1, v2, 3
4189 v_mul_i32_i24_e32 v1, -3, v3
4190 v_mul_i32_i24_e32 v1, -100, v3
4191 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4192 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004193
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004194VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004195
4196.. code-block:: nasm
4197
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004198 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4199 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4200 v_mov_b32 v0, v0 wave_shl:1
4201 v_mov_b32 v0, v0 row_mirror
4202 v_mov_b32 v0, v0 row_bcast:31
4203 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4204 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4205 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004206
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004207VOP_SDWA examples:
4208
4209.. code-block:: nasm
4210
4211 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4212 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4213 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4214 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4215 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4216
4217For full list of supported instructions, refer to "Vector ALU instructions".
4218
4219HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004220~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004221
4222AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4223one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004224
4225.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004226+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004227
4228*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004229object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004230
4231.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004232+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4233
Tom Stellard347ac792015-06-26 21:15:07 +00004234
4235*major*, *minor*, and *stepping* are all integers that describe the instruction
4236set architecture (ISA) version of the assembly program.
4237
4238*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4239"AMD" and *arch* should always be equal to "AMDGPU".
4240
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004241By default, the assembler will derive the ISA version, *vendor*, and *arch*
4242from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004243
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004244.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004245+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004246
4247This directives specifies that the symbol with given name is a kernel entry point
4248(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004249
4250.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004251++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004252
4253This directive marks the beginning of a list of key / value pairs that are used
4254to specify the amd_kernel_code_t object that will be emitted by the assembler.
4255The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4256any amd_kernel_code_t values that are unspecified a default value will be
4257used. The default value for all keys is 0, with the following exceptions:
4258
4259- *kernel_code_version_major* defaults to 1.
4260- *machine_kind* defaults to 1.
4261- *machine_version_major*, *machine_version_minor*, and
4262 *machine_version_stepping* are derived from the value of the -mcpu option
4263 that is passed to the assembler.
4264- *kernel_code_entry_byte_offset* defaults to 256.
4265- *wavefront_size* defaults to 6.
4266- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004267 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004268 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4269
4270The *.amd_kernel_code_t* directive must be placed immediately after the
4271function label and before any instructions.
4272
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004273For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4274comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004275
4276Here is an example of a minimal amd_kernel_code_t specification:
4277
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004278.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004279
4280 .hsa_code_object_version 1,0
4281 .hsa_code_object_isa
4282
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004283 .hsatext
4284 .globl hello_world
4285 .p2align 8
4286 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004287
4288 hello_world:
4289
4290 .amd_kernel_code_t
4291 enable_sgpr_kernarg_segment_ptr = 1
4292 is_ptr64 = 1
4293 compute_pgm_rsrc1_vgprs = 0
4294 compute_pgm_rsrc1_sgprs = 0
4295 compute_pgm_rsrc2_user_sgpr = 2
4296 kernarg_segment_byte_size = 8
4297 wavefront_sgpr_count = 2
4298 workitem_vgpr_count = 3
4299 .end_amd_kernel_code_t
4300
4301 s_load_dwordx2 s[0:1], s[0:1] 0x0
4302 v_mov_b32 v0, 3.14159
4303 s_waitcnt lgkmcnt(0)
4304 v_mov_b32 v1, s0
4305 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004306 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004307 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004308 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004309 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004310
4311Additional Documentation
4312========================
4313
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004314.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4315.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4316.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4317.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4318.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4319.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4320.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4321.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004322.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4323.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4324.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4325.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4326.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004327.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004328.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4329.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__