blob: 773e0742fa9378364c77aefb4ac78d906633c728 [file] [log] [blame]
Eugene Zelenko3507b042018-03-21 17:09:35 +00001=============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye7a893d42018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye07d9f102017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000080
Tony Tye31105cc2017-12-11 15:35:27 +000081 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
85 [Default]
86 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000088 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +000099 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000106 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000152 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyeb6efb902018-04-14 01:58:10 +0000201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
202 [on] - Ryzen 5 2400G
Matt Arsenault0084adc2018-04-30 19:08:16 +0000203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
204 [off]
205 .. TODO
206 Add product
207 names.
208 ``gfx906`` ``amdgcn`` dGPU - xnack *TBA*
209 [off]
210 .. TODO
211 Add product
212 names.
Tony Tye31105cc2017-12-11 15:35:27 +0000213 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000214
215.. _amdgpu-target-features:
216
217Target Features
218---------------
219
220Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000221processor specific features. Not all target features are supported by
222all processors. The runtime must ensure that the features supported by
223the device used to execute the code match the features enabled when
224generating the code. A mismatch of features may result in incorrect
225execution, or a reduction in performance.
226
227The target features supported by each processor, and the default value
228used if not specified explicitly, is listed in
229:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000230
231Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232target features.
233
234For example:
235
236``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000237 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000238``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000239 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000240
241 .. table:: AMDGPU Target Features
242 :name: amdgpu-target-feature-table
243
Tony Tye31105cc2017-12-11 15:35:27 +0000244 ============== ==================================================
245 Target Feature Description
246 ============== ==================================================
247 -m[no-]xnack Enable/disable generating code that has
248 memory clauses that are compatible with
249 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000250
Tony Tye31105cc2017-12-11 15:35:27 +0000251 This is used for demand paging and page
252 migration. If XNACK replay is enabled in
253 the device, then if a page fault occurs
254 the code may execute incorrectly if the
255 ``xnack`` feature is not enabled. Executing
256 code that has the feature enabled on a
257 device that does not have XNACK replay
258 enabled will execute correctly, but may
259 be less performant than code with the
260 feature disabled.
261 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000262
263.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000264
265Address Spaces
266--------------
267
Tony Tyef16a45e2017-06-06 20:31:59 +0000268The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000269
Tony Tyef16a45e2017-06-06 20:31:59 +0000270The memory space names used in the table, aside from the region memory space, is
271from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000272
Tony Tyef16a45e2017-06-06 20:31:59 +0000273LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000274
Tony Tyef16a45e2017-06-06 20:31:59 +0000275 .. table:: Address Space Mapping
276 :name: amdgpu-address-space-mapping-table
277
Yaxun Liu0124b542018-02-13 18:00:25 +0000278 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000279 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000280 ================== =================
281 0 Generic (Flat)
282 1 Global
283 2 Region (GDS)
284 3 Local (group/LDS)
285 4 Constant
286 5 Private (Scratch)
287 6 Constant 32-bit
288 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000289
290.. _amdgpu-memory-scopes:
291
292Memory Scopes
293-------------
294
295This section provides LLVM memory synchronization scopes supported by the AMDGPU
296backend memory model when the target triple OS is ``amdhsa`` (see
297:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
298
299The memory model supported is based on the HSA memory model [HSA]_ which is
300based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301relation is transitive over the synchonizes-with relation independent of scope,
302and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000303table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000304
305This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306inclusion and requires the memory scopes to exactly match. However, this
307is conservatively correct for OpenCL.
308
Tony Tye07d9f102017-11-10 01:00:54 +0000309 .. table:: AMDHSA LLVM Sync Scopes
310 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000311
312 ================ ==========================================================
313 LLVM Sync Scope Description
314 ================ ==========================================================
315 *none* The default: ``system``.
316
317 Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
322
323 - ``system``.
324 - ``agent`` and executed by a thread on the same agent.
325 - ``workgroup`` and executed by a thread in the same
326 workgroup.
327 - ``wavefront`` and executed by a thread in the same
328 wavefront.
329
330 ``agent`` Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system`` or ``agent`` and executed by a thread on the
337 same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``workgroup`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system``, ``agent`` or ``workgroup`` and executed by a
350 thread in the same workgroup.
351 - ``wavefront`` and executed by a thread in the same
352 wavefront.
353
354 ``wavefront`` Synchronizes with, and participates in modification and
355 seq_cst total orderings with, other operations (except
356 image operations) for all address spaces (except private,
357 or generic that accesses private) provided the other
358 operation's sync scope is:
359
360 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361 and executed by a thread in the same wavefront.
362
363 ``singlethread`` Only synchronizes with, and participates in modification
364 and seq_cst total orderings with, other operations (except
365 image operations) running in the same thread for all
366 address spaces (for example, in signal handlers).
367 ================ ==========================================================
368
369AMDGPU Intrinsics
370-----------------
371
372The AMDGPU backend implements the following intrinsics.
373
374*This section is WIP.*
375
376.. TODO
377 List AMDGPU intrinsics
378
379Code Object
380===========
381
382The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
383can be linked by ``lld`` to produce a standard ELF shared code object which can
384be loaded and executed on an AMDGPU target.
385
386Header
387------
388
389The AMDGPU backend uses the following ELF header:
390
391 .. table:: AMDGPU ELF Header
392 :name: amdgpu-elf-header-table
393
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000394 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000395 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000396 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000397 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
398 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000399 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
400 - ``ELFOSABI_AMDGPU_HSA``
401 - ``ELFOSABI_AMDGPU_PAL``
402 - ``ELFOSABI_AMDGPU_MESA3D``
403 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
404 - ``ELFABIVERSION_AMDGPU_PAL``
405 - ``ELFABIVERSION_AMDGPU_MESA3D``
406 ``e_type`` - ``ET_REL``
407 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 ``e_machine`` ``EM_AMDGPU``
409 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000410 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000411 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000412
413..
414
415 .. table:: AMDGPU ELF Header Enumeration Values
416 :name: amdgpu-elf-header-enumeration-values-table
417
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000418 =============================== =====
419 Name Value
420 =============================== =====
421 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000422 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000423 ``ELFOSABI_AMDGPU_HSA`` 64
424 ``ELFOSABI_AMDGPU_PAL`` 65
425 ``ELFOSABI_AMDGPU_MESA3D`` 66
426 ``ELFABIVERSION_AMDGPU_HSA`` 1
427 ``ELFABIVERSION_AMDGPU_PAL`` 0
428 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
429 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000430
431``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000432 The ELF class is:
433
434 * ``ELFCLASS32`` for ``r600`` architecture.
435
436 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
437 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000438
439``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000440 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000441
442``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000443 One of the following AMD GPU architecture specific OS ABIs
444 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000445
Tony Tye07d9f102017-11-10 01:00:54 +0000446 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000447
Tony Tye07d9f102017-11-10 01:00:54 +0000448 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000449
Tony Tye07d9f102017-11-10 01:00:54 +0000450 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
451
452 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000453
Tony Tyef16a45e2017-06-06 20:31:59 +0000454``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000455 The ABI version of the AMD GPU architecture specific OS ABI to which the code
456 object conforms:
457
458 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
459 runtime ABI.
460
461 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
462 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000463
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000464 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000465 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_type``
468 Can be one of the following values:
469
470
471 ``ET_REL``
472 The type produced by the AMD GPU backend compiler as it is relocatable code
473 object.
474
475 ``ET_DYN``
476 The type produced by the linker as it is a shared code object.
477
478 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
479
480``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000481 The value ``EM_AMDGPU`` is used for the machine for all processors supported
482 by the ``r600`` and ``amdgcn`` architectures (see
483 :ref:`amdgpu-processor-table`). The specific processor is specified in the
484 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
485 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000486
487``e_entry``
488 The entry point is 0 as the entry points for individual kernels must be
489 selected in order to invoke them through AQL packets.
490
491``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000492 The AMDGPU backend uses the following ELF header flags:
493
494 .. table:: AMDGPU ELF Header ``e_flags``
495 :name: amdgpu-elf-header-e_flags-table
496
497 ================================= ========== =============================
498 Name Value Description
499 ================================= ========== =============================
500 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
501 -------------------------------------------- -----------------------------
502 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
503 mask for
504 ``EF_AMDGPU_MACH_xxx`` values
505 defined in
506 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000507 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
508 target feature is
509 enabled for all code
510 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000511 If the processor
512 does not support the
513 ``xnack`` target
514 feature then must
515 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000516 See
517 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000518 ================================= ========== =============================
519
520 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
521 :name: amdgpu-ef-amdgpu-mach-table
522
523 ================================= ========== =============================
524 Name Value Description (see
525 :ref:`amdgpu-processor-table`)
526 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000527 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
528 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
529 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
530 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
531 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
532 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
533 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
534 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
535 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
536 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
537 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
540 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
541 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
542 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
544 *reserved* 0x011 - Reserved for ``r600``
545 0x01f architecture processors.
546 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
553 *reserved* 0x027 Reserved.
554 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault0084adc2018-04-30 19:08:16 +0000560 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000562 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000563 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000564
565Sections
566--------
567
568An AMDGPU target ELF code object has the standard ELF sections which include:
569
570 .. table:: AMDGPU ELF Sections
571 :name: amdgpu-elf-sections-table
572
573 ================== ================ =================================
574 Name Type Attributes
575 ================== ================ =================================
576 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
577 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
578 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
579 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
580 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
581 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
582 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
583 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
584 ``.note`` ``SHT_NOTE`` *none*
585 ``.rela``\ *name* ``SHT_RELA`` *none*
586 ``.rela.dyn`` ``SHT_RELA`` *none*
587 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
588 ``.shstrtab`` ``SHT_STRTAB`` *none*
589 ``.strtab`` ``SHT_STRTAB`` *none*
590 ``.symtab`` ``SHT_SYMTAB`` *none*
591 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
592 ================== ================ =================================
593
594These sections have their standard meanings (see [ELF]_) and are only generated
595if needed.
596
597``.debug``\ *\**
598 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
599 DWARF produced by the AMDGPU backend.
600
Tony Tye46d35762017-08-15 20:47:41 +0000601``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000602 The standard sections used by a dynamic loader.
603
604``.note``
605 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
606 backend.
607
608``.rela``\ *name*, ``.rela.dyn``
609 For relocatable code objects, *name* is the name of the section that the
610 relocation records apply. For example, ``.rela.text`` is the section name for
611 relocation records associated with the ``.text`` section.
612
613 For linked shared code objects, ``.rela.dyn`` contains all the relocation
614 records from each of the relocatable code object's ``.rela``\ *name* sections.
615
616 See :ref:`amdgpu-relocation-records` for the relocation records supported by
617 the AMDGPU backend.
618
619``.text``
620 The executable machine code for the kernels and functions they call. Generated
621 as position independent code. See :ref:`amdgpu-code-conventions` for
622 information on conventions used in the isa generation.
623
624.. _amdgpu-note-records:
625
626Note Records
627------------
628
Tony Tye07d9f102017-11-10 01:00:54 +0000629As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
630be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
631aligned. In addition, minimal zero byte padding must be generated to ensure the
632``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
633``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000634
635The AMDGPU backend code object uses the following ELF note records in the
636``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000637record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000638variable size strings have a corresponding ``*_size`` field that specifies the
639number of bytes, including the terminating null character, in the string. The
640string(s) come immediately after the preceding fields.
641
642Additional note records can be present.
643
644 .. table:: AMDGPU ELF Note Records
645 :name: amdgpu-elf-note-records-table
646
Tony Tye46d35762017-08-15 20:47:41 +0000647 ===== ============================== ======================================
648 Name Type Description
649 ===== ============================== ======================================
650 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000651 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000652
653..
654
655 .. table:: AMDGPU ELF Note Record Enumeration Values
656 :name: amdgpu-elf-note-record-enumeration-values-table
657
Tony Tye46d35762017-08-15 20:47:41 +0000658 ============================== =====
659 Name Value
660 ============================== =====
661 *reserved* 0-9
662 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000663 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000664 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000665
Tony Tye46d35762017-08-15 20:47:41 +0000666``NT_AMD_AMDGPU_HSA_METADATA``
667 Specifies extensible metadata associated with the code objects executed on HSA
668 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
669 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
670 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
671 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000672
Tony Tye46d35762017-08-15 20:47:41 +0000673.. _amdgpu-symbols:
674
675Symbols
676-------
677
678Symbols include the following:
679
680 .. table:: AMDGPU ELF Symbols
681 :name: amdgpu-elf-symbols-table
682
683 ===================== ============== ============= ==================
684 Name Type Section Description
685 ===================== ============== ============= ==================
686 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
687 - ``.rodata``
688 - ``.bss``
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +0000689 *link-name*\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye46d35762017-08-15 20:47:41 +0000690 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
691 ===================== ============== ============= ==================
692
693Global variable
694 Global variables both used and defined by the compilation unit.
695
696 If the symbol is defined in the compilation unit then it is allocated in the
697 appropriate section according to if it has initialized data or is readonly.
698
699 If the symbol is external then its section is ``STN_UNDEF`` and the loader
700 will resolve relocations using the definition provided by another code object
701 or explicitly defined by the runtime.
702
703 All global symbols, whether defined in the compilation unit or external, are
704 accessed by the machine code indirectly through a GOT table entry. This
705 allows them to be preemptable. The GOT table is only supported when the target
706 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000707
708 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000709 Add description of linked shared object symbols. Seems undefined symbols
710 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000711
Tony Tye46d35762017-08-15 20:47:41 +0000712Kernel descriptor
713 Every HSA kernel has an associated kernel descriptor. It is the address of the
714 kernel descriptor that is used in the AQL dispatch packet used to invoke the
715 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
716 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
717
718Kernel entry point
719 Every HSA kernel also has a symbol for its machine code entry point.
720
721.. _amdgpu-relocation-records:
722
723Relocation Records
724------------------
725
726AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
727relocatable fields are:
728
729``word32``
730 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
731 alignment. These values use the same byte order as other word values in the
732 AMD GPU architecture.
733
734``word64``
735 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
736 alignment. These values use the same byte order as other word values in the
737 AMD GPU architecture.
738
739Following notations are used for specifying relocation calculations:
740
741**A**
742 Represents the addend used to compute the value of the relocatable field.
743
744**G**
745 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000746 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000747
748**GOT**
749 Represents the address of the global offset table.
750
751**P**
752 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
753 of the storage unit being relocated (computed using ``r_offset``).
754
755**S**
756 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000757 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
758
759**B**
760 Represents the base address of a loaded executable or shared object which is
761 the difference between the ELF address and the actual load address. Relocations
762 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000763
764The following relocation types are supported:
765
766 .. table:: AMDGPU ELF Relocation Records
767 :name: amdgpu-elf-relocation-records-table
768
Tony Tyedb6c9932018-01-30 23:59:43 +0000769 ========================== ======= ===== ========== ==============================
770 Relocation Type Kind Value Field Calculation
771 ========================== ======= ===== ========== ==============================
772 ``R_AMDGPU_NONE`` 0 *none* *none*
Tony Tye223f4c72018-04-13 01:01:27 +0000773 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
774 Dynamic
775 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
776 Dynamic
777 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault0084adc2018-04-30 19:08:16 +0000778 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000779 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
780 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye223f4c72018-04-13 01:01:27 +0000781 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
782 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000783 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
784 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
785 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
786 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
787 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
788 *reserved* 12
789 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
790 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000791
Tony Tye223f4c72018-04-13 01:01:27 +0000792``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
793the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
794
795There is no current OS loader support for 32 bit programs and so
796``R_AMDGPU_ABS32`` is not used.
Matt Arsenault0084adc2018-04-30 19:08:16 +0000797
Tony Tye46d35762017-08-15 20:47:41 +0000798.. _amdgpu-dwarf:
799
800DWARF
801-----
802
Scott Linder16c7bda2018-02-23 23:01:06 +0000803Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000804information that maps the code object executable code and data to the source
805language constructs. It can be used by tools such as debuggers and profilers.
806
807Address Space Mapping
808~~~~~~~~~~~~~~~~~~~~~
809
810The following address space mapping is used:
811
812 .. table:: AMDGPU DWARF Address Space Mapping
813 :name: amdgpu-dwarf-address-space-mapping-table
814
815 =================== =================
816 DWARF Address Space Memory Space
817 =================== =================
818 1 Private (Scratch)
819 2 Local (group/LDS)
820 *omitted* Global
821 *omitted* Constant
822 *omitted* Generic (Flat)
823 *not supported* Region (GDS)
824 =================== =================
825
826See :ref:`amdgpu-address-spaces` for information on the memory space terminology
827used in the table.
828
829An ``address_class`` attribute is generated on pointer type DIEs to specify the
830DWARF address space of the value of the pointer when it is in the *private* or
831*local* address space. Otherwise the attribute is omitted.
832
833An ``XDEREF`` operation is generated in location list expressions for variables
834that are allocated in the *private* and *local* address space. Otherwise no
835``XDREF`` is omitted.
836
837Register Mapping
838~~~~~~~~~~~~~~~~
839
840*This section is WIP.*
841
842.. TODO
843 Define DWARF register enumeration.
844
845 If want to present a wavefront state then should expose vector registers as
846 64 wide (rather than per work-item view that LLVM uses). Either as separate
847 registers, or a 64x4 byte single register. In either case use a new LANE op
848 (akin to XDREF) to select the current lane usage in a location
849 expression. This would also allow scalar register spilling to vector register
850 lanes to be expressed (currently no debug information is being generated for
851 spilling). If choose a wide single register approach then use LANE in
852 conjunction with PIECE operation to select the dword part of the register for
853 the current lane. If the separate register approach then use LANE to select
854 the register.
855
856Source Text
857~~~~~~~~~~~
858
Scott Linder16c7bda2018-02-23 23:01:06 +0000859Source text for online-compiled programs (e.g. those compiled by the OpenCL
860runtime) may be embedded into the DWARF v5 line table using the ``clang
861-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000862
Scott Linder16c7bda2018-02-23 23:01:06 +0000863For example:
864
865``-gembed-source``
866 Enable the embedded source DWARF v5 extension.
867``-gno-embed-source``
868 Disable the embedded source DWARF v5 extension.
869
870 .. table:: AMDGPU Debug Options
871 :name: amdgpu-debug-options
872
873 ==================== ==================================================
874 Debug Flag Description
875 ==================== ==================================================
876 -g[no-]embed-source Enable/disable embedding source text in DWARF
877 debug sections. Useful for environments where
878 source cannot be written to disk, such as
879 when performing online compilation.
880 ==================== ==================================================
881
882This option enables one extended content types in the DWARF v5 Line Number
883Program Header, which is used to encode embedded source.
884
885 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
886 :name: amdgpu-dwarf-extended-content-types
887
888 ============================ ======================
889 Content Type Form
890 ============================ ======================
891 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
892 ============================ ======================
893
894The source field will contain the UTF-8 encoded, null-terminated source text
895with ``'\n'`` line endings. When the source field is present, consumers can use
896the embedded source instead of attempting to discover the source on disk. When
897the source field is absent, consumers can access the file to get the source
898text.
899
900The above content type appears in the ``file_name_entry_format`` field of the
901line table prologue, and its corresponding value appear in the ``file_names``
902field. The current encoding of the content type is documented in table
903:ref:`amdgpu-dwarf-extended-content-types-encoding`
904
905 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
906 :name: amdgpu-dwarf-extended-content-types-encoding
907
908 ============================ ====================
909 Content Type Value
910 ============================ ====================
911 ``DW_LNCT_LLVM_source`` 0x2001
912 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000913
914.. _amdgpu-code-conventions:
915
916Code Conventions
917================
918
919This section provides code conventions used for each supported target triple OS
920(see :ref:`amdgpu-target-triples`).
921
922AMDHSA
923------
924
925This section provides code conventions used when the target triple OS is
926``amdhsa`` (see :ref:`amdgpu-target-triples`).
927
928.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000929
Tony Tye01bfd6c2018-03-27 21:20:46 +0000930Code Object Target Identification
931~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
932
933The AMDHSA OS uses the following syntax to specify the code object
934target as a single string:
935
936 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
937
938Where:
939
940 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
941 are the same as the *Target Triple* (see
942 :ref:`amdgpu-target-triples`).
943
944 - ``<Processor>`` is the same as the *Processor* (see
945 :ref:`amdgpu-processors`).
946
947 - ``<Target Features>`` is a list of the enabled *Target Features*
948 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
949 apply to *Processor*. The list must be in the same order as listed
950 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
951 Features* must be included in the list if they are enabled even if
952 that is the default for *Processor*.
953
954For example:
955
956 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
957
Tony Tyef16a45e2017-06-06 20:31:59 +0000958Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000959~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000960
Tony Tye46d35762017-08-15 20:47:41 +0000961The code object metadata specifies extensible metadata associated with the code
962objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
963[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
964(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
965``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
966information necessary to support the ROCM kernel queries. For example, the
967segment sizes needed in a dispatch packet. In addition, a high level language
968runtime may require other information to be included. For example, the AMD
969OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000970
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000971The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000972:doc:`YamlIO`).
973
Tony Tye46d35762017-08-15 20:47:41 +0000974.. TODO
975 Is the string null terminated? It probably should not if YAML allows it to
976 contain null characters, otherwise it should be.
977
Tony Tyef16a45e2017-06-06 20:31:59 +0000978The metadata is represented as a single YAML document comprised of the mapping
979defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
980referenced tables.
981
982For boolean values, the string values of ``false`` and ``true`` are used for
983false and true respectively.
984
985Additional information can be added to the mappings. To avoid conflicts, any
986non-AMD key names should be prefixed by "*vendor-name*.".
987
988 .. table:: AMDHSA Code Object Metadata Mapping
989 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
990
991 ========== ============== ========= =======================================
992 String Key Value Type Required? Description
993 ========== ============== ========= =======================================
994 "Version" sequence of Required - The first integer is the major
995 2 integers version. Currently 1.
996 - The second integer is the minor
997 version. Currently 0.
998 "Printf" sequence of Each string is encoded information
999 strings about a printf function call. The
1000 encoded information is organized as
1001 fields separated by colon (':'):
1002
1003 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1004
1005 where:
1006
1007 ``ID``
1008 A 32 bit integer as a unique id for
1009 each printf function call
1010
1011 ``N``
1012 A 32 bit integer equal to the number
1013 of arguments of printf function call
1014 minus 1
1015
1016 ``S[i]`` (where i = 0, 1, ... , N-1)
1017 32 bit integers for the size in bytes
1018 of the i-th FormatString argument of
1019 the printf function call
1020
1021 FormatString
1022 The format string passed to the
1023 printf function call.
1024 "Kernels" sequence of Required Sequence of the mappings for each
1025 mapping kernel in the code object. See
1026 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1027 for the definition of the mapping.
1028 ========== ============== ========= =======================================
1029
1030..
1031
1032 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1033 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1034
1035 ================= ============== ========= ================================
1036 String Key Value Type Required? Description
1037 ================= ============== ========= ================================
1038 "Name" string Required Source name of the kernel.
1039 "SymbolName" string Required Name of the kernel
1040 descriptor ELF symbol.
1041 "Language" string Source language of the kernel.
1042 Values include:
1043
1044 - "OpenCL C"
1045 - "OpenCL C++"
1046 - "HCC"
1047 - "OpenMP"
1048
1049 "LanguageVersion" sequence of - The first integer is the major
1050 2 integers version.
1051 - The second integer is the
1052 minor version.
1053 "Attrs" mapping Mapping of kernel attributes.
1054 See
1055 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1056 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001057 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001058 mapping kernel arguments. See
1059 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1060 for the definition of the mapping.
1061 "CodeProps" mapping Mapping of properties related to
1062 the kernel code. See
1063 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1064 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001065 ================= ============== ========= ================================
1066
1067..
1068
1069 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1070 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1071
1072 =================== ============== ========= ==============================
1073 String Key Value Type Required? Description
1074 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001075 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1076 3 integers must be >=1 and the dispatch
1077 work-group size X, Y, Z must
1078 correspond to the specified
1079 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001080
1081 Corresponds to the OpenCL
1082 ``reqd_work_group_size``
1083 attribute.
1084 "WorkGroupSizeHint" sequence of The dispatch work-group size
1085 3 integers X, Y, Z is likely to be the
1086 specified values.
1087
1088 Corresponds to the OpenCL
1089 ``work_group_size_hint``
1090 attribute.
1091 "VecTypeHint" string The name of a scalar or vector
1092 type.
1093
1094 Corresponds to the OpenCL
1095 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001096
1097 "RuntimeHandle" string The external symbol name
1098 associated with a kernel.
1099 OpenCL runtime allocates a
1100 global buffer for the symbol
1101 and saves the kernel's address
1102 to it, which is used for
1103 device side enqueueing. Only
1104 available for device side
1105 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001106 =================== ============== ========= ==============================
1107
1108..
1109
1110 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1111 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1112
1113 ================= ============== ========= ================================
1114 String Key Value Type Required? Description
1115 ================= ============== ========= ================================
1116 "Name" string Kernel argument name.
1117 "TypeName" string Kernel argument type name.
1118 "Size" integer Required Kernel argument size in bytes.
1119 "Align" integer Required Kernel argument alignment in
1120 bytes. Must be a power of two.
1121 "ValueKind" string Required Kernel argument kind that
1122 specifies how to set up the
1123 corresponding argument.
1124 Values include:
1125
1126 "ByValue"
1127 The argument is copied
1128 directly into the kernarg.
1129
1130 "GlobalBuffer"
1131 A global address space pointer
1132 to the buffer data is passed
1133 in the kernarg.
1134
1135 "DynamicSharedPointer"
1136 A group address space pointer
1137 to dynamically allocated LDS
1138 is passed in the kernarg.
1139
1140 "Sampler"
1141 A global address space
1142 pointer to a S# is passed in
1143 the kernarg.
1144
1145 "Image"
1146 A global address space
1147 pointer to a T# is passed in
1148 the kernarg.
1149
1150 "Pipe"
1151 A global address space pointer
1152 to an OpenCL pipe is passed in
1153 the kernarg.
1154
1155 "Queue"
1156 A global address space pointer
1157 to an OpenCL device enqueue
1158 queue is passed in the
1159 kernarg.
1160
1161 "HiddenGlobalOffsetX"
1162 The OpenCL grid dispatch
1163 global offset for the X
1164 dimension is passed in the
1165 kernarg.
1166
1167 "HiddenGlobalOffsetY"
1168 The OpenCL grid dispatch
1169 global offset for the Y
1170 dimension is passed in the
1171 kernarg.
1172
1173 "HiddenGlobalOffsetZ"
1174 The OpenCL grid dispatch
1175 global offset for the Z
1176 dimension is passed in the
1177 kernarg.
1178
1179 "HiddenNone"
1180 An argument that is not used
1181 by the kernel. Space needs to
1182 be left for it, but it does
1183 not need to be set up.
1184
1185 "HiddenPrintfBuffer"
1186 A global address space pointer
1187 to the runtime printf buffer
1188 is passed in kernarg.
1189
1190 "HiddenDefaultQueue"
1191 A global address space pointer
1192 to the OpenCL device enqueue
1193 queue that should be used by
1194 the kernel by default is
1195 passed in the kernarg.
1196
1197 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001198 A global address space pointer
1199 to help link enqueued kernels into
1200 the ancestor tree for determining
1201 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001202
1203 "ValueType" string Required Kernel argument value type. Only
1204 present if "ValueKind" is
1205 "ByValue". For vector data
1206 types, the value is for the
1207 element type. Values include:
1208
1209 - "Struct"
1210 - "I8"
1211 - "U8"
1212 - "I16"
1213 - "U16"
1214 - "F16"
1215 - "I32"
1216 - "U32"
1217 - "F32"
1218 - "I64"
1219 - "U64"
1220 - "F64"
1221
1222 .. TODO
1223 How can it be determined if a
1224 vector type, and what size
1225 vector?
1226 "PointeeAlign" integer Alignment in bytes of pointee
1227 type for pointer type kernel
1228 argument. Must be a power
1229 of 2. Only present if
1230 "ValueKind" is
1231 "DynamicSharedPointer".
1232 "AddrSpaceQual" string Kernel argument address space
1233 qualifier. Only present if
1234 "ValueKind" is "GlobalBuffer" or
1235 "DynamicSharedPointer". Values
1236 are:
1237
1238 - "Private"
1239 - "Global"
1240 - "Constant"
1241 - "Local"
1242 - "Generic"
1243 - "Region"
1244
1245 .. TODO
1246 Is GlobalBuffer only Global
1247 or Constant? Is
1248 DynamicSharedPointer always
1249 Local? Can HCC allow Generic?
1250 How can Private or Region
1251 ever happen?
1252 "AccQual" string Kernel argument access
1253 qualifier. Only present if
1254 "ValueKind" is "Image" or
1255 "Pipe". Values
1256 are:
1257
1258 - "ReadOnly"
1259 - "WriteOnly"
1260 - "ReadWrite"
1261
1262 .. TODO
1263 Does this apply to
1264 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001265 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001266 performed by the kernel on the
1267 kernel argument. Only present if
1268 "ValueKind" is "GlobalBuffer",
1269 "Image", or "Pipe". This may be
1270 more restrictive than indicated
1271 by "AccQual" to reflect what the
1272 kernel actual does. If not
1273 present then the runtime must
1274 assume what is implied by
1275 "AccQual" and "IsConst". Values
1276 are:
1277
1278 - "ReadOnly"
1279 - "WriteOnly"
1280 - "ReadWrite"
1281
1282 "IsConst" boolean Indicates if the kernel argument
1283 is const qualified. Only present
1284 if "ValueKind" is
1285 "GlobalBuffer".
1286
1287 "IsRestrict" boolean Indicates if the kernel argument
1288 is restrict qualified. Only
1289 present if "ValueKind" is
1290 "GlobalBuffer".
1291
1292 "IsVolatile" boolean Indicates if the kernel argument
1293 is volatile qualified. Only
1294 present if "ValueKind" is
1295 "GlobalBuffer".
1296
1297 "IsPipe" boolean Indicates if the kernel argument
1298 is pipe qualified. Only present
1299 if "ValueKind" is "Pipe".
1300
1301 .. TODO
1302 Can GlobalBuffer be pipe
1303 qualified?
1304 ================= ============== ========= ================================
1305
1306..
1307
1308 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1309 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1310
1311 ============================ ============== ========= =====================
1312 String Key Value Type Required? Description
1313 ============================ ============== ========= =====================
1314 "KernargSegmentSize" integer Required The size in bytes of
1315 the kernarg segment
1316 that holds the values
1317 of the arguments to
1318 the kernel.
1319 "GroupSegmentFixedSize" integer Required The amount of group
1320 segment memory
1321 required by a
1322 work-group in
1323 bytes. This does not
1324 include any
1325 dynamically allocated
1326 group segment memory
1327 that may be added
1328 when the kernel is
1329 dispatched.
1330 "PrivateSegmentFixedSize" integer Required The amount of fixed
1331 private address space
1332 memory required for a
1333 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001334 bytes. If the kernel
1335 uses a dynamic call
1336 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001337 space must be added
1338 to this value for the
1339 call stack.
1340 "KernargSegmentAlign" integer Required The maximum byte
1341 alignment of
1342 arguments in the
1343 kernarg segment. Must
1344 be a power of 2.
1345 "WavefrontSize" integer Required Wavefront size. Must
1346 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001347 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001348 registers used by a
1349 wavefront for
1350 GFX6-GFX9. This
1351 includes the special
1352 SGPRs for VCC, Flat
1353 Scratch (GFX7-GFX9)
1354 and XNACK (for
1355 GFX8-GFX9). It does
1356 not include the 16
1357 SGPR added if a trap
1358 handler is
1359 enabled. It is not
1360 rounded up to the
1361 allocation
1362 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001363 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001364 registers used by
1365 each work-item for
1366 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001367 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001368 work-group size
1369 supported by the
1370 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001371 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001372 consistent with
1373 ReqdWorkGroupSize if
1374 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001375 "NumSpilledSGPRs" integer Number of stores from
1376 a scalar register to
1377 a register allocator
1378 created spill
1379 location.
1380 "NumSpilledVGPRs" integer Number of stores from
1381 a vector register to
1382 a register allocator
1383 created spill
1384 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001385 ============================ ============== ========= =====================
1386
1387..
1388
Tony Tyef16a45e2017-06-06 20:31:59 +00001389Kernel Dispatch
1390~~~~~~~~~~~~~~~
1391
1392The HSA architected queuing language (AQL) defines a user space memory interface
1393that can be used to control the dispatch of kernels, in an agent independent
1394way. An agent can have zero or more AQL queues created for it using the ROCm
1395runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1396*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1397mechanics and packet layouts.
1398
1399The packet processor of a kernel agent is responsible for detecting and
1400dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1401packet processor is implemented by the hardware command processor (CP),
1402asynchronous dispatch controller (ADC) and shader processor input controller
1403(SPI).
1404
1405The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1406mode driver to initialize and register the AQL queue with CP.
1407
1408To dispatch a kernel the following actions are performed. This can occur in the
1409CPU host program, or from an HSA kernel executing on a GPU.
1410
14111. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1412 executed is obtained.
14132. A pointer to the kernel descriptor (see
1414 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1415 obtained. It must be for a kernel that is contained in a code object that that
1416 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1417 associated.
14183. Space is allocated for the kernel arguments using the ROCm runtime allocator
1419 for a memory region with the kernarg property for the kernel agent that will
1420 execute the kernel. It must be at least 16 byte aligned.
14214. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001422 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001423 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1424 memory in the same way constant memory is accessed. (Note that the HSA
1425 specification allows an implementation to copy the kernel argument contents to
1426 another location that is accessed by the kernel.)
14275. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1428 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1429 packet. The packet must be set up, and the final write must use an atomic
1430 store release to set the packet kind to ensure the packet contents are
1431 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1432 notify the kernel agent that the AQL queue has been updated. These rules, and
1433 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1434 System Architecture Specification* [HSA]_.
14356. A kernel dispatch packet includes information about the actual dispatch,
1436 such as grid and work-group size, together with information from the code
1437 object about the kernel, such as segment sizes. The ROCm runtime queries on
1438 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001439 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000014407. CP executes micro-code and is responsible for detecting and setting up the
1441 GPU to execute the wavefronts of a kernel dispatch.
14428. CP ensures that when the a wavefront starts executing the kernel machine
1443 code, the scalar general purpose registers (SGPR) and vector general purpose
1444 registers (VGPR) are set up as required by the machine code. The required
1445 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1446 register state is defined in
1447 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14489. The prolog of the kernel machine code (see
1449 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1450 before continuing executing the machine code that corresponds to the kernel.
145110. When the kernel dispatch has completed execution, CP signals the completion
1452 signal specified in the kernel dispatch packet if not 0.
1453
1454.. _amdgpu-amdhsa-memory-spaces:
1455
1456Memory Spaces
1457~~~~~~~~~~~~~
1458
1459The memory space properties are:
1460
1461 .. table:: AMDHSA Memory Spaces
1462 :name: amdgpu-amdhsa-memory-spaces-table
1463
1464 ================= =========== ======== ======= ==================
1465 Memory Space Name HSA Segment Hardware Address NULL Value
1466 Name Name Size
1467 ================= =========== ======== ======= ==================
1468 Private private scratch 32 0x00000000
1469 Local group LDS 32 0xFFFFFFFF
1470 Global global global 64 0x0000000000000000
1471 Constant constant *same as 64 0x0000000000000000
1472 global*
1473 Generic flat flat 64 0x0000000000000000
1474 Region N/A GDS 32 *not implemented
1475 for AMDHSA*
1476 ================= =========== ======== ======= ==================
1477
1478The global and constant memory spaces both use global virtual addresses, which
1479are the same virtual address space used by the CPU. However, some virtual
1480addresses may only be accessible to the CPU, some only accessible by the GPU,
1481and some by both.
1482
1483Using the constant memory space indicates that the data will not change during
1484the execution of the kernel. This allows scalar read instructions to be
1485used. The vector and scalar L1 caches are invalidated of volatile data before
1486each kernel dispatch execution to allow constant memory to change values between
1487kernel dispatches.
1488
1489The local memory space uses the hardware Local Data Store (LDS) which is
1490automatically allocated when the hardware creates work-groups of wavefronts, and
1491freed when all the wavefronts of a work-group have terminated. The data store
1492(DS) instructions can be used to access it.
1493
1494The private memory space uses the hardware scratch memory support. If the kernel
1495uses scratch, then the hardware allocates memory that is accessed using
1496wavefront lane dword (4 byte) interleaving. The mapping used from private
1497address to physical address is:
1498
1499 ``wavefront-scratch-base +
1500 (private-address * wavefront-size * 4) +
1501 (wavefront-lane-id * 4)``
1502
1503There are different ways that the wavefront scratch base address is determined
1504by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1505memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001506the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001507instructions, or by flat instructions. If each lane of a wavefront accesses the
1508same private address, the interleaving results in adjacent dwords being accessed
1509and hence requires fewer cache lines to be fetched. Multi-dword access is not
1510supported except by flat and scratch instructions in GFX9.
1511
1512The generic address space uses the hardware flat address support available in
1513GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1514local appertures), that are outside the range of addressible global memory, to
1515map from a flat address to a private or local address.
1516
1517FLAT instructions can take a flat address and access global, private (scratch)
1518and group (LDS) memory depending in if the address is within one of the
1519apperture ranges. Flat access to scratch requires hardware aperture setup and
1520setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1521access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1522(see :ref:`amdgpu-amdhsa-m0`).
1523
1524To convert between a segment address and a flat address the base address of the
1525appertures address can be used. For GFX7-GFX8 these are available in the
1526:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1527Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1528GFX9 the appature base addresses are directly available as inline constant
1529registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1530address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1531which makes it easier to convert from flat to segment or segment to flat.
1532
Tony Tye46d35762017-08-15 20:47:41 +00001533Image and Samplers
1534~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001535
1536Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1537hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1538HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1539enumeration values for the queries that are not trivially deducible from the S#
1540representation.
1541
1542HSA Signals
1543~~~~~~~~~~~
1544
Tony Tye46d35762017-08-15 20:47:41 +00001545HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1546structure allocated in memory accessible from both the CPU and GPU. The
1547structure is defined by the ROCm runtime and subject to change between releases
1548(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001549
1550.. _amdgpu-amdhsa-hsa-aql-queue:
1551
1552HSA AQL Queue
1553~~~~~~~~~~~~~
1554
Tony Tye46d35762017-08-15 20:47:41 +00001555The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001556between releases (see [AMD-ROCm-github]_). For some processors it contains
1557fields needed to implement certain language features such as the flat address
1558aperture bases. It also contains fields used by CP such as managing the
1559allocation of scratch memory.
1560
1561.. _amdgpu-amdhsa-kernel-descriptor:
1562
1563Kernel Descriptor
1564~~~~~~~~~~~~~~~~~
1565
1566A kernel descriptor consists of the information needed by CP to initiate the
1567execution of a kernel, including the entry point address of the machine code
1568that implements the kernel.
1569
1570Kernel Descriptor for GFX6-GFX9
1571+++++++++++++++++++++++++++++++
1572
1573CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1574
1575 .. table:: Kernel Descriptor for GFX6-GFX9
1576 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1577
Tony Tye6baa6d22017-10-18 22:16:55 +00001578 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001579 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001580 ======= ======= =============================== ============================
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001581 31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001582 address space memory
1583 required for a work-group
1584 in bytes. This does not
1585 include any dynamically
1586 allocated local address
1587 space memory that may be
1588 added when the kernel is
1589 dispatched.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001590 63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001591 private address space
1592 memory required for a
1593 work-item in bytes. If
1594 is_dynamic_callstack is 1
1595 then additional space must
1596 be added to this value for
1597 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001598 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001599 191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 negative) from base
1601 address of kernel
1602 descriptor to kernel's
1603 entry point instruction
1604 which must be 256 byte
1605 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001606 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 bytes
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001608 415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001609 program settings used by
1610 CP to set up
1611 ``COMPUTE_PGM_RSRC1``
1612 configuration
1613 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001614 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001615 447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001616 program settings used by
1617 CP to set up
1618 ``COMPUTE_PGM_RSRC2``
1619 configuration
1620 register. See
1621 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001622 448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
1623 _BUFFER SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001624 (see
1625 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1626
1627 The total number of SGPR
1628 user data registers
1629 requested must not exceed
1630 16 and match value in
1631 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1632 Any requests beyond 16
1633 will be ignored.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00001634 449 1 bit ENABLE_SGPR_DISPATCH_PTR *see above*
1635 450 1 bit ENABLE_SGPR_QUEUE_PTR *see above*
1636 451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR *see above*
1637 452 1 bit ENABLE_SGPR_DISPATCH_ID *see above*
1638 453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT *see above*
1639 454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT *see above*
1640 _SIZE
1641 455 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
1642 _COUNT_X should always be 0.
1643 456 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
1644 _COUNT_Y should always be 0.
1645 457 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
1646 _COUNT_Z should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001647 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001648 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001649 bytes
1650 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001651 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001652
1653..
1654
1655 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001656 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001657
Tony Tye3b340612017-06-07 00:46:08 +00001658 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001659 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001660 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001661 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001662 used by each work-item,
1663 granularity is device
1664 specific:
1665
Tony Tye07d9f102017-11-10 01:00:54 +00001666 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001667 - max_vgpr 1..256
1668 - roundup((max_vgpg + 1)
1669 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001670
1671 Used by CP to set up
1672 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001673 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001674 used by a wavefront,
1675 granularity is device
1676 specific:
1677
Tony Tye07d9f102017-11-10 01:00:54 +00001678 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001679 - max_sgpr 1..112
1680 - roundup((max_sgpg + 1)
1681 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001682 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001683 - max_sgpr 1..112
1684 - roundup((max_sgpg + 1)
1685 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001686
1687 Includes the special SGPRs
1688 for VCC, Flat Scratch (for
1689 GFX7 onwards) and XNACK
1690 (for GFX8 onwards). It does
1691 not include the 16 SGPR
1692 added if a trap handler is
1693 enabled.
1694
1695 Used by CP to set up
1696 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001697 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001698
1699 Start executing wavefront
1700 at the specified priority.
1701
1702 CP is responsible for
1703 filling in
1704 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001705 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001706 with specified rounding
1707 mode for single (32
1708 bit) floating point
1709 precision floating point
1710 operations.
1711
1712 Floating point rounding
1713 mode values are defined in
1714 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1715
1716 Used by CP to set up
1717 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001718 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001719 with specified rounding
1720 denorm mode for half/double (16
1721 and 64 bit) floating point
1722 precision floating point
1723 operations.
1724
1725 Floating point rounding
1726 mode values are defined in
1727 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1728
1729 Used by CP to set up
1730 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001731 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001732 with specified denorm mode
1733 for single (32
1734 bit) floating point
1735 precision floating point
1736 operations.
1737
1738 Floating point denorm mode
1739 values are defined in
1740 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1741
1742 Used by CP to set up
1743 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001744 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001745 with specified denorm mode
1746 for half/double (16
1747 and 64 bit) floating point
1748 precision floating point
1749 operations.
1750
1751 Floating point denorm mode
1752 values are defined in
1753 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1754
1755 Used by CP to set up
1756 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001757 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001758
1759 Start executing wavefront
1760 in privilege trap handler
1761 mode.
1762
1763 CP is responsible for
1764 filling in
1765 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001766 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001767 with DX10 clamp mode
1768 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001769 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001770 treatment of NaN's (when
1771 set, clamp NaN to zero,
1772 otherwise pass NaN
1773 through).
1774
1775 Used by CP to set up
1776 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001777 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001778
1779 Start executing wavefront
1780 in single step mode.
1781
1782 CP is responsible for
1783 filling in
1784 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001785 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001786 with IEEE mode
1787 enabled. Floating point
1788 opcodes that support
1789 exception flag gathering
1790 will quiet and propagate
1791 signaling-NaN inputs per
1792 IEEE 754-2008. Min_dx10 and
1793 max_dx10 become IEEE
1794 754-2008 compliant due to
1795 signaling-NaN propagation
1796 and quieting.
1797
1798 Used by CP to set up
1799 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001800 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001801
1802 Only one work-group allowed
1803 to execute on a compute
1804 unit.
1805
1806 CP is responsible for
1807 filling in
1808 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001809 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001810
1811 Flag that can be used to
1812 control debugging code.
1813
1814 CP is responsible for
1815 filling in
1816 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001817 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001818 Reserved, must be 0.
1819 GFX9
1820 Wavefront starts execution
1821 with specified fp16 overflow
1822 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001823
Tony Tye6baa6d22017-10-18 22:16:55 +00001824 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001825 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001826 - If 1, fp16 overflow that is the
1827 result of an +/-INF input value
1828 or divide by 0 produces a +/-INF,
1829 otherwise clamps computed
1830 overflow to +/-MAX_FP16 as
1831 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001832
1833 Used by CP to set up
1834 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001835 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001836 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001837 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001838
1839..
1840
1841 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1842 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1843
Tony Tye3b340612017-06-07 00:46:08 +00001844 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001845 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001846 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001847 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001848 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001849 system register (see
1850 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1851
1852 Used by CP to set up
1853 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001854 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001855 user data registers
1856 requested. This number must
1857 match the number of user
1858 data registers enabled.
1859
1860 Used by CP to set up
1861 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov2ca6b1f2018-05-29 19:09:13 +00001862 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001863
Konstantin Zhuravlyov2ca6b1f2018-05-29 19:09:13 +00001864 This bit represents
1865 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
1866 which is set by the CP if
1867 the runtime has installed a
1868 trap handler.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001869 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001870 system SGPR register for
1871 the work-group id in the X
1872 dimension (see
1873 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1874
1875 Used by CP to set up
1876 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001877 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001878 system SGPR register for
1879 the work-group id in the Y
1880 dimension (see
1881 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1882
1883 Used by CP to set up
1884 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001885 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001886 system SGPR register for
1887 the work-group id in the Z
1888 dimension (see
1889 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1890
1891 Used by CP to set up
1892 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001893 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001894 system SGPR register for
1895 work-group information (see
1896 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1897
1898 Used by CP to set up
1899 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001900 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001901 VGPR system registers used
1902 for the work-item ID.
1903 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1904 defines the values.
1905
1906 Used by CP to set up
1907 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001908 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001909
1910 Wavefront starts execution
1911 with address watch
1912 exceptions enabled which
1913 are generated when L1 has
1914 witnessed a thread access
1915 an *address of
1916 interest*.
1917
1918 CP is responsible for
1919 filling in the address
1920 watch bit in
1921 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1922 according to what the
1923 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001924 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001925
1926 Wavefront starts execution
1927 with memory violation
1928 exceptions exceptions
1929 enabled which are generated
1930 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001931 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001932 L1 or LDS
1933 (write-to-read-only-memory,
1934 mis-aligned atomic, LDS
1935 address out of range,
1936 illegal address, etc.).
1937
1938 CP sets the memory
1939 violation bit in
1940 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1941 according to what the
1942 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001943 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001944
1945 CP uses the rounded value
1946 from the dispatch packet,
1947 not this value, as the
1948 dispatch may contain
1949 dynamically allocated group
1950 segment memory. CP writes
1951 directly to
1952 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1953
1954 Amount of group segment
1955 (LDS) to allocate for each
1956 work-group. Granularity is
1957 device specific:
1958
1959 GFX6:
1960 roundup(lds-size / (64 * 4))
1961 GFX7-GFX9:
1962 roundup(lds-size / (128 * 4))
1963
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001964 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1965 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001966 enabled.
1967
1968 Used by CP to set up
1969 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1970 (set from bits 0..6).
1971
1972 IEEE 754 FP Invalid
1973 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001974 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1975 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001976 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001977 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1978 _DIVISION_BY_ZERO Zero
1979 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1980 _OVERFLOW
1981 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1982 _UNDERFLOW
1983 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1984 _INEXACT
1985 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1986 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001987 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001988 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001989 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001990 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001991
1992..
1993
1994 .. table:: Floating Point Rounding Mode Enumeration Values
1995 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1996
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001997 ====================================== ===== ==============================
1998 Enumeration Name Value Description
1999 ====================================== ===== ==============================
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002000 FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2001 FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2002 FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2003 FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002004 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002005
2006..
2007
2008 .. table:: Floating Point Denorm Mode Enumeration Values
2009 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2010
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002011 ====================================== ===== ==============================
2012 Enumeration Name Value Description
2013 ====================================== ===== ==============================
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002014 FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002015 Denorms
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002016 FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2017 FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2018 FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002019 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002020
2021..
2022
2023 .. table:: System VGPR Work-Item ID Enumeration Values
2024 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2025
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002026 ======================================== ===== ============================
2027 Enumeration Name Value Description
2028 ======================================== ===== ============================
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002029 SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002030 ID.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002031 SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002032 dimensions ID.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002033 SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002034 dimensions ID.
Konstantin Zhuravlyov00f2cb12018-06-12 18:02:46 +00002035 SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002036 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002037
2038.. _amdgpu-amdhsa-initial-kernel-execution-state:
2039
2040Initial Kernel Execution State
2041~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2042
2043This section defines the register state that will be set up by the packet
2044processor prior to the start of execution of every wavefront. This is limited by
2045the constraints of the hardware controllers of CP/ADC/SPI.
2046
2047The order of the SGPR registers is defined, but the compiler can specify which
2048ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2049fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2050for enabled registers are dense starting at SGPR0: the first enabled register is
2051SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2052an SGPR number.
2053
2054The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002055all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002056the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2057initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002058set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002059dispatch.
2060
2061SGPR register initial state is defined in
2062:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2063
2064 .. table:: SGPR Register Set Up Order
2065 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2066
2067 ========== ========================== ====== ==============================
2068 SGPR Order Name Number Description
2069 (kernel descriptor enable of
2070 field) SGPRs
2071 ========== ========================== ====== ==============================
2072 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002073 (enable_sgpr_private with Scratch Wavefront Offset
2074 _segment_buffer) as an offset, to access the
2075 private memory space using a
2076 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002077
2078 CP uses the value provided by
2079 the runtime.
2080 then Dispatch Ptr 2 64 bit address of AQL dispatch
2081 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2082 actually executing.
2083 then Queue Ptr 2 64 bit address of amd_queue_t
2084 (enable_sgpr_queue_ptr) object for AQL queue on which
2085 the dispatch packet was
2086 queued.
2087 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2088 (enable_sgpr_kernarg segment. This is directly
2089 _segment_ptr) copied from the
2090 kernarg_address in the kernel
2091 dispatch packet.
2092
2093 Having CP load it once avoids
2094 loading it at the beginning of
2095 every wavefront.
2096 then Dispatch Id 2 64 bit Dispatch ID of the
2097 (enable_sgpr_dispatch_id) dispatch packet being
2098 executed.
2099 then Flat Scratch Init 2 This is 2 SGPRs:
2100 (enable_sgpr_flat_scratch
2101 _init) GFX6
2102 Not supported.
2103 GFX7-GFX8
2104 The first SGPR is a 32 bit
2105 byte offset from
2106 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2107 to per SPI base of memory
2108 for scratch for the queue
2109 executing the kernel
2110 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002111 from the runtime. (The
2112 Scratch Segment Buffer base
2113 address is
2114 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2115 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002116 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002117 be added to this offset by
2118 the kernel machine code,
2119 right shifted by 8, and
2120 moved to the FLAT_SCRATCH_HI
2121 SGPR register.
2122 FLAT_SCRATCH_HI corresponds
2123 to SGPRn-4 on GFX7, and
2124 SGPRn-6 on GFX8 (where SGPRn
2125 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002126 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002127 FLAT_SCRATCH_HI is
2128 multiplied by 256 (as it is
2129 in units of 256 bytes) and
2130 added to
2131 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002132 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002133 FLAT SCRATCH BASE in flat
2134 memory instructions that
2135 access the scratch
2136 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002137
2138 The second SGPR is 32 bit
2139 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002140 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002141 usage. CP obtains this from
2142 the runtime, and it is
2143 always a multiple of DWORD.
2144 CP checks that the value in
2145 the kernel dispatch packet
2146 Private Segment Byte Size is
2147 not larger, and requests the
2148 runtime to increase the
2149 queue's scratch size if
2150 necessary. The kernel code
2151 must move it to
2152 FLAT_SCRATCH_LO which is
2153 SGPRn-3 on GFX7 and SGPRn-5
2154 on GFX8. FLAT_SCRATCH_LO is
2155 used as the FLAT SCRATCH
2156 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002157 instructions. Having CP load
2158 it once avoids loading it at
2159 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002160 wavefront.
2161 GFX9
2162 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002163 64 bit base address of the
2164 per SPI scratch backing
2165 memory managed by SPI for
2166 the queue executing the
2167 kernel dispatch. CP obtains
2168 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002169 divides it if there are
2170 multiple Shader Arrays each
2171 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002172 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002173 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002174 machine code and the result
2175 moved to the FLAT_SCRATCH
2176 SGPR which is SGPRn-6 and
2177 SGPRn-5. It is used as the
2178 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002179 memory instructions.
2180 then Private Segment Size 1 The 32 bit byte size of a
2181 (enable_sgpr_private single
2182 work-item's
2183 scratch_segment_size) memory
2184 allocation. This is the
2185 value from the kernel
2186 dispatch packet Private
2187 Segment Byte Size rounded up
2188 by CP to a multiple of
2189 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002190
2191 Having CP load it once avoids
2192 loading it at the beginning of
2193 every wavefront.
2194
2195 This is not used for
2196 GFX7-GFX8 since it is the same
2197 value as the second SGPR of
2198 Flat Scratch Init. However, it
2199 may be needed for GFX9 which
2200 changes the meaning of the
2201 Flat Scratch Init value.
2202 then Grid Work-Group Count X 1 32 bit count of the number of
2203 (enable_sgpr_grid work-groups in the X dimension
2204 _workgroup_count_X) for the grid being
2205 executed. Computed from the
2206 fields in the kernel dispatch
2207 packet as ((grid_size.x +
2208 workgroup_size.x - 1) /
2209 workgroup_size.x).
2210 then Grid Work-Group Count Y 1 32 bit count of the number of
2211 (enable_sgpr_grid work-groups in the Y dimension
2212 _workgroup_count_Y && for the grid being
2213 less than 16 previous executed. Computed from the
2214 SGPRs) fields in the kernel dispatch
2215 packet as ((grid_size.y +
2216 workgroup_size.y - 1) /
2217 workgroupSize.y).
2218
2219 Only initialized if <16
2220 previous SGPRs initialized.
2221 then Grid Work-Group Count Z 1 32 bit count of the number of
2222 (enable_sgpr_grid work-groups in the Z dimension
2223 _workgroup_count_Z && for the grid being
2224 less than 16 previous executed. Computed from the
2225 SGPRs) fields in the kernel dispatch
2226 packet as ((grid_size.z +
2227 workgroup_size.z - 1) /
2228 workgroupSize.z).
2229
2230 Only initialized if <16
2231 previous SGPRs initialized.
2232 then Work-Group Id X 1 32 bit work-group id in X
2233 (enable_sgpr_workgroup_id dimension of grid for
2234 _X) wavefront.
2235 then Work-Group Id Y 1 32 bit work-group id in Y
2236 (enable_sgpr_workgroup_id dimension of grid for
2237 _Y) wavefront.
2238 then Work-Group Id Z 1 32 bit work-group id in Z
2239 (enable_sgpr_workgroup_id dimension of grid for
2240 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002241 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002242 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002243 _info) threadgroup_size_in_wavefronts[5:0]}
2244 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002245 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002246 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002247 dispatch. Must be used as an
2248 offset with Private
2249 segment address when using
2250 Scratch Segment Buffer. It
2251 must be used to set up FLAT
2252 SCRATCH for flat addressing
2253 (see
2254 :ref:`amdgpu-amdhsa-flat-scratch`).
2255 ========== ========================== ====== ==============================
2256
2257The order of the VGPR registers is defined, but the compiler can specify which
2258ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2259fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2260for enabled registers are dense starting at VGPR0: the first enabled register is
2261VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2262VGPR number.
2263
2264VGPR register initial state is defined in
2265:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2266
2267 .. table:: VGPR Register Set Up Order
2268 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2269
2270 ========== ========================== ====== ==============================
2271 VGPR Order Name Number Description
2272 (kernel descriptor enable of
2273 field) VGPRs
2274 ========== ========================== ====== ==============================
2275 First Work-Item Id X 1 32 bit work item id in X
2276 (Always initialized) dimension of work-group for
2277 wavefront lane.
2278 then Work-Item Id Y 1 32 bit work item id in Y
2279 (enable_vgpr_workitem_id dimension of work-group for
2280 > 0) wavefront lane.
2281 then Work-Item Id Z 1 32 bit work item id in Z
2282 (enable_vgpr_workitem_id dimension of work-group for
2283 > 1) wavefront lane.
2284 ========== ========================== ====== ==============================
2285
Hiroshi Inouebcadfee2018-04-12 05:53:20 +00002286The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tyef16a45e2017-06-06 20:31:59 +00002287
22881. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2289 registers.
22902. Work-group Id registers X, Y, Z are set by ADC which supports any
2291 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022923. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2293 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022944. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2295 or (X, Y, Z).
2296
2297Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2298value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2299
2300The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002301has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002302instructions (GFX9).
2303
2304If buffer operations are used then the compiler can generate a V# with the
2305following properties:
2306
2307* base address of 0
2308* no swizzle
2309* ATC: 1 if IOMMU present (such as APU)
2310* ptr64: 1
2311* MTYPE set to support memory coherence that matches the runtime (such as CC for
2312 APU and NC for dGPU).
2313
2314.. _amdgpu-amdhsa-kernel-prolog:
2315
2316Kernel Prolog
2317~~~~~~~~~~~~~
2318
2319.. _amdgpu-amdhsa-m0:
2320
2321M0
2322++
2323
2324GFX6-GFX8
2325 The M0 register must be initialized with a value at least the total LDS size
2326 if the kernel may access LDS via DS or flat operations. Total LDS size is
2327 available in dispatch packet. For M0, it is also possible to use maximum
2328 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2329 GFX7-GFX8).
2330GFX9
2331 The M0 register is not used for range checking LDS accesses and so does not
2332 need to be initialized in the prolog.
2333
2334.. _amdgpu-amdhsa-flat-scratch:
2335
2336Flat Scratch
2337++++++++++++
2338
2339If the kernel may use flat operations to access scratch memory, the prolog code
2340must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002341are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002342Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2343
2344GFX6
2345 Flat scratch is not supported.
2346
Tony Tye07d9f102017-11-10 01:00:54 +00002347GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002348 1. The low word of Flat Scratch Init is 32 bit byte offset from
2349 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2350 being managed by SPI for the queue executing the kernel dispatch. This is
2351 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002352 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002353 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2354 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2355 by 8 before moving into FLAT_SCRATCH_LO.
2356 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2357 work-items scratch memory usage. This is directly loaded from the kernel
2358 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2359 DWORD. Having CP load it once avoids loading it at the beginning of every
2360 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2361 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002362
Tony Tyef16a45e2017-06-06 20:31:59 +00002363GFX9
2364 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2365 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002366 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002367 pair for use as the flat scratch base in flat memory instructions.
2368
2369.. _amdgpu-amdhsa-memory-model:
2370
2371Memory Model
2372~~~~~~~~~~~~
2373
2374This section describes the mapping of LLVM memory model onto AMDGPU machine code
2375(see :ref:`memmodel`). *The implementation is WIP.*
2376
2377.. TODO
2378 Update when implementation complete.
2379
Tony Tyef16a45e2017-06-06 20:31:59 +00002380The AMDGPU backend supports the memory synchronization scopes specified in
2381:ref:`amdgpu-memory-scopes`.
2382
2383The code sequences used to implement the memory model are defined in table
2384:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2385
2386The sequences specify the order of instructions that a single thread must
2387execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2388to other memory instructions executed by the same thread. This allows them to be
2389moved earlier or later which can allow them to be combined with other instances
2390of the same instruction, or hoisted/sunk out of loops to improve
2391performance. Only the instructions related to the memory model are given;
2392additional ``s_waitcnt`` instructions are required to ensure registers are
2393defined before being used. These may be able to be combined with the memory
2394model ``s_waitcnt`` instructions as described above.
2395
Tony Tye6baa6d22017-10-18 22:16:55 +00002396The AMDGPU backend supports the following memory models:
2397
2398 HSA Memory Model [HSA]_
2399 The HSA memory model uses a single happens-before relation for all address
2400 spaces (see :ref:`amdgpu-address-spaces`).
2401 OpenCL Memory Model [OpenCL]_
2402 The OpenCL memory model which has separate happens-before relations for the
2403 global and local address spaces. Only a fence specifying both global and
2404 local address space, and seq_cst instructions join the relationships. Since
2405 the LLVM ``memfence`` instruction does not allow an address space to be
2406 specified the OpenCL fence has to convervatively assume both local and
2407 global address space was specified. However, optimizations can often be
2408 done to eliminate the additional ``s_waitcnt`` instructions when there are
2409 no intervening memory instructions which access the corresponding address
2410 space. The code sequences in the table indicate what can be omitted for the
2411 OpenCL memory. The target triple environment is used to determine if the
2412 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002413
2414``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2415operations.
2416
2417``buffer/global/flat_load/store/atomic`` instructions to global memory are
2418termed vector memory operations.
2419
2420For GFX6-GFX9:
2421
2422* Each agent has multiple compute units (CU).
2423* Each CU has multiple SIMDs that execute wavefronts.
2424* The wavefronts for a single work-group are executed in the same CU but may be
2425 executed by different SIMDs.
2426* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2427 executing on it.
2428* All LDS operations of a CU are performed as wavefront wide operations in a
2429 global order and involve no caching. Completion is reported to a wavefront in
2430 execution order.
2431* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002432 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002433 can be reordered relative to each other, which can result in reordering the
2434 visibility of vector memory operations with respect to LDS operations of other
2435 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002436 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002437 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002438 same wavefront.
2439* The vector memory operations are performed as wavefront wide operations and
2440 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002441 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002442 vector memory order if they access LDS memory, and out of LDS operation order
2443 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002444* The vector memory operations access a single vector L1 cache shared by all
2445 SIMDs a CU. Therefore, no special action is required for coherence between the
2446 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002447 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002448 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002449* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2450 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2451 scalar operations are used in a restricted way so do not impact the memory
2452 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2453* The vector and scalar memory operations use an L2 cache shared by all CUs on
2454 the same agent.
2455* The L2 cache has independent channels to service disjoint ranges of virtual
2456 addresses.
2457* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002458 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002459 (which may be executing on different CUs) of an agent can be reordered
2460 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002461 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002462 previous vector memory operation has completed before executing a subsequent
2463 vector memory or LDS operation and so can be used to meet the requirements of
2464 acquire and release.
2465* The L2 cache can be kept coherent with other agents on some targets, or ranges
2466 of virtual addresses can be set up to bypass it to ensure system coherence.
2467
Tony Tye07d9f102017-11-10 01:00:54 +00002468Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002469or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2470memory, atomic memory orderings are not meaningful and all accesses are treated
2471as non-atomic.
2472
2473Constant address space uses ``buffer/global_load`` instructions (or equivalent
2474scalar memory instructions). Since the constant address space contents do not
2475change during the execution of a kernel dispatch it is not legal to perform
2476stores, and atomic memory orderings are not meaningful and all access are
2477treated as non-atomic.
2478
2479A memory synchronization scope wider than work-group is not meaningful for the
2480group (LDS) address space and is treated as work-group.
2481
2482The memory model does not support the region address space which is treated as
2483non-atomic.
2484
2485Acquire memory ordering is not meaningful on store atomic instructions and is
2486treated as non-atomic.
2487
2488Release memory ordering is not meaningful on load atomic instructions and is
2489treated a non-atomic.
2490
2491Acquire-release memory ordering is not meaningful on load or store atomic
2492instructions and is treated as acquire and release respectively.
2493
2494AMDGPU backend only uses scalar memory operations to access memory that is
2495proven to not change during the execution of the kernel dispatch. This includes
2496constant address space and global address space for program scope const
2497variables. Therefore the kernel machine code does not have to maintain the
2498scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2499and vector L1 caches are invalidated between kernel dispatches by CP since
2500constant address space data may change between kernel dispatch executions. See
2501:ref:`amdgpu-amdhsa-memory-spaces`.
2502
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002503The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002504case the AMDGPU backend ensures the memory location used to spill is never
2505accessed by vector memory operations at the same time. If scalar writes are used
2506then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2507return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002508future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002509frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2510as all scalar writes are write-before-read in the same thread.
2511
Tony Tye6baa6d22017-10-18 22:16:55 +00002512Scratch backing memory (which is used for the private address space)
2513is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2514address space is only accessed by a single thread, and is always
2515write-before-read, there is never a need to invalidate these entries from the L1
2516cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2517volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002518
2519On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002520to invalidate the L2 cache. This also causes it to be treated as
2521non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2522(cache coherent) and so the L2 cache will coherent with the CPU and other
2523agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002524
2525 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2526 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2527
Tony Tye6baa6d22017-10-18 22:16:55 +00002528 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002529 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2530 Ordering Sync Scope Address
2531 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002532 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002533 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002534 -----------------------------------------------------------------------------------
2535 load *none* *none* - global - !volatile & !nontemporal
2536 - generic
2537 - private 1. buffer/global/flat_load
2538 - constant
2539 - volatile & !nontemporal
2540
Tony Tyef16a45e2017-06-06 20:31:59 +00002541 1. buffer/global/flat_load
2542 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002543
2544 - nontemporal
2545
2546 1. buffer/global/flat_load
2547 glc=1 slc=1
2548
Tony Tyef16a45e2017-06-06 20:31:59 +00002549 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002550 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002551 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002552 - private 1. buffer/global/flat_store
2553 - constant
2554 - nontemporal
2555
2556 1. buffer/global/flat_stote
2557 glc=1 slc=1
2558
Tony Tyef16a45e2017-06-06 20:31:59 +00002559 store *none* *none* - local 1. ds_store
2560 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002561 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002562 load atomic unordered *any* *any* *Same as non-atomic*.
2563 store atomic unordered *any* *any* *Same as non-atomic*.
2564 atomicrmw unordered *any* *any* *Same as monotonic
2565 atomic*.
2566 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002567 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002568 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2569 - wavefront - generic
2570 - workgroup
2571 load atomic monotonic - singlethread - local 1. ds_load
2572 - wavefront
2573 - workgroup
2574 load atomic monotonic - agent - global 1. buffer/global/flat_load
2575 - system - generic glc=1
2576 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2577 - wavefront - generic
2578 - workgroup
2579 - agent
2580 - system
2581 store atomic monotonic - singlethread - local 1. ds_store
2582 - wavefront
2583 - workgroup
2584 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2585 - wavefront - generic
2586 - workgroup
2587 - agent
2588 - system
2589 atomicrmw monotonic - singlethread - local 1. ds_atomic
2590 - wavefront
2591 - workgroup
2592 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002593 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002594 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2595 - wavefront - local
2596 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002597 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2598 load atomic acquire - workgroup - local 1. ds_load
2599 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002600
Tony Tye6baa6d22017-10-18 22:16:55 +00002601 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002602 - Must happen before
2603 any following
2604 global/generic
2605 load/load
2606 atomic/store/store
2607 atomic/atomicrmw.
2608 - Ensures any
2609 following global
2610 data read is no
2611 older than the load
2612 atomic value being
2613 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002614 load atomic acquire - workgroup - generic 1. flat_load
2615 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002616
Tony Tye6baa6d22017-10-18 22:16:55 +00002617 - If OpenCL, omit.
2618 - Must happen before
2619 any following
2620 global/generic
2621 load/load
2622 atomic/store/store
2623 atomic/atomicrmw.
2624 - Ensures any
2625 following global
2626 data read is no
2627 older than the load
2628 atomic value being
2629 acquired.
2630 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002631 - system glc=1
2632 2. s_waitcnt vmcnt(0)
2633
2634 - Must happen before
2635 following
2636 buffer_wbinvl1_vol.
2637 - Ensures the load
2638 has completed
2639 before invalidating
2640 the cache.
2641
2642 3. buffer_wbinvl1_vol
2643
2644 - Must happen before
2645 any following
2646 global/generic
2647 load/load
2648 atomic/atomicrmw.
2649 - Ensures that
2650 following
2651 loads will not see
2652 stale global data.
2653
2654 load atomic acquire - agent - generic 1. flat_load glc=1
2655 - system 2. s_waitcnt vmcnt(0) &
2656 lgkmcnt(0)
2657
2658 - If OpenCL omit
2659 lgkmcnt(0).
2660 - Must happen before
2661 following
2662 buffer_wbinvl1_vol.
2663 - Ensures the flat_load
2664 has completed
2665 before invalidating
2666 the cache.
2667
2668 3. buffer_wbinvl1_vol
2669
2670 - Must happen before
2671 any following
2672 global/generic
2673 load/load
2674 atomic/atomicrmw.
2675 - Ensures that
2676 following loads
2677 will not see stale
2678 global data.
2679
2680 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2681 - wavefront - local
2682 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002683 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2684 atomicrmw acquire - workgroup - local 1. ds_atomic
2685 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002686
Tony Tye6baa6d22017-10-18 22:16:55 +00002687 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002688 - Must happen before
2689 any following
2690 global/generic
2691 load/load
2692 atomic/store/store
2693 atomic/atomicrmw.
2694 - Ensures any
2695 following global
2696 data read is no
2697 older than the
2698 atomicrmw value
2699 being acquired.
2700
Tony Tye6baa6d22017-10-18 22:16:55 +00002701 atomicrmw acquire - workgroup - generic 1. flat_atomic
2702 2. waitcnt lgkmcnt(0)
2703
2704 - If OpenCL, omit.
2705 - Must happen before
2706 any following
2707 global/generic
2708 load/load
2709 atomic/store/store
2710 atomic/atomicrmw.
2711 - Ensures any
2712 following global
2713 data read is no
2714 older than the
2715 atomicrmw value
2716 being acquired.
2717
2718 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002719 - system 2. s_waitcnt vmcnt(0)
2720
2721 - Must happen before
2722 following
2723 buffer_wbinvl1_vol.
2724 - Ensures the
2725 atomicrmw has
2726 completed before
2727 invalidating the
2728 cache.
2729
2730 3. buffer_wbinvl1_vol
2731
2732 - Must happen before
2733 any following
2734 global/generic
2735 load/load
2736 atomic/atomicrmw.
2737 - Ensures that
2738 following loads
2739 will not see stale
2740 global data.
2741
2742 atomicrmw acquire - agent - generic 1. flat_atomic
2743 - system 2. s_waitcnt vmcnt(0) &
2744 lgkmcnt(0)
2745
2746 - If OpenCL, omit
2747 lgkmcnt(0).
2748 - Must happen before
2749 following
2750 buffer_wbinvl1_vol.
2751 - Ensures the
2752 atomicrmw has
2753 completed before
2754 invalidating the
2755 cache.
2756
2757 3. buffer_wbinvl1_vol
2758
2759 - Must happen before
2760 any following
2761 global/generic
2762 load/load
2763 atomic/atomicrmw.
2764 - Ensures that
2765 following loads
2766 will not see stale
2767 global data.
2768
2769 fence acquire - singlethread *none* *none*
2770 - wavefront
2771 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2772
2773 - If OpenCL and
2774 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002775 not generic, omit.
2776 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002777 currently has no
2778 address space on
2779 the fence need to
2780 conservatively
2781 always generate. If
2782 fence had an
2783 address space then
2784 set to address
2785 space of OpenCL
2786 fence flag, or to
2787 generic if both
2788 local and global
2789 flags are
2790 specified.
2791 - Must happen after
2792 any preceding
2793 local/generic load
2794 atomic/atomicrmw
2795 with an equal or
2796 wider sync scope
2797 and memory ordering
2798 stronger than
2799 unordered (this is
2800 termed the
2801 fence-paired-atomic).
2802 - Must happen before
2803 any following
2804 global/generic
2805 load/load
2806 atomic/store/store
2807 atomic/atomicrmw.
2808 - Ensures any
2809 following global
2810 data read is no
2811 older than the
2812 value read by the
2813 fence-paired-atomic.
2814
Tony Tye6baa6d22017-10-18 22:16:55 +00002815 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2816 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002817
2818 - If OpenCL and
2819 address space is
2820 not generic, omit
2821 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002822 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002823 currently has no
2824 address space on
2825 the fence need to
2826 conservatively
2827 always generate
2828 (see comment for
2829 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002830 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002831 separate s_waitcnt
2832 vmcnt(0) and
2833 s_waitcnt
2834 lgkmcnt(0) to allow
2835 them to be
2836 independently moved
2837 according to the
2838 following rules.
2839 - s_waitcnt vmcnt(0)
2840 must happen after
2841 any preceding
2842 global/generic load
2843 atomic/atomicrmw
2844 with an equal or
2845 wider sync scope
2846 and memory ordering
2847 stronger than
2848 unordered (this is
2849 termed the
2850 fence-paired-atomic).
2851 - s_waitcnt lgkmcnt(0)
2852 must happen after
2853 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002854 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002855 atomic/atomicrmw
2856 with an equal or
2857 wider sync scope
2858 and memory ordering
2859 stronger than
2860 unordered (this is
2861 termed the
2862 fence-paired-atomic).
2863 - Must happen before
2864 the following
2865 buffer_wbinvl1_vol.
2866 - Ensures that the
2867 fence-paired atomic
2868 has completed
2869 before invalidating
2870 the
2871 cache. Therefore
2872 any following
2873 locations read must
2874 be no older than
2875 the value read by
2876 the
2877 fence-paired-atomic.
2878
2879 2. buffer_wbinvl1_vol
2880
Tony Tye6baa6d22017-10-18 22:16:55 +00002881 - Must happen before any
2882 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002883 load/load
2884 atomic/store/store
2885 atomic/atomicrmw.
2886 - Ensures that
2887 following loads
2888 will not see stale
2889 global data.
2890
2891 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002892 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002893 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2894 - wavefront - local
2895 - generic
2896 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002897
2898 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002899 - Must happen after
2900 any preceding
2901 local/generic
2902 load/store/load
2903 atomic/store
2904 atomic/atomicrmw.
2905 - Must happen before
2906 the following
2907 store.
2908 - Ensures that all
2909 memory operations
2910 to local have
2911 completed before
2912 performing the
2913 store that is being
2914 released.
2915
2916 2. buffer/global/flat_store
2917 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002918 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2919
2920 - If OpenCL, omit.
2921 - Must happen after
2922 any preceding
2923 local/generic
2924 load/store/load
2925 atomic/store
2926 atomic/atomicrmw.
2927 - Must happen before
2928 the following
2929 store.
2930 - Ensures that all
2931 memory operations
2932 to local have
2933 completed before
2934 performing the
2935 store that is being
2936 released.
2937
2938 2. flat_store
2939 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2940 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002941
2942 - If OpenCL, omit
2943 lgkmcnt(0).
2944 - Could be split into
2945 separate s_waitcnt
2946 vmcnt(0) and
2947 s_waitcnt
2948 lgkmcnt(0) to allow
2949 them to be
2950 independently moved
2951 according to the
2952 following rules.
2953 - s_waitcnt vmcnt(0)
2954 must happen after
2955 any preceding
2956 global/generic
2957 load/store/load
2958 atomic/store
2959 atomic/atomicrmw.
2960 - s_waitcnt lgkmcnt(0)
2961 must happen after
2962 any preceding
2963 local/generic
2964 load/store/load
2965 atomic/store
2966 atomic/atomicrmw.
2967 - Must happen before
2968 the following
2969 store.
2970 - Ensures that all
2971 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002972 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002973 completed before
2974 performing the
2975 store that is being
2976 released.
2977
2978 2. buffer/global/ds/flat_store
2979 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2980 - wavefront - local
2981 - generic
2982 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002983
2984 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002985 - Must happen after
2986 any preceding
2987 local/generic
2988 load/store/load
2989 atomic/store
2990 atomic/atomicrmw.
2991 - Must happen before
2992 the following
2993 atomicrmw.
2994 - Ensures that all
2995 memory operations
2996 to local have
2997 completed before
2998 performing the
2999 atomicrmw that is
3000 being released.
3001
3002 2. buffer/global/flat_atomic
3003 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00003004 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3005
3006 - If OpenCL, omit.
3007 - Must happen after
3008 any preceding
3009 local/generic
3010 load/store/load
3011 atomic/store
3012 atomic/atomicrmw.
3013 - Must happen before
3014 the following
3015 atomicrmw.
3016 - Ensures that all
3017 memory operations
3018 to local have
3019 completed before
3020 performing the
3021 atomicrmw that is
3022 being released.
3023
3024 2. flat_atomic
3025 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3026 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003027
3028 - If OpenCL, omit
3029 lgkmcnt(0).
3030 - Could be split into
3031 separate s_waitcnt
3032 vmcnt(0) and
3033 s_waitcnt
3034 lgkmcnt(0) to allow
3035 them to be
3036 independently moved
3037 according to the
3038 following rules.
3039 - s_waitcnt vmcnt(0)
3040 must happen after
3041 any preceding
3042 global/generic
3043 load/store/load
3044 atomic/store
3045 atomic/atomicrmw.
3046 - s_waitcnt lgkmcnt(0)
3047 must happen after
3048 any preceding
3049 local/generic
3050 load/store/load
3051 atomic/store
3052 atomic/atomicrmw.
3053 - Must happen before
3054 the following
3055 atomicrmw.
3056 - Ensures that all
3057 memory operations
3058 to global and local
3059 have completed
3060 before performing
3061 the atomicrmw that
3062 is being released.
3063
Tony Tye6baa6d22017-10-18 22:16:55 +00003064 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003065 fence release - singlethread *none* *none*
3066 - wavefront
3067 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3068
3069 - If OpenCL and
3070 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003071 not generic, omit.
3072 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003073 currently has no
3074 address space on
3075 the fence need to
3076 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003077 always generate. If
3078 fence had an
3079 address space then
3080 set to address
3081 space of OpenCL
3082 fence flag, or to
3083 generic if both
3084 local and global
3085 flags are
3086 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003087 - Must happen after
3088 any preceding
3089 local/generic
3090 load/load
3091 atomic/store/store
3092 atomic/atomicrmw.
3093 - Must happen before
3094 any following store
3095 atomic/atomicrmw
3096 with an equal or
3097 wider sync scope
3098 and memory ordering
3099 stronger than
3100 unordered (this is
3101 termed the
3102 fence-paired-atomic).
3103 - Ensures that all
3104 memory operations
3105 to local have
3106 completed before
3107 performing the
3108 following
3109 fence-paired-atomic.
3110
Tony Tye6baa6d22017-10-18 22:16:55 +00003111 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3112 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003113
3114 - If OpenCL and
3115 address space is
3116 not generic, omit
3117 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003118 - If OpenCL and
3119 address space is
3120 local, omit
3121 vmcnt(0).
3122 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003123 currently has no
3124 address space on
3125 the fence need to
3126 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003127 always generate. If
3128 fence had an
3129 address space then
3130 set to address
3131 space of OpenCL
3132 fence flag, or to
3133 generic if both
3134 local and global
3135 flags are
3136 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003137 - Could be split into
3138 separate s_waitcnt
3139 vmcnt(0) and
3140 s_waitcnt
3141 lgkmcnt(0) to allow
3142 them to be
3143 independently moved
3144 according to the
3145 following rules.
3146 - s_waitcnt vmcnt(0)
3147 must happen after
3148 any preceding
3149 global/generic
3150 load/store/load
3151 atomic/store
3152 atomic/atomicrmw.
3153 - s_waitcnt lgkmcnt(0)
3154 must happen after
3155 any preceding
3156 local/generic
3157 load/store/load
3158 atomic/store
3159 atomic/atomicrmw.
3160 - Must happen before
3161 any following store
3162 atomic/atomicrmw
3163 with an equal or
3164 wider sync scope
3165 and memory ordering
3166 stronger than
3167 unordered (this is
3168 termed the
3169 fence-paired-atomic).
3170 - Ensures that all
3171 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003172 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003173 completed before
3174 performing the
3175 following
3176 fence-paired-atomic.
3177
3178 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003179 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003180 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3181 - wavefront - local
3182 - generic
3183 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3184
Tony Tye6baa6d22017-10-18 22:16:55 +00003185 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003186 - Must happen after
3187 any preceding
3188 local/generic
3189 load/store/load
3190 atomic/store
3191 atomic/atomicrmw.
3192 - Must happen before
3193 the following
3194 atomicrmw.
3195 - Ensures that all
3196 memory operations
3197 to local have
3198 completed before
3199 performing the
3200 atomicrmw that is
3201 being released.
3202
Tony Tye6baa6d22017-10-18 22:16:55 +00003203 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003204 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3205 2. s_waitcnt lgkmcnt(0)
3206
Tony Tye6baa6d22017-10-18 22:16:55 +00003207 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003208 - Must happen before
3209 any following
3210 global/generic
3211 load/load
3212 atomic/store/store
3213 atomic/atomicrmw.
3214 - Ensures any
3215 following global
3216 data read is no
3217 older than the load
3218 atomic value being
3219 acquired.
3220
3221 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3222
Tony Tye6baa6d22017-10-18 22:16:55 +00003223 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003224 - Must happen after
3225 any preceding
3226 local/generic
3227 load/store/load
3228 atomic/store
3229 atomic/atomicrmw.
3230 - Must happen before
3231 the following
3232 atomicrmw.
3233 - Ensures that all
3234 memory operations
3235 to local have
3236 completed before
3237 performing the
3238 atomicrmw that is
3239 being released.
3240
3241 2. flat_atomic
3242 3. s_waitcnt lgkmcnt(0)
3243
Tony Tye6baa6d22017-10-18 22:16:55 +00003244 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003245 - Must happen before
3246 any following
3247 global/generic
3248 load/load
3249 atomic/store/store
3250 atomic/atomicrmw.
3251 - Ensures any
3252 following global
3253 data read is no
3254 older than the load
3255 atomic value being
3256 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003257
3258 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3259 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003260
3261 - If OpenCL, omit
3262 lgkmcnt(0).
3263 - Could be split into
3264 separate s_waitcnt
3265 vmcnt(0) and
3266 s_waitcnt
3267 lgkmcnt(0) to allow
3268 them to be
3269 independently moved
3270 according to the
3271 following rules.
3272 - s_waitcnt vmcnt(0)
3273 must happen after
3274 any preceding
3275 global/generic
3276 load/store/load
3277 atomic/store
3278 atomic/atomicrmw.
3279 - s_waitcnt lgkmcnt(0)
3280 must happen after
3281 any preceding
3282 local/generic
3283 load/store/load
3284 atomic/store
3285 atomic/atomicrmw.
3286 - Must happen before
3287 the following
3288 atomicrmw.
3289 - Ensures that all
3290 memory operations
3291 to global have
3292 completed before
3293 performing the
3294 atomicrmw that is
3295 being released.
3296
Tony Tye6baa6d22017-10-18 22:16:55 +00003297 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003298 3. s_waitcnt vmcnt(0)
3299
3300 - Must happen before
3301 following
3302 buffer_wbinvl1_vol.
3303 - Ensures the
3304 atomicrmw has
3305 completed before
3306 invalidating the
3307 cache.
3308
3309 4. buffer_wbinvl1_vol
3310
3311 - Must happen before
3312 any following
3313 global/generic
3314 load/load
3315 atomic/atomicrmw.
3316 - Ensures that
3317 following loads
3318 will not see stale
3319 global data.
3320
Tony Tye6baa6d22017-10-18 22:16:55 +00003321 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3322 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003323
3324 - If OpenCL, omit
3325 lgkmcnt(0).
3326 - Could be split into
3327 separate s_waitcnt
3328 vmcnt(0) and
3329 s_waitcnt
3330 lgkmcnt(0) to allow
3331 them to be
3332 independently moved
3333 according to the
3334 following rules.
3335 - s_waitcnt vmcnt(0)
3336 must happen after
3337 any preceding
3338 global/generic
3339 load/store/load
3340 atomic/store
3341 atomic/atomicrmw.
3342 - s_waitcnt lgkmcnt(0)
3343 must happen after
3344 any preceding
3345 local/generic
3346 load/store/load
3347 atomic/store
3348 atomic/atomicrmw.
3349 - Must happen before
3350 the following
3351 atomicrmw.
3352 - Ensures that all
3353 memory operations
3354 to global have
3355 completed before
3356 performing the
3357 atomicrmw that is
3358 being released.
3359
3360 2. flat_atomic
3361 3. s_waitcnt vmcnt(0) &
3362 lgkmcnt(0)
3363
3364 - If OpenCL, omit
3365 lgkmcnt(0).
3366 - Must happen before
3367 following
3368 buffer_wbinvl1_vol.
3369 - Ensures the
3370 atomicrmw has
3371 completed before
3372 invalidating the
3373 cache.
3374
3375 4. buffer_wbinvl1_vol
3376
3377 - Must happen before
3378 any following
3379 global/generic
3380 load/load
3381 atomic/atomicrmw.
3382 - Ensures that
3383 following loads
3384 will not see stale
3385 global data.
3386
3387 fence acq_rel - singlethread *none* *none*
3388 - wavefront
3389 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3390
3391 - If OpenCL and
3392 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003393 not generic, omit.
3394 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003395 since LLVM
3396 currently has no
3397 address space on
3398 the fence need to
3399 conservatively
3400 always generate
3401 (see comment for
3402 previous fence).
3403 - Must happen after
3404 any preceding
3405 local/generic
3406 load/load
3407 atomic/store/store
3408 atomic/atomicrmw.
3409 - Must happen before
3410 any following
3411 global/generic
3412 load/load
3413 atomic/store/store
3414 atomic/atomicrmw.
3415 - Ensures that all
3416 memory operations
3417 to local have
3418 completed before
3419 performing any
3420 following global
3421 memory operations.
3422 - Ensures that the
3423 preceding
3424 local/generic load
3425 atomic/atomicrmw
3426 with an equal or
3427 wider sync scope
3428 and memory ordering
3429 stronger than
3430 unordered (this is
3431 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003432 acquire-fence-paired-atomic
3433 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003434 before following
3435 global memory
3436 operations. This
3437 satisfies the
3438 requirements of
3439 acquire.
3440 - Ensures that all
3441 previous memory
3442 operations have
3443 completed before a
3444 following
3445 local/generic store
3446 atomic/atomicrmw
3447 with an equal or
3448 wider sync scope
3449 and memory ordering
3450 stronger than
3451 unordered (this is
3452 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003453 release-fence-paired-atomic
3454 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003455 requirements of
3456 release.
3457
Tony Tye6baa6d22017-10-18 22:16:55 +00003458 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3459 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003460
3461 - If OpenCL and
3462 address space is
3463 not generic, omit
3464 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003465 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003466 currently has no
3467 address space on
3468 the fence need to
3469 conservatively
3470 always generate
3471 (see comment for
3472 previous fence).
3473 - Could be split into
3474 separate s_waitcnt
3475 vmcnt(0) and
3476 s_waitcnt
3477 lgkmcnt(0) to allow
3478 them to be
3479 independently moved
3480 according to the
3481 following rules.
3482 - s_waitcnt vmcnt(0)
3483 must happen after
3484 any preceding
3485 global/generic
3486 load/store/load
3487 atomic/store
3488 atomic/atomicrmw.
3489 - s_waitcnt lgkmcnt(0)
3490 must happen after
3491 any preceding
3492 local/generic
3493 load/store/load
3494 atomic/store
3495 atomic/atomicrmw.
3496 - Must happen before
3497 the following
3498 buffer_wbinvl1_vol.
3499 - Ensures that the
3500 preceding
3501 global/local/generic
3502 load
3503 atomic/atomicrmw
3504 with an equal or
3505 wider sync scope
3506 and memory ordering
3507 stronger than
3508 unordered (this is
3509 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003510 acquire-fence-paired-atomic
3511 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003512 before invalidating
3513 the cache. This
3514 satisfies the
3515 requirements of
3516 acquire.
3517 - Ensures that all
3518 previous memory
3519 operations have
3520 completed before a
3521 following
3522 global/local/generic
3523 store
3524 atomic/atomicrmw
3525 with an equal or
3526 wider sync scope
3527 and memory ordering
3528 stronger than
3529 unordered (this is
3530 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003531 release-fence-paired-atomic
3532 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003533 requirements of
3534 release.
3535
3536 2. buffer_wbinvl1_vol
3537
3538 - Must happen before
3539 any following
3540 global/generic
3541 load/load
3542 atomic/store/store
3543 atomic/atomicrmw.
3544 - Ensures that
3545 following loads
3546 will not see stale
3547 global data. This
3548 satisfies the
3549 requirements of
3550 acquire.
3551
3552 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003553 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003554 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003555 - wavefront - local load atomic acquire,
3556 - generic except must generated
3557 all instructions even
3558 for OpenCL.*
3559 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3560 - generic
3561 - Must
3562 happen after
3563 preceding
3564 global/generic load
3565 atomic/store
3566 atomic/atomicrmw
3567 with memory
3568 ordering of seq_cst
3569 and with equal or
3570 wider sync scope.
3571 (Note that seq_cst
3572 fences have their
3573 own s_waitcnt
3574 lgkmcnt(0) and so do
3575 not need to be
3576 considered.)
3577 - Ensures any
3578 preceding
3579 sequential
3580 consistent local
3581 memory instructions
3582 have completed
3583 before executing
3584 this sequentially
3585 consistent
3586 instruction. This
3587 prevents reordering
3588 a seq_cst store
3589 followed by a
3590 seq_cst load. (Note
3591 that seq_cst is
3592 stronger than
3593 acquire/release as
3594 the reordering of
3595 load acquire
3596 followed by a store
3597 release is
3598 prevented by the
3599 waitcnt of
3600 the release, but
3601 there is nothing
3602 preventing a store
3603 release followed by
3604 load acquire from
3605 competing out of
3606 order.)
3607
3608 2. *Following
3609 instructions same as
3610 corresponding load
3611 atomic acquire,
3612 except must generated
3613 all instructions even
3614 for OpenCL.*
3615 load atomic seq_cst - workgroup - local *Same as corresponding
3616 load atomic acquire,
3617 except must generated
3618 all instructions even
3619 for OpenCL.*
3620 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3621 - system - generic vmcnt(0)
3622
3623 - Could be split into
3624 separate s_waitcnt
3625 vmcnt(0)
3626 and s_waitcnt
3627 lgkmcnt(0) to allow
3628 them to be
3629 independently moved
3630 according to the
3631 following rules.
3632 - waitcnt lgkmcnt(0)
3633 must happen after
3634 preceding
3635 global/generic load
3636 atomic/store
3637 atomic/atomicrmw
3638 with memory
3639 ordering of seq_cst
3640 and with equal or
3641 wider sync scope.
3642 (Note that seq_cst
3643 fences have their
3644 own s_waitcnt
3645 lgkmcnt(0) and so do
3646 not need to be
3647 considered.)
3648 - waitcnt vmcnt(0)
3649 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003650 preceding
3651 global/generic load
3652 atomic/store
3653 atomic/atomicrmw
3654 with memory
3655 ordering of seq_cst
3656 and with equal or
3657 wider sync scope.
3658 (Note that seq_cst
3659 fences have their
3660 own s_waitcnt
3661 vmcnt(0) and so do
3662 not need to be
3663 considered.)
3664 - Ensures any
3665 preceding
3666 sequential
3667 consistent global
3668 memory instructions
3669 have completed
3670 before executing
3671 this sequentially
3672 consistent
3673 instruction. This
3674 prevents reordering
3675 a seq_cst store
3676 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003677 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003678 that seq_cst is
3679 stronger than
3680 acquire/release as
3681 the reordering of
3682 load acquire
3683 followed by a store
3684 release is
3685 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003686 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003687 the release, but
3688 there is nothing
3689 preventing a store
3690 release followed by
3691 load acquire from
3692 competing out of
3693 order.)
3694
3695 2. *Following
3696 instructions same as
3697 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003698 atomic acquire,
3699 except must generated
3700 all instructions even
3701 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003702 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003703 - wavefront - local store atomic release,
3704 - workgroup - generic except must generated
3705 all instructions even
3706 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003707 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003708 - system - generic store atomic release,
3709 except must generated
3710 all instructions even
3711 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003712 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003713 - wavefront - local atomicrmw acq_rel,
3714 - workgroup - generic except must generated
3715 all instructions even
3716 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003717 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003718 - system - generic atomicrmw acq_rel,
3719 except must generated
3720 all instructions even
3721 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003722 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003723 - wavefront fence acq_rel,
3724 - workgroup except must generated
3725 - agent all instructions even
3726 - system for OpenCL.*
3727 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003728
3729The memory order also adds the single thread optimization constrains defined in
3730table
3731:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3732
3733 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3734 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3735
3736 ============ ==============================================================
3737 LLVM Memory Optimization Constraints
3738 Ordering
3739 ============ ==============================================================
3740 unordered *none*
3741 monotonic *none*
3742 acquire - If a load atomic/atomicrmw then no following load/load
3743 atomic/store/ store atomic/atomicrmw/fence instruction can
3744 be moved before the acquire.
3745 - If a fence then same as load atomic, plus no preceding
3746 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003747 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003748 atomic/store/ store atomic/atomicrmw/fence instruction can
3749 be moved after the release.
3750 - If a fence then same as store atomic, plus no following
3751 associated fence-paired-atomic can be moved before the
3752 fence.
3753 acq_rel Same constraints as both acquire and release.
3754 seq_cst - If a load atomic then same constraints as acquire, plus no
3755 preceding sequentially consistent load atomic/store
3756 atomic/atomicrmw/fence instruction can be moved after the
3757 seq_cst.
3758 - If a store atomic then the same constraints as release, plus
3759 no following sequentially consistent load atomic/store
3760 atomic/atomicrmw/fence instruction can be moved before the
3761 seq_cst.
3762 - If an atomicrmw/fence then same constraints as acq_rel.
3763 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003764
Wei Ding16289cf2017-02-21 18:48:01 +00003765Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003766~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003767
Tony Tyef16a45e2017-06-06 20:31:59 +00003768For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3769(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3770the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003771
Tony Tyef16a45e2017-06-06 20:31:59 +00003772 .. table:: AMDGPU Trap Handler for AMDHSA OS
3773 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003774
Tony Tyef16a45e2017-06-06 20:31:59 +00003775 =================== =============== =============== =======================
3776 Usage Code Sequence Trap Handler Description
3777 Inputs
3778 =================== =============== =============== =======================
3779 reserved ``s_trap 0x00`` Reserved by hardware.
3780 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3781 ``queue_ptr`` ``debugtrap``
3782 ``VGPR0``: intrinsic (not
3783 ``arg`` implemented).
3784 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3785 ``queue_ptr`` terminated and its
3786 associated queue put
3787 into the error state.
Tony Tye43259df2018-05-16 16:19:34 +00003788 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
3789 installed then
3790 behaves as a
3791 no-operation. The
3792 trap handler is
3793 entered and
3794 immediately returns
3795 to continue
3796 execution of the
3797 wavefront.
3798 - If the debugger is
3799 installed, causes
3800 the debug trap to be
3801 reported by the
3802 debugger and the
3803 wavefront is put in
3804 the halt state until
3805 resumed by the
3806 debugger.
3807 reserved ``s_trap 0x04`` Reserved.
3808 reserved ``s_trap 0x05`` Reserved.
3809 reserved ``s_trap 0x06`` Reserved.
3810 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tyef16a45e2017-06-06 20:31:59 +00003811 breakpoints.
Tony Tye43259df2018-05-16 16:19:34 +00003812 reserved ``s_trap 0x08`` Reserved.
3813 reserved ``s_trap 0xfe`` Reserved.
3814 reserved ``s_trap 0xff`` Reserved.
Tony Tyef16a45e2017-06-06 20:31:59 +00003815 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003816
Tim Corringhamaf2dfc62018-04-04 13:02:09 +00003817AMDPAL
3818------
3819
3820This section provides code conventions used when the target triple OS is
3821``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3822from the application/runtime to each invocation of a hardware shader. These
3823parameters include both generic, application-controlled parameters called
3824*user data* as well as system-generated parameters that are a product of the
3825draw or dispatch execution.
3826
3827User Data
3828~~~~~~~~~
3829
3830Each hardware stage has a set of 32-bit *user data registers* which can be
3831written from a command buffer and then loaded into SGPRs when waves are launched
3832via a subsequent dispatch or draw operation. This is the way most arguments are
3833passed from the application/runtime to a hardware shader.
3834
3835Compute User Data
3836~~~~~~~~~~~~~~~~~
3837
3838Compute shader user data mappings are simpler than graphics shaders, and have a
3839fixed mapping.
3840
3841Note that there are always 10 available *user data entries* in registers -
3842entries beyond that limit must be fetched from memory (via the spill table
3843pointer) by the shader.
3844
3845 .. table:: PAL Compute Shader User Data Registers
3846 :name: pal-compute-user-data-registers
3847
3848 ============= ================================
3849 User Register Description
3850 ============= ================================
3851 0 Global Internal Table (32-bit pointer)
3852 1 Per-Shader Internal Table (32-bit pointer)
3853 2 - 11 Application-Controlled User Data (10 32-bit values)
3854 12 Spill Table (32-bit pointer)
3855 13 - 14 Thread Group Count (64-bit pointer)
3856 15 GDS Range
3857 ============= ================================
3858
3859Graphics User Data
3860~~~~~~~~~~~~~~~~~~
3861
3862Graphics pipelines support a much more flexible user data mapping:
3863
3864 .. table:: PAL Graphics Shader User Data Registers
3865 :name: pal-graphics-user-data-registers
3866
3867 ============= ================================
3868 User Register Description
3869 ============= ================================
3870 0 Global Internal Table (32-bit pointer)
3871 + Per-Shader Internal Table (32-bit pointer)
3872 + 1-15 Application Controlled User Data
3873 (1-15 Contiguous 32-bit Values in Registers)
3874 + Spill Table (32-bit pointer)
3875 + Draw Index (First Stage Only)
3876 + Vertex Offset (First Stage Only)
3877 + Instance Offset (First Stage Only)
3878 ============= ================================
3879
3880 The placement of the global internal table remains fixed in the first *user
3881 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3882 to any desired *user data SGPR register*, with the following regstrictions:
3883
3884 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3885 activehardware stage in a graphics pipeline (i.e. where the API vertex
3886 shader runs).
3887
3888 * Application-controlled user data must be mapped into a contiguous range of
3889 user data registers.
3890
3891 * The application-controlled user data range supports compaction remapping, so
3892 only *entries* that are actually consumed by the shader must be assigned to
3893 corresponding *registers*. Note that in order to support an efficient runtime
3894 implementation, the remapping must pack *registers* in the same order as
3895 *entries*, with unused *entries* removed.
3896
3897.. _pal_global_internal_table:
3898
3899Global Internal Table
3900~~~~~~~~~~~~~~~~~~~~~
3901
3902The global internal table is a table of *shader resource descriptors* (SRDs) that
3903define how certain engine-wide, runtime-managed resources should be accessed
3904from a shader. The majority of these resources have HW-defined formats, and it
3905is up to the compiler to write/read data as required by the target hardware.
3906
3907The following table illustrates the required format:
3908
3909 .. table:: PAL Global Internal Table
3910 :name: pal-git-table
3911
3912 ============= ================================
3913 Offset Description
3914 ============= ================================
3915 0-3 Graphics Scratch SRD
3916 4-7 Compute Scratch SRD
3917 8-11 ES/GS Ring Output SRD
3918 12-15 ES/GS Ring Input SRD
3919 16-19 GS/VS Ring Output #0
3920 20-23 GS/VS Ring Output #1
3921 24-27 GS/VS Ring Output #2
3922 28-31 GS/VS Ring Output #3
3923 32-35 GS/VS Ring Input SRD
3924 36-39 Tessellation Factor Buffer SRD
3925 40-43 Off-Chip LDS Buffer SRD
3926 44-47 Off-Chip Param Cache Buffer SRD
3927 48-51 Sample Position Buffer SRD
3928 52 vaRange::ShadowDescriptorTable High Bits
3929 ============= ================================
3930
3931 The pointer to the global internal table passed to the shader as user data
3932 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3933 the top 32 bits of the pipeline, so the shader may use the program
3934 counter's top 32 bits.
3935
Tony Tye46d35762017-08-15 20:47:41 +00003936Unspecified OS
3937--------------
3938
3939This section provides code conventions used when the target triple OS is
3940empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003941
3942Trap Handler ABI
3943~~~~~~~~~~~~~~~~
3944
3945For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3946not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3947instructions are handled as follows:
3948
3949 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3950 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3951
3952 =============== =============== ===========================================
3953 Usage Code Sequence Description
3954 =============== =============== ===========================================
3955 llvm.trap s_endpgm Causes wavefront to be terminated.
3956 llvm.debugtrap *none* Compiler warning given that there is no
3957 trap handler installed.
3958 =============== =============== ===========================================
3959
3960Source Languages
3961================
3962
3963.. _amdgpu-opencl:
3964
3965OpenCL
3966------
3967
Tony Tyef16a45e2017-06-06 20:31:59 +00003968When the language is OpenCL the following differences occur:
3969
39701. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye7a893d42018-03-23 18:45:18 +000039712. The AMDGPU backend appends additional arguments to the kernel's explicit
3972 arguments for the AMDHSA OS (see
3973 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye46d35762017-08-15 20:47:41 +000039743. Additional metadata is generated
Tony Tye7a893d42018-03-23 18:45:18 +00003975 (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003976
Tony Tye7a893d42018-03-23 18:45:18 +00003977 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3978 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3979
3980 ======== ==== ========= ===========================================
3981 Position Byte Byte Description
3982 Size Alignment
3983 ======== ==== ========= ===========================================
Tony Tye88441a32018-03-23 18:58:47 +00003984 1 8 8 OpenCL Global Offset X
3985 2 8 8 OpenCL Global Offset Y
3986 3 8 8 OpenCL Global Offset Z
3987 4 8 8 OpenCL address of printf buffer
3988 5 8 8 OpenCL address of virtual queue used by
3989 enqueue_kernel.
3990 6 8 8 OpenCL address of AqlWrap struct used by
3991 enqueue_kernel.
Tony Tye7a893d42018-03-23 18:45:18 +00003992 ======== ==== ========= ===========================================
Tony Tyef16a45e2017-06-06 20:31:59 +00003993
3994.. _amdgpu-hcc:
3995
3996HCC
3997---
3998
Tony Tye7a893d42018-03-23 18:45:18 +00003999When the language is HCC the following differences occur:
Tony Tyef16a45e2017-06-06 20:31:59 +00004000
40011. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4002
Tom Stellard45bb48e2015-06-13 03:28:10 +00004003Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00004004---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00004005
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004006AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00004007It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004008
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004009This section describes general syntax for instructions and operands.
4010
4011Instructions
4012~~~~~~~~~~~~
4013
4014.. toctree::
4015 :hidden:
4016
4017 AMDGPUAsmGFX7
4018 AMDGPUAsmGFX8
4019 AMDGPUAsmGFX9
4020 AMDGPUOperandSyntax
4021
4022An instruction has the following syntax:
4023
4024 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4025
4026Note that operands are normally comma-separated while modifiers are space-separated.
4027
4028The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4029
4030See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4031:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4032
4033Note that features under development are not included in this description.
4034
4035For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00004036operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004037[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004038
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004039Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00004040~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004041
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004042The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004043
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004044* SGPR registers: s0, ... or s[0], ...
4045* VGPR registers: v0, ... or v[0], ...
4046* TTMP registers: ttmp0, ... or ttmp[0], ...
4047* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4048* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4049* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4050* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4051* Register index expressions: v[2*2], s[1-1:2-1]
4052* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00004053
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004054Modifiers
4055~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004056
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004057Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004058
Tony Tyef16a45e2017-06-06 20:31:59 +00004059Instruction Examples
4060~~~~~~~~~~~~~~~~~~~~
4061
4062DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004063++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004064
4065.. code-block:: nasm
4066
4067 ds_add_u32 v2, v4 offset:16
4068 ds_write_src2_b64 v2 offset0:4 offset1:8
4069 ds_cmpst_f32 v2, v4, v6
4070 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4071
4072
4073For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4074
Tony Tyef16a45e2017-06-06 20:31:59 +00004075FLAT
4076++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004077
4078.. code-block:: nasm
4079
4080 flat_load_dword v1, v[3:4]
4081 flat_store_dwordx3 v[3:4], v[5:7]
4082 flat_atomic_swap v1, v[3:4], v5 glc
4083 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4084 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4085
4086For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4087
Tony Tyef16a45e2017-06-06 20:31:59 +00004088MUBUF
4089+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004090
4091.. code-block:: nasm
4092
4093 buffer_load_dword v1, off, s[4:7], s1
4094 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4095 buffer_store_format_xy v[1:2], off, s[4:7], s1
4096 buffer_wbinvl1
4097 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4098
4099For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4100
Tony Tyef16a45e2017-06-06 20:31:59 +00004101SMRD/SMEM
4102+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004103
4104.. code-block:: nasm
4105
4106 s_load_dword s1, s[2:3], 0xfc
4107 s_load_dwordx8 s[8:15], s[2:3], s4
4108 s_load_dwordx16 s[88:103], s[2:3], s4
4109 s_dcache_inv_vol
4110 s_memtime s[4:5]
4111
4112For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4113
Tony Tyef16a45e2017-06-06 20:31:59 +00004114SOP1
4115++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004116
4117.. code-block:: nasm
4118
4119 s_mov_b32 s1, s2
4120 s_mov_b64 s[0:1], 0x80000000
4121 s_cmov_b32 s1, 200
4122 s_wqm_b64 s[2:3], s[4:5]
4123 s_bcnt0_i32_b64 s1, s[2:3]
4124 s_swappc_b64 s[2:3], s[4:5]
4125 s_cbranch_join s[4:5]
4126
4127For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4128
Tony Tyef16a45e2017-06-06 20:31:59 +00004129SOP2
4130++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004131
4132.. code-block:: nasm
4133
4134 s_add_u32 s1, s2, s3
4135 s_and_b64 s[2:3], s[4:5], s[6:7]
4136 s_cselect_b32 s1, s2, s3
4137 s_andn2_b32 s2, s4, s6
4138 s_lshr_b64 s[2:3], s[4:5], s6
4139 s_ashr_i32 s2, s4, s6
4140 s_bfm_b64 s[2:3], s4, s6
4141 s_bfe_i64 s[2:3], s[4:5], s6
4142 s_cbranch_g_fork s[4:5], s[6:7]
4143
4144For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4145
Tony Tyef16a45e2017-06-06 20:31:59 +00004146SOPC
4147++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004148
4149.. code-block:: nasm
4150
4151 s_cmp_eq_i32 s1, s2
4152 s_bitcmp1_b32 s1, s2
4153 s_bitcmp0_b64 s[2:3], s4
4154 s_setvskip s3, s5
4155
4156For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4157
Tony Tyef16a45e2017-06-06 20:31:59 +00004158SOPP
4159++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004160
4161.. code-block:: nasm
4162
4163 s_barrier
4164 s_nop 2
4165 s_endpgm
4166 s_waitcnt 0 ; Wait for all counters to be 0
4167 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4168 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4169 s_sethalt 9
4170 s_sleep 10
4171 s_sendmsg 0x1
4172 s_sendmsg sendmsg(MSG_INTERRUPT)
4173 s_trap 1
4174
4175For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4176
4177Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00004178of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00004179range or acceptable values.
4180
Tony Tyef16a45e2017-06-06 20:31:59 +00004181VALU
4182++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00004183
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004184For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4185the assembler will automatically use optimal encoding based on its operands.
4186To force specific encoding, one can add a suffix to the opcode of the instruction:
4187
4188* _e32 for 32-bit VOP1/VOP2/VOPC
4189* _e64 for 64-bit VOP3
4190* _dpp for VOP_DPP
4191* _sdwa for VOP_SDWA
4192
4193VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004194
4195.. code-block:: nasm
4196
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004197 v_mov_b32 v1, v2
4198 v_mov_b32_e32 v1, v2
4199 v_nop
4200 v_cvt_f64_i32_e32 v[1:2], v2
4201 v_floor_f32_e32 v1, v2
4202 v_bfrev_b32_e32 v1, v2
4203 v_add_f32_e32 v1, v2, v3
4204 v_mul_i32_i24_e64 v1, v2, 3
4205 v_mul_i32_i24_e32 v1, -3, v3
4206 v_mul_i32_i24_e32 v1, -100, v3
4207 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4208 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004209
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004210VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004211
4212.. code-block:: nasm
4213
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004214 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4215 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4216 v_mov_b32 v0, v0 wave_shl:1
4217 v_mov_b32 v0, v0 row_mirror
4218 v_mov_b32 v0, v0 row_bcast:31
4219 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4220 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4221 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004222
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004223VOP_SDWA examples:
4224
4225.. code-block:: nasm
4226
4227 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4228 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4229 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4230 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4231 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4232
4233For full list of supported instructions, refer to "Vector ALU instructions".
4234
4235HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004236~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004237
4238AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4239one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004240
4241.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004242+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004243
4244*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004245object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004246
4247.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004248+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4249
Tom Stellard347ac792015-06-26 21:15:07 +00004250
4251*major*, *minor*, and *stepping* are all integers that describe the instruction
4252set architecture (ISA) version of the assembly program.
4253
4254*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4255"AMD" and *arch* should always be equal to "AMDGPU".
4256
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004257By default, the assembler will derive the ISA version, *vendor*, and *arch*
4258from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004259
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004260.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004261+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004262
4263This directives specifies that the symbol with given name is a kernel entry point
4264(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004265
4266.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004267++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004268
4269This directive marks the beginning of a list of key / value pairs that are used
4270to specify the amd_kernel_code_t object that will be emitted by the assembler.
4271The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4272any amd_kernel_code_t values that are unspecified a default value will be
4273used. The default value for all keys is 0, with the following exceptions:
4274
4275- *kernel_code_version_major* defaults to 1.
4276- *machine_kind* defaults to 1.
4277- *machine_version_major*, *machine_version_minor*, and
4278 *machine_version_stepping* are derived from the value of the -mcpu option
4279 that is passed to the assembler.
4280- *kernel_code_entry_byte_offset* defaults to 256.
4281- *wavefront_size* defaults to 6.
4282- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004283 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004284 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4285
4286The *.amd_kernel_code_t* directive must be placed immediately after the
4287function label and before any instructions.
4288
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004289For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4290comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004291
4292Here is an example of a minimal amd_kernel_code_t specification:
4293
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004294.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004295
4296 .hsa_code_object_version 1,0
4297 .hsa_code_object_isa
4298
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004299 .hsatext
4300 .globl hello_world
4301 .p2align 8
4302 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004303
4304 hello_world:
4305
4306 .amd_kernel_code_t
4307 enable_sgpr_kernarg_segment_ptr = 1
4308 is_ptr64 = 1
4309 compute_pgm_rsrc1_vgprs = 0
4310 compute_pgm_rsrc1_sgprs = 0
4311 compute_pgm_rsrc2_user_sgpr = 2
4312 kernarg_segment_byte_size = 8
4313 wavefront_sgpr_count = 2
4314 workitem_vgpr_count = 3
4315 .end_amd_kernel_code_t
4316
4317 s_load_dwordx2 s[0:1], s[0:1] 0x0
4318 v_mov_b32 v0, 3.14159
4319 s_waitcnt lgkmcnt(0)
4320 v_mov_b32 v1, s0
4321 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004322 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004323 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004324 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004325 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004326
4327Additional Documentation
4328========================
4329
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004330.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4331.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4332.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4333.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4334.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4335.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4336.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4337.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004338.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4339.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4340.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4341.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4342.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004343.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004344.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4345.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__