blob: 7a264af62f40cbfde335918ae8ff53eb60c95197 [file] [log] [blame]
Eugene Zelenko3507b042018-03-21 17:09:35 +00001=============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye7a893d42018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye07d9f102017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000080
Tony Tye31105cc2017-12-11 15:35:27 +000081 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
85 [Default]
86 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000088 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +000099 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000106 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000152 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyeb6efb902018-04-14 01:58:10 +0000201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
202 [on] - Ryzen 5 2400G
Matt Arsenault0084adc2018-04-30 19:08:16 +0000203 ``gfx904`` ``amdgcn`` dGPU - xnack *TBA*
204 [off]
205 .. TODO
206 Add product
207 names.
208 ``gfx906`` ``amdgcn`` dGPU - xnack *TBA*
209 [off]
210 .. TODO
211 Add product
212 names.
Tony Tye31105cc2017-12-11 15:35:27 +0000213 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000214
215.. _amdgpu-target-features:
216
217Target Features
218---------------
219
220Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000221processor specific features. Not all target features are supported by
222all processors. The runtime must ensure that the features supported by
223the device used to execute the code match the features enabled when
224generating the code. A mismatch of features may result in incorrect
225execution, or a reduction in performance.
226
227The target features supported by each processor, and the default value
228used if not specified explicitly, is listed in
229:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000230
231Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
232target features.
233
234For example:
235
236``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000237 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000238``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000239 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000240
241 .. table:: AMDGPU Target Features
242 :name: amdgpu-target-feature-table
243
Tony Tye31105cc2017-12-11 15:35:27 +0000244 ============== ==================================================
245 Target Feature Description
246 ============== ==================================================
247 -m[no-]xnack Enable/disable generating code that has
248 memory clauses that are compatible with
249 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000250
Tony Tye31105cc2017-12-11 15:35:27 +0000251 This is used for demand paging and page
252 migration. If XNACK replay is enabled in
253 the device, then if a page fault occurs
254 the code may execute incorrectly if the
255 ``xnack`` feature is not enabled. Executing
256 code that has the feature enabled on a
257 device that does not have XNACK replay
258 enabled will execute correctly, but may
259 be less performant than code with the
260 feature disabled.
261 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000262
263.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000264
265Address Spaces
266--------------
267
Tony Tyef16a45e2017-06-06 20:31:59 +0000268The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000269
Tony Tyef16a45e2017-06-06 20:31:59 +0000270The memory space names used in the table, aside from the region memory space, is
271from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000272
Tony Tyef16a45e2017-06-06 20:31:59 +0000273LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000274
Tony Tyef16a45e2017-06-06 20:31:59 +0000275 .. table:: Address Space Mapping
276 :name: amdgpu-address-space-mapping-table
277
Yaxun Liu0124b542018-02-13 18:00:25 +0000278 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000279 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000280 ================== =================
281 0 Generic (Flat)
282 1 Global
283 2 Region (GDS)
284 3 Local (group/LDS)
285 4 Constant
286 5 Private (Scratch)
287 6 Constant 32-bit
288 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000289
290.. _amdgpu-memory-scopes:
291
292Memory Scopes
293-------------
294
295This section provides LLVM memory synchronization scopes supported by the AMDGPU
296backend memory model when the target triple OS is ``amdhsa`` (see
297:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
298
299The memory model supported is based on the HSA memory model [HSA]_ which is
300based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
301relation is transitive over the synchonizes-with relation independent of scope,
302and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000303table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000304
305This is different to the OpenCL [OpenCL]_ memory model which does not have scope
306inclusion and requires the memory scopes to exactly match. However, this
307is conservatively correct for OpenCL.
308
Tony Tye07d9f102017-11-10 01:00:54 +0000309 .. table:: AMDHSA LLVM Sync Scopes
310 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000311
312 ================ ==========================================================
313 LLVM Sync Scope Description
314 ================ ==========================================================
315 *none* The default: ``system``.
316
317 Synchronizes with, and participates in modification and
318 seq_cst total orderings with, other operations (except
319 image operations) for all address spaces (except private,
320 or generic that accesses private) provided the other
321 operation's sync scope is:
322
323 - ``system``.
324 - ``agent`` and executed by a thread on the same agent.
325 - ``workgroup`` and executed by a thread in the same
326 workgroup.
327 - ``wavefront`` and executed by a thread in the same
328 wavefront.
329
330 ``agent`` Synchronizes with, and participates in modification and
331 seq_cst total orderings with, other operations (except
332 image operations) for all address spaces (except private,
333 or generic that accesses private) provided the other
334 operation's sync scope is:
335
336 - ``system`` or ``agent`` and executed by a thread on the
337 same agent.
338 - ``workgroup`` and executed by a thread in the same
339 workgroup.
340 - ``wavefront`` and executed by a thread in the same
341 wavefront.
342
343 ``workgroup`` Synchronizes with, and participates in modification and
344 seq_cst total orderings with, other operations (except
345 image operations) for all address spaces (except private,
346 or generic that accesses private) provided the other
347 operation's sync scope is:
348
349 - ``system``, ``agent`` or ``workgroup`` and executed by a
350 thread in the same workgroup.
351 - ``wavefront`` and executed by a thread in the same
352 wavefront.
353
354 ``wavefront`` Synchronizes with, and participates in modification and
355 seq_cst total orderings with, other operations (except
356 image operations) for all address spaces (except private,
357 or generic that accesses private) provided the other
358 operation's sync scope is:
359
360 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
361 and executed by a thread in the same wavefront.
362
363 ``singlethread`` Only synchronizes with, and participates in modification
364 and seq_cst total orderings with, other operations (except
365 image operations) running in the same thread for all
366 address spaces (for example, in signal handlers).
367 ================ ==========================================================
368
369AMDGPU Intrinsics
370-----------------
371
372The AMDGPU backend implements the following intrinsics.
373
374*This section is WIP.*
375
376.. TODO
377 List AMDGPU intrinsics
378
379Code Object
380===========
381
382The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
383can be linked by ``lld`` to produce a standard ELF shared code object which can
384be loaded and executed on an AMDGPU target.
385
386Header
387------
388
389The AMDGPU backend uses the following ELF header:
390
391 .. table:: AMDGPU ELF Header
392 :name: amdgpu-elf-header-table
393
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000394 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000395 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000396 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000397 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
398 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000399 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
400 - ``ELFOSABI_AMDGPU_HSA``
401 - ``ELFOSABI_AMDGPU_PAL``
402 - ``ELFOSABI_AMDGPU_MESA3D``
403 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
404 - ``ELFABIVERSION_AMDGPU_PAL``
405 - ``ELFABIVERSION_AMDGPU_MESA3D``
406 ``e_type`` - ``ET_REL``
407 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000408 ``e_machine`` ``EM_AMDGPU``
409 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000410 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000411 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000412
413..
414
415 .. table:: AMDGPU ELF Header Enumeration Values
416 :name: amdgpu-elf-header-enumeration-values-table
417
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000418 =============================== =====
419 Name Value
420 =============================== =====
421 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000422 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000423 ``ELFOSABI_AMDGPU_HSA`` 64
424 ``ELFOSABI_AMDGPU_PAL`` 65
425 ``ELFOSABI_AMDGPU_MESA3D`` 66
426 ``ELFABIVERSION_AMDGPU_HSA`` 1
427 ``ELFABIVERSION_AMDGPU_PAL`` 0
428 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
429 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000430
431``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000432 The ELF class is:
433
434 * ``ELFCLASS32`` for ``r600`` architecture.
435
436 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
437 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000438
439``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000440 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000441
442``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000443 One of the following AMD GPU architecture specific OS ABIs
444 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000445
Tony Tye07d9f102017-11-10 01:00:54 +0000446 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000447
Tony Tye07d9f102017-11-10 01:00:54 +0000448 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000449
Tony Tye07d9f102017-11-10 01:00:54 +0000450 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
451
452 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000453
Tony Tyef16a45e2017-06-06 20:31:59 +0000454``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000455 The ABI version of the AMD GPU architecture specific OS ABI to which the code
456 object conforms:
457
458 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
459 runtime ABI.
460
461 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
462 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000463
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000464 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000465 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000466
Tony Tyef16a45e2017-06-06 20:31:59 +0000467``e_type``
468 Can be one of the following values:
469
470
471 ``ET_REL``
472 The type produced by the AMD GPU backend compiler as it is relocatable code
473 object.
474
475 ``ET_DYN``
476 The type produced by the linker as it is a shared code object.
477
478 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
479
480``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000481 The value ``EM_AMDGPU`` is used for the machine for all processors supported
482 by the ``r600`` and ``amdgcn`` architectures (see
483 :ref:`amdgpu-processor-table`). The specific processor is specified in the
484 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
485 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000486
487``e_entry``
488 The entry point is 0 as the entry points for individual kernels must be
489 selected in order to invoke them through AQL packets.
490
491``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000492 The AMDGPU backend uses the following ELF header flags:
493
494 .. table:: AMDGPU ELF Header ``e_flags``
495 :name: amdgpu-elf-header-e_flags-table
496
497 ================================= ========== =============================
498 Name Value Description
499 ================================= ========== =============================
500 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
501 -------------------------------------------- -----------------------------
502 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
503 mask for
504 ``EF_AMDGPU_MACH_xxx`` values
505 defined in
506 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000507 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
508 target feature is
509 enabled for all code
510 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000511 If the processor
512 does not support the
513 ``xnack`` target
514 feature then must
515 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000516 See
517 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000518 ================================= ========== =============================
519
520 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
521 :name: amdgpu-ef-amdgpu-mach-table
522
523 ================================= ========== =============================
524 Name Value Description (see
525 :ref:`amdgpu-processor-table`)
526 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000527 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
528 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
529 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
530 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
531 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
532 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
533 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
534 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
535 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
536 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
537 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
538 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
539 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
540 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
541 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
542 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
543 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
544 *reserved* 0x011 - Reserved for ``r600``
545 0x01f architecture processors.
546 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
550 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
551 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
552 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
553 *reserved* 0x027 Reserved.
554 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
555 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
556 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
557 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
558 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
559 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault0084adc2018-04-30 19:08:16 +0000560 ``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
561 ``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000562 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000563 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000564
565Sections
566--------
567
568An AMDGPU target ELF code object has the standard ELF sections which include:
569
570 .. table:: AMDGPU ELF Sections
571 :name: amdgpu-elf-sections-table
572
573 ================== ================ =================================
574 Name Type Attributes
575 ================== ================ =================================
576 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
577 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
578 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
579 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
580 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
581 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
582 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
583 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
584 ``.note`` ``SHT_NOTE`` *none*
585 ``.rela``\ *name* ``SHT_RELA`` *none*
586 ``.rela.dyn`` ``SHT_RELA`` *none*
587 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
588 ``.shstrtab`` ``SHT_STRTAB`` *none*
589 ``.strtab`` ``SHT_STRTAB`` *none*
590 ``.symtab`` ``SHT_SYMTAB`` *none*
591 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
592 ================== ================ =================================
593
594These sections have their standard meanings (see [ELF]_) and are only generated
595if needed.
596
597``.debug``\ *\**
598 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
599 DWARF produced by the AMDGPU backend.
600
Tony Tye46d35762017-08-15 20:47:41 +0000601``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000602 The standard sections used by a dynamic loader.
603
604``.note``
605 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
606 backend.
607
608``.rela``\ *name*, ``.rela.dyn``
609 For relocatable code objects, *name* is the name of the section that the
610 relocation records apply. For example, ``.rela.text`` is the section name for
611 relocation records associated with the ``.text`` section.
612
613 For linked shared code objects, ``.rela.dyn`` contains all the relocation
614 records from each of the relocatable code object's ``.rela``\ *name* sections.
615
616 See :ref:`amdgpu-relocation-records` for the relocation records supported by
617 the AMDGPU backend.
618
619``.text``
620 The executable machine code for the kernels and functions they call. Generated
621 as position independent code. See :ref:`amdgpu-code-conventions` for
622 information on conventions used in the isa generation.
623
624.. _amdgpu-note-records:
625
626Note Records
627------------
628
Tony Tye07d9f102017-11-10 01:00:54 +0000629As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
630be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
631aligned. In addition, minimal zero byte padding must be generated to ensure the
632``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
633``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000634
635The AMDGPU backend code object uses the following ELF note records in the
636``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000637record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000638variable size strings have a corresponding ``*_size`` field that specifies the
639number of bytes, including the terminating null character, in the string. The
640string(s) come immediately after the preceding fields.
641
642Additional note records can be present.
643
644 .. table:: AMDGPU ELF Note Records
645 :name: amdgpu-elf-note-records-table
646
Tony Tye46d35762017-08-15 20:47:41 +0000647 ===== ============================== ======================================
648 Name Type Description
649 ===== ============================== ======================================
650 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000651 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000652
653..
654
655 .. table:: AMDGPU ELF Note Record Enumeration Values
656 :name: amdgpu-elf-note-record-enumeration-values-table
657
Tony Tye46d35762017-08-15 20:47:41 +0000658 ============================== =====
659 Name Value
660 ============================== =====
661 *reserved* 0-9
662 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000663 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000664 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000665
Tony Tye46d35762017-08-15 20:47:41 +0000666``NT_AMD_AMDGPU_HSA_METADATA``
667 Specifies extensible metadata associated with the code objects executed on HSA
668 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
669 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
670 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
671 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000672
Tony Tye46d35762017-08-15 20:47:41 +0000673.. _amdgpu-symbols:
674
675Symbols
676-------
677
678Symbols include the following:
679
680 .. table:: AMDGPU ELF Symbols
681 :name: amdgpu-elf-symbols-table
682
683 ===================== ============== ============= ==================
684 Name Type Section Description
685 ===================== ============== ============= ==================
686 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
687 - ``.rodata``
688 - ``.bss``
689 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
690 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
691 ===================== ============== ============= ==================
692
693Global variable
694 Global variables both used and defined by the compilation unit.
695
696 If the symbol is defined in the compilation unit then it is allocated in the
697 appropriate section according to if it has initialized data or is readonly.
698
699 If the symbol is external then its section is ``STN_UNDEF`` and the loader
700 will resolve relocations using the definition provided by another code object
701 or explicitly defined by the runtime.
702
703 All global symbols, whether defined in the compilation unit or external, are
704 accessed by the machine code indirectly through a GOT table entry. This
705 allows them to be preemptable. The GOT table is only supported when the target
706 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000707
708 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000709 Add description of linked shared object symbols. Seems undefined symbols
710 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000711
Tony Tye46d35762017-08-15 20:47:41 +0000712Kernel descriptor
713 Every HSA kernel has an associated kernel descriptor. It is the address of the
714 kernel descriptor that is used in the AQL dispatch packet used to invoke the
715 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
716 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
717
718Kernel entry point
719 Every HSA kernel also has a symbol for its machine code entry point.
720
721.. _amdgpu-relocation-records:
722
723Relocation Records
724------------------
725
726AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
727relocatable fields are:
728
729``word32``
730 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
731 alignment. These values use the same byte order as other word values in the
732 AMD GPU architecture.
733
734``word64``
735 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
736 alignment. These values use the same byte order as other word values in the
737 AMD GPU architecture.
738
739Following notations are used for specifying relocation calculations:
740
741**A**
742 Represents the addend used to compute the value of the relocatable field.
743
744**G**
745 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000746 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000747
748**GOT**
749 Represents the address of the global offset table.
750
751**P**
752 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
753 of the storage unit being relocated (computed using ``r_offset``).
754
755**S**
756 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000757 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
758
759**B**
760 Represents the base address of a loaded executable or shared object which is
761 the difference between the ELF address and the actual load address. Relocations
762 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000763
764The following relocation types are supported:
765
766 .. table:: AMDGPU ELF Relocation Records
767 :name: amdgpu-elf-relocation-records-table
768
Tony Tyedb6c9932018-01-30 23:59:43 +0000769 ========================== ======= ===== ========== ==============================
770 Relocation Type Kind Value Field Calculation
771 ========================== ======= ===== ========== ==============================
772 ``R_AMDGPU_NONE`` 0 *none* *none*
Tony Tye223f4c72018-04-13 01:01:27 +0000773 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
774 Dynamic
775 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
776 Dynamic
777 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault0084adc2018-04-30 19:08:16 +0000778 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000779 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
780 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye223f4c72018-04-13 01:01:27 +0000781 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
782 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000783 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
784 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
785 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
786 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
787 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
788 *reserved* 12
789 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
790 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000791
Tony Tye223f4c72018-04-13 01:01:27 +0000792``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
793the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
794
795There is no current OS loader support for 32 bit programs and so
796``R_AMDGPU_ABS32`` is not used.
Matt Arsenault0084adc2018-04-30 19:08:16 +0000797
Tony Tye46d35762017-08-15 20:47:41 +0000798.. _amdgpu-dwarf:
799
800DWARF
801-----
802
Scott Linder16c7bda2018-02-23 23:01:06 +0000803Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000804information that maps the code object executable code and data to the source
805language constructs. It can be used by tools such as debuggers and profilers.
806
807Address Space Mapping
808~~~~~~~~~~~~~~~~~~~~~
809
810The following address space mapping is used:
811
812 .. table:: AMDGPU DWARF Address Space Mapping
813 :name: amdgpu-dwarf-address-space-mapping-table
814
815 =================== =================
816 DWARF Address Space Memory Space
817 =================== =================
818 1 Private (Scratch)
819 2 Local (group/LDS)
820 *omitted* Global
821 *omitted* Constant
822 *omitted* Generic (Flat)
823 *not supported* Region (GDS)
824 =================== =================
825
826See :ref:`amdgpu-address-spaces` for information on the memory space terminology
827used in the table.
828
829An ``address_class`` attribute is generated on pointer type DIEs to specify the
830DWARF address space of the value of the pointer when it is in the *private* or
831*local* address space. Otherwise the attribute is omitted.
832
833An ``XDEREF`` operation is generated in location list expressions for variables
834that are allocated in the *private* and *local* address space. Otherwise no
835``XDREF`` is omitted.
836
837Register Mapping
838~~~~~~~~~~~~~~~~
839
840*This section is WIP.*
841
842.. TODO
843 Define DWARF register enumeration.
844
845 If want to present a wavefront state then should expose vector registers as
846 64 wide (rather than per work-item view that LLVM uses). Either as separate
847 registers, or a 64x4 byte single register. In either case use a new LANE op
848 (akin to XDREF) to select the current lane usage in a location
849 expression. This would also allow scalar register spilling to vector register
850 lanes to be expressed (currently no debug information is being generated for
851 spilling). If choose a wide single register approach then use LANE in
852 conjunction with PIECE operation to select the dword part of the register for
853 the current lane. If the separate register approach then use LANE to select
854 the register.
855
856Source Text
857~~~~~~~~~~~
858
Scott Linder16c7bda2018-02-23 23:01:06 +0000859Source text for online-compiled programs (e.g. those compiled by the OpenCL
860runtime) may be embedded into the DWARF v5 line table using the ``clang
861-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000862
Scott Linder16c7bda2018-02-23 23:01:06 +0000863For example:
864
865``-gembed-source``
866 Enable the embedded source DWARF v5 extension.
867``-gno-embed-source``
868 Disable the embedded source DWARF v5 extension.
869
870 .. table:: AMDGPU Debug Options
871 :name: amdgpu-debug-options
872
873 ==================== ==================================================
874 Debug Flag Description
875 ==================== ==================================================
876 -g[no-]embed-source Enable/disable embedding source text in DWARF
877 debug sections. Useful for environments where
878 source cannot be written to disk, such as
879 when performing online compilation.
880 ==================== ==================================================
881
882This option enables one extended content types in the DWARF v5 Line Number
883Program Header, which is used to encode embedded source.
884
885 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
886 :name: amdgpu-dwarf-extended-content-types
887
888 ============================ ======================
889 Content Type Form
890 ============================ ======================
891 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
892 ============================ ======================
893
894The source field will contain the UTF-8 encoded, null-terminated source text
895with ``'\n'`` line endings. When the source field is present, consumers can use
896the embedded source instead of attempting to discover the source on disk. When
897the source field is absent, consumers can access the file to get the source
898text.
899
900The above content type appears in the ``file_name_entry_format`` field of the
901line table prologue, and its corresponding value appear in the ``file_names``
902field. The current encoding of the content type is documented in table
903:ref:`amdgpu-dwarf-extended-content-types-encoding`
904
905 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
906 :name: amdgpu-dwarf-extended-content-types-encoding
907
908 ============================ ====================
909 Content Type Value
910 ============================ ====================
911 ``DW_LNCT_LLVM_source`` 0x2001
912 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000913
914.. _amdgpu-code-conventions:
915
916Code Conventions
917================
918
919This section provides code conventions used for each supported target triple OS
920(see :ref:`amdgpu-target-triples`).
921
922AMDHSA
923------
924
925This section provides code conventions used when the target triple OS is
926``amdhsa`` (see :ref:`amdgpu-target-triples`).
927
928.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000929
Tony Tye01bfd6c2018-03-27 21:20:46 +0000930Code Object Target Identification
931~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
932
933The AMDHSA OS uses the following syntax to specify the code object
934target as a single string:
935
936 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
937
938Where:
939
940 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
941 are the same as the *Target Triple* (see
942 :ref:`amdgpu-target-triples`).
943
944 - ``<Processor>`` is the same as the *Processor* (see
945 :ref:`amdgpu-processors`).
946
947 - ``<Target Features>`` is a list of the enabled *Target Features*
948 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
949 apply to *Processor*. The list must be in the same order as listed
950 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
951 Features* must be included in the list if they are enabled even if
952 that is the default for *Processor*.
953
954For example:
955
956 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
957
Tony Tyef16a45e2017-06-06 20:31:59 +0000958Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000959~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000960
Tony Tye46d35762017-08-15 20:47:41 +0000961The code object metadata specifies extensible metadata associated with the code
962objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
963[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
964(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
965``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
966information necessary to support the ROCM kernel queries. For example, the
967segment sizes needed in a dispatch packet. In addition, a high level language
968runtime may require other information to be included. For example, the AMD
969OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000970
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000971The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000972:doc:`YamlIO`).
973
Tony Tye46d35762017-08-15 20:47:41 +0000974.. TODO
975 Is the string null terminated? It probably should not if YAML allows it to
976 contain null characters, otherwise it should be.
977
Tony Tyef16a45e2017-06-06 20:31:59 +0000978The metadata is represented as a single YAML document comprised of the mapping
979defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
980referenced tables.
981
982For boolean values, the string values of ``false`` and ``true`` are used for
983false and true respectively.
984
985Additional information can be added to the mappings. To avoid conflicts, any
986non-AMD key names should be prefixed by "*vendor-name*.".
987
988 .. table:: AMDHSA Code Object Metadata Mapping
989 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
990
991 ========== ============== ========= =======================================
992 String Key Value Type Required? Description
993 ========== ============== ========= =======================================
994 "Version" sequence of Required - The first integer is the major
995 2 integers version. Currently 1.
996 - The second integer is the minor
997 version. Currently 0.
998 "Printf" sequence of Each string is encoded information
999 strings about a printf function call. The
1000 encoded information is organized as
1001 fields separated by colon (':'):
1002
1003 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
1004
1005 where:
1006
1007 ``ID``
1008 A 32 bit integer as a unique id for
1009 each printf function call
1010
1011 ``N``
1012 A 32 bit integer equal to the number
1013 of arguments of printf function call
1014 minus 1
1015
1016 ``S[i]`` (where i = 0, 1, ... , N-1)
1017 32 bit integers for the size in bytes
1018 of the i-th FormatString argument of
1019 the printf function call
1020
1021 FormatString
1022 The format string passed to the
1023 printf function call.
1024 "Kernels" sequence of Required Sequence of the mappings for each
1025 mapping kernel in the code object. See
1026 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1027 for the definition of the mapping.
1028 ========== ============== ========= =======================================
1029
1030..
1031
1032 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1033 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1034
1035 ================= ============== ========= ================================
1036 String Key Value Type Required? Description
1037 ================= ============== ========= ================================
1038 "Name" string Required Source name of the kernel.
1039 "SymbolName" string Required Name of the kernel
1040 descriptor ELF symbol.
1041 "Language" string Source language of the kernel.
1042 Values include:
1043
1044 - "OpenCL C"
1045 - "OpenCL C++"
1046 - "HCC"
1047 - "OpenMP"
1048
1049 "LanguageVersion" sequence of - The first integer is the major
1050 2 integers version.
1051 - The second integer is the
1052 minor version.
1053 "Attrs" mapping Mapping of kernel attributes.
1054 See
1055 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1056 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001057 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001058 mapping kernel arguments. See
1059 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1060 for the definition of the mapping.
1061 "CodeProps" mapping Mapping of properties related to
1062 the kernel code. See
1063 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1064 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001065 ================= ============== ========= ================================
1066
1067..
1068
1069 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1070 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1071
1072 =================== ============== ========= ==============================
1073 String Key Value Type Required? Description
1074 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001075 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1076 3 integers must be >=1 and the dispatch
1077 work-group size X, Y, Z must
1078 correspond to the specified
1079 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001080
1081 Corresponds to the OpenCL
1082 ``reqd_work_group_size``
1083 attribute.
1084 "WorkGroupSizeHint" sequence of The dispatch work-group size
1085 3 integers X, Y, Z is likely to be the
1086 specified values.
1087
1088 Corresponds to the OpenCL
1089 ``work_group_size_hint``
1090 attribute.
1091 "VecTypeHint" string The name of a scalar or vector
1092 type.
1093
1094 Corresponds to the OpenCL
1095 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001096
1097 "RuntimeHandle" string The external symbol name
1098 associated with a kernel.
1099 OpenCL runtime allocates a
1100 global buffer for the symbol
1101 and saves the kernel's address
1102 to it, which is used for
1103 device side enqueueing. Only
1104 available for device side
1105 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001106 =================== ============== ========= ==============================
1107
1108..
1109
1110 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1111 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1112
1113 ================= ============== ========= ================================
1114 String Key Value Type Required? Description
1115 ================= ============== ========= ================================
1116 "Name" string Kernel argument name.
1117 "TypeName" string Kernel argument type name.
1118 "Size" integer Required Kernel argument size in bytes.
1119 "Align" integer Required Kernel argument alignment in
1120 bytes. Must be a power of two.
1121 "ValueKind" string Required Kernel argument kind that
1122 specifies how to set up the
1123 corresponding argument.
1124 Values include:
1125
1126 "ByValue"
1127 The argument is copied
1128 directly into the kernarg.
1129
1130 "GlobalBuffer"
1131 A global address space pointer
1132 to the buffer data is passed
1133 in the kernarg.
1134
1135 "DynamicSharedPointer"
1136 A group address space pointer
1137 to dynamically allocated LDS
1138 is passed in the kernarg.
1139
1140 "Sampler"
1141 A global address space
1142 pointer to a S# is passed in
1143 the kernarg.
1144
1145 "Image"
1146 A global address space
1147 pointer to a T# is passed in
1148 the kernarg.
1149
1150 "Pipe"
1151 A global address space pointer
1152 to an OpenCL pipe is passed in
1153 the kernarg.
1154
1155 "Queue"
1156 A global address space pointer
1157 to an OpenCL device enqueue
1158 queue is passed in the
1159 kernarg.
1160
1161 "HiddenGlobalOffsetX"
1162 The OpenCL grid dispatch
1163 global offset for the X
1164 dimension is passed in the
1165 kernarg.
1166
1167 "HiddenGlobalOffsetY"
1168 The OpenCL grid dispatch
1169 global offset for the Y
1170 dimension is passed in the
1171 kernarg.
1172
1173 "HiddenGlobalOffsetZ"
1174 The OpenCL grid dispatch
1175 global offset for the Z
1176 dimension is passed in the
1177 kernarg.
1178
1179 "HiddenNone"
1180 An argument that is not used
1181 by the kernel. Space needs to
1182 be left for it, but it does
1183 not need to be set up.
1184
1185 "HiddenPrintfBuffer"
1186 A global address space pointer
1187 to the runtime printf buffer
1188 is passed in kernarg.
1189
1190 "HiddenDefaultQueue"
1191 A global address space pointer
1192 to the OpenCL device enqueue
1193 queue that should be used by
1194 the kernel by default is
1195 passed in the kernarg.
1196
1197 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001198 A global address space pointer
1199 to help link enqueued kernels into
1200 the ancestor tree for determining
1201 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001202
1203 "ValueType" string Required Kernel argument value type. Only
1204 present if "ValueKind" is
1205 "ByValue". For vector data
1206 types, the value is for the
1207 element type. Values include:
1208
1209 - "Struct"
1210 - "I8"
1211 - "U8"
1212 - "I16"
1213 - "U16"
1214 - "F16"
1215 - "I32"
1216 - "U32"
1217 - "F32"
1218 - "I64"
1219 - "U64"
1220 - "F64"
1221
1222 .. TODO
1223 How can it be determined if a
1224 vector type, and what size
1225 vector?
1226 "PointeeAlign" integer Alignment in bytes of pointee
1227 type for pointer type kernel
1228 argument. Must be a power
1229 of 2. Only present if
1230 "ValueKind" is
1231 "DynamicSharedPointer".
1232 "AddrSpaceQual" string Kernel argument address space
1233 qualifier. Only present if
1234 "ValueKind" is "GlobalBuffer" or
1235 "DynamicSharedPointer". Values
1236 are:
1237
1238 - "Private"
1239 - "Global"
1240 - "Constant"
1241 - "Local"
1242 - "Generic"
1243 - "Region"
1244
1245 .. TODO
1246 Is GlobalBuffer only Global
1247 or Constant? Is
1248 DynamicSharedPointer always
1249 Local? Can HCC allow Generic?
1250 How can Private or Region
1251 ever happen?
1252 "AccQual" string Kernel argument access
1253 qualifier. Only present if
1254 "ValueKind" is "Image" or
1255 "Pipe". Values
1256 are:
1257
1258 - "ReadOnly"
1259 - "WriteOnly"
1260 - "ReadWrite"
1261
1262 .. TODO
1263 Does this apply to
1264 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001265 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001266 performed by the kernel on the
1267 kernel argument. Only present if
1268 "ValueKind" is "GlobalBuffer",
1269 "Image", or "Pipe". This may be
1270 more restrictive than indicated
1271 by "AccQual" to reflect what the
1272 kernel actual does. If not
1273 present then the runtime must
1274 assume what is implied by
1275 "AccQual" and "IsConst". Values
1276 are:
1277
1278 - "ReadOnly"
1279 - "WriteOnly"
1280 - "ReadWrite"
1281
1282 "IsConst" boolean Indicates if the kernel argument
1283 is const qualified. Only present
1284 if "ValueKind" is
1285 "GlobalBuffer".
1286
1287 "IsRestrict" boolean Indicates if the kernel argument
1288 is restrict qualified. Only
1289 present if "ValueKind" is
1290 "GlobalBuffer".
1291
1292 "IsVolatile" boolean Indicates if the kernel argument
1293 is volatile qualified. Only
1294 present if "ValueKind" is
1295 "GlobalBuffer".
1296
1297 "IsPipe" boolean Indicates if the kernel argument
1298 is pipe qualified. Only present
1299 if "ValueKind" is "Pipe".
1300
1301 .. TODO
1302 Can GlobalBuffer be pipe
1303 qualified?
1304 ================= ============== ========= ================================
1305
1306..
1307
1308 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1309 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1310
1311 ============================ ============== ========= =====================
1312 String Key Value Type Required? Description
1313 ============================ ============== ========= =====================
1314 "KernargSegmentSize" integer Required The size in bytes of
1315 the kernarg segment
1316 that holds the values
1317 of the arguments to
1318 the kernel.
1319 "GroupSegmentFixedSize" integer Required The amount of group
1320 segment memory
1321 required by a
1322 work-group in
1323 bytes. This does not
1324 include any
1325 dynamically allocated
1326 group segment memory
1327 that may be added
1328 when the kernel is
1329 dispatched.
1330 "PrivateSegmentFixedSize" integer Required The amount of fixed
1331 private address space
1332 memory required for a
1333 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001334 bytes. If the kernel
1335 uses a dynamic call
1336 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001337 space must be added
1338 to this value for the
1339 call stack.
1340 "KernargSegmentAlign" integer Required The maximum byte
1341 alignment of
1342 arguments in the
1343 kernarg segment. Must
1344 be a power of 2.
1345 "WavefrontSize" integer Required Wavefront size. Must
1346 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001347 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001348 registers used by a
1349 wavefront for
1350 GFX6-GFX9. This
1351 includes the special
1352 SGPRs for VCC, Flat
1353 Scratch (GFX7-GFX9)
1354 and XNACK (for
1355 GFX8-GFX9). It does
1356 not include the 16
1357 SGPR added if a trap
1358 handler is
1359 enabled. It is not
1360 rounded up to the
1361 allocation
1362 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001363 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001364 registers used by
1365 each work-item for
1366 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001367 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001368 work-group size
1369 supported by the
1370 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001371 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001372 consistent with
1373 ReqdWorkGroupSize if
1374 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001375 "NumSpilledSGPRs" integer Number of stores from
1376 a scalar register to
1377 a register allocator
1378 created spill
1379 location.
1380 "NumSpilledVGPRs" integer Number of stores from
1381 a vector register to
1382 a register allocator
1383 created spill
1384 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001385 ============================ ============== ========= =====================
1386
1387..
1388
Tony Tyef16a45e2017-06-06 20:31:59 +00001389Kernel Dispatch
1390~~~~~~~~~~~~~~~
1391
1392The HSA architected queuing language (AQL) defines a user space memory interface
1393that can be used to control the dispatch of kernels, in an agent independent
1394way. An agent can have zero or more AQL queues created for it using the ROCm
1395runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1396*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1397mechanics and packet layouts.
1398
1399The packet processor of a kernel agent is responsible for detecting and
1400dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1401packet processor is implemented by the hardware command processor (CP),
1402asynchronous dispatch controller (ADC) and shader processor input controller
1403(SPI).
1404
1405The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1406mode driver to initialize and register the AQL queue with CP.
1407
1408To dispatch a kernel the following actions are performed. This can occur in the
1409CPU host program, or from an HSA kernel executing on a GPU.
1410
14111. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1412 executed is obtained.
14132. A pointer to the kernel descriptor (see
1414 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1415 obtained. It must be for a kernel that is contained in a code object that that
1416 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1417 associated.
14183. Space is allocated for the kernel arguments using the ROCm runtime allocator
1419 for a memory region with the kernarg property for the kernel agent that will
1420 execute the kernel. It must be at least 16 byte aligned.
14214. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001422 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001423 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1424 memory in the same way constant memory is accessed. (Note that the HSA
1425 specification allows an implementation to copy the kernel argument contents to
1426 another location that is accessed by the kernel.)
14275. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1428 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1429 packet. The packet must be set up, and the final write must use an atomic
1430 store release to set the packet kind to ensure the packet contents are
1431 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1432 notify the kernel agent that the AQL queue has been updated. These rules, and
1433 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1434 System Architecture Specification* [HSA]_.
14356. A kernel dispatch packet includes information about the actual dispatch,
1436 such as grid and work-group size, together with information from the code
1437 object about the kernel, such as segment sizes. The ROCm runtime queries on
1438 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001439 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000014407. CP executes micro-code and is responsible for detecting and setting up the
1441 GPU to execute the wavefronts of a kernel dispatch.
14428. CP ensures that when the a wavefront starts executing the kernel machine
1443 code, the scalar general purpose registers (SGPR) and vector general purpose
1444 registers (VGPR) are set up as required by the machine code. The required
1445 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1446 register state is defined in
1447 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14489. The prolog of the kernel machine code (see
1449 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1450 before continuing executing the machine code that corresponds to the kernel.
145110. When the kernel dispatch has completed execution, CP signals the completion
1452 signal specified in the kernel dispatch packet if not 0.
1453
1454.. _amdgpu-amdhsa-memory-spaces:
1455
1456Memory Spaces
1457~~~~~~~~~~~~~
1458
1459The memory space properties are:
1460
1461 .. table:: AMDHSA Memory Spaces
1462 :name: amdgpu-amdhsa-memory-spaces-table
1463
1464 ================= =========== ======== ======= ==================
1465 Memory Space Name HSA Segment Hardware Address NULL Value
1466 Name Name Size
1467 ================= =========== ======== ======= ==================
1468 Private private scratch 32 0x00000000
1469 Local group LDS 32 0xFFFFFFFF
1470 Global global global 64 0x0000000000000000
1471 Constant constant *same as 64 0x0000000000000000
1472 global*
1473 Generic flat flat 64 0x0000000000000000
1474 Region N/A GDS 32 *not implemented
1475 for AMDHSA*
1476 ================= =========== ======== ======= ==================
1477
1478The global and constant memory spaces both use global virtual addresses, which
1479are the same virtual address space used by the CPU. However, some virtual
1480addresses may only be accessible to the CPU, some only accessible by the GPU,
1481and some by both.
1482
1483Using the constant memory space indicates that the data will not change during
1484the execution of the kernel. This allows scalar read instructions to be
1485used. The vector and scalar L1 caches are invalidated of volatile data before
1486each kernel dispatch execution to allow constant memory to change values between
1487kernel dispatches.
1488
1489The local memory space uses the hardware Local Data Store (LDS) which is
1490automatically allocated when the hardware creates work-groups of wavefronts, and
1491freed when all the wavefronts of a work-group have terminated. The data store
1492(DS) instructions can be used to access it.
1493
1494The private memory space uses the hardware scratch memory support. If the kernel
1495uses scratch, then the hardware allocates memory that is accessed using
1496wavefront lane dword (4 byte) interleaving. The mapping used from private
1497address to physical address is:
1498
1499 ``wavefront-scratch-base +
1500 (private-address * wavefront-size * 4) +
1501 (wavefront-lane-id * 4)``
1502
1503There are different ways that the wavefront scratch base address is determined
1504by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1505memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001506the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001507instructions, or by flat instructions. If each lane of a wavefront accesses the
1508same private address, the interleaving results in adjacent dwords being accessed
1509and hence requires fewer cache lines to be fetched. Multi-dword access is not
1510supported except by flat and scratch instructions in GFX9.
1511
1512The generic address space uses the hardware flat address support available in
1513GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1514local appertures), that are outside the range of addressible global memory, to
1515map from a flat address to a private or local address.
1516
1517FLAT instructions can take a flat address and access global, private (scratch)
1518and group (LDS) memory depending in if the address is within one of the
1519apperture ranges. Flat access to scratch requires hardware aperture setup and
1520setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1521access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1522(see :ref:`amdgpu-amdhsa-m0`).
1523
1524To convert between a segment address and a flat address the base address of the
1525appertures address can be used. For GFX7-GFX8 these are available in the
1526:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1527Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1528GFX9 the appature base addresses are directly available as inline constant
1529registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1530address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1531which makes it easier to convert from flat to segment or segment to flat.
1532
Tony Tye46d35762017-08-15 20:47:41 +00001533Image and Samplers
1534~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001535
1536Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1537hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1538HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1539enumeration values for the queries that are not trivially deducible from the S#
1540representation.
1541
1542HSA Signals
1543~~~~~~~~~~~
1544
Tony Tye46d35762017-08-15 20:47:41 +00001545HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1546structure allocated in memory accessible from both the CPU and GPU. The
1547structure is defined by the ROCm runtime and subject to change between releases
1548(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001549
1550.. _amdgpu-amdhsa-hsa-aql-queue:
1551
1552HSA AQL Queue
1553~~~~~~~~~~~~~
1554
Tony Tye46d35762017-08-15 20:47:41 +00001555The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001556between releases (see [AMD-ROCm-github]_). For some processors it contains
1557fields needed to implement certain language features such as the flat address
1558aperture bases. It also contains fields used by CP such as managing the
1559allocation of scratch memory.
1560
1561.. _amdgpu-amdhsa-kernel-descriptor:
1562
1563Kernel Descriptor
1564~~~~~~~~~~~~~~~~~
1565
1566A kernel descriptor consists of the information needed by CP to initiate the
1567execution of a kernel, including the entry point address of the machine code
1568that implements the kernel.
1569
1570Kernel Descriptor for GFX6-GFX9
1571+++++++++++++++++++++++++++++++
1572
1573CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1574
1575 .. table:: Kernel Descriptor for GFX6-GFX9
1576 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1577
Tony Tye6baa6d22017-10-18 22:16:55 +00001578 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001579 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001580 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001581 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001582 address space memory
1583 required for a work-group
1584 in bytes. This does not
1585 include any dynamically
1586 allocated local address
1587 space memory that may be
1588 added when the kernel is
1589 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001590 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001591 private address space
1592 memory required for a
1593 work-item in bytes. If
1594 is_dynamic_callstack is 1
1595 then additional space must
1596 be added to this value for
1597 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001598 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001599 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001600 negative) from base
1601 address of kernel
1602 descriptor to kernel's
1603 entry point instruction
1604 which must be 256 byte
1605 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001606 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001607 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001608 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001609 program settings used by
1610 CP to set up
1611 ``COMPUTE_PGM_RSRC1``
1612 configuration
1613 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001614 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001615 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001616 program settings used by
1617 CP to set up
1618 ``COMPUTE_PGM_RSRC2``
1619 configuration
1620 register. See
1621 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001622 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1623 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001624 (see
1625 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1626
1627 The total number of SGPR
1628 user data registers
1629 requested must not exceed
1630 16 and match value in
1631 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1632 Any requests beyond 16
1633 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001634 449 1 bit EnableSGPRDispatchPtr *see above*
1635 450 1 bit EnableSGPRQueuePtr *see above*
1636 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1637 452 1 bit EnableSGPRDispatchID *see above*
1638 453 1 bit EnableSGPRFlatScratchInit *see above*
1639 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1640 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1641 should always be 0.
1642 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1643 should always be 0.
1644 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1645 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001646 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001647 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001648 bytes
1649 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001650 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001651
1652..
1653
1654 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001655 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001656
Tony Tye3b340612017-06-07 00:46:08 +00001657 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001658 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001659 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001660 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001661 used by each work-item,
1662 granularity is device
1663 specific:
1664
Tony Tye07d9f102017-11-10 01:00:54 +00001665 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001666 - max_vgpr 1..256
1667 - roundup((max_vgpg + 1)
1668 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001669
1670 Used by CP to set up
1671 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001672 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001673 used by a wavefront,
1674 granularity is device
1675 specific:
1676
Tony Tye07d9f102017-11-10 01:00:54 +00001677 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001678 - max_sgpr 1..112
1679 - roundup((max_sgpg + 1)
1680 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001681 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001682 - max_sgpr 1..112
1683 - roundup((max_sgpg + 1)
1684 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001685
1686 Includes the special SGPRs
1687 for VCC, Flat Scratch (for
1688 GFX7 onwards) and XNACK
1689 (for GFX8 onwards). It does
1690 not include the 16 SGPR
1691 added if a trap handler is
1692 enabled.
1693
1694 Used by CP to set up
1695 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001696 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001697
1698 Start executing wavefront
1699 at the specified priority.
1700
1701 CP is responsible for
1702 filling in
1703 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001704 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001705 with specified rounding
1706 mode for single (32
1707 bit) floating point
1708 precision floating point
1709 operations.
1710
1711 Floating point rounding
1712 mode values are defined in
1713 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1714
1715 Used by CP to set up
1716 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001717 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001718 with specified rounding
1719 denorm mode for half/double (16
1720 and 64 bit) floating point
1721 precision floating point
1722 operations.
1723
1724 Floating point rounding
1725 mode values are defined in
1726 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1727
1728 Used by CP to set up
1729 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001730 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001731 with specified denorm mode
1732 for single (32
1733 bit) floating point
1734 precision floating point
1735 operations.
1736
1737 Floating point denorm mode
1738 values are defined in
1739 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1740
1741 Used by CP to set up
1742 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001743 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001744 with specified denorm mode
1745 for half/double (16
1746 and 64 bit) floating point
1747 precision floating point
1748 operations.
1749
1750 Floating point denorm mode
1751 values are defined in
1752 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1753
1754 Used by CP to set up
1755 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001756 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001757
1758 Start executing wavefront
1759 in privilege trap handler
1760 mode.
1761
1762 CP is responsible for
1763 filling in
1764 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001765 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001766 with DX10 clamp mode
1767 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001768 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001769 treatment of NaN's (when
1770 set, clamp NaN to zero,
1771 otherwise pass NaN
1772 through).
1773
1774 Used by CP to set up
1775 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001776 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001777
1778 Start executing wavefront
1779 in single step mode.
1780
1781 CP is responsible for
1782 filling in
1783 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001784 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001785 with IEEE mode
1786 enabled. Floating point
1787 opcodes that support
1788 exception flag gathering
1789 will quiet and propagate
1790 signaling-NaN inputs per
1791 IEEE 754-2008. Min_dx10 and
1792 max_dx10 become IEEE
1793 754-2008 compliant due to
1794 signaling-NaN propagation
1795 and quieting.
1796
1797 Used by CP to set up
1798 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001799 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001800
1801 Only one work-group allowed
1802 to execute on a compute
1803 unit.
1804
1805 CP is responsible for
1806 filling in
1807 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001808 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001809
1810 Flag that can be used to
1811 control debugging code.
1812
1813 CP is responsible for
1814 filling in
1815 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001816 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001817 Reserved, must be 0.
1818 GFX9
1819 Wavefront starts execution
1820 with specified fp16 overflow
1821 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001822
Tony Tye6baa6d22017-10-18 22:16:55 +00001823 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001824 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001825 - If 1, fp16 overflow that is the
1826 result of an +/-INF input value
1827 or divide by 0 produces a +/-INF,
1828 otherwise clamps computed
1829 overflow to +/-MAX_FP16 as
1830 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001831
1832 Used by CP to set up
1833 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001834 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001835 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001836 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001837
1838..
1839
1840 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1841 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1842
Tony Tye3b340612017-06-07 00:46:08 +00001843 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001844 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001845 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001846 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001847 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001848 system register (see
1849 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1850
1851 Used by CP to set up
1852 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001853 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001854 user data registers
1855 requested. This number must
1856 match the number of user
1857 data registers enabled.
1858
1859 Used by CP to set up
1860 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov2ca6b1f2018-05-29 19:09:13 +00001861 6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001862
Konstantin Zhuravlyov2ca6b1f2018-05-29 19:09:13 +00001863 This bit represents
1864 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
1865 which is set by the CP if
1866 the runtime has installed a
1867 trap handler.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001868 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001869 system SGPR register for
1870 the work-group id in the X
1871 dimension (see
1872 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1873
1874 Used by CP to set up
1875 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001876 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001877 system SGPR register for
1878 the work-group id in the Y
1879 dimension (see
1880 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1881
1882 Used by CP to set up
1883 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001884 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001885 system SGPR register for
1886 the work-group id in the Z
1887 dimension (see
1888 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1889
1890 Used by CP to set up
1891 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001892 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001893 system SGPR register for
1894 work-group information (see
1895 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1896
1897 Used by CP to set up
1898 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001899 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001900 VGPR system registers used
1901 for the work-item ID.
1902 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1903 defines the values.
1904
1905 Used by CP to set up
1906 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001907 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001908
1909 Wavefront starts execution
1910 with address watch
1911 exceptions enabled which
1912 are generated when L1 has
1913 witnessed a thread access
1914 an *address of
1915 interest*.
1916
1917 CP is responsible for
1918 filling in the address
1919 watch bit in
1920 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1921 according to what the
1922 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001923 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001924
1925 Wavefront starts execution
1926 with memory violation
1927 exceptions exceptions
1928 enabled which are generated
1929 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001930 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001931 L1 or LDS
1932 (write-to-read-only-memory,
1933 mis-aligned atomic, LDS
1934 address out of range,
1935 illegal address, etc.).
1936
1937 CP sets the memory
1938 violation bit in
1939 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1940 according to what the
1941 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001942 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001943
1944 CP uses the rounded value
1945 from the dispatch packet,
1946 not this value, as the
1947 dispatch may contain
1948 dynamically allocated group
1949 segment memory. CP writes
1950 directly to
1951 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1952
1953 Amount of group segment
1954 (LDS) to allocate for each
1955 work-group. Granularity is
1956 device specific:
1957
1958 GFX6:
1959 roundup(lds-size / (64 * 4))
1960 GFX7-GFX9:
1961 roundup(lds-size / (128 * 4))
1962
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001963 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1964 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001965 enabled.
1966
1967 Used by CP to set up
1968 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1969 (set from bits 0..6).
1970
1971 IEEE 754 FP Invalid
1972 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001973 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1974 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001975 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001976 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1977 _DIVISION_BY_ZERO Zero
1978 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1979 _OVERFLOW
1980 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1981 _UNDERFLOW
1982 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1983 _INEXACT
1984 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1985 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001986 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001987 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001988 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001989 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001990
1991..
1992
1993 .. table:: Floating Point Rounding Mode Enumeration Values
1994 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1995
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001996 ====================================== ===== ==============================
1997 Enumeration Name Value Description
1998 ====================================== ===== ==============================
1999 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
2000 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
2001 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
2002 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
2003 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002004
2005..
2006
2007 .. table:: Floating Point Denorm Mode Enumeration Values
2008 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2009
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002010 ====================================== ===== ==============================
2011 Enumeration Name Value Description
2012 ====================================== ===== ==============================
2013 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2014 Denorms
2015 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2016 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2017 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2018 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002019
2020..
2021
2022 .. table:: System VGPR Work-Item ID Enumeration Values
2023 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2024
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002025 ======================================== ===== ============================
2026 Enumeration Name Value Description
2027 ======================================== ===== ============================
2028 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2029 ID.
2030 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2031 dimensions ID.
2032 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2033 dimensions ID.
2034 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2035 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002036
2037.. _amdgpu-amdhsa-initial-kernel-execution-state:
2038
2039Initial Kernel Execution State
2040~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2041
2042This section defines the register state that will be set up by the packet
2043processor prior to the start of execution of every wavefront. This is limited by
2044the constraints of the hardware controllers of CP/ADC/SPI.
2045
2046The order of the SGPR registers is defined, but the compiler can specify which
2047ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2048fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2049for enabled registers are dense starting at SGPR0: the first enabled register is
2050SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2051an SGPR number.
2052
2053The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002054all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002055the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2056initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002057set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002058dispatch.
2059
2060SGPR register initial state is defined in
2061:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2062
2063 .. table:: SGPR Register Set Up Order
2064 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2065
2066 ========== ========================== ====== ==============================
2067 SGPR Order Name Number Description
2068 (kernel descriptor enable of
2069 field) SGPRs
2070 ========== ========================== ====== ==============================
2071 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002072 (enable_sgpr_private with Scratch Wavefront Offset
2073 _segment_buffer) as an offset, to access the
2074 private memory space using a
2075 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002076
2077 CP uses the value provided by
2078 the runtime.
2079 then Dispatch Ptr 2 64 bit address of AQL dispatch
2080 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2081 actually executing.
2082 then Queue Ptr 2 64 bit address of amd_queue_t
2083 (enable_sgpr_queue_ptr) object for AQL queue on which
2084 the dispatch packet was
2085 queued.
2086 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2087 (enable_sgpr_kernarg segment. This is directly
2088 _segment_ptr) copied from the
2089 kernarg_address in the kernel
2090 dispatch packet.
2091
2092 Having CP load it once avoids
2093 loading it at the beginning of
2094 every wavefront.
2095 then Dispatch Id 2 64 bit Dispatch ID of the
2096 (enable_sgpr_dispatch_id) dispatch packet being
2097 executed.
2098 then Flat Scratch Init 2 This is 2 SGPRs:
2099 (enable_sgpr_flat_scratch
2100 _init) GFX6
2101 Not supported.
2102 GFX7-GFX8
2103 The first SGPR is a 32 bit
2104 byte offset from
2105 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2106 to per SPI base of memory
2107 for scratch for the queue
2108 executing the kernel
2109 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002110 from the runtime. (The
2111 Scratch Segment Buffer base
2112 address is
2113 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2114 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002115 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002116 be added to this offset by
2117 the kernel machine code,
2118 right shifted by 8, and
2119 moved to the FLAT_SCRATCH_HI
2120 SGPR register.
2121 FLAT_SCRATCH_HI corresponds
2122 to SGPRn-4 on GFX7, and
2123 SGPRn-6 on GFX8 (where SGPRn
2124 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002125 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002126 FLAT_SCRATCH_HI is
2127 multiplied by 256 (as it is
2128 in units of 256 bytes) and
2129 added to
2130 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002131 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002132 FLAT SCRATCH BASE in flat
2133 memory instructions that
2134 access the scratch
2135 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002136
2137 The second SGPR is 32 bit
2138 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002139 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002140 usage. CP obtains this from
2141 the runtime, and it is
2142 always a multiple of DWORD.
2143 CP checks that the value in
2144 the kernel dispatch packet
2145 Private Segment Byte Size is
2146 not larger, and requests the
2147 runtime to increase the
2148 queue's scratch size if
2149 necessary. The kernel code
2150 must move it to
2151 FLAT_SCRATCH_LO which is
2152 SGPRn-3 on GFX7 and SGPRn-5
2153 on GFX8. FLAT_SCRATCH_LO is
2154 used as the FLAT SCRATCH
2155 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002156 instructions. Having CP load
2157 it once avoids loading it at
2158 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002159 wavefront.
2160 GFX9
2161 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002162 64 bit base address of the
2163 per SPI scratch backing
2164 memory managed by SPI for
2165 the queue executing the
2166 kernel dispatch. CP obtains
2167 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002168 divides it if there are
2169 multiple Shader Arrays each
2170 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002171 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002172 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002173 machine code and the result
2174 moved to the FLAT_SCRATCH
2175 SGPR which is SGPRn-6 and
2176 SGPRn-5. It is used as the
2177 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002178 memory instructions.
2179 then Private Segment Size 1 The 32 bit byte size of a
2180 (enable_sgpr_private single
2181 work-item's
2182 scratch_segment_size) memory
2183 allocation. This is the
2184 value from the kernel
2185 dispatch packet Private
2186 Segment Byte Size rounded up
2187 by CP to a multiple of
2188 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002189
2190 Having CP load it once avoids
2191 loading it at the beginning of
2192 every wavefront.
2193
2194 This is not used for
2195 GFX7-GFX8 since it is the same
2196 value as the second SGPR of
2197 Flat Scratch Init. However, it
2198 may be needed for GFX9 which
2199 changes the meaning of the
2200 Flat Scratch Init value.
2201 then Grid Work-Group Count X 1 32 bit count of the number of
2202 (enable_sgpr_grid work-groups in the X dimension
2203 _workgroup_count_X) for the grid being
2204 executed. Computed from the
2205 fields in the kernel dispatch
2206 packet as ((grid_size.x +
2207 workgroup_size.x - 1) /
2208 workgroup_size.x).
2209 then Grid Work-Group Count Y 1 32 bit count of the number of
2210 (enable_sgpr_grid work-groups in the Y dimension
2211 _workgroup_count_Y && for the grid being
2212 less than 16 previous executed. Computed from the
2213 SGPRs) fields in the kernel dispatch
2214 packet as ((grid_size.y +
2215 workgroup_size.y - 1) /
2216 workgroupSize.y).
2217
2218 Only initialized if <16
2219 previous SGPRs initialized.
2220 then Grid Work-Group Count Z 1 32 bit count of the number of
2221 (enable_sgpr_grid work-groups in the Z dimension
2222 _workgroup_count_Z && for the grid being
2223 less than 16 previous executed. Computed from the
2224 SGPRs) fields in the kernel dispatch
2225 packet as ((grid_size.z +
2226 workgroup_size.z - 1) /
2227 workgroupSize.z).
2228
2229 Only initialized if <16
2230 previous SGPRs initialized.
2231 then Work-Group Id X 1 32 bit work-group id in X
2232 (enable_sgpr_workgroup_id dimension of grid for
2233 _X) wavefront.
2234 then Work-Group Id Y 1 32 bit work-group id in Y
2235 (enable_sgpr_workgroup_id dimension of grid for
2236 _Y) wavefront.
2237 then Work-Group Id Z 1 32 bit work-group id in Z
2238 (enable_sgpr_workgroup_id dimension of grid for
2239 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002240 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002241 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002242 _info) threadgroup_size_in_wavefronts[5:0]}
2243 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002244 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002245 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002246 dispatch. Must be used as an
2247 offset with Private
2248 segment address when using
2249 Scratch Segment Buffer. It
2250 must be used to set up FLAT
2251 SCRATCH for flat addressing
2252 (see
2253 :ref:`amdgpu-amdhsa-flat-scratch`).
2254 ========== ========================== ====== ==============================
2255
2256The order of the VGPR registers is defined, but the compiler can specify which
2257ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2258fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2259for enabled registers are dense starting at VGPR0: the first enabled register is
2260VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2261VGPR number.
2262
2263VGPR register initial state is defined in
2264:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2265
2266 .. table:: VGPR Register Set Up Order
2267 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2268
2269 ========== ========================== ====== ==============================
2270 VGPR Order Name Number Description
2271 (kernel descriptor enable of
2272 field) VGPRs
2273 ========== ========================== ====== ==============================
2274 First Work-Item Id X 1 32 bit work item id in X
2275 (Always initialized) dimension of work-group for
2276 wavefront lane.
2277 then Work-Item Id Y 1 32 bit work item id in Y
2278 (enable_vgpr_workitem_id dimension of work-group for
2279 > 0) wavefront lane.
2280 then Work-Item Id Z 1 32 bit work item id in Z
2281 (enable_vgpr_workitem_id dimension of work-group for
2282 > 1) wavefront lane.
2283 ========== ========================== ====== ==============================
2284
Hiroshi Inouebcadfee2018-04-12 05:53:20 +00002285The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tyef16a45e2017-06-06 20:31:59 +00002286
22871. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2288 registers.
22892. Work-group Id registers X, Y, Z are set by ADC which supports any
2290 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022913. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2292 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022934. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2294 or (X, Y, Z).
2295
2296Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2297value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2298
2299The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002300has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002301instructions (GFX9).
2302
2303If buffer operations are used then the compiler can generate a V# with the
2304following properties:
2305
2306* base address of 0
2307* no swizzle
2308* ATC: 1 if IOMMU present (such as APU)
2309* ptr64: 1
2310* MTYPE set to support memory coherence that matches the runtime (such as CC for
2311 APU and NC for dGPU).
2312
2313.. _amdgpu-amdhsa-kernel-prolog:
2314
2315Kernel Prolog
2316~~~~~~~~~~~~~
2317
2318.. _amdgpu-amdhsa-m0:
2319
2320M0
2321++
2322
2323GFX6-GFX8
2324 The M0 register must be initialized with a value at least the total LDS size
2325 if the kernel may access LDS via DS or flat operations. Total LDS size is
2326 available in dispatch packet. For M0, it is also possible to use maximum
2327 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2328 GFX7-GFX8).
2329GFX9
2330 The M0 register is not used for range checking LDS accesses and so does not
2331 need to be initialized in the prolog.
2332
2333.. _amdgpu-amdhsa-flat-scratch:
2334
2335Flat Scratch
2336++++++++++++
2337
2338If the kernel may use flat operations to access scratch memory, the prolog code
2339must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002340are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002341Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2342
2343GFX6
2344 Flat scratch is not supported.
2345
Tony Tye07d9f102017-11-10 01:00:54 +00002346GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002347 1. The low word of Flat Scratch Init is 32 bit byte offset from
2348 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2349 being managed by SPI for the queue executing the kernel dispatch. This is
2350 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002351 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002352 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2353 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2354 by 8 before moving into FLAT_SCRATCH_LO.
2355 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2356 work-items scratch memory usage. This is directly loaded from the kernel
2357 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2358 DWORD. Having CP load it once avoids loading it at the beginning of every
2359 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2360 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002361
Tony Tyef16a45e2017-06-06 20:31:59 +00002362GFX9
2363 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2364 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002365 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002366 pair for use as the flat scratch base in flat memory instructions.
2367
2368.. _amdgpu-amdhsa-memory-model:
2369
2370Memory Model
2371~~~~~~~~~~~~
2372
2373This section describes the mapping of LLVM memory model onto AMDGPU machine code
2374(see :ref:`memmodel`). *The implementation is WIP.*
2375
2376.. TODO
2377 Update when implementation complete.
2378
Tony Tyef16a45e2017-06-06 20:31:59 +00002379The AMDGPU backend supports the memory synchronization scopes specified in
2380:ref:`amdgpu-memory-scopes`.
2381
2382The code sequences used to implement the memory model are defined in table
2383:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2384
2385The sequences specify the order of instructions that a single thread must
2386execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2387to other memory instructions executed by the same thread. This allows them to be
2388moved earlier or later which can allow them to be combined with other instances
2389of the same instruction, or hoisted/sunk out of loops to improve
2390performance. Only the instructions related to the memory model are given;
2391additional ``s_waitcnt`` instructions are required to ensure registers are
2392defined before being used. These may be able to be combined with the memory
2393model ``s_waitcnt`` instructions as described above.
2394
Tony Tye6baa6d22017-10-18 22:16:55 +00002395The AMDGPU backend supports the following memory models:
2396
2397 HSA Memory Model [HSA]_
2398 The HSA memory model uses a single happens-before relation for all address
2399 spaces (see :ref:`amdgpu-address-spaces`).
2400 OpenCL Memory Model [OpenCL]_
2401 The OpenCL memory model which has separate happens-before relations for the
2402 global and local address spaces. Only a fence specifying both global and
2403 local address space, and seq_cst instructions join the relationships. Since
2404 the LLVM ``memfence`` instruction does not allow an address space to be
2405 specified the OpenCL fence has to convervatively assume both local and
2406 global address space was specified. However, optimizations can often be
2407 done to eliminate the additional ``s_waitcnt`` instructions when there are
2408 no intervening memory instructions which access the corresponding address
2409 space. The code sequences in the table indicate what can be omitted for the
2410 OpenCL memory. The target triple environment is used to determine if the
2411 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002412
2413``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2414operations.
2415
2416``buffer/global/flat_load/store/atomic`` instructions to global memory are
2417termed vector memory operations.
2418
2419For GFX6-GFX9:
2420
2421* Each agent has multiple compute units (CU).
2422* Each CU has multiple SIMDs that execute wavefronts.
2423* The wavefronts for a single work-group are executed in the same CU but may be
2424 executed by different SIMDs.
2425* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2426 executing on it.
2427* All LDS operations of a CU are performed as wavefront wide operations in a
2428 global order and involve no caching. Completion is reported to a wavefront in
2429 execution order.
2430* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002431 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002432 can be reordered relative to each other, which can result in reordering the
2433 visibility of vector memory operations with respect to LDS operations of other
2434 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002435 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002436 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002437 same wavefront.
2438* The vector memory operations are performed as wavefront wide operations and
2439 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002440 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002441 vector memory order if they access LDS memory, and out of LDS operation order
2442 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002443* The vector memory operations access a single vector L1 cache shared by all
2444 SIMDs a CU. Therefore, no special action is required for coherence between the
2445 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002446 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002447 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002448* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2449 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2450 scalar operations are used in a restricted way so do not impact the memory
2451 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2452* The vector and scalar memory operations use an L2 cache shared by all CUs on
2453 the same agent.
2454* The L2 cache has independent channels to service disjoint ranges of virtual
2455 addresses.
2456* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002457 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002458 (which may be executing on different CUs) of an agent can be reordered
2459 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002460 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002461 previous vector memory operation has completed before executing a subsequent
2462 vector memory or LDS operation and so can be used to meet the requirements of
2463 acquire and release.
2464* The L2 cache can be kept coherent with other agents on some targets, or ranges
2465 of virtual addresses can be set up to bypass it to ensure system coherence.
2466
Tony Tye07d9f102017-11-10 01:00:54 +00002467Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002468or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2469memory, atomic memory orderings are not meaningful and all accesses are treated
2470as non-atomic.
2471
2472Constant address space uses ``buffer/global_load`` instructions (or equivalent
2473scalar memory instructions). Since the constant address space contents do not
2474change during the execution of a kernel dispatch it is not legal to perform
2475stores, and atomic memory orderings are not meaningful and all access are
2476treated as non-atomic.
2477
2478A memory synchronization scope wider than work-group is not meaningful for the
2479group (LDS) address space and is treated as work-group.
2480
2481The memory model does not support the region address space which is treated as
2482non-atomic.
2483
2484Acquire memory ordering is not meaningful on store atomic instructions and is
2485treated as non-atomic.
2486
2487Release memory ordering is not meaningful on load atomic instructions and is
2488treated a non-atomic.
2489
2490Acquire-release memory ordering is not meaningful on load or store atomic
2491instructions and is treated as acquire and release respectively.
2492
2493AMDGPU backend only uses scalar memory operations to access memory that is
2494proven to not change during the execution of the kernel dispatch. This includes
2495constant address space and global address space for program scope const
2496variables. Therefore the kernel machine code does not have to maintain the
2497scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2498and vector L1 caches are invalidated between kernel dispatches by CP since
2499constant address space data may change between kernel dispatch executions. See
2500:ref:`amdgpu-amdhsa-memory-spaces`.
2501
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002502The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002503case the AMDGPU backend ensures the memory location used to spill is never
2504accessed by vector memory operations at the same time. If scalar writes are used
2505then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2506return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002507future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002508frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2509as all scalar writes are write-before-read in the same thread.
2510
Tony Tye6baa6d22017-10-18 22:16:55 +00002511Scratch backing memory (which is used for the private address space)
2512is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2513address space is only accessed by a single thread, and is always
2514write-before-read, there is never a need to invalidate these entries from the L1
2515cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2516volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002517
2518On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002519to invalidate the L2 cache. This also causes it to be treated as
2520non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2521(cache coherent) and so the L2 cache will coherent with the CPU and other
2522agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002523
2524 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2525 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2526
Tony Tye6baa6d22017-10-18 22:16:55 +00002527 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002528 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2529 Ordering Sync Scope Address
2530 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002531 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002532 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002533 -----------------------------------------------------------------------------------
2534 load *none* *none* - global - !volatile & !nontemporal
2535 - generic
2536 - private 1. buffer/global/flat_load
2537 - constant
2538 - volatile & !nontemporal
2539
Tony Tyef16a45e2017-06-06 20:31:59 +00002540 1. buffer/global/flat_load
2541 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002542
2543 - nontemporal
2544
2545 1. buffer/global/flat_load
2546 glc=1 slc=1
2547
Tony Tyef16a45e2017-06-06 20:31:59 +00002548 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002549 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002550 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002551 - private 1. buffer/global/flat_store
2552 - constant
2553 - nontemporal
2554
2555 1. buffer/global/flat_stote
2556 glc=1 slc=1
2557
Tony Tyef16a45e2017-06-06 20:31:59 +00002558 store *none* *none* - local 1. ds_store
2559 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002560 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002561 load atomic unordered *any* *any* *Same as non-atomic*.
2562 store atomic unordered *any* *any* *Same as non-atomic*.
2563 atomicrmw unordered *any* *any* *Same as monotonic
2564 atomic*.
2565 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002566 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002567 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2568 - wavefront - generic
2569 - workgroup
2570 load atomic monotonic - singlethread - local 1. ds_load
2571 - wavefront
2572 - workgroup
2573 load atomic monotonic - agent - global 1. buffer/global/flat_load
2574 - system - generic glc=1
2575 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2576 - wavefront - generic
2577 - workgroup
2578 - agent
2579 - system
2580 store atomic monotonic - singlethread - local 1. ds_store
2581 - wavefront
2582 - workgroup
2583 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2584 - wavefront - generic
2585 - workgroup
2586 - agent
2587 - system
2588 atomicrmw monotonic - singlethread - local 1. ds_atomic
2589 - wavefront
2590 - workgroup
2591 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002592 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002593 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2594 - wavefront - local
2595 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002596 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2597 load atomic acquire - workgroup - local 1. ds_load
2598 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002599
Tony Tye6baa6d22017-10-18 22:16:55 +00002600 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002601 - Must happen before
2602 any following
2603 global/generic
2604 load/load
2605 atomic/store/store
2606 atomic/atomicrmw.
2607 - Ensures any
2608 following global
2609 data read is no
2610 older than the load
2611 atomic value being
2612 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002613 load atomic acquire - workgroup - generic 1. flat_load
2614 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002615
Tony Tye6baa6d22017-10-18 22:16:55 +00002616 - If OpenCL, omit.
2617 - Must happen before
2618 any following
2619 global/generic
2620 load/load
2621 atomic/store/store
2622 atomic/atomicrmw.
2623 - Ensures any
2624 following global
2625 data read is no
2626 older than the load
2627 atomic value being
2628 acquired.
2629 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002630 - system glc=1
2631 2. s_waitcnt vmcnt(0)
2632
2633 - Must happen before
2634 following
2635 buffer_wbinvl1_vol.
2636 - Ensures the load
2637 has completed
2638 before invalidating
2639 the cache.
2640
2641 3. buffer_wbinvl1_vol
2642
2643 - Must happen before
2644 any following
2645 global/generic
2646 load/load
2647 atomic/atomicrmw.
2648 - Ensures that
2649 following
2650 loads will not see
2651 stale global data.
2652
2653 load atomic acquire - agent - generic 1. flat_load glc=1
2654 - system 2. s_waitcnt vmcnt(0) &
2655 lgkmcnt(0)
2656
2657 - If OpenCL omit
2658 lgkmcnt(0).
2659 - Must happen before
2660 following
2661 buffer_wbinvl1_vol.
2662 - Ensures the flat_load
2663 has completed
2664 before invalidating
2665 the cache.
2666
2667 3. buffer_wbinvl1_vol
2668
2669 - Must happen before
2670 any following
2671 global/generic
2672 load/load
2673 atomic/atomicrmw.
2674 - Ensures that
2675 following loads
2676 will not see stale
2677 global data.
2678
2679 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2680 - wavefront - local
2681 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002682 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2683 atomicrmw acquire - workgroup - local 1. ds_atomic
2684 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002685
Tony Tye6baa6d22017-10-18 22:16:55 +00002686 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002687 - Must happen before
2688 any following
2689 global/generic
2690 load/load
2691 atomic/store/store
2692 atomic/atomicrmw.
2693 - Ensures any
2694 following global
2695 data read is no
2696 older than the
2697 atomicrmw value
2698 being acquired.
2699
Tony Tye6baa6d22017-10-18 22:16:55 +00002700 atomicrmw acquire - workgroup - generic 1. flat_atomic
2701 2. waitcnt lgkmcnt(0)
2702
2703 - If OpenCL, omit.
2704 - Must happen before
2705 any following
2706 global/generic
2707 load/load
2708 atomic/store/store
2709 atomic/atomicrmw.
2710 - Ensures any
2711 following global
2712 data read is no
2713 older than the
2714 atomicrmw value
2715 being acquired.
2716
2717 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002718 - system 2. s_waitcnt vmcnt(0)
2719
2720 - Must happen before
2721 following
2722 buffer_wbinvl1_vol.
2723 - Ensures the
2724 atomicrmw has
2725 completed before
2726 invalidating the
2727 cache.
2728
2729 3. buffer_wbinvl1_vol
2730
2731 - Must happen before
2732 any following
2733 global/generic
2734 load/load
2735 atomic/atomicrmw.
2736 - Ensures that
2737 following loads
2738 will not see stale
2739 global data.
2740
2741 atomicrmw acquire - agent - generic 1. flat_atomic
2742 - system 2. s_waitcnt vmcnt(0) &
2743 lgkmcnt(0)
2744
2745 - If OpenCL, omit
2746 lgkmcnt(0).
2747 - Must happen before
2748 following
2749 buffer_wbinvl1_vol.
2750 - Ensures the
2751 atomicrmw has
2752 completed before
2753 invalidating the
2754 cache.
2755
2756 3. buffer_wbinvl1_vol
2757
2758 - Must happen before
2759 any following
2760 global/generic
2761 load/load
2762 atomic/atomicrmw.
2763 - Ensures that
2764 following loads
2765 will not see stale
2766 global data.
2767
2768 fence acquire - singlethread *none* *none*
2769 - wavefront
2770 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2771
2772 - If OpenCL and
2773 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002774 not generic, omit.
2775 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002776 currently has no
2777 address space on
2778 the fence need to
2779 conservatively
2780 always generate. If
2781 fence had an
2782 address space then
2783 set to address
2784 space of OpenCL
2785 fence flag, or to
2786 generic if both
2787 local and global
2788 flags are
2789 specified.
2790 - Must happen after
2791 any preceding
2792 local/generic load
2793 atomic/atomicrmw
2794 with an equal or
2795 wider sync scope
2796 and memory ordering
2797 stronger than
2798 unordered (this is
2799 termed the
2800 fence-paired-atomic).
2801 - Must happen before
2802 any following
2803 global/generic
2804 load/load
2805 atomic/store/store
2806 atomic/atomicrmw.
2807 - Ensures any
2808 following global
2809 data read is no
2810 older than the
2811 value read by the
2812 fence-paired-atomic.
2813
Tony Tye6baa6d22017-10-18 22:16:55 +00002814 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2815 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002816
2817 - If OpenCL and
2818 address space is
2819 not generic, omit
2820 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002821 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002822 currently has no
2823 address space on
2824 the fence need to
2825 conservatively
2826 always generate
2827 (see comment for
2828 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002829 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002830 separate s_waitcnt
2831 vmcnt(0) and
2832 s_waitcnt
2833 lgkmcnt(0) to allow
2834 them to be
2835 independently moved
2836 according to the
2837 following rules.
2838 - s_waitcnt vmcnt(0)
2839 must happen after
2840 any preceding
2841 global/generic load
2842 atomic/atomicrmw
2843 with an equal or
2844 wider sync scope
2845 and memory ordering
2846 stronger than
2847 unordered (this is
2848 termed the
2849 fence-paired-atomic).
2850 - s_waitcnt lgkmcnt(0)
2851 must happen after
2852 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002853 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002854 atomic/atomicrmw
2855 with an equal or
2856 wider sync scope
2857 and memory ordering
2858 stronger than
2859 unordered (this is
2860 termed the
2861 fence-paired-atomic).
2862 - Must happen before
2863 the following
2864 buffer_wbinvl1_vol.
2865 - Ensures that the
2866 fence-paired atomic
2867 has completed
2868 before invalidating
2869 the
2870 cache. Therefore
2871 any following
2872 locations read must
2873 be no older than
2874 the value read by
2875 the
2876 fence-paired-atomic.
2877
2878 2. buffer_wbinvl1_vol
2879
Tony Tye6baa6d22017-10-18 22:16:55 +00002880 - Must happen before any
2881 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002882 load/load
2883 atomic/store/store
2884 atomic/atomicrmw.
2885 - Ensures that
2886 following loads
2887 will not see stale
2888 global data.
2889
2890 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002891 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002892 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2893 - wavefront - local
2894 - generic
2895 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002896
2897 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002898 - Must happen after
2899 any preceding
2900 local/generic
2901 load/store/load
2902 atomic/store
2903 atomic/atomicrmw.
2904 - Must happen before
2905 the following
2906 store.
2907 - Ensures that all
2908 memory operations
2909 to local have
2910 completed before
2911 performing the
2912 store that is being
2913 released.
2914
2915 2. buffer/global/flat_store
2916 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002917 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2918
2919 - If OpenCL, omit.
2920 - Must happen after
2921 any preceding
2922 local/generic
2923 load/store/load
2924 atomic/store
2925 atomic/atomicrmw.
2926 - Must happen before
2927 the following
2928 store.
2929 - Ensures that all
2930 memory operations
2931 to local have
2932 completed before
2933 performing the
2934 store that is being
2935 released.
2936
2937 2. flat_store
2938 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2939 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002940
2941 - If OpenCL, omit
2942 lgkmcnt(0).
2943 - Could be split into
2944 separate s_waitcnt
2945 vmcnt(0) and
2946 s_waitcnt
2947 lgkmcnt(0) to allow
2948 them to be
2949 independently moved
2950 according to the
2951 following rules.
2952 - s_waitcnt vmcnt(0)
2953 must happen after
2954 any preceding
2955 global/generic
2956 load/store/load
2957 atomic/store
2958 atomic/atomicrmw.
2959 - s_waitcnt lgkmcnt(0)
2960 must happen after
2961 any preceding
2962 local/generic
2963 load/store/load
2964 atomic/store
2965 atomic/atomicrmw.
2966 - Must happen before
2967 the following
2968 store.
2969 - Ensures that all
2970 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002971 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002972 completed before
2973 performing the
2974 store that is being
2975 released.
2976
2977 2. buffer/global/ds/flat_store
2978 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2979 - wavefront - local
2980 - generic
2981 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002982
2983 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002984 - Must happen after
2985 any preceding
2986 local/generic
2987 load/store/load
2988 atomic/store
2989 atomic/atomicrmw.
2990 - Must happen before
2991 the following
2992 atomicrmw.
2993 - Ensures that all
2994 memory operations
2995 to local have
2996 completed before
2997 performing the
2998 atomicrmw that is
2999 being released.
3000
3001 2. buffer/global/flat_atomic
3002 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00003003 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3004
3005 - If OpenCL, omit.
3006 - Must happen after
3007 any preceding
3008 local/generic
3009 load/store/load
3010 atomic/store
3011 atomic/atomicrmw.
3012 - Must happen before
3013 the following
3014 atomicrmw.
3015 - Ensures that all
3016 memory operations
3017 to local have
3018 completed before
3019 performing the
3020 atomicrmw that is
3021 being released.
3022
3023 2. flat_atomic
3024 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3025 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003026
3027 - If OpenCL, omit
3028 lgkmcnt(0).
3029 - Could be split into
3030 separate s_waitcnt
3031 vmcnt(0) and
3032 s_waitcnt
3033 lgkmcnt(0) to allow
3034 them to be
3035 independently moved
3036 according to the
3037 following rules.
3038 - s_waitcnt vmcnt(0)
3039 must happen after
3040 any preceding
3041 global/generic
3042 load/store/load
3043 atomic/store
3044 atomic/atomicrmw.
3045 - s_waitcnt lgkmcnt(0)
3046 must happen after
3047 any preceding
3048 local/generic
3049 load/store/load
3050 atomic/store
3051 atomic/atomicrmw.
3052 - Must happen before
3053 the following
3054 atomicrmw.
3055 - Ensures that all
3056 memory operations
3057 to global and local
3058 have completed
3059 before performing
3060 the atomicrmw that
3061 is being released.
3062
Tony Tye6baa6d22017-10-18 22:16:55 +00003063 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003064 fence release - singlethread *none* *none*
3065 - wavefront
3066 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3067
3068 - If OpenCL and
3069 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003070 not generic, omit.
3071 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003072 currently has no
3073 address space on
3074 the fence need to
3075 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003076 always generate. If
3077 fence had an
3078 address space then
3079 set to address
3080 space of OpenCL
3081 fence flag, or to
3082 generic if both
3083 local and global
3084 flags are
3085 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003086 - Must happen after
3087 any preceding
3088 local/generic
3089 load/load
3090 atomic/store/store
3091 atomic/atomicrmw.
3092 - Must happen before
3093 any following store
3094 atomic/atomicrmw
3095 with an equal or
3096 wider sync scope
3097 and memory ordering
3098 stronger than
3099 unordered (this is
3100 termed the
3101 fence-paired-atomic).
3102 - Ensures that all
3103 memory operations
3104 to local have
3105 completed before
3106 performing the
3107 following
3108 fence-paired-atomic.
3109
Tony Tye6baa6d22017-10-18 22:16:55 +00003110 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3111 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003112
3113 - If OpenCL and
3114 address space is
3115 not generic, omit
3116 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003117 - If OpenCL and
3118 address space is
3119 local, omit
3120 vmcnt(0).
3121 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003122 currently has no
3123 address space on
3124 the fence need to
3125 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003126 always generate. If
3127 fence had an
3128 address space then
3129 set to address
3130 space of OpenCL
3131 fence flag, or to
3132 generic if both
3133 local and global
3134 flags are
3135 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003136 - Could be split into
3137 separate s_waitcnt
3138 vmcnt(0) and
3139 s_waitcnt
3140 lgkmcnt(0) to allow
3141 them to be
3142 independently moved
3143 according to the
3144 following rules.
3145 - s_waitcnt vmcnt(0)
3146 must happen after
3147 any preceding
3148 global/generic
3149 load/store/load
3150 atomic/store
3151 atomic/atomicrmw.
3152 - s_waitcnt lgkmcnt(0)
3153 must happen after
3154 any preceding
3155 local/generic
3156 load/store/load
3157 atomic/store
3158 atomic/atomicrmw.
3159 - Must happen before
3160 any following store
3161 atomic/atomicrmw
3162 with an equal or
3163 wider sync scope
3164 and memory ordering
3165 stronger than
3166 unordered (this is
3167 termed the
3168 fence-paired-atomic).
3169 - Ensures that all
3170 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003171 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003172 completed before
3173 performing the
3174 following
3175 fence-paired-atomic.
3176
3177 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003178 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003179 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3180 - wavefront - local
3181 - generic
3182 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3183
Tony Tye6baa6d22017-10-18 22:16:55 +00003184 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003185 - Must happen after
3186 any preceding
3187 local/generic
3188 load/store/load
3189 atomic/store
3190 atomic/atomicrmw.
3191 - Must happen before
3192 the following
3193 atomicrmw.
3194 - Ensures that all
3195 memory operations
3196 to local have
3197 completed before
3198 performing the
3199 atomicrmw that is
3200 being released.
3201
Tony Tye6baa6d22017-10-18 22:16:55 +00003202 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003203 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3204 2. s_waitcnt lgkmcnt(0)
3205
Tony Tye6baa6d22017-10-18 22:16:55 +00003206 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003207 - Must happen before
3208 any following
3209 global/generic
3210 load/load
3211 atomic/store/store
3212 atomic/atomicrmw.
3213 - Ensures any
3214 following global
3215 data read is no
3216 older than the load
3217 atomic value being
3218 acquired.
3219
3220 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3221
Tony Tye6baa6d22017-10-18 22:16:55 +00003222 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003223 - Must happen after
3224 any preceding
3225 local/generic
3226 load/store/load
3227 atomic/store
3228 atomic/atomicrmw.
3229 - Must happen before
3230 the following
3231 atomicrmw.
3232 - Ensures that all
3233 memory operations
3234 to local have
3235 completed before
3236 performing the
3237 atomicrmw that is
3238 being released.
3239
3240 2. flat_atomic
3241 3. s_waitcnt lgkmcnt(0)
3242
Tony Tye6baa6d22017-10-18 22:16:55 +00003243 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003244 - Must happen before
3245 any following
3246 global/generic
3247 load/load
3248 atomic/store/store
3249 atomic/atomicrmw.
3250 - Ensures any
3251 following global
3252 data read is no
3253 older than the load
3254 atomic value being
3255 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003256
3257 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3258 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003259
3260 - If OpenCL, omit
3261 lgkmcnt(0).
3262 - Could be split into
3263 separate s_waitcnt
3264 vmcnt(0) and
3265 s_waitcnt
3266 lgkmcnt(0) to allow
3267 them to be
3268 independently moved
3269 according to the
3270 following rules.
3271 - s_waitcnt vmcnt(0)
3272 must happen after
3273 any preceding
3274 global/generic
3275 load/store/load
3276 atomic/store
3277 atomic/atomicrmw.
3278 - s_waitcnt lgkmcnt(0)
3279 must happen after
3280 any preceding
3281 local/generic
3282 load/store/load
3283 atomic/store
3284 atomic/atomicrmw.
3285 - Must happen before
3286 the following
3287 atomicrmw.
3288 - Ensures that all
3289 memory operations
3290 to global have
3291 completed before
3292 performing the
3293 atomicrmw that is
3294 being released.
3295
Tony Tye6baa6d22017-10-18 22:16:55 +00003296 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003297 3. s_waitcnt vmcnt(0)
3298
3299 - Must happen before
3300 following
3301 buffer_wbinvl1_vol.
3302 - Ensures the
3303 atomicrmw has
3304 completed before
3305 invalidating the
3306 cache.
3307
3308 4. buffer_wbinvl1_vol
3309
3310 - Must happen before
3311 any following
3312 global/generic
3313 load/load
3314 atomic/atomicrmw.
3315 - Ensures that
3316 following loads
3317 will not see stale
3318 global data.
3319
Tony Tye6baa6d22017-10-18 22:16:55 +00003320 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3321 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003322
3323 - If OpenCL, omit
3324 lgkmcnt(0).
3325 - Could be split into
3326 separate s_waitcnt
3327 vmcnt(0) and
3328 s_waitcnt
3329 lgkmcnt(0) to allow
3330 them to be
3331 independently moved
3332 according to the
3333 following rules.
3334 - s_waitcnt vmcnt(0)
3335 must happen after
3336 any preceding
3337 global/generic
3338 load/store/load
3339 atomic/store
3340 atomic/atomicrmw.
3341 - s_waitcnt lgkmcnt(0)
3342 must happen after
3343 any preceding
3344 local/generic
3345 load/store/load
3346 atomic/store
3347 atomic/atomicrmw.
3348 - Must happen before
3349 the following
3350 atomicrmw.
3351 - Ensures that all
3352 memory operations
3353 to global have
3354 completed before
3355 performing the
3356 atomicrmw that is
3357 being released.
3358
3359 2. flat_atomic
3360 3. s_waitcnt vmcnt(0) &
3361 lgkmcnt(0)
3362
3363 - If OpenCL, omit
3364 lgkmcnt(0).
3365 - Must happen before
3366 following
3367 buffer_wbinvl1_vol.
3368 - Ensures the
3369 atomicrmw has
3370 completed before
3371 invalidating the
3372 cache.
3373
3374 4. buffer_wbinvl1_vol
3375
3376 - Must happen before
3377 any following
3378 global/generic
3379 load/load
3380 atomic/atomicrmw.
3381 - Ensures that
3382 following loads
3383 will not see stale
3384 global data.
3385
3386 fence acq_rel - singlethread *none* *none*
3387 - wavefront
3388 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3389
3390 - If OpenCL and
3391 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003392 not generic, omit.
3393 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003394 since LLVM
3395 currently has no
3396 address space on
3397 the fence need to
3398 conservatively
3399 always generate
3400 (see comment for
3401 previous fence).
3402 - Must happen after
3403 any preceding
3404 local/generic
3405 load/load
3406 atomic/store/store
3407 atomic/atomicrmw.
3408 - Must happen before
3409 any following
3410 global/generic
3411 load/load
3412 atomic/store/store
3413 atomic/atomicrmw.
3414 - Ensures that all
3415 memory operations
3416 to local have
3417 completed before
3418 performing any
3419 following global
3420 memory operations.
3421 - Ensures that the
3422 preceding
3423 local/generic load
3424 atomic/atomicrmw
3425 with an equal or
3426 wider sync scope
3427 and memory ordering
3428 stronger than
3429 unordered (this is
3430 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003431 acquire-fence-paired-atomic
3432 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003433 before following
3434 global memory
3435 operations. This
3436 satisfies the
3437 requirements of
3438 acquire.
3439 - Ensures that all
3440 previous memory
3441 operations have
3442 completed before a
3443 following
3444 local/generic store
3445 atomic/atomicrmw
3446 with an equal or
3447 wider sync scope
3448 and memory ordering
3449 stronger than
3450 unordered (this is
3451 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003452 release-fence-paired-atomic
3453 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003454 requirements of
3455 release.
3456
Tony Tye6baa6d22017-10-18 22:16:55 +00003457 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3458 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003459
3460 - If OpenCL and
3461 address space is
3462 not generic, omit
3463 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003464 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003465 currently has no
3466 address space on
3467 the fence need to
3468 conservatively
3469 always generate
3470 (see comment for
3471 previous fence).
3472 - Could be split into
3473 separate s_waitcnt
3474 vmcnt(0) and
3475 s_waitcnt
3476 lgkmcnt(0) to allow
3477 them to be
3478 independently moved
3479 according to the
3480 following rules.
3481 - s_waitcnt vmcnt(0)
3482 must happen after
3483 any preceding
3484 global/generic
3485 load/store/load
3486 atomic/store
3487 atomic/atomicrmw.
3488 - s_waitcnt lgkmcnt(0)
3489 must happen after
3490 any preceding
3491 local/generic
3492 load/store/load
3493 atomic/store
3494 atomic/atomicrmw.
3495 - Must happen before
3496 the following
3497 buffer_wbinvl1_vol.
3498 - Ensures that the
3499 preceding
3500 global/local/generic
3501 load
3502 atomic/atomicrmw
3503 with an equal or
3504 wider sync scope
3505 and memory ordering
3506 stronger than
3507 unordered (this is
3508 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003509 acquire-fence-paired-atomic
3510 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003511 before invalidating
3512 the cache. This
3513 satisfies the
3514 requirements of
3515 acquire.
3516 - Ensures that all
3517 previous memory
3518 operations have
3519 completed before a
3520 following
3521 global/local/generic
3522 store
3523 atomic/atomicrmw
3524 with an equal or
3525 wider sync scope
3526 and memory ordering
3527 stronger than
3528 unordered (this is
3529 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003530 release-fence-paired-atomic
3531 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003532 requirements of
3533 release.
3534
3535 2. buffer_wbinvl1_vol
3536
3537 - Must happen before
3538 any following
3539 global/generic
3540 load/load
3541 atomic/store/store
3542 atomic/atomicrmw.
3543 - Ensures that
3544 following loads
3545 will not see stale
3546 global data. This
3547 satisfies the
3548 requirements of
3549 acquire.
3550
3551 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003552 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003553 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003554 - wavefront - local load atomic acquire,
3555 - generic except must generated
3556 all instructions even
3557 for OpenCL.*
3558 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3559 - generic
3560 - Must
3561 happen after
3562 preceding
3563 global/generic load
3564 atomic/store
3565 atomic/atomicrmw
3566 with memory
3567 ordering of seq_cst
3568 and with equal or
3569 wider sync scope.
3570 (Note that seq_cst
3571 fences have their
3572 own s_waitcnt
3573 lgkmcnt(0) and so do
3574 not need to be
3575 considered.)
3576 - Ensures any
3577 preceding
3578 sequential
3579 consistent local
3580 memory instructions
3581 have completed
3582 before executing
3583 this sequentially
3584 consistent
3585 instruction. This
3586 prevents reordering
3587 a seq_cst store
3588 followed by a
3589 seq_cst load. (Note
3590 that seq_cst is
3591 stronger than
3592 acquire/release as
3593 the reordering of
3594 load acquire
3595 followed by a store
3596 release is
3597 prevented by the
3598 waitcnt of
3599 the release, but
3600 there is nothing
3601 preventing a store
3602 release followed by
3603 load acquire from
3604 competing out of
3605 order.)
3606
3607 2. *Following
3608 instructions same as
3609 corresponding load
3610 atomic acquire,
3611 except must generated
3612 all instructions even
3613 for OpenCL.*
3614 load atomic seq_cst - workgroup - local *Same as corresponding
3615 load atomic acquire,
3616 except must generated
3617 all instructions even
3618 for OpenCL.*
3619 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3620 - system - generic vmcnt(0)
3621
3622 - Could be split into
3623 separate s_waitcnt
3624 vmcnt(0)
3625 and s_waitcnt
3626 lgkmcnt(0) to allow
3627 them to be
3628 independently moved
3629 according to the
3630 following rules.
3631 - waitcnt lgkmcnt(0)
3632 must happen after
3633 preceding
3634 global/generic load
3635 atomic/store
3636 atomic/atomicrmw
3637 with memory
3638 ordering of seq_cst
3639 and with equal or
3640 wider sync scope.
3641 (Note that seq_cst
3642 fences have their
3643 own s_waitcnt
3644 lgkmcnt(0) and so do
3645 not need to be
3646 considered.)
3647 - waitcnt vmcnt(0)
3648 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003649 preceding
3650 global/generic load
3651 atomic/store
3652 atomic/atomicrmw
3653 with memory
3654 ordering of seq_cst
3655 and with equal or
3656 wider sync scope.
3657 (Note that seq_cst
3658 fences have their
3659 own s_waitcnt
3660 vmcnt(0) and so do
3661 not need to be
3662 considered.)
3663 - Ensures any
3664 preceding
3665 sequential
3666 consistent global
3667 memory instructions
3668 have completed
3669 before executing
3670 this sequentially
3671 consistent
3672 instruction. This
3673 prevents reordering
3674 a seq_cst store
3675 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003676 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003677 that seq_cst is
3678 stronger than
3679 acquire/release as
3680 the reordering of
3681 load acquire
3682 followed by a store
3683 release is
3684 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003685 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003686 the release, but
3687 there is nothing
3688 preventing a store
3689 release followed by
3690 load acquire from
3691 competing out of
3692 order.)
3693
3694 2. *Following
3695 instructions same as
3696 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003697 atomic acquire,
3698 except must generated
3699 all instructions even
3700 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003701 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003702 - wavefront - local store atomic release,
3703 - workgroup - generic except must generated
3704 all instructions even
3705 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003706 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003707 - system - generic store atomic release,
3708 except must generated
3709 all instructions even
3710 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003711 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003712 - wavefront - local atomicrmw acq_rel,
3713 - workgroup - generic except must generated
3714 all instructions even
3715 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003716 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003717 - system - generic atomicrmw acq_rel,
3718 except must generated
3719 all instructions even
3720 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003721 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003722 - wavefront fence acq_rel,
3723 - workgroup except must generated
3724 - agent all instructions even
3725 - system for OpenCL.*
3726 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003727
3728The memory order also adds the single thread optimization constrains defined in
3729table
3730:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3731
3732 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3733 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3734
3735 ============ ==============================================================
3736 LLVM Memory Optimization Constraints
3737 Ordering
3738 ============ ==============================================================
3739 unordered *none*
3740 monotonic *none*
3741 acquire - If a load atomic/atomicrmw then no following load/load
3742 atomic/store/ store atomic/atomicrmw/fence instruction can
3743 be moved before the acquire.
3744 - If a fence then same as load atomic, plus no preceding
3745 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003746 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003747 atomic/store/ store atomic/atomicrmw/fence instruction can
3748 be moved after the release.
3749 - If a fence then same as store atomic, plus no following
3750 associated fence-paired-atomic can be moved before the
3751 fence.
3752 acq_rel Same constraints as both acquire and release.
3753 seq_cst - If a load atomic then same constraints as acquire, plus no
3754 preceding sequentially consistent load atomic/store
3755 atomic/atomicrmw/fence instruction can be moved after the
3756 seq_cst.
3757 - If a store atomic then the same constraints as release, plus
3758 no following sequentially consistent load atomic/store
3759 atomic/atomicrmw/fence instruction can be moved before the
3760 seq_cst.
3761 - If an atomicrmw/fence then same constraints as acq_rel.
3762 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003763
Wei Ding16289cf2017-02-21 18:48:01 +00003764Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003765~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003766
Tony Tyef16a45e2017-06-06 20:31:59 +00003767For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3768(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3769the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003770
Tony Tyef16a45e2017-06-06 20:31:59 +00003771 .. table:: AMDGPU Trap Handler for AMDHSA OS
3772 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003773
Tony Tyef16a45e2017-06-06 20:31:59 +00003774 =================== =============== =============== =======================
3775 Usage Code Sequence Trap Handler Description
3776 Inputs
3777 =================== =============== =============== =======================
3778 reserved ``s_trap 0x00`` Reserved by hardware.
3779 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3780 ``queue_ptr`` ``debugtrap``
3781 ``VGPR0``: intrinsic (not
3782 ``arg`` implemented).
3783 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3784 ``queue_ptr`` terminated and its
3785 associated queue put
3786 into the error state.
Tony Tye43259df2018-05-16 16:19:34 +00003787 ``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
3788 installed then
3789 behaves as a
3790 no-operation. The
3791 trap handler is
3792 entered and
3793 immediately returns
3794 to continue
3795 execution of the
3796 wavefront.
3797 - If the debugger is
3798 installed, causes
3799 the debug trap to be
3800 reported by the
3801 debugger and the
3802 wavefront is put in
3803 the halt state until
3804 resumed by the
3805 debugger.
3806 reserved ``s_trap 0x04`` Reserved.
3807 reserved ``s_trap 0x05`` Reserved.
3808 reserved ``s_trap 0x06`` Reserved.
3809 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tyef16a45e2017-06-06 20:31:59 +00003810 breakpoints.
Tony Tye43259df2018-05-16 16:19:34 +00003811 reserved ``s_trap 0x08`` Reserved.
3812 reserved ``s_trap 0xfe`` Reserved.
3813 reserved ``s_trap 0xff`` Reserved.
Tony Tyef16a45e2017-06-06 20:31:59 +00003814 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003815
Tim Corringhamaf2dfc62018-04-04 13:02:09 +00003816AMDPAL
3817------
3818
3819This section provides code conventions used when the target triple OS is
3820``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3821from the application/runtime to each invocation of a hardware shader. These
3822parameters include both generic, application-controlled parameters called
3823*user data* as well as system-generated parameters that are a product of the
3824draw or dispatch execution.
3825
3826User Data
3827~~~~~~~~~
3828
3829Each hardware stage has a set of 32-bit *user data registers* which can be
3830written from a command buffer and then loaded into SGPRs when waves are launched
3831via a subsequent dispatch or draw operation. This is the way most arguments are
3832passed from the application/runtime to a hardware shader.
3833
3834Compute User Data
3835~~~~~~~~~~~~~~~~~
3836
3837Compute shader user data mappings are simpler than graphics shaders, and have a
3838fixed mapping.
3839
3840Note that there are always 10 available *user data entries* in registers -
3841entries beyond that limit must be fetched from memory (via the spill table
3842pointer) by the shader.
3843
3844 .. table:: PAL Compute Shader User Data Registers
3845 :name: pal-compute-user-data-registers
3846
3847 ============= ================================
3848 User Register Description
3849 ============= ================================
3850 0 Global Internal Table (32-bit pointer)
3851 1 Per-Shader Internal Table (32-bit pointer)
3852 2 - 11 Application-Controlled User Data (10 32-bit values)
3853 12 Spill Table (32-bit pointer)
3854 13 - 14 Thread Group Count (64-bit pointer)
3855 15 GDS Range
3856 ============= ================================
3857
3858Graphics User Data
3859~~~~~~~~~~~~~~~~~~
3860
3861Graphics pipelines support a much more flexible user data mapping:
3862
3863 .. table:: PAL Graphics Shader User Data Registers
3864 :name: pal-graphics-user-data-registers
3865
3866 ============= ================================
3867 User Register Description
3868 ============= ================================
3869 0 Global Internal Table (32-bit pointer)
3870 + Per-Shader Internal Table (32-bit pointer)
3871 + 1-15 Application Controlled User Data
3872 (1-15 Contiguous 32-bit Values in Registers)
3873 + Spill Table (32-bit pointer)
3874 + Draw Index (First Stage Only)
3875 + Vertex Offset (First Stage Only)
3876 + Instance Offset (First Stage Only)
3877 ============= ================================
3878
3879 The placement of the global internal table remains fixed in the first *user
3880 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3881 to any desired *user data SGPR register*, with the following regstrictions:
3882
3883 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3884 activehardware stage in a graphics pipeline (i.e. where the API vertex
3885 shader runs).
3886
3887 * Application-controlled user data must be mapped into a contiguous range of
3888 user data registers.
3889
3890 * The application-controlled user data range supports compaction remapping, so
3891 only *entries* that are actually consumed by the shader must be assigned to
3892 corresponding *registers*. Note that in order to support an efficient runtime
3893 implementation, the remapping must pack *registers* in the same order as
3894 *entries*, with unused *entries* removed.
3895
3896.. _pal_global_internal_table:
3897
3898Global Internal Table
3899~~~~~~~~~~~~~~~~~~~~~
3900
3901The global internal table is a table of *shader resource descriptors* (SRDs) that
3902define how certain engine-wide, runtime-managed resources should be accessed
3903from a shader. The majority of these resources have HW-defined formats, and it
3904is up to the compiler to write/read data as required by the target hardware.
3905
3906The following table illustrates the required format:
3907
3908 .. table:: PAL Global Internal Table
3909 :name: pal-git-table
3910
3911 ============= ================================
3912 Offset Description
3913 ============= ================================
3914 0-3 Graphics Scratch SRD
3915 4-7 Compute Scratch SRD
3916 8-11 ES/GS Ring Output SRD
3917 12-15 ES/GS Ring Input SRD
3918 16-19 GS/VS Ring Output #0
3919 20-23 GS/VS Ring Output #1
3920 24-27 GS/VS Ring Output #2
3921 28-31 GS/VS Ring Output #3
3922 32-35 GS/VS Ring Input SRD
3923 36-39 Tessellation Factor Buffer SRD
3924 40-43 Off-Chip LDS Buffer SRD
3925 44-47 Off-Chip Param Cache Buffer SRD
3926 48-51 Sample Position Buffer SRD
3927 52 vaRange::ShadowDescriptorTable High Bits
3928 ============= ================================
3929
3930 The pointer to the global internal table passed to the shader as user data
3931 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3932 the top 32 bits of the pipeline, so the shader may use the program
3933 counter's top 32 bits.
3934
Tony Tye46d35762017-08-15 20:47:41 +00003935Unspecified OS
3936--------------
3937
3938This section provides code conventions used when the target triple OS is
3939empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003940
3941Trap Handler ABI
3942~~~~~~~~~~~~~~~~
3943
3944For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3945not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3946instructions are handled as follows:
3947
3948 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3949 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3950
3951 =============== =============== ===========================================
3952 Usage Code Sequence Description
3953 =============== =============== ===========================================
3954 llvm.trap s_endpgm Causes wavefront to be terminated.
3955 llvm.debugtrap *none* Compiler warning given that there is no
3956 trap handler installed.
3957 =============== =============== ===========================================
3958
3959Source Languages
3960================
3961
3962.. _amdgpu-opencl:
3963
3964OpenCL
3965------
3966
Tony Tyef16a45e2017-06-06 20:31:59 +00003967When the language is OpenCL the following differences occur:
3968
39691. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye7a893d42018-03-23 18:45:18 +000039702. The AMDGPU backend appends additional arguments to the kernel's explicit
3971 arguments for the AMDHSA OS (see
3972 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye46d35762017-08-15 20:47:41 +000039733. Additional metadata is generated
Tony Tye7a893d42018-03-23 18:45:18 +00003974 (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003975
Tony Tye7a893d42018-03-23 18:45:18 +00003976 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3977 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3978
3979 ======== ==== ========= ===========================================
3980 Position Byte Byte Description
3981 Size Alignment
3982 ======== ==== ========= ===========================================
Tony Tye88441a32018-03-23 18:58:47 +00003983 1 8 8 OpenCL Global Offset X
3984 2 8 8 OpenCL Global Offset Y
3985 3 8 8 OpenCL Global Offset Z
3986 4 8 8 OpenCL address of printf buffer
3987 5 8 8 OpenCL address of virtual queue used by
3988 enqueue_kernel.
3989 6 8 8 OpenCL address of AqlWrap struct used by
3990 enqueue_kernel.
Tony Tye7a893d42018-03-23 18:45:18 +00003991 ======== ==== ========= ===========================================
Tony Tyef16a45e2017-06-06 20:31:59 +00003992
3993.. _amdgpu-hcc:
3994
3995HCC
3996---
3997
Tony Tye7a893d42018-03-23 18:45:18 +00003998When the language is HCC the following differences occur:
Tony Tyef16a45e2017-06-06 20:31:59 +00003999
40001. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
4001
Tom Stellard45bb48e2015-06-13 03:28:10 +00004002Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00004003---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00004004
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004005AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00004006It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004007
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004008This section describes general syntax for instructions and operands.
4009
4010Instructions
4011~~~~~~~~~~~~
4012
4013.. toctree::
4014 :hidden:
4015
4016 AMDGPUAsmGFX7
4017 AMDGPUAsmGFX8
4018 AMDGPUAsmGFX9
4019 AMDGPUOperandSyntax
4020
4021An instruction has the following syntax:
4022
4023 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
4024
4025Note that operands are normally comma-separated while modifiers are space-separated.
4026
4027The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4028
4029See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4030:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4031
4032Note that features under development are not included in this description.
4033
4034For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00004035operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004036[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004037
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004038Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00004039~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004040
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004041The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004042
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004043* SGPR registers: s0, ... or s[0], ...
4044* VGPR registers: v0, ... or v[0], ...
4045* TTMP registers: ttmp0, ... or ttmp[0], ...
4046* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4047* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4048* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4049* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4050* Register index expressions: v[2*2], s[1-1:2-1]
4051* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00004052
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004053Modifiers
4054~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004055
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004056Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004057
Tony Tyef16a45e2017-06-06 20:31:59 +00004058Instruction Examples
4059~~~~~~~~~~~~~~~~~~~~
4060
4061DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004062++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004063
4064.. code-block:: nasm
4065
4066 ds_add_u32 v2, v4 offset:16
4067 ds_write_src2_b64 v2 offset0:4 offset1:8
4068 ds_cmpst_f32 v2, v4, v6
4069 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4070
4071
4072For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4073
Tony Tyef16a45e2017-06-06 20:31:59 +00004074FLAT
4075++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004076
4077.. code-block:: nasm
4078
4079 flat_load_dword v1, v[3:4]
4080 flat_store_dwordx3 v[3:4], v[5:7]
4081 flat_atomic_swap v1, v[3:4], v5 glc
4082 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4083 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4084
4085For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4086
Tony Tyef16a45e2017-06-06 20:31:59 +00004087MUBUF
4088+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004089
4090.. code-block:: nasm
4091
4092 buffer_load_dword v1, off, s[4:7], s1
4093 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4094 buffer_store_format_xy v[1:2], off, s[4:7], s1
4095 buffer_wbinvl1
4096 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4097
4098For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4099
Tony Tyef16a45e2017-06-06 20:31:59 +00004100SMRD/SMEM
4101+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004102
4103.. code-block:: nasm
4104
4105 s_load_dword s1, s[2:3], 0xfc
4106 s_load_dwordx8 s[8:15], s[2:3], s4
4107 s_load_dwordx16 s[88:103], s[2:3], s4
4108 s_dcache_inv_vol
4109 s_memtime s[4:5]
4110
4111For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4112
Tony Tyef16a45e2017-06-06 20:31:59 +00004113SOP1
4114++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004115
4116.. code-block:: nasm
4117
4118 s_mov_b32 s1, s2
4119 s_mov_b64 s[0:1], 0x80000000
4120 s_cmov_b32 s1, 200
4121 s_wqm_b64 s[2:3], s[4:5]
4122 s_bcnt0_i32_b64 s1, s[2:3]
4123 s_swappc_b64 s[2:3], s[4:5]
4124 s_cbranch_join s[4:5]
4125
4126For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4127
Tony Tyef16a45e2017-06-06 20:31:59 +00004128SOP2
4129++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004130
4131.. code-block:: nasm
4132
4133 s_add_u32 s1, s2, s3
4134 s_and_b64 s[2:3], s[4:5], s[6:7]
4135 s_cselect_b32 s1, s2, s3
4136 s_andn2_b32 s2, s4, s6
4137 s_lshr_b64 s[2:3], s[4:5], s6
4138 s_ashr_i32 s2, s4, s6
4139 s_bfm_b64 s[2:3], s4, s6
4140 s_bfe_i64 s[2:3], s[4:5], s6
4141 s_cbranch_g_fork s[4:5], s[6:7]
4142
4143For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4144
Tony Tyef16a45e2017-06-06 20:31:59 +00004145SOPC
4146++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004147
4148.. code-block:: nasm
4149
4150 s_cmp_eq_i32 s1, s2
4151 s_bitcmp1_b32 s1, s2
4152 s_bitcmp0_b64 s[2:3], s4
4153 s_setvskip s3, s5
4154
4155For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4156
Tony Tyef16a45e2017-06-06 20:31:59 +00004157SOPP
4158++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004159
4160.. code-block:: nasm
4161
4162 s_barrier
4163 s_nop 2
4164 s_endpgm
4165 s_waitcnt 0 ; Wait for all counters to be 0
4166 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4167 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4168 s_sethalt 9
4169 s_sleep 10
4170 s_sendmsg 0x1
4171 s_sendmsg sendmsg(MSG_INTERRUPT)
4172 s_trap 1
4173
4174For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4175
4176Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00004177of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00004178range or acceptable values.
4179
Tony Tyef16a45e2017-06-06 20:31:59 +00004180VALU
4181++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00004182
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004183For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4184the assembler will automatically use optimal encoding based on its operands.
4185To force specific encoding, one can add a suffix to the opcode of the instruction:
4186
4187* _e32 for 32-bit VOP1/VOP2/VOPC
4188* _e64 for 64-bit VOP3
4189* _dpp for VOP_DPP
4190* _sdwa for VOP_SDWA
4191
4192VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004193
4194.. code-block:: nasm
4195
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004196 v_mov_b32 v1, v2
4197 v_mov_b32_e32 v1, v2
4198 v_nop
4199 v_cvt_f64_i32_e32 v[1:2], v2
4200 v_floor_f32_e32 v1, v2
4201 v_bfrev_b32_e32 v1, v2
4202 v_add_f32_e32 v1, v2, v3
4203 v_mul_i32_i24_e64 v1, v2, 3
4204 v_mul_i32_i24_e32 v1, -3, v3
4205 v_mul_i32_i24_e32 v1, -100, v3
4206 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4207 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004208
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004209VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004210
4211.. code-block:: nasm
4212
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004213 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4214 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4215 v_mov_b32 v0, v0 wave_shl:1
4216 v_mov_b32 v0, v0 row_mirror
4217 v_mov_b32 v0, v0 row_bcast:31
4218 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4219 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4220 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004221
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004222VOP_SDWA examples:
4223
4224.. code-block:: nasm
4225
4226 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4227 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4228 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4229 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4230 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4231
4232For full list of supported instructions, refer to "Vector ALU instructions".
4233
4234HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004235~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004236
4237AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4238one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004239
4240.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004241+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004242
4243*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004244object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004245
4246.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004247+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4248
Tom Stellard347ac792015-06-26 21:15:07 +00004249
4250*major*, *minor*, and *stepping* are all integers that describe the instruction
4251set architecture (ISA) version of the assembly program.
4252
4253*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4254"AMD" and *arch* should always be equal to "AMDGPU".
4255
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004256By default, the assembler will derive the ISA version, *vendor*, and *arch*
4257from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004258
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004259.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004260+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004261
4262This directives specifies that the symbol with given name is a kernel entry point
4263(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004264
4265.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004266++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004267
4268This directive marks the beginning of a list of key / value pairs that are used
4269to specify the amd_kernel_code_t object that will be emitted by the assembler.
4270The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4271any amd_kernel_code_t values that are unspecified a default value will be
4272used. The default value for all keys is 0, with the following exceptions:
4273
4274- *kernel_code_version_major* defaults to 1.
4275- *machine_kind* defaults to 1.
4276- *machine_version_major*, *machine_version_minor*, and
4277 *machine_version_stepping* are derived from the value of the -mcpu option
4278 that is passed to the assembler.
4279- *kernel_code_entry_byte_offset* defaults to 256.
4280- *wavefront_size* defaults to 6.
4281- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004282 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004283 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4284
4285The *.amd_kernel_code_t* directive must be placed immediately after the
4286function label and before any instructions.
4287
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004288For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4289comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004290
4291Here is an example of a minimal amd_kernel_code_t specification:
4292
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004293.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004294
4295 .hsa_code_object_version 1,0
4296 .hsa_code_object_isa
4297
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004298 .hsatext
4299 .globl hello_world
4300 .p2align 8
4301 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004302
4303 hello_world:
4304
4305 .amd_kernel_code_t
4306 enable_sgpr_kernarg_segment_ptr = 1
4307 is_ptr64 = 1
4308 compute_pgm_rsrc1_vgprs = 0
4309 compute_pgm_rsrc1_sgprs = 0
4310 compute_pgm_rsrc2_user_sgpr = 2
4311 kernarg_segment_byte_size = 8
4312 wavefront_sgpr_count = 2
4313 workitem_vgpr_count = 3
4314 .end_amd_kernel_code_t
4315
4316 s_load_dwordx2 s[0:1], s[0:1] 0x0
4317 v_mov_b32 v0, 3.14159
4318 s_waitcnt lgkmcnt(0)
4319 v_mov_b32 v1, s0
4320 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004321 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004322 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004323 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004324 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004325
4326Additional Documentation
4327========================
4328
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004329.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4330.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4331.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4332.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4333.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4334.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4335.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4336.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004337.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4338.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4339.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4340.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4341.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004342.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004343.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4344.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__