blob: 886c378b21b1effe15d4ebe4640ed28fa95fb734 [file] [log] [blame]
Eugene Zelenko3507b042018-03-21 17:09:35 +00001=============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002User Guide for AMDGPU Backend
3=============================
4
5.. contents::
6 :local:
Tom Stellard45bb48e2015-06-13 03:28:10 +00007
8Introduction
9============
10
Tony Tyef16a45e2017-06-06 20:31:59 +000011The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
12R600 family up until the current GCN families. It lives in the
13``lib/Target/AMDGPU`` directory.
Tom Stellard45bb48e2015-06-13 03:28:10 +000014
Tony Tyef16a45e2017-06-06 20:31:59 +000015LLVM
16====
Tom Stellard45bb48e2015-06-13 03:28:10 +000017
Tony Tyef16a45e2017-06-06 20:31:59 +000018.. _amdgpu-target-triples:
19
20Target Triples
21--------------
22
23Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
24specify the target triple:
25
Tony Tye07d9f102017-11-10 01:00:54 +000026 .. table:: AMDGPU Architectures
27 :name: amdgpu-architecture-table
Tony Tyef16a45e2017-06-06 20:31:59 +000028
Tony Tye07d9f102017-11-10 01:00:54 +000029 ============ ==============================================================
30 Architecture Description
31 ============ ==============================================================
32 ``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
33 ``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
34 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000035
Tony Tye07d9f102017-11-10 01:00:54 +000036 .. table:: AMDGPU Vendors
37 :name: amdgpu-vendor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000038
Tony Tye07d9f102017-11-10 01:00:54 +000039 ============ ==============================================================
40 Vendor Description
41 ============ ==============================================================
42 ``amd`` Can be used for all AMD GPU usage.
43 ``mesa3d`` Can be used if the OS is ``mesa3d``.
44 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000045
Tony Tye07d9f102017-11-10 01:00:54 +000046 .. table:: AMDGPU Operating Systems
47 :name: amdgpu-os-table
Tony Tyef16a45e2017-06-06 20:31:59 +000048
Tony Tye07d9f102017-11-10 01:00:54 +000049 ============== ============================================================
50 OS Description
51 ============== ============================================================
52 *<empty>* Defaults to the *unknown* OS.
53 ``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
54 such as AMD's ROCm [AMD-ROCm]_.
55 ``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
56 runtime.
57 ``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
58 runtime.
59 ============== ============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000060
Tony Tye07d9f102017-11-10 01:00:54 +000061 .. table:: AMDGPU Environments
62 :name: amdgpu-environment-table
Tony Tyef16a45e2017-06-06 20:31:59 +000063
Tony Tye07d9f102017-11-10 01:00:54 +000064 ============ ==============================================================
65 Environment Description
66 ============ ==============================================================
Tony Tye7a893d42018-03-23 18:45:18 +000067 *<empty>* Default.
Tony Tye07d9f102017-11-10 01:00:54 +000068 ============ ==============================================================
Tony Tyef16a45e2017-06-06 20:31:59 +000069
70.. _amdgpu-processors:
71
72Processors
73----------
74
75Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
76names from both the *Processor* and *Alternative Processor* can be used.
77
78 .. table:: AMDGPU Processors
Tony Tye07d9f102017-11-10 01:00:54 +000079 :name: amdgpu-processor-table
Tony Tyef16a45e2017-06-06 20:31:59 +000080
Tony Tye31105cc2017-12-11 15:35:27 +000081 =========== =============== ============ ===== ========= ======= ==================
82 Processor Alternative Target dGPU/ Target ROCm Example
83 Processor Triple APU Features Support Products
84 Architecture Supported
85 [Default]
86 =========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000087 **Radeon HD 2000/3000 Series (R600)** [AMD-RADEON-HD-2000-3000]_
Tony Tye31105cc2017-12-11 15:35:27 +000088 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000089 ``r600`` ``r600`` dGPU
90 ``r630`` ``r600`` dGPU
91 ``rs880`` ``r600`` dGPU
92 ``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000093 **Radeon HD 4000 Series (R700)** [AMD-RADEON-HD-4000]_
Tony Tye31105cc2017-12-11 15:35:27 +000094 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +000095 ``rv710`` ``r600`` dGPU
96 ``rv730`` ``r600`` dGPU
97 ``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +000098 **Radeon HD 5000 Series (Evergreen)** [AMD-RADEON-HD-5000]_
Tony Tye31105cc2017-12-11 15:35:27 +000099 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000100 ``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000101 ``cypress`` ``r600`` dGPU
102 ``juniper`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000103 ``redwood`` ``r600`` dGPU
104 ``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000105 **Radeon HD 6000 Series (Northern Islands)** [AMD-RADEON-HD-6000]_
Tony Tye31105cc2017-12-11 15:35:27 +0000106 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000107 ``barts`` ``r600`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000108 ``caicos`` ``r600`` dGPU
109 ``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000110 ``turks`` ``r600`` dGPU
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000111 **GCN GFX6 (Southern Islands (SI))** [AMD-GCN-GFX6]_
Tony Tye31105cc2017-12-11 15:35:27 +0000112 -----------------------------------------------------------------------------------
Tony Tye07d9f102017-11-10 01:00:54 +0000113 ``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000114 ``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye07d9f102017-11-10 01:00:54 +0000115 - ``oland``
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000116 - ``pitcairn``
117 - ``verde``
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000118 **GCN GFX7 (Sea Islands (CI))** [AMD-GCN-GFX7]_
Tony Tye31105cc2017-12-11 15:35:27 +0000119 -----------------------------------------------------------------------------------
120 ``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
121 - A6 Pro-7050B
122 - A8-7100
123 - A8 Pro-7150B
124 - A10-7300
125 - A10 Pro-7350B
126 - FX-7500
127 - A8-7200P
128 - A10-7400P
129 - FX-7600P
130 ``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
131 - FirePro W9100
132 - FirePro S9150
133 - FirePro S9170
134 ``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
135 - Radeon R9 290x
136 - Radeon R390
137 - Radeon R390x
138 ``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
139 - ``mullins`` - E1-2200
140 - E1-2500
141 - E2-3000
142 - E2-3800
143 - A4-5000
144 - A4-5100
145 - A6-5200
146 - A4 Pro-3340B
147 ``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
148 - Radeon HD 8770
149 - R7 260
150 - R7 260X
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000151 **GCN GFX8 (Volcanic Islands (VI))** [AMD-GCN-GFX8]_
Tony Tye31105cc2017-12-11 15:35:27 +0000152 -----------------------------------------------------------------------------------
Tony Tye31105cc2017-12-11 15:35:27 +0000153 ``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
154 [on] - Pro A6-8500B
155 - A8-8600P
156 - Pro A8-8600B
157 - FX-8800P
158 - Pro A12-8800B
159 \ ``amdgcn`` APU - xnack ROCm - A10-8700P
160 [on] - Pro A10-8700B
161 - A10-8780P
162 \ ``amdgcn`` APU - xnack - A10-9600P
163 [on] - A10-9630P
164 - A12-9700P
165 - A12-9730P
166 - FX-9800P
167 - FX-9830P
168 \ ``amdgcn`` APU - xnack - E2-9010
169 [on] - A6-9210
170 - A9-9410
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000171 ``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
172 - ``tonga`` [off] - FirePro S7100
Tony Tye31105cc2017-12-11 15:35:27 +0000173 - FirePro W7100
174 - Radeon R285
175 - Radeon R9 380
176 - Radeon R9 385
177 - Mobile FirePro
178 M7170
179 ``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
180 [off] - Radeon R9 Fury
181 - Radeon R9 FuryX
182 - Radeon Pro Duo
183 - FirePro S9300x2
184 - Radeon Instinct MI8
185 \ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
186 [off] - Radeon RX 480
187 - Radeon Instinct MI6
188 \ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
189 [off]
190 ``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
191 [on]
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +0000192 **GCN GFX9** [AMD-GCN-GFX9]_
Tony Tye31105cc2017-12-11 15:35:27 +0000193 -----------------------------------------------------------------------------------
194 ``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
195 [off] Frontier Edition
196 - Radeon RX Vega 56
197 - Radeon RX Vega 64
198 - Radeon RX Vega 64
199 Liquid
200 - Radeon Instinct MI25
Tony Tyeb6efb902018-04-14 01:58:10 +0000201 ``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
202 [on] - Ryzen 5 2400G
Tony Tye31105cc2017-12-11 15:35:27 +0000203 =========== =============== ============ ===== ========= ======= ==================
Tony Tye07d9f102017-11-10 01:00:54 +0000204
205.. _amdgpu-target-features:
206
207Target Features
208---------------
209
210Target features control how code is generated to support certain
Tony Tye31105cc2017-12-11 15:35:27 +0000211processor specific features. Not all target features are supported by
212all processors. The runtime must ensure that the features supported by
213the device used to execute the code match the features enabled when
214generating the code. A mismatch of features may result in incorrect
215execution, or a reduction in performance.
216
217The target features supported by each processor, and the default value
218used if not specified explicitly, is listed in
219:ref:`amdgpu-processor-table`.
Tony Tye07d9f102017-11-10 01:00:54 +0000220
221Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
222target features.
223
224For example:
225
226``-mxnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000227 Enable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000228``-mno-xnack``
Tony Tye31105cc2017-12-11 15:35:27 +0000229 Disable the ``xnack`` feature.
Tony Tye07d9f102017-11-10 01:00:54 +0000230
231 .. table:: AMDGPU Target Features
232 :name: amdgpu-target-feature-table
233
Tony Tye31105cc2017-12-11 15:35:27 +0000234 ============== ==================================================
235 Target Feature Description
236 ============== ==================================================
237 -m[no-]xnack Enable/disable generating code that has
238 memory clauses that are compatible with
239 having XNACK replay enabled.
Tony Tye07d9f102017-11-10 01:00:54 +0000240
Tony Tye31105cc2017-12-11 15:35:27 +0000241 This is used for demand paging and page
242 migration. If XNACK replay is enabled in
243 the device, then if a page fault occurs
244 the code may execute incorrectly if the
245 ``xnack`` feature is not enabled. Executing
246 code that has the feature enabled on a
247 device that does not have XNACK replay
248 enabled will execute correctly, but may
249 be less performant than code with the
250 feature disabled.
251 ============== ==================================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000252
253.. _amdgpu-address-spaces:
Tom Stellard3ec09e62016-04-06 01:29:19 +0000254
255Address Spaces
256--------------
257
Tony Tyef16a45e2017-06-06 20:31:59 +0000258The AMDGPU backend uses the following address space mappings.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000259
Tony Tyef16a45e2017-06-06 20:31:59 +0000260The memory space names used in the table, aside from the region memory space, is
261from the OpenCL standard.
Tom Stellard3ec09e62016-04-06 01:29:19 +0000262
Tony Tyef16a45e2017-06-06 20:31:59 +0000263LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard3ec09e62016-04-06 01:29:19 +0000264
Tony Tyef16a45e2017-06-06 20:31:59 +0000265 .. table:: Address Space Mapping
266 :name: amdgpu-address-space-mapping-table
267
Yaxun Liu0124b542018-02-13 18:00:25 +0000268 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000269 LLVM Address Space Memory Space
Yaxun Liu0124b542018-02-13 18:00:25 +0000270 ================== =================
271 0 Generic (Flat)
272 1 Global
273 2 Region (GDS)
274 3 Local (group/LDS)
275 4 Constant
276 5 Private (Scratch)
277 6 Constant 32-bit
278 ================== =================
Tony Tyef16a45e2017-06-06 20:31:59 +0000279
280.. _amdgpu-memory-scopes:
281
282Memory Scopes
283-------------
284
285This section provides LLVM memory synchronization scopes supported by the AMDGPU
286backend memory model when the target triple OS is ``amdhsa`` (see
287:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
288
289The memory model supported is based on the HSA memory model [HSA]_ which is
290based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
291relation is transitive over the synchonizes-with relation independent of scope,
292and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye07d9f102017-11-10 01:00:54 +0000293table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000294
295This is different to the OpenCL [OpenCL]_ memory model which does not have scope
296inclusion and requires the memory scopes to exactly match. However, this
297is conservatively correct for OpenCL.
298
Tony Tye07d9f102017-11-10 01:00:54 +0000299 .. table:: AMDHSA LLVM Sync Scopes
300 :name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tyef16a45e2017-06-06 20:31:59 +0000301
302 ================ ==========================================================
303 LLVM Sync Scope Description
304 ================ ==========================================================
305 *none* The default: ``system``.
306
307 Synchronizes with, and participates in modification and
308 seq_cst total orderings with, other operations (except
309 image operations) for all address spaces (except private,
310 or generic that accesses private) provided the other
311 operation's sync scope is:
312
313 - ``system``.
314 - ``agent`` and executed by a thread on the same agent.
315 - ``workgroup`` and executed by a thread in the same
316 workgroup.
317 - ``wavefront`` and executed by a thread in the same
318 wavefront.
319
320 ``agent`` Synchronizes with, and participates in modification and
321 seq_cst total orderings with, other operations (except
322 image operations) for all address spaces (except private,
323 or generic that accesses private) provided the other
324 operation's sync scope is:
325
326 - ``system`` or ``agent`` and executed by a thread on the
327 same agent.
328 - ``workgroup`` and executed by a thread in the same
329 workgroup.
330 - ``wavefront`` and executed by a thread in the same
331 wavefront.
332
333 ``workgroup`` Synchronizes with, and participates in modification and
334 seq_cst total orderings with, other operations (except
335 image operations) for all address spaces (except private,
336 or generic that accesses private) provided the other
337 operation's sync scope is:
338
339 - ``system``, ``agent`` or ``workgroup`` and executed by a
340 thread in the same workgroup.
341 - ``wavefront`` and executed by a thread in the same
342 wavefront.
343
344 ``wavefront`` Synchronizes with, and participates in modification and
345 seq_cst total orderings with, other operations (except
346 image operations) for all address spaces (except private,
347 or generic that accesses private) provided the other
348 operation's sync scope is:
349
350 - ``system``, ``agent``, ``workgroup`` or ``wavefront``
351 and executed by a thread in the same wavefront.
352
353 ``singlethread`` Only synchronizes with, and participates in modification
354 and seq_cst total orderings with, other operations (except
355 image operations) running in the same thread for all
356 address spaces (for example, in signal handlers).
357 ================ ==========================================================
358
359AMDGPU Intrinsics
360-----------------
361
362The AMDGPU backend implements the following intrinsics.
363
364*This section is WIP.*
365
366.. TODO
367 List AMDGPU intrinsics
368
369Code Object
370===========
371
372The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
373can be linked by ``lld`` to produce a standard ELF shared code object which can
374be loaded and executed on an AMDGPU target.
375
376Header
377------
378
379The AMDGPU backend uses the following ELF header:
380
381 .. table:: AMDGPU ELF Header
382 :name: amdgpu-elf-header-table
383
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000384 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000385 Field Value
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000386 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000387 ``e_ident[EI_CLASS]`` ``ELFCLASS64``
388 ``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye07d9f102017-11-10 01:00:54 +0000389 ``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
390 - ``ELFOSABI_AMDGPU_HSA``
391 - ``ELFOSABI_AMDGPU_PAL``
392 - ``ELFOSABI_AMDGPU_MESA3D``
393 ``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
394 - ``ELFABIVERSION_AMDGPU_PAL``
395 - ``ELFABIVERSION_AMDGPU_MESA3D``
396 ``e_type`` - ``ET_REL``
397 - ``ET_DYN``
Tony Tyef16a45e2017-06-06 20:31:59 +0000398 ``e_machine`` ``EM_AMDGPU``
399 ``e_entry`` 0
Tony Tye07d9f102017-11-10 01:00:54 +0000400 ``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000401 ========================== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000402
403..
404
405 .. table:: AMDGPU ELF Header Enumeration Values
406 :name: amdgpu-elf-header-enumeration-values-table
407
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000408 =============================== =====
409 Name Value
410 =============================== =====
411 ``EM_AMDGPU`` 224
Tony Tye07d9f102017-11-10 01:00:54 +0000412 ``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000413 ``ELFOSABI_AMDGPU_HSA`` 64
414 ``ELFOSABI_AMDGPU_PAL`` 65
415 ``ELFOSABI_AMDGPU_MESA3D`` 66
416 ``ELFABIVERSION_AMDGPU_HSA`` 1
417 ``ELFABIVERSION_AMDGPU_PAL`` 0
418 ``ELFABIVERSION_AMDGPU_MESA3D`` 0
419 =============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000420
421``e_ident[EI_CLASS]``
Tony Tye07d9f102017-11-10 01:00:54 +0000422 The ELF class is:
423
424 * ``ELFCLASS32`` for ``r600`` architecture.
425
426 * ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
427 bit applications.
Tony Tyef16a45e2017-06-06 20:31:59 +0000428
429``e_ident[EI_DATA]``
Tony Tye07d9f102017-11-10 01:00:54 +0000430 All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tyef16a45e2017-06-06 20:31:59 +0000431
432``e_ident[EI_OSABI]``
Tony Tye07d9f102017-11-10 01:00:54 +0000433 One of the following AMD GPU architecture specific OS ABIs
434 (see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000435
Tony Tye07d9f102017-11-10 01:00:54 +0000436 * ``ELFOSABI_NONE`` for *unknown* OS.
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000437
Tony Tye07d9f102017-11-10 01:00:54 +0000438 * ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tyef16a45e2017-06-06 20:31:59 +0000439
Tony Tye07d9f102017-11-10 01:00:54 +0000440 * ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
441
442 * ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000443
Tony Tyef16a45e2017-06-06 20:31:59 +0000444``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyova952b442017-10-03 20:54:07 +0000445 The ABI version of the AMD GPU architecture specific OS ABI to which the code
446 object conforms:
447
448 * ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
449 runtime ABI.
450
451 * ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
452 runtime ABI.
Tony Tyef16a45e2017-06-06 20:31:59 +0000453
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000454 * ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye07d9f102017-11-10 01:00:54 +0000455 3D runtime ABI.
Konstantin Zhuravlyov0aa94d32017-10-03 21:14:14 +0000456
Tony Tyef16a45e2017-06-06 20:31:59 +0000457``e_type``
458 Can be one of the following values:
459
460
461 ``ET_REL``
462 The type produced by the AMD GPU backend compiler as it is relocatable code
463 object.
464
465 ``ET_DYN``
466 The type produced by the linker as it is a shared code object.
467
468 The AMD HSA runtime loader requires a ``ET_DYN`` code object.
469
470``e_machine``
Tony Tye07d9f102017-11-10 01:00:54 +0000471 The value ``EM_AMDGPU`` is used for the machine for all processors supported
472 by the ``r600`` and ``amdgcn`` architectures (see
473 :ref:`amdgpu-processor-table`). The specific processor is specified in the
474 ``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
475 :ref:`amdgpu-elf-header-e_flags-table`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000476
477``e_entry``
478 The entry point is 0 as the entry points for individual kernels must be
479 selected in order to invoke them through AQL packets.
480
481``e_flags``
Tony Tye07d9f102017-11-10 01:00:54 +0000482 The AMDGPU backend uses the following ELF header flags:
483
484 .. table:: AMDGPU ELF Header ``e_flags``
485 :name: amdgpu-elf-header-e_flags-table
486
487 ================================= ========== =============================
488 Name Value Description
489 ================================= ========== =============================
490 **AMDGPU Processor Flag** See :ref:`amdgpu-processor-table`.
491 -------------------------------------------- -----------------------------
492 ``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
493 mask for
494 ``EF_AMDGPU_MACH_xxx`` values
495 defined in
496 :ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye31105cc2017-12-11 15:35:27 +0000497 ``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
498 target feature is
499 enabled for all code
500 contained in the code object.
Tony Tye5bbcca62018-03-08 05:46:01 +0000501 If the processor
502 does not support the
503 ``xnack`` target
504 feature then must
505 be 0.
Tony Tye31105cc2017-12-11 15:35:27 +0000506 See
507 :ref:`amdgpu-target-features`.
Tony Tye07d9f102017-11-10 01:00:54 +0000508 ================================= ========== =============================
509
510 .. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
511 :name: amdgpu-ef-amdgpu-mach-table
512
513 ================================= ========== =============================
514 Name Value Description (see
515 :ref:`amdgpu-processor-table`)
516 ================================= ========== =============================
Konstantin Zhuravlyov9122a632018-02-16 22:33:59 +0000517 ``EF_AMDGPU_MACH_NONE`` 0x000 *not specified*
518 ``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
519 ``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
520 ``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
521 ``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
522 ``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
523 ``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
524 ``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
525 ``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
526 ``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
527 ``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
528 ``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
529 ``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
530 ``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
531 ``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
532 ``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
533 ``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
534 *reserved* 0x011 - Reserved for ``r600``
535 0x01f architecture processors.
536 ``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
537 ``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
538 ``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
539 ``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
540 ``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
541 ``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
542 ``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
543 *reserved* 0x027 Reserved.
544 ``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
545 ``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
546 ``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
547 ``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
548 ``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
549 ``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
550 *reserved* 0x02e Reserved.
551 *reserved* 0x02f Reserved.
552 *reserved* 0x030 Reserved.
Tony Tye07d9f102017-11-10 01:00:54 +0000553 ================================= ========== =============================
Tony Tyef16a45e2017-06-06 20:31:59 +0000554
555Sections
556--------
557
558An AMDGPU target ELF code object has the standard ELF sections which include:
559
560 .. table:: AMDGPU ELF Sections
561 :name: amdgpu-elf-sections-table
562
563 ================== ================ =================================
564 Name Type Attributes
565 ================== ================ =================================
566 ``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
567 ``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
568 ``.debug_``\ *\** ``SHT_PROGBITS`` *none*
569 ``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
570 ``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
571 ``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
572 ``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
573 ``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
574 ``.note`` ``SHT_NOTE`` *none*
575 ``.rela``\ *name* ``SHT_RELA`` *none*
576 ``.rela.dyn`` ``SHT_RELA`` *none*
577 ``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
578 ``.shstrtab`` ``SHT_STRTAB`` *none*
579 ``.strtab`` ``SHT_STRTAB`` *none*
580 ``.symtab`` ``SHT_SYMTAB`` *none*
581 ``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
582 ================== ================ =================================
583
584These sections have their standard meanings (see [ELF]_) and are only generated
585if needed.
586
587``.debug``\ *\**
588 The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
589 DWARF produced by the AMDGPU backend.
590
Tony Tye46d35762017-08-15 20:47:41 +0000591``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tyef16a45e2017-06-06 20:31:59 +0000592 The standard sections used by a dynamic loader.
593
594``.note``
595 See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
596 backend.
597
598``.rela``\ *name*, ``.rela.dyn``
599 For relocatable code objects, *name* is the name of the section that the
600 relocation records apply. For example, ``.rela.text`` is the section name for
601 relocation records associated with the ``.text`` section.
602
603 For linked shared code objects, ``.rela.dyn`` contains all the relocation
604 records from each of the relocatable code object's ``.rela``\ *name* sections.
605
606 See :ref:`amdgpu-relocation-records` for the relocation records supported by
607 the AMDGPU backend.
608
609``.text``
610 The executable machine code for the kernels and functions they call. Generated
611 as position independent code. See :ref:`amdgpu-code-conventions` for
612 information on conventions used in the isa generation.
613
614.. _amdgpu-note-records:
615
616Note Records
617------------
618
Tony Tye07d9f102017-11-10 01:00:54 +0000619As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
620be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
621aligned. In addition, minimal zero byte padding must be generated to ensure the
622``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
623``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tyef16a45e2017-06-06 20:31:59 +0000624
625The AMDGPU backend code object uses the following ELF note records in the
626``.note`` section. The *Description* column specifies the layout of the note
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000627record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tyef16a45e2017-06-06 20:31:59 +0000628variable size strings have a corresponding ``*_size`` field that specifies the
629number of bytes, including the terminating null character, in the string. The
630string(s) come immediately after the preceding fields.
631
632Additional note records can be present.
633
634 .. table:: AMDGPU ELF Note Records
635 :name: amdgpu-elf-note-records-table
636
Tony Tye46d35762017-08-15 20:47:41 +0000637 ===== ============================== ======================================
638 Name Type Description
639 ===== ============================== ======================================
640 "AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye46d35762017-08-15 20:47:41 +0000641 ===== ============================== ======================================
Tony Tyef16a45e2017-06-06 20:31:59 +0000642
643..
644
645 .. table:: AMDGPU ELF Note Record Enumeration Values
646 :name: amdgpu-elf-note-record-enumeration-values-table
647
Tony Tye46d35762017-08-15 20:47:41 +0000648 ============================== =====
649 Name Value
650 ============================== =====
651 *reserved* 0-9
652 ``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye07d9f102017-11-10 01:00:54 +0000653 *reserved* 11
Tony Tye46d35762017-08-15 20:47:41 +0000654 ============================== =====
Tony Tyef16a45e2017-06-06 20:31:59 +0000655
Tony Tye46d35762017-08-15 20:47:41 +0000656``NT_AMD_AMDGPU_HSA_METADATA``
657 Specifies extensible metadata associated with the code objects executed on HSA
658 [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
659 the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
660 :ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
661 object metadata string.
Tony Tyef16a45e2017-06-06 20:31:59 +0000662
Tony Tye46d35762017-08-15 20:47:41 +0000663.. _amdgpu-symbols:
664
665Symbols
666-------
667
668Symbols include the following:
669
670 .. table:: AMDGPU ELF Symbols
671 :name: amdgpu-elf-symbols-table
672
673 ===================== ============== ============= ==================
674 Name Type Section Description
675 ===================== ============== ============= ==================
676 *link-name* ``STT_OBJECT`` - ``.data`` Global variable
677 - ``.rodata``
678 - ``.bss``
679 *link-name*\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
680 *link-name* ``STT_FUNC`` - ``.text`` Kernel entry point
681 ===================== ============== ============= ==================
682
683Global variable
684 Global variables both used and defined by the compilation unit.
685
686 If the symbol is defined in the compilation unit then it is allocated in the
687 appropriate section according to if it has initialized data or is readonly.
688
689 If the symbol is external then its section is ``STN_UNDEF`` and the loader
690 will resolve relocations using the definition provided by another code object
691 or explicitly defined by the runtime.
692
693 All global symbols, whether defined in the compilation unit or external, are
694 accessed by the machine code indirectly through a GOT table entry. This
695 allows them to be preemptable. The GOT table is only supported when the target
696 triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +0000697
698 .. TODO
Tony Tye46d35762017-08-15 20:47:41 +0000699 Add description of linked shared object symbols. Seems undefined symbols
700 are marked as STT_NOTYPE.
Tony Tyef16a45e2017-06-06 20:31:59 +0000701
Tony Tye46d35762017-08-15 20:47:41 +0000702Kernel descriptor
703 Every HSA kernel has an associated kernel descriptor. It is the address of the
704 kernel descriptor that is used in the AQL dispatch packet used to invoke the
705 kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
706 defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
707
708Kernel entry point
709 Every HSA kernel also has a symbol for its machine code entry point.
710
711.. _amdgpu-relocation-records:
712
713Relocation Records
714------------------
715
716AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
717relocatable fields are:
718
719``word32``
720 This specifies a 32-bit field occupying 4 bytes with arbitrary byte
721 alignment. These values use the same byte order as other word values in the
722 AMD GPU architecture.
723
724``word64``
725 This specifies a 64-bit field occupying 8 bytes with arbitrary byte
726 alignment. These values use the same byte order as other word values in the
727 AMD GPU architecture.
728
729Following notations are used for specifying relocation calculations:
730
731**A**
732 Represents the addend used to compute the value of the relocatable field.
733
734**G**
735 Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +0000736 entry's symbol will reside during execution.
Tony Tye46d35762017-08-15 20:47:41 +0000737
738**GOT**
739 Represents the address of the global offset table.
740
741**P**
742 Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
743 of the storage unit being relocated (computed using ``r_offset``).
744
745**S**
746 Represents the value of the symbol whose index resides in the relocation
Tony Tyed2884302017-10-16 20:44:29 +0000747 entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
748
749**B**
750 Represents the base address of a loaded executable or shared object which is
751 the difference between the ELF address and the actual load address. Relocations
752 using this are only valid in executable or shared objects.
Tony Tye46d35762017-08-15 20:47:41 +0000753
754The following relocation types are supported:
755
756 .. table:: AMDGPU ELF Relocation Records
757 :name: amdgpu-elf-relocation-records-table
758
Tony Tyedb6c9932018-01-30 23:59:43 +0000759 ========================== ======= ===== ========== ==============================
760 Relocation Type Kind Value Field Calculation
761 ========================== ======= ===== ========== ==============================
762 ``R_AMDGPU_NONE`` 0 *none* *none*
Tony Tye223f4c72018-04-13 01:01:27 +0000763 ``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
764 Dynamic
765 ``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
766 Dynamic
767 ``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
768 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000769 ``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
770 ``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye223f4c72018-04-13 01:01:27 +0000771 ``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
772 Dynamic
Tony Tyedb6c9932018-01-30 23:59:43 +0000773 ``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
774 ``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
775 ``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
776 ``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
777 ``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
778 *reserved* 12
779 ``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
780 ========================== ======= ===== ========== ==============================
Tony Tye46d35762017-08-15 20:47:41 +0000781
Tony Tye223f4c72018-04-13 01:01:27 +0000782``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
783the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
784
785There is no current OS loader support for 32 bit programs and so
786``R_AMDGPU_ABS32`` is not used.
787
Tony Tye46d35762017-08-15 20:47:41 +0000788.. _amdgpu-dwarf:
789
790DWARF
791-----
792
Scott Linder16c7bda2018-02-23 23:01:06 +0000793Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye46d35762017-08-15 20:47:41 +0000794information that maps the code object executable code and data to the source
795language constructs. It can be used by tools such as debuggers and profilers.
796
797Address Space Mapping
798~~~~~~~~~~~~~~~~~~~~~
799
800The following address space mapping is used:
801
802 .. table:: AMDGPU DWARF Address Space Mapping
803 :name: amdgpu-dwarf-address-space-mapping-table
804
805 =================== =================
806 DWARF Address Space Memory Space
807 =================== =================
808 1 Private (Scratch)
809 2 Local (group/LDS)
810 *omitted* Global
811 *omitted* Constant
812 *omitted* Generic (Flat)
813 *not supported* Region (GDS)
814 =================== =================
815
816See :ref:`amdgpu-address-spaces` for information on the memory space terminology
817used in the table.
818
819An ``address_class`` attribute is generated on pointer type DIEs to specify the
820DWARF address space of the value of the pointer when it is in the *private* or
821*local* address space. Otherwise the attribute is omitted.
822
823An ``XDEREF`` operation is generated in location list expressions for variables
824that are allocated in the *private* and *local* address space. Otherwise no
825``XDREF`` is omitted.
826
827Register Mapping
828~~~~~~~~~~~~~~~~
829
830*This section is WIP.*
831
832.. TODO
833 Define DWARF register enumeration.
834
835 If want to present a wavefront state then should expose vector registers as
836 64 wide (rather than per work-item view that LLVM uses). Either as separate
837 registers, or a 64x4 byte single register. In either case use a new LANE op
838 (akin to XDREF) to select the current lane usage in a location
839 expression. This would also allow scalar register spilling to vector register
840 lanes to be expressed (currently no debug information is being generated for
841 spilling). If choose a wide single register approach then use LANE in
842 conjunction with PIECE operation to select the dword part of the register for
843 the current lane. If the separate register approach then use LANE to select
844 the register.
845
846Source Text
847~~~~~~~~~~~
848
Scott Linder16c7bda2018-02-23 23:01:06 +0000849Source text for online-compiled programs (e.g. those compiled by the OpenCL
850runtime) may be embedded into the DWARF v5 line table using the ``clang
851-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye46d35762017-08-15 20:47:41 +0000852
Scott Linder16c7bda2018-02-23 23:01:06 +0000853For example:
854
855``-gembed-source``
856 Enable the embedded source DWARF v5 extension.
857``-gno-embed-source``
858 Disable the embedded source DWARF v5 extension.
859
860 .. table:: AMDGPU Debug Options
861 :name: amdgpu-debug-options
862
863 ==================== ==================================================
864 Debug Flag Description
865 ==================== ==================================================
866 -g[no-]embed-source Enable/disable embedding source text in DWARF
867 debug sections. Useful for environments where
868 source cannot be written to disk, such as
869 when performing online compilation.
870 ==================== ==================================================
871
872This option enables one extended content types in the DWARF v5 Line Number
873Program Header, which is used to encode embedded source.
874
875 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
876 :name: amdgpu-dwarf-extended-content-types
877
878 ============================ ======================
879 Content Type Form
880 ============================ ======================
881 ``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
882 ============================ ======================
883
884The source field will contain the UTF-8 encoded, null-terminated source text
885with ``'\n'`` line endings. When the source field is present, consumers can use
886the embedded source instead of attempting to discover the source on disk. When
887the source field is absent, consumers can access the file to get the source
888text.
889
890The above content type appears in the ``file_name_entry_format`` field of the
891line table prologue, and its corresponding value appear in the ``file_names``
892field. The current encoding of the content type is documented in table
893:ref:`amdgpu-dwarf-extended-content-types-encoding`
894
895 .. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
896 :name: amdgpu-dwarf-extended-content-types-encoding
897
898 ============================ ====================
899 Content Type Value
900 ============================ ====================
901 ``DW_LNCT_LLVM_source`` 0x2001
902 ============================ ====================
Tony Tye46d35762017-08-15 20:47:41 +0000903
904.. _amdgpu-code-conventions:
905
906Code Conventions
907================
908
909This section provides code conventions used for each supported target triple OS
910(see :ref:`amdgpu-target-triples`).
911
912AMDHSA
913------
914
915This section provides code conventions used when the target triple OS is
916``amdhsa`` (see :ref:`amdgpu-target-triples`).
917
918.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tyef16a45e2017-06-06 20:31:59 +0000919
Tony Tye01bfd6c2018-03-27 21:20:46 +0000920Code Object Target Identification
921~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
922
923The AMDHSA OS uses the following syntax to specify the code object
924target as a single string:
925
926 ``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
927
928Where:
929
930 - ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
931 are the same as the *Target Triple* (see
932 :ref:`amdgpu-target-triples`).
933
934 - ``<Processor>`` is the same as the *Processor* (see
935 :ref:`amdgpu-processors`).
936
937 - ``<Target Features>`` is a list of the enabled *Target Features*
938 (see :ref:`amdgpu-target-features`), each prefixed by a plus, that
939 apply to *Processor*. The list must be in the same order as listed
940 in the table :ref:`amdgpu-target-feature-table`. Note that *Target
941 Features* must be included in the list if they are enabled even if
942 that is the default for *Processor*.
943
944For example:
945
946 ``"amdgcn-amd-amdhsa--gfx902+xnack"``
947
Tony Tyef16a45e2017-06-06 20:31:59 +0000948Code Object Metadata
Tony Tye46d35762017-08-15 20:47:41 +0000949~~~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +0000950
Tony Tye46d35762017-08-15 20:47:41 +0000951The code object metadata specifies extensible metadata associated with the code
952objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
953[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
954(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
955``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
956information necessary to support the ROCM kernel queries. For example, the
957segment sizes needed in a dispatch packet. In addition, a high level language
958runtime may require other information to be included. For example, the AMD
959OpenCL runtime records kernel argument information.
Tony Tyef16a45e2017-06-06 20:31:59 +0000960
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +0000961The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tyef16a45e2017-06-06 20:31:59 +0000962:doc:`YamlIO`).
963
Tony Tye46d35762017-08-15 20:47:41 +0000964.. TODO
965 Is the string null terminated? It probably should not if YAML allows it to
966 contain null characters, otherwise it should be.
967
Tony Tyef16a45e2017-06-06 20:31:59 +0000968The metadata is represented as a single YAML document comprised of the mapping
969defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
970referenced tables.
971
972For boolean values, the string values of ``false`` and ``true`` are used for
973false and true respectively.
974
975Additional information can be added to the mappings. To avoid conflicts, any
976non-AMD key names should be prefixed by "*vendor-name*.".
977
978 .. table:: AMDHSA Code Object Metadata Mapping
979 :name: amdgpu-amdhsa-code-object-metadata-mapping-table
980
981 ========== ============== ========= =======================================
982 String Key Value Type Required? Description
983 ========== ============== ========= =======================================
984 "Version" sequence of Required - The first integer is the major
985 2 integers version. Currently 1.
986 - The second integer is the minor
987 version. Currently 0.
988 "Printf" sequence of Each string is encoded information
989 strings about a printf function call. The
990 encoded information is organized as
991 fields separated by colon (':'):
992
993 ``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
994
995 where:
996
997 ``ID``
998 A 32 bit integer as a unique id for
999 each printf function call
1000
1001 ``N``
1002 A 32 bit integer equal to the number
1003 of arguments of printf function call
1004 minus 1
1005
1006 ``S[i]`` (where i = 0, 1, ... , N-1)
1007 32 bit integers for the size in bytes
1008 of the i-th FormatString argument of
1009 the printf function call
1010
1011 FormatString
1012 The format string passed to the
1013 printf function call.
1014 "Kernels" sequence of Required Sequence of the mappings for each
1015 mapping kernel in the code object. See
1016 :ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
1017 for the definition of the mapping.
1018 ========== ============== ========= =======================================
1019
1020..
1021
1022 .. table:: AMDHSA Code Object Kernel Metadata Mapping
1023 :name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
1024
1025 ================= ============== ========= ================================
1026 String Key Value Type Required? Description
1027 ================= ============== ========= ================================
1028 "Name" string Required Source name of the kernel.
1029 "SymbolName" string Required Name of the kernel
1030 descriptor ELF symbol.
1031 "Language" string Source language of the kernel.
1032 Values include:
1033
1034 - "OpenCL C"
1035 - "OpenCL C++"
1036 - "HCC"
1037 - "OpenMP"
1038
1039 "LanguageVersion" sequence of - The first integer is the major
1040 2 integers version.
1041 - The second integer is the
1042 minor version.
1043 "Attrs" mapping Mapping of kernel attributes.
1044 See
1045 :ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
1046 for the mapping definition.
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001047 "Args" sequence of Sequence of mappings of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001048 mapping kernel arguments. See
1049 :ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
1050 for the definition of the mapping.
1051 "CodeProps" mapping Mapping of properties related to
1052 the kernel code. See
1053 :ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
1054 for the mapping definition.
Tony Tyef16a45e2017-06-06 20:31:59 +00001055 ================= ============== ========= ================================
1056
1057..
1058
1059 .. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
1060 :name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
1061
1062 =================== ============== ========= ==============================
1063 String Key Value Type Required? Description
1064 =================== ============== ========= ==============================
Tony Tyee039d0e2018-01-30 23:07:10 +00001065 "ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
1066 3 integers must be >=1 and the dispatch
1067 work-group size X, Y, Z must
1068 correspond to the specified
1069 values. Defaults to 0, 0, 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001070
1071 Corresponds to the OpenCL
1072 ``reqd_work_group_size``
1073 attribute.
1074 "WorkGroupSizeHint" sequence of The dispatch work-group size
1075 3 integers X, Y, Z is likely to be the
1076 specified values.
1077
1078 Corresponds to the OpenCL
1079 ``work_group_size_hint``
1080 attribute.
1081 "VecTypeHint" string The name of a scalar or vector
1082 type.
1083
1084 Corresponds to the OpenCL
1085 ``vec_type_hint`` attribute.
Yaxun Liude4b88d2017-10-10 19:39:48 +00001086
1087 "RuntimeHandle" string The external symbol name
1088 associated with a kernel.
1089 OpenCL runtime allocates a
1090 global buffer for the symbol
1091 and saves the kernel's address
1092 to it, which is used for
1093 device side enqueueing. Only
1094 available for device side
1095 enqueued kernels.
Tony Tyef16a45e2017-06-06 20:31:59 +00001096 =================== ============== ========= ==============================
1097
1098..
1099
1100 .. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
1101 :name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
1102
1103 ================= ============== ========= ================================
1104 String Key Value Type Required? Description
1105 ================= ============== ========= ================================
1106 "Name" string Kernel argument name.
1107 "TypeName" string Kernel argument type name.
1108 "Size" integer Required Kernel argument size in bytes.
1109 "Align" integer Required Kernel argument alignment in
1110 bytes. Must be a power of two.
1111 "ValueKind" string Required Kernel argument kind that
1112 specifies how to set up the
1113 corresponding argument.
1114 Values include:
1115
1116 "ByValue"
1117 The argument is copied
1118 directly into the kernarg.
1119
1120 "GlobalBuffer"
1121 A global address space pointer
1122 to the buffer data is passed
1123 in the kernarg.
1124
1125 "DynamicSharedPointer"
1126 A group address space pointer
1127 to dynamically allocated LDS
1128 is passed in the kernarg.
1129
1130 "Sampler"
1131 A global address space
1132 pointer to a S# is passed in
1133 the kernarg.
1134
1135 "Image"
1136 A global address space
1137 pointer to a T# is passed in
1138 the kernarg.
1139
1140 "Pipe"
1141 A global address space pointer
1142 to an OpenCL pipe is passed in
1143 the kernarg.
1144
1145 "Queue"
1146 A global address space pointer
1147 to an OpenCL device enqueue
1148 queue is passed in the
1149 kernarg.
1150
1151 "HiddenGlobalOffsetX"
1152 The OpenCL grid dispatch
1153 global offset for the X
1154 dimension is passed in the
1155 kernarg.
1156
1157 "HiddenGlobalOffsetY"
1158 The OpenCL grid dispatch
1159 global offset for the Y
1160 dimension is passed in the
1161 kernarg.
1162
1163 "HiddenGlobalOffsetZ"
1164 The OpenCL grid dispatch
1165 global offset for the Z
1166 dimension is passed in the
1167 kernarg.
1168
1169 "HiddenNone"
1170 An argument that is not used
1171 by the kernel. Space needs to
1172 be left for it, but it does
1173 not need to be set up.
1174
1175 "HiddenPrintfBuffer"
1176 A global address space pointer
1177 to the runtime printf buffer
1178 is passed in kernarg.
1179
1180 "HiddenDefaultQueue"
1181 A global address space pointer
1182 to the OpenCL device enqueue
1183 queue that should be used by
1184 the kernel by default is
1185 passed in the kernarg.
1186
1187 "HiddenCompletionAction"
Yaxun Liuc928f2a2017-10-30 14:30:28 +00001188 A global address space pointer
1189 to help link enqueued kernels into
1190 the ancestor tree for determining
1191 when the parent kernel has finished.
Tony Tyef16a45e2017-06-06 20:31:59 +00001192
1193 "ValueType" string Required Kernel argument value type. Only
1194 present if "ValueKind" is
1195 "ByValue". For vector data
1196 types, the value is for the
1197 element type. Values include:
1198
1199 - "Struct"
1200 - "I8"
1201 - "U8"
1202 - "I16"
1203 - "U16"
1204 - "F16"
1205 - "I32"
1206 - "U32"
1207 - "F32"
1208 - "I64"
1209 - "U64"
1210 - "F64"
1211
1212 .. TODO
1213 How can it be determined if a
1214 vector type, and what size
1215 vector?
1216 "PointeeAlign" integer Alignment in bytes of pointee
1217 type for pointer type kernel
1218 argument. Must be a power
1219 of 2. Only present if
1220 "ValueKind" is
1221 "DynamicSharedPointer".
1222 "AddrSpaceQual" string Kernel argument address space
1223 qualifier. Only present if
1224 "ValueKind" is "GlobalBuffer" or
1225 "DynamicSharedPointer". Values
1226 are:
1227
1228 - "Private"
1229 - "Global"
1230 - "Constant"
1231 - "Local"
1232 - "Generic"
1233 - "Region"
1234
1235 .. TODO
1236 Is GlobalBuffer only Global
1237 or Constant? Is
1238 DynamicSharedPointer always
1239 Local? Can HCC allow Generic?
1240 How can Private or Region
1241 ever happen?
1242 "AccQual" string Kernel argument access
1243 qualifier. Only present if
1244 "ValueKind" is "Image" or
1245 "Pipe". Values
1246 are:
1247
1248 - "ReadOnly"
1249 - "WriteOnly"
1250 - "ReadWrite"
1251
1252 .. TODO
1253 Does this apply to
1254 GlobalBuffer?
Konstantin Zhuravlyova01d8b02017-10-14 19:03:51 +00001255 "ActualAccQual" string The actual memory accesses
Tony Tyef16a45e2017-06-06 20:31:59 +00001256 performed by the kernel on the
1257 kernel argument. Only present if
1258 "ValueKind" is "GlobalBuffer",
1259 "Image", or "Pipe". This may be
1260 more restrictive than indicated
1261 by "AccQual" to reflect what the
1262 kernel actual does. If not
1263 present then the runtime must
1264 assume what is implied by
1265 "AccQual" and "IsConst". Values
1266 are:
1267
1268 - "ReadOnly"
1269 - "WriteOnly"
1270 - "ReadWrite"
1271
1272 "IsConst" boolean Indicates if the kernel argument
1273 is const qualified. Only present
1274 if "ValueKind" is
1275 "GlobalBuffer".
1276
1277 "IsRestrict" boolean Indicates if the kernel argument
1278 is restrict qualified. Only
1279 present if "ValueKind" is
1280 "GlobalBuffer".
1281
1282 "IsVolatile" boolean Indicates if the kernel argument
1283 is volatile qualified. Only
1284 present if "ValueKind" is
1285 "GlobalBuffer".
1286
1287 "IsPipe" boolean Indicates if the kernel argument
1288 is pipe qualified. Only present
1289 if "ValueKind" is "Pipe".
1290
1291 .. TODO
1292 Can GlobalBuffer be pipe
1293 qualified?
1294 ================= ============== ========= ================================
1295
1296..
1297
1298 .. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
1299 :name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
1300
1301 ============================ ============== ========= =====================
1302 String Key Value Type Required? Description
1303 ============================ ============== ========= =====================
1304 "KernargSegmentSize" integer Required The size in bytes of
1305 the kernarg segment
1306 that holds the values
1307 of the arguments to
1308 the kernel.
1309 "GroupSegmentFixedSize" integer Required The amount of group
1310 segment memory
1311 required by a
1312 work-group in
1313 bytes. This does not
1314 include any
1315 dynamically allocated
1316 group segment memory
1317 that may be added
1318 when the kernel is
1319 dispatched.
1320 "PrivateSegmentFixedSize" integer Required The amount of fixed
1321 private address space
1322 memory required for a
1323 work-item in
Tony Tye07d9f102017-11-10 01:00:54 +00001324 bytes. If the kernel
1325 uses a dynamic call
1326 stack then additional
Tony Tyef16a45e2017-06-06 20:31:59 +00001327 space must be added
1328 to this value for the
1329 call stack.
1330 "KernargSegmentAlign" integer Required The maximum byte
1331 alignment of
1332 arguments in the
1333 kernarg segment. Must
1334 be a power of 2.
1335 "WavefrontSize" integer Required Wavefront size. Must
1336 be a power of 2.
Tony Tye07d9f102017-11-10 01:00:54 +00001337 "NumSGPRs" integer Required Number of scalar
Tony Tyef16a45e2017-06-06 20:31:59 +00001338 registers used by a
1339 wavefront for
1340 GFX6-GFX9. This
1341 includes the special
1342 SGPRs for VCC, Flat
1343 Scratch (GFX7-GFX9)
1344 and XNACK (for
1345 GFX8-GFX9). It does
1346 not include the 16
1347 SGPR added if a trap
1348 handler is
1349 enabled. It is not
1350 rounded up to the
1351 allocation
1352 granularity.
Tony Tye07d9f102017-11-10 01:00:54 +00001353 "NumVGPRs" integer Required Number of vector
Tony Tyef16a45e2017-06-06 20:31:59 +00001354 registers used by
1355 each work-item for
1356 GFX6-GFX9
Tony Tye07d9f102017-11-10 01:00:54 +00001357 "MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tyef16a45e2017-06-06 20:31:59 +00001358 work-group size
1359 supported by the
1360 kernel in work-items.
Tony Tye07d9f102017-11-10 01:00:54 +00001361 Must be >=1 and
Tony Tyee039d0e2018-01-30 23:07:10 +00001362 consistent with
1363 ReqdWorkGroupSize if
1364 not 0, 0, 0.
Konstantin Zhuravlyov06ae4ec2017-11-28 17:51:08 +00001365 "NumSpilledSGPRs" integer Number of stores from
1366 a scalar register to
1367 a register allocator
1368 created spill
1369 location.
1370 "NumSpilledVGPRs" integer Number of stores from
1371 a vector register to
1372 a register allocator
1373 created spill
1374 location.
Tony Tyef16a45e2017-06-06 20:31:59 +00001375 ============================ ============== ========= =====================
1376
1377..
1378
Tony Tyef16a45e2017-06-06 20:31:59 +00001379Kernel Dispatch
1380~~~~~~~~~~~~~~~
1381
1382The HSA architected queuing language (AQL) defines a user space memory interface
1383that can be used to control the dispatch of kernels, in an agent independent
1384way. An agent can have zero or more AQL queues created for it using the ROCm
1385runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
1386*HSA Platform System Architecture Specification* [HSA]_ for the AQL queue
1387mechanics and packet layouts.
1388
1389The packet processor of a kernel agent is responsible for detecting and
1390dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
1391packet processor is implemented by the hardware command processor (CP),
1392asynchronous dispatch controller (ADC) and shader processor input controller
1393(SPI).
1394
1395The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
1396mode driver to initialize and register the AQL queue with CP.
1397
1398To dispatch a kernel the following actions are performed. This can occur in the
1399CPU host program, or from an HSA kernel executing on a GPU.
1400
14011. A pointer to an AQL queue for the kernel agent on which the kernel is to be
1402 executed is obtained.
14032. A pointer to the kernel descriptor (see
1404 :ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
1405 obtained. It must be for a kernel that is contained in a code object that that
1406 was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
1407 associated.
14083. Space is allocated for the kernel arguments using the ROCm runtime allocator
1409 for a memory region with the kernarg property for the kernel agent that will
1410 execute the kernel. It must be at least 16 byte aligned.
14114. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00001412 allocation. The layout is defined in the *HSA Programmer's Language Reference*
Tony Tyef16a45e2017-06-06 20:31:59 +00001413 [HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
1414 memory in the same way constant memory is accessed. (Note that the HSA
1415 specification allows an implementation to copy the kernel argument contents to
1416 another location that is accessed by the kernel.)
14175. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
1418 api uses 64 bit atomic operations to reserve space in the AQL queue for the
1419 packet. The packet must be set up, and the final write must use an atomic
1420 store release to set the packet kind to ensure the packet contents are
1421 visible to the kernel agent. AQL defines a doorbell signal mechanism to
1422 notify the kernel agent that the AQL queue has been updated. These rules, and
1423 the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
1424 System Architecture Specification* [HSA]_.
14256. A kernel dispatch packet includes information about the actual dispatch,
1426 such as grid and work-group size, together with information from the code
1427 object about the kernel, such as segment sizes. The ROCm runtime queries on
1428 the kernel symbol can be used to obtain the code object values which are
Tony Tye46d35762017-08-15 20:47:41 +00001429 recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tyef16a45e2017-06-06 20:31:59 +000014307. CP executes micro-code and is responsible for detecting and setting up the
1431 GPU to execute the wavefronts of a kernel dispatch.
14328. CP ensures that when the a wavefront starts executing the kernel machine
1433 code, the scalar general purpose registers (SGPR) and vector general purpose
1434 registers (VGPR) are set up as required by the machine code. The required
1435 setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
1436 register state is defined in
1437 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
14389. The prolog of the kernel machine code (see
1439 :ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
1440 before continuing executing the machine code that corresponds to the kernel.
144110. When the kernel dispatch has completed execution, CP signals the completion
1442 signal specified in the kernel dispatch packet if not 0.
1443
1444.. _amdgpu-amdhsa-memory-spaces:
1445
1446Memory Spaces
1447~~~~~~~~~~~~~
1448
1449The memory space properties are:
1450
1451 .. table:: AMDHSA Memory Spaces
1452 :name: amdgpu-amdhsa-memory-spaces-table
1453
1454 ================= =========== ======== ======= ==================
1455 Memory Space Name HSA Segment Hardware Address NULL Value
1456 Name Name Size
1457 ================= =========== ======== ======= ==================
1458 Private private scratch 32 0x00000000
1459 Local group LDS 32 0xFFFFFFFF
1460 Global global global 64 0x0000000000000000
1461 Constant constant *same as 64 0x0000000000000000
1462 global*
1463 Generic flat flat 64 0x0000000000000000
1464 Region N/A GDS 32 *not implemented
1465 for AMDHSA*
1466 ================= =========== ======== ======= ==================
1467
1468The global and constant memory spaces both use global virtual addresses, which
1469are the same virtual address space used by the CPU. However, some virtual
1470addresses may only be accessible to the CPU, some only accessible by the GPU,
1471and some by both.
1472
1473Using the constant memory space indicates that the data will not change during
1474the execution of the kernel. This allows scalar read instructions to be
1475used. The vector and scalar L1 caches are invalidated of volatile data before
1476each kernel dispatch execution to allow constant memory to change values between
1477kernel dispatches.
1478
1479The local memory space uses the hardware Local Data Store (LDS) which is
1480automatically allocated when the hardware creates work-groups of wavefronts, and
1481freed when all the wavefronts of a work-group have terminated. The data store
1482(DS) instructions can be used to access it.
1483
1484The private memory space uses the hardware scratch memory support. If the kernel
1485uses scratch, then the hardware allocates memory that is accessed using
1486wavefront lane dword (4 byte) interleaving. The mapping used from private
1487address to physical address is:
1488
1489 ``wavefront-scratch-base +
1490 (private-address * wavefront-size * 4) +
1491 (wavefront-lane-id * 4)``
1492
1493There are different ways that the wavefront scratch base address is determined
1494by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
1495memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye5bbcca62018-03-08 05:46:01 +00001496the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tyef16a45e2017-06-06 20:31:59 +00001497instructions, or by flat instructions. If each lane of a wavefront accesses the
1498same private address, the interleaving results in adjacent dwords being accessed
1499and hence requires fewer cache lines to be fetched. Multi-dword access is not
1500supported except by flat and scratch instructions in GFX9.
1501
1502The generic address space uses the hardware flat address support available in
1503GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
1504local appertures), that are outside the range of addressible global memory, to
1505map from a flat address to a private or local address.
1506
1507FLAT instructions can take a flat address and access global, private (scratch)
1508and group (LDS) memory depending in if the address is within one of the
1509apperture ranges. Flat access to scratch requires hardware aperture setup and
1510setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
1511access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
1512(see :ref:`amdgpu-amdhsa-m0`).
1513
1514To convert between a segment address and a flat address the base address of the
1515appertures address can be used. For GFX7-GFX8 these are available in the
1516:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
1517Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
1518GFX9 the appature base addresses are directly available as inline constant
1519registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
1520address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
1521which makes it easier to convert from flat to segment or segment to flat.
1522
Tony Tye46d35762017-08-15 20:47:41 +00001523Image and Samplers
1524~~~~~~~~~~~~~~~~~~
Tony Tyef16a45e2017-06-06 20:31:59 +00001525
1526Image and sample handles created by the ROCm runtime are 64 bit addresses of a
1527hardware 32 byte V# and 48 byte S# object respectively. In order to support the
1528HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
1529enumeration values for the queries that are not trivially deducible from the S#
1530representation.
1531
1532HSA Signals
1533~~~~~~~~~~~
1534
Tony Tye46d35762017-08-15 20:47:41 +00001535HSA signal handles created by the ROCm runtime are 64 bit addresses of a
1536structure allocated in memory accessible from both the CPU and GPU. The
1537structure is defined by the ROCm runtime and subject to change between releases
1538(see [AMD-ROCm-github]_).
Tony Tyef16a45e2017-06-06 20:31:59 +00001539
1540.. _amdgpu-amdhsa-hsa-aql-queue:
1541
1542HSA AQL Queue
1543~~~~~~~~~~~~~
1544
Tony Tye46d35762017-08-15 20:47:41 +00001545The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tyef16a45e2017-06-06 20:31:59 +00001546between releases (see [AMD-ROCm-github]_). For some processors it contains
1547fields needed to implement certain language features such as the flat address
1548aperture bases. It also contains fields used by CP such as managing the
1549allocation of scratch memory.
1550
1551.. _amdgpu-amdhsa-kernel-descriptor:
1552
1553Kernel Descriptor
1554~~~~~~~~~~~~~~~~~
1555
1556A kernel descriptor consists of the information needed by CP to initiate the
1557execution of a kernel, including the entry point address of the machine code
1558that implements the kernel.
1559
1560Kernel Descriptor for GFX6-GFX9
1561+++++++++++++++++++++++++++++++
1562
1563CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
1564
1565 .. table:: Kernel Descriptor for GFX6-GFX9
1566 :name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
1567
Tony Tye6baa6d22017-10-18 22:16:55 +00001568 ======= ======= =============================== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001569 Bits Size Field Name Description
Tony Tye6baa6d22017-10-18 22:16:55 +00001570 ======= ======= =============================== ============================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001571 31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tyef16a45e2017-06-06 20:31:59 +00001572 address space memory
1573 required for a work-group
1574 in bytes. This does not
1575 include any dynamically
1576 allocated local address
1577 space memory that may be
1578 added when the kernel is
1579 dispatched.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001580 63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tyef16a45e2017-06-06 20:31:59 +00001581 private address space
1582 memory required for a
1583 work-item in bytes. If
1584 is_dynamic_callstack is 1
1585 then additional space must
1586 be added to this value for
1587 the call stack.
Tony Tye07d9f102017-11-10 01:00:54 +00001588 127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001589 191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tyef16a45e2017-06-06 20:31:59 +00001590 negative) from base
1591 address of kernel
1592 descriptor to kernel's
1593 entry point instruction
1594 which must be 256 byte
1595 aligned.
Tony Tyee039d0e2018-01-30 23:07:10 +00001596 383:192 24 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001597 bytes
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001598 415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001599 program settings used by
1600 CP to set up
1601 ``COMPUTE_PGM_RSRC1``
1602 configuration
1603 register. See
Tony Tye6baa6d22017-10-18 22:16:55 +00001604 :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001605 447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tyef16a45e2017-06-06 20:31:59 +00001606 program settings used by
1607 CP to set up
1608 ``COMPUTE_PGM_RSRC2``
1609 configuration
1610 register. See
1611 :ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001612 448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
1613 SGPR user data registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001614 (see
1615 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1616
1617 The total number of SGPR
1618 user data registers
1619 requested must not exceed
1620 16 and match value in
1621 ``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
1622 Any requests beyond 16
1623 will be ignored.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001624 449 1 bit EnableSGPRDispatchPtr *see above*
1625 450 1 bit EnableSGPRQueuePtr *see above*
1626 451 1 bit EnableSGPRKernargSegmentPtr *see above*
1627 452 1 bit EnableSGPRDispatchID *see above*
1628 453 1 bit EnableSGPRFlatScratchInit *see above*
1629 454 1 bit EnableSGPRPrivateSegmentSize *see above*
1630 455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
1631 should always be 0.
1632 456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
1633 should always be 0.
1634 457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
1635 should always be 0.
Tony Tye31105cc2017-12-11 15:35:27 +00001636 463:458 6 bits Reserved, must be 0.
Tony Tye6baa6d22017-10-18 22:16:55 +00001637 511:464 6 Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001638 bytes
1639 512 **Total size 64 bytes.**
Tony Tye6baa6d22017-10-18 22:16:55 +00001640 ======= ====================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001641
1642..
1643
1644 .. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001645 :name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tyef16a45e2017-06-06 20:31:59 +00001646
Tony Tye3b340612017-06-07 00:46:08 +00001647 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001648 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001649 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001650 5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001651 used by each work-item,
1652 granularity is device
1653 specific:
1654
Tony Tye07d9f102017-11-10 01:00:54 +00001655 GFX6-GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001656 - max_vgpr 1..256
1657 - roundup((max_vgpg + 1)
1658 / 4) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001659
1660 Used by CP to set up
1661 ``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001662 9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tyef16a45e2017-06-06 20:31:59 +00001663 used by a wavefront,
1664 granularity is device
1665 specific:
1666
Tony Tye07d9f102017-11-10 01:00:54 +00001667 GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001668 - max_sgpr 1..112
1669 - roundup((max_sgpg + 1)
1670 / 8) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001671 GFX9
Tony Tye6baa6d22017-10-18 22:16:55 +00001672 - max_sgpr 1..112
1673 - roundup((max_sgpg + 1)
1674 / 16) - 1
Tony Tyef16a45e2017-06-06 20:31:59 +00001675
1676 Includes the special SGPRs
1677 for VCC, Flat Scratch (for
1678 GFX7 onwards) and XNACK
1679 (for GFX8 onwards). It does
1680 not include the 16 SGPR
1681 added if a trap handler is
1682 enabled.
1683
1684 Used by CP to set up
1685 ``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001686 11:10 2 bits PRIORITY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001687
1688 Start executing wavefront
1689 at the specified priority.
1690
1691 CP is responsible for
1692 filling in
1693 ``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001694 13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001695 with specified rounding
1696 mode for single (32
1697 bit) floating point
1698 precision floating point
1699 operations.
1700
1701 Floating point rounding
1702 mode values are defined in
1703 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1704
1705 Used by CP to set up
1706 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001707 15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001708 with specified rounding
1709 denorm mode for half/double (16
1710 and 64 bit) floating point
1711 precision floating point
1712 operations.
1713
1714 Floating point rounding
1715 mode values are defined in
1716 :ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
1717
1718 Used by CP to set up
1719 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001720 17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001721 with specified denorm mode
1722 for single (32
1723 bit) floating point
1724 precision floating point
1725 operations.
1726
1727 Floating point denorm mode
1728 values are defined in
1729 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1730
1731 Used by CP to set up
1732 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001733 19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001734 with specified denorm mode
1735 for half/double (16
1736 and 64 bit) floating point
1737 precision floating point
1738 operations.
1739
1740 Floating point denorm mode
1741 values are defined in
1742 :ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
1743
1744 Used by CP to set up
1745 ``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001746 20 1 bit PRIV Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001747
1748 Start executing wavefront
1749 in privilege trap handler
1750 mode.
1751
1752 CP is responsible for
1753 filling in
1754 ``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001755 21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001756 with DX10 clamp mode
1757 enabled. Used by the vector
Tony Tye6baa6d22017-10-18 22:16:55 +00001758 ALU to force DX10 style
Tony Tyef16a45e2017-06-06 20:31:59 +00001759 treatment of NaN's (when
1760 set, clamp NaN to zero,
1761 otherwise pass NaN
1762 through).
1763
1764 Used by CP to set up
1765 ``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001766 22 1 bit DEBUG_MODE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001767
1768 Start executing wavefront
1769 in single step mode.
1770
1771 CP is responsible for
1772 filling in
1773 ``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001774 23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tyef16a45e2017-06-06 20:31:59 +00001775 with IEEE mode
1776 enabled. Floating point
1777 opcodes that support
1778 exception flag gathering
1779 will quiet and propagate
1780 signaling-NaN inputs per
1781 IEEE 754-2008. Min_dx10 and
1782 max_dx10 become IEEE
1783 754-2008 compliant due to
1784 signaling-NaN propagation
1785 and quieting.
1786
1787 Used by CP to set up
1788 ``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001789 24 1 bit BULKY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001790
1791 Only one work-group allowed
1792 to execute on a compute
1793 unit.
1794
1795 CP is responsible for
1796 filling in
1797 ``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001798 25 1 bit CDBG_USER Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001799
1800 Flag that can be used to
1801 control debugging code.
1802
1803 CP is responsible for
1804 filling in
1805 ``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye07d9f102017-11-10 01:00:54 +00001806 26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye6baa6d22017-10-18 22:16:55 +00001807 Reserved, must be 0.
1808 GFX9
1809 Wavefront starts execution
1810 with specified fp16 overflow
1811 mode.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001812
Tony Tye6baa6d22017-10-18 22:16:55 +00001813 - If 0, fp16 overflow generates
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001814 +/-INF values.
Tony Tye6baa6d22017-10-18 22:16:55 +00001815 - If 1, fp16 overflow that is the
1816 result of an +/-INF input value
1817 or divide by 0 produces a +/-INF,
1818 otherwise clamps computed
1819 overflow to +/-MAX_FP16 as
1820 appropriate.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001821
1822 Used by CP to set up
1823 ``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye6baa6d22017-10-18 22:16:55 +00001824 31:27 5 bits Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001825 32 **Total size 4 bytes**
Tony Tye3b340612017-06-07 00:46:08 +00001826 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001827
1828..
1829
1830 .. table:: compute_pgm_rsrc2 for GFX6-GFX9
1831 :name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
1832
Tony Tye3b340612017-06-07 00:46:08 +00001833 ======= ======= =============================== ===========================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001834 Bits Size Field Name Description
Tony Tye3b340612017-06-07 00:46:08 +00001835 ======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001836 0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye5bbcca62018-03-08 05:46:01 +00001837 _WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tyef16a45e2017-06-06 20:31:59 +00001838 system register (see
1839 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1840
1841 Used by CP to set up
1842 ``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001843 5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tyef16a45e2017-06-06 20:31:59 +00001844 user data registers
1845 requested. This number must
1846 match the number of user
1847 data registers enabled.
1848
1849 Used by CP to set up
1850 ``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001851 6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tyef16a45e2017-06-06 20:31:59 +00001852 TRAP instruction which
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00001853 requires a trap handler to
Tony Tyef16a45e2017-06-06 20:31:59 +00001854 be enabled.
1855
1856 CP sets
1857 ``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
1858 if the runtime has
1859 installed a trap handler
1860 regardless of the setting
1861 of this field.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001862 7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001863 system SGPR register for
1864 the work-group id in the X
1865 dimension (see
1866 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1867
1868 Used by CP to set up
1869 ``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001870 8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001871 system SGPR register for
1872 the work-group id in the Y
1873 dimension (see
1874 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1875
1876 Used by CP to set up
1877 ``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001878 9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001879 system SGPR register for
1880 the work-group id in the Z
1881 dimension (see
1882 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1883
1884 Used by CP to set up
1885 ``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001886 10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001887 system SGPR register for
1888 work-group information (see
1889 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
1890
1891 Used by CP to set up
1892 ``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001893 12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tyef16a45e2017-06-06 20:31:59 +00001894 VGPR system registers used
1895 for the work-item ID.
1896 :ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
1897 defines the values.
1898
1899 Used by CP to set up
1900 ``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001901 13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001902
1903 Wavefront starts execution
1904 with address watch
1905 exceptions enabled which
1906 are generated when L1 has
1907 witnessed a thread access
1908 an *address of
1909 interest*.
1910
1911 CP is responsible for
1912 filling in the address
1913 watch bit in
1914 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1915 according to what the
1916 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001917 14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001918
1919 Wavefront starts execution
1920 with memory violation
1921 exceptions exceptions
1922 enabled which are generated
1923 when a memory violation has
Tony Tye5bbcca62018-03-08 05:46:01 +00001924 occurred for this wavefront from
Tony Tyef16a45e2017-06-06 20:31:59 +00001925 L1 or LDS
1926 (write-to-read-only-memory,
1927 mis-aligned atomic, LDS
1928 address out of range,
1929 illegal address, etc.).
1930
1931 CP sets the memory
1932 violation bit in
1933 ``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
1934 according to what the
1935 runtime requests.
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001936 23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001937
1938 CP uses the rounded value
1939 from the dispatch packet,
1940 not this value, as the
1941 dispatch may contain
1942 dynamically allocated group
1943 segment memory. CP writes
1944 directly to
1945 ``COMPUTE_PGM_RSRC2.LDS_SIZE``.
1946
1947 Amount of group segment
1948 (LDS) to allocate for each
1949 work-group. Granularity is
1950 device specific:
1951
1952 GFX6:
1953 roundup(lds-size / (64 * 4))
1954 GFX7-GFX9:
1955 roundup(lds-size / (128 * 4))
1956
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001957 24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
1958 _INVALID_OPERATION with specified exceptions
Tony Tyef16a45e2017-06-06 20:31:59 +00001959 enabled.
1960
1961 Used by CP to set up
1962 ``COMPUTE_PGM_RSRC2.EXCP_EN``
1963 (set from bits 0..6).
1964
1965 IEEE 754 FP Invalid
1966 Operation
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001967 25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
1968 _SOURCE input operands is a
Tony Tyef16a45e2017-06-06 20:31:59 +00001969 denormal number
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001970 26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
1971 _DIVISION_BY_ZERO Zero
1972 27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
1973 _OVERFLOW
1974 28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
1975 _UNDERFLOW
1976 29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
1977 _INEXACT
1978 30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
1979 _ZERO (rcp_iflag_f32 instruction
Tony Tyef16a45e2017-06-06 20:31:59 +00001980 only)
Tony Tye6baa6d22017-10-18 22:16:55 +00001981 31 1 bit Reserved, must be 0.
Tony Tyef16a45e2017-06-06 20:31:59 +00001982 32 **Total size 4 bytes.**
Tony Tye3b340612017-06-07 00:46:08 +00001983 ======= ===================================================================================================================
Tony Tyef16a45e2017-06-06 20:31:59 +00001984
1985..
1986
1987 .. table:: Floating Point Rounding Mode Enumeration Values
1988 :name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
1989
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00001990 ====================================== ===== ==============================
1991 Enumeration Name Value Description
1992 ====================================== ===== ==============================
1993 AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
1994 AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
1995 AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
1996 AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
1997 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00001998
1999..
2000
2001 .. table:: Floating Point Denorm Mode Enumeration Values
2002 :name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
2003
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002004 ====================================== ===== ==============================
2005 Enumeration Name Value Description
2006 ====================================== ===== ==============================
2007 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
2008 Denorms
2009 AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
2010 AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
2011 AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
2012 ====================================== ===== ==============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002013
2014..
2015
2016 .. table:: System VGPR Work-Item ID Enumeration Values
2017 :name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
2018
Konstantin Zhuravlyov13376a42017-10-14 19:17:08 +00002019 ======================================== ===== ============================
2020 Enumeration Name Value Description
2021 ======================================== ===== ============================
2022 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
2023 ID.
2024 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
2025 dimensions ID.
2026 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
2027 dimensions ID.
2028 AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
2029 ======================================== ===== ============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002030
2031.. _amdgpu-amdhsa-initial-kernel-execution-state:
2032
2033Initial Kernel Execution State
2034~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2035
2036This section defines the register state that will be set up by the packet
2037processor prior to the start of execution of every wavefront. This is limited by
2038the constraints of the hardware controllers of CP/ADC/SPI.
2039
2040The order of the SGPR registers is defined, but the compiler can specify which
2041ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
2042fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2043for enabled registers are dense starting at SGPR0: the first enabled register is
2044SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
2045an SGPR number.
2046
2047The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye5bbcca62018-03-08 05:46:01 +00002048all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tyef16a45e2017-06-06 20:31:59 +00002049the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
2050initialized. These are then immediately followed by the System SGPRs that are
Tony Tye5bbcca62018-03-08 05:46:01 +00002051set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tyef16a45e2017-06-06 20:31:59 +00002052dispatch.
2053
2054SGPR register initial state is defined in
2055:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
2056
2057 .. table:: SGPR Register Set Up Order
2058 :name: amdgpu-amdhsa-sgpr-register-set-up-order-table
2059
2060 ========== ========================== ====== ==============================
2061 SGPR Order Name Number Description
2062 (kernel descriptor enable of
2063 field) SGPRs
2064 ========== ========================== ====== ==============================
2065 First Private Segment Buffer 4 V# that can be used, together
Tony Tye5bbcca62018-03-08 05:46:01 +00002066 (enable_sgpr_private with Scratch Wavefront Offset
2067 _segment_buffer) as an offset, to access the
2068 private memory space using a
2069 segment address.
Tony Tyef16a45e2017-06-06 20:31:59 +00002070
2071 CP uses the value provided by
2072 the runtime.
2073 then Dispatch Ptr 2 64 bit address of AQL dispatch
2074 (enable_sgpr_dispatch_ptr) packet for kernel dispatch
2075 actually executing.
2076 then Queue Ptr 2 64 bit address of amd_queue_t
2077 (enable_sgpr_queue_ptr) object for AQL queue on which
2078 the dispatch packet was
2079 queued.
2080 then Kernarg Segment Ptr 2 64 bit address of Kernarg
2081 (enable_sgpr_kernarg segment. This is directly
2082 _segment_ptr) copied from the
2083 kernarg_address in the kernel
2084 dispatch packet.
2085
2086 Having CP load it once avoids
2087 loading it at the beginning of
2088 every wavefront.
2089 then Dispatch Id 2 64 bit Dispatch ID of the
2090 (enable_sgpr_dispatch_id) dispatch packet being
2091 executed.
2092 then Flat Scratch Init 2 This is 2 SGPRs:
2093 (enable_sgpr_flat_scratch
2094 _init) GFX6
2095 Not supported.
2096 GFX7-GFX8
2097 The first SGPR is a 32 bit
2098 byte offset from
2099 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2100 to per SPI base of memory
2101 for scratch for the queue
2102 executing the kernel
2103 dispatch. CP obtains this
Tony Tye46d35762017-08-15 20:47:41 +00002104 from the runtime. (The
2105 Scratch Segment Buffer base
2106 address is
2107 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
2108 plus this offset.) The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002109 of Scratch Wavefront Offset must
Tony Tye46d35762017-08-15 20:47:41 +00002110 be added to this offset by
2111 the kernel machine code,
2112 right shifted by 8, and
2113 moved to the FLAT_SCRATCH_HI
2114 SGPR register.
2115 FLAT_SCRATCH_HI corresponds
2116 to SGPRn-4 on GFX7, and
2117 SGPRn-6 on GFX8 (where SGPRn
2118 is the highest numbered SGPR
Tony Tye5bbcca62018-03-08 05:46:01 +00002119 allocated to the wavefront).
Tony Tye46d35762017-08-15 20:47:41 +00002120 FLAT_SCRATCH_HI is
2121 multiplied by 256 (as it is
2122 in units of 256 bytes) and
2123 added to
2124 ``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye5bbcca62018-03-08 05:46:01 +00002125 to calculate the per wavefront
Tony Tye46d35762017-08-15 20:47:41 +00002126 FLAT SCRATCH BASE in flat
2127 memory instructions that
2128 access the scratch
2129 apperture.
Tony Tyef16a45e2017-06-06 20:31:59 +00002130
2131 The second SGPR is 32 bit
2132 byte size of a single
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00002133 work-item's scratch memory
Tony Tye46d35762017-08-15 20:47:41 +00002134 usage. CP obtains this from
2135 the runtime, and it is
2136 always a multiple of DWORD.
2137 CP checks that the value in
2138 the kernel dispatch packet
2139 Private Segment Byte Size is
2140 not larger, and requests the
2141 runtime to increase the
2142 queue's scratch size if
2143 necessary. The kernel code
2144 must move it to
2145 FLAT_SCRATCH_LO which is
2146 SGPRn-3 on GFX7 and SGPRn-5
2147 on GFX8. FLAT_SCRATCH_LO is
2148 used as the FLAT SCRATCH
2149 SIZE in flat memory
Tony Tyef16a45e2017-06-06 20:31:59 +00002150 instructions. Having CP load
2151 it once avoids loading it at
2152 the beginning of every
Tony Tyef59d0712017-11-10 20:51:43 +00002153 wavefront.
2154 GFX9
2155 This is the
Tony Tye46d35762017-08-15 20:47:41 +00002156 64 bit base address of the
2157 per SPI scratch backing
2158 memory managed by SPI for
2159 the queue executing the
2160 kernel dispatch. CP obtains
2161 this from the runtime (and
Tony Tyef16a45e2017-06-06 20:31:59 +00002162 divides it if there are
2163 multiple Shader Arrays each
2164 with its own SPI). The value
Tony Tye5bbcca62018-03-08 05:46:01 +00002165 of Scratch Wavefront Offset must
Tony Tyef16a45e2017-06-06 20:31:59 +00002166 be added by the kernel
Tony Tye46d35762017-08-15 20:47:41 +00002167 machine code and the result
2168 moved to the FLAT_SCRATCH
2169 SGPR which is SGPRn-6 and
2170 SGPRn-5. It is used as the
2171 FLAT SCRATCH BASE in flat
Tony Tyef59d0712017-11-10 20:51:43 +00002172 memory instructions.
2173 then Private Segment Size 1 The 32 bit byte size of a
2174 (enable_sgpr_private single
2175 work-item's
2176 scratch_segment_size) memory
2177 allocation. This is the
2178 value from the kernel
2179 dispatch packet Private
2180 Segment Byte Size rounded up
2181 by CP to a multiple of
2182 DWORD.
Tony Tyef16a45e2017-06-06 20:31:59 +00002183
2184 Having CP load it once avoids
2185 loading it at the beginning of
2186 every wavefront.
2187
2188 This is not used for
2189 GFX7-GFX8 since it is the same
2190 value as the second SGPR of
2191 Flat Scratch Init. However, it
2192 may be needed for GFX9 which
2193 changes the meaning of the
2194 Flat Scratch Init value.
2195 then Grid Work-Group Count X 1 32 bit count of the number of
2196 (enable_sgpr_grid work-groups in the X dimension
2197 _workgroup_count_X) for the grid being
2198 executed. Computed from the
2199 fields in the kernel dispatch
2200 packet as ((grid_size.x +
2201 workgroup_size.x - 1) /
2202 workgroup_size.x).
2203 then Grid Work-Group Count Y 1 32 bit count of the number of
2204 (enable_sgpr_grid work-groups in the Y dimension
2205 _workgroup_count_Y && for the grid being
2206 less than 16 previous executed. Computed from the
2207 SGPRs) fields in the kernel dispatch
2208 packet as ((grid_size.y +
2209 workgroup_size.y - 1) /
2210 workgroupSize.y).
2211
2212 Only initialized if <16
2213 previous SGPRs initialized.
2214 then Grid Work-Group Count Z 1 32 bit count of the number of
2215 (enable_sgpr_grid work-groups in the Z dimension
2216 _workgroup_count_Z && for the grid being
2217 less than 16 previous executed. Computed from the
2218 SGPRs) fields in the kernel dispatch
2219 packet as ((grid_size.z +
2220 workgroup_size.z - 1) /
2221 workgroupSize.z).
2222
2223 Only initialized if <16
2224 previous SGPRs initialized.
2225 then Work-Group Id X 1 32 bit work-group id in X
2226 (enable_sgpr_workgroup_id dimension of grid for
2227 _X) wavefront.
2228 then Work-Group Id Y 1 32 bit work-group id in Y
2229 (enable_sgpr_workgroup_id dimension of grid for
2230 _Y) wavefront.
2231 then Work-Group Id Z 1 32 bit work-group id in Z
2232 (enable_sgpr_workgroup_id dimension of grid for
2233 _Z) wavefront.
Tony Tye5bbcca62018-03-08 05:46:01 +00002234 then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tyef16a45e2017-06-06 20:31:59 +00002235 (enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye5bbcca62018-03-08 05:46:01 +00002236 _info) threadgroup_size_in_wavefronts[5:0]}
2237 then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tyef16a45e2017-06-06 20:31:59 +00002238 (enable_sgpr_private of scratch base of queue
Tony Tye5bbcca62018-03-08 05:46:01 +00002239 _segment_wavefront_offset) executing the kernel
Tony Tyef16a45e2017-06-06 20:31:59 +00002240 dispatch. Must be used as an
2241 offset with Private
2242 segment address when using
2243 Scratch Segment Buffer. It
2244 must be used to set up FLAT
2245 SCRATCH for flat addressing
2246 (see
2247 :ref:`amdgpu-amdhsa-flat-scratch`).
2248 ========== ========================== ====== ==============================
2249
2250The order of the VGPR registers is defined, but the compiler can specify which
2251ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
2252fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
2253for enabled registers are dense starting at VGPR0: the first enabled register is
2254VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
2255VGPR number.
2256
2257VGPR register initial state is defined in
2258:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
2259
2260 .. table:: VGPR Register Set Up Order
2261 :name: amdgpu-amdhsa-vgpr-register-set-up-order-table
2262
2263 ========== ========================== ====== ==============================
2264 VGPR Order Name Number Description
2265 (kernel descriptor enable of
2266 field) VGPRs
2267 ========== ========================== ====== ==============================
2268 First Work-Item Id X 1 32 bit work item id in X
2269 (Always initialized) dimension of work-group for
2270 wavefront lane.
2271 then Work-Item Id Y 1 32 bit work item id in Y
2272 (enable_vgpr_workitem_id dimension of work-group for
2273 > 0) wavefront lane.
2274 then Work-Item Id Z 1 32 bit work item id in Z
2275 (enable_vgpr_workitem_id dimension of work-group for
2276 > 1) wavefront lane.
2277 ========== ========================== ====== ==============================
2278
Hiroshi Inouebcadfee2018-04-12 05:53:20 +00002279The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tyef16a45e2017-06-06 20:31:59 +00002280
22811. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
2282 registers.
22832. Work-group Id registers X, Y, Z are set by ADC which supports any
2284 combination including none.
Tony Tye5bbcca62018-03-08 05:46:01 +000022853. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
2286 its value cannot included with the flat scratch init value which is per queue.
Tony Tyef16a45e2017-06-06 20:31:59 +000022874. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
2288 or (X, Y, Z).
2289
2290Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
2291value to the hardware required SGPRn-3 and SGPRn-4 respectively.
2292
2293The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye07d9f102017-11-10 01:00:54 +00002294has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tyef16a45e2017-06-06 20:31:59 +00002295instructions (GFX9).
2296
2297If buffer operations are used then the compiler can generate a V# with the
2298following properties:
2299
2300* base address of 0
2301* no swizzle
2302* ATC: 1 if IOMMU present (such as APU)
2303* ptr64: 1
2304* MTYPE set to support memory coherence that matches the runtime (such as CC for
2305 APU and NC for dGPU).
2306
2307.. _amdgpu-amdhsa-kernel-prolog:
2308
2309Kernel Prolog
2310~~~~~~~~~~~~~
2311
2312.. _amdgpu-amdhsa-m0:
2313
2314M0
2315++
2316
2317GFX6-GFX8
2318 The M0 register must be initialized with a value at least the total LDS size
2319 if the kernel may access LDS via DS or flat operations. Total LDS size is
2320 available in dispatch packet. For M0, it is also possible to use maximum
2321 possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
2322 GFX7-GFX8).
2323GFX9
2324 The M0 register is not used for range checking LDS accesses and so does not
2325 need to be initialized in the prolog.
2326
2327.. _amdgpu-amdhsa-flat-scratch:
2328
2329Flat Scratch
2330++++++++++++
2331
2332If the kernel may use flat operations to access scratch memory, the prolog code
2333must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye5bbcca62018-03-08 05:46:01 +00002334are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tyef16a45e2017-06-06 20:31:59 +00002335Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
2336
2337GFX6
2338 Flat scratch is not supported.
2339
Tony Tye07d9f102017-11-10 01:00:54 +00002340GFX7-GFX8
Tony Tyef16a45e2017-06-06 20:31:59 +00002341 1. The low word of Flat Scratch Init is 32 bit byte offset from
2342 ``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
2343 being managed by SPI for the queue executing the kernel dispatch. This is
2344 the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002345 prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tyef16a45e2017-06-06 20:31:59 +00002346 scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
2347 FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
2348 by 8 before moving into FLAT_SCRATCH_LO.
2349 2. The second word of Flat Scratch Init is 32 bit byte size of a single
2350 work-items scratch memory usage. This is directly loaded from the kernel
2351 dispatch packet Private Segment Byte Size and rounded up to a multiple of
2352 DWORD. Having CP load it once avoids loading it at the beginning of every
2353 wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
2354 SIZE.
Tony Tyef59d0712017-11-10 20:51:43 +00002355
Tony Tyef16a45e2017-06-06 20:31:59 +00002356GFX9
2357 The Flat Scratch Init is the 64 bit address of the base of scratch backing
2358 memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye5bbcca62018-03-08 05:46:01 +00002359 prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tyef16a45e2017-06-06 20:31:59 +00002360 pair for use as the flat scratch base in flat memory instructions.
2361
2362.. _amdgpu-amdhsa-memory-model:
2363
2364Memory Model
2365~~~~~~~~~~~~
2366
2367This section describes the mapping of LLVM memory model onto AMDGPU machine code
2368(see :ref:`memmodel`). *The implementation is WIP.*
2369
2370.. TODO
2371 Update when implementation complete.
2372
Tony Tyef16a45e2017-06-06 20:31:59 +00002373The AMDGPU backend supports the memory synchronization scopes specified in
2374:ref:`amdgpu-memory-scopes`.
2375
2376The code sequences used to implement the memory model are defined in table
2377:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
2378
2379The sequences specify the order of instructions that a single thread must
2380execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
2381to other memory instructions executed by the same thread. This allows them to be
2382moved earlier or later which can allow them to be combined with other instances
2383of the same instruction, or hoisted/sunk out of loops to improve
2384performance. Only the instructions related to the memory model are given;
2385additional ``s_waitcnt`` instructions are required to ensure registers are
2386defined before being used. These may be able to be combined with the memory
2387model ``s_waitcnt`` instructions as described above.
2388
Tony Tye6baa6d22017-10-18 22:16:55 +00002389The AMDGPU backend supports the following memory models:
2390
2391 HSA Memory Model [HSA]_
2392 The HSA memory model uses a single happens-before relation for all address
2393 spaces (see :ref:`amdgpu-address-spaces`).
2394 OpenCL Memory Model [OpenCL]_
2395 The OpenCL memory model which has separate happens-before relations for the
2396 global and local address spaces. Only a fence specifying both global and
2397 local address space, and seq_cst instructions join the relationships. Since
2398 the LLVM ``memfence`` instruction does not allow an address space to be
2399 specified the OpenCL fence has to convervatively assume both local and
2400 global address space was specified. However, optimizations can often be
2401 done to eliminate the additional ``s_waitcnt`` instructions when there are
2402 no intervening memory instructions which access the corresponding address
2403 space. The code sequences in the table indicate what can be omitted for the
2404 OpenCL memory. The target triple environment is used to determine if the
2405 source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tyef16a45e2017-06-06 20:31:59 +00002406
2407``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
2408operations.
2409
2410``buffer/global/flat_load/store/atomic`` instructions to global memory are
2411termed vector memory operations.
2412
2413For GFX6-GFX9:
2414
2415* Each agent has multiple compute units (CU).
2416* Each CU has multiple SIMDs that execute wavefronts.
2417* The wavefronts for a single work-group are executed in the same CU but may be
2418 executed by different SIMDs.
2419* Each CU has a single LDS memory shared by the wavefronts of the work-groups
2420 executing on it.
2421* All LDS operations of a CU are performed as wavefront wide operations in a
2422 global order and involve no caching. Completion is reported to a wavefront in
2423 execution order.
2424* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye5bbcca62018-03-08 05:46:01 +00002425 CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tyef16a45e2017-06-06 20:31:59 +00002426 can be reordered relative to each other, which can result in reordering the
2427 visibility of vector memory operations with respect to LDS operations of other
2428 wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002429 ensure synchronization between LDS operations and vector memory operations
Tony Tye5bbcca62018-03-08 05:46:01 +00002430 between wavefronts of a work-group, but not between operations performed by the
Tony Tyef16a45e2017-06-06 20:31:59 +00002431 same wavefront.
2432* The vector memory operations are performed as wavefront wide operations and
2433 completion is reported to a wavefront in execution order. The exception is
Tony Tye07d9f102017-11-10 01:00:54 +00002434 that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tyef16a45e2017-06-06 20:31:59 +00002435 vector memory order if they access LDS memory, and out of LDS operation order
2436 if they access global memory.
Tony Tye6baa6d22017-10-18 22:16:55 +00002437* The vector memory operations access a single vector L1 cache shared by all
2438 SIMDs a CU. Therefore, no special action is required for coherence between the
2439 lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye5bbcca62018-03-08 05:46:01 +00002440 work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye6baa6d22017-10-18 22:16:55 +00002441 executing in different work-groups as they may be executing on different CUs.
Tony Tyef16a45e2017-06-06 20:31:59 +00002442* The scalar memory operations access a scalar L1 cache shared by all wavefronts
2443 on a group of CUs. The scalar and vector L1 caches are not coherent. However,
2444 scalar operations are used in a restricted way so do not impact the memory
2445 model. See :ref:`amdgpu-amdhsa-memory-spaces`.
2446* The vector and scalar memory operations use an L2 cache shared by all CUs on
2447 the same agent.
2448* The L2 cache has independent channels to service disjoint ranges of virtual
2449 addresses.
2450* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye5bbcca62018-03-08 05:46:01 +00002451 scalar memory operations performed by wavefronts executing in different work-groups
Tony Tyef16a45e2017-06-06 20:31:59 +00002452 (which may be executing on different CUs) of an agent can be reordered
2453 relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002454 synchronization between vector memory operations of different CUs. It ensures a
Tony Tyef16a45e2017-06-06 20:31:59 +00002455 previous vector memory operation has completed before executing a subsequent
2456 vector memory or LDS operation and so can be used to meet the requirements of
2457 acquire and release.
2458* The L2 cache can be kept coherent with other agents on some targets, or ranges
2459 of virtual addresses can be set up to bypass it to ensure system coherence.
2460
Tony Tye07d9f102017-11-10 01:00:54 +00002461Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tyef16a45e2017-06-06 20:31:59 +00002462or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
2463memory, atomic memory orderings are not meaningful and all accesses are treated
2464as non-atomic.
2465
2466Constant address space uses ``buffer/global_load`` instructions (or equivalent
2467scalar memory instructions). Since the constant address space contents do not
2468change during the execution of a kernel dispatch it is not legal to perform
2469stores, and atomic memory orderings are not meaningful and all access are
2470treated as non-atomic.
2471
2472A memory synchronization scope wider than work-group is not meaningful for the
2473group (LDS) address space and is treated as work-group.
2474
2475The memory model does not support the region address space which is treated as
2476non-atomic.
2477
2478Acquire memory ordering is not meaningful on store atomic instructions and is
2479treated as non-atomic.
2480
2481Release memory ordering is not meaningful on load atomic instructions and is
2482treated a non-atomic.
2483
2484Acquire-release memory ordering is not meaningful on load or store atomic
2485instructions and is treated as acquire and release respectively.
2486
2487AMDGPU backend only uses scalar memory operations to access memory that is
2488proven to not change during the execution of the kernel dispatch. This includes
2489constant address space and global address space for program scope const
2490variables. Therefore the kernel machine code does not have to maintain the
2491scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
2492and vector L1 caches are invalidated between kernel dispatches by CP since
2493constant address space data may change between kernel dispatch executions. See
2494:ref:`amdgpu-amdhsa-memory-spaces`.
2495
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00002496The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tyef16a45e2017-06-06 20:31:59 +00002497case the AMDGPU backend ensures the memory location used to spill is never
2498accessed by vector memory operations at the same time. If scalar writes are used
2499then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
2500return since the locations may be used for vector memory instructions by a
Tony Tye5bbcca62018-03-08 05:46:01 +00002501future wavefront that uses the same scratch area, or a function call that creates a
Tony Tyef16a45e2017-06-06 20:31:59 +00002502frame at the same address, respectively. There is no need for a ``s_dcache_inv``
2503as all scalar writes are write-before-read in the same thread.
2504
Tony Tye6baa6d22017-10-18 22:16:55 +00002505Scratch backing memory (which is used for the private address space)
2506is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
2507address space is only accessed by a single thread, and is always
2508write-before-read, there is never a need to invalidate these entries from the L1
2509cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
2510volatile cache lines.
Tony Tyef16a45e2017-06-06 20:31:59 +00002511
2512On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye6baa6d22017-10-18 22:16:55 +00002513to invalidate the L2 cache. This also causes it to be treated as
2514non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
2515(cache coherent) and so the L2 cache will coherent with the CPU and other
2516agents.
Tony Tyef16a45e2017-06-06 20:31:59 +00002517
2518 .. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
2519 :name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
2520
Tony Tye6baa6d22017-10-18 22:16:55 +00002521 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002522 LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
2523 Ordering Sync Scope Address
2524 Space
Tony Tye6baa6d22017-10-18 22:16:55 +00002525 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00002526 **Non-Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002527 -----------------------------------------------------------------------------------
2528 load *none* *none* - global - !volatile & !nontemporal
2529 - generic
2530 - private 1. buffer/global/flat_load
2531 - constant
2532 - volatile & !nontemporal
2533
Tony Tyef16a45e2017-06-06 20:31:59 +00002534 1. buffer/global/flat_load
2535 glc=1
Tony Tye6baa6d22017-10-18 22:16:55 +00002536
2537 - nontemporal
2538
2539 1. buffer/global/flat_load
2540 glc=1 slc=1
2541
Tony Tyef16a45e2017-06-06 20:31:59 +00002542 load *none* *none* - local 1. ds_load
Tony Tye6baa6d22017-10-18 22:16:55 +00002543 store *none* *none* - global - !nontemporal
Tony Tyef16a45e2017-06-06 20:31:59 +00002544 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002545 - private 1. buffer/global/flat_store
2546 - constant
2547 - nontemporal
2548
2549 1. buffer/global/flat_stote
2550 glc=1 slc=1
2551
Tony Tyef16a45e2017-06-06 20:31:59 +00002552 store *none* *none* - local 1. ds_store
2553 **Unordered Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002554 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002555 load atomic unordered *any* *any* *Same as non-atomic*.
2556 store atomic unordered *any* *any* *Same as non-atomic*.
2557 atomicrmw unordered *any* *any* *Same as monotonic
2558 atomic*.
2559 **Monotonic Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002560 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002561 load atomic monotonic - singlethread - global 1. buffer/global/flat_load
2562 - wavefront - generic
2563 - workgroup
2564 load atomic monotonic - singlethread - local 1. ds_load
2565 - wavefront
2566 - workgroup
2567 load atomic monotonic - agent - global 1. buffer/global/flat_load
2568 - system - generic glc=1
2569 store atomic monotonic - singlethread - global 1. buffer/global/flat_store
2570 - wavefront - generic
2571 - workgroup
2572 - agent
2573 - system
2574 store atomic monotonic - singlethread - local 1. ds_store
2575 - wavefront
2576 - workgroup
2577 atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
2578 - wavefront - generic
2579 - workgroup
2580 - agent
2581 - system
2582 atomicrmw monotonic - singlethread - local 1. ds_atomic
2583 - wavefront
2584 - workgroup
2585 **Acquire Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002586 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002587 load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
2588 - wavefront - local
2589 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002590 load atomic acquire - workgroup - global 1. buffer/global/flat_load
2591 load atomic acquire - workgroup - local 1. ds_load
2592 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002593
Tony Tye6baa6d22017-10-18 22:16:55 +00002594 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002595 - Must happen before
2596 any following
2597 global/generic
2598 load/load
2599 atomic/store/store
2600 atomic/atomicrmw.
2601 - Ensures any
2602 following global
2603 data read is no
2604 older than the load
2605 atomic value being
2606 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00002607 load atomic acquire - workgroup - generic 1. flat_load
2608 2. s_waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002609
Tony Tye6baa6d22017-10-18 22:16:55 +00002610 - If OpenCL, omit.
2611 - Must happen before
2612 any following
2613 global/generic
2614 load/load
2615 atomic/store/store
2616 atomic/atomicrmw.
2617 - Ensures any
2618 following global
2619 data read is no
2620 older than the load
2621 atomic value being
2622 acquired.
2623 load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tyef16a45e2017-06-06 20:31:59 +00002624 - system glc=1
2625 2. s_waitcnt vmcnt(0)
2626
2627 - Must happen before
2628 following
2629 buffer_wbinvl1_vol.
2630 - Ensures the load
2631 has completed
2632 before invalidating
2633 the cache.
2634
2635 3. buffer_wbinvl1_vol
2636
2637 - Must happen before
2638 any following
2639 global/generic
2640 load/load
2641 atomic/atomicrmw.
2642 - Ensures that
2643 following
2644 loads will not see
2645 stale global data.
2646
2647 load atomic acquire - agent - generic 1. flat_load glc=1
2648 - system 2. s_waitcnt vmcnt(0) &
2649 lgkmcnt(0)
2650
2651 - If OpenCL omit
2652 lgkmcnt(0).
2653 - Must happen before
2654 following
2655 buffer_wbinvl1_vol.
2656 - Ensures the flat_load
2657 has completed
2658 before invalidating
2659 the cache.
2660
2661 3. buffer_wbinvl1_vol
2662
2663 - Must happen before
2664 any following
2665 global/generic
2666 load/load
2667 atomic/atomicrmw.
2668 - Ensures that
2669 following loads
2670 will not see stale
2671 global data.
2672
2673 atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
2674 - wavefront - local
2675 - generic
Tony Tye6baa6d22017-10-18 22:16:55 +00002676 atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
2677 atomicrmw acquire - workgroup - local 1. ds_atomic
2678 2. waitcnt lgkmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002679
Tony Tye6baa6d22017-10-18 22:16:55 +00002680 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002681 - Must happen before
2682 any following
2683 global/generic
2684 load/load
2685 atomic/store/store
2686 atomic/atomicrmw.
2687 - Ensures any
2688 following global
2689 data read is no
2690 older than the
2691 atomicrmw value
2692 being acquired.
2693
Tony Tye6baa6d22017-10-18 22:16:55 +00002694 atomicrmw acquire - workgroup - generic 1. flat_atomic
2695 2. waitcnt lgkmcnt(0)
2696
2697 - If OpenCL, omit.
2698 - Must happen before
2699 any following
2700 global/generic
2701 load/load
2702 atomic/store/store
2703 atomic/atomicrmw.
2704 - Ensures any
2705 following global
2706 data read is no
2707 older than the
2708 atomicrmw value
2709 being acquired.
2710
2711 atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00002712 - system 2. s_waitcnt vmcnt(0)
2713
2714 - Must happen before
2715 following
2716 buffer_wbinvl1_vol.
2717 - Ensures the
2718 atomicrmw has
2719 completed before
2720 invalidating the
2721 cache.
2722
2723 3. buffer_wbinvl1_vol
2724
2725 - Must happen before
2726 any following
2727 global/generic
2728 load/load
2729 atomic/atomicrmw.
2730 - Ensures that
2731 following loads
2732 will not see stale
2733 global data.
2734
2735 atomicrmw acquire - agent - generic 1. flat_atomic
2736 - system 2. s_waitcnt vmcnt(0) &
2737 lgkmcnt(0)
2738
2739 - If OpenCL, omit
2740 lgkmcnt(0).
2741 - Must happen before
2742 following
2743 buffer_wbinvl1_vol.
2744 - Ensures the
2745 atomicrmw has
2746 completed before
2747 invalidating the
2748 cache.
2749
2750 3. buffer_wbinvl1_vol
2751
2752 - Must happen before
2753 any following
2754 global/generic
2755 load/load
2756 atomic/atomicrmw.
2757 - Ensures that
2758 following loads
2759 will not see stale
2760 global data.
2761
2762 fence acquire - singlethread *none* *none*
2763 - wavefront
2764 fence acquire - workgroup *none* 1. s_waitcnt lgkmcnt(0)
2765
2766 - If OpenCL and
2767 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00002768 not generic, omit.
2769 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002770 currently has no
2771 address space on
2772 the fence need to
2773 conservatively
2774 always generate. If
2775 fence had an
2776 address space then
2777 set to address
2778 space of OpenCL
2779 fence flag, or to
2780 generic if both
2781 local and global
2782 flags are
2783 specified.
2784 - Must happen after
2785 any preceding
2786 local/generic load
2787 atomic/atomicrmw
2788 with an equal or
2789 wider sync scope
2790 and memory ordering
2791 stronger than
2792 unordered (this is
2793 termed the
2794 fence-paired-atomic).
2795 - Must happen before
2796 any following
2797 global/generic
2798 load/load
2799 atomic/store/store
2800 atomic/atomicrmw.
2801 - Ensures any
2802 following global
2803 data read is no
2804 older than the
2805 value read by the
2806 fence-paired-atomic.
2807
Tony Tye6baa6d22017-10-18 22:16:55 +00002808 fence acquire - agent *none* 1. s_waitcnt lgkmcnt(0) &
2809 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002810
2811 - If OpenCL and
2812 address space is
2813 not generic, omit
2814 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00002815 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00002816 currently has no
2817 address space on
2818 the fence need to
2819 conservatively
2820 always generate
2821 (see comment for
2822 previous fence).
Tony Tyed9c251f2017-06-07 00:08:35 +00002823 - Could be split into
Tony Tyef16a45e2017-06-06 20:31:59 +00002824 separate s_waitcnt
2825 vmcnt(0) and
2826 s_waitcnt
2827 lgkmcnt(0) to allow
2828 them to be
2829 independently moved
2830 according to the
2831 following rules.
2832 - s_waitcnt vmcnt(0)
2833 must happen after
2834 any preceding
2835 global/generic load
2836 atomic/atomicrmw
2837 with an equal or
2838 wider sync scope
2839 and memory ordering
2840 stronger than
2841 unordered (this is
2842 termed the
2843 fence-paired-atomic).
2844 - s_waitcnt lgkmcnt(0)
2845 must happen after
2846 any preceding
Tony Tye6baa6d22017-10-18 22:16:55 +00002847 local/generic load
Tony Tyef16a45e2017-06-06 20:31:59 +00002848 atomic/atomicrmw
2849 with an equal or
2850 wider sync scope
2851 and memory ordering
2852 stronger than
2853 unordered (this is
2854 termed the
2855 fence-paired-atomic).
2856 - Must happen before
2857 the following
2858 buffer_wbinvl1_vol.
2859 - Ensures that the
2860 fence-paired atomic
2861 has completed
2862 before invalidating
2863 the
2864 cache. Therefore
2865 any following
2866 locations read must
2867 be no older than
2868 the value read by
2869 the
2870 fence-paired-atomic.
2871
2872 2. buffer_wbinvl1_vol
2873
Tony Tye6baa6d22017-10-18 22:16:55 +00002874 - Must happen before any
2875 following global/generic
Tony Tyef16a45e2017-06-06 20:31:59 +00002876 load/load
2877 atomic/store/store
2878 atomic/atomicrmw.
2879 - Ensures that
2880 following loads
2881 will not see stale
2882 global data.
2883
2884 **Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00002885 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00002886 store atomic release - singlethread - global 1. buffer/global/ds/flat_store
2887 - wavefront - local
2888 - generic
2889 store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002890
2891 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002892 - Must happen after
2893 any preceding
2894 local/generic
2895 load/store/load
2896 atomic/store
2897 atomic/atomicrmw.
2898 - Must happen before
2899 the following
2900 store.
2901 - Ensures that all
2902 memory operations
2903 to local have
2904 completed before
2905 performing the
2906 store that is being
2907 released.
2908
2909 2. buffer/global/flat_store
2910 store atomic release - workgroup - local 1. ds_store
Tony Tye6baa6d22017-10-18 22:16:55 +00002911 store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2912
2913 - If OpenCL, omit.
2914 - Must happen after
2915 any preceding
2916 local/generic
2917 load/store/load
2918 atomic/store
2919 atomic/atomicrmw.
2920 - Must happen before
2921 the following
2922 store.
2923 - Ensures that all
2924 memory operations
2925 to local have
2926 completed before
2927 performing the
2928 store that is being
2929 released.
2930
2931 2. flat_store
2932 store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
2933 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00002934
2935 - If OpenCL, omit
2936 lgkmcnt(0).
2937 - Could be split into
2938 separate s_waitcnt
2939 vmcnt(0) and
2940 s_waitcnt
2941 lgkmcnt(0) to allow
2942 them to be
2943 independently moved
2944 according to the
2945 following rules.
2946 - s_waitcnt vmcnt(0)
2947 must happen after
2948 any preceding
2949 global/generic
2950 load/store/load
2951 atomic/store
2952 atomic/atomicrmw.
2953 - s_waitcnt lgkmcnt(0)
2954 must happen after
2955 any preceding
2956 local/generic
2957 load/store/load
2958 atomic/store
2959 atomic/atomicrmw.
2960 - Must happen before
2961 the following
2962 store.
2963 - Ensures that all
2964 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00002965 to memory have
Tony Tyef16a45e2017-06-06 20:31:59 +00002966 completed before
2967 performing the
2968 store that is being
2969 released.
2970
2971 2. buffer/global/ds/flat_store
2972 atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
2973 - wavefront - local
2974 - generic
2975 atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye6baa6d22017-10-18 22:16:55 +00002976
2977 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00002978 - Must happen after
2979 any preceding
2980 local/generic
2981 load/store/load
2982 atomic/store
2983 atomic/atomicrmw.
2984 - Must happen before
2985 the following
2986 atomicrmw.
2987 - Ensures that all
2988 memory operations
2989 to local have
2990 completed before
2991 performing the
2992 atomicrmw that is
2993 being released.
2994
2995 2. buffer/global/flat_atomic
2996 atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye6baa6d22017-10-18 22:16:55 +00002997 atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
2998
2999 - If OpenCL, omit.
3000 - Must happen after
3001 any preceding
3002 local/generic
3003 load/store/load
3004 atomic/store
3005 atomic/atomicrmw.
3006 - Must happen before
3007 the following
3008 atomicrmw.
3009 - Ensures that all
3010 memory operations
3011 to local have
3012 completed before
3013 performing the
3014 atomicrmw that is
3015 being released.
3016
3017 2. flat_atomic
3018 atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
3019 - system - generic vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003020
3021 - If OpenCL, omit
3022 lgkmcnt(0).
3023 - Could be split into
3024 separate s_waitcnt
3025 vmcnt(0) and
3026 s_waitcnt
3027 lgkmcnt(0) to allow
3028 them to be
3029 independently moved
3030 according to the
3031 following rules.
3032 - s_waitcnt vmcnt(0)
3033 must happen after
3034 any preceding
3035 global/generic
3036 load/store/load
3037 atomic/store
3038 atomic/atomicrmw.
3039 - s_waitcnt lgkmcnt(0)
3040 must happen after
3041 any preceding
3042 local/generic
3043 load/store/load
3044 atomic/store
3045 atomic/atomicrmw.
3046 - Must happen before
3047 the following
3048 atomicrmw.
3049 - Ensures that all
3050 memory operations
3051 to global and local
3052 have completed
3053 before performing
3054 the atomicrmw that
3055 is being released.
3056
Tony Tye6baa6d22017-10-18 22:16:55 +00003057 2. buffer/global/ds/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003058 fence release - singlethread *none* *none*
3059 - wavefront
3060 fence release - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3061
3062 - If OpenCL and
3063 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003064 not generic, omit.
3065 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003066 currently has no
3067 address space on
3068 the fence need to
3069 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003070 always generate. If
3071 fence had an
3072 address space then
3073 set to address
3074 space of OpenCL
3075 fence flag, or to
3076 generic if both
3077 local and global
3078 flags are
3079 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003080 - Must happen after
3081 any preceding
3082 local/generic
3083 load/load
3084 atomic/store/store
3085 atomic/atomicrmw.
3086 - Must happen before
3087 any following store
3088 atomic/atomicrmw
3089 with an equal or
3090 wider sync scope
3091 and memory ordering
3092 stronger than
3093 unordered (this is
3094 termed the
3095 fence-paired-atomic).
3096 - Ensures that all
3097 memory operations
3098 to local have
3099 completed before
3100 performing the
3101 following
3102 fence-paired-atomic.
3103
Tony Tye6baa6d22017-10-18 22:16:55 +00003104 fence release - agent *none* 1. s_waitcnt lgkmcnt(0) &
3105 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003106
3107 - If OpenCL and
3108 address space is
3109 not generic, omit
3110 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003111 - If OpenCL and
3112 address space is
3113 local, omit
3114 vmcnt(0).
3115 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003116 currently has no
3117 address space on
3118 the fence need to
3119 conservatively
Tony Tye6baa6d22017-10-18 22:16:55 +00003120 always generate. If
3121 fence had an
3122 address space then
3123 set to address
3124 space of OpenCL
3125 fence flag, or to
3126 generic if both
3127 local and global
3128 flags are
3129 specified.
Tony Tyef16a45e2017-06-06 20:31:59 +00003130 - Could be split into
3131 separate s_waitcnt
3132 vmcnt(0) and
3133 s_waitcnt
3134 lgkmcnt(0) to allow
3135 them to be
3136 independently moved
3137 according to the
3138 following rules.
3139 - s_waitcnt vmcnt(0)
3140 must happen after
3141 any preceding
3142 global/generic
3143 load/store/load
3144 atomic/store
3145 atomic/atomicrmw.
3146 - s_waitcnt lgkmcnt(0)
3147 must happen after
3148 any preceding
3149 local/generic
3150 load/store/load
3151 atomic/store
3152 atomic/atomicrmw.
3153 - Must happen before
3154 any following store
3155 atomic/atomicrmw
3156 with an equal or
3157 wider sync scope
3158 and memory ordering
3159 stronger than
3160 unordered (this is
3161 termed the
3162 fence-paired-atomic).
3163 - Ensures that all
3164 memory operations
Tony Tye6baa6d22017-10-18 22:16:55 +00003165 have
Tony Tyef16a45e2017-06-06 20:31:59 +00003166 completed before
3167 performing the
3168 following
3169 fence-paired-atomic.
3170
3171 **Acquire-Release Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003172 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003173 atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
3174 - wavefront - local
3175 - generic
3176 atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
3177
Tony Tye6baa6d22017-10-18 22:16:55 +00003178 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003179 - Must happen after
3180 any preceding
3181 local/generic
3182 load/store/load
3183 atomic/store
3184 atomic/atomicrmw.
3185 - Must happen before
3186 the following
3187 atomicrmw.
3188 - Ensures that all
3189 memory operations
3190 to local have
3191 completed before
3192 performing the
3193 atomicrmw that is
3194 being released.
3195
Tony Tye6baa6d22017-10-18 22:16:55 +00003196 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003197 atomicrmw acq_rel - workgroup - local 1. ds_atomic
3198 2. s_waitcnt lgkmcnt(0)
3199
Tony Tye6baa6d22017-10-18 22:16:55 +00003200 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003201 - Must happen before
3202 any following
3203 global/generic
3204 load/load
3205 atomic/store/store
3206 atomic/atomicrmw.
3207 - Ensures any
3208 following global
3209 data read is no
3210 older than the load
3211 atomic value being
3212 acquired.
3213
3214 atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
3215
Tony Tye6baa6d22017-10-18 22:16:55 +00003216 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003217 - Must happen after
3218 any preceding
3219 local/generic
3220 load/store/load
3221 atomic/store
3222 atomic/atomicrmw.
3223 - Must happen before
3224 the following
3225 atomicrmw.
3226 - Ensures that all
3227 memory operations
3228 to local have
3229 completed before
3230 performing the
3231 atomicrmw that is
3232 being released.
3233
3234 2. flat_atomic
3235 3. s_waitcnt lgkmcnt(0)
3236
Tony Tye6baa6d22017-10-18 22:16:55 +00003237 - If OpenCL, omit.
Tony Tyef16a45e2017-06-06 20:31:59 +00003238 - Must happen before
3239 any following
3240 global/generic
3241 load/load
3242 atomic/store/store
3243 atomic/atomicrmw.
3244 - Ensures any
3245 following global
3246 data read is no
3247 older than the load
3248 atomic value being
3249 acquired.
Tony Tye6baa6d22017-10-18 22:16:55 +00003250
3251 atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
3252 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003253
3254 - If OpenCL, omit
3255 lgkmcnt(0).
3256 - Could be split into
3257 separate s_waitcnt
3258 vmcnt(0) and
3259 s_waitcnt
3260 lgkmcnt(0) to allow
3261 them to be
3262 independently moved
3263 according to the
3264 following rules.
3265 - s_waitcnt vmcnt(0)
3266 must happen after
3267 any preceding
3268 global/generic
3269 load/store/load
3270 atomic/store
3271 atomic/atomicrmw.
3272 - s_waitcnt lgkmcnt(0)
3273 must happen after
3274 any preceding
3275 local/generic
3276 load/store/load
3277 atomic/store
3278 atomic/atomicrmw.
3279 - Must happen before
3280 the following
3281 atomicrmw.
3282 - Ensures that all
3283 memory operations
3284 to global have
3285 completed before
3286 performing the
3287 atomicrmw that is
3288 being released.
3289
Tony Tye6baa6d22017-10-18 22:16:55 +00003290 2. buffer/global/flat_atomic
Tony Tyef16a45e2017-06-06 20:31:59 +00003291 3. s_waitcnt vmcnt(0)
3292
3293 - Must happen before
3294 following
3295 buffer_wbinvl1_vol.
3296 - Ensures the
3297 atomicrmw has
3298 completed before
3299 invalidating the
3300 cache.
3301
3302 4. buffer_wbinvl1_vol
3303
3304 - Must happen before
3305 any following
3306 global/generic
3307 load/load
3308 atomic/atomicrmw.
3309 - Ensures that
3310 following loads
3311 will not see stale
3312 global data.
3313
Tony Tye6baa6d22017-10-18 22:16:55 +00003314 atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
3315 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003316
3317 - If OpenCL, omit
3318 lgkmcnt(0).
3319 - Could be split into
3320 separate s_waitcnt
3321 vmcnt(0) and
3322 s_waitcnt
3323 lgkmcnt(0) to allow
3324 them to be
3325 independently moved
3326 according to the
3327 following rules.
3328 - s_waitcnt vmcnt(0)
3329 must happen after
3330 any preceding
3331 global/generic
3332 load/store/load
3333 atomic/store
3334 atomic/atomicrmw.
3335 - s_waitcnt lgkmcnt(0)
3336 must happen after
3337 any preceding
3338 local/generic
3339 load/store/load
3340 atomic/store
3341 atomic/atomicrmw.
3342 - Must happen before
3343 the following
3344 atomicrmw.
3345 - Ensures that all
3346 memory operations
3347 to global have
3348 completed before
3349 performing the
3350 atomicrmw that is
3351 being released.
3352
3353 2. flat_atomic
3354 3. s_waitcnt vmcnt(0) &
3355 lgkmcnt(0)
3356
3357 - If OpenCL, omit
3358 lgkmcnt(0).
3359 - Must happen before
3360 following
3361 buffer_wbinvl1_vol.
3362 - Ensures the
3363 atomicrmw has
3364 completed before
3365 invalidating the
3366 cache.
3367
3368 4. buffer_wbinvl1_vol
3369
3370 - Must happen before
3371 any following
3372 global/generic
3373 load/load
3374 atomic/atomicrmw.
3375 - Ensures that
3376 following loads
3377 will not see stale
3378 global data.
3379
3380 fence acq_rel - singlethread *none* *none*
3381 - wavefront
3382 fence acq_rel - workgroup *none* 1. s_waitcnt lgkmcnt(0)
3383
3384 - If OpenCL and
3385 address space is
Tony Tye6baa6d22017-10-18 22:16:55 +00003386 not generic, omit.
3387 - However,
Tony Tyef16a45e2017-06-06 20:31:59 +00003388 since LLVM
3389 currently has no
3390 address space on
3391 the fence need to
3392 conservatively
3393 always generate
3394 (see comment for
3395 previous fence).
3396 - Must happen after
3397 any preceding
3398 local/generic
3399 load/load
3400 atomic/store/store
3401 atomic/atomicrmw.
3402 - Must happen before
3403 any following
3404 global/generic
3405 load/load
3406 atomic/store/store
3407 atomic/atomicrmw.
3408 - Ensures that all
3409 memory operations
3410 to local have
3411 completed before
3412 performing any
3413 following global
3414 memory operations.
3415 - Ensures that the
3416 preceding
3417 local/generic load
3418 atomic/atomicrmw
3419 with an equal or
3420 wider sync scope
3421 and memory ordering
3422 stronger than
3423 unordered (this is
3424 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003425 acquire-fence-paired-atomic
3426 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003427 before following
3428 global memory
3429 operations. This
3430 satisfies the
3431 requirements of
3432 acquire.
3433 - Ensures that all
3434 previous memory
3435 operations have
3436 completed before a
3437 following
3438 local/generic store
3439 atomic/atomicrmw
3440 with an equal or
3441 wider sync scope
3442 and memory ordering
3443 stronger than
3444 unordered (this is
3445 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003446 release-fence-paired-atomic
3447 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003448 requirements of
3449 release.
3450
Tony Tye6baa6d22017-10-18 22:16:55 +00003451 fence acq_rel - agent *none* 1. s_waitcnt lgkmcnt(0) &
3452 - system vmcnt(0)
Tony Tyef16a45e2017-06-06 20:31:59 +00003453
3454 - If OpenCL and
3455 address space is
3456 not generic, omit
3457 lgkmcnt(0).
Tony Tye6baa6d22017-10-18 22:16:55 +00003458 - However, since LLVM
Tony Tyef16a45e2017-06-06 20:31:59 +00003459 currently has no
3460 address space on
3461 the fence need to
3462 conservatively
3463 always generate
3464 (see comment for
3465 previous fence).
3466 - Could be split into
3467 separate s_waitcnt
3468 vmcnt(0) and
3469 s_waitcnt
3470 lgkmcnt(0) to allow
3471 them to be
3472 independently moved
3473 according to the
3474 following rules.
3475 - s_waitcnt vmcnt(0)
3476 must happen after
3477 any preceding
3478 global/generic
3479 load/store/load
3480 atomic/store
3481 atomic/atomicrmw.
3482 - s_waitcnt lgkmcnt(0)
3483 must happen after
3484 any preceding
3485 local/generic
3486 load/store/load
3487 atomic/store
3488 atomic/atomicrmw.
3489 - Must happen before
3490 the following
3491 buffer_wbinvl1_vol.
3492 - Ensures that the
3493 preceding
3494 global/local/generic
3495 load
3496 atomic/atomicrmw
3497 with an equal or
3498 wider sync scope
3499 and memory ordering
3500 stronger than
3501 unordered (this is
3502 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003503 acquire-fence-paired-atomic
3504 ) has completed
Tony Tyef16a45e2017-06-06 20:31:59 +00003505 before invalidating
3506 the cache. This
3507 satisfies the
3508 requirements of
3509 acquire.
3510 - Ensures that all
3511 previous memory
3512 operations have
3513 completed before a
3514 following
3515 global/local/generic
3516 store
3517 atomic/atomicrmw
3518 with an equal or
3519 wider sync scope
3520 and memory ordering
3521 stronger than
3522 unordered (this is
3523 termed the
Tony Tye6baa6d22017-10-18 22:16:55 +00003524 release-fence-paired-atomic
3525 ). This satisfies the
Tony Tyef16a45e2017-06-06 20:31:59 +00003526 requirements of
3527 release.
3528
3529 2. buffer_wbinvl1_vol
3530
3531 - Must happen before
3532 any following
3533 global/generic
3534 load/load
3535 atomic/store/store
3536 atomic/atomicrmw.
3537 - Ensures that
3538 following loads
3539 will not see stale
3540 global data. This
3541 satisfies the
3542 requirements of
3543 acquire.
3544
3545 **Sequential Consistent Atomic**
Tony Tye6baa6d22017-10-18 22:16:55 +00003546 -----------------------------------------------------------------------------------
Tony Tyef16a45e2017-06-06 20:31:59 +00003547 load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003548 - wavefront - local load atomic acquire,
3549 - generic except must generated
3550 all instructions even
3551 for OpenCL.*
3552 load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
3553 - generic
3554 - Must
3555 happen after
3556 preceding
3557 global/generic load
3558 atomic/store
3559 atomic/atomicrmw
3560 with memory
3561 ordering of seq_cst
3562 and with equal or
3563 wider sync scope.
3564 (Note that seq_cst
3565 fences have their
3566 own s_waitcnt
3567 lgkmcnt(0) and so do
3568 not need to be
3569 considered.)
3570 - Ensures any
3571 preceding
3572 sequential
3573 consistent local
3574 memory instructions
3575 have completed
3576 before executing
3577 this sequentially
3578 consistent
3579 instruction. This
3580 prevents reordering
3581 a seq_cst store
3582 followed by a
3583 seq_cst load. (Note
3584 that seq_cst is
3585 stronger than
3586 acquire/release as
3587 the reordering of
3588 load acquire
3589 followed by a store
3590 release is
3591 prevented by the
3592 waitcnt of
3593 the release, but
3594 there is nothing
3595 preventing a store
3596 release followed by
3597 load acquire from
3598 competing out of
3599 order.)
3600
3601 2. *Following
3602 instructions same as
3603 corresponding load
3604 atomic acquire,
3605 except must generated
3606 all instructions even
3607 for OpenCL.*
3608 load atomic seq_cst - workgroup - local *Same as corresponding
3609 load atomic acquire,
3610 except must generated
3611 all instructions even
3612 for OpenCL.*
3613 load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
3614 - system - generic vmcnt(0)
3615
3616 - Could be split into
3617 separate s_waitcnt
3618 vmcnt(0)
3619 and s_waitcnt
3620 lgkmcnt(0) to allow
3621 them to be
3622 independently moved
3623 according to the
3624 following rules.
3625 - waitcnt lgkmcnt(0)
3626 must happen after
3627 preceding
3628 global/generic load
3629 atomic/store
3630 atomic/atomicrmw
3631 with memory
3632 ordering of seq_cst
3633 and with equal or
3634 wider sync scope.
3635 (Note that seq_cst
3636 fences have their
3637 own s_waitcnt
3638 lgkmcnt(0) and so do
3639 not need to be
3640 considered.)
3641 - waitcnt vmcnt(0)
3642 must happen after
Tony Tyef16a45e2017-06-06 20:31:59 +00003643 preceding
3644 global/generic load
3645 atomic/store
3646 atomic/atomicrmw
3647 with memory
3648 ordering of seq_cst
3649 and with equal or
3650 wider sync scope.
3651 (Note that seq_cst
3652 fences have their
3653 own s_waitcnt
3654 vmcnt(0) and so do
3655 not need to be
3656 considered.)
3657 - Ensures any
3658 preceding
3659 sequential
3660 consistent global
3661 memory instructions
3662 have completed
3663 before executing
3664 this sequentially
3665 consistent
3666 instruction. This
3667 prevents reordering
3668 a seq_cst store
3669 followed by a
Tony Tye6baa6d22017-10-18 22:16:55 +00003670 seq_cst load. (Note
Tony Tyef16a45e2017-06-06 20:31:59 +00003671 that seq_cst is
3672 stronger than
3673 acquire/release as
3674 the reordering of
3675 load acquire
3676 followed by a store
3677 release is
3678 prevented by the
Tony Tye6baa6d22017-10-18 22:16:55 +00003679 waitcnt of
Tony Tyef16a45e2017-06-06 20:31:59 +00003680 the release, but
3681 there is nothing
3682 preventing a store
3683 release followed by
3684 load acquire from
3685 competing out of
3686 order.)
3687
3688 2. *Following
3689 instructions same as
3690 corresponding load
Tony Tye6baa6d22017-10-18 22:16:55 +00003691 atomic acquire,
3692 except must generated
3693 all instructions even
3694 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003695 store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003696 - wavefront - local store atomic release,
3697 - workgroup - generic except must generated
3698 all instructions even
3699 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003700 store atomic seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003701 - system - generic store atomic release,
3702 except must generated
3703 all instructions even
3704 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003705 atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003706 - wavefront - local atomicrmw acq_rel,
3707 - workgroup - generic except must generated
3708 all instructions even
3709 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003710 atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003711 - system - generic atomicrmw acq_rel,
3712 except must generated
3713 all instructions even
3714 for OpenCL.*
Tony Tyef16a45e2017-06-06 20:31:59 +00003715 fence seq_cst - singlethread *none* *Same as corresponding
Tony Tye6baa6d22017-10-18 22:16:55 +00003716 - wavefront fence acq_rel,
3717 - workgroup except must generated
3718 - agent all instructions even
3719 - system for OpenCL.*
3720 ============ ============ ============== ========== ===============================
Tony Tyef16a45e2017-06-06 20:31:59 +00003721
3722The memory order also adds the single thread optimization constrains defined in
3723table
3724:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
3725
3726 .. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
3727 :name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
3728
3729 ============ ==============================================================
3730 LLVM Memory Optimization Constraints
3731 Ordering
3732 ============ ==============================================================
3733 unordered *none*
3734 monotonic *none*
3735 acquire - If a load atomic/atomicrmw then no following load/load
3736 atomic/store/ store atomic/atomicrmw/fence instruction can
3737 be moved before the acquire.
3738 - If a fence then same as load atomic, plus no preceding
3739 associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledrue3fdbae2017-06-26 02:45:39 +00003740 release - If a store atomic/atomicrmw then no preceding load/load
Tony Tyef16a45e2017-06-06 20:31:59 +00003741 atomic/store/ store atomic/atomicrmw/fence instruction can
3742 be moved after the release.
3743 - If a fence then same as store atomic, plus no following
3744 associated fence-paired-atomic can be moved before the
3745 fence.
3746 acq_rel Same constraints as both acquire and release.
3747 seq_cst - If a load atomic then same constraints as acquire, plus no
3748 preceding sequentially consistent load atomic/store
3749 atomic/atomicrmw/fence instruction can be moved after the
3750 seq_cst.
3751 - If a store atomic then the same constraints as release, plus
3752 no following sequentially consistent load atomic/store
3753 atomic/atomicrmw/fence instruction can be moved before the
3754 seq_cst.
3755 - If an atomicrmw/fence then same constraints as acq_rel.
3756 ============ ==============================================================
Konstantin Zhuravlyovd5561e02017-03-08 23:55:44 +00003757
Wei Ding16289cf2017-02-21 18:48:01 +00003758Trap Handler ABI
Tony Tyef16a45e2017-06-06 20:31:59 +00003759~~~~~~~~~~~~~~~~
Wei Ding16289cf2017-02-21 18:48:01 +00003760
Tony Tyef16a45e2017-06-06 20:31:59 +00003761For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
3762(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
3763the ``s_trap`` instruction with the following usage:
Wei Ding16289cf2017-02-21 18:48:01 +00003764
Tony Tyef16a45e2017-06-06 20:31:59 +00003765 .. table:: AMDGPU Trap Handler for AMDHSA OS
3766 :name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding16289cf2017-02-21 18:48:01 +00003767
Tony Tyef16a45e2017-06-06 20:31:59 +00003768 =================== =============== =============== =======================
3769 Usage Code Sequence Trap Handler Description
3770 Inputs
3771 =================== =============== =============== =======================
3772 reserved ``s_trap 0x00`` Reserved by hardware.
3773 ``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
3774 ``queue_ptr`` ``debugtrap``
3775 ``VGPR0``: intrinsic (not
3776 ``arg`` implemented).
3777 ``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
3778 ``queue_ptr`` terminated and its
3779 associated queue put
3780 into the error state.
3781 ``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
3782 ``queue_ptr`` installed handled
3783 same as ``llvm.trap``.
3784 debugger breakpoint ``s_trap 0x07`` Reserved for debugger
3785 breakpoints.
3786 debugger ``s_trap 0x08`` Reserved for debugger.
3787 debugger ``s_trap 0xfe`` Reserved for debugger.
3788 debugger ``s_trap 0xff`` Reserved for debugger.
3789 =================== =============== =============== =======================
Wei Ding16289cf2017-02-21 18:48:01 +00003790
Tim Corringhamaf2dfc62018-04-04 13:02:09 +00003791AMDPAL
3792------
3793
3794This section provides code conventions used when the target triple OS is
3795``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
3796from the application/runtime to each invocation of a hardware shader. These
3797parameters include both generic, application-controlled parameters called
3798*user data* as well as system-generated parameters that are a product of the
3799draw or dispatch execution.
3800
3801User Data
3802~~~~~~~~~
3803
3804Each hardware stage has a set of 32-bit *user data registers* which can be
3805written from a command buffer and then loaded into SGPRs when waves are launched
3806via a subsequent dispatch or draw operation. This is the way most arguments are
3807passed from the application/runtime to a hardware shader.
3808
3809Compute User Data
3810~~~~~~~~~~~~~~~~~
3811
3812Compute shader user data mappings are simpler than graphics shaders, and have a
3813fixed mapping.
3814
3815Note that there are always 10 available *user data entries* in registers -
3816entries beyond that limit must be fetched from memory (via the spill table
3817pointer) by the shader.
3818
3819 .. table:: PAL Compute Shader User Data Registers
3820 :name: pal-compute-user-data-registers
3821
3822 ============= ================================
3823 User Register Description
3824 ============= ================================
3825 0 Global Internal Table (32-bit pointer)
3826 1 Per-Shader Internal Table (32-bit pointer)
3827 2 - 11 Application-Controlled User Data (10 32-bit values)
3828 12 Spill Table (32-bit pointer)
3829 13 - 14 Thread Group Count (64-bit pointer)
3830 15 GDS Range
3831 ============= ================================
3832
3833Graphics User Data
3834~~~~~~~~~~~~~~~~~~
3835
3836Graphics pipelines support a much more flexible user data mapping:
3837
3838 .. table:: PAL Graphics Shader User Data Registers
3839 :name: pal-graphics-user-data-registers
3840
3841 ============= ================================
3842 User Register Description
3843 ============= ================================
3844 0 Global Internal Table (32-bit pointer)
3845 + Per-Shader Internal Table (32-bit pointer)
3846 + 1-15 Application Controlled User Data
3847 (1-15 Contiguous 32-bit Values in Registers)
3848 + Spill Table (32-bit pointer)
3849 + Draw Index (First Stage Only)
3850 + Vertex Offset (First Stage Only)
3851 + Instance Offset (First Stage Only)
3852 ============= ================================
3853
3854 The placement of the global internal table remains fixed in the first *user
3855 data SGPR register*. Otherwise all parameters are optional, and can be mapped
3856 to any desired *user data SGPR register*, with the following regstrictions:
3857
3858 * Draw Index, Vertex Offset, and Instance Offset can only be used by the first
3859 activehardware stage in a graphics pipeline (i.e. where the API vertex
3860 shader runs).
3861
3862 * Application-controlled user data must be mapped into a contiguous range of
3863 user data registers.
3864
3865 * The application-controlled user data range supports compaction remapping, so
3866 only *entries* that are actually consumed by the shader must be assigned to
3867 corresponding *registers*. Note that in order to support an efficient runtime
3868 implementation, the remapping must pack *registers* in the same order as
3869 *entries*, with unused *entries* removed.
3870
3871.. _pal_global_internal_table:
3872
3873Global Internal Table
3874~~~~~~~~~~~~~~~~~~~~~
3875
3876The global internal table is a table of *shader resource descriptors* (SRDs) that
3877define how certain engine-wide, runtime-managed resources should be accessed
3878from a shader. The majority of these resources have HW-defined formats, and it
3879is up to the compiler to write/read data as required by the target hardware.
3880
3881The following table illustrates the required format:
3882
3883 .. table:: PAL Global Internal Table
3884 :name: pal-git-table
3885
3886 ============= ================================
3887 Offset Description
3888 ============= ================================
3889 0-3 Graphics Scratch SRD
3890 4-7 Compute Scratch SRD
3891 8-11 ES/GS Ring Output SRD
3892 12-15 ES/GS Ring Input SRD
3893 16-19 GS/VS Ring Output #0
3894 20-23 GS/VS Ring Output #1
3895 24-27 GS/VS Ring Output #2
3896 28-31 GS/VS Ring Output #3
3897 32-35 GS/VS Ring Input SRD
3898 36-39 Tessellation Factor Buffer SRD
3899 40-43 Off-Chip LDS Buffer SRD
3900 44-47 Off-Chip Param Cache Buffer SRD
3901 48-51 Sample Position Buffer SRD
3902 52 vaRange::ShadowDescriptorTable High Bits
3903 ============= ================================
3904
3905 The pointer to the global internal table passed to the shader as user data
3906 is a 32-bit pointer. The top 32 bits should be assumed to be the same as
3907 the top 32 bits of the pipeline, so the shader may use the program
3908 counter's top 32 bits.
3909
Tony Tye46d35762017-08-15 20:47:41 +00003910Unspecified OS
3911--------------
3912
3913This section provides code conventions used when the target triple OS is
3914empty (see :ref:`amdgpu-target-triples`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003915
3916Trap Handler ABI
3917~~~~~~~~~~~~~~~~
3918
3919For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
3920not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
3921instructions are handled as follows:
3922
3923 .. table:: AMDGPU Trap Handler for Non-AMDHSA OS
3924 :name: amdgpu-trap-handler-for-non-amdhsa-os-table
3925
3926 =============== =============== ===========================================
3927 Usage Code Sequence Description
3928 =============== =============== ===========================================
3929 llvm.trap s_endpgm Causes wavefront to be terminated.
3930 llvm.debugtrap *none* Compiler warning given that there is no
3931 trap handler installed.
3932 =============== =============== ===========================================
3933
3934Source Languages
3935================
3936
3937.. _amdgpu-opencl:
3938
3939OpenCL
3940------
3941
Tony Tyef16a45e2017-06-06 20:31:59 +00003942When the language is OpenCL the following differences occur:
3943
39441. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye7a893d42018-03-23 18:45:18 +000039452. The AMDGPU backend appends additional arguments to the kernel's explicit
3946 arguments for the AMDHSA OS (see
3947 :ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye46d35762017-08-15 20:47:41 +000039483. Additional metadata is generated
Tony Tye7a893d42018-03-23 18:45:18 +00003949 (see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tyef16a45e2017-06-06 20:31:59 +00003950
Tony Tye7a893d42018-03-23 18:45:18 +00003951 .. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
3952 :name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
3953
3954 ======== ==== ========= ===========================================
3955 Position Byte Byte Description
3956 Size Alignment
3957 ======== ==== ========= ===========================================
Tony Tye88441a32018-03-23 18:58:47 +00003958 1 8 8 OpenCL Global Offset X
3959 2 8 8 OpenCL Global Offset Y
3960 3 8 8 OpenCL Global Offset Z
3961 4 8 8 OpenCL address of printf buffer
3962 5 8 8 OpenCL address of virtual queue used by
3963 enqueue_kernel.
3964 6 8 8 OpenCL address of AqlWrap struct used by
3965 enqueue_kernel.
Tony Tye7a893d42018-03-23 18:45:18 +00003966 ======== ==== ========= ===========================================
Tony Tyef16a45e2017-06-06 20:31:59 +00003967
3968.. _amdgpu-hcc:
3969
3970HCC
3971---
3972
Tony Tye7a893d42018-03-23 18:45:18 +00003973When the language is HCC the following differences occur:
Tony Tyef16a45e2017-06-06 20:31:59 +00003974
39751. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
3976
Tom Stellard45bb48e2015-06-13 03:28:10 +00003977Assembler
Tony Tyef16a45e2017-06-06 20:31:59 +00003978---------
Tom Stellard45bb48e2015-06-13 03:28:10 +00003979
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00003980AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tyef59d0712017-11-10 20:51:43 +00003981It supports AMDGCN GFX6-GFX9.
Tom Stellard45bb48e2015-06-13 03:28:10 +00003982
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00003983This section describes general syntax for instructions and operands.
3984
3985Instructions
3986~~~~~~~~~~~~
3987
3988.. toctree::
3989 :hidden:
3990
3991 AMDGPUAsmGFX7
3992 AMDGPUAsmGFX8
3993 AMDGPUAsmGFX9
3994 AMDGPUOperandSyntax
3995
3996An instruction has the following syntax:
3997
3998 *<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...*
3999
4000Note that operands are normally comma-separated while modifiers are space-separated.
4001
4002The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
4003
4004See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
4005:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
4006
4007Note that features under development are not included in this description.
4008
4009For more information about instructions, their semantics and supported combinations of
Tony Tyef16a45e2017-06-06 20:31:59 +00004010operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004011[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard45bb48e2015-06-13 03:28:10 +00004012
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004013Operands
Tony Tyef16a45e2017-06-06 20:31:59 +00004014~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004015
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004016The following syntax for register operands is supported:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004017
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004018* SGPR registers: s0, ... or s[0], ...
4019* VGPR registers: v0, ... or v[0], ...
4020* TTMP registers: ttmp0, ... or ttmp[0], ...
4021* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
4022* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
4023* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
4024* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
4025* Register index expressions: v[2*2], s[1-1:2-1]
4026* 'off' indicates that an operand is not enabled
Tom Stellard45bb48e2015-06-13 03:28:10 +00004027
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004028Modifiers
4029~~~~~~~~~
Tom Stellard45bb48e2015-06-13 03:28:10 +00004030
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004031Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004032
Tony Tyef16a45e2017-06-06 20:31:59 +00004033Instruction Examples
4034~~~~~~~~~~~~~~~~~~~~
4035
4036DS
Dmitry Preobrazhenskyc6d31e62018-03-12 15:55:08 +00004037++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004038
4039.. code-block:: nasm
4040
4041 ds_add_u32 v2, v4 offset:16
4042 ds_write_src2_b64 v2 offset0:4 offset1:8
4043 ds_cmpst_f32 v2, v4, v6
4044 ds_min_rtn_f64 v[8:9], v2, v[4:5]
4045
4046
4047For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
4048
Tony Tyef16a45e2017-06-06 20:31:59 +00004049FLAT
4050++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004051
4052.. code-block:: nasm
4053
4054 flat_load_dword v1, v[3:4]
4055 flat_store_dwordx3 v[3:4], v[5:7]
4056 flat_atomic_swap v1, v[3:4], v5 glc
4057 flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
4058 flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
4059
4060For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
4061
Tony Tyef16a45e2017-06-06 20:31:59 +00004062MUBUF
4063+++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004064
4065.. code-block:: nasm
4066
4067 buffer_load_dword v1, off, s[4:7], s1
4068 buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
4069 buffer_store_format_xy v[1:2], off, s[4:7], s1
4070 buffer_wbinvl1
4071 buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
4072
4073For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
4074
Tony Tyef16a45e2017-06-06 20:31:59 +00004075SMRD/SMEM
4076+++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004077
4078.. code-block:: nasm
4079
4080 s_load_dword s1, s[2:3], 0xfc
4081 s_load_dwordx8 s[8:15], s[2:3], s4
4082 s_load_dwordx16 s[88:103], s[2:3], s4
4083 s_dcache_inv_vol
4084 s_memtime s[4:5]
4085
4086For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
4087
Tony Tyef16a45e2017-06-06 20:31:59 +00004088SOP1
4089++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004090
4091.. code-block:: nasm
4092
4093 s_mov_b32 s1, s2
4094 s_mov_b64 s[0:1], 0x80000000
4095 s_cmov_b32 s1, 200
4096 s_wqm_b64 s[2:3], s[4:5]
4097 s_bcnt0_i32_b64 s1, s[2:3]
4098 s_swappc_b64 s[2:3], s[4:5]
4099 s_cbranch_join s[4:5]
4100
4101For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
4102
Tony Tyef16a45e2017-06-06 20:31:59 +00004103SOP2
4104++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004105
4106.. code-block:: nasm
4107
4108 s_add_u32 s1, s2, s3
4109 s_and_b64 s[2:3], s[4:5], s[6:7]
4110 s_cselect_b32 s1, s2, s3
4111 s_andn2_b32 s2, s4, s6
4112 s_lshr_b64 s[2:3], s[4:5], s6
4113 s_ashr_i32 s2, s4, s6
4114 s_bfm_b64 s[2:3], s4, s6
4115 s_bfe_i64 s[2:3], s[4:5], s6
4116 s_cbranch_g_fork s[4:5], s[6:7]
4117
4118For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
4119
Tony Tyef16a45e2017-06-06 20:31:59 +00004120SOPC
4121++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004122
4123.. code-block:: nasm
4124
4125 s_cmp_eq_i32 s1, s2
4126 s_bitcmp1_b32 s1, s2
4127 s_bitcmp0_b64 s[2:3], s4
4128 s_setvskip s3, s5
4129
4130For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
4131
Tony Tyef16a45e2017-06-06 20:31:59 +00004132SOPP
4133++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004134
4135.. code-block:: nasm
4136
4137 s_barrier
4138 s_nop 2
4139 s_endpgm
4140 s_waitcnt 0 ; Wait for all counters to be 0
4141 s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
4142 s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
4143 s_sethalt 9
4144 s_sleep 10
4145 s_sendmsg 0x1
4146 s_sendmsg sendmsg(MSG_INTERRUPT)
4147 s_trap 1
4148
4149For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
4150
4151Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledrue6ec4412017-01-14 11:37:01 +00004152of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard45bb48e2015-06-13 03:28:10 +00004153range or acceptable values.
4154
Tony Tyef16a45e2017-06-06 20:31:59 +00004155VALU
4156++++
Tom Stellard45bb48e2015-06-13 03:28:10 +00004157
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004158For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
4159the assembler will automatically use optimal encoding based on its operands.
4160To force specific encoding, one can add a suffix to the opcode of the instruction:
4161
4162* _e32 for 32-bit VOP1/VOP2/VOPC
4163* _e64 for 64-bit VOP3
4164* _dpp for VOP_DPP
4165* _sdwa for VOP_SDWA
4166
4167VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004168
4169.. code-block:: nasm
4170
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004171 v_mov_b32 v1, v2
4172 v_mov_b32_e32 v1, v2
4173 v_nop
4174 v_cvt_f64_i32_e32 v[1:2], v2
4175 v_floor_f32_e32 v1, v2
4176 v_bfrev_b32_e32 v1, v2
4177 v_add_f32_e32 v1, v2, v3
4178 v_mul_i32_i24_e64 v1, v2, 3
4179 v_mul_i32_i24_e32 v1, -3, v3
4180 v_mul_i32_i24_e32 v1, -100, v3
4181 v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
4182 v_max_f16_e32 v1, v2, v3
Tom Stellard45bb48e2015-06-13 03:28:10 +00004183
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004184VOP_DPP examples:
Tom Stellard45bb48e2015-06-13 03:28:10 +00004185
4186.. code-block:: nasm
4187
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004188 v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
4189 v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4190 v_mov_b32 v0, v0 wave_shl:1
4191 v_mov_b32 v0, v0 row_mirror
4192 v_mov_b32 v0, v0 row_bcast:31
4193 v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
4194 v_add_f32 v0, v0, |v0| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
4195 v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard347ac792015-06-26 21:15:07 +00004196
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004197VOP_SDWA examples:
4198
4199.. code-block:: nasm
4200
4201 v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
4202 v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
4203 v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
4204 v_fract_f32 v0, |v0| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
4205 v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
4206
4207For full list of supported instructions, refer to "Vector ALU instructions".
4208
4209HSA Code Object Directives
Tony Tyef16a45e2017-06-06 20:31:59 +00004210~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004211
4212AMDGPU ABI defines auxiliary data in output code object. In assembly source,
4213one can specify them with assembler directives.
Tom Stellard347ac792015-06-26 21:15:07 +00004214
4215.hsa_code_object_version major, minor
Tony Tyef16a45e2017-06-06 20:31:59 +00004216+++++++++++++++++++++++++++++++++++++
Tom Stellard347ac792015-06-26 21:15:07 +00004217
4218*major* and *minor* are integers that specify the version of the HSA code
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004219object that will be generated by the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004220
4221.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tyef16a45e2017-06-06 20:31:59 +00004222+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4223
Tom Stellard347ac792015-06-26 21:15:07 +00004224
4225*major*, *minor*, and *stepping* are all integers that describe the instruction
4226set architecture (ISA) version of the assembly program.
4227
4228*vendor* and *arch* are quoted strings. *vendor* should always be equal to
4229"AMD" and *arch* should always be equal to "AMDGPU".
4230
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004231By default, the assembler will derive the ISA version, *vendor*, and *arch*
4232from the value of the -mcpu option that is passed to the assembler.
Tom Stellard347ac792015-06-26 21:15:07 +00004233
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004234.amdgpu_hsa_kernel (name)
Tony Tyef16a45e2017-06-06 20:31:59 +00004235+++++++++++++++++++++++++
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004236
4237This directives specifies that the symbol with given name is a kernel entry point
4238(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellardff7416b2015-06-26 21:58:31 +00004239
4240.amd_kernel_code_t
Tony Tyef16a45e2017-06-06 20:31:59 +00004241++++++++++++++++++
Tom Stellardff7416b2015-06-26 21:58:31 +00004242
4243This directive marks the beginning of a list of key / value pairs that are used
4244to specify the amd_kernel_code_t object that will be emitted by the assembler.
4245The list must be terminated by the *.end_amd_kernel_code_t* directive. For
4246any amd_kernel_code_t values that are unspecified a default value will be
4247used. The default value for all keys is 0, with the following exceptions:
4248
4249- *kernel_code_version_major* defaults to 1.
4250- *machine_kind* defaults to 1.
4251- *machine_version_major*, *machine_version_minor*, and
4252 *machine_version_stepping* are derived from the value of the -mcpu option
4253 that is passed to the assembler.
4254- *kernel_code_entry_byte_offset* defaults to 256.
4255- *wavefront_size* defaults to 6.
4256- *kernarg_segment_alignment*, *group_segment_alignment*, and
Tony Tye6baa6d22017-10-18 22:16:55 +00004257 *private_segment_alignment* default to 4. Note that alignments are specified
Tom Stellardff7416b2015-06-26 21:58:31 +00004258 as a power of two, so a value of **n** means an alignment of 2^ **n**.
4259
4260The *.amd_kernel_code_t* directive must be placed immediately after the
4261function label and before any instructions.
4262
Nikolay Haustov96a56bd2016-09-20 09:04:51 +00004263For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
4264comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellardff7416b2015-06-26 21:58:31 +00004265
4266Here is an example of a minimal amd_kernel_code_t specification:
4267
Aaron Ballman887ad0e2016-07-19 17:46:55 +00004268.. code-block:: none
Tom Stellardff7416b2015-06-26 21:58:31 +00004269
4270 .hsa_code_object_version 1,0
4271 .hsa_code_object_isa
4272
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004273 .hsatext
4274 .globl hello_world
4275 .p2align 8
4276 .amdgpu_hsa_kernel hello_world
Tom Stellardff7416b2015-06-26 21:58:31 +00004277
4278 hello_world:
4279
4280 .amd_kernel_code_t
4281 enable_sgpr_kernarg_segment_ptr = 1
4282 is_ptr64 = 1
4283 compute_pgm_rsrc1_vgprs = 0
4284 compute_pgm_rsrc1_sgprs = 0
4285 compute_pgm_rsrc2_user_sgpr = 2
4286 kernarg_segment_byte_size = 8
4287 wavefront_sgpr_count = 2
4288 workitem_vgpr_count = 3
4289 .end_amd_kernel_code_t
4290
4291 s_load_dwordx2 s[0:1], s[0:1] 0x0
4292 v_mov_b32 v0, 3.14159
4293 s_waitcnt lgkmcnt(0)
4294 v_mov_b32 v1, s0
4295 v_mov_b32 v2, s1
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004296 flat_store_dword v[1:2], v0
Tom Stellardff7416b2015-06-26 21:58:31 +00004297 s_endpgm
Sylvestre Ledrua7de9822016-02-23 11:17:27 +00004298 .Lfunc_end0:
Tom Stellardb8a91bb2016-02-22 18:36:00 +00004299 .size hello_world, .Lfunc_end0-hello_world
Tony Tyef16a45e2017-06-06 20:31:59 +00004300
4301Additional Documentation
4302========================
4303
Konstantin Zhuravlyov265d2532017-10-18 17:59:20 +00004304.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
4305.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
4306.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
4307.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
4308.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
4309.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
4310.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
4311.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004312.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
4313.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
4314.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
4315.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
4316.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyovea35e462017-10-19 17:12:55 +00004317.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tyef16a45e2017-06-06 20:31:59 +00004318.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
4319.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__