Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: d0671b0a74e0c0998dcaea2eb0e4855fe7ee7180 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
				67	<empty> Defaults to ``opencl``.
				68	``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	69	``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
				70	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	71
				72	.. _amdgpu-processors:
				73
				74	Processors
				75	----------
				76
				77	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				78	names from both the Processor and Alternative Processor can be used.
				79
				80	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	81	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	82
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	83	=========== =============== ============ ===== ========= ======= ==================
				84	Processor Alternative Target dGPU/ Target ROCm Example
				85	Processor Triple APU Features Support Products
				86	Architecture Supported
				87	[Default]
				88	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	89	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	90	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	91	``r600`` ``r600`` dGPU
				92	``r630`` ``r600`` dGPU
				93	``rs880`` ``r600`` dGPU
				94	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	95	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	96	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	97	``rv710`` ``r600`` dGPU
				98	``rv730`` ``r600`` dGPU
				99	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	100	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	101	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	102	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	103	``cypress`` ``r600`` dGPU
				104	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	105	``redwood`` ``r600`` dGPU
				106	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	107	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	108	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	109	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	110	``caicos`` ``r600`` dGPU
				111	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	112	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	113	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	114	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	117	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	118	- ``pitcairn``
				119	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	120	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	121	-----------------------------------------------------------------------------------
				122	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				123	- A6 Pro-7050B
				124	- A8-7100
				125	- A8 Pro-7150B
				126	- A10-7300
				127	- A10 Pro-7350B
				128	- FX-7500
				129	- A8-7200P
				130	- A10-7400P
				131	- FX-7600P
				132	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				133	- FirePro W9100
				134	- FirePro S9150
				135	- FirePro S9170
				136	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				137	- Radeon R9 290x
				138	- Radeon R390
				139	- Radeon R390x
				140	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				141	- ``mullins`` - E1-2200
				142	- E1-2500
				143	- E2-3000
				144	- E2-3800
				145	- A4-5000
				146	- A4-5100
				147	- A6-5200
				148	- A4 Pro-3340B
				149	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				150	- Radeon HD 8770
				151	- R7 260
				152	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	153	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	154	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	155	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				156	[on] - Pro A6-8500B
				157	- A8-8600P
				158	- Pro A8-8600B
				159	- FX-8800P
				160	- Pro A12-8800B
				161	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				162	[on] - Pro A10-8700B
				163	- A10-8780P
				164	\ ``amdgcn`` APU - xnack - A10-9600P
				165	[on] - A10-9630P
				166	- A12-9700P
				167	- A12-9730P
				168	- FX-9800P
				169	- FX-9830P
				170	\ ``amdgcn`` APU - xnack - E2-9010
				171	[on] - A6-9210
				172	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	173	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				174	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	175	- FirePro W7100
				176	- Radeon R285
				177	- Radeon R9 380
				178	- Radeon R9 385
				179	- Mobile FirePro
				180	M7170
				181	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				182	[off] - Radeon R9 Fury
				183	- Radeon R9 FuryX
				184	- Radeon Pro Duo
				185	- FirePro S9300x2
				186	- Radeon Instinct MI8
				187	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				188	[off] - Radeon RX 480
				189	- Radeon Instinct MI6
				190	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				191	[off]
				192	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				193	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	194	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	195	-----------------------------------------------------------------------------------
				196	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				197	[off] Frontier Edition
				198	- Radeon RX Vega 56
				199	- Radeon RX Vega 64
				200	- Radeon RX Vega 64
				201	Liquid
				202	- Radeon Instinct MI25
				203	``gfx902`` ``amdgcn`` APU - xnack TBA
				204	[on]
				205	.. TODO
				206	Add product
				207	names.
				208	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	209
				210	.. _amdgpu-target-features:
				211
				212	Target Features
				213	---------------
				214
				215	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	216	processor specific features. Not all target features are supported by
				217	all processors. The runtime must ensure that the features supported by
				218	the device used to execute the code match the features enabled when
				219	generating the code. A mismatch of features may result in incorrect
				220	execution, or a reduction in performance.
				221
				222	The target features supported by each processor, and the default value
				223	used if not specified explicitly, is listed in
				224	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	225
				226	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				227	target features.
				228
				229	For example:
				230
				231	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	232	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	233	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	234	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	235
				236	.. table:: AMDGPU Target Features
				237	:name: amdgpu-target-feature-table
				238
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	============== ==================================================
				240	Target Feature Description
				241	============== ==================================================
				242	-m[no-]xnack Enable/disable generating code that has
				243	memory clauses that are compatible with
				244	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	245
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	246	This is used for demand paging and page
				247	migration. If XNACK replay is enabled in
				248	the device, then if a page fault occurs
				249	the code may execute incorrectly if the
				250	``xnack`` feature is not enabled. Executing
				251	code that has the feature enabled on a
				252	device that does not have XNACK replay
				253	enabled will execute correctly, but may
				254	be less performant than code with the
				255	feature disabled.
				256	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	257
				258	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	259
				260	Address Spaces
				261	--------------
				262
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	263	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	265	The memory space names used in the table, aside from the region memory space, is
				266	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	.. table:: Address Space Mapping
				271	:name: amdgpu-address-space-mapping-table
				272
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	273	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	274	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	275	================== =================
				276	0 Generic (Flat)
				277	1 Global
				278	2 Region (GDS)
				279	3 Local (group/LDS)
				280	4 Constant
				281	5 Private (Scratch)
				282	6 Constant 32-bit
				283	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	284
				285	.. _amdgpu-memory-scopes:
				286
				287	Memory Scopes
				288	-------------
				289
				290	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				291	backend memory model when the target triple OS is ``amdhsa`` (see
				292	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				293
				294	The memory model supported is based on the HSA memory model [HSA]_ which is
				295	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				296	relation is transitive over the synchonizes-with relation independent of scope,
				297	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	298	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	299
				300	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				301	inclusion and requires the memory scopes to exactly match. However, this
				302	is conservatively correct for OpenCL.
				303
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	304	.. table:: AMDHSA LLVM Sync Scopes
				305	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	306
				307	================ ==========================================================
				308	LLVM Sync Scope Description
				309	================ ==========================================================
				310	none The default: ``system``.
				311
				312	Synchronizes with, and participates in modification and
				313	seq_cst total orderings with, other operations (except
				314	image operations) for all address spaces (except private,
				315	or generic that accesses private) provided the other
				316	operation's sync scope is:
				317
				318	- ``system``.
				319	- ``agent`` and executed by a thread on the same agent.
				320	- ``workgroup`` and executed by a thread in the same
				321	workgroup.
				322	- ``wavefront`` and executed by a thread in the same
				323	wavefront.
				324
				325	``agent`` Synchronizes with, and participates in modification and
				326	seq_cst total orderings with, other operations (except
				327	image operations) for all address spaces (except private,
				328	or generic that accesses private) provided the other
				329	operation's sync scope is:
				330
				331	- ``system`` or ``agent`` and executed by a thread on the
				332	same agent.
				333	- ``workgroup`` and executed by a thread in the same
				334	workgroup.
				335	- ``wavefront`` and executed by a thread in the same
				336	wavefront.
				337
				338	``workgroup`` Synchronizes with, and participates in modification and
				339	seq_cst total orderings with, other operations (except
				340	image operations) for all address spaces (except private,
				341	or generic that accesses private) provided the other
				342	operation's sync scope is:
				343
				344	- ``system``, ``agent`` or ``workgroup`` and executed by a
				345	thread in the same workgroup.
				346	- ``wavefront`` and executed by a thread in the same
				347	wavefront.
				348
				349	``wavefront`` Synchronizes with, and participates in modification and
				350	seq_cst total orderings with, other operations (except
				351	image operations) for all address spaces (except private,
				352	or generic that accesses private) provided the other
				353	operation's sync scope is:
				354
				355	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				356	and executed by a thread in the same wavefront.
				357
				358	``singlethread`` Only synchronizes with, and participates in modification
				359	and seq_cst total orderings with, other operations (except
				360	image operations) running in the same thread for all
				361	address spaces (for example, in signal handlers).
				362	================ ==========================================================
				363
				364	AMDGPU Intrinsics
				365	-----------------
				366
				367	The AMDGPU backend implements the following intrinsics.
				368
				369	This section is WIP.
				370
				371	.. TODO
				372	List AMDGPU intrinsics
				373
				374	Code Object
				375	===========
				376
				377	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				378	can be linked by ``lld`` to produce a standard ELF shared code object which can
				379	be loaded and executed on an AMDGPU target.
				380
				381	Header
				382	------
				383
				384	The AMDGPU backend uses the following ELF header:
				385
				386	.. table:: AMDGPU ELF Header
				387	:name: amdgpu-elf-header-table
				388
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	389	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	390	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	391	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	392	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				393	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	394	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				395	- ``ELFOSABI_AMDGPU_HSA``
				396	- ``ELFOSABI_AMDGPU_PAL``
				397	- ``ELFOSABI_AMDGPU_MESA3D``
				398	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				399	- ``ELFABIVERSION_AMDGPU_PAL``
				400	- ``ELFABIVERSION_AMDGPU_MESA3D``
				401	``e_type`` - ``ET_REL``
				402	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	403	``e_machine`` ``EM_AMDGPU``
				404	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	405	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	406	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	407
				408	..
				409
				410	.. table:: AMDGPU ELF Header Enumeration Values
				411	:name: amdgpu-elf-header-enumeration-values-table
				412
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	413	=============================== =====
				414	Name Value
				415	=============================== =====
				416	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	417	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	418	``ELFOSABI_AMDGPU_HSA`` 64
				419	``ELFOSABI_AMDGPU_PAL`` 65
				420	``ELFOSABI_AMDGPU_MESA3D`` 66
				421	``ELFABIVERSION_AMDGPU_HSA`` 1
				422	``ELFABIVERSION_AMDGPU_PAL`` 0
				423	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				424	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	425
				426	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	427	The ELF class is:
				428
				429	* ``ELFCLASS32`` for ``r600`` architecture.
				430
				431	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				432	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	433
				434	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	435	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	436
				437	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	438	One of the following AMD GPU architecture specific OS ABIs
				439	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	440
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	441	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	442
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	443	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	444
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	445	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				446
				447	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	448
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	449	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	450	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				451	object conforms:
				452
				453	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				454	runtime ABI.
				455
				456	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				457	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	458
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	459	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	460	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	461
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	462	``e_type``
				463	Can be one of the following values:
				464
				465
				466	``ET_REL``
				467	The type produced by the AMD GPU backend compiler as it is relocatable code
				468	object.
				469
				470	``ET_DYN``
				471	The type produced by the linker as it is a shared code object.
				472
				473	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				474
				475	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	476	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				477	by the ``r600`` and ``amdgcn`` architectures (see
				478	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				479	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				480	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	481
				482	``e_entry``
				483	The entry point is 0 as the entry points for individual kernels must be
				484	selected in order to invoke them through AQL packets.
				485
				486	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	487	The AMDGPU backend uses the following ELF header flags:
				488
				489	.. table:: AMDGPU ELF Header ``e_flags``
				490	:name: amdgpu-elf-header-e_flags-table
				491
				492	================================= ========== =============================
				493	Name Value Description
				494	================================= ========== =============================
				495	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				496	-------------------------------------------- -----------------------------
				497	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				498	mask for
				499	``EF_AMDGPU_MACH_xxx`` values
				500	defined in
				501	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	502	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				503	target feature is
				504	enabled for all code
				505	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	506	If the processor
				507	does not support the
				508	``xnack`` target
				509	feature then must
				510	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	511	See
				512	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	513	================================= ========== =============================
				514
				515	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				516	:name: amdgpu-ef-amdgpu-mach-table
				517
				518	================================= ========== =============================
				519	Name Value Description (see
				520	:ref:`amdgpu-processor-table`)
				521	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	522	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				523	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				524	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				525	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				526	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				527	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				528	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				529	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				530	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				531	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				532	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				533	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				534	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				535	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				536	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				537	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				538	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				539	reserved 0x011 - Reserved for ``r600``
				540	0x01f architecture processors.
				541	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				542	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				543	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				544	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				545	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				546	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				547	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				548	reserved 0x027 Reserved.
				549	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				553	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				554	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
				555	reserved 0x02e Reserved.
				556	reserved 0x02f Reserved.
				557	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	558	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	559
				560	Sections
				561	--------
				562
				563	An AMDGPU target ELF code object has the standard ELF sections which include:
				564
				565	.. table:: AMDGPU ELF Sections
				566	:name: amdgpu-elf-sections-table
				567
				568	================== ================ =================================
				569	Name Type Attributes
				570	================== ================ =================================
				571	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				572	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				573	``.debug_``\ \* ``SHT_PROGBITS`` none
				574	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				575	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				576	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				577	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				578	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				579	``.note`` ``SHT_NOTE`` none
				580	``.rela``\ name ``SHT_RELA`` none
				581	``.rela.dyn`` ``SHT_RELA`` none
				582	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				583	``.shstrtab`` ``SHT_STRTAB`` none
				584	``.strtab`` ``SHT_STRTAB`` none
				585	``.symtab`` ``SHT_SYMTAB`` none
				586	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				587	================== ================ =================================
				588
				589	These sections have their standard meanings (see [ELF]_) and are only generated
				590	if needed.
				591
				592	``.debug``\ \*
				593	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				594	DWARF produced by the AMDGPU backend.
				595
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	596	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	597	The standard sections used by a dynamic loader.
				598
				599	``.note``
				600	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				601	backend.
				602
				603	``.rela``\ name, ``.rela.dyn``
				604	For relocatable code objects, name is the name of the section that the
				605	relocation records apply. For example, ``.rela.text`` is the section name for
				606	relocation records associated with the ``.text`` section.
				607
				608	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				609	records from each of the relocatable code object's ``.rela``\ name sections.
				610
				611	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				612	the AMDGPU backend.
				613
				614	``.text``
				615	The executable machine code for the kernels and functions they call. Generated
				616	as position independent code. See :ref:`amdgpu-code-conventions` for
				617	information on conventions used in the isa generation.
				618
				619	.. _amdgpu-note-records:
				620
				621	Note Records
				622	------------
				623
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	624	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				625	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				626	aligned. In addition, minimal zero byte padding must be generated to ensure the
				627	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				628	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	629
				630	The AMDGPU backend code object uses the following ELF note records in the
				631	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	632	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	633	variable size strings have a corresponding ``*_size`` field that specifies the
				634	number of bytes, including the terminating null character, in the string. The
				635	string(s) come immediately after the preceding fields.
				636
				637	Additional note records can be present.
				638
				639	.. table:: AMDGPU ELF Note Records
				640	:name: amdgpu-elf-note-records-table
				641
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	642	===== ============================== ======================================
				643	Name Type Description
				644	===== ============================== ======================================
				645	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	646	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	647
				648	..
				649
				650	.. table:: AMDGPU ELF Note Record Enumeration Values
				651	:name: amdgpu-elf-note-record-enumeration-values-table
				652
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	653	============================== =====
				654	Name Value
				655	============================== =====
				656	reserved 0-9
				657	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	658	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	659	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	660
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	661	``NT_AMD_AMDGPU_HSA_METADATA``
				662	Specifies extensible metadata associated with the code objects executed on HSA
				663	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				664	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				665	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				666	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	667
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	668	.. _amdgpu-symbols:
				669
				670	Symbols
				671	-------
				672
				673	Symbols include the following:
				674
				675	.. table:: AMDGPU ELF Symbols
				676	:name: amdgpu-elf-symbols-table
				677
				678	===================== ============== ============= ==================
				679	Name Type Section Description
				680	===================== ============== ============= ==================
				681	link-name ``STT_OBJECT`` - ``.data`` Global variable
				682	- ``.rodata``
				683	- ``.bss``
				684	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				685	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				686	===================== ============== ============= ==================
				687
				688	Global variable
				689	Global variables both used and defined by the compilation unit.
				690
				691	If the symbol is defined in the compilation unit then it is allocated in the
				692	appropriate section according to if it has initialized data or is readonly.
				693
				694	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				695	will resolve relocations using the definition provided by another code object
				696	or explicitly defined by the runtime.
				697
				698	All global symbols, whether defined in the compilation unit or external, are
				699	accessed by the machine code indirectly through a GOT table entry. This
				700	allows them to be preemptable. The GOT table is only supported when the target
				701	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	702
				703	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	704	Add description of linked shared object symbols. Seems undefined symbols
				705	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	706
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	707	Kernel descriptor
				708	Every HSA kernel has an associated kernel descriptor. It is the address of the
				709	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				710	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				711	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				712
				713	Kernel entry point
				714	Every HSA kernel also has a symbol for its machine code entry point.
				715
				716	.. _amdgpu-relocation-records:
				717
				718	Relocation Records
				719	------------------
				720
				721	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				722	relocatable fields are:
				723
				724	``word32``
				725	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				726	alignment. These values use the same byte order as other word values in the
				727	AMD GPU architecture.
				728
				729	``word64``
				730	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				731	alignment. These values use the same byte order as other word values in the
				732	AMD GPU architecture.
				733
				734	Following notations are used for specifying relocation calculations:
				735
				736	A
				737	Represents the addend used to compute the value of the relocatable field.
				738
				739	G
				740	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	741	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	742
				743	GOT
				744	Represents the address of the global offset table.
				745
				746	P
				747	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				748	of the storage unit being relocated (computed using ``r_offset``).
				749
				750	S
				751	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	752	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				753
				754	B
				755	Represents the base address of a loaded executable or shared object which is
				756	the difference between the ELF address and the actual load address. Relocations
				757	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	758
				759	The following relocation types are supported:
				760
				761	.. table:: AMDGPU ELF Relocation Records
				762	:name: amdgpu-elf-relocation-records-table
				763
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	764	========================== ======= ===== ========== ==============================
				765	Relocation Type Kind Value Field Calculation
				766	========================== ======= ===== ========== ==============================
				767	``R_AMDGPU_NONE`` 0 none none
				768	``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
				769	``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
				770	``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
				771	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				772	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
				773	``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
				774	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				775	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				776	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				777	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				778	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				779	reserved 12
				780	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				781	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	782
				783	.. _amdgpu-dwarf:
				784
				785	DWARF
				786	-----
				787
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	788	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	789	information that maps the code object executable code and data to the source
				790	language constructs. It can be used by tools such as debuggers and profilers.
				791
				792	Address Space Mapping
				793	~~~~~~~~~~~~~~~~~~~~~
				794
				795	The following address space mapping is used:
				796
				797	.. table:: AMDGPU DWARF Address Space Mapping
				798	:name: amdgpu-dwarf-address-space-mapping-table
				799
				800	=================== =================
				801	DWARF Address Space Memory Space
				802	=================== =================
				803	1 Private (Scratch)
				804	2 Local (group/LDS)
				805	omitted Global
				806	omitted Constant
				807	omitted Generic (Flat)
				808	not supported Region (GDS)
				809	=================== =================
				810
				811	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				812	used in the table.
				813
				814	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				815	DWARF address space of the value of the pointer when it is in the private or
				816	local address space. Otherwise the attribute is omitted.
				817
				818	An ``XDEREF`` operation is generated in location list expressions for variables
				819	that are allocated in the private and local address space. Otherwise no
				820	``XDREF`` is omitted.
				821
				822	Register Mapping
				823	~~~~~~~~~~~~~~~~
				824
				825	This section is WIP.
				826
				827	.. TODO
				828	Define DWARF register enumeration.
				829
				830	If want to present a wavefront state then should expose vector registers as
				831	64 wide (rather than per work-item view that LLVM uses). Either as separate
				832	registers, or a 64x4 byte single register. In either case use a new LANE op
				833	(akin to XDREF) to select the current lane usage in a location
				834	expression. This would also allow scalar register spilling to vector register
				835	lanes to be expressed (currently no debug information is being generated for
				836	spilling). If choose a wide single register approach then use LANE in
				837	conjunction with PIECE operation to select the dword part of the register for
				838	the current lane. If the separate register approach then use LANE to select
				839	the register.
				840
				841	Source Text
				842	~~~~~~~~~~~
				843
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	844	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				845	runtime) may be embedded into the DWARF v5 line table using the ``clang
				846	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	847
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	848	For example:
				849
				850	``-gembed-source``
				851	Enable the embedded source DWARF v5 extension.
				852	``-gno-embed-source``
				853	Disable the embedded source DWARF v5 extension.
				854
				855	.. table:: AMDGPU Debug Options
				856	:name: amdgpu-debug-options
				857
				858	==================== ==================================================
				859	Debug Flag Description
				860	==================== ==================================================
				861	-g[no-]embed-source Enable/disable embedding source text in DWARF
				862	debug sections. Useful for environments where
				863	source cannot be written to disk, such as
				864	when performing online compilation.
				865	==================== ==================================================
				866
				867	This option enables one extended content types in the DWARF v5 Line Number
				868	Program Header, which is used to encode embedded source.
				869
				870	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				871	:name: amdgpu-dwarf-extended-content-types
				872
				873	============================ ======================
				874	Content Type Form
				875	============================ ======================
				876	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				877	============================ ======================
				878
				879	The source field will contain the UTF-8 encoded, null-terminated source text
				880	with ``'\n'`` line endings. When the source field is present, consumers can use
				881	the embedded source instead of attempting to discover the source on disk. When
				882	the source field is absent, consumers can access the file to get the source
				883	text.
				884
				885	The above content type appears in the ``file_name_entry_format`` field of the
				886	line table prologue, and its corresponding value appear in the ``file_names``
				887	field. The current encoding of the content type is documented in table
				888	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				889
				890	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				891	:name: amdgpu-dwarf-extended-content-types-encoding
				892
				893	============================ ====================
				894	Content Type Value
				895	============================ ====================
				896	``DW_LNCT_LLVM_source`` 0x2001
				897	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	898
				899	.. _amdgpu-code-conventions:
				900
				901	Code Conventions
				902	================
				903
				904	This section provides code conventions used for each supported target triple OS
				905	(see :ref:`amdgpu-target-triples`).
				906
				907	AMDHSA
				908	------
				909
				910	This section provides code conventions used when the target triple OS is
				911	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				912
				913	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	914
				915	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	916	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	917
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	918	The code object metadata specifies extensible metadata associated with the code
				919	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				920	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				921	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				922	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				923	information necessary to support the ROCM kernel queries. For example, the
				924	segment sizes needed in a dispatch packet. In addition, a high level language
				925	runtime may require other information to be included. For example, the AMD
				926	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	927
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	928	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	929	:doc:`YamlIO`).
				930
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	931	.. TODO
				932	Is the string null terminated? It probably should not if YAML allows it to
				933	contain null characters, otherwise it should be.
				934
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	935	The metadata is represented as a single YAML document comprised of the mapping
				936	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				937	referenced tables.
				938
				939	For boolean values, the string values of ``false`` and ``true`` are used for
				940	false and true respectively.
				941
				942	Additional information can be added to the mappings. To avoid conflicts, any
				943	non-AMD key names should be prefixed by "vendor-name.".
				944
				945	.. table:: AMDHSA Code Object Metadata Mapping
				946	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				947
				948	========== ============== ========= =======================================
				949	String Key Value Type Required? Description
				950	========== ============== ========= =======================================
				951	"Version" sequence of Required - The first integer is the major
				952	2 integers version. Currently 1.
				953	- The second integer is the minor
				954	version. Currently 0.
				955	"Printf" sequence of Each string is encoded information
				956	strings about a printf function call. The
				957	encoded information is organized as
				958	fields separated by colon (':'):
				959
				960	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				961
				962	where:
				963
				964	``ID``
				965	A 32 bit integer as a unique id for
				966	each printf function call
				967
				968	``N``
				969	A 32 bit integer equal to the number
				970	of arguments of printf function call
				971	minus 1
				972
				973	``S[i]`` (where i = 0, 1, ... , N-1)
				974	32 bit integers for the size in bytes
				975	of the i-th FormatString argument of
				976	the printf function call
				977
				978	FormatString
				979	The format string passed to the
				980	printf function call.
				981	"Kernels" sequence of Required Sequence of the mappings for each
				982	mapping kernel in the code object. See
				983	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				984	for the definition of the mapping.
				985	========== ============== ========= =======================================
				986
				987	..
				988
				989	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				990	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				991
				992	================= ============== ========= ================================
				993	String Key Value Type Required? Description
				994	================= ============== ========= ================================
				995	"Name" string Required Source name of the kernel.
				996	"SymbolName" string Required Name of the kernel
				997	descriptor ELF symbol.
				998	"Language" string Source language of the kernel.
				999	Values include:
				1000
				1001	- "OpenCL C"
				1002	- "OpenCL C++"
				1003	- "HCC"
				1004	- "OpenMP"
				1005
				1006	"LanguageVersion" sequence of - The first integer is the major
				1007	2 integers version.
				1008	- The second integer is the
				1009	minor version.
				1010	"Attrs" mapping Mapping of kernel attributes.
				1011	See
				1012	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1013	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1014	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1015	mapping kernel arguments. See
				1016	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1017	for the definition of the mapping.
				1018	"CodeProps" mapping Mapping of properties related to
				1019	the kernel code. See
				1020	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1021	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1022	================= ============== ========= ================================
				1023
				1024	..
				1025
				1026	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1027	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1028
				1029	=================== ============== ========= ==============================
				1030	String Key Value Type Required? Description
				1031	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1032	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1033	3 integers must be >=1 and the dispatch
				1034	work-group size X, Y, Z must
				1035	correspond to the specified
				1036	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1037
				1038	Corresponds to the OpenCL
				1039	``reqd_work_group_size``
				1040	attribute.
				1041	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1042	3 integers X, Y, Z is likely to be the
				1043	specified values.
				1044
				1045	Corresponds to the OpenCL
				1046	``work_group_size_hint``
				1047	attribute.
				1048	"VecTypeHint" string The name of a scalar or vector
				1049	type.
				1050
				1051	Corresponds to the OpenCL
				1052	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1053
				1054	"RuntimeHandle" string The external symbol name
				1055	associated with a kernel.
				1056	OpenCL runtime allocates a
				1057	global buffer for the symbol
				1058	and saves the kernel's address
				1059	to it, which is used for
				1060	device side enqueueing. Only
				1061	available for device side
				1062	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1063	=================== ============== ========= ==============================
				1064
				1065	..
				1066
				1067	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1068	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1069
				1070	================= ============== ========= ================================
				1071	String Key Value Type Required? Description
				1072	================= ============== ========= ================================
				1073	"Name" string Kernel argument name.
				1074	"TypeName" string Kernel argument type name.
				1075	"Size" integer Required Kernel argument size in bytes.
				1076	"Align" integer Required Kernel argument alignment in
				1077	bytes. Must be a power of two.
				1078	"ValueKind" string Required Kernel argument kind that
				1079	specifies how to set up the
				1080	corresponding argument.
				1081	Values include:
				1082
				1083	"ByValue"
				1084	The argument is copied
				1085	directly into the kernarg.
				1086
				1087	"GlobalBuffer"
				1088	A global address space pointer
				1089	to the buffer data is passed
				1090	in the kernarg.
				1091
				1092	"DynamicSharedPointer"
				1093	A group address space pointer
				1094	to dynamically allocated LDS
				1095	is passed in the kernarg.
				1096
				1097	"Sampler"
				1098	A global address space
				1099	pointer to a S# is passed in
				1100	the kernarg.
				1101
				1102	"Image"
				1103	A global address space
				1104	pointer to a T# is passed in
				1105	the kernarg.
				1106
				1107	"Pipe"
				1108	A global address space pointer
				1109	to an OpenCL pipe is passed in
				1110	the kernarg.
				1111
				1112	"Queue"
				1113	A global address space pointer
				1114	to an OpenCL device enqueue
				1115	queue is passed in the
				1116	kernarg.
				1117
				1118	"HiddenGlobalOffsetX"
				1119	The OpenCL grid dispatch
				1120	global offset for the X
				1121	dimension is passed in the
				1122	kernarg.
				1123
				1124	"HiddenGlobalOffsetY"
				1125	The OpenCL grid dispatch
				1126	global offset for the Y
				1127	dimension is passed in the
				1128	kernarg.
				1129
				1130	"HiddenGlobalOffsetZ"
				1131	The OpenCL grid dispatch
				1132	global offset for the Z
				1133	dimension is passed in the
				1134	kernarg.
				1135
				1136	"HiddenNone"
				1137	An argument that is not used
				1138	by the kernel. Space needs to
				1139	be left for it, but it does
				1140	not need to be set up.
				1141
				1142	"HiddenPrintfBuffer"
				1143	A global address space pointer
				1144	to the runtime printf buffer
				1145	is passed in kernarg.
				1146
				1147	"HiddenDefaultQueue"
				1148	A global address space pointer
				1149	to the OpenCL device enqueue
				1150	queue that should be used by
				1151	the kernel by default is
				1152	passed in the kernarg.
				1153
				1154	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1155	A global address space pointer
				1156	to help link enqueued kernels into
				1157	the ancestor tree for determining
				1158	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1159
				1160	"ValueType" string Required Kernel argument value type. Only
				1161	present if "ValueKind" is
				1162	"ByValue". For vector data
				1163	types, the value is for the
				1164	element type. Values include:
				1165
				1166	- "Struct"
				1167	- "I8"
				1168	- "U8"
				1169	- "I16"
				1170	- "U16"
				1171	- "F16"
				1172	- "I32"
				1173	- "U32"
				1174	- "F32"
				1175	- "I64"
				1176	- "U64"
				1177	- "F64"
				1178
				1179	.. TODO
				1180	How can it be determined if a
				1181	vector type, and what size
				1182	vector?
				1183	"PointeeAlign" integer Alignment in bytes of pointee
				1184	type for pointer type kernel
				1185	argument. Must be a power
				1186	of 2. Only present if
				1187	"ValueKind" is
				1188	"DynamicSharedPointer".
				1189	"AddrSpaceQual" string Kernel argument address space
				1190	qualifier. Only present if
				1191	"ValueKind" is "GlobalBuffer" or
				1192	"DynamicSharedPointer". Values
				1193	are:
				1194
				1195	- "Private"
				1196	- "Global"
				1197	- "Constant"
				1198	- "Local"
				1199	- "Generic"
				1200	- "Region"
				1201
				1202	.. TODO
				1203	Is GlobalBuffer only Global
				1204	or Constant? Is
				1205	DynamicSharedPointer always
				1206	Local? Can HCC allow Generic?
				1207	How can Private or Region
				1208	ever happen?
				1209	"AccQual" string Kernel argument access
				1210	qualifier. Only present if
				1211	"ValueKind" is "Image" or
				1212	"Pipe". Values
				1213	are:
				1214
				1215	- "ReadOnly"
				1216	- "WriteOnly"
				1217	- "ReadWrite"
				1218
				1219	.. TODO
				1220	Does this apply to
				1221	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1222	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1223	performed by the kernel on the
				1224	kernel argument. Only present if
				1225	"ValueKind" is "GlobalBuffer",
				1226	"Image", or "Pipe". This may be
				1227	more restrictive than indicated
				1228	by "AccQual" to reflect what the
				1229	kernel actual does. If not
				1230	present then the runtime must
				1231	assume what is implied by
				1232	"AccQual" and "IsConst". Values
				1233	are:
				1234
				1235	- "ReadOnly"
				1236	- "WriteOnly"
				1237	- "ReadWrite"
				1238
				1239	"IsConst" boolean Indicates if the kernel argument
				1240	is const qualified. Only present
				1241	if "ValueKind" is
				1242	"GlobalBuffer".
				1243
				1244	"IsRestrict" boolean Indicates if the kernel argument
				1245	is restrict qualified. Only
				1246	present if "ValueKind" is
				1247	"GlobalBuffer".
				1248
				1249	"IsVolatile" boolean Indicates if the kernel argument
				1250	is volatile qualified. Only
				1251	present if "ValueKind" is
				1252	"GlobalBuffer".
				1253
				1254	"IsPipe" boolean Indicates if the kernel argument
				1255	is pipe qualified. Only present
				1256	if "ValueKind" is "Pipe".
				1257
				1258	.. TODO
				1259	Can GlobalBuffer be pipe
				1260	qualified?
				1261	================= ============== ========= ================================
				1262
				1263	..
				1264
				1265	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1266	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1267
				1268	============================ ============== ========= =====================
				1269	String Key Value Type Required? Description
				1270	============================ ============== ========= =====================
				1271	"KernargSegmentSize" integer Required The size in bytes of
				1272	the kernarg segment
				1273	that holds the values
				1274	of the arguments to
				1275	the kernel.
				1276	"GroupSegmentFixedSize" integer Required The amount of group
				1277	segment memory
				1278	required by a
				1279	work-group in
				1280	bytes. This does not
				1281	include any
				1282	dynamically allocated
				1283	group segment memory
				1284	that may be added
				1285	when the kernel is
				1286	dispatched.
				1287	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1288	private address space
				1289	memory required for a
				1290	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1291	bytes. If the kernel
				1292	uses a dynamic call
				1293	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1294	space must be added
				1295	to this value for the
				1296	call stack.
				1297	"KernargSegmentAlign" integer Required The maximum byte
				1298	alignment of
				1299	arguments in the
				1300	kernarg segment. Must
				1301	be a power of 2.
				1302	"WavefrontSize" integer Required Wavefront size. Must
				1303	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1304	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1305	registers used by a
				1306	wavefront for
				1307	GFX6-GFX9. This
				1308	includes the special
				1309	SGPRs for VCC, Flat
				1310	Scratch (GFX7-GFX9)
				1311	and XNACK (for
				1312	GFX8-GFX9). It does
				1313	not include the 16
				1314	SGPR added if a trap
				1315	handler is
				1316	enabled. It is not
				1317	rounded up to the
				1318	allocation
				1319	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1320	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1321	registers used by
				1322	each work-item for
				1323	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1324	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1325	work-group size
				1326	supported by the
				1327	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1328	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1329	consistent with
				1330	ReqdWorkGroupSize if
				1331	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1332	"NumSpilledSGPRs" integer Number of stores from
				1333	a scalar register to
				1334	a register allocator
				1335	created spill
				1336	location.
				1337	"NumSpilledVGPRs" integer Number of stores from
				1338	a vector register to
				1339	a register allocator
				1340	created spill
				1341	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1342	============================ ============== ========= =====================
				1343
				1344	..
				1345
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1346	Kernel Dispatch
				1347	~~~~~~~~~~~~~~~
				1348
				1349	The HSA architected queuing language (AQL) defines a user space memory interface
				1350	that can be used to control the dispatch of kernels, in an agent independent
				1351	way. An agent can have zero or more AQL queues created for it using the ROCm
				1352	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1353	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1354	mechanics and packet layouts.
				1355
				1356	The packet processor of a kernel agent is responsible for detecting and
				1357	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1358	packet processor is implemented by the hardware command processor (CP),
				1359	asynchronous dispatch controller (ADC) and shader processor input controller
				1360	(SPI).
				1361
				1362	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1363	mode driver to initialize and register the AQL queue with CP.
				1364
				1365	To dispatch a kernel the following actions are performed. This can occur in the
				1366	CPU host program, or from an HSA kernel executing on a GPU.
				1367
				1368	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1369	executed is obtained.
				1370	2. A pointer to the kernel descriptor (see
				1371	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1372	obtained. It must be for a kernel that is contained in a code object that that
				1373	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1374	associated.
				1375	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1376	for a memory region with the kernarg property for the kernel agent that will
				1377	execute the kernel. It must be at least 16 byte aligned.
				1378	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1379	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1380	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1381	memory in the same way constant memory is accessed. (Note that the HSA
				1382	specification allows an implementation to copy the kernel argument contents to
				1383	another location that is accessed by the kernel.)
				1384	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1385	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1386	packet. The packet must be set up, and the final write must use an atomic
				1387	store release to set the packet kind to ensure the packet contents are
				1388	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1389	notify the kernel agent that the AQL queue has been updated. These rules, and
				1390	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1391	System Architecture Specification* [HSA]_.
				1392	6. A kernel dispatch packet includes information about the actual dispatch,
				1393	such as grid and work-group size, together with information from the code
				1394	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1395	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1396	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1397	7. CP executes micro-code and is responsible for detecting and setting up the
				1398	GPU to execute the wavefronts of a kernel dispatch.
				1399	8. CP ensures that when the a wavefront starts executing the kernel machine
				1400	code, the scalar general purpose registers (SGPR) and vector general purpose
				1401	registers (VGPR) are set up as required by the machine code. The required
				1402	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1403	register state is defined in
				1404	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1405	9. The prolog of the kernel machine code (see
				1406	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1407	before continuing executing the machine code that corresponds to the kernel.
				1408	10. When the kernel dispatch has completed execution, CP signals the completion
				1409	signal specified in the kernel dispatch packet if not 0.
				1410
				1411	.. _amdgpu-amdhsa-memory-spaces:
				1412
				1413	Memory Spaces
				1414	~~~~~~~~~~~~~
				1415
				1416	The memory space properties are:
				1417
				1418	.. table:: AMDHSA Memory Spaces
				1419	:name: amdgpu-amdhsa-memory-spaces-table
				1420
				1421	================= =========== ======== ======= ==================
				1422	Memory Space Name HSA Segment Hardware Address NULL Value
				1423	Name Name Size
				1424	================= =========== ======== ======= ==================
				1425	Private private scratch 32 0x00000000
				1426	Local group LDS 32 0xFFFFFFFF
				1427	Global global global 64 0x0000000000000000
				1428	Constant constant *same as 64 0x0000000000000000
				1429	global*
				1430	Generic flat flat 64 0x0000000000000000
				1431	Region N/A GDS 32 *not implemented
				1432	for AMDHSA*
				1433	================= =========== ======== ======= ==================
				1434
				1435	The global and constant memory spaces both use global virtual addresses, which
				1436	are the same virtual address space used by the CPU. However, some virtual
				1437	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1438	and some by both.
				1439
				1440	Using the constant memory space indicates that the data will not change during
				1441	the execution of the kernel. This allows scalar read instructions to be
				1442	used. The vector and scalar L1 caches are invalidated of volatile data before
				1443	each kernel dispatch execution to allow constant memory to change values between
				1444	kernel dispatches.
				1445
				1446	The local memory space uses the hardware Local Data Store (LDS) which is
				1447	automatically allocated when the hardware creates work-groups of wavefronts, and
				1448	freed when all the wavefronts of a work-group have terminated. The data store
				1449	(DS) instructions can be used to access it.
				1450
				1451	The private memory space uses the hardware scratch memory support. If the kernel
				1452	uses scratch, then the hardware allocates memory that is accessed using
				1453	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1454	address to physical address is:
				1455
				1456	``wavefront-scratch-base +
				1457	(private-address * wavefront-size * 4) +
				1458	(wavefront-lane-id * 4)``
				1459
				1460	There are different ways that the wavefront scratch base address is determined
				1461	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1462	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1463	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1464	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1465	same private address, the interleaving results in adjacent dwords being accessed
				1466	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1467	supported except by flat and scratch instructions in GFX9.
				1468
				1469	The generic address space uses the hardware flat address support available in
				1470	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1471	local appertures), that are outside the range of addressible global memory, to
				1472	map from a flat address to a private or local address.
				1473
				1474	FLAT instructions can take a flat address and access global, private (scratch)
				1475	and group (LDS) memory depending in if the address is within one of the
				1476	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1477	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1478	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1479	(see :ref:`amdgpu-amdhsa-m0`).
				1480
				1481	To convert between a segment address and a flat address the base address of the
				1482	appertures address can be used. For GFX7-GFX8 these are available in the
				1483	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1484	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1485	GFX9 the appature base addresses are directly available as inline constant
				1486	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1487	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1488	which makes it easier to convert from flat to segment or segment to flat.
				1489
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1490	Image and Samplers
				1491	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1492
				1493	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1494	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1495	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1496	enumeration values for the queries that are not trivially deducible from the S#
				1497	representation.
				1498
				1499	HSA Signals
				1500	~~~~~~~~~~~
				1501
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1502	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1503	structure allocated in memory accessible from both the CPU and GPU. The
				1504	structure is defined by the ROCm runtime and subject to change between releases
				1505	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1506
				1507	.. _amdgpu-amdhsa-hsa-aql-queue:
				1508
				1509	HSA AQL Queue
				1510	~~~~~~~~~~~~~
				1511
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1512	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1513	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1514	fields needed to implement certain language features such as the flat address
				1515	aperture bases. It also contains fields used by CP such as managing the
				1516	allocation of scratch memory.
				1517
				1518	.. _amdgpu-amdhsa-kernel-descriptor:
				1519
				1520	Kernel Descriptor
				1521	~~~~~~~~~~~~~~~~~
				1522
				1523	A kernel descriptor consists of the information needed by CP to initiate the
				1524	execution of a kernel, including the entry point address of the machine code
				1525	that implements the kernel.
				1526
				1527	Kernel Descriptor for GFX6-GFX9
				1528	+++++++++++++++++++++++++++++++
				1529
				1530	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1531
				1532	.. table:: Kernel Descriptor for GFX6-GFX9
				1533	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1534
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1535	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1536	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1537	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1538	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1539	address space memory
				1540	required for a work-group
				1541	in bytes. This does not
				1542	include any dynamically
				1543	allocated local address
				1544	space memory that may be
				1545	added when the kernel is
				1546	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1547	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1548	private address space
				1549	memory required for a
				1550	work-item in bytes. If
				1551	is_dynamic_callstack is 1
				1552	then additional space must
				1553	be added to this value for
				1554	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1555	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1556	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1557	negative) from base
				1558	address of kernel
				1559	descriptor to kernel's
				1560	entry point instruction
				1561	which must be 256 byte
				1562	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1563	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1564	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1565	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1566	program settings used by
				1567	CP to set up
				1568	``COMPUTE_PGM_RSRC1``
				1569	configuration
				1570	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1571	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1572	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1573	program settings used by
				1574	CP to set up
				1575	``COMPUTE_PGM_RSRC2``
				1576	configuration
				1577	register. See
				1578	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1579	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1580	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1581	(see
				1582	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1583
				1584	The total number of SGPR
				1585	user data registers
				1586	requested must not exceed
				1587	16 and match value in
				1588	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1589	Any requests beyond 16
				1590	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1591	449 1 bit EnableSGPRDispatchPtr see above
				1592	450 1 bit EnableSGPRQueuePtr see above
				1593	451 1 bit EnableSGPRKernargSegmentPtr see above
				1594	452 1 bit EnableSGPRDispatchID see above
				1595	453 1 bit EnableSGPRFlatScratchInit see above
				1596	454 1 bit EnableSGPRPrivateSegmentSize see above
				1597	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1598	should always be 0.
				1599	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1600	should always be 0.
				1601	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1602	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1603	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1604	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1605	bytes
				1606	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1607	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1608
				1609	..
				1610
				1611	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1612	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1613
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1614	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1615	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1616	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1617	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1618	used by each work-item,
				1619	granularity is device
				1620	specific:
				1621
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1622	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1623	- max_vgpr 1..256
				1624	- roundup((max_vgpg + 1)
				1625	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1626
				1627	Used by CP to set up
				1628	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1629	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1630	used by a wavefront,
				1631	granularity is device
				1632	specific:
				1633
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1634	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1635	- max_sgpr 1..112
				1636	- roundup((max_sgpg + 1)
				1637	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1638	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1639	- max_sgpr 1..112
				1640	- roundup((max_sgpg + 1)
				1641	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1642
				1643	Includes the special SGPRs
				1644	for VCC, Flat Scratch (for
				1645	GFX7 onwards) and XNACK
				1646	(for GFX8 onwards). It does
				1647	not include the 16 SGPR
				1648	added if a trap handler is
				1649	enabled.
				1650
				1651	Used by CP to set up
				1652	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1653	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1654
				1655	Start executing wavefront
				1656	at the specified priority.
				1657
				1658	CP is responsible for
				1659	filling in
				1660	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1661	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1662	with specified rounding
				1663	mode for single (32
				1664	bit) floating point
				1665	precision floating point
				1666	operations.
				1667
				1668	Floating point rounding
				1669	mode values are defined in
				1670	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1671
				1672	Used by CP to set up
				1673	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1674	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1675	with specified rounding
				1676	denorm mode for half/double (16
				1677	and 64 bit) floating point
				1678	precision floating point
				1679	operations.
				1680
				1681	Floating point rounding
				1682	mode values are defined in
				1683	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1684
				1685	Used by CP to set up
				1686	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1687	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1688	with specified denorm mode
				1689	for single (32
				1690	bit) floating point
				1691	precision floating point
				1692	operations.
				1693
				1694	Floating point denorm mode
				1695	values are defined in
				1696	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1697
				1698	Used by CP to set up
				1699	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1700	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1701	with specified denorm mode
				1702	for half/double (16
				1703	and 64 bit) floating point
				1704	precision floating point
				1705	operations.
				1706
				1707	Floating point denorm mode
				1708	values are defined in
				1709	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1710
				1711	Used by CP to set up
				1712	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1713	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1714
				1715	Start executing wavefront
				1716	in privilege trap handler
				1717	mode.
				1718
				1719	CP is responsible for
				1720	filling in
				1721	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1722	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1723	with DX10 clamp mode
				1724	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1725	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1726	treatment of NaN's (when
				1727	set, clamp NaN to zero,
				1728	otherwise pass NaN
				1729	through).
				1730
				1731	Used by CP to set up
				1732	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1733	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1734
				1735	Start executing wavefront
				1736	in single step mode.
				1737
				1738	CP is responsible for
				1739	filling in
				1740	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1741	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1742	with IEEE mode
				1743	enabled. Floating point
				1744	opcodes that support
				1745	exception flag gathering
				1746	will quiet and propagate
				1747	signaling-NaN inputs per
				1748	IEEE 754-2008. Min_dx10 and
				1749	max_dx10 become IEEE
				1750	754-2008 compliant due to
				1751	signaling-NaN propagation
				1752	and quieting.
				1753
				1754	Used by CP to set up
				1755	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1756	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1757
				1758	Only one work-group allowed
				1759	to execute on a compute
				1760	unit.
				1761
				1762	CP is responsible for
				1763	filling in
				1764	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1765	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1766
				1767	Flag that can be used to
				1768	control debugging code.
				1769
				1770	CP is responsible for
				1771	filling in
				1772	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1773	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1774	Reserved, must be 0.
				1775	GFX9
				1776	Wavefront starts execution
				1777	with specified fp16 overflow
				1778	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1779
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1780	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1781	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1782	- If 1, fp16 overflow that is the
				1783	result of an +/-INF input value
				1784	or divide by 0 produces a +/-INF,
				1785	otherwise clamps computed
				1786	overflow to +/-MAX_FP16 as
				1787	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1788
				1789	Used by CP to set up
				1790	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1791	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1792	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1793	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1794
				1795	..
				1796
				1797	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1798	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1799
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1800	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1801	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1802	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1803	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1804	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1805	system register (see
				1806	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1807
				1808	Used by CP to set up
				1809	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1810	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1811	user data registers
				1812	requested. This number must
				1813	match the number of user
				1814	data registers enabled.
				1815
				1816	Used by CP to set up
				1817	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1818	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1819	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1820	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1821	be enabled.
				1822
				1823	CP sets
				1824	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1825	if the runtime has
				1826	installed a trap handler
				1827	regardless of the setting
				1828	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1829	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1830	system SGPR register for
				1831	the work-group id in the X
				1832	dimension (see
				1833	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1834
				1835	Used by CP to set up
				1836	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1837	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1838	system SGPR register for
				1839	the work-group id in the Y
				1840	dimension (see
				1841	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1842
				1843	Used by CP to set up
				1844	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1845	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1846	system SGPR register for
				1847	the work-group id in the Z
				1848	dimension (see
				1849	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1850
				1851	Used by CP to set up
				1852	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1853	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854	system SGPR register for
				1855	work-group information (see
				1856	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1857
				1858	Used by CP to set up
				1859	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1860	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1861	VGPR system registers used
				1862	for the work-item ID.
				1863	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1864	defines the values.
				1865
				1866	Used by CP to set up
				1867	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1868	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1869
				1870	Wavefront starts execution
				1871	with address watch
				1872	exceptions enabled which
				1873	are generated when L1 has
				1874	witnessed a thread access
				1875	an *address of
				1876	interest*.
				1877
				1878	CP is responsible for
				1879	filling in the address
				1880	watch bit in
				1881	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1882	according to what the
				1883	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1884	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1885
				1886	Wavefront starts execution
				1887	with memory violation
				1888	exceptions exceptions
				1889	enabled which are generated
				1890	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1891	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1892	L1 or LDS
				1893	(write-to-read-only-memory,
				1894	mis-aligned atomic, LDS
				1895	address out of range,
				1896	illegal address, etc.).
				1897
				1898	CP sets the memory
				1899	violation bit in
				1900	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1901	according to what the
				1902	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1903	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1904
				1905	CP uses the rounded value
				1906	from the dispatch packet,
				1907	not this value, as the
				1908	dispatch may contain
				1909	dynamically allocated group
				1910	segment memory. CP writes
				1911	directly to
				1912	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1913
				1914	Amount of group segment
				1915	(LDS) to allocate for each
				1916	work-group. Granularity is
				1917	device specific:
				1918
				1919	GFX6:
				1920	roundup(lds-size / (64 * 4))
				1921	GFX7-GFX9:
				1922	roundup(lds-size / (128 * 4))
				1923
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1924	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1925	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1926	enabled.
				1927
				1928	Used by CP to set up
				1929	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1930	(set from bits 0..6).
				1931
				1932	IEEE 754 FP Invalid
				1933	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1934	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1935	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1936	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1937	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1938	_DIVISION_BY_ZERO Zero
				1939	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1940	_OVERFLOW
				1941	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1942	_UNDERFLOW
				1943	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1944	_INEXACT
				1945	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1946	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1947	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1948	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1949	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1950	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1951
				1952	..
				1953
				1954	.. table:: Floating Point Rounding Mode Enumeration Values
				1955	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1956
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1957	====================================== ===== ==============================
				1958	Enumeration Name Value Description
				1959	====================================== ===== ==============================
				1960	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1961	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1962	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1963	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1964	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1965
				1966	..
				1967
				1968	.. table:: Floating Point Denorm Mode Enumeration Values
				1969	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1970
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1971	====================================== ===== ==============================
				1972	Enumeration Name Value Description
				1973	====================================== ===== ==============================
				1974	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1975	Denorms
				1976	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1977	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1978	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1979	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1980
				1981	..
				1982
				1983	.. table:: System VGPR Work-Item ID Enumeration Values
				1984	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1985
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1986	======================================== ===== ============================
				1987	Enumeration Name Value Description
				1988	======================================== ===== ============================
				1989	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1990	ID.
				1991	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1992	dimensions ID.
				1993	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1994	dimensions ID.
				1995	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1996	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1997
				1998	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1999
				2000	Initial Kernel Execution State
				2001	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2002
				2003	This section defines the register state that will be set up by the packet
				2004	processor prior to the start of execution of every wavefront. This is limited by
				2005	the constraints of the hardware controllers of CP/ADC/SPI.
				2006
				2007	The order of the SGPR registers is defined, but the compiler can specify which
				2008	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2009	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2010	for enabled registers are dense starting at SGPR0: the first enabled register is
				2011	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2012	an SGPR number.
				2013
				2014	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2015	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2016	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2017	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2018	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2019	dispatch.
				2020
				2021	SGPR register initial state is defined in
				2022	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2023
				2024	.. table:: SGPR Register Set Up Order
				2025	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2026
				2027	========== ========================== ====== ==============================
				2028	SGPR Order Name Number Description
				2029	(kernel descriptor enable of
				2030	field) SGPRs
				2031	========== ========================== ====== ==============================
				2032	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2033	(enable_sgpr_private with Scratch Wavefront Offset
				2034	_segment_buffer) as an offset, to access the
				2035	private memory space using a
				2036	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2037
				2038	CP uses the value provided by
				2039	the runtime.
				2040	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2041	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2042	actually executing.
				2043	then Queue Ptr 2 64 bit address of amd_queue_t
				2044	(enable_sgpr_queue_ptr) object for AQL queue on which
				2045	the dispatch packet was
				2046	queued.
				2047	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2048	(enable_sgpr_kernarg segment. This is directly
				2049	_segment_ptr) copied from the
				2050	kernarg_address in the kernel
				2051	dispatch packet.
				2052
				2053	Having CP load it once avoids
				2054	loading it at the beginning of
				2055	every wavefront.
				2056	then Dispatch Id 2 64 bit Dispatch ID of the
				2057	(enable_sgpr_dispatch_id) dispatch packet being
				2058	executed.
				2059	then Flat Scratch Init 2 This is 2 SGPRs:
				2060	(enable_sgpr_flat_scratch
				2061	_init) GFX6
				2062	Not supported.
				2063	GFX7-GFX8
				2064	The first SGPR is a 32 bit
				2065	byte offset from
				2066	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2067	to per SPI base of memory
				2068	for scratch for the queue
				2069	executing the kernel
				2070	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2071	from the runtime. (The
				2072	Scratch Segment Buffer base
				2073	address is
				2074	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2075	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2076	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2077	be added to this offset by
				2078	the kernel machine code,
				2079	right shifted by 8, and
				2080	moved to the FLAT_SCRATCH_HI
				2081	SGPR register.
				2082	FLAT_SCRATCH_HI corresponds
				2083	to SGPRn-4 on GFX7, and
				2084	SGPRn-6 on GFX8 (where SGPRn
				2085	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2086	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2087	FLAT_SCRATCH_HI is
				2088	multiplied by 256 (as it is
				2089	in units of 256 bytes) and
				2090	added to
				2091	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2092	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2093	FLAT SCRATCH BASE in flat
				2094	memory instructions that
				2095	access the scratch
				2096	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2097
				2098	The second SGPR is 32 bit
				2099	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2100	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2101	usage. CP obtains this from
				2102	the runtime, and it is
				2103	always a multiple of DWORD.
				2104	CP checks that the value in
				2105	the kernel dispatch packet
				2106	Private Segment Byte Size is
				2107	not larger, and requests the
				2108	runtime to increase the
				2109	queue's scratch size if
				2110	necessary. The kernel code
				2111	must move it to
				2112	FLAT_SCRATCH_LO which is
				2113	SGPRn-3 on GFX7 and SGPRn-5
				2114	on GFX8. FLAT_SCRATCH_LO is
				2115	used as the FLAT SCRATCH
				2116	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2117	instructions. Having CP load
				2118	it once avoids loading it at
				2119	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2120	wavefront.
				2121	GFX9
				2122	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2123	64 bit base address of the
				2124	per SPI scratch backing
				2125	memory managed by SPI for
				2126	the queue executing the
				2127	kernel dispatch. CP obtains
				2128	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2129	divides it if there are
				2130	multiple Shader Arrays each
				2131	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2132	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2133	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2134	machine code and the result
				2135	moved to the FLAT_SCRATCH
				2136	SGPR which is SGPRn-6 and
				2137	SGPRn-5. It is used as the
				2138	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2139	memory instructions.
				2140	then Private Segment Size 1 The 32 bit byte size of a
				2141	(enable_sgpr_private single
				2142	work-item's
				2143	scratch_segment_size) memory
				2144	allocation. This is the
				2145	value from the kernel
				2146	dispatch packet Private
				2147	Segment Byte Size rounded up
				2148	by CP to a multiple of
				2149	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2150
				2151	Having CP load it once avoids
				2152	loading it at the beginning of
				2153	every wavefront.
				2154
				2155	This is not used for
				2156	GFX7-GFX8 since it is the same
				2157	value as the second SGPR of
				2158	Flat Scratch Init. However, it
				2159	may be needed for GFX9 which
				2160	changes the meaning of the
				2161	Flat Scratch Init value.
				2162	then Grid Work-Group Count X 1 32 bit count of the number of
				2163	(enable_sgpr_grid work-groups in the X dimension
				2164	_workgroup_count_X) for the grid being
				2165	executed. Computed from the
				2166	fields in the kernel dispatch
				2167	packet as ((grid_size.x +
				2168	workgroup_size.x - 1) /
				2169	workgroup_size.x).
				2170	then Grid Work-Group Count Y 1 32 bit count of the number of
				2171	(enable_sgpr_grid work-groups in the Y dimension
				2172	_workgroup_count_Y && for the grid being
				2173	less than 16 previous executed. Computed from the
				2174	SGPRs) fields in the kernel dispatch
				2175	packet as ((grid_size.y +
				2176	workgroup_size.y - 1) /
				2177	workgroupSize.y).
				2178
				2179	Only initialized if <16
				2180	previous SGPRs initialized.
				2181	then Grid Work-Group Count Z 1 32 bit count of the number of
				2182	(enable_sgpr_grid work-groups in the Z dimension
				2183	_workgroup_count_Z && for the grid being
				2184	less than 16 previous executed. Computed from the
				2185	SGPRs) fields in the kernel dispatch
				2186	packet as ((grid_size.z +
				2187	workgroup_size.z - 1) /
				2188	workgroupSize.z).
				2189
				2190	Only initialized if <16
				2191	previous SGPRs initialized.
				2192	then Work-Group Id X 1 32 bit work-group id in X
				2193	(enable_sgpr_workgroup_id dimension of grid for
				2194	_X) wavefront.
				2195	then Work-Group Id Y 1 32 bit work-group id in Y
				2196	(enable_sgpr_workgroup_id dimension of grid for
				2197	_Y) wavefront.
				2198	then Work-Group Id Z 1 32 bit work-group id in Z
				2199	(enable_sgpr_workgroup_id dimension of grid for
				2200	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2201	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2202	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2203	_info) threadgroup_size_in_wavefronts[5:0]}
				2204	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2205	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2206	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2207	dispatch. Must be used as an
				2208	offset with Private
				2209	segment address when using
				2210	Scratch Segment Buffer. It
				2211	must be used to set up FLAT
				2212	SCRATCH for flat addressing
				2213	(see
				2214	:ref:`amdgpu-amdhsa-flat-scratch`).
				2215	========== ========================== ====== ==============================
				2216
				2217	The order of the VGPR registers is defined, but the compiler can specify which
				2218	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2219	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2220	for enabled registers are dense starting at VGPR0: the first enabled register is
				2221	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2222	VGPR number.
				2223
				2224	VGPR register initial state is defined in
				2225	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2226
				2227	.. table:: VGPR Register Set Up Order
				2228	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2229
				2230	========== ========================== ====== ==============================
				2231	VGPR Order Name Number Description
				2232	(kernel descriptor enable of
				2233	field) VGPRs
				2234	========== ========================== ====== ==============================
				2235	First Work-Item Id X 1 32 bit work item id in X
				2236	(Always initialized) dimension of work-group for
				2237	wavefront lane.
				2238	then Work-Item Id Y 1 32 bit work item id in Y
				2239	(enable_vgpr_workitem_id dimension of work-group for
				2240	> 0) wavefront lane.
				2241	then Work-Item Id Z 1 32 bit work item id in Z
				2242	(enable_vgpr_workitem_id dimension of work-group for
				2243	> 1) wavefront lane.
				2244	========== ========================== ====== ==============================
				2245
				2246	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2247
				2248	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2249	registers.
				2250	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2251	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2252	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2253	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2254	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2255	or (X, Y, Z).
				2256
				2257	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2258	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2259
				2260	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2261	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2262	instructions (GFX9).
				2263
				2264	If buffer operations are used then the compiler can generate a V# with the
				2265	following properties:
				2266
				2267	* base address of 0
				2268	* no swizzle
				2269	* ATC: 1 if IOMMU present (such as APU)
				2270	* ptr64: 1
				2271	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2272	APU and NC for dGPU).
				2273
				2274	.. _amdgpu-amdhsa-kernel-prolog:
				2275
				2276	Kernel Prolog
				2277	~~~~~~~~~~~~~
				2278
				2279	.. _amdgpu-amdhsa-m0:
				2280
				2281	M0
				2282	++
				2283
				2284	GFX6-GFX8
				2285	The M0 register must be initialized with a value at least the total LDS size
				2286	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2287	available in dispatch packet. For M0, it is also possible to use maximum
				2288	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2289	GFX7-GFX8).
				2290	GFX9
				2291	The M0 register is not used for range checking LDS accesses and so does not
				2292	need to be initialized in the prolog.
				2293
				2294	.. _amdgpu-amdhsa-flat-scratch:
				2295
				2296	Flat Scratch
				2297	++++++++++++
				2298
				2299	If the kernel may use flat operations to access scratch memory, the prolog code
				2300	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2301	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2302	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2303
				2304	GFX6
				2305	Flat scratch is not supported.
				2306
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2307	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2308	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2309	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2310	being managed by SPI for the queue executing the kernel dispatch. This is
				2311	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2312	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2313	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2314	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2315	by 8 before moving into FLAT_SCRATCH_LO.
				2316	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2317	work-items scratch memory usage. This is directly loaded from the kernel
				2318	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2319	DWORD. Having CP load it once avoids loading it at the beginning of every
				2320	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2321	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2322
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2323	GFX9
				2324	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2325	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2326	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2327	pair for use as the flat scratch base in flat memory instructions.
				2328
				2329	.. _amdgpu-amdhsa-memory-model:
				2330
				2331	Memory Model
				2332	~~~~~~~~~~~~
				2333
				2334	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2335	(see :ref:`memmodel`). The implementation is WIP.
				2336
				2337	.. TODO
				2338	Update when implementation complete.
				2339
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2340	The AMDGPU backend supports the memory synchronization scopes specified in
				2341	:ref:`amdgpu-memory-scopes`.
				2342
				2343	The code sequences used to implement the memory model are defined in table
				2344	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2345
				2346	The sequences specify the order of instructions that a single thread must
				2347	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2348	to other memory instructions executed by the same thread. This allows them to be
				2349	moved earlier or later which can allow them to be combined with other instances
				2350	of the same instruction, or hoisted/sunk out of loops to improve
				2351	performance. Only the instructions related to the memory model are given;
				2352	additional ``s_waitcnt`` instructions are required to ensure registers are
				2353	defined before being used. These may be able to be combined with the memory
				2354	model ``s_waitcnt`` instructions as described above.
				2355
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2356	The AMDGPU backend supports the following memory models:
				2357
				2358	HSA Memory Model [HSA]_
				2359	The HSA memory model uses a single happens-before relation for all address
				2360	spaces (see :ref:`amdgpu-address-spaces`).
				2361	OpenCL Memory Model [OpenCL]_
				2362	The OpenCL memory model which has separate happens-before relations for the
				2363	global and local address spaces. Only a fence specifying both global and
				2364	local address space, and seq_cst instructions join the relationships. Since
				2365	the LLVM ``memfence`` instruction does not allow an address space to be
				2366	specified the OpenCL fence has to convervatively assume both local and
				2367	global address space was specified. However, optimizations can often be
				2368	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2369	no intervening memory instructions which access the corresponding address
				2370	space. The code sequences in the table indicate what can be omitted for the
				2371	OpenCL memory. The target triple environment is used to determine if the
				2372	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2373
				2374	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2375	operations.
				2376
				2377	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2378	termed vector memory operations.
				2379
				2380	For GFX6-GFX9:
				2381
				2382	* Each agent has multiple compute units (CU).
				2383	* Each CU has multiple SIMDs that execute wavefronts.
				2384	* The wavefronts for a single work-group are executed in the same CU but may be
				2385	executed by different SIMDs.
				2386	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2387	executing on it.
				2388	* All LDS operations of a CU are performed as wavefront wide operations in a
				2389	global order and involve no caching. Completion is reported to a wavefront in
				2390	execution order.
				2391	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2392	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2393	can be reordered relative to each other, which can result in reordering the
				2394	visibility of vector memory operations with respect to LDS operations of other
				2395	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2396	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2397	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2398	same wavefront.
				2399	* The vector memory operations are performed as wavefront wide operations and
				2400	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2401	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2402	vector memory order if they access LDS memory, and out of LDS operation order
				2403	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2404	* The vector memory operations access a single vector L1 cache shared by all
				2405	SIMDs a CU. Therefore, no special action is required for coherence between the
				2406	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2407	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2408	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2409	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2410	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2411	scalar operations are used in a restricted way so do not impact the memory
				2412	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2413	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2414	the same agent.
				2415	* The L2 cache has independent channels to service disjoint ranges of virtual
				2416	addresses.
				2417	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2418	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2419	(which may be executing on different CUs) of an agent can be reordered
				2420	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2421	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2422	previous vector memory operation has completed before executing a subsequent
				2423	vector memory or LDS operation and so can be used to meet the requirements of
				2424	acquire and release.
				2425	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2426	of virtual addresses can be set up to bypass it to ensure system coherence.
				2427
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2428	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2429	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2430	memory, atomic memory orderings are not meaningful and all accesses are treated
				2431	as non-atomic.
				2432
				2433	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2434	scalar memory instructions). Since the constant address space contents do not
				2435	change during the execution of a kernel dispatch it is not legal to perform
				2436	stores, and atomic memory orderings are not meaningful and all access are
				2437	treated as non-atomic.
				2438
				2439	A memory synchronization scope wider than work-group is not meaningful for the
				2440	group (LDS) address space and is treated as work-group.
				2441
				2442	The memory model does not support the region address space which is treated as
				2443	non-atomic.
				2444
				2445	Acquire memory ordering is not meaningful on store atomic instructions and is
				2446	treated as non-atomic.
				2447
				2448	Release memory ordering is not meaningful on load atomic instructions and is
				2449	treated a non-atomic.
				2450
				2451	Acquire-release memory ordering is not meaningful on load or store atomic
				2452	instructions and is treated as acquire and release respectively.
				2453
				2454	AMDGPU backend only uses scalar memory operations to access memory that is
				2455	proven to not change during the execution of the kernel dispatch. This includes
				2456	constant address space and global address space for program scope const
				2457	variables. Therefore the kernel machine code does not have to maintain the
				2458	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2459	and vector L1 caches are invalidated between kernel dispatches by CP since
				2460	constant address space data may change between kernel dispatch executions. See
				2461	:ref:`amdgpu-amdhsa-memory-spaces`.
				2462
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2463	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2464	case the AMDGPU backend ensures the memory location used to spill is never
				2465	accessed by vector memory operations at the same time. If scalar writes are used
				2466	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2467	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2468	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2469	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2470	as all scalar writes are write-before-read in the same thread.
				2471
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2472	Scratch backing memory (which is used for the private address space)
				2473	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2474	address space is only accessed by a single thread, and is always
				2475	write-before-read, there is never a need to invalidate these entries from the L1
				2476	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2477	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2478
				2479	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2480	to invalidate the L2 cache. This also causes it to be treated as
				2481	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2482	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2483	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2484
				2485	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2486	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2487
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2488	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2489	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2490	Ordering Sync Scope Address
				2491	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2492	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2493	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2494	-----------------------------------------------------------------------------------
				2495	load none none - global - !volatile & !nontemporal
				2496	- generic
				2497	- private 1. buffer/global/flat_load
				2498	- constant
				2499	- volatile & !nontemporal
				2500
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2501	1. buffer/global/flat_load
				2502	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2503
				2504	- nontemporal
				2505
				2506	1. buffer/global/flat_load
				2507	glc=1 slc=1
				2508
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2509	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2510	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2511	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2512	- private 1. buffer/global/flat_store
				2513	- constant
				2514	- nontemporal
				2515
				2516	1. buffer/global/flat_stote
				2517	glc=1 slc=1
				2518
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2519	store none none - local 1. ds_store
				2520	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2521	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2522	load atomic unordered any any Same as non-atomic.
				2523	store atomic unordered any any Same as non-atomic.
				2524	atomicrmw unordered any any *Same as monotonic
				2525	atomic*.
				2526	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2527	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2528	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2529	- wavefront - generic
				2530	- workgroup
				2531	load atomic monotonic - singlethread - local 1. ds_load
				2532	- wavefront
				2533	- workgroup
				2534	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2535	- system - generic glc=1
				2536	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2537	- wavefront - generic
				2538	- workgroup
				2539	- agent
				2540	- system
				2541	store atomic monotonic - singlethread - local 1. ds_store
				2542	- wavefront
				2543	- workgroup
				2544	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2545	- wavefront - generic
				2546	- workgroup
				2547	- agent
				2548	- system
				2549	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2550	- wavefront
				2551	- workgroup
				2552	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2553	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2554	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2555	- wavefront - local
				2556	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2557	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2558	load atomic acquire - workgroup - local 1. ds_load
				2559	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2560
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2561	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2562	- Must happen before
				2563	any following
				2564	global/generic
				2565	load/load
				2566	atomic/store/store
				2567	atomic/atomicrmw.
				2568	- Ensures any
				2569	following global
				2570	data read is no
				2571	older than the load
				2572	atomic value being
				2573	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2574	load atomic acquire - workgroup - generic 1. flat_load
				2575	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2576
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2577	- If OpenCL, omit.
				2578	- Must happen before
				2579	any following
				2580	global/generic
				2581	load/load
				2582	atomic/store/store
				2583	atomic/atomicrmw.
				2584	- Ensures any
				2585	following global
				2586	data read is no
				2587	older than the load
				2588	atomic value being
				2589	acquired.
				2590	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2591	- system glc=1
				2592	2. s_waitcnt vmcnt(0)
				2593
				2594	- Must happen before
				2595	following
				2596	buffer_wbinvl1_vol.
				2597	- Ensures the load
				2598	has completed
				2599	before invalidating
				2600	the cache.
				2601
				2602	3. buffer_wbinvl1_vol
				2603
				2604	- Must happen before
				2605	any following
				2606	global/generic
				2607	load/load
				2608	atomic/atomicrmw.
				2609	- Ensures that
				2610	following
				2611	loads will not see
				2612	stale global data.
				2613
				2614	load atomic acquire - agent - generic 1. flat_load glc=1
				2615	- system 2. s_waitcnt vmcnt(0) &
				2616	lgkmcnt(0)
				2617
				2618	- If OpenCL omit
				2619	lgkmcnt(0).
				2620	- Must happen before
				2621	following
				2622	buffer_wbinvl1_vol.
				2623	- Ensures the flat_load
				2624	has completed
				2625	before invalidating
				2626	the cache.
				2627
				2628	3. buffer_wbinvl1_vol
				2629
				2630	- Must happen before
				2631	any following
				2632	global/generic
				2633	load/load
				2634	atomic/atomicrmw.
				2635	- Ensures that
				2636	following loads
				2637	will not see stale
				2638	global data.
				2639
				2640	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2641	- wavefront - local
				2642	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2643	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2644	atomicrmw acquire - workgroup - local 1. ds_atomic
				2645	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2646
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2647	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2648	- Must happen before
				2649	any following
				2650	global/generic
				2651	load/load
				2652	atomic/store/store
				2653	atomic/atomicrmw.
				2654	- Ensures any
				2655	following global
				2656	data read is no
				2657	older than the
				2658	atomicrmw value
				2659	being acquired.
				2660
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2661	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2662	2. waitcnt lgkmcnt(0)
				2663
				2664	- If OpenCL, omit.
				2665	- Must happen before
				2666	any following
				2667	global/generic
				2668	load/load
				2669	atomic/store/store
				2670	atomic/atomicrmw.
				2671	- Ensures any
				2672	following global
				2673	data read is no
				2674	older than the
				2675	atomicrmw value
				2676	being acquired.
				2677
				2678	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2679	- system 2. s_waitcnt vmcnt(0)
				2680
				2681	- Must happen before
				2682	following
				2683	buffer_wbinvl1_vol.
				2684	- Ensures the
				2685	atomicrmw has
				2686	completed before
				2687	invalidating the
				2688	cache.
				2689
				2690	3. buffer_wbinvl1_vol
				2691
				2692	- Must happen before
				2693	any following
				2694	global/generic
				2695	load/load
				2696	atomic/atomicrmw.
				2697	- Ensures that
				2698	following loads
				2699	will not see stale
				2700	global data.
				2701
				2702	atomicrmw acquire - agent - generic 1. flat_atomic
				2703	- system 2. s_waitcnt vmcnt(0) &
				2704	lgkmcnt(0)
				2705
				2706	- If OpenCL, omit
				2707	lgkmcnt(0).
				2708	- Must happen before
				2709	following
				2710	buffer_wbinvl1_vol.
				2711	- Ensures the
				2712	atomicrmw has
				2713	completed before
				2714	invalidating the
				2715	cache.
				2716
				2717	3. buffer_wbinvl1_vol
				2718
				2719	- Must happen before
				2720	any following
				2721	global/generic
				2722	load/load
				2723	atomic/atomicrmw.
				2724	- Ensures that
				2725	following loads
				2726	will not see stale
				2727	global data.
				2728
				2729	fence acquire - singlethread none none
				2730	- wavefront
				2731	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2732
				2733	- If OpenCL and
				2734	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2735	not generic, omit.
				2736	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2737	currently has no
				2738	address space on
				2739	the fence need to
				2740	conservatively
				2741	always generate. If
				2742	fence had an
				2743	address space then
				2744	set to address
				2745	space of OpenCL
				2746	fence flag, or to
				2747	generic if both
				2748	local and global
				2749	flags are
				2750	specified.
				2751	- Must happen after
				2752	any preceding
				2753	local/generic load
				2754	atomic/atomicrmw
				2755	with an equal or
				2756	wider sync scope
				2757	and memory ordering
				2758	stronger than
				2759	unordered (this is
				2760	termed the
				2761	fence-paired-atomic).
				2762	- Must happen before
				2763	any following
				2764	global/generic
				2765	load/load
				2766	atomic/store/store
				2767	atomic/atomicrmw.
				2768	- Ensures any
				2769	following global
				2770	data read is no
				2771	older than the
				2772	value read by the
				2773	fence-paired-atomic.
				2774
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2775	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2776	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2777
				2778	- If OpenCL and
				2779	address space is
				2780	not generic, omit
				2781	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2782	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2783	currently has no
				2784	address space on
				2785	the fence need to
				2786	conservatively
				2787	always generate
				2788	(see comment for
				2789	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2790	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2791	separate s_waitcnt
				2792	vmcnt(0) and
				2793	s_waitcnt
				2794	lgkmcnt(0) to allow
				2795	them to be
				2796	independently moved
				2797	according to the
				2798	following rules.
				2799	- s_waitcnt vmcnt(0)
				2800	must happen after
				2801	any preceding
				2802	global/generic load
				2803	atomic/atomicrmw
				2804	with an equal or
				2805	wider sync scope
				2806	and memory ordering
				2807	stronger than
				2808	unordered (this is
				2809	termed the
				2810	fence-paired-atomic).
				2811	- s_waitcnt lgkmcnt(0)
				2812	must happen after
				2813	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2814	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2815	atomic/atomicrmw
				2816	with an equal or
				2817	wider sync scope
				2818	and memory ordering
				2819	stronger than
				2820	unordered (this is
				2821	termed the
				2822	fence-paired-atomic).
				2823	- Must happen before
				2824	the following
				2825	buffer_wbinvl1_vol.
				2826	- Ensures that the
				2827	fence-paired atomic
				2828	has completed
				2829	before invalidating
				2830	the
				2831	cache. Therefore
				2832	any following
				2833	locations read must
				2834	be no older than
				2835	the value read by
				2836	the
				2837	fence-paired-atomic.
				2838
				2839	2. buffer_wbinvl1_vol
				2840
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2841	- Must happen before any
				2842	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2843	load/load
				2844	atomic/store/store
				2845	atomic/atomicrmw.
				2846	- Ensures that
				2847	following loads
				2848	will not see stale
				2849	global data.
				2850
				2851	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2852	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2853	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2854	- wavefront - local
				2855	- generic
				2856	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2857
				2858	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2859	- Must happen after
				2860	any preceding
				2861	local/generic
				2862	load/store/load
				2863	atomic/store
				2864	atomic/atomicrmw.
				2865	- Must happen before
				2866	the following
				2867	store.
				2868	- Ensures that all
				2869	memory operations
				2870	to local have
				2871	completed before
				2872	performing the
				2873	store that is being
				2874	released.
				2875
				2876	2. buffer/global/flat_store
				2877	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2878	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2879
				2880	- If OpenCL, omit.
				2881	- Must happen after
				2882	any preceding
				2883	local/generic
				2884	load/store/load
				2885	atomic/store
				2886	atomic/atomicrmw.
				2887	- Must happen before
				2888	the following
				2889	store.
				2890	- Ensures that all
				2891	memory operations
				2892	to local have
				2893	completed before
				2894	performing the
				2895	store that is being
				2896	released.
				2897
				2898	2. flat_store
				2899	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2900	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2901
				2902	- If OpenCL, omit
				2903	lgkmcnt(0).
				2904	- Could be split into
				2905	separate s_waitcnt
				2906	vmcnt(0) and
				2907	s_waitcnt
				2908	lgkmcnt(0) to allow
				2909	them to be
				2910	independently moved
				2911	according to the
				2912	following rules.
				2913	- s_waitcnt vmcnt(0)
				2914	must happen after
				2915	any preceding
				2916	global/generic
				2917	load/store/load
				2918	atomic/store
				2919	atomic/atomicrmw.
				2920	- s_waitcnt lgkmcnt(0)
				2921	must happen after
				2922	any preceding
				2923	local/generic
				2924	load/store/load
				2925	atomic/store
				2926	atomic/atomicrmw.
				2927	- Must happen before
				2928	the following
				2929	store.
				2930	- Ensures that all
				2931	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2932	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2933	completed before
				2934	performing the
				2935	store that is being
				2936	released.
				2937
				2938	2. buffer/global/ds/flat_store
				2939	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2940	- wavefront - local
				2941	- generic
				2942	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2943
				2944	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2945	- Must happen after
				2946	any preceding
				2947	local/generic
				2948	load/store/load
				2949	atomic/store
				2950	atomic/atomicrmw.
				2951	- Must happen before
				2952	the following
				2953	atomicrmw.
				2954	- Ensures that all
				2955	memory operations
				2956	to local have
				2957	completed before
				2958	performing the
				2959	atomicrmw that is
				2960	being released.
				2961
				2962	2. buffer/global/flat_atomic
				2963	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2964	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2965
				2966	- If OpenCL, omit.
				2967	- Must happen after
				2968	any preceding
				2969	local/generic
				2970	load/store/load
				2971	atomic/store
				2972	atomic/atomicrmw.
				2973	- Must happen before
				2974	the following
				2975	atomicrmw.
				2976	- Ensures that all
				2977	memory operations
				2978	to local have
				2979	completed before
				2980	performing the
				2981	atomicrmw that is
				2982	being released.
				2983
				2984	2. flat_atomic
				2985	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2986	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2987
				2988	- If OpenCL, omit
				2989	lgkmcnt(0).
				2990	- Could be split into
				2991	separate s_waitcnt
				2992	vmcnt(0) and
				2993	s_waitcnt
				2994	lgkmcnt(0) to allow
				2995	them to be
				2996	independently moved
				2997	according to the
				2998	following rules.
				2999	- s_waitcnt vmcnt(0)
				3000	must happen after
				3001	any preceding
				3002	global/generic
				3003	load/store/load
				3004	atomic/store
				3005	atomic/atomicrmw.
				3006	- s_waitcnt lgkmcnt(0)
				3007	must happen after
				3008	any preceding
				3009	local/generic
				3010	load/store/load
				3011	atomic/store
				3012	atomic/atomicrmw.
				3013	- Must happen before
				3014	the following
				3015	atomicrmw.
				3016	- Ensures that all
				3017	memory operations
				3018	to global and local
				3019	have completed
				3020	before performing
				3021	the atomicrmw that
				3022	is being released.
				3023
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3024	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3025	fence release - singlethread none none
				3026	- wavefront
				3027	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3028
				3029	- If OpenCL and
				3030	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3031	not generic, omit.
				3032	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3033	currently has no
				3034	address space on
				3035	the fence need to
				3036	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3037	always generate. If
				3038	fence had an
				3039	address space then
				3040	set to address
				3041	space of OpenCL
				3042	fence flag, or to
				3043	generic if both
				3044	local and global
				3045	flags are
				3046	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3047	- Must happen after
				3048	any preceding
				3049	local/generic
				3050	load/load
				3051	atomic/store/store
				3052	atomic/atomicrmw.
				3053	- Must happen before
				3054	any following store
				3055	atomic/atomicrmw
				3056	with an equal or
				3057	wider sync scope
				3058	and memory ordering
				3059	stronger than
				3060	unordered (this is
				3061	termed the
				3062	fence-paired-atomic).
				3063	- Ensures that all
				3064	memory operations
				3065	to local have
				3066	completed before
				3067	performing the
				3068	following
				3069	fence-paired-atomic.
				3070
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3071	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3072	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3073
				3074	- If OpenCL and
				3075	address space is
				3076	not generic, omit
				3077	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3078	- If OpenCL and
				3079	address space is
				3080	local, omit
				3081	vmcnt(0).
				3082	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3083	currently has no
				3084	address space on
				3085	the fence need to
				3086	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3087	always generate. If
				3088	fence had an
				3089	address space then
				3090	set to address
				3091	space of OpenCL
				3092	fence flag, or to
				3093	generic if both
				3094	local and global
				3095	flags are
				3096	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3097	- Could be split into
				3098	separate s_waitcnt
				3099	vmcnt(0) and
				3100	s_waitcnt
				3101	lgkmcnt(0) to allow
				3102	them to be
				3103	independently moved
				3104	according to the
				3105	following rules.
				3106	- s_waitcnt vmcnt(0)
				3107	must happen after
				3108	any preceding
				3109	global/generic
				3110	load/store/load
				3111	atomic/store
				3112	atomic/atomicrmw.
				3113	- s_waitcnt lgkmcnt(0)
				3114	must happen after
				3115	any preceding
				3116	local/generic
				3117	load/store/load
				3118	atomic/store
				3119	atomic/atomicrmw.
				3120	- Must happen before
				3121	any following store
				3122	atomic/atomicrmw
				3123	with an equal or
				3124	wider sync scope
				3125	and memory ordering
				3126	stronger than
				3127	unordered (this is
				3128	termed the
				3129	fence-paired-atomic).
				3130	- Ensures that all
				3131	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3132	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3133	completed before
				3134	performing the
				3135	following
				3136	fence-paired-atomic.
				3137
				3138	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3139	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3140	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3141	- wavefront - local
				3142	- generic
				3143	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3144
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3145	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3146	- Must happen after
				3147	any preceding
				3148	local/generic
				3149	load/store/load
				3150	atomic/store
				3151	atomic/atomicrmw.
				3152	- Must happen before
				3153	the following
				3154	atomicrmw.
				3155	- Ensures that all
				3156	memory operations
				3157	to local have
				3158	completed before
				3159	performing the
				3160	atomicrmw that is
				3161	being released.
				3162
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3163	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3164	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3165	2. s_waitcnt lgkmcnt(0)
				3166
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3167	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3168	- Must happen before
				3169	any following
				3170	global/generic
				3171	load/load
				3172	atomic/store/store
				3173	atomic/atomicrmw.
				3174	- Ensures any
				3175	following global
				3176	data read is no
				3177	older than the load
				3178	atomic value being
				3179	acquired.
				3180
				3181	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3182
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3183	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3184	- Must happen after
				3185	any preceding
				3186	local/generic
				3187	load/store/load
				3188	atomic/store
				3189	atomic/atomicrmw.
				3190	- Must happen before
				3191	the following
				3192	atomicrmw.
				3193	- Ensures that all
				3194	memory operations
				3195	to local have
				3196	completed before
				3197	performing the
				3198	atomicrmw that is
				3199	being released.
				3200
				3201	2. flat_atomic
				3202	3. s_waitcnt lgkmcnt(0)
				3203
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3204	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3205	- Must happen before
				3206	any following
				3207	global/generic
				3208	load/load
				3209	atomic/store/store
				3210	atomic/atomicrmw.
				3211	- Ensures any
				3212	following global
				3213	data read is no
				3214	older than the load
				3215	atomic value being
				3216	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3217
				3218	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3219	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3220
				3221	- If OpenCL, omit
				3222	lgkmcnt(0).
				3223	- Could be split into
				3224	separate s_waitcnt
				3225	vmcnt(0) and
				3226	s_waitcnt
				3227	lgkmcnt(0) to allow
				3228	them to be
				3229	independently moved
				3230	according to the
				3231	following rules.
				3232	- s_waitcnt vmcnt(0)
				3233	must happen after
				3234	any preceding
				3235	global/generic
				3236	load/store/load
				3237	atomic/store
				3238	atomic/atomicrmw.
				3239	- s_waitcnt lgkmcnt(0)
				3240	must happen after
				3241	any preceding
				3242	local/generic
				3243	load/store/load
				3244	atomic/store
				3245	atomic/atomicrmw.
				3246	- Must happen before
				3247	the following
				3248	atomicrmw.
				3249	- Ensures that all
				3250	memory operations
				3251	to global have
				3252	completed before
				3253	performing the
				3254	atomicrmw that is
				3255	being released.
				3256
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3257	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3258	3. s_waitcnt vmcnt(0)
				3259
				3260	- Must happen before
				3261	following
				3262	buffer_wbinvl1_vol.
				3263	- Ensures the
				3264	atomicrmw has
				3265	completed before
				3266	invalidating the
				3267	cache.
				3268
				3269	4. buffer_wbinvl1_vol
				3270
				3271	- Must happen before
				3272	any following
				3273	global/generic
				3274	load/load
				3275	atomic/atomicrmw.
				3276	- Ensures that
				3277	following loads
				3278	will not see stale
				3279	global data.
				3280
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3281	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3282	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3283
				3284	- If OpenCL, omit
				3285	lgkmcnt(0).
				3286	- Could be split into
				3287	separate s_waitcnt
				3288	vmcnt(0) and
				3289	s_waitcnt
				3290	lgkmcnt(0) to allow
				3291	them to be
				3292	independently moved
				3293	according to the
				3294	following rules.
				3295	- s_waitcnt vmcnt(0)
				3296	must happen after
				3297	any preceding
				3298	global/generic
				3299	load/store/load
				3300	atomic/store
				3301	atomic/atomicrmw.
				3302	- s_waitcnt lgkmcnt(0)
				3303	must happen after
				3304	any preceding
				3305	local/generic
				3306	load/store/load
				3307	atomic/store
				3308	atomic/atomicrmw.
				3309	- Must happen before
				3310	the following
				3311	atomicrmw.
				3312	- Ensures that all
				3313	memory operations
				3314	to global have
				3315	completed before
				3316	performing the
				3317	atomicrmw that is
				3318	being released.
				3319
				3320	2. flat_atomic
				3321	3. s_waitcnt vmcnt(0) &
				3322	lgkmcnt(0)
				3323
				3324	- If OpenCL, omit
				3325	lgkmcnt(0).
				3326	- Must happen before
				3327	following
				3328	buffer_wbinvl1_vol.
				3329	- Ensures the
				3330	atomicrmw has
				3331	completed before
				3332	invalidating the
				3333	cache.
				3334
				3335	4. buffer_wbinvl1_vol
				3336
				3337	- Must happen before
				3338	any following
				3339	global/generic
				3340	load/load
				3341	atomic/atomicrmw.
				3342	- Ensures that
				3343	following loads
				3344	will not see stale
				3345	global data.
				3346
				3347	fence acq_rel - singlethread none none
				3348	- wavefront
				3349	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3350
				3351	- If OpenCL and
				3352	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3353	not generic, omit.
				3354	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3355	since LLVM
				3356	currently has no
				3357	address space on
				3358	the fence need to
				3359	conservatively
				3360	always generate
				3361	(see comment for
				3362	previous fence).
				3363	- Must happen after
				3364	any preceding
				3365	local/generic
				3366	load/load
				3367	atomic/store/store
				3368	atomic/atomicrmw.
				3369	- Must happen before
				3370	any following
				3371	global/generic
				3372	load/load
				3373	atomic/store/store
				3374	atomic/atomicrmw.
				3375	- Ensures that all
				3376	memory operations
				3377	to local have
				3378	completed before
				3379	performing any
				3380	following global
				3381	memory operations.
				3382	- Ensures that the
				3383	preceding
				3384	local/generic load
				3385	atomic/atomicrmw
				3386	with an equal or
				3387	wider sync scope
				3388	and memory ordering
				3389	stronger than
				3390	unordered (this is
				3391	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3392	acquire-fence-paired-atomic
				3393	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3394	before following
				3395	global memory
				3396	operations. This
				3397	satisfies the
				3398	requirements of
				3399	acquire.
				3400	- Ensures that all
				3401	previous memory
				3402	operations have
				3403	completed before a
				3404	following
				3405	local/generic store
				3406	atomic/atomicrmw
				3407	with an equal or
				3408	wider sync scope
				3409	and memory ordering
				3410	stronger than
				3411	unordered (this is
				3412	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3413	release-fence-paired-atomic
				3414	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3415	requirements of
				3416	release.
				3417
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3418	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3419	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3420
				3421	- If OpenCL and
				3422	address space is
				3423	not generic, omit
				3424	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3425	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3426	currently has no
				3427	address space on
				3428	the fence need to
				3429	conservatively
				3430	always generate
				3431	(see comment for
				3432	previous fence).
				3433	- Could be split into
				3434	separate s_waitcnt
				3435	vmcnt(0) and
				3436	s_waitcnt
				3437	lgkmcnt(0) to allow
				3438	them to be
				3439	independently moved
				3440	according to the
				3441	following rules.
				3442	- s_waitcnt vmcnt(0)
				3443	must happen after
				3444	any preceding
				3445	global/generic
				3446	load/store/load
				3447	atomic/store
				3448	atomic/atomicrmw.
				3449	- s_waitcnt lgkmcnt(0)
				3450	must happen after
				3451	any preceding
				3452	local/generic
				3453	load/store/load
				3454	atomic/store
				3455	atomic/atomicrmw.
				3456	- Must happen before
				3457	the following
				3458	buffer_wbinvl1_vol.
				3459	- Ensures that the
				3460	preceding
				3461	global/local/generic
				3462	load
				3463	atomic/atomicrmw
				3464	with an equal or
				3465	wider sync scope
				3466	and memory ordering
				3467	stronger than
				3468	unordered (this is
				3469	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3470	acquire-fence-paired-atomic
				3471	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3472	before invalidating
				3473	the cache. This
				3474	satisfies the
				3475	requirements of
				3476	acquire.
				3477	- Ensures that all
				3478	previous memory
				3479	operations have
				3480	completed before a
				3481	following
				3482	global/local/generic
				3483	store
				3484	atomic/atomicrmw
				3485	with an equal or
				3486	wider sync scope
				3487	and memory ordering
				3488	stronger than
				3489	unordered (this is
				3490	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3491	release-fence-paired-atomic
				3492	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3493	requirements of
				3494	release.
				3495
				3496	2. buffer_wbinvl1_vol
				3497
				3498	- Must happen before
				3499	any following
				3500	global/generic
				3501	load/load
				3502	atomic/store/store
				3503	atomic/atomicrmw.
				3504	- Ensures that
				3505	following loads
				3506	will not see stale
				3507	global data. This
				3508	satisfies the
				3509	requirements of
				3510	acquire.
				3511
				3512	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3513	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3514	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3515	- wavefront - local load atomic acquire,
				3516	- generic except must generated
				3517	all instructions even
				3518	for OpenCL.*
				3519	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3520	- generic
				3521	- Must
				3522	happen after
				3523	preceding
				3524	global/generic load
				3525	atomic/store
				3526	atomic/atomicrmw
				3527	with memory
				3528	ordering of seq_cst
				3529	and with equal or
				3530	wider sync scope.
				3531	(Note that seq_cst
				3532	fences have their
				3533	own s_waitcnt
				3534	lgkmcnt(0) and so do
				3535	not need to be
				3536	considered.)
				3537	- Ensures any
				3538	preceding
				3539	sequential
				3540	consistent local
				3541	memory instructions
				3542	have completed
				3543	before executing
				3544	this sequentially
				3545	consistent
				3546	instruction. This
				3547	prevents reordering
				3548	a seq_cst store
				3549	followed by a
				3550	seq_cst load. (Note
				3551	that seq_cst is
				3552	stronger than
				3553	acquire/release as
				3554	the reordering of
				3555	load acquire
				3556	followed by a store
				3557	release is
				3558	prevented by the
				3559	waitcnt of
				3560	the release, but
				3561	there is nothing
				3562	preventing a store
				3563	release followed by
				3564	load acquire from
				3565	competing out of
				3566	order.)
				3567
				3568	2. *Following
				3569	instructions same as
				3570	corresponding load
				3571	atomic acquire,
				3572	except must generated
				3573	all instructions even
				3574	for OpenCL.*
				3575	load atomic seq_cst - workgroup - local *Same as corresponding
				3576	load atomic acquire,
				3577	except must generated
				3578	all instructions even
				3579	for OpenCL.*
				3580	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3581	- system - generic vmcnt(0)
				3582
				3583	- Could be split into
				3584	separate s_waitcnt
				3585	vmcnt(0)
				3586	and s_waitcnt
				3587	lgkmcnt(0) to allow
				3588	them to be
				3589	independently moved
				3590	according to the
				3591	following rules.
				3592	- waitcnt lgkmcnt(0)
				3593	must happen after
				3594	preceding
				3595	global/generic load
				3596	atomic/store
				3597	atomic/atomicrmw
				3598	with memory
				3599	ordering of seq_cst
				3600	and with equal or
				3601	wider sync scope.
				3602	(Note that seq_cst
				3603	fences have their
				3604	own s_waitcnt
				3605	lgkmcnt(0) and so do
				3606	not need to be
				3607	considered.)
				3608	- waitcnt vmcnt(0)
				3609	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3610	preceding
				3611	global/generic load
				3612	atomic/store
				3613	atomic/atomicrmw
				3614	with memory
				3615	ordering of seq_cst
				3616	and with equal or
				3617	wider sync scope.
				3618	(Note that seq_cst
				3619	fences have their
				3620	own s_waitcnt
				3621	vmcnt(0) and so do
				3622	not need to be
				3623	considered.)
				3624	- Ensures any
				3625	preceding
				3626	sequential
				3627	consistent global
				3628	memory instructions
				3629	have completed
				3630	before executing
				3631	this sequentially
				3632	consistent
				3633	instruction. This
				3634	prevents reordering
				3635	a seq_cst store
				3636	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3637	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3638	that seq_cst is
				3639	stronger than
				3640	acquire/release as
				3641	the reordering of
				3642	load acquire
				3643	followed by a store
				3644	release is
				3645	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3646	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3647	the release, but
				3648	there is nothing
				3649	preventing a store
				3650	release followed by
				3651	load acquire from
				3652	competing out of
				3653	order.)
				3654
				3655	2. *Following
				3656	instructions same as
				3657	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3658	atomic acquire,
				3659	except must generated
				3660	all instructions even
				3661	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3662	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3663	- wavefront - local store atomic release,
				3664	- workgroup - generic except must generated
				3665	all instructions even
				3666	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3667	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3668	- system - generic store atomic release,
				3669	except must generated
				3670	all instructions even
				3671	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3672	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3673	- wavefront - local atomicrmw acq_rel,
				3674	- workgroup - generic except must generated
				3675	all instructions even
				3676	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3677	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3678	- system - generic atomicrmw acq_rel,
				3679	except must generated
				3680	all instructions even
				3681	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3682	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3683	- wavefront fence acq_rel,
				3684	- workgroup except must generated
				3685	- agent all instructions even
				3686	- system for OpenCL.*
				3687	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3688
				3689	The memory order also adds the single thread optimization constrains defined in
				3690	table
				3691	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3692
				3693	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3694	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3695
				3696	============ ==============================================================
				3697	LLVM Memory Optimization Constraints
				3698	Ordering
				3699	============ ==============================================================
				3700	unordered none
				3701	monotonic none
				3702	acquire - If a load atomic/atomicrmw then no following load/load
				3703	atomic/store/ store atomic/atomicrmw/fence instruction can
				3704	be moved before the acquire.
				3705	- If a fence then same as load atomic, plus no preceding
				3706	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3707	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3708	atomic/store/ store atomic/atomicrmw/fence instruction can
				3709	be moved after the release.
				3710	- If a fence then same as store atomic, plus no following
				3711	associated fence-paired-atomic can be moved before the
				3712	fence.
				3713	acq_rel Same constraints as both acquire and release.
				3714	seq_cst - If a load atomic then same constraints as acquire, plus no
				3715	preceding sequentially consistent load atomic/store
				3716	atomic/atomicrmw/fence instruction can be moved after the
				3717	seq_cst.
				3718	- If a store atomic then the same constraints as release, plus
				3719	no following sequentially consistent load atomic/store
				3720	atomic/atomicrmw/fence instruction can be moved before the
				3721	seq_cst.
				3722	- If an atomicrmw/fence then same constraints as acq_rel.
				3723	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3724
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3725	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3726	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3727
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3728	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3729	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3730	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3731
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3732	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3733	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3734
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3735	=================== =============== =============== =======================
				3736	Usage Code Sequence Trap Handler Description
				3737	Inputs
				3738	=================== =============== =============== =======================
				3739	reserved ``s_trap 0x00`` Reserved by hardware.
				3740	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3741	``queue_ptr`` ``debugtrap``
				3742	``VGPR0``: intrinsic (not
				3743	``arg`` implemented).
				3744	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3745	``queue_ptr`` terminated and its
				3746	associated queue put
				3747	into the error state.
				3748	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3749	``queue_ptr`` installed handled
				3750	same as ``llvm.trap``.
				3751	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3752	breakpoints.
				3753	debugger ``s_trap 0x08`` Reserved for debugger.
				3754	debugger ``s_trap 0xfe`` Reserved for debugger.
				3755	debugger ``s_trap 0xff`` Reserved for debugger.
				3756	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3757
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3758	Unspecified OS
				3759	--------------
				3760
				3761	This section provides code conventions used when the target triple OS is
				3762	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3763
				3764	Trap Handler ABI
				3765	~~~~~~~~~~~~~~~~
				3766
				3767	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3768	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3769	instructions are handled as follows:
				3770
				3771	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3772	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3773
				3774	=============== =============== ===========================================
				3775	Usage Code Sequence Description
				3776	=============== =============== ===========================================
				3777	llvm.trap s_endpgm Causes wavefront to be terminated.
				3778	llvm.debugtrap none Compiler warning given that there is no
				3779	trap handler installed.
				3780	=============== =============== ===========================================
				3781
				3782	Source Languages
				3783	================
				3784
				3785	.. _amdgpu-opencl:
				3786
				3787	OpenCL
				3788	------
				3789
				3790	When generating code for the OpenCL language the target triple environment
				3791	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3792
				3793	When the language is OpenCL the following differences occur:
				3794
				3795	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3796	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3797	3. Additional metadata is generated
				3798	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3799
				3800	.. TODO
				3801	Specify what affect this has. Hidden arguments added. Additional metadata
				3802	generated.
				3803
				3804	.. _amdgpu-hcc:
				3805
				3806	HCC
				3807	---
				3808
				3809	When generating code for the OpenCL language the target triple environment
				3810	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3811
				3812	When the language is OpenCL the following differences occur:
				3813
				3814	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3815
				3816	.. TODO
				3817	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3818
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3819	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3820	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3821
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3822	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3823	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3824
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3825	This section describes general syntax for instructions and operands.
				3826
				3827	Instructions
				3828	~~~~~~~~~~~~
				3829
				3830	.. toctree::
				3831	:hidden:
				3832
				3833	AMDGPUAsmGFX7
				3834	AMDGPUAsmGFX8
				3835	AMDGPUAsmGFX9
				3836	AMDGPUOperandSyntax
				3837
				3838	An instruction has the following syntax:
				3839
				3840	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				3841
				3842	Note that operands are normally comma-separated while modifiers are space-separated.
				3843
				3844	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				3845
				3846	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				3847	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				3848
				3849	Note that features under development are not included in this description.
				3850
				3851	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3852	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3853	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3854
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3855	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3856	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3857
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3858	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3859
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3860	* SGPR registers: s0, ... or s[0], ...
				3861	* VGPR registers: v0, ... or v[0], ...
				3862	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3863	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3864	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3865	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3866	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3867	* Register index expressions: v[2*2], s[1-1:2-1]
				3868	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3869
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3870	Modifiers
				3871	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3872
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3873	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3874
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3875	Instruction Examples
				3876	~~~~~~~~~~~~~~~~~~~~
				3877
				3878	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3879	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3880
				3881	.. code-block:: nasm
				3882
				3883	ds_add_u32 v2, v4 offset:16
				3884	ds_write_src2_b64 v2 offset0:4 offset1:8
				3885	ds_cmpst_f32 v2, v4, v6
				3886	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3887
				3888
				3889	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3890
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3891	FLAT
				3892	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3893
				3894	.. code-block:: nasm
				3895
				3896	flat_load_dword v1, v[3:4]
				3897	flat_store_dwordx3 v[3:4], v[5:7]
				3898	flat_atomic_swap v1, v[3:4], v5 glc
				3899	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3900	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3901
				3902	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3903
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3904	MUBUF
				3905	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3906
				3907	.. code-block:: nasm
				3908
				3909	buffer_load_dword v1, off, s[4:7], s1
				3910	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3911	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3912	buffer_wbinvl1
				3913	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3914
				3915	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3916
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3917	SMRD/SMEM
				3918	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3919
				3920	.. code-block:: nasm
				3921
				3922	s_load_dword s1, s[2:3], 0xfc
				3923	s_load_dwordx8 s[8:15], s[2:3], s4
				3924	s_load_dwordx16 s[88:103], s[2:3], s4
				3925	s_dcache_inv_vol
				3926	s_memtime s[4:5]
				3927
				3928	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3929
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3930	SOP1
				3931	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3932
				3933	.. code-block:: nasm
				3934
				3935	s_mov_b32 s1, s2
				3936	s_mov_b64 s[0:1], 0x80000000
				3937	s_cmov_b32 s1, 200
				3938	s_wqm_b64 s[2:3], s[4:5]
				3939	s_bcnt0_i32_b64 s1, s[2:3]
				3940	s_swappc_b64 s[2:3], s[4:5]
				3941	s_cbranch_join s[4:5]
				3942
				3943	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3944
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3945	SOP2
				3946	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3947
				3948	.. code-block:: nasm
				3949
				3950	s_add_u32 s1, s2, s3
				3951	s_and_b64 s[2:3], s[4:5], s[6:7]
				3952	s_cselect_b32 s1, s2, s3
				3953	s_andn2_b32 s2, s4, s6
				3954	s_lshr_b64 s[2:3], s[4:5], s6
				3955	s_ashr_i32 s2, s4, s6
				3956	s_bfm_b64 s[2:3], s4, s6
				3957	s_bfe_i64 s[2:3], s[4:5], s6
				3958	s_cbranch_g_fork s[4:5], s[6:7]
				3959
				3960	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3961
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3962	SOPC
				3963	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3964
				3965	.. code-block:: nasm
				3966
				3967	s_cmp_eq_i32 s1, s2
				3968	s_bitcmp1_b32 s1, s2
				3969	s_bitcmp0_b64 s[2:3], s4
				3970	s_setvskip s3, s5
				3971
				3972	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3973
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3974	SOPP
				3975	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3976
				3977	.. code-block:: nasm
				3978
				3979	s_barrier
				3980	s_nop 2
				3981	s_endpgm
				3982	s_waitcnt 0 ; Wait for all counters to be 0
				3983	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3984	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3985	s_sethalt 9
				3986	s_sleep 10
				3987	s_sendmsg 0x1
				3988	s_sendmsg sendmsg(MSG_INTERRUPT)
				3989	s_trap 1
				3990
				3991	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3992
				3993	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3994	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3995	range or acceptable values.
				3996
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3997	VALU
				3998	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3999
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4000	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4001	the assembler will automatically use optimal encoding based on its operands.
				4002	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4003
				4004	* _e32 for 32-bit VOP1/VOP2/VOPC
				4005	* _e64 for 64-bit VOP3
				4006	* _dpp for VOP_DPP
				4007	* _sdwa for VOP_SDWA
				4008
				4009	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4010
				4011	.. code-block:: nasm
				4012
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4013	v_mov_b32 v1, v2
				4014	v_mov_b32_e32 v1, v2
				4015	v_nop
				4016	v_cvt_f64_i32_e32 v[1:2], v2
				4017	v_floor_f32_e32 v1, v2
				4018	v_bfrev_b32_e32 v1, v2
				4019	v_add_f32_e32 v1, v2, v3
				4020	v_mul_i32_i24_e64 v1, v2, 3
				4021	v_mul_i32_i24_e32 v1, -3, v3
				4022	v_mul_i32_i24_e32 v1, -100, v3
				4023	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4024	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4025
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4026	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4027
				4028	.. code-block:: nasm
				4029
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4030	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4031	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4032	v_mov_b32 v0, v0 wave_shl:1
				4033	v_mov_b32 v0, v0 row_mirror
				4034	v_mov_b32 v0, v0 row_bcast:31
				4035	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4036	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4037	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4038
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4039	VOP_SDWA examples:
				4040
				4041	.. code-block:: nasm
				4042
				4043	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4044	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4045	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4046	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4047	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4048
				4049	For full list of supported instructions, refer to "Vector ALU instructions".
				4050
				4051	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4052	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4053
				4054	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4055	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4056
				4057	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4058	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4059
				4060	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4061	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4062
				4063	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4064	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4065
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4066
				4067	major, minor, and stepping are all integers that describe the instruction
				4068	set architecture (ISA) version of the assembly program.
				4069
				4070	vendor and arch are quoted strings. vendor should always be equal to
				4071	"AMD" and arch should always be equal to "AMDGPU".
				4072
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4073	By default, the assembler will derive the ISA version, vendor, and arch
				4074	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4075
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4076	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4077	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4078
				4079	This directives specifies that the symbol with given name is a kernel entry point
				4080	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4081
				4082	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4083	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4084
				4085	This directive marks the beginning of a list of key / value pairs that are used
				4086	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4087	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4088	any amd_kernel_code_t values that are unspecified a default value will be
				4089	used. The default value for all keys is 0, with the following exceptions:
				4090
				4091	- kernel_code_version_major defaults to 1.
				4092	- machine_kind defaults to 1.
				4093	- machine_version_major, machine_version_minor, and
				4094	machine_version_stepping are derived from the value of the -mcpu option
				4095	that is passed to the assembler.
				4096	- kernel_code_entry_byte_offset defaults to 256.
				4097	- wavefront_size defaults to 6.
				4098	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4099	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4100	as a power of two, so a value of n means an alignment of 2^ n.
				4101
				4102	The .amd_kernel_code_t directive must be placed immediately after the
				4103	function label and before any instructions.
				4104
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4105	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4106	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4107
				4108	Here is an example of a minimal amd_kernel_code_t specification:
				4109
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4110	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4111
				4112	.hsa_code_object_version 1,0
				4113	.hsa_code_object_isa
				4114
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4115	.hsatext
				4116	.globl hello_world
				4117	.p2align 8
				4118	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4119
				4120	hello_world:
				4121
				4122	.amd_kernel_code_t
				4123	enable_sgpr_kernarg_segment_ptr = 1
				4124	is_ptr64 = 1
				4125	compute_pgm_rsrc1_vgprs = 0
				4126	compute_pgm_rsrc1_sgprs = 0
				4127	compute_pgm_rsrc2_user_sgpr = 2
				4128	kernarg_segment_byte_size = 8
				4129	wavefront_sgpr_count = 2
				4130	workitem_vgpr_count = 3
				4131	.end_amd_kernel_code_t
				4132
				4133	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4134	v_mov_b32 v0, 3.14159
				4135	s_waitcnt lgkmcnt(0)
				4136	v_mov_b32 v1, s0
				4137	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4138	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4139	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4140	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4141	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4142
				4143	Additional Documentation
				4144	========================
				4145
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4146	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4147	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4148	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4149	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4150	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4151	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4152	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4153	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4154	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4155	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4156	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4157	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4158	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4159	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4160	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4161	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__