Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 5c1430732c1399664a5a754f449ef68950ad1755 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
				204	[off]
				205	.. TODO
				206	Add product
				207	names.
				208	``gfx906`` ``amdgcn`` dGPU - xnack TBA
				209	[off]
				210	.. TODO
				211	Add product
				212	names.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	213	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	214
				215	.. _amdgpu-target-features:
				216
				217	Target Features
				218	---------------
				219
				220	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	221	processor specific features. Not all target features are supported by
				222	all processors. The runtime must ensure that the features supported by
				223	the device used to execute the code match the features enabled when
				224	generating the code. A mismatch of features may result in incorrect
				225	execution, or a reduction in performance.
				226
				227	The target features supported by each processor, and the default value
				228	used if not specified explicitly, is listed in
				229	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				232	target features.
				233
				234	For example:
				235
				236	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	238	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
				241	.. table:: AMDGPU Target Features
				242	:name: amdgpu-target-feature-table
				243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	============== ==================================================
				245	Target Feature Description
				246	============== ==================================================
				247	-m[no-]xnack Enable/disable generating code that has
				248	memory clauses that are compatible with
				249	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	250
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	251	This is used for demand paging and page
				252	migration. If XNACK replay is enabled in
				253	the device, then if a page fault occurs
				254	the code may execute incorrectly if the
				255	``xnack`` feature is not enabled. Executing
				256	code that has the feature enabled on a
				257	device that does not have XNACK replay
				258	enabled will execute correctly, but may
				259	be less performant than code with the
				260	feature disabled.
				261	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	262
				263	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
				265	Address Spaces
				266	--------------
				267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	The memory space names used in the table, aside from the region memory space, is
				271	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	272
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	273	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	274
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	275	.. table:: Address Space Mapping
				276	:name: amdgpu-address-space-mapping-table
				277
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	280	================== =================
				281	0 Generic (Flat)
				282	1 Global
				283	2 Region (GDS)
				284	3 Local (group/LDS)
				285	4 Constant
				286	5 Private (Scratch)
				287	6 Constant 32-bit
				288	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	289
				290	.. _amdgpu-memory-scopes:
				291
				292	Memory Scopes
				293	-------------
				294
				295	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				296	backend memory model when the target triple OS is ``amdhsa`` (see
				297	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				298
				299	The memory model supported is based on the HSA memory model [HSA]_ which is
				300	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				301	relation is transitive over the synchonizes-with relation independent of scope,
				302	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	303	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				306	inclusion and requires the memory scopes to exactly match. However, this
				307	is conservatively correct for OpenCL.
				308
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	309	.. table:: AMDHSA LLVM Sync Scopes
				310	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	311
				312	================ ==========================================================
				313	LLVM Sync Scope Description
				314	================ ==========================================================
				315	none The default: ``system``.
				316
				317	Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``.
				324	- ``agent`` and executed by a thread on the same agent.
				325	- ``workgroup`` and executed by a thread in the same
				326	workgroup.
				327	- ``wavefront`` and executed by a thread in the same
				328	wavefront.
				329
				330	``agent`` Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system`` or ``agent`` and executed by a thread on the
				337	same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``workgroup`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system``, ``agent`` or ``workgroup`` and executed by a
				350	thread in the same workgroup.
				351	- ``wavefront`` and executed by a thread in the same
				352	wavefront.
				353
				354	``wavefront`` Synchronizes with, and participates in modification and
				355	seq_cst total orderings with, other operations (except
				356	image operations) for all address spaces (except private,
				357	or generic that accesses private) provided the other
				358	operation's sync scope is:
				359
				360	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				361	and executed by a thread in the same wavefront.
				362
				363	``singlethread`` Only synchronizes with, and participates in modification
				364	and seq_cst total orderings with, other operations (except
				365	image operations) running in the same thread for all
				366	address spaces (for example, in signal handlers).
				367	================ ==========================================================
				368
				369	AMDGPU Intrinsics
				370	-----------------
				371
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame^]	372	The AMDGPU backend implements the following LLVM IR intrinsics.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373
				374	This section is WIP.
				375
				376	.. TODO
				377	List AMDGPU intrinsics
				378
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame^]	379	AMDGPU Attributes
				380	-----------------
				381
				382	The AMDGPU backend supports the following LLVM IR attributes.
				383
				384	.. table:: AMDGPU LLVM IR Attributes
				385	:name: amdgpu-llvm-ir-attributes-table
				386
				387	======================================= ==========================================================
				388	LLVM Attribute Description
				389	======================================= ==========================================================
				390	"amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
				391	will be specified when the kernel is dispatched. Generated
				392	by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
				393	"amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
				394	argument block size for the implicit arguments. This
				395	varies by OS and language (for OpenCL see
				396	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
				397	"amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed
				398	when the kernel is dispatched.
				399	"amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
				400	the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
				401	"amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
				402	``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
				403	"amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
				404	execution unit. Generated by the ``amdgpu_waves_per_eu``
				405	CLANG attribute [CLANG-ATTR]_.
				406	======================================= ==========================================================
				407
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	Code Object
				409	===========
				410
				411	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				412	can be linked by ``lld`` to produce a standard ELF shared code object which can
				413	be loaded and executed on an AMDGPU target.
				414
				415	Header
				416	------
				417
				418	The AMDGPU backend uses the following ELF header:
				419
				420	.. table:: AMDGPU ELF Header
				421	:name: amdgpu-elf-header-table
				422
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	423	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	425	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	426	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				427	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	428	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				429	- ``ELFOSABI_AMDGPU_HSA``
				430	- ``ELFOSABI_AMDGPU_PAL``
				431	- ``ELFOSABI_AMDGPU_MESA3D``
				432	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				433	- ``ELFABIVERSION_AMDGPU_PAL``
				434	- ``ELFABIVERSION_AMDGPU_MESA3D``
				435	``e_type`` - ``ET_REL``
				436	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	437	``e_machine`` ``EM_AMDGPU``
				438	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	439	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	440	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	441
				442	..
				443
				444	.. table:: AMDGPU ELF Header Enumeration Values
				445	:name: amdgpu-elf-header-enumeration-values-table
				446
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	447	=============================== =====
				448	Name Value
				449	=============================== =====
				450	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	451	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	452	``ELFOSABI_AMDGPU_HSA`` 64
				453	``ELFOSABI_AMDGPU_PAL`` 65
				454	``ELFOSABI_AMDGPU_MESA3D`` 66
				455	``ELFABIVERSION_AMDGPU_HSA`` 1
				456	``ELFABIVERSION_AMDGPU_PAL`` 0
				457	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				458	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	459
				460	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	461	The ELF class is:
				462
				463	* ``ELFCLASS32`` for ``r600`` architecture.
				464
				465	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				466	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467
				468	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	469	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	470
				471	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	472	One of the following AMD GPU architecture specific OS ABIs
				473	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	474
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	475	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	476
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	477	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	478
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	479	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				480
				481	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	482
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	483	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	484	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				485	object conforms:
				486
				487	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				488	runtime ABI.
				489
				490	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				491	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	492
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	493	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	494	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	495
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	496	``e_type``
				497	Can be one of the following values:
				498
				499
				500	``ET_REL``
				501	The type produced by the AMD GPU backend compiler as it is relocatable code
				502	object.
				503
				504	``ET_DYN``
				505	The type produced by the linker as it is a shared code object.
				506
				507	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				508
				509	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	510	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				511	by the ``r600`` and ``amdgcn`` architectures (see
				512	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				513	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				514	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	515
				516	``e_entry``
				517	The entry point is 0 as the entry points for individual kernels must be
				518	selected in order to invoke them through AQL packets.
				519
				520	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	521	The AMDGPU backend uses the following ELF header flags:
				522
				523	.. table:: AMDGPU ELF Header ``e_flags``
				524	:name: amdgpu-elf-header-e_flags-table
				525
				526	================================= ========== =============================
				527	Name Value Description
				528	================================= ========== =============================
				529	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				530	-------------------------------------------- -----------------------------
				531	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				532	mask for
				533	``EF_AMDGPU_MACH_xxx`` values
				534	defined in
				535	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	536	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				537	target feature is
				538	enabled for all code
				539	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	540	If the processor
				541	does not support the
				542	``xnack`` target
				543	feature then must
				544	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	545	See
				546	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	547	================================= ========== =============================
				548
				549	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				550	:name: amdgpu-ef-amdgpu-mach-table
				551
				552	================================= ========== =============================
				553	Name Value Description (see
				554	:ref:`amdgpu-processor-table`)
				555	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	556	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				557	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				558	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				559	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				560	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				561	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				562	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				563	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				564	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				565	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				566	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				567	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				568	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				569	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				570	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				571	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				572	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				573	reserved 0x011 - Reserved for ``r600``
				574	0x01f architecture processors.
				575	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				576	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				577	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				578	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				579	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				580	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				581	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				582	reserved 0x027 Reserved.
				583	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				584	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				585	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				586	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				587	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				588	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	589	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				590	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	591	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	592	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	593
				594	Sections
				595	--------
				596
				597	An AMDGPU target ELF code object has the standard ELF sections which include:
				598
				599	.. table:: AMDGPU ELF Sections
				600	:name: amdgpu-elf-sections-table
				601
				602	================== ================ =================================
				603	Name Type Attributes
				604	================== ================ =================================
				605	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				606	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				607	``.debug_``\ \* ``SHT_PROGBITS`` none
				608	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				609	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				610	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				611	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				612	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				613	``.note`` ``SHT_NOTE`` none
				614	``.rela``\ name ``SHT_RELA`` none
				615	``.rela.dyn`` ``SHT_RELA`` none
				616	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				617	``.shstrtab`` ``SHT_STRTAB`` none
				618	``.strtab`` ``SHT_STRTAB`` none
				619	``.symtab`` ``SHT_SYMTAB`` none
				620	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				621	================== ================ =================================
				622
				623	These sections have their standard meanings (see [ELF]_) and are only generated
				624	if needed.
				625
				626	``.debug``\ \*
				627	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				628	DWARF produced by the AMDGPU backend.
				629
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	630	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	631	The standard sections used by a dynamic loader.
				632
				633	``.note``
				634	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				635	backend.
				636
				637	``.rela``\ name, ``.rela.dyn``
				638	For relocatable code objects, name is the name of the section that the
				639	relocation records apply. For example, ``.rela.text`` is the section name for
				640	relocation records associated with the ``.text`` section.
				641
				642	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				643	records from each of the relocatable code object's ``.rela``\ name sections.
				644
				645	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				646	the AMDGPU backend.
				647
				648	``.text``
				649	The executable machine code for the kernels and functions they call. Generated
				650	as position independent code. See :ref:`amdgpu-code-conventions` for
				651	information on conventions used in the isa generation.
				652
				653	.. _amdgpu-note-records:
				654
				655	Note Records
				656	------------
				657
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	658	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				659	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				660	aligned. In addition, minimal zero byte padding must be generated to ensure the
				661	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				662	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	663
				664	The AMDGPU backend code object uses the following ELF note records in the
				665	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	666	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	667	variable size strings have a corresponding ``*_size`` field that specifies the
				668	number of bytes, including the terminating null character, in the string. The
				669	string(s) come immediately after the preceding fields.
				670
				671	Additional note records can be present.
				672
				673	.. table:: AMDGPU ELF Note Records
				674	:name: amdgpu-elf-note-records-table
				675
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	676	===== ============================== ======================================
				677	Name Type Description
				678	===== ============================== ======================================
				679	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	680	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	681
				682	..
				683
				684	.. table:: AMDGPU ELF Note Record Enumeration Values
				685	:name: amdgpu-elf-note-record-enumeration-values-table
				686
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	687	============================== =====
				688	Name Value
				689	============================== =====
				690	reserved 0-9
				691	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	692	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	693	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	694
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	695	``NT_AMD_AMDGPU_HSA_METADATA``
				696	Specifies extensible metadata associated with the code objects executed on HSA
				697	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				698	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				699	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				700	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	701
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	702	.. _amdgpu-symbols:
				703
				704	Symbols
				705	-------
				706
				707	Symbols include the following:
				708
				709	.. table:: AMDGPU ELF Symbols
				710	:name: amdgpu-elf-symbols-table
				711
				712	===================== ============== ============= ==================
				713	Name Type Section Description
				714	===================== ============== ============= ==================
				715	link-name ``STT_OBJECT`` - ``.data`` Global variable
				716	- ``.rodata``
				717	- ``.bss``
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	718	link-name\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	719	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				720	===================== ============== ============= ==================
				721
				722	Global variable
				723	Global variables both used and defined by the compilation unit.
				724
				725	If the symbol is defined in the compilation unit then it is allocated in the
				726	appropriate section according to if it has initialized data or is readonly.
				727
				728	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				729	will resolve relocations using the definition provided by another code object
				730	or explicitly defined by the runtime.
				731
				732	All global symbols, whether defined in the compilation unit or external, are
				733	accessed by the machine code indirectly through a GOT table entry. This
				734	allows them to be preemptable. The GOT table is only supported when the target
				735	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	736
				737	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	738	Add description of linked shared object symbols. Seems undefined symbols
				739	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	740
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	741	Kernel descriptor
				742	Every HSA kernel has an associated kernel descriptor. It is the address of the
				743	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				744	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				745	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				746
				747	Kernel entry point
				748	Every HSA kernel also has a symbol for its machine code entry point.
				749
				750	.. _amdgpu-relocation-records:
				751
				752	Relocation Records
				753	------------------
				754
				755	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				756	relocatable fields are:
				757
				758	``word32``
				759	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				760	alignment. These values use the same byte order as other word values in the
				761	AMD GPU architecture.
				762
				763	``word64``
				764	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				765	alignment. These values use the same byte order as other word values in the
				766	AMD GPU architecture.
				767
				768	Following notations are used for specifying relocation calculations:
				769
				770	A
				771	Represents the addend used to compute the value of the relocatable field.
				772
				773	G
				774	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	775	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	776
				777	GOT
				778	Represents the address of the global offset table.
				779
				780	P
				781	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				782	of the storage unit being relocated (computed using ``r_offset``).
				783
				784	S
				785	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	786	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				787
				788	B
				789	Represents the base address of a loaded executable or shared object which is
				790	the difference between the ELF address and the actual load address. Relocations
				791	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	792
				793	The following relocation types are supported:
				794
				795	.. table:: AMDGPU ELF Relocation Records
				796	:name: amdgpu-elf-relocation-records-table
				797
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	798	========================== ======= ===== ========== ==============================
				799	Relocation Type Kind Value Field Calculation
				800	========================== ======= ===== ========== ==============================
				801	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	802	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				803	Dynamic
				804	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				805	Dynamic
				806	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	807	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	808	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				809	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	810	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				811	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	812	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				813	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				814	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				815	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				816	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				817	reserved 12
				818	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				819	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	820
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	821	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				822	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				823
				824	There is no current OS loader support for 32 bit programs and so
				825	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	826
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	827	.. _amdgpu-dwarf:
				828
				829	DWARF
				830	-----
				831
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	832	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	833	information that maps the code object executable code and data to the source
				834	language constructs. It can be used by tools such as debuggers and profilers.
				835
				836	Address Space Mapping
				837	~~~~~~~~~~~~~~~~~~~~~
				838
				839	The following address space mapping is used:
				840
				841	.. table:: AMDGPU DWARF Address Space Mapping
				842	:name: amdgpu-dwarf-address-space-mapping-table
				843
				844	=================== =================
				845	DWARF Address Space Memory Space
				846	=================== =================
				847	1 Private (Scratch)
				848	2 Local (group/LDS)
				849	omitted Global
				850	omitted Constant
				851	omitted Generic (Flat)
				852	not supported Region (GDS)
				853	=================== =================
				854
				855	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				856	used in the table.
				857
				858	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				859	DWARF address space of the value of the pointer when it is in the private or
				860	local address space. Otherwise the attribute is omitted.
				861
				862	An ``XDEREF`` operation is generated in location list expressions for variables
				863	that are allocated in the private and local address space. Otherwise no
				864	``XDREF`` is omitted.
				865
				866	Register Mapping
				867	~~~~~~~~~~~~~~~~
				868
				869	This section is WIP.
				870
				871	.. TODO
				872	Define DWARF register enumeration.
				873
				874	If want to present a wavefront state then should expose vector registers as
				875	64 wide (rather than per work-item view that LLVM uses). Either as separate
				876	registers, or a 64x4 byte single register. In either case use a new LANE op
				877	(akin to XDREF) to select the current lane usage in a location
				878	expression. This would also allow scalar register spilling to vector register
				879	lanes to be expressed (currently no debug information is being generated for
				880	spilling). If choose a wide single register approach then use LANE in
				881	conjunction with PIECE operation to select the dword part of the register for
				882	the current lane. If the separate register approach then use LANE to select
				883	the register.
				884
				885	Source Text
				886	~~~~~~~~~~~
				887
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	888	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				889	runtime) may be embedded into the DWARF v5 line table using the ``clang
				890	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	891
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	892	For example:
				893
				894	``-gembed-source``
				895	Enable the embedded source DWARF v5 extension.
				896	``-gno-embed-source``
				897	Disable the embedded source DWARF v5 extension.
				898
				899	.. table:: AMDGPU Debug Options
				900	:name: amdgpu-debug-options
				901
				902	==================== ==================================================
				903	Debug Flag Description
				904	==================== ==================================================
				905	-g[no-]embed-source Enable/disable embedding source text in DWARF
				906	debug sections. Useful for environments where
				907	source cannot be written to disk, such as
				908	when performing online compilation.
				909	==================== ==================================================
				910
				911	This option enables one extended content types in the DWARF v5 Line Number
				912	Program Header, which is used to encode embedded source.
				913
				914	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				915	:name: amdgpu-dwarf-extended-content-types
				916
				917	============================ ======================
				918	Content Type Form
				919	============================ ======================
				920	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				921	============================ ======================
				922
				923	The source field will contain the UTF-8 encoded, null-terminated source text
				924	with ``'\n'`` line endings. When the source field is present, consumers can use
				925	the embedded source instead of attempting to discover the source on disk. When
				926	the source field is absent, consumers can access the file to get the source
				927	text.
				928
				929	The above content type appears in the ``file_name_entry_format`` field of the
				930	line table prologue, and its corresponding value appear in the ``file_names``
				931	field. The current encoding of the content type is documented in table
				932	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				933
				934	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				935	:name: amdgpu-dwarf-extended-content-types-encoding
				936
				937	============================ ====================
				938	Content Type Value
				939	============================ ====================
				940	``DW_LNCT_LLVM_source`` 0x2001
				941	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	942
				943	.. _amdgpu-code-conventions:
				944
				945	Code Conventions
				946	================
				947
				948	This section provides code conventions used for each supported target triple OS
				949	(see :ref:`amdgpu-target-triples`).
				950
				951	AMDHSA
				952	------
				953
				954	This section provides code conventions used when the target triple OS is
				955	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				956
				957	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	958
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	959	Code Object Target Identification
				960	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				961
				962	The AMDHSA OS uses the following syntax to specify the code object
				963	target as a single string:
				964
				965	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				966
				967	Where:
				968
				969	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				970	are the same as the Target Triple (see
				971	:ref:`amdgpu-target-triples`).
				972
				973	- ``<Processor>`` is the same as the Processor (see
				974	:ref:`amdgpu-processors`).
				975
				976	- ``<Target Features>`` is a list of the enabled Target Features
				977	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				978	apply to Processor. The list must be in the same order as listed
				979	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				980	Features* must be included in the list if they are enabled even if
				981	that is the default for Processor.
				982
				983	For example:
				984
				985	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				986
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	987	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	988	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	989
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	990	The code object metadata specifies extensible metadata associated with the code
				991	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				992	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				993	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				994	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				995	information necessary to support the ROCM kernel queries. For example, the
				996	segment sizes needed in a dispatch packet. In addition, a high level language
				997	runtime may require other information to be included. For example, the AMD
				998	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	999
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1000	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1001	:doc:`YamlIO`).
				1002
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1003	.. TODO
				1004	Is the string null terminated? It probably should not if YAML allows it to
				1005	contain null characters, otherwise it should be.
				1006
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1007	The metadata is represented as a single YAML document comprised of the mapping
				1008	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				1009	referenced tables.
				1010
				1011	For boolean values, the string values of ``false`` and ``true`` are used for
				1012	false and true respectively.
				1013
				1014	Additional information can be added to the mappings. To avoid conflicts, any
				1015	non-AMD key names should be prefixed by "vendor-name.".
				1016
				1017	.. table:: AMDHSA Code Object Metadata Mapping
				1018	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				1019
				1020	========== ============== ========= =======================================
				1021	String Key Value Type Required? Description
				1022	========== ============== ========= =======================================
				1023	"Version" sequence of Required - The first integer is the major
				1024	2 integers version. Currently 1.
				1025	- The second integer is the minor
				1026	version. Currently 0.
				1027	"Printf" sequence of Each string is encoded information
				1028	strings about a printf function call. The
				1029	encoded information is organized as
				1030	fields separated by colon (':'):
				1031
				1032	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1033
				1034	where:
				1035
				1036	``ID``
				1037	A 32 bit integer as a unique id for
				1038	each printf function call
				1039
				1040	``N``
				1041	A 32 bit integer equal to the number
				1042	of arguments of printf function call
				1043	minus 1
				1044
				1045	``S[i]`` (where i = 0, 1, ... , N-1)
				1046	32 bit integers for the size in bytes
				1047	of the i-th FormatString argument of
				1048	the printf function call
				1049
				1050	FormatString
				1051	The format string passed to the
				1052	printf function call.
				1053	"Kernels" sequence of Required Sequence of the mappings for each
				1054	mapping kernel in the code object. See
				1055	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1056	for the definition of the mapping.
				1057	========== ============== ========= =======================================
				1058
				1059	..
				1060
				1061	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1062	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1063
				1064	================= ============== ========= ================================
				1065	String Key Value Type Required? Description
				1066	================= ============== ========= ================================
				1067	"Name" string Required Source name of the kernel.
				1068	"SymbolName" string Required Name of the kernel
				1069	descriptor ELF symbol.
				1070	"Language" string Source language of the kernel.
				1071	Values include:
				1072
				1073	- "OpenCL C"
				1074	- "OpenCL C++"
				1075	- "HCC"
				1076	- "OpenMP"
				1077
				1078	"LanguageVersion" sequence of - The first integer is the major
				1079	2 integers version.
				1080	- The second integer is the
				1081	minor version.
				1082	"Attrs" mapping Mapping of kernel attributes.
				1083	See
				1084	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1085	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1086	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1087	mapping kernel arguments. See
				1088	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1089	for the definition of the mapping.
				1090	"CodeProps" mapping Mapping of properties related to
				1091	the kernel code. See
				1092	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1093	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1094	================= ============== ========= ================================
				1095
				1096	..
				1097
				1098	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1099	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1100
				1101	=================== ============== ========= ==============================
				1102	String Key Value Type Required? Description
				1103	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1104	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1105	3 integers must be >=1 and the dispatch
				1106	work-group size X, Y, Z must
				1107	correspond to the specified
				1108	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1109
				1110	Corresponds to the OpenCL
				1111	``reqd_work_group_size``
				1112	attribute.
				1113	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1114	3 integers X, Y, Z is likely to be the
				1115	specified values.
				1116
				1117	Corresponds to the OpenCL
				1118	``work_group_size_hint``
				1119	attribute.
				1120	"VecTypeHint" string The name of a scalar or vector
				1121	type.
				1122
				1123	Corresponds to the OpenCL
				1124	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1125
				1126	"RuntimeHandle" string The external symbol name
				1127	associated with a kernel.
				1128	OpenCL runtime allocates a
				1129	global buffer for the symbol
				1130	and saves the kernel's address
				1131	to it, which is used for
				1132	device side enqueueing. Only
				1133	available for device side
				1134	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1135	=================== ============== ========= ==============================
				1136
				1137	..
				1138
				1139	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1140	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1141
				1142	================= ============== ========= ================================
				1143	String Key Value Type Required? Description
				1144	================= ============== ========= ================================
				1145	"Name" string Kernel argument name.
				1146	"TypeName" string Kernel argument type name.
				1147	"Size" integer Required Kernel argument size in bytes.
				1148	"Align" integer Required Kernel argument alignment in
				1149	bytes. Must be a power of two.
				1150	"ValueKind" string Required Kernel argument kind that
				1151	specifies how to set up the
				1152	corresponding argument.
				1153	Values include:
				1154
				1155	"ByValue"
				1156	The argument is copied
				1157	directly into the kernarg.
				1158
				1159	"GlobalBuffer"
				1160	A global address space pointer
				1161	to the buffer data is passed
				1162	in the kernarg.
				1163
				1164	"DynamicSharedPointer"
				1165	A group address space pointer
				1166	to dynamically allocated LDS
				1167	is passed in the kernarg.
				1168
				1169	"Sampler"
				1170	A global address space
				1171	pointer to a S# is passed in
				1172	the kernarg.
				1173
				1174	"Image"
				1175	A global address space
				1176	pointer to a T# is passed in
				1177	the kernarg.
				1178
				1179	"Pipe"
				1180	A global address space pointer
				1181	to an OpenCL pipe is passed in
				1182	the kernarg.
				1183
				1184	"Queue"
				1185	A global address space pointer
				1186	to an OpenCL device enqueue
				1187	queue is passed in the
				1188	kernarg.
				1189
				1190	"HiddenGlobalOffsetX"
				1191	The OpenCL grid dispatch
				1192	global offset for the X
				1193	dimension is passed in the
				1194	kernarg.
				1195
				1196	"HiddenGlobalOffsetY"
				1197	The OpenCL grid dispatch
				1198	global offset for the Y
				1199	dimension is passed in the
				1200	kernarg.
				1201
				1202	"HiddenGlobalOffsetZ"
				1203	The OpenCL grid dispatch
				1204	global offset for the Z
				1205	dimension is passed in the
				1206	kernarg.
				1207
				1208	"HiddenNone"
				1209	An argument that is not used
				1210	by the kernel. Space needs to
				1211	be left for it, but it does
				1212	not need to be set up.
				1213
				1214	"HiddenPrintfBuffer"
				1215	A global address space pointer
				1216	to the runtime printf buffer
				1217	is passed in kernarg.
				1218
				1219	"HiddenDefaultQueue"
				1220	A global address space pointer
				1221	to the OpenCL device enqueue
				1222	queue that should be used by
				1223	the kernel by default is
				1224	passed in the kernarg.
				1225
				1226	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1227	A global address space pointer
				1228	to help link enqueued kernels into
				1229	the ancestor tree for determining
				1230	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1231
				1232	"ValueType" string Required Kernel argument value type. Only
				1233	present if "ValueKind" is
				1234	"ByValue". For vector data
				1235	types, the value is for the
				1236	element type. Values include:
				1237
				1238	- "Struct"
				1239	- "I8"
				1240	- "U8"
				1241	- "I16"
				1242	- "U16"
				1243	- "F16"
				1244	- "I32"
				1245	- "U32"
				1246	- "F32"
				1247	- "I64"
				1248	- "U64"
				1249	- "F64"
				1250
				1251	.. TODO
				1252	How can it be determined if a
				1253	vector type, and what size
				1254	vector?
				1255	"PointeeAlign" integer Alignment in bytes of pointee
				1256	type for pointer type kernel
				1257	argument. Must be a power
				1258	of 2. Only present if
				1259	"ValueKind" is
				1260	"DynamicSharedPointer".
				1261	"AddrSpaceQual" string Kernel argument address space
				1262	qualifier. Only present if
				1263	"ValueKind" is "GlobalBuffer" or
				1264	"DynamicSharedPointer". Values
				1265	are:
				1266
				1267	- "Private"
				1268	- "Global"
				1269	- "Constant"
				1270	- "Local"
				1271	- "Generic"
				1272	- "Region"
				1273
				1274	.. TODO
				1275	Is GlobalBuffer only Global
				1276	or Constant? Is
				1277	DynamicSharedPointer always
				1278	Local? Can HCC allow Generic?
				1279	How can Private or Region
				1280	ever happen?
				1281	"AccQual" string Kernel argument access
				1282	qualifier. Only present if
				1283	"ValueKind" is "Image" or
				1284	"Pipe". Values
				1285	are:
				1286
				1287	- "ReadOnly"
				1288	- "WriteOnly"
				1289	- "ReadWrite"
				1290
				1291	.. TODO
				1292	Does this apply to
				1293	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1294	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1295	performed by the kernel on the
				1296	kernel argument. Only present if
				1297	"ValueKind" is "GlobalBuffer",
				1298	"Image", or "Pipe". This may be
				1299	more restrictive than indicated
				1300	by "AccQual" to reflect what the
				1301	kernel actual does. If not
				1302	present then the runtime must
				1303	assume what is implied by
				1304	"AccQual" and "IsConst". Values
				1305	are:
				1306
				1307	- "ReadOnly"
				1308	- "WriteOnly"
				1309	- "ReadWrite"
				1310
				1311	"IsConst" boolean Indicates if the kernel argument
				1312	is const qualified. Only present
				1313	if "ValueKind" is
				1314	"GlobalBuffer".
				1315
				1316	"IsRestrict" boolean Indicates if the kernel argument
				1317	is restrict qualified. Only
				1318	present if "ValueKind" is
				1319	"GlobalBuffer".
				1320
				1321	"IsVolatile" boolean Indicates if the kernel argument
				1322	is volatile qualified. Only
				1323	present if "ValueKind" is
				1324	"GlobalBuffer".
				1325
				1326	"IsPipe" boolean Indicates if the kernel argument
				1327	is pipe qualified. Only present
				1328	if "ValueKind" is "Pipe".
				1329
				1330	.. TODO
				1331	Can GlobalBuffer be pipe
				1332	qualified?
				1333	================= ============== ========= ================================
				1334
				1335	..
				1336
				1337	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1338	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1339
				1340	============================ ============== ========= =====================
				1341	String Key Value Type Required? Description
				1342	============================ ============== ========= =====================
				1343	"KernargSegmentSize" integer Required The size in bytes of
				1344	the kernarg segment
				1345	that holds the values
				1346	of the arguments to
				1347	the kernel.
				1348	"GroupSegmentFixedSize" integer Required The amount of group
				1349	segment memory
				1350	required by a
				1351	work-group in
				1352	bytes. This does not
				1353	include any
				1354	dynamically allocated
				1355	group segment memory
				1356	that may be added
				1357	when the kernel is
				1358	dispatched.
				1359	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1360	private address space
				1361	memory required for a
				1362	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1363	bytes. If the kernel
				1364	uses a dynamic call
				1365	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1366	space must be added
				1367	to this value for the
				1368	call stack.
				1369	"KernargSegmentAlign" integer Required The maximum byte
				1370	alignment of
				1371	arguments in the
				1372	kernarg segment. Must
				1373	be a power of 2.
				1374	"WavefrontSize" integer Required Wavefront size. Must
				1375	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1376	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1377	registers used by a
				1378	wavefront for
				1379	GFX6-GFX9. This
				1380	includes the special
				1381	SGPRs for VCC, Flat
				1382	Scratch (GFX7-GFX9)
				1383	and XNACK (for
				1384	GFX8-GFX9). It does
				1385	not include the 16
				1386	SGPR added if a trap
				1387	handler is
				1388	enabled. It is not
				1389	rounded up to the
				1390	allocation
				1391	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1392	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1393	registers used by
				1394	each work-item for
				1395	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1396	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1397	work-group size
				1398	supported by the
				1399	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1400	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1401	consistent with
				1402	ReqdWorkGroupSize if
				1403	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1404	"NumSpilledSGPRs" integer Number of stores from
				1405	a scalar register to
				1406	a register allocator
				1407	created spill
				1408	location.
				1409	"NumSpilledVGPRs" integer Number of stores from
				1410	a vector register to
				1411	a register allocator
				1412	created spill
				1413	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1414	============================ ============== ========= =====================
				1415
				1416	..
				1417
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1418	Kernel Dispatch
				1419	~~~~~~~~~~~~~~~
				1420
				1421	The HSA architected queuing language (AQL) defines a user space memory interface
				1422	that can be used to control the dispatch of kernels, in an agent independent
				1423	way. An agent can have zero or more AQL queues created for it using the ROCm
				1424	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1425	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1426	mechanics and packet layouts.
				1427
				1428	The packet processor of a kernel agent is responsible for detecting and
				1429	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1430	packet processor is implemented by the hardware command processor (CP),
				1431	asynchronous dispatch controller (ADC) and shader processor input controller
				1432	(SPI).
				1433
				1434	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1435	mode driver to initialize and register the AQL queue with CP.
				1436
				1437	To dispatch a kernel the following actions are performed. This can occur in the
				1438	CPU host program, or from an HSA kernel executing on a GPU.
				1439
				1440	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1441	executed is obtained.
				1442	2. A pointer to the kernel descriptor (see
				1443	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1444	obtained. It must be for a kernel that is contained in a code object that that
				1445	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1446	associated.
				1447	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1448	for a memory region with the kernarg property for the kernel agent that will
				1449	execute the kernel. It must be at least 16 byte aligned.
				1450	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1451	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1452	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1453	memory in the same way constant memory is accessed. (Note that the HSA
				1454	specification allows an implementation to copy the kernel argument contents to
				1455	another location that is accessed by the kernel.)
				1456	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1457	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1458	packet. The packet must be set up, and the final write must use an atomic
				1459	store release to set the packet kind to ensure the packet contents are
				1460	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1461	notify the kernel agent that the AQL queue has been updated. These rules, and
				1462	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1463	System Architecture Specification* [HSA]_.
				1464	6. A kernel dispatch packet includes information about the actual dispatch,
				1465	such as grid and work-group size, together with information from the code
				1466	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1467	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1468	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1469	7. CP executes micro-code and is responsible for detecting and setting up the
				1470	GPU to execute the wavefronts of a kernel dispatch.
				1471	8. CP ensures that when the a wavefront starts executing the kernel machine
				1472	code, the scalar general purpose registers (SGPR) and vector general purpose
				1473	registers (VGPR) are set up as required by the machine code. The required
				1474	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1475	register state is defined in
				1476	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1477	9. The prolog of the kernel machine code (see
				1478	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1479	before continuing executing the machine code that corresponds to the kernel.
				1480	10. When the kernel dispatch has completed execution, CP signals the completion
				1481	signal specified in the kernel dispatch packet if not 0.
				1482
				1483	.. _amdgpu-amdhsa-memory-spaces:
				1484
				1485	Memory Spaces
				1486	~~~~~~~~~~~~~
				1487
				1488	The memory space properties are:
				1489
				1490	.. table:: AMDHSA Memory Spaces
				1491	:name: amdgpu-amdhsa-memory-spaces-table
				1492
				1493	================= =========== ======== ======= ==================
				1494	Memory Space Name HSA Segment Hardware Address NULL Value
				1495	Name Name Size
				1496	================= =========== ======== ======= ==================
				1497	Private private scratch 32 0x00000000
				1498	Local group LDS 32 0xFFFFFFFF
				1499	Global global global 64 0x0000000000000000
				1500	Constant constant *same as 64 0x0000000000000000
				1501	global*
				1502	Generic flat flat 64 0x0000000000000000
				1503	Region N/A GDS 32 *not implemented
				1504	for AMDHSA*
				1505	================= =========== ======== ======= ==================
				1506
				1507	The global and constant memory spaces both use global virtual addresses, which
				1508	are the same virtual address space used by the CPU. However, some virtual
				1509	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1510	and some by both.
				1511
				1512	Using the constant memory space indicates that the data will not change during
				1513	the execution of the kernel. This allows scalar read instructions to be
				1514	used. The vector and scalar L1 caches are invalidated of volatile data before
				1515	each kernel dispatch execution to allow constant memory to change values between
				1516	kernel dispatches.
				1517
				1518	The local memory space uses the hardware Local Data Store (LDS) which is
				1519	automatically allocated when the hardware creates work-groups of wavefronts, and
				1520	freed when all the wavefronts of a work-group have terminated. The data store
				1521	(DS) instructions can be used to access it.
				1522
				1523	The private memory space uses the hardware scratch memory support. If the kernel
				1524	uses scratch, then the hardware allocates memory that is accessed using
				1525	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1526	address to physical address is:
				1527
				1528	``wavefront-scratch-base +
				1529	(private-address * wavefront-size * 4) +
				1530	(wavefront-lane-id * 4)``
				1531
				1532	There are different ways that the wavefront scratch base address is determined
				1533	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1534	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1535	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1536	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1537	same private address, the interleaving results in adjacent dwords being accessed
				1538	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1539	supported except by flat and scratch instructions in GFX9.
				1540
				1541	The generic address space uses the hardware flat address support available in
				1542	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1543	local appertures), that are outside the range of addressible global memory, to
				1544	map from a flat address to a private or local address.
				1545
				1546	FLAT instructions can take a flat address and access global, private (scratch)
				1547	and group (LDS) memory depending in if the address is within one of the
				1548	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1549	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1550	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1551	(see :ref:`amdgpu-amdhsa-m0`).
				1552
				1553	To convert between a segment address and a flat address the base address of the
				1554	appertures address can be used. For GFX7-GFX8 these are available in the
				1555	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1556	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1557	GFX9 the appature base addresses are directly available as inline constant
				1558	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1559	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1560	which makes it easier to convert from flat to segment or segment to flat.
				1561
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1562	Image and Samplers
				1563	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1564
				1565	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1566	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1567	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1568	enumeration values for the queries that are not trivially deducible from the S#
				1569	representation.
				1570
				1571	HSA Signals
				1572	~~~~~~~~~~~
				1573
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1574	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1575	structure allocated in memory accessible from both the CPU and GPU. The
				1576	structure is defined by the ROCm runtime and subject to change between releases
				1577	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1578
				1579	.. _amdgpu-amdhsa-hsa-aql-queue:
				1580
				1581	HSA AQL Queue
				1582	~~~~~~~~~~~~~
				1583
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1584	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1585	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1586	fields needed to implement certain language features such as the flat address
				1587	aperture bases. It also contains fields used by CP such as managing the
				1588	allocation of scratch memory.
				1589
				1590	.. _amdgpu-amdhsa-kernel-descriptor:
				1591
				1592	Kernel Descriptor
				1593	~~~~~~~~~~~~~~~~~
				1594
				1595	A kernel descriptor consists of the information needed by CP to initiate the
				1596	execution of a kernel, including the entry point address of the machine code
				1597	that implements the kernel.
				1598
				1599	Kernel Descriptor for GFX6-GFX9
				1600	+++++++++++++++++++++++++++++++
				1601
				1602	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1603
				1604	.. table:: Kernel Descriptor for GFX6-GFX9
				1605	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1606
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1607	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1608	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1609	======= ======= =============================== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1610	31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1611	address space memory
				1612	required for a work-group
				1613	in bytes. This does not
				1614	include any dynamically
				1615	allocated local address
				1616	space memory that may be
				1617	added when the kernel is
				1618	dispatched.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1619	63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1620	private address space
				1621	memory required for a
				1622	work-item in bytes. If
				1623	is_dynamic_callstack is 1
				1624	then additional space must
				1625	be added to this value for
				1626	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1627	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1628	191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1629	negative) from base
				1630	address of kernel
				1631	descriptor to kernel's
				1632	entry point instruction
				1633	which must be 256 byte
				1634	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1635	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1636	bytes
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1637	415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1638	program settings used by
				1639	CP to set up
				1640	``COMPUTE_PGM_RSRC1``
				1641	configuration
				1642	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1643	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1644	447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1645	program settings used by
				1646	CP to set up
				1647	``COMPUTE_PGM_RSRC2``
				1648	configuration
				1649	register. See
				1650	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1651	448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1652	_BUFFER SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1653	(see
				1654	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1655
				1656	The total number of SGPR
				1657	user data registers
				1658	requested must not exceed
				1659	16 and match value in
				1660	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1661	Any requests beyond 16
				1662	will be ignored.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1663	449 1 bit ENABLE_SGPR_DISPATCH_PTR see above
				1664	450 1 bit ENABLE_SGPR_QUEUE_PTR see above
				1665	451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR see above
				1666	452 1 bit ENABLE_SGPR_DISPATCH_ID see above
				1667	453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT see above
				1668	454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT see above
				1669	_SIZE
				1670	455 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1671	_COUNT_X should always be 0.
				1672	456 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1673	_COUNT_Y should always be 0.
				1674	457 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1675	_COUNT_Z should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1676	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1677	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1678	bytes
				1679	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1680	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1681
				1682	..
				1683
				1684	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1685	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1686
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1687	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1688	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1689	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1690	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1691	used by each work-item,
				1692	granularity is device
				1693	specific:
				1694
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1695	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1696	- max_vgpr 1..256
				1697	- roundup((max_vgpg + 1)
				1698	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1699
				1700	Used by CP to set up
				1701	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1702	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1703	used by a wavefront,
				1704	granularity is device
				1705	specific:
				1706
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1707	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1708	- max_sgpr 1..112
				1709	- roundup((max_sgpg + 1)
				1710	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1711	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1712	- max_sgpr 1..112
				1713	- roundup((max_sgpg + 1)
				1714	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1715
				1716	Includes the special SGPRs
				1717	for VCC, Flat Scratch (for
				1718	GFX7 onwards) and XNACK
				1719	(for GFX8 onwards). It does
				1720	not include the 16 SGPR
				1721	added if a trap handler is
				1722	enabled.
				1723
				1724	Used by CP to set up
				1725	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1726	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1727
				1728	Start executing wavefront
				1729	at the specified priority.
				1730
				1731	CP is responsible for
				1732	filling in
				1733	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1734	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1735	with specified rounding
				1736	mode for single (32
				1737	bit) floating point
				1738	precision floating point
				1739	operations.
				1740
				1741	Floating point rounding
				1742	mode values are defined in
				1743	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1744
				1745	Used by CP to set up
				1746	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1747	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1748	with specified rounding
				1749	denorm mode for half/double (16
				1750	and 64 bit) floating point
				1751	precision floating point
				1752	operations.
				1753
				1754	Floating point rounding
				1755	mode values are defined in
				1756	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1757
				1758	Used by CP to set up
				1759	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1760	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1761	with specified denorm mode
				1762	for single (32
				1763	bit) floating point
				1764	precision floating point
				1765	operations.
				1766
				1767	Floating point denorm mode
				1768	values are defined in
				1769	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1770
				1771	Used by CP to set up
				1772	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1773	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1774	with specified denorm mode
				1775	for half/double (16
				1776	and 64 bit) floating point
				1777	precision floating point
				1778	operations.
				1779
				1780	Floating point denorm mode
				1781	values are defined in
				1782	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1783
				1784	Used by CP to set up
				1785	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1786	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1787
				1788	Start executing wavefront
				1789	in privilege trap handler
				1790	mode.
				1791
				1792	CP is responsible for
				1793	filling in
				1794	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1795	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1796	with DX10 clamp mode
				1797	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1798	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1799	treatment of NaN's (when
				1800	set, clamp NaN to zero,
				1801	otherwise pass NaN
				1802	through).
				1803
				1804	Used by CP to set up
				1805	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1806	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1807
				1808	Start executing wavefront
				1809	in single step mode.
				1810
				1811	CP is responsible for
				1812	filling in
				1813	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1814	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1815	with IEEE mode
				1816	enabled. Floating point
				1817	opcodes that support
				1818	exception flag gathering
				1819	will quiet and propagate
				1820	signaling-NaN inputs per
				1821	IEEE 754-2008. Min_dx10 and
				1822	max_dx10 become IEEE
				1823	754-2008 compliant due to
				1824	signaling-NaN propagation
				1825	and quieting.
				1826
				1827	Used by CP to set up
				1828	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1829	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1830
				1831	Only one work-group allowed
				1832	to execute on a compute
				1833	unit.
				1834
				1835	CP is responsible for
				1836	filling in
				1837	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1838	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1839
				1840	Flag that can be used to
				1841	control debugging code.
				1842
				1843	CP is responsible for
				1844	filling in
				1845	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1846	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1847	Reserved, must be 0.
				1848	GFX9
				1849	Wavefront starts execution
				1850	with specified fp16 overflow
				1851	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1852
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1853	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1854	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1855	- If 1, fp16 overflow that is the
				1856	result of an +/-INF input value
				1857	or divide by 0 produces a +/-INF,
				1858	otherwise clamps computed
				1859	overflow to +/-MAX_FP16 as
				1860	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1861
				1862	Used by CP to set up
				1863	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1864	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1865	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1866	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1867
				1868	..
				1869
				1870	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1871	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1872
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1873	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1874	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1875	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1876	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1877	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1878	system register (see
				1879	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1880
				1881	Used by CP to set up
				1882	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1883	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1884	user data registers
				1885	requested. This number must
				1886	match the number of user
				1887	data registers enabled.
				1888
				1889	Used by CP to set up
				1890	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1891	6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1892
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1893	This bit represents
				1894	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
				1895	which is set by the CP if
				1896	the runtime has installed a
				1897	trap handler.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1898	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1899	system SGPR register for
				1900	the work-group id in the X
				1901	dimension (see
				1902	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1903
				1904	Used by CP to set up
				1905	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1906	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1907	system SGPR register for
				1908	the work-group id in the Y
				1909	dimension (see
				1910	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1911
				1912	Used by CP to set up
				1913	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1914	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1915	system SGPR register for
				1916	the work-group id in the Z
				1917	dimension (see
				1918	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1919
				1920	Used by CP to set up
				1921	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1922	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1923	system SGPR register for
				1924	work-group information (see
				1925	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1926
				1927	Used by CP to set up
				1928	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1929	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1930	VGPR system registers used
				1931	for the work-item ID.
				1932	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1933	defines the values.
				1934
				1935	Used by CP to set up
				1936	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1937	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1938
				1939	Wavefront starts execution
				1940	with address watch
				1941	exceptions enabled which
				1942	are generated when L1 has
				1943	witnessed a thread access
				1944	an *address of
				1945	interest*.
				1946
				1947	CP is responsible for
				1948	filling in the address
				1949	watch bit in
				1950	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1951	according to what the
				1952	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1953	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1954
				1955	Wavefront starts execution
				1956	with memory violation
				1957	exceptions exceptions
				1958	enabled which are generated
				1959	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1960	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1961	L1 or LDS
				1962	(write-to-read-only-memory,
				1963	mis-aligned atomic, LDS
				1964	address out of range,
				1965	illegal address, etc.).
				1966
				1967	CP sets the memory
				1968	violation bit in
				1969	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1970	according to what the
				1971	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1972	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1973
				1974	CP uses the rounded value
				1975	from the dispatch packet,
				1976	not this value, as the
				1977	dispatch may contain
				1978	dynamically allocated group
				1979	segment memory. CP writes
				1980	directly to
				1981	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1982
				1983	Amount of group segment
				1984	(LDS) to allocate for each
				1985	work-group. Granularity is
				1986	device specific:
				1987
				1988	GFX6:
				1989	roundup(lds-size / (64 * 4))
				1990	GFX7-GFX9:
				1991	roundup(lds-size / (128 * 4))
				1992
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1993	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1994	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1995	enabled.
				1996
				1997	Used by CP to set up
				1998	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1999	(set from bits 0..6).
				2000
				2001	IEEE 754 FP Invalid
				2002	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2003	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				2004	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2005	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2006	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				2007	_DIVISION_BY_ZERO Zero
				2008	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				2009	_OVERFLOW
				2010	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				2011	_UNDERFLOW
				2012	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				2013	_INEXACT
				2014	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				2015	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2016	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2017	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2018	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	2019	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2020
				2021	..
				2022
				2023	.. table:: Floating Point Rounding Mode Enumeration Values
				2024	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				2025
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2026	====================================== ===== ==============================
				2027	Enumeration Name Value Description
				2028	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2029	FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2030	FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2031	FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2032	FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2033	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2034
				2035	..
				2036
				2037	.. table:: Floating Point Denorm Mode Enumeration Values
				2038	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2039
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2040	====================================== ===== ==============================
				2041	Enumeration Name Value Description
				2042	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2043	FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2044	Denorms
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2045	FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2046	FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2047	FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2048	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2049
				2050	..
				2051
				2052	.. table:: System VGPR Work-Item ID Enumeration Values
				2053	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2054
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2055	======================================== ===== ============================
				2056	Enumeration Name Value Description
				2057	======================================== ===== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2058	SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2059	ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2060	SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2061	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2062	SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2063	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2064	SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2065	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2066
				2067	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2068
				2069	Initial Kernel Execution State
				2070	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2071
				2072	This section defines the register state that will be set up by the packet
				2073	processor prior to the start of execution of every wavefront. This is limited by
				2074	the constraints of the hardware controllers of CP/ADC/SPI.
				2075
				2076	The order of the SGPR registers is defined, but the compiler can specify which
				2077	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2078	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2079	for enabled registers are dense starting at SGPR0: the first enabled register is
				2080	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2081	an SGPR number.
				2082
				2083	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2084	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2085	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2086	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2087	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2088	dispatch.
				2089
				2090	SGPR register initial state is defined in
				2091	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2092
				2093	.. table:: SGPR Register Set Up Order
				2094	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2095
				2096	========== ========================== ====== ==============================
				2097	SGPR Order Name Number Description
				2098	(kernel descriptor enable of
				2099	field) SGPRs
				2100	========== ========================== ====== ==============================
				2101	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2102	(enable_sgpr_private with Scratch Wavefront Offset
				2103	_segment_buffer) as an offset, to access the
				2104	private memory space using a
				2105	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2106
				2107	CP uses the value provided by
				2108	the runtime.
				2109	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2110	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2111	actually executing.
				2112	then Queue Ptr 2 64 bit address of amd_queue_t
				2113	(enable_sgpr_queue_ptr) object for AQL queue on which
				2114	the dispatch packet was
				2115	queued.
				2116	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2117	(enable_sgpr_kernarg segment. This is directly
				2118	_segment_ptr) copied from the
				2119	kernarg_address in the kernel
				2120	dispatch packet.
				2121
				2122	Having CP load it once avoids
				2123	loading it at the beginning of
				2124	every wavefront.
				2125	then Dispatch Id 2 64 bit Dispatch ID of the
				2126	(enable_sgpr_dispatch_id) dispatch packet being
				2127	executed.
				2128	then Flat Scratch Init 2 This is 2 SGPRs:
				2129	(enable_sgpr_flat_scratch
				2130	_init) GFX6
				2131	Not supported.
				2132	GFX7-GFX8
				2133	The first SGPR is a 32 bit
				2134	byte offset from
				2135	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2136	to per SPI base of memory
				2137	for scratch for the queue
				2138	executing the kernel
				2139	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2140	from the runtime. (The
				2141	Scratch Segment Buffer base
				2142	address is
				2143	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2144	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2145	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2146	be added to this offset by
				2147	the kernel machine code,
				2148	right shifted by 8, and
				2149	moved to the FLAT_SCRATCH_HI
				2150	SGPR register.
				2151	FLAT_SCRATCH_HI corresponds
				2152	to SGPRn-4 on GFX7, and
				2153	SGPRn-6 on GFX8 (where SGPRn
				2154	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2155	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2156	FLAT_SCRATCH_HI is
				2157	multiplied by 256 (as it is
				2158	in units of 256 bytes) and
				2159	added to
				2160	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2161	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2162	FLAT SCRATCH BASE in flat
				2163	memory instructions that
				2164	access the scratch
				2165	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2166
				2167	The second SGPR is 32 bit
				2168	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2169	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2170	usage. CP obtains this from
				2171	the runtime, and it is
				2172	always a multiple of DWORD.
				2173	CP checks that the value in
				2174	the kernel dispatch packet
				2175	Private Segment Byte Size is
				2176	not larger, and requests the
				2177	runtime to increase the
				2178	queue's scratch size if
				2179	necessary. The kernel code
				2180	must move it to
				2181	FLAT_SCRATCH_LO which is
				2182	SGPRn-3 on GFX7 and SGPRn-5
				2183	on GFX8. FLAT_SCRATCH_LO is
				2184	used as the FLAT SCRATCH
				2185	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2186	instructions. Having CP load
				2187	it once avoids loading it at
				2188	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2189	wavefront.
				2190	GFX9
				2191	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2192	64 bit base address of the
				2193	per SPI scratch backing
				2194	memory managed by SPI for
				2195	the queue executing the
				2196	kernel dispatch. CP obtains
				2197	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2198	divides it if there are
				2199	multiple Shader Arrays each
				2200	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2201	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2202	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2203	machine code and the result
				2204	moved to the FLAT_SCRATCH
				2205	SGPR which is SGPRn-6 and
				2206	SGPRn-5. It is used as the
				2207	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2208	memory instructions.
				2209	then Private Segment Size 1 The 32 bit byte size of a
				2210	(enable_sgpr_private single
				2211	work-item's
				2212	scratch_segment_size) memory
				2213	allocation. This is the
				2214	value from the kernel
				2215	dispatch packet Private
				2216	Segment Byte Size rounded up
				2217	by CP to a multiple of
				2218	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2219
				2220	Having CP load it once avoids
				2221	loading it at the beginning of
				2222	every wavefront.
				2223
				2224	This is not used for
				2225	GFX7-GFX8 since it is the same
				2226	value as the second SGPR of
				2227	Flat Scratch Init. However, it
				2228	may be needed for GFX9 which
				2229	changes the meaning of the
				2230	Flat Scratch Init value.
				2231	then Grid Work-Group Count X 1 32 bit count of the number of
				2232	(enable_sgpr_grid work-groups in the X dimension
				2233	_workgroup_count_X) for the grid being
				2234	executed. Computed from the
				2235	fields in the kernel dispatch
				2236	packet as ((grid_size.x +
				2237	workgroup_size.x - 1) /
				2238	workgroup_size.x).
				2239	then Grid Work-Group Count Y 1 32 bit count of the number of
				2240	(enable_sgpr_grid work-groups in the Y dimension
				2241	_workgroup_count_Y && for the grid being
				2242	less than 16 previous executed. Computed from the
				2243	SGPRs) fields in the kernel dispatch
				2244	packet as ((grid_size.y +
				2245	workgroup_size.y - 1) /
				2246	workgroupSize.y).
				2247
				2248	Only initialized if <16
				2249	previous SGPRs initialized.
				2250	then Grid Work-Group Count Z 1 32 bit count of the number of
				2251	(enable_sgpr_grid work-groups in the Z dimension
				2252	_workgroup_count_Z && for the grid being
				2253	less than 16 previous executed. Computed from the
				2254	SGPRs) fields in the kernel dispatch
				2255	packet as ((grid_size.z +
				2256	workgroup_size.z - 1) /
				2257	workgroupSize.z).
				2258
				2259	Only initialized if <16
				2260	previous SGPRs initialized.
				2261	then Work-Group Id X 1 32 bit work-group id in X
				2262	(enable_sgpr_workgroup_id dimension of grid for
				2263	_X) wavefront.
				2264	then Work-Group Id Y 1 32 bit work-group id in Y
				2265	(enable_sgpr_workgroup_id dimension of grid for
				2266	_Y) wavefront.
				2267	then Work-Group Id Z 1 32 bit work-group id in Z
				2268	(enable_sgpr_workgroup_id dimension of grid for
				2269	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2270	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2271	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2272	_info) threadgroup_size_in_wavefronts[5:0]}
				2273	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2274	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2275	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2276	dispatch. Must be used as an
				2277	offset with Private
				2278	segment address when using
				2279	Scratch Segment Buffer. It
				2280	must be used to set up FLAT
				2281	SCRATCH for flat addressing
				2282	(see
				2283	:ref:`amdgpu-amdhsa-flat-scratch`).
				2284	========== ========================== ====== ==============================
				2285
				2286	The order of the VGPR registers is defined, but the compiler can specify which
				2287	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2288	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2289	for enabled registers are dense starting at VGPR0: the first enabled register is
				2290	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2291	VGPR number.
				2292
				2293	VGPR register initial state is defined in
				2294	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2295
				2296	.. table:: VGPR Register Set Up Order
				2297	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2298
				2299	========== ========================== ====== ==============================
				2300	VGPR Order Name Number Description
				2301	(kernel descriptor enable of
				2302	field) VGPRs
				2303	========== ========================== ====== ==============================
				2304	First Work-Item Id X 1 32 bit work item id in X
				2305	(Always initialized) dimension of work-group for
				2306	wavefront lane.
				2307	then Work-Item Id Y 1 32 bit work item id in Y
				2308	(enable_vgpr_workitem_id dimension of work-group for
				2309	> 0) wavefront lane.
				2310	then Work-Item Id Z 1 32 bit work item id in Z
				2311	(enable_vgpr_workitem_id dimension of work-group for
				2312	> 1) wavefront lane.
				2313	========== ========================== ====== ==============================
				2314
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2315	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2316
				2317	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2318	registers.
				2319	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2320	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2321	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2322	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2323	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2324	or (X, Y, Z).
				2325
				2326	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2327	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2328
				2329	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2330	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2331	instructions (GFX9).
				2332
				2333	If buffer operations are used then the compiler can generate a V# with the
				2334	following properties:
				2335
				2336	* base address of 0
				2337	* no swizzle
				2338	* ATC: 1 if IOMMU present (such as APU)
				2339	* ptr64: 1
				2340	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2341	APU and NC for dGPU).
				2342
				2343	.. _amdgpu-amdhsa-kernel-prolog:
				2344
				2345	Kernel Prolog
				2346	~~~~~~~~~~~~~
				2347
				2348	.. _amdgpu-amdhsa-m0:
				2349
				2350	M0
				2351	++
				2352
				2353	GFX6-GFX8
				2354	The M0 register must be initialized with a value at least the total LDS size
				2355	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2356	available in dispatch packet. For M0, it is also possible to use maximum
				2357	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2358	GFX7-GFX8).
				2359	GFX9
				2360	The M0 register is not used for range checking LDS accesses and so does not
				2361	need to be initialized in the prolog.
				2362
				2363	.. _amdgpu-amdhsa-flat-scratch:
				2364
				2365	Flat Scratch
				2366	++++++++++++
				2367
				2368	If the kernel may use flat operations to access scratch memory, the prolog code
				2369	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2370	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2371	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2372
				2373	GFX6
				2374	Flat scratch is not supported.
				2375
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2376	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2377	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2378	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2379	being managed by SPI for the queue executing the kernel dispatch. This is
				2380	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2381	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2382	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2383	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2384	by 8 before moving into FLAT_SCRATCH_LO.
				2385	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2386	work-items scratch memory usage. This is directly loaded from the kernel
				2387	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2388	DWORD. Having CP load it once avoids loading it at the beginning of every
				2389	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2390	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2391
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2392	GFX9
				2393	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2394	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2395	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2396	pair for use as the flat scratch base in flat memory instructions.
				2397
				2398	.. _amdgpu-amdhsa-memory-model:
				2399
				2400	Memory Model
				2401	~~~~~~~~~~~~
				2402
				2403	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2404	(see :ref:`memmodel`). The implementation is WIP.
				2405
				2406	.. TODO
				2407	Update when implementation complete.
				2408
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2409	The AMDGPU backend supports the memory synchronization scopes specified in
				2410	:ref:`amdgpu-memory-scopes`.
				2411
				2412	The code sequences used to implement the memory model are defined in table
				2413	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2414
				2415	The sequences specify the order of instructions that a single thread must
				2416	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2417	to other memory instructions executed by the same thread. This allows them to be
				2418	moved earlier or later which can allow them to be combined with other instances
				2419	of the same instruction, or hoisted/sunk out of loops to improve
				2420	performance. Only the instructions related to the memory model are given;
				2421	additional ``s_waitcnt`` instructions are required to ensure registers are
				2422	defined before being used. These may be able to be combined with the memory
				2423	model ``s_waitcnt`` instructions as described above.
				2424
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2425	The AMDGPU backend supports the following memory models:
				2426
				2427	HSA Memory Model [HSA]_
				2428	The HSA memory model uses a single happens-before relation for all address
				2429	spaces (see :ref:`amdgpu-address-spaces`).
				2430	OpenCL Memory Model [OpenCL]_
				2431	The OpenCL memory model which has separate happens-before relations for the
				2432	global and local address spaces. Only a fence specifying both global and
				2433	local address space, and seq_cst instructions join the relationships. Since
				2434	the LLVM ``memfence`` instruction does not allow an address space to be
				2435	specified the OpenCL fence has to convervatively assume both local and
				2436	global address space was specified. However, optimizations can often be
				2437	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2438	no intervening memory instructions which access the corresponding address
				2439	space. The code sequences in the table indicate what can be omitted for the
				2440	OpenCL memory. The target triple environment is used to determine if the
				2441	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2442
				2443	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2444	operations.
				2445
				2446	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2447	termed vector memory operations.
				2448
				2449	For GFX6-GFX9:
				2450
				2451	* Each agent has multiple compute units (CU).
				2452	* Each CU has multiple SIMDs that execute wavefronts.
				2453	* The wavefronts for a single work-group are executed in the same CU but may be
				2454	executed by different SIMDs.
				2455	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2456	executing on it.
				2457	* All LDS operations of a CU are performed as wavefront wide operations in a
				2458	global order and involve no caching. Completion is reported to a wavefront in
				2459	execution order.
				2460	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2461	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2462	can be reordered relative to each other, which can result in reordering the
				2463	visibility of vector memory operations with respect to LDS operations of other
				2464	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2465	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2466	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2467	same wavefront.
				2468	* The vector memory operations are performed as wavefront wide operations and
				2469	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2470	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2471	vector memory order if they access LDS memory, and out of LDS operation order
				2472	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2473	* The vector memory operations access a single vector L1 cache shared by all
				2474	SIMDs a CU. Therefore, no special action is required for coherence between the
				2475	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2476	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2477	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2478	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2479	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2480	scalar operations are used in a restricted way so do not impact the memory
				2481	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2482	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2483	the same agent.
				2484	* The L2 cache has independent channels to service disjoint ranges of virtual
				2485	addresses.
				2486	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2487	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2488	(which may be executing on different CUs) of an agent can be reordered
				2489	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2490	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2491	previous vector memory operation has completed before executing a subsequent
				2492	vector memory or LDS operation and so can be used to meet the requirements of
				2493	acquire and release.
				2494	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2495	of virtual addresses can be set up to bypass it to ensure system coherence.
				2496
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2497	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2498	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2499	memory, atomic memory orderings are not meaningful and all accesses are treated
				2500	as non-atomic.
				2501
				2502	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2503	scalar memory instructions). Since the constant address space contents do not
				2504	change during the execution of a kernel dispatch it is not legal to perform
				2505	stores, and atomic memory orderings are not meaningful and all access are
				2506	treated as non-atomic.
				2507
				2508	A memory synchronization scope wider than work-group is not meaningful for the
				2509	group (LDS) address space and is treated as work-group.
				2510
				2511	The memory model does not support the region address space which is treated as
				2512	non-atomic.
				2513
				2514	Acquire memory ordering is not meaningful on store atomic instructions and is
				2515	treated as non-atomic.
				2516
				2517	Release memory ordering is not meaningful on load atomic instructions and is
				2518	treated a non-atomic.
				2519
				2520	Acquire-release memory ordering is not meaningful on load or store atomic
				2521	instructions and is treated as acquire and release respectively.
				2522
				2523	AMDGPU backend only uses scalar memory operations to access memory that is
				2524	proven to not change during the execution of the kernel dispatch. This includes
				2525	constant address space and global address space for program scope const
				2526	variables. Therefore the kernel machine code does not have to maintain the
				2527	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2528	and vector L1 caches are invalidated between kernel dispatches by CP since
				2529	constant address space data may change between kernel dispatch executions. See
				2530	:ref:`amdgpu-amdhsa-memory-spaces`.
				2531
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2532	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2533	case the AMDGPU backend ensures the memory location used to spill is never
				2534	accessed by vector memory operations at the same time. If scalar writes are used
				2535	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2536	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2537	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2538	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2539	as all scalar writes are write-before-read in the same thread.
				2540
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2541	Scratch backing memory (which is used for the private address space)
				2542	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2543	address space is only accessed by a single thread, and is always
				2544	write-before-read, there is never a need to invalidate these entries from the L1
				2545	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2546	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2547
				2548	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2549	to invalidate the L2 cache. This also causes it to be treated as
				2550	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2551	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2552	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2553
				2554	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2555	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2556
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2557	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2558	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2559	Ordering Sync Scope Address
				2560	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2561	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2562	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2563	-----------------------------------------------------------------------------------
				2564	load none none - global - !volatile & !nontemporal
				2565	- generic
				2566	- private 1. buffer/global/flat_load
				2567	- constant
				2568	- volatile & !nontemporal
				2569
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2570	1. buffer/global/flat_load
				2571	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2572
				2573	- nontemporal
				2574
				2575	1. buffer/global/flat_load
				2576	glc=1 slc=1
				2577
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2578	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2579	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2580	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2581	- private 1. buffer/global/flat_store
				2582	- constant
				2583	- nontemporal
				2584
				2585	1. buffer/global/flat_stote
				2586	glc=1 slc=1
				2587
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2588	store none none - local 1. ds_store
				2589	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2590	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2591	load atomic unordered any any Same as non-atomic.
				2592	store atomic unordered any any Same as non-atomic.
				2593	atomicrmw unordered any any *Same as monotonic
				2594	atomic*.
				2595	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2596	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2597	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2598	- wavefront - generic
				2599	- workgroup
				2600	load atomic monotonic - singlethread - local 1. ds_load
				2601	- wavefront
				2602	- workgroup
				2603	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2604	- system - generic glc=1
				2605	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2606	- wavefront - generic
				2607	- workgroup
				2608	- agent
				2609	- system
				2610	store atomic monotonic - singlethread - local 1. ds_store
				2611	- wavefront
				2612	- workgroup
				2613	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2614	- wavefront - generic
				2615	- workgroup
				2616	- agent
				2617	- system
				2618	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2619	- wavefront
				2620	- workgroup
				2621	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2622	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2623	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2624	- wavefront - local
				2625	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2626	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2627	load atomic acquire - workgroup - local 1. ds_load
				2628	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2629
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2630	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2631	- Must happen before
				2632	any following
				2633	global/generic
				2634	load/load
				2635	atomic/store/store
				2636	atomic/atomicrmw.
				2637	- Ensures any
				2638	following global
				2639	data read is no
				2640	older than the load
				2641	atomic value being
				2642	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2643	load atomic acquire - workgroup - generic 1. flat_load
				2644	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2645
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2646	- If OpenCL, omit.
				2647	- Must happen before
				2648	any following
				2649	global/generic
				2650	load/load
				2651	atomic/store/store
				2652	atomic/atomicrmw.
				2653	- Ensures any
				2654	following global
				2655	data read is no
				2656	older than the load
				2657	atomic value being
				2658	acquired.
				2659	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2660	- system glc=1
				2661	2. s_waitcnt vmcnt(0)
				2662
				2663	- Must happen before
				2664	following
				2665	buffer_wbinvl1_vol.
				2666	- Ensures the load
				2667	has completed
				2668	before invalidating
				2669	the cache.
				2670
				2671	3. buffer_wbinvl1_vol
				2672
				2673	- Must happen before
				2674	any following
				2675	global/generic
				2676	load/load
				2677	atomic/atomicrmw.
				2678	- Ensures that
				2679	following
				2680	loads will not see
				2681	stale global data.
				2682
				2683	load atomic acquire - agent - generic 1. flat_load glc=1
				2684	- system 2. s_waitcnt vmcnt(0) &
				2685	lgkmcnt(0)
				2686
				2687	- If OpenCL omit
				2688	lgkmcnt(0).
				2689	- Must happen before
				2690	following
				2691	buffer_wbinvl1_vol.
				2692	- Ensures the flat_load
				2693	has completed
				2694	before invalidating
				2695	the cache.
				2696
				2697	3. buffer_wbinvl1_vol
				2698
				2699	- Must happen before
				2700	any following
				2701	global/generic
				2702	load/load
				2703	atomic/atomicrmw.
				2704	- Ensures that
				2705	following loads
				2706	will not see stale
				2707	global data.
				2708
				2709	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2710	- wavefront - local
				2711	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2712	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2713	atomicrmw acquire - workgroup - local 1. ds_atomic
				2714	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2715
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2716	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2717	- Must happen before
				2718	any following
				2719	global/generic
				2720	load/load
				2721	atomic/store/store
				2722	atomic/atomicrmw.
				2723	- Ensures any
				2724	following global
				2725	data read is no
				2726	older than the
				2727	atomicrmw value
				2728	being acquired.
				2729
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2730	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2731	2. waitcnt lgkmcnt(0)
				2732
				2733	- If OpenCL, omit.
				2734	- Must happen before
				2735	any following
				2736	global/generic
				2737	load/load
				2738	atomic/store/store
				2739	atomic/atomicrmw.
				2740	- Ensures any
				2741	following global
				2742	data read is no
				2743	older than the
				2744	atomicrmw value
				2745	being acquired.
				2746
				2747	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2748	- system 2. s_waitcnt vmcnt(0)
				2749
				2750	- Must happen before
				2751	following
				2752	buffer_wbinvl1_vol.
				2753	- Ensures the
				2754	atomicrmw has
				2755	completed before
				2756	invalidating the
				2757	cache.
				2758
				2759	3. buffer_wbinvl1_vol
				2760
				2761	- Must happen before
				2762	any following
				2763	global/generic
				2764	load/load
				2765	atomic/atomicrmw.
				2766	- Ensures that
				2767	following loads
				2768	will not see stale
				2769	global data.
				2770
				2771	atomicrmw acquire - agent - generic 1. flat_atomic
				2772	- system 2. s_waitcnt vmcnt(0) &
				2773	lgkmcnt(0)
				2774
				2775	- If OpenCL, omit
				2776	lgkmcnt(0).
				2777	- Must happen before
				2778	following
				2779	buffer_wbinvl1_vol.
				2780	- Ensures the
				2781	atomicrmw has
				2782	completed before
				2783	invalidating the
				2784	cache.
				2785
				2786	3. buffer_wbinvl1_vol
				2787
				2788	- Must happen before
				2789	any following
				2790	global/generic
				2791	load/load
				2792	atomic/atomicrmw.
				2793	- Ensures that
				2794	following loads
				2795	will not see stale
				2796	global data.
				2797
				2798	fence acquire - singlethread none none
				2799	- wavefront
				2800	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2801
				2802	- If OpenCL and
				2803	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2804	not generic, omit.
				2805	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2806	currently has no
				2807	address space on
				2808	the fence need to
				2809	conservatively
				2810	always generate. If
				2811	fence had an
				2812	address space then
				2813	set to address
				2814	space of OpenCL
				2815	fence flag, or to
				2816	generic if both
				2817	local and global
				2818	flags are
				2819	specified.
				2820	- Must happen after
				2821	any preceding
				2822	local/generic load
				2823	atomic/atomicrmw
				2824	with an equal or
				2825	wider sync scope
				2826	and memory ordering
				2827	stronger than
				2828	unordered (this is
				2829	termed the
				2830	fence-paired-atomic).
				2831	- Must happen before
				2832	any following
				2833	global/generic
				2834	load/load
				2835	atomic/store/store
				2836	atomic/atomicrmw.
				2837	- Ensures any
				2838	following global
				2839	data read is no
				2840	older than the
				2841	value read by the
				2842	fence-paired-atomic.
				2843
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2844	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2845	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2846
				2847	- If OpenCL and
				2848	address space is
				2849	not generic, omit
				2850	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2851	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2852	currently has no
				2853	address space on
				2854	the fence need to
				2855	conservatively
				2856	always generate
				2857	(see comment for
				2858	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2859	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2860	separate s_waitcnt
				2861	vmcnt(0) and
				2862	s_waitcnt
				2863	lgkmcnt(0) to allow
				2864	them to be
				2865	independently moved
				2866	according to the
				2867	following rules.
				2868	- s_waitcnt vmcnt(0)
				2869	must happen after
				2870	any preceding
				2871	global/generic load
				2872	atomic/atomicrmw
				2873	with an equal or
				2874	wider sync scope
				2875	and memory ordering
				2876	stronger than
				2877	unordered (this is
				2878	termed the
				2879	fence-paired-atomic).
				2880	- s_waitcnt lgkmcnt(0)
				2881	must happen after
				2882	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2883	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2884	atomic/atomicrmw
				2885	with an equal or
				2886	wider sync scope
				2887	and memory ordering
				2888	stronger than
				2889	unordered (this is
				2890	termed the
				2891	fence-paired-atomic).
				2892	- Must happen before
				2893	the following
				2894	buffer_wbinvl1_vol.
				2895	- Ensures that the
				2896	fence-paired atomic
				2897	has completed
				2898	before invalidating
				2899	the
				2900	cache. Therefore
				2901	any following
				2902	locations read must
				2903	be no older than
				2904	the value read by
				2905	the
				2906	fence-paired-atomic.
				2907
				2908	2. buffer_wbinvl1_vol
				2909
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2910	- Must happen before any
				2911	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2912	load/load
				2913	atomic/store/store
				2914	atomic/atomicrmw.
				2915	- Ensures that
				2916	following loads
				2917	will not see stale
				2918	global data.
				2919
				2920	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2921	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2922	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2923	- wavefront - local
				2924	- generic
				2925	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2926
				2927	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2928	- Must happen after
				2929	any preceding
				2930	local/generic
				2931	load/store/load
				2932	atomic/store
				2933	atomic/atomicrmw.
				2934	- Must happen before
				2935	the following
				2936	store.
				2937	- Ensures that all
				2938	memory operations
				2939	to local have
				2940	completed before
				2941	performing the
				2942	store that is being
				2943	released.
				2944
				2945	2. buffer/global/flat_store
				2946	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2947	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2948
				2949	- If OpenCL, omit.
				2950	- Must happen after
				2951	any preceding
				2952	local/generic
				2953	load/store/load
				2954	atomic/store
				2955	atomic/atomicrmw.
				2956	- Must happen before
				2957	the following
				2958	store.
				2959	- Ensures that all
				2960	memory operations
				2961	to local have
				2962	completed before
				2963	performing the
				2964	store that is being
				2965	released.
				2966
				2967	2. flat_store
				2968	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2969	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2970
				2971	- If OpenCL, omit
				2972	lgkmcnt(0).
				2973	- Could be split into
				2974	separate s_waitcnt
				2975	vmcnt(0) and
				2976	s_waitcnt
				2977	lgkmcnt(0) to allow
				2978	them to be
				2979	independently moved
				2980	according to the
				2981	following rules.
				2982	- s_waitcnt vmcnt(0)
				2983	must happen after
				2984	any preceding
				2985	global/generic
				2986	load/store/load
				2987	atomic/store
				2988	atomic/atomicrmw.
				2989	- s_waitcnt lgkmcnt(0)
				2990	must happen after
				2991	any preceding
				2992	local/generic
				2993	load/store/load
				2994	atomic/store
				2995	atomic/atomicrmw.
				2996	- Must happen before
				2997	the following
				2998	store.
				2999	- Ensures that all
				3000	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3001	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3002	completed before
				3003	performing the
				3004	store that is being
				3005	released.
				3006
				3007	2. buffer/global/ds/flat_store
				3008	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				3009	- wavefront - local
				3010	- generic
				3011	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3012
				3013	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3014	- Must happen after
				3015	any preceding
				3016	local/generic
				3017	load/store/load
				3018	atomic/store
				3019	atomic/atomicrmw.
				3020	- Must happen before
				3021	the following
				3022	atomicrmw.
				3023	- Ensures that all
				3024	memory operations
				3025	to local have
				3026	completed before
				3027	performing the
				3028	atomicrmw that is
				3029	being released.
				3030
				3031	2. buffer/global/flat_atomic
				3032	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3033	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3034
				3035	- If OpenCL, omit.
				3036	- Must happen after
				3037	any preceding
				3038	local/generic
				3039	load/store/load
				3040	atomic/store
				3041	atomic/atomicrmw.
				3042	- Must happen before
				3043	the following
				3044	atomicrmw.
				3045	- Ensures that all
				3046	memory operations
				3047	to local have
				3048	completed before
				3049	performing the
				3050	atomicrmw that is
				3051	being released.
				3052
				3053	2. flat_atomic
				3054	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3055	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3056
				3057	- If OpenCL, omit
				3058	lgkmcnt(0).
				3059	- Could be split into
				3060	separate s_waitcnt
				3061	vmcnt(0) and
				3062	s_waitcnt
				3063	lgkmcnt(0) to allow
				3064	them to be
				3065	independently moved
				3066	according to the
				3067	following rules.
				3068	- s_waitcnt vmcnt(0)
				3069	must happen after
				3070	any preceding
				3071	global/generic
				3072	load/store/load
				3073	atomic/store
				3074	atomic/atomicrmw.
				3075	- s_waitcnt lgkmcnt(0)
				3076	must happen after
				3077	any preceding
				3078	local/generic
				3079	load/store/load
				3080	atomic/store
				3081	atomic/atomicrmw.
				3082	- Must happen before
				3083	the following
				3084	atomicrmw.
				3085	- Ensures that all
				3086	memory operations
				3087	to global and local
				3088	have completed
				3089	before performing
				3090	the atomicrmw that
				3091	is being released.
				3092
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3093	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3094	fence release - singlethread none none
				3095	- wavefront
				3096	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3097
				3098	- If OpenCL and
				3099	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3100	not generic, omit.
				3101	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3102	currently has no
				3103	address space on
				3104	the fence need to
				3105	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3106	always generate. If
				3107	fence had an
				3108	address space then
				3109	set to address
				3110	space of OpenCL
				3111	fence flag, or to
				3112	generic if both
				3113	local and global
				3114	flags are
				3115	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3116	- Must happen after
				3117	any preceding
				3118	local/generic
				3119	load/load
				3120	atomic/store/store
				3121	atomic/atomicrmw.
				3122	- Must happen before
				3123	any following store
				3124	atomic/atomicrmw
				3125	with an equal or
				3126	wider sync scope
				3127	and memory ordering
				3128	stronger than
				3129	unordered (this is
				3130	termed the
				3131	fence-paired-atomic).
				3132	- Ensures that all
				3133	memory operations
				3134	to local have
				3135	completed before
				3136	performing the
				3137	following
				3138	fence-paired-atomic.
				3139
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3140	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3141	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3142
				3143	- If OpenCL and
				3144	address space is
				3145	not generic, omit
				3146	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3147	- If OpenCL and
				3148	address space is
				3149	local, omit
				3150	vmcnt(0).
				3151	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3152	currently has no
				3153	address space on
				3154	the fence need to
				3155	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3156	always generate. If
				3157	fence had an
				3158	address space then
				3159	set to address
				3160	space of OpenCL
				3161	fence flag, or to
				3162	generic if both
				3163	local and global
				3164	flags are
				3165	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3166	- Could be split into
				3167	separate s_waitcnt
				3168	vmcnt(0) and
				3169	s_waitcnt
				3170	lgkmcnt(0) to allow
				3171	them to be
				3172	independently moved
				3173	according to the
				3174	following rules.
				3175	- s_waitcnt vmcnt(0)
				3176	must happen after
				3177	any preceding
				3178	global/generic
				3179	load/store/load
				3180	atomic/store
				3181	atomic/atomicrmw.
				3182	- s_waitcnt lgkmcnt(0)
				3183	must happen after
				3184	any preceding
				3185	local/generic
				3186	load/store/load
				3187	atomic/store
				3188	atomic/atomicrmw.
				3189	- Must happen before
				3190	any following store
				3191	atomic/atomicrmw
				3192	with an equal or
				3193	wider sync scope
				3194	and memory ordering
				3195	stronger than
				3196	unordered (this is
				3197	termed the
				3198	fence-paired-atomic).
				3199	- Ensures that all
				3200	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3201	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3202	completed before
				3203	performing the
				3204	following
				3205	fence-paired-atomic.
				3206
				3207	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3208	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3209	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3210	- wavefront - local
				3211	- generic
				3212	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3213
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3214	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3215	- Must happen after
				3216	any preceding
				3217	local/generic
				3218	load/store/load
				3219	atomic/store
				3220	atomic/atomicrmw.
				3221	- Must happen before
				3222	the following
				3223	atomicrmw.
				3224	- Ensures that all
				3225	memory operations
				3226	to local have
				3227	completed before
				3228	performing the
				3229	atomicrmw that is
				3230	being released.
				3231
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3232	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3233	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3234	2. s_waitcnt lgkmcnt(0)
				3235
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3236	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3237	- Must happen before
				3238	any following
				3239	global/generic
				3240	load/load
				3241	atomic/store/store
				3242	atomic/atomicrmw.
				3243	- Ensures any
				3244	following global
				3245	data read is no
				3246	older than the load
				3247	atomic value being
				3248	acquired.
				3249
				3250	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3251
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3252	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3253	- Must happen after
				3254	any preceding
				3255	local/generic
				3256	load/store/load
				3257	atomic/store
				3258	atomic/atomicrmw.
				3259	- Must happen before
				3260	the following
				3261	atomicrmw.
				3262	- Ensures that all
				3263	memory operations
				3264	to local have
				3265	completed before
				3266	performing the
				3267	atomicrmw that is
				3268	being released.
				3269
				3270	2. flat_atomic
				3271	3. s_waitcnt lgkmcnt(0)
				3272
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3273	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3274	- Must happen before
				3275	any following
				3276	global/generic
				3277	load/load
				3278	atomic/store/store
				3279	atomic/atomicrmw.
				3280	- Ensures any
				3281	following global
				3282	data read is no
				3283	older than the load
				3284	atomic value being
				3285	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3286
				3287	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3288	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3289
				3290	- If OpenCL, omit
				3291	lgkmcnt(0).
				3292	- Could be split into
				3293	separate s_waitcnt
				3294	vmcnt(0) and
				3295	s_waitcnt
				3296	lgkmcnt(0) to allow
				3297	them to be
				3298	independently moved
				3299	according to the
				3300	following rules.
				3301	- s_waitcnt vmcnt(0)
				3302	must happen after
				3303	any preceding
				3304	global/generic
				3305	load/store/load
				3306	atomic/store
				3307	atomic/atomicrmw.
				3308	- s_waitcnt lgkmcnt(0)
				3309	must happen after
				3310	any preceding
				3311	local/generic
				3312	load/store/load
				3313	atomic/store
				3314	atomic/atomicrmw.
				3315	- Must happen before
				3316	the following
				3317	atomicrmw.
				3318	- Ensures that all
				3319	memory operations
				3320	to global have
				3321	completed before
				3322	performing the
				3323	atomicrmw that is
				3324	being released.
				3325
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3326	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3327	3. s_waitcnt vmcnt(0)
				3328
				3329	- Must happen before
				3330	following
				3331	buffer_wbinvl1_vol.
				3332	- Ensures the
				3333	atomicrmw has
				3334	completed before
				3335	invalidating the
				3336	cache.
				3337
				3338	4. buffer_wbinvl1_vol
				3339
				3340	- Must happen before
				3341	any following
				3342	global/generic
				3343	load/load
				3344	atomic/atomicrmw.
				3345	- Ensures that
				3346	following loads
				3347	will not see stale
				3348	global data.
				3349
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3350	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3351	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3352
				3353	- If OpenCL, omit
				3354	lgkmcnt(0).
				3355	- Could be split into
				3356	separate s_waitcnt
				3357	vmcnt(0) and
				3358	s_waitcnt
				3359	lgkmcnt(0) to allow
				3360	them to be
				3361	independently moved
				3362	according to the
				3363	following rules.
				3364	- s_waitcnt vmcnt(0)
				3365	must happen after
				3366	any preceding
				3367	global/generic
				3368	load/store/load
				3369	atomic/store
				3370	atomic/atomicrmw.
				3371	- s_waitcnt lgkmcnt(0)
				3372	must happen after
				3373	any preceding
				3374	local/generic
				3375	load/store/load
				3376	atomic/store
				3377	atomic/atomicrmw.
				3378	- Must happen before
				3379	the following
				3380	atomicrmw.
				3381	- Ensures that all
				3382	memory operations
				3383	to global have
				3384	completed before
				3385	performing the
				3386	atomicrmw that is
				3387	being released.
				3388
				3389	2. flat_atomic
				3390	3. s_waitcnt vmcnt(0) &
				3391	lgkmcnt(0)
				3392
				3393	- If OpenCL, omit
				3394	lgkmcnt(0).
				3395	- Must happen before
				3396	following
				3397	buffer_wbinvl1_vol.
				3398	- Ensures the
				3399	atomicrmw has
				3400	completed before
				3401	invalidating the
				3402	cache.
				3403
				3404	4. buffer_wbinvl1_vol
				3405
				3406	- Must happen before
				3407	any following
				3408	global/generic
				3409	load/load
				3410	atomic/atomicrmw.
				3411	- Ensures that
				3412	following loads
				3413	will not see stale
				3414	global data.
				3415
				3416	fence acq_rel - singlethread none none
				3417	- wavefront
				3418	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3419
				3420	- If OpenCL and
				3421	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3422	not generic, omit.
				3423	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3424	since LLVM
				3425	currently has no
				3426	address space on
				3427	the fence need to
				3428	conservatively
				3429	always generate
				3430	(see comment for
				3431	previous fence).
				3432	- Must happen after
				3433	any preceding
				3434	local/generic
				3435	load/load
				3436	atomic/store/store
				3437	atomic/atomicrmw.
				3438	- Must happen before
				3439	any following
				3440	global/generic
				3441	load/load
				3442	atomic/store/store
				3443	atomic/atomicrmw.
				3444	- Ensures that all
				3445	memory operations
				3446	to local have
				3447	completed before
				3448	performing any
				3449	following global
				3450	memory operations.
				3451	- Ensures that the
				3452	preceding
				3453	local/generic load
				3454	atomic/atomicrmw
				3455	with an equal or
				3456	wider sync scope
				3457	and memory ordering
				3458	stronger than
				3459	unordered (this is
				3460	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3461	acquire-fence-paired-atomic
				3462	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3463	before following
				3464	global memory
				3465	operations. This
				3466	satisfies the
				3467	requirements of
				3468	acquire.
				3469	- Ensures that all
				3470	previous memory
				3471	operations have
				3472	completed before a
				3473	following
				3474	local/generic store
				3475	atomic/atomicrmw
				3476	with an equal or
				3477	wider sync scope
				3478	and memory ordering
				3479	stronger than
				3480	unordered (this is
				3481	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3482	release-fence-paired-atomic
				3483	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3484	requirements of
				3485	release.
				3486
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3487	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3488	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3489
				3490	- If OpenCL and
				3491	address space is
				3492	not generic, omit
				3493	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3494	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3495	currently has no
				3496	address space on
				3497	the fence need to
				3498	conservatively
				3499	always generate
				3500	(see comment for
				3501	previous fence).
				3502	- Could be split into
				3503	separate s_waitcnt
				3504	vmcnt(0) and
				3505	s_waitcnt
				3506	lgkmcnt(0) to allow
				3507	them to be
				3508	independently moved
				3509	according to the
				3510	following rules.
				3511	- s_waitcnt vmcnt(0)
				3512	must happen after
				3513	any preceding
				3514	global/generic
				3515	load/store/load
				3516	atomic/store
				3517	atomic/atomicrmw.
				3518	- s_waitcnt lgkmcnt(0)
				3519	must happen after
				3520	any preceding
				3521	local/generic
				3522	load/store/load
				3523	atomic/store
				3524	atomic/atomicrmw.
				3525	- Must happen before
				3526	the following
				3527	buffer_wbinvl1_vol.
				3528	- Ensures that the
				3529	preceding
				3530	global/local/generic
				3531	load
				3532	atomic/atomicrmw
				3533	with an equal or
				3534	wider sync scope
				3535	and memory ordering
				3536	stronger than
				3537	unordered (this is
				3538	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3539	acquire-fence-paired-atomic
				3540	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3541	before invalidating
				3542	the cache. This
				3543	satisfies the
				3544	requirements of
				3545	acquire.
				3546	- Ensures that all
				3547	previous memory
				3548	operations have
				3549	completed before a
				3550	following
				3551	global/local/generic
				3552	store
				3553	atomic/atomicrmw
				3554	with an equal or
				3555	wider sync scope
				3556	and memory ordering
				3557	stronger than
				3558	unordered (this is
				3559	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3560	release-fence-paired-atomic
				3561	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3562	requirements of
				3563	release.
				3564
				3565	2. buffer_wbinvl1_vol
				3566
				3567	- Must happen before
				3568	any following
				3569	global/generic
				3570	load/load
				3571	atomic/store/store
				3572	atomic/atomicrmw.
				3573	- Ensures that
				3574	following loads
				3575	will not see stale
				3576	global data. This
				3577	satisfies the
				3578	requirements of
				3579	acquire.
				3580
				3581	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3582	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3583	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3584	- wavefront - local load atomic acquire,
				3585	- generic except must generated
				3586	all instructions even
				3587	for OpenCL.*
				3588	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3589	- generic
				3590	- Must
				3591	happen after
				3592	preceding
				3593	global/generic load
				3594	atomic/store
				3595	atomic/atomicrmw
				3596	with memory
				3597	ordering of seq_cst
				3598	and with equal or
				3599	wider sync scope.
				3600	(Note that seq_cst
				3601	fences have their
				3602	own s_waitcnt
				3603	lgkmcnt(0) and so do
				3604	not need to be
				3605	considered.)
				3606	- Ensures any
				3607	preceding
				3608	sequential
				3609	consistent local
				3610	memory instructions
				3611	have completed
				3612	before executing
				3613	this sequentially
				3614	consistent
				3615	instruction. This
				3616	prevents reordering
				3617	a seq_cst store
				3618	followed by a
				3619	seq_cst load. (Note
				3620	that seq_cst is
				3621	stronger than
				3622	acquire/release as
				3623	the reordering of
				3624	load acquire
				3625	followed by a store
				3626	release is
				3627	prevented by the
				3628	waitcnt of
				3629	the release, but
				3630	there is nothing
				3631	preventing a store
				3632	release followed by
				3633	load acquire from
				3634	competing out of
				3635	order.)
				3636
				3637	2. *Following
				3638	instructions same as
				3639	corresponding load
				3640	atomic acquire,
				3641	except must generated
				3642	all instructions even
				3643	for OpenCL.*
				3644	load atomic seq_cst - workgroup - local *Same as corresponding
				3645	load atomic acquire,
				3646	except must generated
				3647	all instructions even
				3648	for OpenCL.*
				3649	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3650	- system - generic vmcnt(0)
				3651
				3652	- Could be split into
				3653	separate s_waitcnt
				3654	vmcnt(0)
				3655	and s_waitcnt
				3656	lgkmcnt(0) to allow
				3657	them to be
				3658	independently moved
				3659	according to the
				3660	following rules.
				3661	- waitcnt lgkmcnt(0)
				3662	must happen after
				3663	preceding
				3664	global/generic load
				3665	atomic/store
				3666	atomic/atomicrmw
				3667	with memory
				3668	ordering of seq_cst
				3669	and with equal or
				3670	wider sync scope.
				3671	(Note that seq_cst
				3672	fences have their
				3673	own s_waitcnt
				3674	lgkmcnt(0) and so do
				3675	not need to be
				3676	considered.)
				3677	- waitcnt vmcnt(0)
				3678	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3679	preceding
				3680	global/generic load
				3681	atomic/store
				3682	atomic/atomicrmw
				3683	with memory
				3684	ordering of seq_cst
				3685	and with equal or
				3686	wider sync scope.
				3687	(Note that seq_cst
				3688	fences have their
				3689	own s_waitcnt
				3690	vmcnt(0) and so do
				3691	not need to be
				3692	considered.)
				3693	- Ensures any
				3694	preceding
				3695	sequential
				3696	consistent global
				3697	memory instructions
				3698	have completed
				3699	before executing
				3700	this sequentially
				3701	consistent
				3702	instruction. This
				3703	prevents reordering
				3704	a seq_cst store
				3705	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3706	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3707	that seq_cst is
				3708	stronger than
				3709	acquire/release as
				3710	the reordering of
				3711	load acquire
				3712	followed by a store
				3713	release is
				3714	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3715	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3716	the release, but
				3717	there is nothing
				3718	preventing a store
				3719	release followed by
				3720	load acquire from
				3721	competing out of
				3722	order.)
				3723
				3724	2. *Following
				3725	instructions same as
				3726	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3727	atomic acquire,
				3728	except must generated
				3729	all instructions even
				3730	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3731	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3732	- wavefront - local store atomic release,
				3733	- workgroup - generic except must generated
				3734	all instructions even
				3735	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3736	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3737	- system - generic store atomic release,
				3738	except must generated
				3739	all instructions even
				3740	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3741	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3742	- wavefront - local atomicrmw acq_rel,
				3743	- workgroup - generic except must generated
				3744	all instructions even
				3745	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3746	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3747	- system - generic atomicrmw acq_rel,
				3748	except must generated
				3749	all instructions even
				3750	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3751	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3752	- wavefront fence acq_rel,
				3753	- workgroup except must generated
				3754	- agent all instructions even
				3755	- system for OpenCL.*
				3756	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3757
				3758	The memory order also adds the single thread optimization constrains defined in
				3759	table
				3760	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3761
				3762	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3763	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3764
				3765	============ ==============================================================
				3766	LLVM Memory Optimization Constraints
				3767	Ordering
				3768	============ ==============================================================
				3769	unordered none
				3770	monotonic none
				3771	acquire - If a load atomic/atomicrmw then no following load/load
				3772	atomic/store/ store atomic/atomicrmw/fence instruction can
				3773	be moved before the acquire.
				3774	- If a fence then same as load atomic, plus no preceding
				3775	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3776	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3777	atomic/store/ store atomic/atomicrmw/fence instruction can
				3778	be moved after the release.
				3779	- If a fence then same as store atomic, plus no following
				3780	associated fence-paired-atomic can be moved before the
				3781	fence.
				3782	acq_rel Same constraints as both acquire and release.
				3783	seq_cst - If a load atomic then same constraints as acquire, plus no
				3784	preceding sequentially consistent load atomic/store
				3785	atomic/atomicrmw/fence instruction can be moved after the
				3786	seq_cst.
				3787	- If a store atomic then the same constraints as release, plus
				3788	no following sequentially consistent load atomic/store
				3789	atomic/atomicrmw/fence instruction can be moved before the
				3790	seq_cst.
				3791	- If an atomicrmw/fence then same constraints as acq_rel.
				3792	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3793
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3794	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3795	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3796
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3797	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3798	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3799	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3800
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3801	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3802	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3803
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3804	=================== =============== =============== =======================
				3805	Usage Code Sequence Trap Handler Description
				3806	Inputs
				3807	=================== =============== =============== =======================
				3808	reserved ``s_trap 0x00`` Reserved by hardware.
				3809	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3810	``queue_ptr`` ``debugtrap``
				3811	``VGPR0``: intrinsic (not
				3812	``arg`` implemented).
				3813	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3814	``queue_ptr`` terminated and its
				3815	associated queue put
				3816	into the error state.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3817	``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
				3818	installed then
				3819	behaves as a
				3820	no-operation. The
				3821	trap handler is
				3822	entered and
				3823	immediately returns
				3824	to continue
				3825	execution of the
				3826	wavefront.
				3827	- If the debugger is
				3828	installed, causes
				3829	the debug trap to be
				3830	reported by the
				3831	debugger and the
				3832	wavefront is put in
				3833	the halt state until
				3834	resumed by the
				3835	debugger.
				3836	reserved ``s_trap 0x04`` Reserved.
				3837	reserved ``s_trap 0x05`` Reserved.
				3838	reserved ``s_trap 0x06`` Reserved.
				3839	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3840	breakpoints.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3841	reserved ``s_trap 0x08`` Reserved.
				3842	reserved ``s_trap 0xfe`` Reserved.
				3843	reserved ``s_trap 0xff`` Reserved.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3844	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3845
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3846	AMDPAL
				3847	------
				3848
				3849	This section provides code conventions used when the target triple OS is
				3850	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3851	from the application/runtime to each invocation of a hardware shader. These
				3852	parameters include both generic, application-controlled parameters called
				3853	user data as well as system-generated parameters that are a product of the
				3854	draw or dispatch execution.
				3855
				3856	User Data
				3857	~~~~~~~~~
				3858
				3859	Each hardware stage has a set of 32-bit user data registers which can be
				3860	written from a command buffer and then loaded into SGPRs when waves are launched
				3861	via a subsequent dispatch or draw operation. This is the way most arguments are
				3862	passed from the application/runtime to a hardware shader.
				3863
				3864	Compute User Data
				3865	~~~~~~~~~~~~~~~~~
				3866
				3867	Compute shader user data mappings are simpler than graphics shaders, and have a
				3868	fixed mapping.
				3869
				3870	Note that there are always 10 available user data entries in registers -
				3871	entries beyond that limit must be fetched from memory (via the spill table
				3872	pointer) by the shader.
				3873
				3874	.. table:: PAL Compute Shader User Data Registers
				3875	:name: pal-compute-user-data-registers
				3876
				3877	============= ================================
				3878	User Register Description
				3879	============= ================================
				3880	0 Global Internal Table (32-bit pointer)
				3881	1 Per-Shader Internal Table (32-bit pointer)
				3882	2 - 11 Application-Controlled User Data (10 32-bit values)
				3883	12 Spill Table (32-bit pointer)
				3884	13 - 14 Thread Group Count (64-bit pointer)
				3885	15 GDS Range
				3886	============= ================================
				3887
				3888	Graphics User Data
				3889	~~~~~~~~~~~~~~~~~~
				3890
				3891	Graphics pipelines support a much more flexible user data mapping:
				3892
				3893	.. table:: PAL Graphics Shader User Data Registers
				3894	:name: pal-graphics-user-data-registers
				3895
				3896	============= ================================
				3897	User Register Description
				3898	============= ================================
				3899	0 Global Internal Table (32-bit pointer)
				3900	+ Per-Shader Internal Table (32-bit pointer)
				3901	+ 1-15 Application Controlled User Data
				3902	(1-15 Contiguous 32-bit Values in Registers)
				3903	+ Spill Table (32-bit pointer)
				3904	+ Draw Index (First Stage Only)
				3905	+ Vertex Offset (First Stage Only)
				3906	+ Instance Offset (First Stage Only)
				3907	============= ================================
				3908
				3909	The placement of the global internal table remains fixed in the first *user
				3910	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3911	to any desired user data SGPR register, with the following regstrictions:
				3912
				3913	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3914	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3915	shader runs).
				3916
				3917	* Application-controlled user data must be mapped into a contiguous range of
				3918	user data registers.
				3919
				3920	* The application-controlled user data range supports compaction remapping, so
				3921	only entries that are actually consumed by the shader must be assigned to
				3922	corresponding registers. Note that in order to support an efficient runtime
				3923	implementation, the remapping must pack registers in the same order as
				3924	entries, with unused entries removed.
				3925
				3926	.. _pal_global_internal_table:
				3927
				3928	Global Internal Table
				3929	~~~~~~~~~~~~~~~~~~~~~
				3930
				3931	The global internal table is a table of shader resource descriptors (SRDs) that
				3932	define how certain engine-wide, runtime-managed resources should be accessed
				3933	from a shader. The majority of these resources have HW-defined formats, and it
				3934	is up to the compiler to write/read data as required by the target hardware.
				3935
				3936	The following table illustrates the required format:
				3937
				3938	.. table:: PAL Global Internal Table
				3939	:name: pal-git-table
				3940
				3941	============= ================================
				3942	Offset Description
				3943	============= ================================
				3944	0-3 Graphics Scratch SRD
				3945	4-7 Compute Scratch SRD
				3946	8-11 ES/GS Ring Output SRD
				3947	12-15 ES/GS Ring Input SRD
				3948	16-19 GS/VS Ring Output #0
				3949	20-23 GS/VS Ring Output #1
				3950	24-27 GS/VS Ring Output #2
				3951	28-31 GS/VS Ring Output #3
				3952	32-35 GS/VS Ring Input SRD
				3953	36-39 Tessellation Factor Buffer SRD
				3954	40-43 Off-Chip LDS Buffer SRD
				3955	44-47 Off-Chip Param Cache Buffer SRD
				3956	48-51 Sample Position Buffer SRD
				3957	52 vaRange::ShadowDescriptorTable High Bits
				3958	============= ================================
				3959
				3960	The pointer to the global internal table passed to the shader as user data
				3961	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3962	the top 32 bits of the pipeline, so the shader may use the program
				3963	counter's top 32 bits.
				3964
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3965	Unspecified OS
				3966	--------------
				3967
				3968	This section provides code conventions used when the target triple OS is
				3969	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3970
				3971	Trap Handler ABI
				3972	~~~~~~~~~~~~~~~~
				3973
				3974	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3975	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3976	instructions are handled as follows:
				3977
				3978	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3979	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3980
				3981	=============== =============== ===========================================
				3982	Usage Code Sequence Description
				3983	=============== =============== ===========================================
				3984	llvm.trap s_endpgm Causes wavefront to be terminated.
				3985	llvm.debugtrap none Compiler warning given that there is no
				3986	trap handler installed.
				3987	=============== =============== ===========================================
				3988
				3989	Source Languages
				3990	================
				3991
				3992	.. _amdgpu-opencl:
				3993
				3994	OpenCL
				3995	------
				3996
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3997	When the language is OpenCL the following differences occur:
				3998
				3999	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4000	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				4001	arguments for the AMDHSA OS (see
				4002	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	4003	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4004	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4005
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4006	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				4007	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				4008
				4009	======== ==== ========= ===========================================
				4010	Position Byte Byte Description
				4011	Size Alignment
				4012	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	4013	1 8 8 OpenCL Global Offset X
				4014	2 8 8 OpenCL Global Offset Y
				4015	3 8 8 OpenCL Global Offset Z
				4016	4 8 8 OpenCL address of printf buffer
				4017	5 8 8 OpenCL address of virtual queue used by
				4018	enqueue_kernel.
				4019	6 8 8 OpenCL address of AqlWrap struct used by
				4020	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4021	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4022
				4023	.. _amdgpu-hcc:
				4024
				4025	HCC
				4026	---
				4027
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4028	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4029
				4030	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				4031
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4032	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4033	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4034
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4035	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	4036	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4037
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4038	This section describes general syntax for instructions and operands.
				4039
				4040	Instructions
				4041	~~~~~~~~~~~~
				4042
				4043	.. toctree::
				4044	:hidden:
				4045
				4046	AMDGPUAsmGFX7
				4047	AMDGPUAsmGFX8
				4048	AMDGPUAsmGFX9
				4049	AMDGPUOperandSyntax
				4050
				4051	An instruction has the following syntax:
				4052
				4053	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				4054
				4055	Note that operands are normally comma-separated while modifiers are space-separated.
				4056
				4057	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4058
				4059	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4060	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4061
				4062	Note that features under development are not included in this description.
				4063
				4064	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4065	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4066	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4067
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4068	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4069	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4070
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4071	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4072
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4073	* SGPR registers: s0, ... or s[0], ...
				4074	* VGPR registers: v0, ... or v[0], ...
				4075	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4076	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4077	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4078	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4079	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4080	* Register index expressions: v[2*2], s[1-1:2-1]
				4081	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4082
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4083	Modifiers
				4084	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4085
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4086	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4087
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4088	Instruction Examples
				4089	~~~~~~~~~~~~~~~~~~~~
				4090
				4091	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4092	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4093
				4094	.. code-block:: nasm
				4095
				4096	ds_add_u32 v2, v4 offset:16
				4097	ds_write_src2_b64 v2 offset0:4 offset1:8
				4098	ds_cmpst_f32 v2, v4, v6
				4099	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4100
				4101
				4102	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4103
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4104	FLAT
				4105	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4106
				4107	.. code-block:: nasm
				4108
				4109	flat_load_dword v1, v[3:4]
				4110	flat_store_dwordx3 v[3:4], v[5:7]
				4111	flat_atomic_swap v1, v[3:4], v5 glc
				4112	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4113	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4114
				4115	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4116
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4117	MUBUF
				4118	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4119
				4120	.. code-block:: nasm
				4121
				4122	buffer_load_dword v1, off, s[4:7], s1
				4123	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4124	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4125	buffer_wbinvl1
				4126	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4127
				4128	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4129
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4130	SMRD/SMEM
				4131	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4132
				4133	.. code-block:: nasm
				4134
				4135	s_load_dword s1, s[2:3], 0xfc
				4136	s_load_dwordx8 s[8:15], s[2:3], s4
				4137	s_load_dwordx16 s[88:103], s[2:3], s4
				4138	s_dcache_inv_vol
				4139	s_memtime s[4:5]
				4140
				4141	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4142
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4143	SOP1
				4144	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4145
				4146	.. code-block:: nasm
				4147
				4148	s_mov_b32 s1, s2
				4149	s_mov_b64 s[0:1], 0x80000000
				4150	s_cmov_b32 s1, 200
				4151	s_wqm_b64 s[2:3], s[4:5]
				4152	s_bcnt0_i32_b64 s1, s[2:3]
				4153	s_swappc_b64 s[2:3], s[4:5]
				4154	s_cbranch_join s[4:5]
				4155
				4156	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4157
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4158	SOP2
				4159	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4160
				4161	.. code-block:: nasm
				4162
				4163	s_add_u32 s1, s2, s3
				4164	s_and_b64 s[2:3], s[4:5], s[6:7]
				4165	s_cselect_b32 s1, s2, s3
				4166	s_andn2_b32 s2, s4, s6
				4167	s_lshr_b64 s[2:3], s[4:5], s6
				4168	s_ashr_i32 s2, s4, s6
				4169	s_bfm_b64 s[2:3], s4, s6
				4170	s_bfe_i64 s[2:3], s[4:5], s6
				4171	s_cbranch_g_fork s[4:5], s[6:7]
				4172
				4173	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4174
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4175	SOPC
				4176	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4177
				4178	.. code-block:: nasm
				4179
				4180	s_cmp_eq_i32 s1, s2
				4181	s_bitcmp1_b32 s1, s2
				4182	s_bitcmp0_b64 s[2:3], s4
				4183	s_setvskip s3, s5
				4184
				4185	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4186
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4187	SOPP
				4188	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4189
				4190	.. code-block:: nasm
				4191
				4192	s_barrier
				4193	s_nop 2
				4194	s_endpgm
				4195	s_waitcnt 0 ; Wait for all counters to be 0
				4196	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4197	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4198	s_sethalt 9
				4199	s_sleep 10
				4200	s_sendmsg 0x1
				4201	s_sendmsg sendmsg(MSG_INTERRUPT)
				4202	s_trap 1
				4203
				4204	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4205
				4206	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4207	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4208	range or acceptable values.
				4209
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4210	VALU
				4211	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4212
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4213	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4214	the assembler will automatically use optimal encoding based on its operands.
				4215	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4216
				4217	* _e32 for 32-bit VOP1/VOP2/VOPC
				4218	* _e64 for 64-bit VOP3
				4219	* _dpp for VOP_DPP
				4220	* _sdwa for VOP_SDWA
				4221
				4222	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4223
				4224	.. code-block:: nasm
				4225
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4226	v_mov_b32 v1, v2
				4227	v_mov_b32_e32 v1, v2
				4228	v_nop
				4229	v_cvt_f64_i32_e32 v[1:2], v2
				4230	v_floor_f32_e32 v1, v2
				4231	v_bfrev_b32_e32 v1, v2
				4232	v_add_f32_e32 v1, v2, v3
				4233	v_mul_i32_i24_e64 v1, v2, 3
				4234	v_mul_i32_i24_e32 v1, -3, v3
				4235	v_mul_i32_i24_e32 v1, -100, v3
				4236	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4237	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4238
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4239	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4240
				4241	.. code-block:: nasm
				4242
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4243	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4244	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4245	v_mov_b32 v0, v0 wave_shl:1
				4246	v_mov_b32 v0, v0 row_mirror
				4247	v_mov_b32 v0, v0 row_bcast:31
				4248	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4249	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4250	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4251
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4252	VOP_SDWA examples:
				4253
				4254	.. code-block:: nasm
				4255
				4256	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4257	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4258	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4259	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4260	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4261
				4262	For full list of supported instructions, refer to "Vector ALU instructions".
				4263
				4264	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4265	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4266
				4267	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4268	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4269
				4270	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4271	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4272
				4273	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4274	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4275
				4276	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4277	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4278
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4279
				4280	major, minor, and stepping are all integers that describe the instruction
				4281	set architecture (ISA) version of the assembly program.
				4282
				4283	vendor and arch are quoted strings. vendor should always be equal to
				4284	"AMD" and arch should always be equal to "AMDGPU".
				4285
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4286	By default, the assembler will derive the ISA version, vendor, and arch
				4287	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4288
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4289	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4290	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4291
				4292	This directives specifies that the symbol with given name is a kernel entry point
				4293	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4294
				4295	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4296	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4297
				4298	This directive marks the beginning of a list of key / value pairs that are used
				4299	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4300	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4301	any amd_kernel_code_t values that are unspecified a default value will be
				4302	used. The default value for all keys is 0, with the following exceptions:
				4303
				4304	- kernel_code_version_major defaults to 1.
				4305	- machine_kind defaults to 1.
				4306	- machine_version_major, machine_version_minor, and
				4307	machine_version_stepping are derived from the value of the -mcpu option
				4308	that is passed to the assembler.
				4309	- kernel_code_entry_byte_offset defaults to 256.
				4310	- wavefront_size defaults to 6.
				4311	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4312	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4313	as a power of two, so a value of n means an alignment of 2^ n.
				4314
				4315	The .amd_kernel_code_t directive must be placed immediately after the
				4316	function label and before any instructions.
				4317
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4318	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4319	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4320
				4321	Here is an example of a minimal amd_kernel_code_t specification:
				4322
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4323	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4324
				4325	.hsa_code_object_version 1,0
				4326	.hsa_code_object_isa
				4327
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4328	.hsatext
				4329	.globl hello_world
				4330	.p2align 8
				4331	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4332
				4333	hello_world:
				4334
				4335	.amd_kernel_code_t
				4336	enable_sgpr_kernarg_segment_ptr = 1
				4337	is_ptr64 = 1
				4338	compute_pgm_rsrc1_vgprs = 0
				4339	compute_pgm_rsrc1_sgprs = 0
				4340	compute_pgm_rsrc2_user_sgpr = 2
				4341	kernarg_segment_byte_size = 8
				4342	wavefront_sgpr_count = 2
				4343	workitem_vgpr_count = 3
				4344	.end_amd_kernel_code_t
				4345
				4346	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4347	v_mov_b32 v0, 3.14159
				4348	s_waitcnt lgkmcnt(0)
				4349	v_mov_b32 v1, s0
				4350	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4351	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4352	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4353	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4354	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4355
				4356	Additional Documentation
				4357	========================
				4358
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4359	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4360	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4361	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4362	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4363	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4364	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4365	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4366	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4367	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4368	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4369	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4370	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4371	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4372	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4373	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4374	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame^]	4375	.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__