Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 1ddda1bae9ec5b1aba10329591d7936a46d364a7 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
				204	[off]
				205	.. TODO
				206	Add product
				207	names.
				208	``gfx906`` ``amdgcn`` dGPU - xnack TBA
				209	[off]
				210	.. TODO
				211	Add product
				212	names.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	213	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	214
				215	.. _amdgpu-target-features:
				216
				217	Target Features
				218	---------------
				219
				220	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	221	processor specific features. Not all target features are supported by
				222	all processors. The runtime must ensure that the features supported by
				223	the device used to execute the code match the features enabled when
				224	generating the code. A mismatch of features may result in incorrect
				225	execution, or a reduction in performance.
				226
				227	The target features supported by each processor, and the default value
				228	used if not specified explicitly, is listed in
				229	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				232	target features.
				233
				234	For example:
				235
				236	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	238	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
				241	.. table:: AMDGPU Target Features
				242	:name: amdgpu-target-feature-table
				243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	============== ==================================================
				245	Target Feature Description
				246	============== ==================================================
				247	-m[no-]xnack Enable/disable generating code that has
				248	memory clauses that are compatible with
				249	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	250
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	251	This is used for demand paging and page
				252	migration. If XNACK replay is enabled in
				253	the device, then if a page fault occurs
				254	the code may execute incorrectly if the
				255	``xnack`` feature is not enabled. Executing
				256	code that has the feature enabled on a
				257	device that does not have XNACK replay
				258	enabled will execute correctly, but may
				259	be less performant than code with the
				260	feature disabled.
				261	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	262
				263	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
				265	Address Spaces
				266	--------------
				267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	The memory space names used in the table, aside from the region memory space, is
				271	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	272
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	273	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	274
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	275	.. table:: Address Space Mapping
				276	:name: amdgpu-address-space-mapping-table
				277
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	280	================== =================
				281	0 Generic (Flat)
				282	1 Global
				283	2 Region (GDS)
				284	3 Local (group/LDS)
				285	4 Constant
				286	5 Private (Scratch)
				287	6 Constant 32-bit
				288	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	289
				290	.. _amdgpu-memory-scopes:
				291
				292	Memory Scopes
				293	-------------
				294
				295	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				296	backend memory model when the target triple OS is ``amdhsa`` (see
				297	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				298
				299	The memory model supported is based on the HSA memory model [HSA]_ which is
				300	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				301	relation is transitive over the synchonizes-with relation independent of scope,
				302	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	303	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				306	inclusion and requires the memory scopes to exactly match. However, this
				307	is conservatively correct for OpenCL.
				308
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	309	.. table:: AMDHSA LLVM Sync Scopes
				310	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	311
				312	================ ==========================================================
				313	LLVM Sync Scope Description
				314	================ ==========================================================
				315	none The default: ``system``.
				316
				317	Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``.
				324	- ``agent`` and executed by a thread on the same agent.
				325	- ``workgroup`` and executed by a thread in the same
				326	workgroup.
				327	- ``wavefront`` and executed by a thread in the same
				328	wavefront.
				329
				330	``agent`` Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system`` or ``agent`` and executed by a thread on the
				337	same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``workgroup`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system``, ``agent`` or ``workgroup`` and executed by a
				350	thread in the same workgroup.
				351	- ``wavefront`` and executed by a thread in the same
				352	wavefront.
				353
				354	``wavefront`` Synchronizes with, and participates in modification and
				355	seq_cst total orderings with, other operations (except
				356	image operations) for all address spaces (except private,
				357	or generic that accesses private) provided the other
				358	operation's sync scope is:
				359
				360	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				361	and executed by a thread in the same wavefront.
				362
				363	``singlethread`` Only synchronizes with, and participates in modification
				364	and seq_cst total orderings with, other operations (except
				365	image operations) running in the same thread for all
				366	address spaces (for example, in signal handlers).
				367	================ ==========================================================
				368
				369	AMDGPU Intrinsics
				370	-----------------
				371
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame]	372	The AMDGPU backend implements the following LLVM IR intrinsics.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373
				374	This section is WIP.
				375
				376	.. TODO
				377	List AMDGPU intrinsics
				378
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame]	379	AMDGPU Attributes
				380	-----------------
				381
				382	The AMDGPU backend supports the following LLVM IR attributes.
				383
				384	.. table:: AMDGPU LLVM IR Attributes
				385	:name: amdgpu-llvm-ir-attributes-table
				386
				387	======================================= ==========================================================
				388	LLVM Attribute Description
				389	======================================= ==========================================================
				390	"amdgpu-flat-work-group-size"="min,max" Specify the minimum and maximum flat work group sizes that
				391	will be specified when the kernel is dispatched. Generated
				392	by the ``amdgpu_flat_work_group_size`` CLANG attribute [CLANG-ATTR]_.
				393	"amdgpu-implicitarg-num-bytes"="n" Number of kernel argument bytes to add to the kernel
				394	argument block size for the implicit arguments. This
				395	varies by OS and language (for OpenCL see
				396	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
				397	"amdgpu-max-work-group-size"="n" Specify the maximum work-group size that will be specifed
				398	when the kernel is dispatched.
				399	"amdgpu-num-sgpr"="n" Specifies the number of SGPRs to use. Generated by
				400	the ``amdgpu_num_sgpr`` CLANG attribute [CLANG-ATTR]_.
				401	"amdgpu-num-vgpr"="n" Specifies the number of VGPRs to use. Generated by the
				402	``amdgpu_num_vgpr`` CLANG attribute [CLANG-ATTR]_.
				403	"amdgpu-waves-per-eu"="m,n" Specify the minimum and maximum number of waves per
				404	execution unit. Generated by the ``amdgpu_waves_per_eu``
				405	CLANG attribute [CLANG-ATTR]_.
				406	======================================= ==========================================================
				407
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	Code Object
				409	===========
				410
				411	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				412	can be linked by ``lld`` to produce a standard ELF shared code object which can
				413	be loaded and executed on an AMDGPU target.
				414
				415	Header
				416	------
				417
				418	The AMDGPU backend uses the following ELF header:
				419
				420	.. table:: AMDGPU ELF Header
				421	:name: amdgpu-elf-header-table
				422
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	423	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	425	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	426	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				427	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	428	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				429	- ``ELFOSABI_AMDGPU_HSA``
				430	- ``ELFOSABI_AMDGPU_PAL``
				431	- ``ELFOSABI_AMDGPU_MESA3D``
				432	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				433	- ``ELFABIVERSION_AMDGPU_PAL``
				434	- ``ELFABIVERSION_AMDGPU_MESA3D``
				435	``e_type`` - ``ET_REL``
				436	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	437	``e_machine`` ``EM_AMDGPU``
				438	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	439	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	440	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	441
				442	..
				443
				444	.. table:: AMDGPU ELF Header Enumeration Values
				445	:name: amdgpu-elf-header-enumeration-values-table
				446
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	447	=============================== =====
				448	Name Value
				449	=============================== =====
				450	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	451	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	452	``ELFOSABI_AMDGPU_HSA`` 64
				453	``ELFOSABI_AMDGPU_PAL`` 65
				454	``ELFOSABI_AMDGPU_MESA3D`` 66
				455	``ELFABIVERSION_AMDGPU_HSA`` 1
				456	``ELFABIVERSION_AMDGPU_PAL`` 0
				457	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				458	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	459
				460	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	461	The ELF class is:
				462
				463	* ``ELFCLASS32`` for ``r600`` architecture.
				464
				465	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				466	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467
				468	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	469	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	470
				471	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	472	One of the following AMD GPU architecture specific OS ABIs
				473	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	474
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	475	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	476
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	477	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	478
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	479	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				480
				481	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	482
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	483	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	484	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				485	object conforms:
				486
				487	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				488	runtime ABI.
				489
				490	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				491	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	492
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	493	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	494	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	495
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	496	``e_type``
				497	Can be one of the following values:
				498
				499
				500	``ET_REL``
				501	The type produced by the AMD GPU backend compiler as it is relocatable code
				502	object.
				503
				504	``ET_DYN``
				505	The type produced by the linker as it is a shared code object.
				506
				507	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				508
				509	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	510	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				511	by the ``r600`` and ``amdgcn`` architectures (see
				512	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				513	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				514	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	515
				516	``e_entry``
				517	The entry point is 0 as the entry points for individual kernels must be
				518	selected in order to invoke them through AQL packets.
				519
				520	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	521	The AMDGPU backend uses the following ELF header flags:
				522
				523	.. table:: AMDGPU ELF Header ``e_flags``
				524	:name: amdgpu-elf-header-e_flags-table
				525
				526	================================= ========== =============================
				527	Name Value Description
				528	================================= ========== =============================
				529	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				530	-------------------------------------------- -----------------------------
				531	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				532	mask for
				533	``EF_AMDGPU_MACH_xxx`` values
				534	defined in
				535	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	536	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				537	target feature is
				538	enabled for all code
				539	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	540	If the processor
				541	does not support the
				542	``xnack`` target
				543	feature then must
				544	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	545	See
				546	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	547	================================= ========== =============================
				548
				549	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				550	:name: amdgpu-ef-amdgpu-mach-table
				551
				552	================================= ========== =============================
				553	Name Value Description (see
				554	:ref:`amdgpu-processor-table`)
				555	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	556	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				557	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				558	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				559	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				560	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				561	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				562	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				563	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				564	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				565	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				566	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				567	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				568	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				569	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				570	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				571	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				572	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				573	reserved 0x011 - Reserved for ``r600``
				574	0x01f architecture processors.
				575	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				576	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				577	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				578	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				579	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				580	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				581	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				582	reserved 0x027 Reserved.
				583	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				584	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				585	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				586	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				587	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				588	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	589	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				590	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	591	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	592	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	593
				594	Sections
				595	--------
				596
				597	An AMDGPU target ELF code object has the standard ELF sections which include:
				598
				599	.. table:: AMDGPU ELF Sections
				600	:name: amdgpu-elf-sections-table
				601
				602	================== ================ =================================
				603	Name Type Attributes
				604	================== ================ =================================
				605	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				606	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				607	``.debug_``\ \* ``SHT_PROGBITS`` none
				608	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				609	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				610	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				611	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				612	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				613	``.note`` ``SHT_NOTE`` none
				614	``.rela``\ name ``SHT_RELA`` none
				615	``.rela.dyn`` ``SHT_RELA`` none
				616	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				617	``.shstrtab`` ``SHT_STRTAB`` none
				618	``.strtab`` ``SHT_STRTAB`` none
				619	``.symtab`` ``SHT_SYMTAB`` none
				620	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				621	================== ================ =================================
				622
				623	These sections have their standard meanings (see [ELF]_) and are only generated
				624	if needed.
				625
				626	``.debug``\ \*
				627	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				628	DWARF produced by the AMDGPU backend.
				629
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	630	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	631	The standard sections used by a dynamic loader.
				632
				633	``.note``
				634	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				635	backend.
				636
				637	``.rela``\ name, ``.rela.dyn``
				638	For relocatable code objects, name is the name of the section that the
				639	relocation records apply. For example, ``.rela.text`` is the section name for
				640	relocation records associated with the ``.text`` section.
				641
				642	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				643	records from each of the relocatable code object's ``.rela``\ name sections.
				644
				645	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				646	the AMDGPU backend.
				647
				648	``.text``
				649	The executable machine code for the kernels and functions they call. Generated
				650	as position independent code. See :ref:`amdgpu-code-conventions` for
				651	information on conventions used in the isa generation.
				652
				653	.. _amdgpu-note-records:
				654
				655	Note Records
				656	------------
				657
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	658	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				659	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				660	aligned. In addition, minimal zero byte padding must be generated to ensure the
				661	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				662	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	663
				664	The AMDGPU backend code object uses the following ELF note records in the
				665	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	666	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	667	variable size strings have a corresponding ``*_size`` field that specifies the
				668	number of bytes, including the terminating null character, in the string. The
				669	string(s) come immediately after the preceding fields.
				670
				671	Additional note records can be present.
				672
				673	.. table:: AMDGPU ELF Note Records
				674	:name: amdgpu-elf-note-records-table
				675
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	676	===== ============================== ======================================
				677	Name Type Description
				678	===== ============================== ======================================
				679	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	680	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	681
				682	..
				683
				684	.. table:: AMDGPU ELF Note Record Enumeration Values
				685	:name: amdgpu-elf-note-record-enumeration-values-table
				686
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	687	============================== =====
				688	Name Value
				689	============================== =====
				690	reserved 0-9
				691	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	692	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	693	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	694
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	695	``NT_AMD_AMDGPU_HSA_METADATA``
				696	Specifies extensible metadata associated with the code objects executed on HSA
				697	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				698	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	699	:ref:`amdgpu-amdhsa-code-object-metadata` for the syntax of the code
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	700	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	701
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	702	.. _amdgpu-symbols:
				703
				704	Symbols
				705	-------
				706
				707	Symbols include the following:
				708
				709	.. table:: AMDGPU ELF Symbols
				710	:name: amdgpu-elf-symbols-table
				711
				712	===================== ============== ============= ==================
				713	Name Type Section Description
				714	===================== ============== ============= ==================
				715	link-name ``STT_OBJECT`` - ``.data`` Global variable
				716	- ``.rodata``
				717	- ``.bss``
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	718	link-name\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	719	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				720	===================== ============== ============= ==================
				721
				722	Global variable
				723	Global variables both used and defined by the compilation unit.
				724
				725	If the symbol is defined in the compilation unit then it is allocated in the
				726	appropriate section according to if it has initialized data or is readonly.
				727
				728	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				729	will resolve relocations using the definition provided by another code object
				730	or explicitly defined by the runtime.
				731
				732	All global symbols, whether defined in the compilation unit or external, are
				733	accessed by the machine code indirectly through a GOT table entry. This
				734	allows them to be preemptable. The GOT table is only supported when the target
				735	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	736
				737	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	738	Add description of linked shared object symbols. Seems undefined symbols
				739	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	740
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	741	Kernel descriptor
				742	Every HSA kernel has an associated kernel descriptor. It is the address of the
				743	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				744	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				745	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				746
				747	Kernel entry point
				748	Every HSA kernel also has a symbol for its machine code entry point.
				749
				750	.. _amdgpu-relocation-records:
				751
				752	Relocation Records
				753	------------------
				754
				755	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				756	relocatable fields are:
				757
				758	``word32``
				759	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				760	alignment. These values use the same byte order as other word values in the
				761	AMD GPU architecture.
				762
				763	``word64``
				764	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				765	alignment. These values use the same byte order as other word values in the
				766	AMD GPU architecture.
				767
				768	Following notations are used for specifying relocation calculations:
				769
				770	A
				771	Represents the addend used to compute the value of the relocatable field.
				772
				773	G
				774	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	775	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	776
				777	GOT
				778	Represents the address of the global offset table.
				779
				780	P
				781	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				782	of the storage unit being relocated (computed using ``r_offset``).
				783
				784	S
				785	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	786	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				787
				788	B
				789	Represents the base address of a loaded executable or shared object which is
				790	the difference between the ELF address and the actual load address. Relocations
				791	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	792
				793	The following relocation types are supported:
				794
				795	.. table:: AMDGPU ELF Relocation Records
				796	:name: amdgpu-elf-relocation-records-table
				797
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	798	========================== ======= ===== ========== ==============================
				799	Relocation Type Kind Value Field Calculation
				800	========================== ======= ===== ========== ==============================
				801	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	802	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				803	Dynamic
				804	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				805	Dynamic
				806	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	807	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	808	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				809	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	810	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				811	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	812	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				813	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				814	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				815	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				816	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				817	reserved 12
				818	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				819	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	820
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	821	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				822	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				823
				824	There is no current OS loader support for 32 bit programs and so
				825	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	826
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	827	.. _amdgpu-dwarf:
				828
				829	DWARF
				830	-----
				831
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	832	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	833	information that maps the code object executable code and data to the source
				834	language constructs. It can be used by tools such as debuggers and profilers.
				835
				836	Address Space Mapping
				837	~~~~~~~~~~~~~~~~~~~~~
				838
				839	The following address space mapping is used:
				840
				841	.. table:: AMDGPU DWARF Address Space Mapping
				842	:name: amdgpu-dwarf-address-space-mapping-table
				843
				844	=================== =================
				845	DWARF Address Space Memory Space
				846	=================== =================
				847	1 Private (Scratch)
				848	2 Local (group/LDS)
				849	omitted Global
				850	omitted Constant
				851	omitted Generic (Flat)
				852	not supported Region (GDS)
				853	=================== =================
				854
				855	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				856	used in the table.
				857
				858	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				859	DWARF address space of the value of the pointer when it is in the private or
				860	local address space. Otherwise the attribute is omitted.
				861
				862	An ``XDEREF`` operation is generated in location list expressions for variables
				863	that are allocated in the private and local address space. Otherwise no
				864	``XDREF`` is omitted.
				865
				866	Register Mapping
				867	~~~~~~~~~~~~~~~~
				868
				869	This section is WIP.
				870
				871	.. TODO
				872	Define DWARF register enumeration.
				873
				874	If want to present a wavefront state then should expose vector registers as
				875	64 wide (rather than per work-item view that LLVM uses). Either as separate
				876	registers, or a 64x4 byte single register. In either case use a new LANE op
				877	(akin to XDREF) to select the current lane usage in a location
				878	expression. This would also allow scalar register spilling to vector register
				879	lanes to be expressed (currently no debug information is being generated for
				880	spilling). If choose a wide single register approach then use LANE in
				881	conjunction with PIECE operation to select the dword part of the register for
				882	the current lane. If the separate register approach then use LANE to select
				883	the register.
				884
				885	Source Text
				886	~~~~~~~~~~~
				887
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	888	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				889	runtime) may be embedded into the DWARF v5 line table using the ``clang
				890	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	891
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	892	For example:
				893
				894	``-gembed-source``
				895	Enable the embedded source DWARF v5 extension.
				896	``-gno-embed-source``
				897	Disable the embedded source DWARF v5 extension.
				898
				899	.. table:: AMDGPU Debug Options
				900	:name: amdgpu-debug-options
				901
				902	==================== ==================================================
				903	Debug Flag Description
				904	==================== ==================================================
				905	-g[no-]embed-source Enable/disable embedding source text in DWARF
				906	debug sections. Useful for environments where
				907	source cannot be written to disk, such as
				908	when performing online compilation.
				909	==================== ==================================================
				910
				911	This option enables one extended content types in the DWARF v5 Line Number
				912	Program Header, which is used to encode embedded source.
				913
				914	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				915	:name: amdgpu-dwarf-extended-content-types
				916
				917	============================ ======================
				918	Content Type Form
				919	============================ ======================
				920	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				921	============================ ======================
				922
				923	The source field will contain the UTF-8 encoded, null-terminated source text
				924	with ``'\n'`` line endings. When the source field is present, consumers can use
				925	the embedded source instead of attempting to discover the source on disk. When
				926	the source field is absent, consumers can access the file to get the source
				927	text.
				928
				929	The above content type appears in the ``file_name_entry_format`` field of the
				930	line table prologue, and its corresponding value appear in the ``file_names``
				931	field. The current encoding of the content type is documented in table
				932	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				933
				934	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				935	:name: amdgpu-dwarf-extended-content-types-encoding
				936
				937	============================ ====================
				938	Content Type Value
				939	============================ ====================
				940	``DW_LNCT_LLVM_source`` 0x2001
				941	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	942
				943	.. _amdgpu-code-conventions:
				944
				945	Code Conventions
				946	================
				947
				948	This section provides code conventions used for each supported target triple OS
				949	(see :ref:`amdgpu-target-triples`).
				950
				951	AMDHSA
				952	------
				953
				954	This section provides code conventions used when the target triple OS is
				955	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				956
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	957	.. _amdgpu-amdhsa-code-object-target-identification:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	958
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	959	Code Object Target Identification
				960	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				961
				962	The AMDHSA OS uses the following syntax to specify the code object
				963	target as a single string:
				964
				965	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				966
				967	Where:
				968
				969	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				970	are the same as the Target Triple (see
				971	:ref:`amdgpu-target-triples`).
				972
				973	- ``<Processor>`` is the same as the Processor (see
				974	:ref:`amdgpu-processors`).
				975
				976	- ``<Target Features>`` is a list of the enabled Target Features
				977	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				978	apply to Processor. The list must be in the same order as listed
				979	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				980	Features* must be included in the list if they are enabled even if
				981	that is the default for Processor.
				982
				983	For example:
				984
				985	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				986
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	987	.. _amdgpu-amdhsa-code-object-metadata:
				988
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	989	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	990	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	991
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	992	The code object metadata specifies extensible metadata associated with the code
				993	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				994	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				995	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				996	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				997	information necessary to support the ROCM kernel queries. For example, the
				998	segment sizes needed in a dispatch packet. In addition, a high level language
				999	runtime may require other information to be included. For example, the AMD
				1000	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1001
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1002	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1003	:doc:`YamlIO`).
				1004
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1005	.. TODO
				1006	Is the string null terminated? It probably should not if YAML allows it to
				1007	contain null characters, otherwise it should be.
				1008
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1009	The metadata is represented as a single YAML document comprised of the mapping
				1010	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				1011	referenced tables.
				1012
				1013	For boolean values, the string values of ``false`` and ``true`` are used for
				1014	false and true respectively.
				1015
				1016	Additional information can be added to the mappings. To avoid conflicts, any
				1017	non-AMD key names should be prefixed by "vendor-name.".
				1018
				1019	.. table:: AMDHSA Code Object Metadata Mapping
				1020	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				1021
				1022	========== ============== ========= =======================================
				1023	String Key Value Type Required? Description
				1024	========== ============== ========= =======================================
				1025	"Version" sequence of Required - The first integer is the major
				1026	2 integers version. Currently 1.
				1027	- The second integer is the minor
				1028	version. Currently 0.
				1029	"Printf" sequence of Each string is encoded information
				1030	strings about a printf function call. The
				1031	encoded information is organized as
				1032	fields separated by colon (':'):
				1033
				1034	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1035
				1036	where:
				1037
				1038	``ID``
				1039	A 32 bit integer as a unique id for
				1040	each printf function call
				1041
				1042	``N``
				1043	A 32 bit integer equal to the number
				1044	of arguments of printf function call
				1045	minus 1
				1046
				1047	``S[i]`` (where i = 0, 1, ... , N-1)
				1048	32 bit integers for the size in bytes
				1049	of the i-th FormatString argument of
				1050	the printf function call
				1051
				1052	FormatString
				1053	The format string passed to the
				1054	printf function call.
				1055	"Kernels" sequence of Required Sequence of the mappings for each
				1056	mapping kernel in the code object. See
				1057	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1058	for the definition of the mapping.
				1059	========== ============== ========= =======================================
				1060
				1061	..
				1062
				1063	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1064	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1065
				1066	================= ============== ========= ================================
				1067	String Key Value Type Required? Description
				1068	================= ============== ========= ================================
				1069	"Name" string Required Source name of the kernel.
				1070	"SymbolName" string Required Name of the kernel
				1071	descriptor ELF symbol.
				1072	"Language" string Source language of the kernel.
				1073	Values include:
				1074
				1075	- "OpenCL C"
				1076	- "OpenCL C++"
				1077	- "HCC"
				1078	- "OpenMP"
				1079
				1080	"LanguageVersion" sequence of - The first integer is the major
				1081	2 integers version.
				1082	- The second integer is the
				1083	minor version.
				1084	"Attrs" mapping Mapping of kernel attributes.
				1085	See
				1086	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1087	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1088	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1089	mapping kernel arguments. See
				1090	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1091	for the definition of the mapping.
				1092	"CodeProps" mapping Mapping of properties related to
				1093	the kernel code. See
				1094	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1095	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1096	================= ============== ========= ================================
				1097
				1098	..
				1099
				1100	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1101	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1102
				1103	=================== ============== ========= ==============================
				1104	String Key Value Type Required? Description
				1105	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1106	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1107	3 integers must be >=1 and the dispatch
				1108	work-group size X, Y, Z must
				1109	correspond to the specified
				1110	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1111
				1112	Corresponds to the OpenCL
				1113	``reqd_work_group_size``
				1114	attribute.
				1115	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1116	3 integers X, Y, Z is likely to be the
				1117	specified values.
				1118
				1119	Corresponds to the OpenCL
				1120	``work_group_size_hint``
				1121	attribute.
				1122	"VecTypeHint" string The name of a scalar or vector
				1123	type.
				1124
				1125	Corresponds to the OpenCL
				1126	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1127
				1128	"RuntimeHandle" string The external symbol name
				1129	associated with a kernel.
				1130	OpenCL runtime allocates a
				1131	global buffer for the symbol
				1132	and saves the kernel's address
				1133	to it, which is used for
				1134	device side enqueueing. Only
				1135	available for device side
				1136	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1137	=================== ============== ========= ==============================
				1138
				1139	..
				1140
				1141	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1142	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1143
				1144	================= ============== ========= ================================
				1145	String Key Value Type Required? Description
				1146	================= ============== ========= ================================
				1147	"Name" string Kernel argument name.
				1148	"TypeName" string Kernel argument type name.
				1149	"Size" integer Required Kernel argument size in bytes.
				1150	"Align" integer Required Kernel argument alignment in
				1151	bytes. Must be a power of two.
				1152	"ValueKind" string Required Kernel argument kind that
				1153	specifies how to set up the
				1154	corresponding argument.
				1155	Values include:
				1156
				1157	"ByValue"
				1158	The argument is copied
				1159	directly into the kernarg.
				1160
				1161	"GlobalBuffer"
				1162	A global address space pointer
				1163	to the buffer data is passed
				1164	in the kernarg.
				1165
				1166	"DynamicSharedPointer"
				1167	A group address space pointer
				1168	to dynamically allocated LDS
				1169	is passed in the kernarg.
				1170
				1171	"Sampler"
				1172	A global address space
				1173	pointer to a S# is passed in
				1174	the kernarg.
				1175
				1176	"Image"
				1177	A global address space
				1178	pointer to a T# is passed in
				1179	the kernarg.
				1180
				1181	"Pipe"
				1182	A global address space pointer
				1183	to an OpenCL pipe is passed in
				1184	the kernarg.
				1185
				1186	"Queue"
				1187	A global address space pointer
				1188	to an OpenCL device enqueue
				1189	queue is passed in the
				1190	kernarg.
				1191
				1192	"HiddenGlobalOffsetX"
				1193	The OpenCL grid dispatch
				1194	global offset for the X
				1195	dimension is passed in the
				1196	kernarg.
				1197
				1198	"HiddenGlobalOffsetY"
				1199	The OpenCL grid dispatch
				1200	global offset for the Y
				1201	dimension is passed in the
				1202	kernarg.
				1203
				1204	"HiddenGlobalOffsetZ"
				1205	The OpenCL grid dispatch
				1206	global offset for the Z
				1207	dimension is passed in the
				1208	kernarg.
				1209
				1210	"HiddenNone"
				1211	An argument that is not used
				1212	by the kernel. Space needs to
				1213	be left for it, but it does
				1214	not need to be set up.
				1215
				1216	"HiddenPrintfBuffer"
				1217	A global address space pointer
				1218	to the runtime printf buffer
				1219	is passed in kernarg.
				1220
				1221	"HiddenDefaultQueue"
				1222	A global address space pointer
				1223	to the OpenCL device enqueue
				1224	queue that should be used by
				1225	the kernel by default is
				1226	passed in the kernarg.
				1227
				1228	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1229	A global address space pointer
				1230	to help link enqueued kernels into
				1231	the ancestor tree for determining
				1232	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1233
				1234	"ValueType" string Required Kernel argument value type. Only
				1235	present if "ValueKind" is
				1236	"ByValue". For vector data
				1237	types, the value is for the
				1238	element type. Values include:
				1239
				1240	- "Struct"
				1241	- "I8"
				1242	- "U8"
				1243	- "I16"
				1244	- "U16"
				1245	- "F16"
				1246	- "I32"
				1247	- "U32"
				1248	- "F32"
				1249	- "I64"
				1250	- "U64"
				1251	- "F64"
				1252
				1253	.. TODO
				1254	How can it be determined if a
				1255	vector type, and what size
				1256	vector?
				1257	"PointeeAlign" integer Alignment in bytes of pointee
				1258	type for pointer type kernel
				1259	argument. Must be a power
				1260	of 2. Only present if
				1261	"ValueKind" is
				1262	"DynamicSharedPointer".
				1263	"AddrSpaceQual" string Kernel argument address space
				1264	qualifier. Only present if
				1265	"ValueKind" is "GlobalBuffer" or
				1266	"DynamicSharedPointer". Values
				1267	are:
				1268
				1269	- "Private"
				1270	- "Global"
				1271	- "Constant"
				1272	- "Local"
				1273	- "Generic"
				1274	- "Region"
				1275
				1276	.. TODO
				1277	Is GlobalBuffer only Global
				1278	or Constant? Is
				1279	DynamicSharedPointer always
				1280	Local? Can HCC allow Generic?
				1281	How can Private or Region
				1282	ever happen?
				1283	"AccQual" string Kernel argument access
				1284	qualifier. Only present if
				1285	"ValueKind" is "Image" or
				1286	"Pipe". Values
				1287	are:
				1288
				1289	- "ReadOnly"
				1290	- "WriteOnly"
				1291	- "ReadWrite"
				1292
				1293	.. TODO
				1294	Does this apply to
				1295	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1296	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1297	performed by the kernel on the
				1298	kernel argument. Only present if
				1299	"ValueKind" is "GlobalBuffer",
				1300	"Image", or "Pipe". This may be
				1301	more restrictive than indicated
				1302	by "AccQual" to reflect what the
				1303	kernel actual does. If not
				1304	present then the runtime must
				1305	assume what is implied by
				1306	"AccQual" and "IsConst". Values
				1307	are:
				1308
				1309	- "ReadOnly"
				1310	- "WriteOnly"
				1311	- "ReadWrite"
				1312
				1313	"IsConst" boolean Indicates if the kernel argument
				1314	is const qualified. Only present
				1315	if "ValueKind" is
				1316	"GlobalBuffer".
				1317
				1318	"IsRestrict" boolean Indicates if the kernel argument
				1319	is restrict qualified. Only
				1320	present if "ValueKind" is
				1321	"GlobalBuffer".
				1322
				1323	"IsVolatile" boolean Indicates if the kernel argument
				1324	is volatile qualified. Only
				1325	present if "ValueKind" is
				1326	"GlobalBuffer".
				1327
				1328	"IsPipe" boolean Indicates if the kernel argument
				1329	is pipe qualified. Only present
				1330	if "ValueKind" is "Pipe".
				1331
				1332	.. TODO
				1333	Can GlobalBuffer be pipe
				1334	qualified?
				1335	================= ============== ========= ================================
				1336
				1337	..
				1338
				1339	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1340	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1341
				1342	============================ ============== ========= =====================
				1343	String Key Value Type Required? Description
				1344	============================ ============== ========= =====================
				1345	"KernargSegmentSize" integer Required The size in bytes of
				1346	the kernarg segment
				1347	that holds the values
				1348	of the arguments to
				1349	the kernel.
				1350	"GroupSegmentFixedSize" integer Required The amount of group
				1351	segment memory
				1352	required by a
				1353	work-group in
				1354	bytes. This does not
				1355	include any
				1356	dynamically allocated
				1357	group segment memory
				1358	that may be added
				1359	when the kernel is
				1360	dispatched.
				1361	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1362	private address space
				1363	memory required for a
				1364	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1365	bytes. If the kernel
				1366	uses a dynamic call
				1367	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1368	space must be added
				1369	to this value for the
				1370	call stack.
				1371	"KernargSegmentAlign" integer Required The maximum byte
				1372	alignment of
				1373	arguments in the
				1374	kernarg segment. Must
				1375	be a power of 2.
				1376	"WavefrontSize" integer Required Wavefront size. Must
				1377	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1378	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1379	registers used by a
				1380	wavefront for
				1381	GFX6-GFX9. This
				1382	includes the special
				1383	SGPRs for VCC, Flat
				1384	Scratch (GFX7-GFX9)
				1385	and XNACK (for
				1386	GFX8-GFX9). It does
				1387	not include the 16
				1388	SGPR added if a trap
				1389	handler is
				1390	enabled. It is not
				1391	rounded up to the
				1392	allocation
				1393	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1394	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1395	registers used by
				1396	each work-item for
				1397	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1398	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1399	work-group size
				1400	supported by the
				1401	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1402	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1403	consistent with
				1404	ReqdWorkGroupSize if
				1405	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1406	"NumSpilledSGPRs" integer Number of stores from
				1407	a scalar register to
				1408	a register allocator
				1409	created spill
				1410	location.
				1411	"NumSpilledVGPRs" integer Number of stores from
				1412	a vector register to
				1413	a register allocator
				1414	created spill
				1415	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1416	============================ ============== ========= =====================
				1417
				1418	..
				1419
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1420	Kernel Dispatch
				1421	~~~~~~~~~~~~~~~
				1422
				1423	The HSA architected queuing language (AQL) defines a user space memory interface
				1424	that can be used to control the dispatch of kernels, in an agent independent
				1425	way. An agent can have zero or more AQL queues created for it using the ROCm
				1426	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1427	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1428	mechanics and packet layouts.
				1429
				1430	The packet processor of a kernel agent is responsible for detecting and
				1431	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1432	packet processor is implemented by the hardware command processor (CP),
				1433	asynchronous dispatch controller (ADC) and shader processor input controller
				1434	(SPI).
				1435
				1436	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1437	mode driver to initialize and register the AQL queue with CP.
				1438
				1439	To dispatch a kernel the following actions are performed. This can occur in the
				1440	CPU host program, or from an HSA kernel executing on a GPU.
				1441
				1442	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1443	executed is obtained.
				1444	2. A pointer to the kernel descriptor (see
				1445	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1446	obtained. It must be for a kernel that is contained in a code object that that
				1447	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1448	associated.
				1449	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1450	for a memory region with the kernarg property for the kernel agent that will
				1451	execute the kernel. It must be at least 16 byte aligned.
				1452	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1453	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1454	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1455	memory in the same way constant memory is accessed. (Note that the HSA
				1456	specification allows an implementation to copy the kernel argument contents to
				1457	another location that is accessed by the kernel.)
				1458	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1459	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1460	packet. The packet must be set up, and the final write must use an atomic
				1461	store release to set the packet kind to ensure the packet contents are
				1462	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1463	notify the kernel agent that the AQL queue has been updated. These rules, and
				1464	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1465	System Architecture Specification* [HSA]_.
				1466	6. A kernel dispatch packet includes information about the actual dispatch,
				1467	such as grid and work-group size, together with information from the code
				1468	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1469	the kernel symbol can be used to obtain the code object values which are
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1470	recorded in the :ref:`amdgpu-amdhsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1471	7. CP executes micro-code and is responsible for detecting and setting up the
				1472	GPU to execute the wavefronts of a kernel dispatch.
				1473	8. CP ensures that when the a wavefront starts executing the kernel machine
				1474	code, the scalar general purpose registers (SGPR) and vector general purpose
				1475	registers (VGPR) are set up as required by the machine code. The required
				1476	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1477	register state is defined in
				1478	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1479	9. The prolog of the kernel machine code (see
				1480	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1481	before continuing executing the machine code that corresponds to the kernel.
				1482	10. When the kernel dispatch has completed execution, CP signals the completion
				1483	signal specified in the kernel dispatch packet if not 0.
				1484
				1485	.. _amdgpu-amdhsa-memory-spaces:
				1486
				1487	Memory Spaces
				1488	~~~~~~~~~~~~~
				1489
				1490	The memory space properties are:
				1491
				1492	.. table:: AMDHSA Memory Spaces
				1493	:name: amdgpu-amdhsa-memory-spaces-table
				1494
				1495	================= =========== ======== ======= ==================
				1496	Memory Space Name HSA Segment Hardware Address NULL Value
				1497	Name Name Size
				1498	================= =========== ======== ======= ==================
				1499	Private private scratch 32 0x00000000
				1500	Local group LDS 32 0xFFFFFFFF
				1501	Global global global 64 0x0000000000000000
				1502	Constant constant *same as 64 0x0000000000000000
				1503	global*
				1504	Generic flat flat 64 0x0000000000000000
				1505	Region N/A GDS 32 *not implemented
				1506	for AMDHSA*
				1507	================= =========== ======== ======= ==================
				1508
				1509	The global and constant memory spaces both use global virtual addresses, which
				1510	are the same virtual address space used by the CPU. However, some virtual
				1511	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1512	and some by both.
				1513
				1514	Using the constant memory space indicates that the data will not change during
				1515	the execution of the kernel. This allows scalar read instructions to be
				1516	used. The vector and scalar L1 caches are invalidated of volatile data before
				1517	each kernel dispatch execution to allow constant memory to change values between
				1518	kernel dispatches.
				1519
				1520	The local memory space uses the hardware Local Data Store (LDS) which is
				1521	automatically allocated when the hardware creates work-groups of wavefronts, and
				1522	freed when all the wavefronts of a work-group have terminated. The data store
				1523	(DS) instructions can be used to access it.
				1524
				1525	The private memory space uses the hardware scratch memory support. If the kernel
				1526	uses scratch, then the hardware allocates memory that is accessed using
				1527	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1528	address to physical address is:
				1529
				1530	``wavefront-scratch-base +
				1531	(private-address * wavefront-size * 4) +
				1532	(wavefront-lane-id * 4)``
				1533
				1534	There are different ways that the wavefront scratch base address is determined
				1535	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1536	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1537	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1538	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1539	same private address, the interleaving results in adjacent dwords being accessed
				1540	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1541	supported except by flat and scratch instructions in GFX9.
				1542
				1543	The generic address space uses the hardware flat address support available in
				1544	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1545	local appertures), that are outside the range of addressible global memory, to
				1546	map from a flat address to a private or local address.
				1547
				1548	FLAT instructions can take a flat address and access global, private (scratch)
				1549	and group (LDS) memory depending in if the address is within one of the
				1550	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1551	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1552	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1553	(see :ref:`amdgpu-amdhsa-m0`).
				1554
				1555	To convert between a segment address and a flat address the base address of the
				1556	appertures address can be used. For GFX7-GFX8 these are available in the
				1557	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1558	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1559	GFX9 the appature base addresses are directly available as inline constant
				1560	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1561	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1562	which makes it easier to convert from flat to segment or segment to flat.
				1563
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1564	Image and Samplers
				1565	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1566
				1567	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1568	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1569	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1570	enumeration values for the queries that are not trivially deducible from the S#
				1571	representation.
				1572
				1573	HSA Signals
				1574	~~~~~~~~~~~
				1575
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1576	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1577	structure allocated in memory accessible from both the CPU and GPU. The
				1578	structure is defined by the ROCm runtime and subject to change between releases
				1579	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1580
				1581	.. _amdgpu-amdhsa-hsa-aql-queue:
				1582
				1583	HSA AQL Queue
				1584	~~~~~~~~~~~~~
				1585
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1586	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1587	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1588	fields needed to implement certain language features such as the flat address
				1589	aperture bases. It also contains fields used by CP such as managing the
				1590	allocation of scratch memory.
				1591
				1592	.. _amdgpu-amdhsa-kernel-descriptor:
				1593
				1594	Kernel Descriptor
				1595	~~~~~~~~~~~~~~~~~
				1596
				1597	A kernel descriptor consists of the information needed by CP to initiate the
				1598	execution of a kernel, including the entry point address of the machine code
				1599	that implements the kernel.
				1600
				1601	Kernel Descriptor for GFX6-GFX9
				1602	+++++++++++++++++++++++++++++++
				1603
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1604	CP microcode requires the Kernel descriptor to be allocated on 64 byte
				1605	alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1606
				1607	.. table:: Kernel Descriptor for GFX6-GFX9
				1608	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1609
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1610	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1611	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1612	======= ======= =============================== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1613	31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1614	address space memory
				1615	required for a work-group
				1616	in bytes. This does not
				1617	include any dynamically
				1618	allocated local address
				1619	space memory that may be
				1620	added when the kernel is
				1621	dispatched.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1622	63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1623	private address space
				1624	memory required for a
				1625	work-item in bytes. If
				1626	is_dynamic_callstack is 1
				1627	then additional space must
				1628	be added to this value for
				1629	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1630	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1631	191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1632	negative) from base
				1633	address of kernel
				1634	descriptor to kernel's
				1635	entry point instruction
				1636	which must be 256 byte
				1637	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1638	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1639	bytes
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1640	415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1641	program settings used by
				1642	CP to set up
				1643	``COMPUTE_PGM_RSRC1``
				1644	configuration
				1645	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1646	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1647	447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1648	program settings used by
				1649	CP to set up
				1650	``COMPUTE_PGM_RSRC2``
				1651	configuration
				1652	register. See
				1653	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1654	448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1655	_BUFFER SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1656	(see
				1657	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1658
				1659	The total number of SGPR
				1660	user data registers
				1661	requested must not exceed
				1662	16 and match value in
				1663	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1664	Any requests beyond 16
				1665	will be ignored.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1666	449 1 bit ENABLE_SGPR_DISPATCH_PTR see above
				1667	450 1 bit ENABLE_SGPR_QUEUE_PTR see above
				1668	451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR see above
				1669	452 1 bit ENABLE_SGPR_DISPATCH_ID see above
				1670	453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT see above
				1671	454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT see above
				1672	_SIZE
Konstantin Zhuravlyov	766c77e	2018-06-21 18:36:04 +0000	[diff] [blame]	1673	455 1 bit Reserved, must be 0.
				1674	511:456 8 bytes Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1675	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1676	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1677
				1678	..
				1679
				1680	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1681	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1682
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1683	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1684	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1685	======= ======= =============================== ===========================================================================
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1686	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector register
				1687	blocks used by each work-item;
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1688	granularity is device
				1689	specific:
				1690
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1691	GFX6-GFX9
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1692	- vgprs_used 0..256
				1693	- max(0, ceil(vgprs_used / 4) - 1)
				1694
				1695	Where vgprs_used is defined
				1696	as the highest VGPR number
				1697	explicitly referenced plus
				1698	one.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1699
				1700	Used by CP to set up
				1701	``COMPUTE_PGM_RSRC1.VGPRS``.
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1702
				1703	The
				1704	:ref:`amdgpu-assembler`
				1705	calculates this
				1706	automatically for the
				1707	selected processor from
				1708	values provided to the
				1709	`.amdhsa_kernel` directive
				1710	by the
				1711	`.amdhsa_next_free_vgpr`
				1712	nested directive (see
				1713	:ref:`amdhsa-kernel-directives-table`).
				1714	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar register
				1715	blocks used by a wavefront;
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1716	granularity is device
				1717	specific:
				1718
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1719	GFX6-GFX8
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1720	- sgprs_used 0..112
				1721	- max(0, ceil(sgprs_used / 8) - 1)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1722	GFX9
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1723	- sgprs_used 0..112
				1724	- 2 * max(0, ceil(sgprs_used / 16) - 1)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1725
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1726	Where sgprs_used is
				1727	defined as the highest
				1728	SGPR number explicitly
				1729	referenced plus one, plus
				1730	a target-specific number
				1731	of additional special
				1732	SGPRs for VCC,
				1733	FLAT_SCRATCH (GFX7+) and
				1734	XNACK_MASK (GFX8+), and
				1735	any additional
				1736	target-specific
				1737	limitations. It does not
				1738	include the 16 SGPRs added
				1739	if a trap handler is
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1740	enabled.
				1741
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1742	The target-specific
				1743	limitations and special
				1744	SGPR layout are defined in
				1745	the hardware
				1746	documentation, which can
				1747	be found in the
				1748	:ref:`amdgpu-processors`
				1749	table.
				1750
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1751	Used by CP to set up
				1752	``COMPUTE_PGM_RSRC1.SGPRS``.
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	1753
				1754	The
				1755	:ref:`amdgpu-assembler`
				1756	calculates this
				1757	automatically for the
				1758	selected processor from
				1759	values provided to the
				1760	`.amdhsa_kernel` directive
				1761	by the
				1762	`.amdhsa_next_free_sgpr`
				1763	and `.amdhsa_reserve_*`
				1764	nested directives (see
				1765	:ref:`amdhsa-kernel-directives-table`).
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1766	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1767
				1768	Start executing wavefront
				1769	at the specified priority.
				1770
				1771	CP is responsible for
				1772	filling in
				1773	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1774	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1775	with specified rounding
				1776	mode for single (32
				1777	bit) floating point
				1778	precision floating point
				1779	operations.
				1780
				1781	Floating point rounding
				1782	mode values are defined in
				1783	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1784
				1785	Used by CP to set up
				1786	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1787	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1788	with specified rounding
				1789	denorm mode for half/double (16
				1790	and 64 bit) floating point
				1791	precision floating point
				1792	operations.
				1793
				1794	Floating point rounding
				1795	mode values are defined in
				1796	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1797
				1798	Used by CP to set up
				1799	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1800	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1801	with specified denorm mode
				1802	for single (32
				1803	bit) floating point
				1804	precision floating point
				1805	operations.
				1806
				1807	Floating point denorm mode
				1808	values are defined in
				1809	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1810
				1811	Used by CP to set up
				1812	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1813	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1814	with specified denorm mode
				1815	for half/double (16
				1816	and 64 bit) floating point
				1817	precision floating point
				1818	operations.
				1819
				1820	Floating point denorm mode
				1821	values are defined in
				1822	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1823
				1824	Used by CP to set up
				1825	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1826	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1827
				1828	Start executing wavefront
				1829	in privilege trap handler
				1830	mode.
				1831
				1832	CP is responsible for
				1833	filling in
				1834	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1835	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1836	with DX10 clamp mode
				1837	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1838	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1839	treatment of NaN's (when
				1840	set, clamp NaN to zero,
				1841	otherwise pass NaN
				1842	through).
				1843
				1844	Used by CP to set up
				1845	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1847
				1848	Start executing wavefront
				1849	in single step mode.
				1850
				1851	CP is responsible for
				1852	filling in
				1853	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1854	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1855	with IEEE mode
				1856	enabled. Floating point
				1857	opcodes that support
				1858	exception flag gathering
				1859	will quiet and propagate
				1860	signaling-NaN inputs per
				1861	IEEE 754-2008. Min_dx10 and
				1862	max_dx10 become IEEE
				1863	754-2008 compliant due to
				1864	signaling-NaN propagation
				1865	and quieting.
				1866
				1867	Used by CP to set up
				1868	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1869	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1870
				1871	Only one work-group allowed
				1872	to execute on a compute
				1873	unit.
				1874
				1875	CP is responsible for
				1876	filling in
				1877	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1878	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1879
				1880	Flag that can be used to
				1881	control debugging code.
				1882
				1883	CP is responsible for
				1884	filling in
				1885	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1886	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1887	Reserved, must be 0.
				1888	GFX9
				1889	Wavefront starts execution
				1890	with specified fp16 overflow
				1891	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1892
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1893	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1894	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1895	- If 1, fp16 overflow that is the
				1896	result of an +/-INF input value
				1897	or divide by 0 produces a +/-INF,
				1898	otherwise clamps computed
				1899	overflow to +/-MAX_FP16 as
				1900	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1901
				1902	Used by CP to set up
				1903	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1904	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1905	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1906	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1907
				1908	..
				1909
				1910	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1911	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1912
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1913	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1914	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1915	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1916	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1917	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1918	system register (see
				1919	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1920
				1921	Used by CP to set up
				1922	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1923	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1924	user data registers
				1925	requested. This number must
				1926	match the number of user
				1927	data registers enabled.
				1928
				1929	Used by CP to set up
				1930	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1931	6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1932
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1933	This bit represents
				1934	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
				1935	which is set by the CP if
				1936	the runtime has installed a
				1937	trap handler.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1938	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1939	system SGPR register for
				1940	the work-group id in the X
				1941	dimension (see
				1942	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1943
				1944	Used by CP to set up
				1945	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1946	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1947	system SGPR register for
				1948	the work-group id in the Y
				1949	dimension (see
				1950	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1951
				1952	Used by CP to set up
				1953	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1954	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1955	system SGPR register for
				1956	the work-group id in the Z
				1957	dimension (see
				1958	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1959
				1960	Used by CP to set up
				1961	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1962	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1963	system SGPR register for
				1964	work-group information (see
				1965	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1966
				1967	Used by CP to set up
				1968	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1969	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1970	VGPR system registers used
				1971	for the work-item ID.
				1972	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1973	defines the values.
				1974
				1975	Used by CP to set up
				1976	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1977	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1978
				1979	Wavefront starts execution
				1980	with address watch
				1981	exceptions enabled which
				1982	are generated when L1 has
				1983	witnessed a thread access
				1984	an *address of
				1985	interest*.
				1986
				1987	CP is responsible for
				1988	filling in the address
				1989	watch bit in
				1990	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1991	according to what the
				1992	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1993	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1994
				1995	Wavefront starts execution
				1996	with memory violation
				1997	exceptions exceptions
				1998	enabled which are generated
				1999	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2000	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2001	L1 or LDS
				2002	(write-to-read-only-memory,
				2003	mis-aligned atomic, LDS
				2004	address out of range,
				2005	illegal address, etc.).
				2006
				2007	CP sets the memory
				2008	violation bit in
				2009	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				2010	according to what the
				2011	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2012	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2013
				2014	CP uses the rounded value
				2015	from the dispatch packet,
				2016	not this value, as the
				2017	dispatch may contain
				2018	dynamically allocated group
				2019	segment memory. CP writes
				2020	directly to
				2021	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				2022
				2023	Amount of group segment
				2024	(LDS) to allocate for each
				2025	work-group. Granularity is
				2026	device specific:
				2027
				2028	GFX6:
				2029	roundup(lds-size / (64 * 4))
				2030	GFX7-GFX9:
				2031	roundup(lds-size / (128 * 4))
				2032
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2033	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				2034	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2035	enabled.
				2036
				2037	Used by CP to set up
				2038	``COMPUTE_PGM_RSRC2.EXCP_EN``
				2039	(set from bits 0..6).
				2040
				2041	IEEE 754 FP Invalid
				2042	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2043	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				2044	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2045	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2046	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				2047	_DIVISION_BY_ZERO Zero
				2048	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				2049	_OVERFLOW
				2050	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				2051	_UNDERFLOW
				2052	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				2053	_INEXACT
				2054	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				2055	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2056	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2057	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2058	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	2059	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2060
				2061	..
				2062
				2063	.. table:: Floating Point Rounding Mode Enumeration Values
				2064	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				2065
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2066	====================================== ===== ==============================
				2067	Enumeration Name Value Description
				2068	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2069	FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2070	FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2071	FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2072	FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2073	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2074
				2075	..
				2076
				2077	.. table:: Floating Point Denorm Mode Enumeration Values
				2078	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2079
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2080	====================================== ===== ==============================
				2081	Enumeration Name Value Description
				2082	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2083	FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2084	Denorms
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2085	FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2086	FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2087	FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2088	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2089
				2090	..
				2091
				2092	.. table:: System VGPR Work-Item ID Enumeration Values
				2093	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2094
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2095	======================================== ===== ============================
				2096	Enumeration Name Value Description
				2097	======================================== ===== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2098	SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2099	ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2100	SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2101	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2102	SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2103	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2104	SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2105	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2106
				2107	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2108
				2109	Initial Kernel Execution State
				2110	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2111
				2112	This section defines the register state that will be set up by the packet
				2113	processor prior to the start of execution of every wavefront. This is limited by
				2114	the constraints of the hardware controllers of CP/ADC/SPI.
				2115
				2116	The order of the SGPR registers is defined, but the compiler can specify which
				2117	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2118	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2119	for enabled registers are dense starting at SGPR0: the first enabled register is
				2120	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2121	an SGPR number.
				2122
				2123	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2124	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2125	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2126	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2127	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2128	dispatch.
				2129
				2130	SGPR register initial state is defined in
				2131	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2132
				2133	.. table:: SGPR Register Set Up Order
				2134	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2135
				2136	========== ========================== ====== ==============================
				2137	SGPR Order Name Number Description
				2138	(kernel descriptor enable of
				2139	field) SGPRs
				2140	========== ========================== ====== ==============================
				2141	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2142	(enable_sgpr_private with Scratch Wavefront Offset
				2143	_segment_buffer) as an offset, to access the
				2144	private memory space using a
				2145	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2146
				2147	CP uses the value provided by
				2148	the runtime.
				2149	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2150	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2151	actually executing.
				2152	then Queue Ptr 2 64 bit address of amd_queue_t
				2153	(enable_sgpr_queue_ptr) object for AQL queue on which
				2154	the dispatch packet was
				2155	queued.
				2156	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2157	(enable_sgpr_kernarg segment. This is directly
				2158	_segment_ptr) copied from the
				2159	kernarg_address in the kernel
				2160	dispatch packet.
				2161
				2162	Having CP load it once avoids
				2163	loading it at the beginning of
				2164	every wavefront.
				2165	then Dispatch Id 2 64 bit Dispatch ID of the
				2166	(enable_sgpr_dispatch_id) dispatch packet being
				2167	executed.
				2168	then Flat Scratch Init 2 This is 2 SGPRs:
				2169	(enable_sgpr_flat_scratch
				2170	_init) GFX6
				2171	Not supported.
				2172	GFX7-GFX8
				2173	The first SGPR is a 32 bit
				2174	byte offset from
				2175	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2176	to per SPI base of memory
				2177	for scratch for the queue
				2178	executing the kernel
				2179	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2180	from the runtime. (The
				2181	Scratch Segment Buffer base
				2182	address is
				2183	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2184	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2185	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2186	be added to this offset by
				2187	the kernel machine code,
				2188	right shifted by 8, and
				2189	moved to the FLAT_SCRATCH_HI
				2190	SGPR register.
				2191	FLAT_SCRATCH_HI corresponds
				2192	to SGPRn-4 on GFX7, and
				2193	SGPRn-6 on GFX8 (where SGPRn
				2194	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2195	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2196	FLAT_SCRATCH_HI is
				2197	multiplied by 256 (as it is
				2198	in units of 256 bytes) and
				2199	added to
				2200	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2201	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2202	FLAT SCRATCH BASE in flat
				2203	memory instructions that
				2204	access the scratch
				2205	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2206
				2207	The second SGPR is 32 bit
				2208	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2209	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2210	usage. CP obtains this from
				2211	the runtime, and it is
				2212	always a multiple of DWORD.
				2213	CP checks that the value in
				2214	the kernel dispatch packet
				2215	Private Segment Byte Size is
				2216	not larger, and requests the
				2217	runtime to increase the
				2218	queue's scratch size if
				2219	necessary. The kernel code
				2220	must move it to
				2221	FLAT_SCRATCH_LO which is
				2222	SGPRn-3 on GFX7 and SGPRn-5
				2223	on GFX8. FLAT_SCRATCH_LO is
				2224	used as the FLAT SCRATCH
				2225	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2226	instructions. Having CP load
				2227	it once avoids loading it at
				2228	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2229	wavefront.
				2230	GFX9
				2231	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2232	64 bit base address of the
				2233	per SPI scratch backing
				2234	memory managed by SPI for
				2235	the queue executing the
				2236	kernel dispatch. CP obtains
				2237	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2238	divides it if there are
				2239	multiple Shader Arrays each
				2240	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2241	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2242	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2243	machine code and the result
				2244	moved to the FLAT_SCRATCH
				2245	SGPR which is SGPRn-6 and
				2246	SGPRn-5. It is used as the
				2247	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2248	memory instructions.
				2249	then Private Segment Size 1 The 32 bit byte size of a
				2250	(enable_sgpr_private single
				2251	work-item's
				2252	scratch_segment_size) memory
				2253	allocation. This is the
				2254	value from the kernel
				2255	dispatch packet Private
				2256	Segment Byte Size rounded up
				2257	by CP to a multiple of
				2258	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2259
				2260	Having CP load it once avoids
				2261	loading it at the beginning of
				2262	every wavefront.
				2263
				2264	This is not used for
				2265	GFX7-GFX8 since it is the same
				2266	value as the second SGPR of
				2267	Flat Scratch Init. However, it
				2268	may be needed for GFX9 which
				2269	changes the meaning of the
				2270	Flat Scratch Init value.
				2271	then Grid Work-Group Count X 1 32 bit count of the number of
				2272	(enable_sgpr_grid work-groups in the X dimension
				2273	_workgroup_count_X) for the grid being
				2274	executed. Computed from the
				2275	fields in the kernel dispatch
				2276	packet as ((grid_size.x +
				2277	workgroup_size.x - 1) /
				2278	workgroup_size.x).
				2279	then Grid Work-Group Count Y 1 32 bit count of the number of
				2280	(enable_sgpr_grid work-groups in the Y dimension
				2281	_workgroup_count_Y && for the grid being
				2282	less than 16 previous executed. Computed from the
				2283	SGPRs) fields in the kernel dispatch
				2284	packet as ((grid_size.y +
				2285	workgroup_size.y - 1) /
				2286	workgroupSize.y).
				2287
				2288	Only initialized if <16
				2289	previous SGPRs initialized.
				2290	then Grid Work-Group Count Z 1 32 bit count of the number of
				2291	(enable_sgpr_grid work-groups in the Z dimension
				2292	_workgroup_count_Z && for the grid being
				2293	less than 16 previous executed. Computed from the
				2294	SGPRs) fields in the kernel dispatch
				2295	packet as ((grid_size.z +
				2296	workgroup_size.z - 1) /
				2297	workgroupSize.z).
				2298
				2299	Only initialized if <16
				2300	previous SGPRs initialized.
				2301	then Work-Group Id X 1 32 bit work-group id in X
				2302	(enable_sgpr_workgroup_id dimension of grid for
				2303	_X) wavefront.
				2304	then Work-Group Id Y 1 32 bit work-group id in Y
				2305	(enable_sgpr_workgroup_id dimension of grid for
				2306	_Y) wavefront.
				2307	then Work-Group Id Z 1 32 bit work-group id in Z
				2308	(enable_sgpr_workgroup_id dimension of grid for
				2309	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2310	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2311	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2312	_info) threadgroup_size_in_wavefronts[5:0]}
				2313	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2314	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2315	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2316	dispatch. Must be used as an
				2317	offset with Private
				2318	segment address when using
				2319	Scratch Segment Buffer. It
				2320	must be used to set up FLAT
				2321	SCRATCH for flat addressing
				2322	(see
				2323	:ref:`amdgpu-amdhsa-flat-scratch`).
				2324	========== ========================== ====== ==============================
				2325
				2326	The order of the VGPR registers is defined, but the compiler can specify which
				2327	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2328	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2329	for enabled registers are dense starting at VGPR0: the first enabled register is
				2330	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2331	VGPR number.
				2332
				2333	VGPR register initial state is defined in
				2334	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2335
				2336	.. table:: VGPR Register Set Up Order
				2337	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2338
				2339	========== ========================== ====== ==============================
				2340	VGPR Order Name Number Description
				2341	(kernel descriptor enable of
				2342	field) VGPRs
				2343	========== ========================== ====== ==============================
				2344	First Work-Item Id X 1 32 bit work item id in X
				2345	(Always initialized) dimension of work-group for
				2346	wavefront lane.
				2347	then Work-Item Id Y 1 32 bit work item id in Y
				2348	(enable_vgpr_workitem_id dimension of work-group for
				2349	> 0) wavefront lane.
				2350	then Work-Item Id Z 1 32 bit work item id in Z
				2351	(enable_vgpr_workitem_id dimension of work-group for
				2352	> 1) wavefront lane.
				2353	========== ========================== ====== ==============================
				2354
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2355	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2356
				2357	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2358	registers.
				2359	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2360	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2361	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2362	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2363	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2364	or (X, Y, Z).
				2365
				2366	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2367	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2368
				2369	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2370	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2371	instructions (GFX9).
				2372
				2373	If buffer operations are used then the compiler can generate a V# with the
				2374	following properties:
				2375
				2376	* base address of 0
				2377	* no swizzle
				2378	* ATC: 1 if IOMMU present (such as APU)
				2379	* ptr64: 1
				2380	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2381	APU and NC for dGPU).
				2382
				2383	.. _amdgpu-amdhsa-kernel-prolog:
				2384
				2385	Kernel Prolog
				2386	~~~~~~~~~~~~~
				2387
				2388	.. _amdgpu-amdhsa-m0:
				2389
				2390	M0
				2391	++
				2392
				2393	GFX6-GFX8
				2394	The M0 register must be initialized with a value at least the total LDS size
				2395	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2396	available in dispatch packet. For M0, it is also possible to use maximum
				2397	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2398	GFX7-GFX8).
				2399	GFX9
				2400	The M0 register is not used for range checking LDS accesses and so does not
				2401	need to be initialized in the prolog.
				2402
				2403	.. _amdgpu-amdhsa-flat-scratch:
				2404
				2405	Flat Scratch
				2406	++++++++++++
				2407
				2408	If the kernel may use flat operations to access scratch memory, the prolog code
				2409	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2410	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2411	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2412
				2413	GFX6
				2414	Flat scratch is not supported.
				2415
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2416	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2417	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2418	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2419	being managed by SPI for the queue executing the kernel dispatch. This is
				2420	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2421	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2422	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2423	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2424	by 8 before moving into FLAT_SCRATCH_LO.
				2425	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2426	work-items scratch memory usage. This is directly loaded from the kernel
				2427	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2428	DWORD. Having CP load it once avoids loading it at the beginning of every
				2429	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2430	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2431
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2432	GFX9
				2433	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2434	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2435	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2436	pair for use as the flat scratch base in flat memory instructions.
				2437
				2438	.. _amdgpu-amdhsa-memory-model:
				2439
				2440	Memory Model
				2441	~~~~~~~~~~~~
				2442
				2443	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2444	(see :ref:`memmodel`). The implementation is WIP.
				2445
				2446	.. TODO
				2447	Update when implementation complete.
				2448
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2449	The AMDGPU backend supports the memory synchronization scopes specified in
				2450	:ref:`amdgpu-memory-scopes`.
				2451
				2452	The code sequences used to implement the memory model are defined in table
				2453	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2454
				2455	The sequences specify the order of instructions that a single thread must
				2456	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2457	to other memory instructions executed by the same thread. This allows them to be
				2458	moved earlier or later which can allow them to be combined with other instances
				2459	of the same instruction, or hoisted/sunk out of loops to improve
				2460	performance. Only the instructions related to the memory model are given;
				2461	additional ``s_waitcnt`` instructions are required to ensure registers are
				2462	defined before being used. These may be able to be combined with the memory
				2463	model ``s_waitcnt`` instructions as described above.
				2464
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2465	The AMDGPU backend supports the following memory models:
				2466
				2467	HSA Memory Model [HSA]_
				2468	The HSA memory model uses a single happens-before relation for all address
				2469	spaces (see :ref:`amdgpu-address-spaces`).
				2470	OpenCL Memory Model [OpenCL]_
				2471	The OpenCL memory model which has separate happens-before relations for the
				2472	global and local address spaces. Only a fence specifying both global and
				2473	local address space, and seq_cst instructions join the relationships. Since
				2474	the LLVM ``memfence`` instruction does not allow an address space to be
				2475	specified the OpenCL fence has to convervatively assume both local and
				2476	global address space was specified. However, optimizations can often be
				2477	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2478	no intervening memory instructions which access the corresponding address
				2479	space. The code sequences in the table indicate what can be omitted for the
				2480	OpenCL memory. The target triple environment is used to determine if the
				2481	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2482
				2483	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2484	operations.
				2485
				2486	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2487	termed vector memory operations.
				2488
				2489	For GFX6-GFX9:
				2490
				2491	* Each agent has multiple compute units (CU).
				2492	* Each CU has multiple SIMDs that execute wavefronts.
				2493	* The wavefronts for a single work-group are executed in the same CU but may be
				2494	executed by different SIMDs.
				2495	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2496	executing on it.
				2497	* All LDS operations of a CU are performed as wavefront wide operations in a
				2498	global order and involve no caching. Completion is reported to a wavefront in
				2499	execution order.
				2500	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2501	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2502	can be reordered relative to each other, which can result in reordering the
				2503	visibility of vector memory operations with respect to LDS operations of other
				2504	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2505	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2506	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2507	same wavefront.
				2508	* The vector memory operations are performed as wavefront wide operations and
				2509	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2510	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2511	vector memory order if they access LDS memory, and out of LDS operation order
				2512	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2513	* The vector memory operations access a single vector L1 cache shared by all
				2514	SIMDs a CU. Therefore, no special action is required for coherence between the
				2515	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2516	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2517	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2518	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2519	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2520	scalar operations are used in a restricted way so do not impact the memory
				2521	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2522	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2523	the same agent.
				2524	* The L2 cache has independent channels to service disjoint ranges of virtual
				2525	addresses.
				2526	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2527	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2528	(which may be executing on different CUs) of an agent can be reordered
				2529	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2530	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2531	previous vector memory operation has completed before executing a subsequent
				2532	vector memory or LDS operation and so can be used to meet the requirements of
				2533	acquire and release.
				2534	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2535	of virtual addresses can be set up to bypass it to ensure system coherence.
				2536
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2537	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2538	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2539	memory, atomic memory orderings are not meaningful and all accesses are treated
				2540	as non-atomic.
				2541
				2542	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2543	scalar memory instructions). Since the constant address space contents do not
				2544	change during the execution of a kernel dispatch it is not legal to perform
				2545	stores, and atomic memory orderings are not meaningful and all access are
				2546	treated as non-atomic.
				2547
				2548	A memory synchronization scope wider than work-group is not meaningful for the
				2549	group (LDS) address space and is treated as work-group.
				2550
				2551	The memory model does not support the region address space which is treated as
				2552	non-atomic.
				2553
				2554	Acquire memory ordering is not meaningful on store atomic instructions and is
				2555	treated as non-atomic.
				2556
				2557	Release memory ordering is not meaningful on load atomic instructions and is
				2558	treated a non-atomic.
				2559
				2560	Acquire-release memory ordering is not meaningful on load or store atomic
				2561	instructions and is treated as acquire and release respectively.
				2562
				2563	AMDGPU backend only uses scalar memory operations to access memory that is
				2564	proven to not change during the execution of the kernel dispatch. This includes
				2565	constant address space and global address space for program scope const
				2566	variables. Therefore the kernel machine code does not have to maintain the
				2567	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2568	and vector L1 caches are invalidated between kernel dispatches by CP since
				2569	constant address space data may change between kernel dispatch executions. See
				2570	:ref:`amdgpu-amdhsa-memory-spaces`.
				2571
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2572	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2573	case the AMDGPU backend ensures the memory location used to spill is never
				2574	accessed by vector memory operations at the same time. If scalar writes are used
				2575	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2576	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2577	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2578	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2579	as all scalar writes are write-before-read in the same thread.
				2580
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2581	Scratch backing memory (which is used for the private address space)
				2582	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2583	address space is only accessed by a single thread, and is always
				2584	write-before-read, there is never a need to invalidate these entries from the L1
				2585	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2586	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2587
				2588	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2589	to invalidate the L2 cache. This also causes it to be treated as
				2590	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2591	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2592	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2593
				2594	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2595	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2596
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2597	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2598	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2599	Ordering Sync Scope Address
				2600	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2601	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2602	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2603	-----------------------------------------------------------------------------------
				2604	load none none - global - !volatile & !nontemporal
				2605	- generic
				2606	- private 1. buffer/global/flat_load
				2607	- constant
				2608	- volatile & !nontemporal
				2609
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2610	1. buffer/global/flat_load
				2611	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2612
				2613	- nontemporal
				2614
				2615	1. buffer/global/flat_load
				2616	glc=1 slc=1
				2617
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2618	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2619	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2620	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2621	- private 1. buffer/global/flat_store
				2622	- constant
				2623	- nontemporal
				2624
				2625	1. buffer/global/flat_stote
				2626	glc=1 slc=1
				2627
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2628	store none none - local 1. ds_store
				2629	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2630	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2631	load atomic unordered any any Same as non-atomic.
				2632	store atomic unordered any any Same as non-atomic.
				2633	atomicrmw unordered any any *Same as monotonic
				2634	atomic*.
				2635	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2636	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2637	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2638	- wavefront - generic
				2639	- workgroup
				2640	load atomic monotonic - singlethread - local 1. ds_load
				2641	- wavefront
				2642	- workgroup
				2643	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2644	- system - generic glc=1
				2645	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2646	- wavefront - generic
				2647	- workgroup
				2648	- agent
				2649	- system
				2650	store atomic monotonic - singlethread - local 1. ds_store
				2651	- wavefront
				2652	- workgroup
				2653	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2654	- wavefront - generic
				2655	- workgroup
				2656	- agent
				2657	- system
				2658	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2659	- wavefront
				2660	- workgroup
				2661	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2662	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2663	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2664	- wavefront - local
				2665	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2666	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2667	load atomic acquire - workgroup - local 1. ds_load
				2668	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2669
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2670	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2671	- Must happen before
				2672	any following
				2673	global/generic
				2674	load/load
				2675	atomic/store/store
				2676	atomic/atomicrmw.
				2677	- Ensures any
				2678	following global
				2679	data read is no
				2680	older than the load
				2681	atomic value being
				2682	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2683	load atomic acquire - workgroup - generic 1. flat_load
				2684	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2685
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2686	- If OpenCL, omit.
				2687	- Must happen before
				2688	any following
				2689	global/generic
				2690	load/load
				2691	atomic/store/store
				2692	atomic/atomicrmw.
				2693	- Ensures any
				2694	following global
				2695	data read is no
				2696	older than the load
				2697	atomic value being
				2698	acquired.
				2699	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2700	- system glc=1
				2701	2. s_waitcnt vmcnt(0)
				2702
				2703	- Must happen before
				2704	following
				2705	buffer_wbinvl1_vol.
				2706	- Ensures the load
				2707	has completed
				2708	before invalidating
				2709	the cache.
				2710
				2711	3. buffer_wbinvl1_vol
				2712
				2713	- Must happen before
				2714	any following
				2715	global/generic
				2716	load/load
				2717	atomic/atomicrmw.
				2718	- Ensures that
				2719	following
				2720	loads will not see
				2721	stale global data.
				2722
				2723	load atomic acquire - agent - generic 1. flat_load glc=1
				2724	- system 2. s_waitcnt vmcnt(0) &
				2725	lgkmcnt(0)
				2726
				2727	- If OpenCL omit
				2728	lgkmcnt(0).
				2729	- Must happen before
				2730	following
				2731	buffer_wbinvl1_vol.
				2732	- Ensures the flat_load
				2733	has completed
				2734	before invalidating
				2735	the cache.
				2736
				2737	3. buffer_wbinvl1_vol
				2738
				2739	- Must happen before
				2740	any following
				2741	global/generic
				2742	load/load
				2743	atomic/atomicrmw.
				2744	- Ensures that
				2745	following loads
				2746	will not see stale
				2747	global data.
				2748
				2749	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2750	- wavefront - local
				2751	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2752	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2753	atomicrmw acquire - workgroup - local 1. ds_atomic
				2754	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2755
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2756	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2757	- Must happen before
				2758	any following
				2759	global/generic
				2760	load/load
				2761	atomic/store/store
				2762	atomic/atomicrmw.
				2763	- Ensures any
				2764	following global
				2765	data read is no
				2766	older than the
				2767	atomicrmw value
				2768	being acquired.
				2769
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2770	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2771	2. waitcnt lgkmcnt(0)
				2772
				2773	- If OpenCL, omit.
				2774	- Must happen before
				2775	any following
				2776	global/generic
				2777	load/load
				2778	atomic/store/store
				2779	atomic/atomicrmw.
				2780	- Ensures any
				2781	following global
				2782	data read is no
				2783	older than the
				2784	atomicrmw value
				2785	being acquired.
				2786
				2787	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2788	- system 2. s_waitcnt vmcnt(0)
				2789
				2790	- Must happen before
				2791	following
				2792	buffer_wbinvl1_vol.
				2793	- Ensures the
				2794	atomicrmw has
				2795	completed before
				2796	invalidating the
				2797	cache.
				2798
				2799	3. buffer_wbinvl1_vol
				2800
				2801	- Must happen before
				2802	any following
				2803	global/generic
				2804	load/load
				2805	atomic/atomicrmw.
				2806	- Ensures that
				2807	following loads
				2808	will not see stale
				2809	global data.
				2810
				2811	atomicrmw acquire - agent - generic 1. flat_atomic
				2812	- system 2. s_waitcnt vmcnt(0) &
				2813	lgkmcnt(0)
				2814
				2815	- If OpenCL, omit
				2816	lgkmcnt(0).
				2817	- Must happen before
				2818	following
				2819	buffer_wbinvl1_vol.
				2820	- Ensures the
				2821	atomicrmw has
				2822	completed before
				2823	invalidating the
				2824	cache.
				2825
				2826	3. buffer_wbinvl1_vol
				2827
				2828	- Must happen before
				2829	any following
				2830	global/generic
				2831	load/load
				2832	atomic/atomicrmw.
				2833	- Ensures that
				2834	following loads
				2835	will not see stale
				2836	global data.
				2837
				2838	fence acquire - singlethread none none
				2839	- wavefront
				2840	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2841
				2842	- If OpenCL and
				2843	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2844	not generic, omit.
				2845	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2846	currently has no
				2847	address space on
				2848	the fence need to
				2849	conservatively
				2850	always generate. If
				2851	fence had an
				2852	address space then
				2853	set to address
				2854	space of OpenCL
				2855	fence flag, or to
				2856	generic if both
				2857	local and global
				2858	flags are
				2859	specified.
				2860	- Must happen after
				2861	any preceding
				2862	local/generic load
				2863	atomic/atomicrmw
				2864	with an equal or
				2865	wider sync scope
				2866	and memory ordering
				2867	stronger than
				2868	unordered (this is
				2869	termed the
				2870	fence-paired-atomic).
				2871	- Must happen before
				2872	any following
				2873	global/generic
				2874	load/load
				2875	atomic/store/store
				2876	atomic/atomicrmw.
				2877	- Ensures any
				2878	following global
				2879	data read is no
				2880	older than the
				2881	value read by the
				2882	fence-paired-atomic.
				2883
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2884	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2885	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2886
				2887	- If OpenCL and
				2888	address space is
				2889	not generic, omit
				2890	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2891	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2892	currently has no
				2893	address space on
				2894	the fence need to
				2895	conservatively
				2896	always generate
				2897	(see comment for
				2898	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2899	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2900	separate s_waitcnt
				2901	vmcnt(0) and
				2902	s_waitcnt
				2903	lgkmcnt(0) to allow
				2904	them to be
				2905	independently moved
				2906	according to the
				2907	following rules.
				2908	- s_waitcnt vmcnt(0)
				2909	must happen after
				2910	any preceding
				2911	global/generic load
				2912	atomic/atomicrmw
				2913	with an equal or
				2914	wider sync scope
				2915	and memory ordering
				2916	stronger than
				2917	unordered (this is
				2918	termed the
				2919	fence-paired-atomic).
				2920	- s_waitcnt lgkmcnt(0)
				2921	must happen after
				2922	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2923	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2924	atomic/atomicrmw
				2925	with an equal or
				2926	wider sync scope
				2927	and memory ordering
				2928	stronger than
				2929	unordered (this is
				2930	termed the
				2931	fence-paired-atomic).
				2932	- Must happen before
				2933	the following
				2934	buffer_wbinvl1_vol.
				2935	- Ensures that the
				2936	fence-paired atomic
				2937	has completed
				2938	before invalidating
				2939	the
				2940	cache. Therefore
				2941	any following
				2942	locations read must
				2943	be no older than
				2944	the value read by
				2945	the
				2946	fence-paired-atomic.
				2947
				2948	2. buffer_wbinvl1_vol
				2949
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2950	- Must happen before any
				2951	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2952	load/load
				2953	atomic/store/store
				2954	atomic/atomicrmw.
				2955	- Ensures that
				2956	following loads
				2957	will not see stale
				2958	global data.
				2959
				2960	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2961	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2962	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2963	- wavefront - local
				2964	- generic
				2965	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2966
				2967	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2968	- Must happen after
				2969	any preceding
				2970	local/generic
				2971	load/store/load
				2972	atomic/store
				2973	atomic/atomicrmw.
				2974	- Must happen before
				2975	the following
				2976	store.
				2977	- Ensures that all
				2978	memory operations
				2979	to local have
				2980	completed before
				2981	performing the
				2982	store that is being
				2983	released.
				2984
				2985	2. buffer/global/flat_store
				2986	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2987	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2988
				2989	- If OpenCL, omit.
				2990	- Must happen after
				2991	any preceding
				2992	local/generic
				2993	load/store/load
				2994	atomic/store
				2995	atomic/atomicrmw.
				2996	- Must happen before
				2997	the following
				2998	store.
				2999	- Ensures that all
				3000	memory operations
				3001	to local have
				3002	completed before
				3003	performing the
				3004	store that is being
				3005	released.
				3006
				3007	2. flat_store
				3008	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3009	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3010
				3011	- If OpenCL, omit
				3012	lgkmcnt(0).
				3013	- Could be split into
				3014	separate s_waitcnt
				3015	vmcnt(0) and
				3016	s_waitcnt
				3017	lgkmcnt(0) to allow
				3018	them to be
				3019	independently moved
				3020	according to the
				3021	following rules.
				3022	- s_waitcnt vmcnt(0)
				3023	must happen after
				3024	any preceding
				3025	global/generic
				3026	load/store/load
				3027	atomic/store
				3028	atomic/atomicrmw.
				3029	- s_waitcnt lgkmcnt(0)
				3030	must happen after
				3031	any preceding
				3032	local/generic
				3033	load/store/load
				3034	atomic/store
				3035	atomic/atomicrmw.
				3036	- Must happen before
				3037	the following
				3038	store.
				3039	- Ensures that all
				3040	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3041	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3042	completed before
				3043	performing the
				3044	store that is being
				3045	released.
				3046
				3047	2. buffer/global/ds/flat_store
				3048	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				3049	- wavefront - local
				3050	- generic
				3051	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3052
				3053	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3054	- Must happen after
				3055	any preceding
				3056	local/generic
				3057	load/store/load
				3058	atomic/store
				3059	atomic/atomicrmw.
				3060	- Must happen before
				3061	the following
				3062	atomicrmw.
				3063	- Ensures that all
				3064	memory operations
				3065	to local have
				3066	completed before
				3067	performing the
				3068	atomicrmw that is
				3069	being released.
				3070
				3071	2. buffer/global/flat_atomic
				3072	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3073	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3074
				3075	- If OpenCL, omit.
				3076	- Must happen after
				3077	any preceding
				3078	local/generic
				3079	load/store/load
				3080	atomic/store
				3081	atomic/atomicrmw.
				3082	- Must happen before
				3083	the following
				3084	atomicrmw.
				3085	- Ensures that all
				3086	memory operations
				3087	to local have
				3088	completed before
				3089	performing the
				3090	atomicrmw that is
				3091	being released.
				3092
				3093	2. flat_atomic
				3094	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3095	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3096
				3097	- If OpenCL, omit
				3098	lgkmcnt(0).
				3099	- Could be split into
				3100	separate s_waitcnt
				3101	vmcnt(0) and
				3102	s_waitcnt
				3103	lgkmcnt(0) to allow
				3104	them to be
				3105	independently moved
				3106	according to the
				3107	following rules.
				3108	- s_waitcnt vmcnt(0)
				3109	must happen after
				3110	any preceding
				3111	global/generic
				3112	load/store/load
				3113	atomic/store
				3114	atomic/atomicrmw.
				3115	- s_waitcnt lgkmcnt(0)
				3116	must happen after
				3117	any preceding
				3118	local/generic
				3119	load/store/load
				3120	atomic/store
				3121	atomic/atomicrmw.
				3122	- Must happen before
				3123	the following
				3124	atomicrmw.
				3125	- Ensures that all
				3126	memory operations
				3127	to global and local
				3128	have completed
				3129	before performing
				3130	the atomicrmw that
				3131	is being released.
				3132
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3133	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3134	fence release - singlethread none none
				3135	- wavefront
				3136	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3137
				3138	- If OpenCL and
				3139	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3140	not generic, omit.
				3141	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3142	currently has no
				3143	address space on
				3144	the fence need to
				3145	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3146	always generate. If
				3147	fence had an
				3148	address space then
				3149	set to address
				3150	space of OpenCL
				3151	fence flag, or to
				3152	generic if both
				3153	local and global
				3154	flags are
				3155	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3156	- Must happen after
				3157	any preceding
				3158	local/generic
				3159	load/load
				3160	atomic/store/store
				3161	atomic/atomicrmw.
				3162	- Must happen before
				3163	any following store
				3164	atomic/atomicrmw
				3165	with an equal or
				3166	wider sync scope
				3167	and memory ordering
				3168	stronger than
				3169	unordered (this is
				3170	termed the
				3171	fence-paired-atomic).
				3172	- Ensures that all
				3173	memory operations
				3174	to local have
				3175	completed before
				3176	performing the
				3177	following
				3178	fence-paired-atomic.
				3179
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3180	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3181	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3182
				3183	- If OpenCL and
				3184	address space is
				3185	not generic, omit
				3186	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3187	- If OpenCL and
				3188	address space is
				3189	local, omit
				3190	vmcnt(0).
				3191	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3192	currently has no
				3193	address space on
				3194	the fence need to
				3195	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3196	always generate. If
				3197	fence had an
				3198	address space then
				3199	set to address
				3200	space of OpenCL
				3201	fence flag, or to
				3202	generic if both
				3203	local and global
				3204	flags are
				3205	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3206	- Could be split into
				3207	separate s_waitcnt
				3208	vmcnt(0) and
				3209	s_waitcnt
				3210	lgkmcnt(0) to allow
				3211	them to be
				3212	independently moved
				3213	according to the
				3214	following rules.
				3215	- s_waitcnt vmcnt(0)
				3216	must happen after
				3217	any preceding
				3218	global/generic
				3219	load/store/load
				3220	atomic/store
				3221	atomic/atomicrmw.
				3222	- s_waitcnt lgkmcnt(0)
				3223	must happen after
				3224	any preceding
				3225	local/generic
				3226	load/store/load
				3227	atomic/store
				3228	atomic/atomicrmw.
				3229	- Must happen before
				3230	any following store
				3231	atomic/atomicrmw
				3232	with an equal or
				3233	wider sync scope
				3234	and memory ordering
				3235	stronger than
				3236	unordered (this is
				3237	termed the
				3238	fence-paired-atomic).
				3239	- Ensures that all
				3240	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3241	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3242	completed before
				3243	performing the
				3244	following
				3245	fence-paired-atomic.
				3246
				3247	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3248	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3249	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3250	- wavefront - local
				3251	- generic
				3252	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3253
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3254	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3255	- Must happen after
				3256	any preceding
				3257	local/generic
				3258	load/store/load
				3259	atomic/store
				3260	atomic/atomicrmw.
				3261	- Must happen before
				3262	the following
				3263	atomicrmw.
				3264	- Ensures that all
				3265	memory operations
				3266	to local have
				3267	completed before
				3268	performing the
				3269	atomicrmw that is
				3270	being released.
				3271
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3272	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3273	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3274	2. s_waitcnt lgkmcnt(0)
				3275
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3276	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3277	- Must happen before
				3278	any following
				3279	global/generic
				3280	load/load
				3281	atomic/store/store
				3282	atomic/atomicrmw.
				3283	- Ensures any
				3284	following global
				3285	data read is no
				3286	older than the load
				3287	atomic value being
				3288	acquired.
				3289
				3290	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3291
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3292	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3293	- Must happen after
				3294	any preceding
				3295	local/generic
				3296	load/store/load
				3297	atomic/store
				3298	atomic/atomicrmw.
				3299	- Must happen before
				3300	the following
				3301	atomicrmw.
				3302	- Ensures that all
				3303	memory operations
				3304	to local have
				3305	completed before
				3306	performing the
				3307	atomicrmw that is
				3308	being released.
				3309
				3310	2. flat_atomic
				3311	3. s_waitcnt lgkmcnt(0)
				3312
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3313	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3314	- Must happen before
				3315	any following
				3316	global/generic
				3317	load/load
				3318	atomic/store/store
				3319	atomic/atomicrmw.
				3320	- Ensures any
				3321	following global
				3322	data read is no
				3323	older than the load
				3324	atomic value being
				3325	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3326
				3327	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3328	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3329
				3330	- If OpenCL, omit
				3331	lgkmcnt(0).
				3332	- Could be split into
				3333	separate s_waitcnt
				3334	vmcnt(0) and
				3335	s_waitcnt
				3336	lgkmcnt(0) to allow
				3337	them to be
				3338	independently moved
				3339	according to the
				3340	following rules.
				3341	- s_waitcnt vmcnt(0)
				3342	must happen after
				3343	any preceding
				3344	global/generic
				3345	load/store/load
				3346	atomic/store
				3347	atomic/atomicrmw.
				3348	- s_waitcnt lgkmcnt(0)
				3349	must happen after
				3350	any preceding
				3351	local/generic
				3352	load/store/load
				3353	atomic/store
				3354	atomic/atomicrmw.
				3355	- Must happen before
				3356	the following
				3357	atomicrmw.
				3358	- Ensures that all
				3359	memory operations
				3360	to global have
				3361	completed before
				3362	performing the
				3363	atomicrmw that is
				3364	being released.
				3365
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3366	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3367	3. s_waitcnt vmcnt(0)
				3368
				3369	- Must happen before
				3370	following
				3371	buffer_wbinvl1_vol.
				3372	- Ensures the
				3373	atomicrmw has
				3374	completed before
				3375	invalidating the
				3376	cache.
				3377
				3378	4. buffer_wbinvl1_vol
				3379
				3380	- Must happen before
				3381	any following
				3382	global/generic
				3383	load/load
				3384	atomic/atomicrmw.
				3385	- Ensures that
				3386	following loads
				3387	will not see stale
				3388	global data.
				3389
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3390	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3391	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3392
				3393	- If OpenCL, omit
				3394	lgkmcnt(0).
				3395	- Could be split into
				3396	separate s_waitcnt
				3397	vmcnt(0) and
				3398	s_waitcnt
				3399	lgkmcnt(0) to allow
				3400	them to be
				3401	independently moved
				3402	according to the
				3403	following rules.
				3404	- s_waitcnt vmcnt(0)
				3405	must happen after
				3406	any preceding
				3407	global/generic
				3408	load/store/load
				3409	atomic/store
				3410	atomic/atomicrmw.
				3411	- s_waitcnt lgkmcnt(0)
				3412	must happen after
				3413	any preceding
				3414	local/generic
				3415	load/store/load
				3416	atomic/store
				3417	atomic/atomicrmw.
				3418	- Must happen before
				3419	the following
				3420	atomicrmw.
				3421	- Ensures that all
				3422	memory operations
				3423	to global have
				3424	completed before
				3425	performing the
				3426	atomicrmw that is
				3427	being released.
				3428
				3429	2. flat_atomic
				3430	3. s_waitcnt vmcnt(0) &
				3431	lgkmcnt(0)
				3432
				3433	- If OpenCL, omit
				3434	lgkmcnt(0).
				3435	- Must happen before
				3436	following
				3437	buffer_wbinvl1_vol.
				3438	- Ensures the
				3439	atomicrmw has
				3440	completed before
				3441	invalidating the
				3442	cache.
				3443
				3444	4. buffer_wbinvl1_vol
				3445
				3446	- Must happen before
				3447	any following
				3448	global/generic
				3449	load/load
				3450	atomic/atomicrmw.
				3451	- Ensures that
				3452	following loads
				3453	will not see stale
				3454	global data.
				3455
				3456	fence acq_rel - singlethread none none
				3457	- wavefront
				3458	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3459
				3460	- If OpenCL and
				3461	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3462	not generic, omit.
				3463	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3464	since LLVM
				3465	currently has no
				3466	address space on
				3467	the fence need to
				3468	conservatively
				3469	always generate
				3470	(see comment for
				3471	previous fence).
				3472	- Must happen after
				3473	any preceding
				3474	local/generic
				3475	load/load
				3476	atomic/store/store
				3477	atomic/atomicrmw.
				3478	- Must happen before
				3479	any following
				3480	global/generic
				3481	load/load
				3482	atomic/store/store
				3483	atomic/atomicrmw.
				3484	- Ensures that all
				3485	memory operations
				3486	to local have
				3487	completed before
				3488	performing any
				3489	following global
				3490	memory operations.
				3491	- Ensures that the
				3492	preceding
				3493	local/generic load
				3494	atomic/atomicrmw
				3495	with an equal or
				3496	wider sync scope
				3497	and memory ordering
				3498	stronger than
				3499	unordered (this is
				3500	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3501	acquire-fence-paired-atomic
				3502	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3503	before following
				3504	global memory
				3505	operations. This
				3506	satisfies the
				3507	requirements of
				3508	acquire.
				3509	- Ensures that all
				3510	previous memory
				3511	operations have
				3512	completed before a
				3513	following
				3514	local/generic store
				3515	atomic/atomicrmw
				3516	with an equal or
				3517	wider sync scope
				3518	and memory ordering
				3519	stronger than
				3520	unordered (this is
				3521	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3522	release-fence-paired-atomic
				3523	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3524	requirements of
				3525	release.
				3526
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3527	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3528	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3529
				3530	- If OpenCL and
				3531	address space is
				3532	not generic, omit
				3533	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3534	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3535	currently has no
				3536	address space on
				3537	the fence need to
				3538	conservatively
				3539	always generate
				3540	(see comment for
				3541	previous fence).
				3542	- Could be split into
				3543	separate s_waitcnt
				3544	vmcnt(0) and
				3545	s_waitcnt
				3546	lgkmcnt(0) to allow
				3547	them to be
				3548	independently moved
				3549	according to the
				3550	following rules.
				3551	- s_waitcnt vmcnt(0)
				3552	must happen after
				3553	any preceding
				3554	global/generic
				3555	load/store/load
				3556	atomic/store
				3557	atomic/atomicrmw.
				3558	- s_waitcnt lgkmcnt(0)
				3559	must happen after
				3560	any preceding
				3561	local/generic
				3562	load/store/load
				3563	atomic/store
				3564	atomic/atomicrmw.
				3565	- Must happen before
				3566	the following
				3567	buffer_wbinvl1_vol.
				3568	- Ensures that the
				3569	preceding
				3570	global/local/generic
				3571	load
				3572	atomic/atomicrmw
				3573	with an equal or
				3574	wider sync scope
				3575	and memory ordering
				3576	stronger than
				3577	unordered (this is
				3578	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3579	acquire-fence-paired-atomic
				3580	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3581	before invalidating
				3582	the cache. This
				3583	satisfies the
				3584	requirements of
				3585	acquire.
				3586	- Ensures that all
				3587	previous memory
				3588	operations have
				3589	completed before a
				3590	following
				3591	global/local/generic
				3592	store
				3593	atomic/atomicrmw
				3594	with an equal or
				3595	wider sync scope
				3596	and memory ordering
				3597	stronger than
				3598	unordered (this is
				3599	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3600	release-fence-paired-atomic
				3601	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3602	requirements of
				3603	release.
				3604
				3605	2. buffer_wbinvl1_vol
				3606
				3607	- Must happen before
				3608	any following
				3609	global/generic
				3610	load/load
				3611	atomic/store/store
				3612	atomic/atomicrmw.
				3613	- Ensures that
				3614	following loads
				3615	will not see stale
				3616	global data. This
				3617	satisfies the
				3618	requirements of
				3619	acquire.
				3620
				3621	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3622	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3623	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3624	- wavefront - local load atomic acquire,
				3625	- generic except must generated
				3626	all instructions even
				3627	for OpenCL.*
				3628	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3629	- generic
				3630	- Must
				3631	happen after
				3632	preceding
				3633	global/generic load
				3634	atomic/store
				3635	atomic/atomicrmw
				3636	with memory
				3637	ordering of seq_cst
				3638	and with equal or
				3639	wider sync scope.
				3640	(Note that seq_cst
				3641	fences have their
				3642	own s_waitcnt
				3643	lgkmcnt(0) and so do
				3644	not need to be
				3645	considered.)
				3646	- Ensures any
				3647	preceding
				3648	sequential
				3649	consistent local
				3650	memory instructions
				3651	have completed
				3652	before executing
				3653	this sequentially
				3654	consistent
				3655	instruction. This
				3656	prevents reordering
				3657	a seq_cst store
				3658	followed by a
				3659	seq_cst load. (Note
				3660	that seq_cst is
				3661	stronger than
				3662	acquire/release as
				3663	the reordering of
				3664	load acquire
				3665	followed by a store
				3666	release is
				3667	prevented by the
				3668	waitcnt of
				3669	the release, but
				3670	there is nothing
				3671	preventing a store
				3672	release followed by
				3673	load acquire from
				3674	competing out of
				3675	order.)
				3676
				3677	2. *Following
				3678	instructions same as
				3679	corresponding load
				3680	atomic acquire,
				3681	except must generated
				3682	all instructions even
				3683	for OpenCL.*
				3684	load atomic seq_cst - workgroup - local *Same as corresponding
				3685	load atomic acquire,
				3686	except must generated
				3687	all instructions even
				3688	for OpenCL.*
				3689	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3690	- system - generic vmcnt(0)
				3691
				3692	- Could be split into
				3693	separate s_waitcnt
				3694	vmcnt(0)
				3695	and s_waitcnt
				3696	lgkmcnt(0) to allow
				3697	them to be
				3698	independently moved
				3699	according to the
				3700	following rules.
				3701	- waitcnt lgkmcnt(0)
				3702	must happen after
				3703	preceding
				3704	global/generic load
				3705	atomic/store
				3706	atomic/atomicrmw
				3707	with memory
				3708	ordering of seq_cst
				3709	and with equal or
				3710	wider sync scope.
				3711	(Note that seq_cst
				3712	fences have their
				3713	own s_waitcnt
				3714	lgkmcnt(0) and so do
				3715	not need to be
				3716	considered.)
				3717	- waitcnt vmcnt(0)
				3718	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3719	preceding
				3720	global/generic load
				3721	atomic/store
				3722	atomic/atomicrmw
				3723	with memory
				3724	ordering of seq_cst
				3725	and with equal or
				3726	wider sync scope.
				3727	(Note that seq_cst
				3728	fences have their
				3729	own s_waitcnt
				3730	vmcnt(0) and so do
				3731	not need to be
				3732	considered.)
				3733	- Ensures any
				3734	preceding
				3735	sequential
				3736	consistent global
				3737	memory instructions
				3738	have completed
				3739	before executing
				3740	this sequentially
				3741	consistent
				3742	instruction. This
				3743	prevents reordering
				3744	a seq_cst store
				3745	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3746	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3747	that seq_cst is
				3748	stronger than
				3749	acquire/release as
				3750	the reordering of
				3751	load acquire
				3752	followed by a store
				3753	release is
				3754	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3755	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3756	the release, but
				3757	there is nothing
				3758	preventing a store
				3759	release followed by
				3760	load acquire from
				3761	competing out of
				3762	order.)
				3763
				3764	2. *Following
				3765	instructions same as
				3766	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3767	atomic acquire,
				3768	except must generated
				3769	all instructions even
				3770	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3771	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3772	- wavefront - local store atomic release,
				3773	- workgroup - generic except must generated
				3774	all instructions even
				3775	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3776	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3777	- system - generic store atomic release,
				3778	except must generated
				3779	all instructions even
				3780	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3781	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3782	- wavefront - local atomicrmw acq_rel,
				3783	- workgroup - generic except must generated
				3784	all instructions even
				3785	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3786	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3787	- system - generic atomicrmw acq_rel,
				3788	except must generated
				3789	all instructions even
				3790	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3791	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3792	- wavefront fence acq_rel,
				3793	- workgroup except must generated
				3794	- agent all instructions even
				3795	- system for OpenCL.*
				3796	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3797
				3798	The memory order also adds the single thread optimization constrains defined in
				3799	table
				3800	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3801
				3802	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3803	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3804
				3805	============ ==============================================================
				3806	LLVM Memory Optimization Constraints
				3807	Ordering
				3808	============ ==============================================================
				3809	unordered none
				3810	monotonic none
				3811	acquire - If a load atomic/atomicrmw then no following load/load
				3812	atomic/store/ store atomic/atomicrmw/fence instruction can
				3813	be moved before the acquire.
				3814	- If a fence then same as load atomic, plus no preceding
				3815	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3816	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3817	atomic/store/ store atomic/atomicrmw/fence instruction can
				3818	be moved after the release.
				3819	- If a fence then same as store atomic, plus no following
				3820	associated fence-paired-atomic can be moved before the
				3821	fence.
				3822	acq_rel Same constraints as both acquire and release.
				3823	seq_cst - If a load atomic then same constraints as acquire, plus no
				3824	preceding sequentially consistent load atomic/store
				3825	atomic/atomicrmw/fence instruction can be moved after the
				3826	seq_cst.
				3827	- If a store atomic then the same constraints as release, plus
				3828	no following sequentially consistent load atomic/store
				3829	atomic/atomicrmw/fence instruction can be moved before the
				3830	seq_cst.
				3831	- If an atomicrmw/fence then same constraints as acq_rel.
				3832	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3833
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3834	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3835	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3836
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3837	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3838	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3839	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3840
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3841	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3842	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3843
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3844	=================== =============== =============== =======================
				3845	Usage Code Sequence Trap Handler Description
				3846	Inputs
				3847	=================== =============== =============== =======================
				3848	reserved ``s_trap 0x00`` Reserved by hardware.
				3849	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3850	``queue_ptr`` ``debugtrap``
				3851	``VGPR0``: intrinsic (not
				3852	``arg`` implemented).
				3853	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3854	``queue_ptr`` terminated and its
				3855	associated queue put
				3856	into the error state.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3857	``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
				3858	installed then
				3859	behaves as a
				3860	no-operation. The
				3861	trap handler is
				3862	entered and
				3863	immediately returns
				3864	to continue
				3865	execution of the
				3866	wavefront.
				3867	- If the debugger is
				3868	installed, causes
				3869	the debug trap to be
				3870	reported by the
				3871	debugger and the
				3872	wavefront is put in
				3873	the halt state until
				3874	resumed by the
				3875	debugger.
				3876	reserved ``s_trap 0x04`` Reserved.
				3877	reserved ``s_trap 0x05`` Reserved.
				3878	reserved ``s_trap 0x06`` Reserved.
				3879	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3880	breakpoints.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3881	reserved ``s_trap 0x08`` Reserved.
				3882	reserved ``s_trap 0xfe`` Reserved.
				3883	reserved ``s_trap 0xff`` Reserved.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3884	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3885
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3886	AMDPAL
				3887	------
				3888
				3889	This section provides code conventions used when the target triple OS is
				3890	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3891	from the application/runtime to each invocation of a hardware shader. These
				3892	parameters include both generic, application-controlled parameters called
				3893	user data as well as system-generated parameters that are a product of the
				3894	draw or dispatch execution.
				3895
				3896	User Data
				3897	~~~~~~~~~
				3898
				3899	Each hardware stage has a set of 32-bit user data registers which can be
				3900	written from a command buffer and then loaded into SGPRs when waves are launched
				3901	via a subsequent dispatch or draw operation. This is the way most arguments are
				3902	passed from the application/runtime to a hardware shader.
				3903
				3904	Compute User Data
				3905	~~~~~~~~~~~~~~~~~
				3906
				3907	Compute shader user data mappings are simpler than graphics shaders, and have a
				3908	fixed mapping.
				3909
				3910	Note that there are always 10 available user data entries in registers -
				3911	entries beyond that limit must be fetched from memory (via the spill table
				3912	pointer) by the shader.
				3913
				3914	.. table:: PAL Compute Shader User Data Registers
				3915	:name: pal-compute-user-data-registers
				3916
				3917	============= ================================
				3918	User Register Description
				3919	============= ================================
				3920	0 Global Internal Table (32-bit pointer)
				3921	1 Per-Shader Internal Table (32-bit pointer)
				3922	2 - 11 Application-Controlled User Data (10 32-bit values)
				3923	12 Spill Table (32-bit pointer)
				3924	13 - 14 Thread Group Count (64-bit pointer)
				3925	15 GDS Range
				3926	============= ================================
				3927
				3928	Graphics User Data
				3929	~~~~~~~~~~~~~~~~~~
				3930
				3931	Graphics pipelines support a much more flexible user data mapping:
				3932
				3933	.. table:: PAL Graphics Shader User Data Registers
				3934	:name: pal-graphics-user-data-registers
				3935
				3936	============= ================================
				3937	User Register Description
				3938	============= ================================
				3939	0 Global Internal Table (32-bit pointer)
				3940	+ Per-Shader Internal Table (32-bit pointer)
				3941	+ 1-15 Application Controlled User Data
				3942	(1-15 Contiguous 32-bit Values in Registers)
				3943	+ Spill Table (32-bit pointer)
				3944	+ Draw Index (First Stage Only)
				3945	+ Vertex Offset (First Stage Only)
				3946	+ Instance Offset (First Stage Only)
				3947	============= ================================
				3948
				3949	The placement of the global internal table remains fixed in the first *user
				3950	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3951	to any desired user data SGPR register, with the following regstrictions:
				3952
				3953	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3954	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3955	shader runs).
				3956
				3957	* Application-controlled user data must be mapped into a contiguous range of
				3958	user data registers.
				3959
				3960	* The application-controlled user data range supports compaction remapping, so
				3961	only entries that are actually consumed by the shader must be assigned to
				3962	corresponding registers. Note that in order to support an efficient runtime
				3963	implementation, the remapping must pack registers in the same order as
				3964	entries, with unused entries removed.
				3965
				3966	.. _pal_global_internal_table:
				3967
				3968	Global Internal Table
				3969	~~~~~~~~~~~~~~~~~~~~~
				3970
				3971	The global internal table is a table of shader resource descriptors (SRDs) that
				3972	define how certain engine-wide, runtime-managed resources should be accessed
				3973	from a shader. The majority of these resources have HW-defined formats, and it
				3974	is up to the compiler to write/read data as required by the target hardware.
				3975
				3976	The following table illustrates the required format:
				3977
				3978	.. table:: PAL Global Internal Table
				3979	:name: pal-git-table
				3980
				3981	============= ================================
				3982	Offset Description
				3983	============= ================================
				3984	0-3 Graphics Scratch SRD
				3985	4-7 Compute Scratch SRD
				3986	8-11 ES/GS Ring Output SRD
				3987	12-15 ES/GS Ring Input SRD
				3988	16-19 GS/VS Ring Output #0
				3989	20-23 GS/VS Ring Output #1
				3990	24-27 GS/VS Ring Output #2
				3991	28-31 GS/VS Ring Output #3
				3992	32-35 GS/VS Ring Input SRD
				3993	36-39 Tessellation Factor Buffer SRD
				3994	40-43 Off-Chip LDS Buffer SRD
				3995	44-47 Off-Chip Param Cache Buffer SRD
				3996	48-51 Sample Position Buffer SRD
				3997	52 vaRange::ShadowDescriptorTable High Bits
				3998	============= ================================
				3999
				4000	The pointer to the global internal table passed to the shader as user data
				4001	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				4002	the top 32 bits of the pipeline, so the shader may use the program
				4003	counter's top 32 bits.
				4004
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	4005	Unspecified OS
				4006	--------------
				4007
				4008	This section provides code conventions used when the target triple OS is
				4009	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4010
				4011	Trap Handler ABI
				4012	~~~~~~~~~~~~~~~~
				4013
				4014	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				4015	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				4016	instructions are handled as follows:
				4017
				4018	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				4019	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				4020
				4021	=============== =============== ===========================================
				4022	Usage Code Sequence Description
				4023	=============== =============== ===========================================
				4024	llvm.trap s_endpgm Causes wavefront to be terminated.
				4025	llvm.debugtrap none Compiler warning given that there is no
				4026	trap handler installed.
				4027	=============== =============== ===========================================
				4028
				4029	Source Languages
				4030	================
				4031
				4032	.. _amdgpu-opencl:
				4033
				4034	OpenCL
				4035	------
				4036
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4037	When the language is OpenCL the following differences occur:
				4038
				4039	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4040	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				4041	arguments for the AMDHSA OS (see
				4042	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	4043	3. Additional metadata is generated
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4044	(see :ref:`amdgpu-amdhsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4045
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4046	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				4047	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				4048
				4049	======== ==== ========= ===========================================
				4050	Position Byte Byte Description
				4051	Size Alignment
				4052	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	4053	1 8 8 OpenCL Global Offset X
				4054	2 8 8 OpenCL Global Offset Y
				4055	3 8 8 OpenCL Global Offset Z
				4056	4 8 8 OpenCL address of printf buffer
				4057	5 8 8 OpenCL address of virtual queue used by
				4058	enqueue_kernel.
				4059	6 8 8 OpenCL address of AqlWrap struct used by
				4060	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4061	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4062
				4063	.. _amdgpu-hcc:
				4064
				4065	HCC
				4066	---
				4067
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	4068	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4069
				4070	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				4071
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4072	.. _amdgpu-assembler:
				4073
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4074	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4075	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4076
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4077	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	4078	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4079
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4080	This section describes general syntax for instructions and operands.
				4081
				4082	Instructions
				4083	~~~~~~~~~~~~
				4084
				4085	.. toctree::
				4086	:hidden:
				4087
				4088	AMDGPUAsmGFX7
				4089	AMDGPUAsmGFX8
				4090	AMDGPUAsmGFX9
				4091	AMDGPUOperandSyntax
				4092
				4093	An instruction has the following syntax:
				4094
				4095	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				4096
				4097	Note that operands are normally comma-separated while modifiers are space-separated.
				4098
				4099	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4100
				4101	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4102	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4103
				4104	Note that features under development are not included in this description.
				4105
				4106	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4107	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4108	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4109
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4110	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4111	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4112
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4113	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4114
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4115	* SGPR registers: s0, ... or s[0], ...
				4116	* VGPR registers: v0, ... or v[0], ...
				4117	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4118	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4119	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4120	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4121	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4122	* Register index expressions: v[2*2], s[1-1:2-1]
				4123	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4124
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4125	Modifiers
				4126	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4127
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4128	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4129
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4130	Instruction Examples
				4131	~~~~~~~~~~~~~~~~~~~~
				4132
				4133	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4134	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4135
				4136	.. code-block:: nasm
				4137
				4138	ds_add_u32 v2, v4 offset:16
				4139	ds_write_src2_b64 v2 offset0:4 offset1:8
				4140	ds_cmpst_f32 v2, v4, v6
				4141	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4142
				4143
				4144	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4145
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4146	FLAT
				4147	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4148
				4149	.. code-block:: nasm
				4150
				4151	flat_load_dword v1, v[3:4]
				4152	flat_store_dwordx3 v[3:4], v[5:7]
				4153	flat_atomic_swap v1, v[3:4], v5 glc
				4154	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4155	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4156
				4157	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4158
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4159	MUBUF
				4160	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4161
				4162	.. code-block:: nasm
				4163
				4164	buffer_load_dword v1, off, s[4:7], s1
				4165	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4166	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4167	buffer_wbinvl1
				4168	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4169
				4170	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4171
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4172	SMRD/SMEM
				4173	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4174
				4175	.. code-block:: nasm
				4176
				4177	s_load_dword s1, s[2:3], 0xfc
				4178	s_load_dwordx8 s[8:15], s[2:3], s4
				4179	s_load_dwordx16 s[88:103], s[2:3], s4
				4180	s_dcache_inv_vol
				4181	s_memtime s[4:5]
				4182
				4183	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4184
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4185	SOP1
				4186	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4187
				4188	.. code-block:: nasm
				4189
				4190	s_mov_b32 s1, s2
				4191	s_mov_b64 s[0:1], 0x80000000
				4192	s_cmov_b32 s1, 200
				4193	s_wqm_b64 s[2:3], s[4:5]
				4194	s_bcnt0_i32_b64 s1, s[2:3]
				4195	s_swappc_b64 s[2:3], s[4:5]
				4196	s_cbranch_join s[4:5]
				4197
				4198	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4199
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4200	SOP2
				4201	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4202
				4203	.. code-block:: nasm
				4204
				4205	s_add_u32 s1, s2, s3
				4206	s_and_b64 s[2:3], s[4:5], s[6:7]
				4207	s_cselect_b32 s1, s2, s3
				4208	s_andn2_b32 s2, s4, s6
				4209	s_lshr_b64 s[2:3], s[4:5], s6
				4210	s_ashr_i32 s2, s4, s6
				4211	s_bfm_b64 s[2:3], s4, s6
				4212	s_bfe_i64 s[2:3], s[4:5], s6
				4213	s_cbranch_g_fork s[4:5], s[6:7]
				4214
				4215	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4216
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4217	SOPC
				4218	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4219
				4220	.. code-block:: nasm
				4221
				4222	s_cmp_eq_i32 s1, s2
				4223	s_bitcmp1_b32 s1, s2
				4224	s_bitcmp0_b64 s[2:3], s4
				4225	s_setvskip s3, s5
				4226
				4227	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4228
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4229	SOPP
				4230	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4231
				4232	.. code-block:: nasm
				4233
				4234	s_barrier
				4235	s_nop 2
				4236	s_endpgm
				4237	s_waitcnt 0 ; Wait for all counters to be 0
				4238	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4239	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4240	s_sethalt 9
				4241	s_sleep 10
				4242	s_sendmsg 0x1
				4243	s_sendmsg sendmsg(MSG_INTERRUPT)
				4244	s_trap 1
				4245
				4246	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4247
				4248	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4249	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4250	range or acceptable values.
				4251
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4252	VALU
				4253	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4254
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4255	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4256	the assembler will automatically use optimal encoding based on its operands.
				4257	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4258
				4259	* _e32 for 32-bit VOP1/VOP2/VOPC
				4260	* _e64 for 64-bit VOP3
				4261	* _dpp for VOP_DPP
				4262	* _sdwa for VOP_SDWA
				4263
				4264	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4265
				4266	.. code-block:: nasm
				4267
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4268	v_mov_b32 v1, v2
				4269	v_mov_b32_e32 v1, v2
				4270	v_nop
				4271	v_cvt_f64_i32_e32 v[1:2], v2
				4272	v_floor_f32_e32 v1, v2
				4273	v_bfrev_b32_e32 v1, v2
				4274	v_add_f32_e32 v1, v2, v3
				4275	v_mul_i32_i24_e64 v1, v2, 3
				4276	v_mul_i32_i24_e32 v1, -3, v3
				4277	v_mul_i32_i24_e32 v1, -100, v3
				4278	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4279	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4280
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4281	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4282
				4283	.. code-block:: nasm
				4284
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4285	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4286	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4287	v_mov_b32 v0, v0 wave_shl:1
				4288	v_mov_b32 v0, v0 row_mirror
				4289	v_mov_b32 v0, v0 row_bcast:31
				4290	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4291	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4292	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4293
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4294	VOP_SDWA examples:
				4295
				4296	.. code-block:: nasm
				4297
				4298	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4299	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4300	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4301	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4302	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4303
				4304	For full list of supported instructions, refer to "Vector ALU instructions".
				4305
Konstantin Zhuravlyov	dd6b05c	2018-06-22 19:23:18 +0000	[diff] [blame]	4306	.. TODO
				4307	Remove once we switch to code object v3 by default.
				4308
				4309	HSA Code Object Directives
				4310	~~~~~~~~~~~~~~~~~~~~~~~~~~
				4311
				4312	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4313	one can specify them with assembler directives.
				4314
				4315	.hsa_code_object_version major, minor
				4316	+++++++++++++++++++++++++++++++++++++
				4317
				4318	major and minor are integers that specify the version of the HSA code
				4319	object that will be generated by the assembler.
				4320
				4321	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
				4322	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4323
				4324
				4325	major, minor, and stepping are all integers that describe the instruction
				4326	set architecture (ISA) version of the assembly program.
				4327
				4328	vendor and arch are quoted strings. vendor should always be equal to
				4329	"AMD" and arch should always be equal to "AMDGPU".
				4330
				4331	By default, the assembler will derive the ISA version, vendor, and arch
				4332	from the value of the -mcpu option that is passed to the assembler.
				4333
				4334	.amdgpu_hsa_kernel (name)
				4335	+++++++++++++++++++++++++
				4336
				4337	This directives specifies that the symbol with given name is a kernel entry point
				4338	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
				4339
				4340	.amd_kernel_code_t
				4341	++++++++++++++++++
				4342
				4343	This directive marks the beginning of a list of key / value pairs that are used
				4344	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4345	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4346	any amd_kernel_code_t values that are unspecified a default value will be
				4347	used. The default value for all keys is 0, with the following exceptions:
				4348
				4349	- kernel_code_version_major defaults to 1.
				4350	- machine_kind defaults to 1.
				4351	- machine_version_major, machine_version_minor, and
				4352	machine_version_stepping are derived from the value of the -mcpu option
				4353	that is passed to the assembler.
				4354	- kernel_code_entry_byte_offset defaults to 256.
				4355	- wavefront_size defaults to 6.
				4356	- kernarg_segment_alignment, group_segment_alignment, and
				4357	private_segment_alignment default to 4. Note that alignments are specified
				4358	as a power of two, so a value of n means an alignment of 2^ n.
				4359
				4360	The .amd_kernel_code_t directive must be placed immediately after the
				4361	function label and before any instructions.
				4362
				4363	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4364	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
				4365
				4366	Here is an example of a minimal amd_kernel_code_t specification:
				4367
				4368	.. code-block:: none
				4369
				4370	.hsa_code_object_version 1,0
				4371	.hsa_code_object_isa
				4372
				4373	.hsatext
				4374	.globl hello_world
				4375	.p2align 8
				4376	.amdgpu_hsa_kernel hello_world
				4377
				4378	hello_world:
				4379
				4380	.amd_kernel_code_t
				4381	enable_sgpr_kernarg_segment_ptr = 1
				4382	is_ptr64 = 1
				4383	compute_pgm_rsrc1_vgprs = 0
				4384	compute_pgm_rsrc1_sgprs = 0
				4385	compute_pgm_rsrc2_user_sgpr = 2
				4386	kernarg_segment_byte_size = 8
				4387	wavefront_sgpr_count = 2
				4388	workitem_vgpr_count = 3
				4389	.end_amd_kernel_code_t
				4390
				4391	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4392	v_mov_b32 v0, 3.14159
				4393	s_waitcnt lgkmcnt(0)
				4394	v_mov_b32 v1, s0
				4395	v_mov_b32 v2, s1
				4396	flat_store_dword v[1:2], v0
				4397	s_endpgm
				4398	.Lfunc_end0:
				4399	.size hello_world, .Lfunc_end0-hello_world
				4400
				4401	Predefined Symbols (-mattr=+code-object-v3)
				4402	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4403
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4404	The AMDGPU assembler defines and updates some symbols automatically. These
				4405	symbols do not affect code generation.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4406
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4407	.amdgcn.gfx_generation_number
				4408	+++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4409
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4410	Set to the GFX generation number of the target being assembled for. For
				4411	example, when assembling for a "GFX9" target this will be set to the integer
				4412	value "9". The possible GFX generation numbers are presented in
				4413	:ref:`amdgpu-processors`.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4414
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4415	.amdgcn.next_free_vgpr
				4416	++++++++++++++++++++++
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4417
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4418	Set to zero before assembly begins. At each instruction, if the current value
				4419	of this symbol is less than or equal to the maximum VGPR number explicitly
				4420	referenced within that instruction then the symbol value is updated to equal
				4421	that VGPR number plus one.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4422
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4423	May be used to set the `.amdhsa_next_free_vpgr` directive in
				4424	:ref:`amdhsa-kernel-directives-table`.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4425
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4426	May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4427
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4428	.amdgcn.next_free_sgpr
				4429	++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4430
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4431	Set to zero before assembly begins. At each instruction, if the current value
				4432	of this symbol is less than or equal the maximum SGPR number explicitly
				4433	referenced within that instruction then the symbol value is updated to equal
				4434	that SGPR number plus one.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4435
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4436	May be used to set the `.amdhsa_next_free_spgr` directive in
				4437	:ref:`amdhsa-kernel-directives-table`.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4438
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4439	May be set at any time, e.g. manually set to zero at the start of each kernel.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4440
Konstantin Zhuravlyov	dd6b05c	2018-06-22 19:23:18 +0000	[diff] [blame]	4441	Code Object Directives (-mattr=+code-object-v3)
				4442	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4443
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4444	Directives which begin with ``.amdgcn`` are valid for all ``amdgcn``
				4445	architecture processors, and are not OS-specific. Directives which begin with
				4446	``.amdhsa`` are specific to ``amdgcn`` architecture processors when the
				4447	``amdhsa`` OS is specified. See :ref:`amdgpu-target-triples` and
				4448	:ref:`amdgpu-processors`.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4449
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4450	.amdgcn_target <target>
				4451	+++++++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4452
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4453	Optional directive which declares the target supported by the containing
				4454	assembler source file. Valid values are described in
				4455	:ref:`amdgpu-amdhsa-code-object-target-identification`. Used by the assembler
				4456	to validate command-line options such as ``-triple``, ``-mcpu``, and those
				4457	which specify target features.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4458
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4459	.amdhsa_kernel <name>
				4460	+++++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4461
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4462	Creates a correctly aligned AMDHSA kernel descriptor and a symbol,
				4463	``<name>.kd``, in the current location of the current section. Only valid when
				4464	the OS is ``amdhsa``. ``<name>`` must be a symbol that labels the first
				4465	instruction to execute, and does not need to be previously defined.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4466
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4467	Marks the beginning of a list of directives used to generate the bytes of a
				4468	kernel descriptor, as described in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				4469	Directives which may appear in this list are described in
				4470	:ref:`amdhsa-kernel-directives-table`. Directives may appear in any order, must
				4471	be valid for the target being assembled for, and cannot be repeated. Directives
				4472	support the range of values specified by the field they reference in
				4473	:ref:`amdgpu-amdhsa-kernel-descriptor`. If a directive is not specified, it is
				4474	assumed to have its default value, unless it is marked as "Required", in which
				4475	case it is an error to omit the directive. This list of directives is
				4476	terminated by an ``.end_amdhsa_kernel`` directive.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4477
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4478	.. table:: AMDHSA Kernel Assembler Directives
				4479	:name: amdhsa-kernel-directives-table
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4480
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4481	======================================================== ================ ============ ===================
				4482	Directive Default Supported On Description
				4483	======================================================== ================ ============ ===================
				4484	``.amdhsa_group_segment_fixed_size`` 0 GFX6-GFX9 Controls GROUP_SEGMENT_FIXED_SIZE in
				4485	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4486	``.amdhsa_private_segment_fixed_size`` 0 GFX6-GFX9 Controls PRIVATE_SEGMENT_FIXED_SIZE in
				4487	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4488	``.amdhsa_user_sgpr_private_segment_buffer`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_BUFFER in
				4489	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4490	``.amdhsa_user_sgpr_dispatch_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_PTR in
				4491	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4492	``.amdhsa_user_sgpr_queue_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_QUEUE_PTR in
				4493	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4494	``.amdhsa_user_sgpr_kernarg_segment_ptr`` 0 GFX6-GFX9 Controls ENABLE_SGPR_KERNARG_SEGMENT_PTR in
				4495	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4496	``.amdhsa_user_sgpr_dispatch_id`` 0 GFX6-GFX9 Controls ENABLE_SGPR_DISPATCH_ID in
				4497	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4498	``.amdhsa_user_sgpr_flat_scratch_init`` 0 GFX6-GFX9 Controls ENABLE_SGPR_FLAT_SCRATCH_INIT in
				4499	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4500	``.amdhsa_user_sgpr_private_segment_size`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_SIZE in
				4501	:ref:`amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table`.
				4502	``.amdhsa_system_sgpr_private_segment_wavefront_offset`` 0 GFX6-GFX9 Controls ENABLE_SGPR_PRIVATE_SEGMENT_WAVEFRONT_OFFSET in
				4503	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4504	``.amdhsa_system_sgpr_workgroup_id_x`` 1 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_X in
				4505	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4506	``.amdhsa_system_sgpr_workgroup_id_y`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Y in
				4507	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4508	``.amdhsa_system_sgpr_workgroup_id_z`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_ID_Z in
				4509	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4510	``.amdhsa_system_sgpr_workgroup_info`` 0 GFX6-GFX9 Controls ENABLE_SGPR_WORKGROUP_INFO in
				4511	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4512	``.amdhsa_system_vgpr_workitem_id`` 0 GFX6-GFX9 Controls ENABLE_VGPR_WORKITEM_ID in
				4513	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4514	Possible values are defined in
				4515	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`.
				4516	``.amdhsa_next_free_vgpr`` Required GFX6-GFX9 Maximum VGPR number explicitly referenced, plus one.
				4517	Used to calculate GRANULATED_WORKITEM_VGPR_COUNT in
				4518	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4519	``.amdhsa_next_free_sgpr`` Required GFX6-GFX9 Maximum SGPR number explicitly referenced, plus one.
				4520	Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4521	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4522	``.amdhsa_reserve_vcc`` 1 GFX6-GFX9 Whether the kernel may use the special VCC SGPR.
				4523	Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4524	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4525	``.amdhsa_reserve_flat_scratch`` 1 GFX7-GFX9 Whether the kernel may use flat instructions to access
				4526	scratch memory. Used to calculate
				4527	GRANULATED_WAVEFRONT_SGPR_COUNT in
				4528	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4529	``.amdhsa_reserve_xnack_mask`` Target GFX8-GFX9 Whether the kernel may trigger XNACK replay.
				4530	Feature Used to calculate GRANULATED_WAVEFRONT_SGPR_COUNT in
				4531	Specific :ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4532	(+xnack)
				4533	``.amdhsa_float_round_mode_32`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_32 in
				4534	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4535	Possible values are defined in
				4536	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				4537	``.amdhsa_float_round_mode_16_64`` 0 GFX6-GFX9 Controls FLOAT_ROUND_MODE_16_64 in
				4538	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4539	Possible values are defined in
				4540	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				4541	``.amdhsa_float_denorm_mode_32`` 0 GFX6-GFX9 Controls FLOAT_DENORM_MODE_32 in
				4542	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4543	Possible values are defined in
				4544	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				4545	``.amdhsa_float_denorm_mode_16_64`` 3 GFX6-GFX9 Controls FLOAT_DENORM_MODE_16_64 in
				4546	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4547	Possible values are defined in
				4548	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				4549	``.amdhsa_dx10_clamp`` 1 GFX6-GFX9 Controls ENABLE_DX10_CLAMP in
				4550	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4551	``.amdhsa_ieee_mode`` 1 GFX6-GFX9 Controls ENABLE_IEEE_MODE in
				4552	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4553	``.amdhsa_fp16_overflow`` 0 GFX9 Controls FP16_OVFL in
				4554	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
				4555	``.amdhsa_exception_fp_ieee_invalid_op`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INVALID_OPERATION in
				4556	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4557	``.amdhsa_exception_fp_denorm_src`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_FP_DENORMAL_SOURCE in
				4558	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4559	``.amdhsa_exception_fp_ieee_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_DIVISION_BY_ZERO in
				4560	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4561	``.amdhsa_exception_fp_ieee_overflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_OVERFLOW in
				4562	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4563	``.amdhsa_exception_fp_ieee_underflow`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_UNDERFLOW in
				4564	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4565	``.amdhsa_exception_fp_ieee_inexact`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_IEEE_754_FP_INEXACT in
				4566	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4567	``.amdhsa_exception_int_div_zero`` 0 GFX6-GFX9 Controls ENABLE_EXCEPTION_INT_DIVIDE_BY_ZERO in
				4568	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				4569	======================================================== ================ ============ ===================
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4570
Konstantin Zhuravlyov	dd6b05c	2018-06-22 19:23:18 +0000	[diff] [blame]	4571	Example HSA Source Code (-mattr=+code-object-v3)
				4572	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4573
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4574	Here is an example of a minimal assembly source file, defining one HSA kernel:
				4575
Chandler Carruth	343a87a	2018-08-06 01:19:43 +0000	[diff] [blame]	4576	.. code-block:: none
Scott Linder	1e8c2c7	2018-06-21 19:38:56 +0000	[diff] [blame]	4577
				4578	.amdgcn_target "amdgcn-amd-amdhsa--gfx900+xnack" // optional
				4579
				4580	.text
				4581	.globl hello_world
				4582	.p2align 8
				4583	.type hello_world,@function
				4584	hello_world:
				4585	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4586	v_mov_b32 v0, 3.14159
				4587	s_waitcnt lgkmcnt(0)
				4588	v_mov_b32 v1, s0
				4589	v_mov_b32 v2, s1
				4590	flat_store_dword v[1:2], v0
				4591	s_endpgm
				4592	.Lfunc_end0:
				4593	.size hello_world, .Lfunc_end0-hello_world
				4594
				4595	.rodata
				4596	.p2align 6
				4597	.amdhsa_kernel hello_world
				4598	.amdhsa_user_sgpr_kernarg_segment_ptr 1
				4599	.amdhsa_next_free_vgpr .amdgcn.next_free_vgpr
				4600	.amdhsa_next_free_sgpr .amdgcn.next_free_sgpr
				4601	.end_amdhsa_kernel
				4602
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4603
				4604	Additional Documentation
				4605	========================
				4606
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4607	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4608	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4609	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4610	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4611	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4612	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4613	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4614	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4615	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4616	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4617	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4618	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4619	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4620	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4621	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4622	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
Tony Tye	e2f3e10	2018-06-14 16:40:10 +0000	[diff] [blame]	4623	.. [CLANG-ATTR] `Attributes in Clang <http://clang.llvm.org/docs/AttributeReference.html>`__