Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 886c378b21b1effe15d4ebe4640ed28fa95fb734 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	203	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	204
				205	.. _amdgpu-target-features:
				206
				207	Target Features
				208	---------------
				209
				210	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	211	processor specific features. Not all target features are supported by
				212	all processors. The runtime must ensure that the features supported by
				213	the device used to execute the code match the features enabled when
				214	generating the code. A mismatch of features may result in incorrect
				215	execution, or a reduction in performance.
				216
				217	The target features supported by each processor, and the default value
				218	used if not specified explicitly, is listed in
				219	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	220
				221	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				222	target features.
				223
				224	For example:
				225
				226	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	227	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	228	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	229	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	.. table:: AMDGPU Target Features
				232	:name: amdgpu-target-feature-table
				233
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	234	============== ==================================================
				235	Target Feature Description
				236	============== ==================================================
				237	-m[no-]xnack Enable/disable generating code that has
				238	memory clauses that are compatible with
				239	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	241	This is used for demand paging and page
				242	migration. If XNACK replay is enabled in
				243	the device, then if a page fault occurs
				244	the code may execute incorrectly if the
				245	``xnack`` feature is not enabled. Executing
				246	code that has the feature enabled on a
				247	device that does not have XNACK replay
				248	enabled will execute correctly, but may
				249	be less performant than code with the
				250	feature disabled.
				251	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	252
				253	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	254
				255	Address Spaces
				256	--------------
				257
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	258	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	259
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	260	The memory space names used in the table, aside from the region memory space, is
				261	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	262
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	263	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	265	.. table:: Address Space Mapping
				266	:name: amdgpu-address-space-mapping-table
				267
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	268	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	269	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	270	================== =================
				271	0 Generic (Flat)
				272	1 Global
				273	2 Region (GDS)
				274	3 Local (group/LDS)
				275	4 Constant
				276	5 Private (Scratch)
				277	6 Constant 32-bit
				278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279
				280	.. _amdgpu-memory-scopes:
				281
				282	Memory Scopes
				283	-------------
				284
				285	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				286	backend memory model when the target triple OS is ``amdhsa`` (see
				287	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				288
				289	The memory model supported is based on the HSA memory model [HSA]_ which is
				290	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				291	relation is transitive over the synchonizes-with relation independent of scope,
				292	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	293	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	294
				295	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				296	inclusion and requires the memory scopes to exactly match. However, this
				297	is conservatively correct for OpenCL.
				298
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	299	.. table:: AMDHSA LLVM Sync Scopes
				300	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	301
				302	================ ==========================================================
				303	LLVM Sync Scope Description
				304	================ ==========================================================
				305	none The default: ``system``.
				306
				307	Synchronizes with, and participates in modification and
				308	seq_cst total orderings with, other operations (except
				309	image operations) for all address spaces (except private,
				310	or generic that accesses private) provided the other
				311	operation's sync scope is:
				312
				313	- ``system``.
				314	- ``agent`` and executed by a thread on the same agent.
				315	- ``workgroup`` and executed by a thread in the same
				316	workgroup.
				317	- ``wavefront`` and executed by a thread in the same
				318	wavefront.
				319
				320	``agent`` Synchronizes with, and participates in modification and
				321	seq_cst total orderings with, other operations (except
				322	image operations) for all address spaces (except private,
				323	or generic that accesses private) provided the other
				324	operation's sync scope is:
				325
				326	- ``system`` or ``agent`` and executed by a thread on the
				327	same agent.
				328	- ``workgroup`` and executed by a thread in the same
				329	workgroup.
				330	- ``wavefront`` and executed by a thread in the same
				331	wavefront.
				332
				333	``workgroup`` Synchronizes with, and participates in modification and
				334	seq_cst total orderings with, other operations (except
				335	image operations) for all address spaces (except private,
				336	or generic that accesses private) provided the other
				337	operation's sync scope is:
				338
				339	- ``system``, ``agent`` or ``workgroup`` and executed by a
				340	thread in the same workgroup.
				341	- ``wavefront`` and executed by a thread in the same
				342	wavefront.
				343
				344	``wavefront`` Synchronizes with, and participates in modification and
				345	seq_cst total orderings with, other operations (except
				346	image operations) for all address spaces (except private,
				347	or generic that accesses private) provided the other
				348	operation's sync scope is:
				349
				350	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				351	and executed by a thread in the same wavefront.
				352
				353	``singlethread`` Only synchronizes with, and participates in modification
				354	and seq_cst total orderings with, other operations (except
				355	image operations) running in the same thread for all
				356	address spaces (for example, in signal handlers).
				357	================ ==========================================================
				358
				359	AMDGPU Intrinsics
				360	-----------------
				361
				362	The AMDGPU backend implements the following intrinsics.
				363
				364	This section is WIP.
				365
				366	.. TODO
				367	List AMDGPU intrinsics
				368
				369	Code Object
				370	===========
				371
				372	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				373	can be linked by ``lld`` to produce a standard ELF shared code object which can
				374	be loaded and executed on an AMDGPU target.
				375
				376	Header
				377	------
				378
				379	The AMDGPU backend uses the following ELF header:
				380
				381	.. table:: AMDGPU ELF Header
				382	:name: amdgpu-elf-header-table
				383
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	384	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	385	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	386	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	387	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				388	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	389	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				390	- ``ELFOSABI_AMDGPU_HSA``
				391	- ``ELFOSABI_AMDGPU_PAL``
				392	- ``ELFOSABI_AMDGPU_MESA3D``
				393	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				394	- ``ELFABIVERSION_AMDGPU_PAL``
				395	- ``ELFABIVERSION_AMDGPU_MESA3D``
				396	``e_type`` - ``ET_REL``
				397	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	398	``e_machine`` ``EM_AMDGPU``
				399	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	400	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	401	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	402
				403	..
				404
				405	.. table:: AMDGPU ELF Header Enumeration Values
				406	:name: amdgpu-elf-header-enumeration-values-table
				407
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	408	=============================== =====
				409	Name Value
				410	=============================== =====
				411	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	412	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	413	``ELFOSABI_AMDGPU_HSA`` 64
				414	``ELFOSABI_AMDGPU_PAL`` 65
				415	``ELFOSABI_AMDGPU_MESA3D`` 66
				416	``ELFABIVERSION_AMDGPU_HSA`` 1
				417	``ELFABIVERSION_AMDGPU_PAL`` 0
				418	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				419	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	420
				421	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	422	The ELF class is:
				423
				424	* ``ELFCLASS32`` for ``r600`` architecture.
				425
				426	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				427	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428
				429	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	430	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	431
				432	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	433	One of the following AMD GPU architecture specific OS ABIs
				434	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	435
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	436	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	437
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	438	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	439
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	440	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				441
				442	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	443
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	444	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	445	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				446	object conforms:
				447
				448	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				449	runtime ABI.
				450
				451	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				452	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	453
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	454	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	455	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	456
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	457	``e_type``
				458	Can be one of the following values:
				459
				460
				461	``ET_REL``
				462	The type produced by the AMD GPU backend compiler as it is relocatable code
				463	object.
				464
				465	``ET_DYN``
				466	The type produced by the linker as it is a shared code object.
				467
				468	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				469
				470	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	471	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				472	by the ``r600`` and ``amdgcn`` architectures (see
				473	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				474	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				475	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	476
				477	``e_entry``
				478	The entry point is 0 as the entry points for individual kernels must be
				479	selected in order to invoke them through AQL packets.
				480
				481	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	482	The AMDGPU backend uses the following ELF header flags:
				483
				484	.. table:: AMDGPU ELF Header ``e_flags``
				485	:name: amdgpu-elf-header-e_flags-table
				486
				487	================================= ========== =============================
				488	Name Value Description
				489	================================= ========== =============================
				490	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				491	-------------------------------------------- -----------------------------
				492	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				493	mask for
				494	``EF_AMDGPU_MACH_xxx`` values
				495	defined in
				496	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	497	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				498	target feature is
				499	enabled for all code
				500	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	501	If the processor
				502	does not support the
				503	``xnack`` target
				504	feature then must
				505	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	506	See
				507	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	508	================================= ========== =============================
				509
				510	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				511	:name: amdgpu-ef-amdgpu-mach-table
				512
				513	================================= ========== =============================
				514	Name Value Description (see
				515	:ref:`amdgpu-processor-table`)
				516	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	517	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				518	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				519	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				520	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				521	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				522	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				523	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				524	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				525	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				526	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				527	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				528	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				529	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				530	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				531	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				532	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				533	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				534	reserved 0x011 - Reserved for ``r600``
				535	0x01f architecture processors.
				536	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				537	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				538	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				539	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				540	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				541	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				542	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				543	reserved 0x027 Reserved.
				544	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				545	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				546	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				547	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				548	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
				550	reserved 0x02e Reserved.
				551	reserved 0x02f Reserved.
				552	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	553	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	554
				555	Sections
				556	--------
				557
				558	An AMDGPU target ELF code object has the standard ELF sections which include:
				559
				560	.. table:: AMDGPU ELF Sections
				561	:name: amdgpu-elf-sections-table
				562
				563	================== ================ =================================
				564	Name Type Attributes
				565	================== ================ =================================
				566	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				567	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				568	``.debug_``\ \* ``SHT_PROGBITS`` none
				569	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				570	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				571	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				572	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				573	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				574	``.note`` ``SHT_NOTE`` none
				575	``.rela``\ name ``SHT_RELA`` none
				576	``.rela.dyn`` ``SHT_RELA`` none
				577	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				578	``.shstrtab`` ``SHT_STRTAB`` none
				579	``.strtab`` ``SHT_STRTAB`` none
				580	``.symtab`` ``SHT_SYMTAB`` none
				581	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				582	================== ================ =================================
				583
				584	These sections have their standard meanings (see [ELF]_) and are only generated
				585	if needed.
				586
				587	``.debug``\ \*
				588	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				589	DWARF produced by the AMDGPU backend.
				590
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	591	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	592	The standard sections used by a dynamic loader.
				593
				594	``.note``
				595	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				596	backend.
				597
				598	``.rela``\ name, ``.rela.dyn``
				599	For relocatable code objects, name is the name of the section that the
				600	relocation records apply. For example, ``.rela.text`` is the section name for
				601	relocation records associated with the ``.text`` section.
				602
				603	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				604	records from each of the relocatable code object's ``.rela``\ name sections.
				605
				606	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				607	the AMDGPU backend.
				608
				609	``.text``
				610	The executable machine code for the kernels and functions they call. Generated
				611	as position independent code. See :ref:`amdgpu-code-conventions` for
				612	information on conventions used in the isa generation.
				613
				614	.. _amdgpu-note-records:
				615
				616	Note Records
				617	------------
				618
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	619	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				620	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				621	aligned. In addition, minimal zero byte padding must be generated to ensure the
				622	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				623	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	624
				625	The AMDGPU backend code object uses the following ELF note records in the
				626	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	627	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	628	variable size strings have a corresponding ``*_size`` field that specifies the
				629	number of bytes, including the terminating null character, in the string. The
				630	string(s) come immediately after the preceding fields.
				631
				632	Additional note records can be present.
				633
				634	.. table:: AMDGPU ELF Note Records
				635	:name: amdgpu-elf-note-records-table
				636
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	637	===== ============================== ======================================
				638	Name Type Description
				639	===== ============================== ======================================
				640	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	641	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	642
				643	..
				644
				645	.. table:: AMDGPU ELF Note Record Enumeration Values
				646	:name: amdgpu-elf-note-record-enumeration-values-table
				647
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	648	============================== =====
				649	Name Value
				650	============================== =====
				651	reserved 0-9
				652	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	653	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	654	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	655
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	656	``NT_AMD_AMDGPU_HSA_METADATA``
				657	Specifies extensible metadata associated with the code objects executed on HSA
				658	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				659	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				660	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				661	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	662
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	663	.. _amdgpu-symbols:
				664
				665	Symbols
				666	-------
				667
				668	Symbols include the following:
				669
				670	.. table:: AMDGPU ELF Symbols
				671	:name: amdgpu-elf-symbols-table
				672
				673	===================== ============== ============= ==================
				674	Name Type Section Description
				675	===================== ============== ============= ==================
				676	link-name ``STT_OBJECT`` - ``.data`` Global variable
				677	- ``.rodata``
				678	- ``.bss``
				679	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				680	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				681	===================== ============== ============= ==================
				682
				683	Global variable
				684	Global variables both used and defined by the compilation unit.
				685
				686	If the symbol is defined in the compilation unit then it is allocated in the
				687	appropriate section according to if it has initialized data or is readonly.
				688
				689	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				690	will resolve relocations using the definition provided by another code object
				691	or explicitly defined by the runtime.
				692
				693	All global symbols, whether defined in the compilation unit or external, are
				694	accessed by the machine code indirectly through a GOT table entry. This
				695	allows them to be preemptable. The GOT table is only supported when the target
				696	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	697
				698	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	699	Add description of linked shared object symbols. Seems undefined symbols
				700	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	701
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	702	Kernel descriptor
				703	Every HSA kernel has an associated kernel descriptor. It is the address of the
				704	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				705	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				706	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				707
				708	Kernel entry point
				709	Every HSA kernel also has a symbol for its machine code entry point.
				710
				711	.. _amdgpu-relocation-records:
				712
				713	Relocation Records
				714	------------------
				715
				716	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				717	relocatable fields are:
				718
				719	``word32``
				720	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				721	alignment. These values use the same byte order as other word values in the
				722	AMD GPU architecture.
				723
				724	``word64``
				725	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				726	alignment. These values use the same byte order as other word values in the
				727	AMD GPU architecture.
				728
				729	Following notations are used for specifying relocation calculations:
				730
				731	A
				732	Represents the addend used to compute the value of the relocatable field.
				733
				734	G
				735	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	736	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	737
				738	GOT
				739	Represents the address of the global offset table.
				740
				741	P
				742	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				743	of the storage unit being relocated (computed using ``r_offset``).
				744
				745	S
				746	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	747	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				748
				749	B
				750	Represents the base address of a loaded executable or shared object which is
				751	the difference between the ELF address and the actual load address. Relocations
				752	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	753
				754	The following relocation types are supported:
				755
				756	.. table:: AMDGPU ELF Relocation Records
				757	:name: amdgpu-elf-relocation-records-table
				758
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	759	========================== ======= ===== ========== ==============================
				760	Relocation Type Kind Value Field Calculation
				761	========================== ======= ===== ========== ==============================
				762	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	763	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				764	Dynamic
				765	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				766	Dynamic
				767	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
				768	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	769	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				770	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	771	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				772	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	773	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				774	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				775	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				776	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				777	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				778	reserved 12
				779	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				780	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	781
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	782	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				783	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				784
				785	There is no current OS loader support for 32 bit programs and so
				786	``R_AMDGPU_ABS32`` is not used.
				787
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	788	.. _amdgpu-dwarf:
				789
				790	DWARF
				791	-----
				792
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	793	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	794	information that maps the code object executable code and data to the source
				795	language constructs. It can be used by tools such as debuggers and profilers.
				796
				797	Address Space Mapping
				798	~~~~~~~~~~~~~~~~~~~~~
				799
				800	The following address space mapping is used:
				801
				802	.. table:: AMDGPU DWARF Address Space Mapping
				803	:name: amdgpu-dwarf-address-space-mapping-table
				804
				805	=================== =================
				806	DWARF Address Space Memory Space
				807	=================== =================
				808	1 Private (Scratch)
				809	2 Local (group/LDS)
				810	omitted Global
				811	omitted Constant
				812	omitted Generic (Flat)
				813	not supported Region (GDS)
				814	=================== =================
				815
				816	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				817	used in the table.
				818
				819	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				820	DWARF address space of the value of the pointer when it is in the private or
				821	local address space. Otherwise the attribute is omitted.
				822
				823	An ``XDEREF`` operation is generated in location list expressions for variables
				824	that are allocated in the private and local address space. Otherwise no
				825	``XDREF`` is omitted.
				826
				827	Register Mapping
				828	~~~~~~~~~~~~~~~~
				829
				830	This section is WIP.
				831
				832	.. TODO
				833	Define DWARF register enumeration.
				834
				835	If want to present a wavefront state then should expose vector registers as
				836	64 wide (rather than per work-item view that LLVM uses). Either as separate
				837	registers, or a 64x4 byte single register. In either case use a new LANE op
				838	(akin to XDREF) to select the current lane usage in a location
				839	expression. This would also allow scalar register spilling to vector register
				840	lanes to be expressed (currently no debug information is being generated for
				841	spilling). If choose a wide single register approach then use LANE in
				842	conjunction with PIECE operation to select the dword part of the register for
				843	the current lane. If the separate register approach then use LANE to select
				844	the register.
				845
				846	Source Text
				847	~~~~~~~~~~~
				848
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	849	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				850	runtime) may be embedded into the DWARF v5 line table using the ``clang
				851	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	852
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	853	For example:
				854
				855	``-gembed-source``
				856	Enable the embedded source DWARF v5 extension.
				857	``-gno-embed-source``
				858	Disable the embedded source DWARF v5 extension.
				859
				860	.. table:: AMDGPU Debug Options
				861	:name: amdgpu-debug-options
				862
				863	==================== ==================================================
				864	Debug Flag Description
				865	==================== ==================================================
				866	-g[no-]embed-source Enable/disable embedding source text in DWARF
				867	debug sections. Useful for environments where
				868	source cannot be written to disk, such as
				869	when performing online compilation.
				870	==================== ==================================================
				871
				872	This option enables one extended content types in the DWARF v5 Line Number
				873	Program Header, which is used to encode embedded source.
				874
				875	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				876	:name: amdgpu-dwarf-extended-content-types
				877
				878	============================ ======================
				879	Content Type Form
				880	============================ ======================
				881	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				882	============================ ======================
				883
				884	The source field will contain the UTF-8 encoded, null-terminated source text
				885	with ``'\n'`` line endings. When the source field is present, consumers can use
				886	the embedded source instead of attempting to discover the source on disk. When
				887	the source field is absent, consumers can access the file to get the source
				888	text.
				889
				890	The above content type appears in the ``file_name_entry_format`` field of the
				891	line table prologue, and its corresponding value appear in the ``file_names``
				892	field. The current encoding of the content type is documented in table
				893	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				894
				895	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				896	:name: amdgpu-dwarf-extended-content-types-encoding
				897
				898	============================ ====================
				899	Content Type Value
				900	============================ ====================
				901	``DW_LNCT_LLVM_source`` 0x2001
				902	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	903
				904	.. _amdgpu-code-conventions:
				905
				906	Code Conventions
				907	================
				908
				909	This section provides code conventions used for each supported target triple OS
				910	(see :ref:`amdgpu-target-triples`).
				911
				912	AMDHSA
				913	------
				914
				915	This section provides code conventions used when the target triple OS is
				916	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				917
				918	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	919
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	920	Code Object Target Identification
				921	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				922
				923	The AMDHSA OS uses the following syntax to specify the code object
				924	target as a single string:
				925
				926	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				927
				928	Where:
				929
				930	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				931	are the same as the Target Triple (see
				932	:ref:`amdgpu-target-triples`).
				933
				934	- ``<Processor>`` is the same as the Processor (see
				935	:ref:`amdgpu-processors`).
				936
				937	- ``<Target Features>`` is a list of the enabled Target Features
				938	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				939	apply to Processor. The list must be in the same order as listed
				940	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				941	Features* must be included in the list if they are enabled even if
				942	that is the default for Processor.
				943
				944	For example:
				945
				946	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				947
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	948	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	949	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	950
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	951	The code object metadata specifies extensible metadata associated with the code
				952	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				953	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				954	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				955	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				956	information necessary to support the ROCM kernel queries. For example, the
				957	segment sizes needed in a dispatch packet. In addition, a high level language
				958	runtime may require other information to be included. For example, the AMD
				959	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	960
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	961	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	962	:doc:`YamlIO`).
				963
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	964	.. TODO
				965	Is the string null terminated? It probably should not if YAML allows it to
				966	contain null characters, otherwise it should be.
				967
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	968	The metadata is represented as a single YAML document comprised of the mapping
				969	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				970	referenced tables.
				971
				972	For boolean values, the string values of ``false`` and ``true`` are used for
				973	false and true respectively.
				974
				975	Additional information can be added to the mappings. To avoid conflicts, any
				976	non-AMD key names should be prefixed by "vendor-name.".
				977
				978	.. table:: AMDHSA Code Object Metadata Mapping
				979	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				980
				981	========== ============== ========= =======================================
				982	String Key Value Type Required? Description
				983	========== ============== ========= =======================================
				984	"Version" sequence of Required - The first integer is the major
				985	2 integers version. Currently 1.
				986	- The second integer is the minor
				987	version. Currently 0.
				988	"Printf" sequence of Each string is encoded information
				989	strings about a printf function call. The
				990	encoded information is organized as
				991	fields separated by colon (':'):
				992
				993	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				994
				995	where:
				996
				997	``ID``
				998	A 32 bit integer as a unique id for
				999	each printf function call
				1000
				1001	``N``
				1002	A 32 bit integer equal to the number
				1003	of arguments of printf function call
				1004	minus 1
				1005
				1006	``S[i]`` (where i = 0, 1, ... , N-1)
				1007	32 bit integers for the size in bytes
				1008	of the i-th FormatString argument of
				1009	the printf function call
				1010
				1011	FormatString
				1012	The format string passed to the
				1013	printf function call.
				1014	"Kernels" sequence of Required Sequence of the mappings for each
				1015	mapping kernel in the code object. See
				1016	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1017	for the definition of the mapping.
				1018	========== ============== ========= =======================================
				1019
				1020	..
				1021
				1022	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1023	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1024
				1025	================= ============== ========= ================================
				1026	String Key Value Type Required? Description
				1027	================= ============== ========= ================================
				1028	"Name" string Required Source name of the kernel.
				1029	"SymbolName" string Required Name of the kernel
				1030	descriptor ELF symbol.
				1031	"Language" string Source language of the kernel.
				1032	Values include:
				1033
				1034	- "OpenCL C"
				1035	- "OpenCL C++"
				1036	- "HCC"
				1037	- "OpenMP"
				1038
				1039	"LanguageVersion" sequence of - The first integer is the major
				1040	2 integers version.
				1041	- The second integer is the
				1042	minor version.
				1043	"Attrs" mapping Mapping of kernel attributes.
				1044	See
				1045	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1046	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1047	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1048	mapping kernel arguments. See
				1049	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1050	for the definition of the mapping.
				1051	"CodeProps" mapping Mapping of properties related to
				1052	the kernel code. See
				1053	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1054	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1055	================= ============== ========= ================================
				1056
				1057	..
				1058
				1059	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1060	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1061
				1062	=================== ============== ========= ==============================
				1063	String Key Value Type Required? Description
				1064	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1065	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1066	3 integers must be >=1 and the dispatch
				1067	work-group size X, Y, Z must
				1068	correspond to the specified
				1069	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1070
				1071	Corresponds to the OpenCL
				1072	``reqd_work_group_size``
				1073	attribute.
				1074	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1075	3 integers X, Y, Z is likely to be the
				1076	specified values.
				1077
				1078	Corresponds to the OpenCL
				1079	``work_group_size_hint``
				1080	attribute.
				1081	"VecTypeHint" string The name of a scalar or vector
				1082	type.
				1083
				1084	Corresponds to the OpenCL
				1085	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1086
				1087	"RuntimeHandle" string The external symbol name
				1088	associated with a kernel.
				1089	OpenCL runtime allocates a
				1090	global buffer for the symbol
				1091	and saves the kernel's address
				1092	to it, which is used for
				1093	device side enqueueing. Only
				1094	available for device side
				1095	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1096	=================== ============== ========= ==============================
				1097
				1098	..
				1099
				1100	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1101	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1102
				1103	================= ============== ========= ================================
				1104	String Key Value Type Required? Description
				1105	================= ============== ========= ================================
				1106	"Name" string Kernel argument name.
				1107	"TypeName" string Kernel argument type name.
				1108	"Size" integer Required Kernel argument size in bytes.
				1109	"Align" integer Required Kernel argument alignment in
				1110	bytes. Must be a power of two.
				1111	"ValueKind" string Required Kernel argument kind that
				1112	specifies how to set up the
				1113	corresponding argument.
				1114	Values include:
				1115
				1116	"ByValue"
				1117	The argument is copied
				1118	directly into the kernarg.
				1119
				1120	"GlobalBuffer"
				1121	A global address space pointer
				1122	to the buffer data is passed
				1123	in the kernarg.
				1124
				1125	"DynamicSharedPointer"
				1126	A group address space pointer
				1127	to dynamically allocated LDS
				1128	is passed in the kernarg.
				1129
				1130	"Sampler"
				1131	A global address space
				1132	pointer to a S# is passed in
				1133	the kernarg.
				1134
				1135	"Image"
				1136	A global address space
				1137	pointer to a T# is passed in
				1138	the kernarg.
				1139
				1140	"Pipe"
				1141	A global address space pointer
				1142	to an OpenCL pipe is passed in
				1143	the kernarg.
				1144
				1145	"Queue"
				1146	A global address space pointer
				1147	to an OpenCL device enqueue
				1148	queue is passed in the
				1149	kernarg.
				1150
				1151	"HiddenGlobalOffsetX"
				1152	The OpenCL grid dispatch
				1153	global offset for the X
				1154	dimension is passed in the
				1155	kernarg.
				1156
				1157	"HiddenGlobalOffsetY"
				1158	The OpenCL grid dispatch
				1159	global offset for the Y
				1160	dimension is passed in the
				1161	kernarg.
				1162
				1163	"HiddenGlobalOffsetZ"
				1164	The OpenCL grid dispatch
				1165	global offset for the Z
				1166	dimension is passed in the
				1167	kernarg.
				1168
				1169	"HiddenNone"
				1170	An argument that is not used
				1171	by the kernel. Space needs to
				1172	be left for it, but it does
				1173	not need to be set up.
				1174
				1175	"HiddenPrintfBuffer"
				1176	A global address space pointer
				1177	to the runtime printf buffer
				1178	is passed in kernarg.
				1179
				1180	"HiddenDefaultQueue"
				1181	A global address space pointer
				1182	to the OpenCL device enqueue
				1183	queue that should be used by
				1184	the kernel by default is
				1185	passed in the kernarg.
				1186
				1187	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1188	A global address space pointer
				1189	to help link enqueued kernels into
				1190	the ancestor tree for determining
				1191	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1192
				1193	"ValueType" string Required Kernel argument value type. Only
				1194	present if "ValueKind" is
				1195	"ByValue". For vector data
				1196	types, the value is for the
				1197	element type. Values include:
				1198
				1199	- "Struct"
				1200	- "I8"
				1201	- "U8"
				1202	- "I16"
				1203	- "U16"
				1204	- "F16"
				1205	- "I32"
				1206	- "U32"
				1207	- "F32"
				1208	- "I64"
				1209	- "U64"
				1210	- "F64"
				1211
				1212	.. TODO
				1213	How can it be determined if a
				1214	vector type, and what size
				1215	vector?
				1216	"PointeeAlign" integer Alignment in bytes of pointee
				1217	type for pointer type kernel
				1218	argument. Must be a power
				1219	of 2. Only present if
				1220	"ValueKind" is
				1221	"DynamicSharedPointer".
				1222	"AddrSpaceQual" string Kernel argument address space
				1223	qualifier. Only present if
				1224	"ValueKind" is "GlobalBuffer" or
				1225	"DynamicSharedPointer". Values
				1226	are:
				1227
				1228	- "Private"
				1229	- "Global"
				1230	- "Constant"
				1231	- "Local"
				1232	- "Generic"
				1233	- "Region"
				1234
				1235	.. TODO
				1236	Is GlobalBuffer only Global
				1237	or Constant? Is
				1238	DynamicSharedPointer always
				1239	Local? Can HCC allow Generic?
				1240	How can Private or Region
				1241	ever happen?
				1242	"AccQual" string Kernel argument access
				1243	qualifier. Only present if
				1244	"ValueKind" is "Image" or
				1245	"Pipe". Values
				1246	are:
				1247
				1248	- "ReadOnly"
				1249	- "WriteOnly"
				1250	- "ReadWrite"
				1251
				1252	.. TODO
				1253	Does this apply to
				1254	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1255	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1256	performed by the kernel on the
				1257	kernel argument. Only present if
				1258	"ValueKind" is "GlobalBuffer",
				1259	"Image", or "Pipe". This may be
				1260	more restrictive than indicated
				1261	by "AccQual" to reflect what the
				1262	kernel actual does. If not
				1263	present then the runtime must
				1264	assume what is implied by
				1265	"AccQual" and "IsConst". Values
				1266	are:
				1267
				1268	- "ReadOnly"
				1269	- "WriteOnly"
				1270	- "ReadWrite"
				1271
				1272	"IsConst" boolean Indicates if the kernel argument
				1273	is const qualified. Only present
				1274	if "ValueKind" is
				1275	"GlobalBuffer".
				1276
				1277	"IsRestrict" boolean Indicates if the kernel argument
				1278	is restrict qualified. Only
				1279	present if "ValueKind" is
				1280	"GlobalBuffer".
				1281
				1282	"IsVolatile" boolean Indicates if the kernel argument
				1283	is volatile qualified. Only
				1284	present if "ValueKind" is
				1285	"GlobalBuffer".
				1286
				1287	"IsPipe" boolean Indicates if the kernel argument
				1288	is pipe qualified. Only present
				1289	if "ValueKind" is "Pipe".
				1290
				1291	.. TODO
				1292	Can GlobalBuffer be pipe
				1293	qualified?
				1294	================= ============== ========= ================================
				1295
				1296	..
				1297
				1298	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1299	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1300
				1301	============================ ============== ========= =====================
				1302	String Key Value Type Required? Description
				1303	============================ ============== ========= =====================
				1304	"KernargSegmentSize" integer Required The size in bytes of
				1305	the kernarg segment
				1306	that holds the values
				1307	of the arguments to
				1308	the kernel.
				1309	"GroupSegmentFixedSize" integer Required The amount of group
				1310	segment memory
				1311	required by a
				1312	work-group in
				1313	bytes. This does not
				1314	include any
				1315	dynamically allocated
				1316	group segment memory
				1317	that may be added
				1318	when the kernel is
				1319	dispatched.
				1320	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1321	private address space
				1322	memory required for a
				1323	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1324	bytes. If the kernel
				1325	uses a dynamic call
				1326	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1327	space must be added
				1328	to this value for the
				1329	call stack.
				1330	"KernargSegmentAlign" integer Required The maximum byte
				1331	alignment of
				1332	arguments in the
				1333	kernarg segment. Must
				1334	be a power of 2.
				1335	"WavefrontSize" integer Required Wavefront size. Must
				1336	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1337	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1338	registers used by a
				1339	wavefront for
				1340	GFX6-GFX9. This
				1341	includes the special
				1342	SGPRs for VCC, Flat
				1343	Scratch (GFX7-GFX9)
				1344	and XNACK (for
				1345	GFX8-GFX9). It does
				1346	not include the 16
				1347	SGPR added if a trap
				1348	handler is
				1349	enabled. It is not
				1350	rounded up to the
				1351	allocation
				1352	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1353	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1354	registers used by
				1355	each work-item for
				1356	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1357	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1358	work-group size
				1359	supported by the
				1360	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1361	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1362	consistent with
				1363	ReqdWorkGroupSize if
				1364	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1365	"NumSpilledSGPRs" integer Number of stores from
				1366	a scalar register to
				1367	a register allocator
				1368	created spill
				1369	location.
				1370	"NumSpilledVGPRs" integer Number of stores from
				1371	a vector register to
				1372	a register allocator
				1373	created spill
				1374	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1375	============================ ============== ========= =====================
				1376
				1377	..
				1378
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1379	Kernel Dispatch
				1380	~~~~~~~~~~~~~~~
				1381
				1382	The HSA architected queuing language (AQL) defines a user space memory interface
				1383	that can be used to control the dispatch of kernels, in an agent independent
				1384	way. An agent can have zero or more AQL queues created for it using the ROCm
				1385	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1386	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1387	mechanics and packet layouts.
				1388
				1389	The packet processor of a kernel agent is responsible for detecting and
				1390	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1391	packet processor is implemented by the hardware command processor (CP),
				1392	asynchronous dispatch controller (ADC) and shader processor input controller
				1393	(SPI).
				1394
				1395	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1396	mode driver to initialize and register the AQL queue with CP.
				1397
				1398	To dispatch a kernel the following actions are performed. This can occur in the
				1399	CPU host program, or from an HSA kernel executing on a GPU.
				1400
				1401	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1402	executed is obtained.
				1403	2. A pointer to the kernel descriptor (see
				1404	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1405	obtained. It must be for a kernel that is contained in a code object that that
				1406	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1407	associated.
				1408	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1409	for a memory region with the kernarg property for the kernel agent that will
				1410	execute the kernel. It must be at least 16 byte aligned.
				1411	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1412	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1413	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1414	memory in the same way constant memory is accessed. (Note that the HSA
				1415	specification allows an implementation to copy the kernel argument contents to
				1416	another location that is accessed by the kernel.)
				1417	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1418	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1419	packet. The packet must be set up, and the final write must use an atomic
				1420	store release to set the packet kind to ensure the packet contents are
				1421	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1422	notify the kernel agent that the AQL queue has been updated. These rules, and
				1423	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1424	System Architecture Specification* [HSA]_.
				1425	6. A kernel dispatch packet includes information about the actual dispatch,
				1426	such as grid and work-group size, together with information from the code
				1427	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1428	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1429	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1430	7. CP executes micro-code and is responsible for detecting and setting up the
				1431	GPU to execute the wavefronts of a kernel dispatch.
				1432	8. CP ensures that when the a wavefront starts executing the kernel machine
				1433	code, the scalar general purpose registers (SGPR) and vector general purpose
				1434	registers (VGPR) are set up as required by the machine code. The required
				1435	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1436	register state is defined in
				1437	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1438	9. The prolog of the kernel machine code (see
				1439	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1440	before continuing executing the machine code that corresponds to the kernel.
				1441	10. When the kernel dispatch has completed execution, CP signals the completion
				1442	signal specified in the kernel dispatch packet if not 0.
				1443
				1444	.. _amdgpu-amdhsa-memory-spaces:
				1445
				1446	Memory Spaces
				1447	~~~~~~~~~~~~~
				1448
				1449	The memory space properties are:
				1450
				1451	.. table:: AMDHSA Memory Spaces
				1452	:name: amdgpu-amdhsa-memory-spaces-table
				1453
				1454	================= =========== ======== ======= ==================
				1455	Memory Space Name HSA Segment Hardware Address NULL Value
				1456	Name Name Size
				1457	================= =========== ======== ======= ==================
				1458	Private private scratch 32 0x00000000
				1459	Local group LDS 32 0xFFFFFFFF
				1460	Global global global 64 0x0000000000000000
				1461	Constant constant *same as 64 0x0000000000000000
				1462	global*
				1463	Generic flat flat 64 0x0000000000000000
				1464	Region N/A GDS 32 *not implemented
				1465	for AMDHSA*
				1466	================= =========== ======== ======= ==================
				1467
				1468	The global and constant memory spaces both use global virtual addresses, which
				1469	are the same virtual address space used by the CPU. However, some virtual
				1470	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1471	and some by both.
				1472
				1473	Using the constant memory space indicates that the data will not change during
				1474	the execution of the kernel. This allows scalar read instructions to be
				1475	used. The vector and scalar L1 caches are invalidated of volatile data before
				1476	each kernel dispatch execution to allow constant memory to change values between
				1477	kernel dispatches.
				1478
				1479	The local memory space uses the hardware Local Data Store (LDS) which is
				1480	automatically allocated when the hardware creates work-groups of wavefronts, and
				1481	freed when all the wavefronts of a work-group have terminated. The data store
				1482	(DS) instructions can be used to access it.
				1483
				1484	The private memory space uses the hardware scratch memory support. If the kernel
				1485	uses scratch, then the hardware allocates memory that is accessed using
				1486	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1487	address to physical address is:
				1488
				1489	``wavefront-scratch-base +
				1490	(private-address * wavefront-size * 4) +
				1491	(wavefront-lane-id * 4)``
				1492
				1493	There are different ways that the wavefront scratch base address is determined
				1494	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1495	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1496	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1497	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1498	same private address, the interleaving results in adjacent dwords being accessed
				1499	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1500	supported except by flat and scratch instructions in GFX9.
				1501
				1502	The generic address space uses the hardware flat address support available in
				1503	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1504	local appertures), that are outside the range of addressible global memory, to
				1505	map from a flat address to a private or local address.
				1506
				1507	FLAT instructions can take a flat address and access global, private (scratch)
				1508	and group (LDS) memory depending in if the address is within one of the
				1509	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1510	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1511	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1512	(see :ref:`amdgpu-amdhsa-m0`).
				1513
				1514	To convert between a segment address and a flat address the base address of the
				1515	appertures address can be used. For GFX7-GFX8 these are available in the
				1516	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1517	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1518	GFX9 the appature base addresses are directly available as inline constant
				1519	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1520	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1521	which makes it easier to convert from flat to segment or segment to flat.
				1522
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1523	Image and Samplers
				1524	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1525
				1526	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1527	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1528	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1529	enumeration values for the queries that are not trivially deducible from the S#
				1530	representation.
				1531
				1532	HSA Signals
				1533	~~~~~~~~~~~
				1534
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1535	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1536	structure allocated in memory accessible from both the CPU and GPU. The
				1537	structure is defined by the ROCm runtime and subject to change between releases
				1538	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1539
				1540	.. _amdgpu-amdhsa-hsa-aql-queue:
				1541
				1542	HSA AQL Queue
				1543	~~~~~~~~~~~~~
				1544
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1545	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1546	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1547	fields needed to implement certain language features such as the flat address
				1548	aperture bases. It also contains fields used by CP such as managing the
				1549	allocation of scratch memory.
				1550
				1551	.. _amdgpu-amdhsa-kernel-descriptor:
				1552
				1553	Kernel Descriptor
				1554	~~~~~~~~~~~~~~~~~
				1555
				1556	A kernel descriptor consists of the information needed by CP to initiate the
				1557	execution of a kernel, including the entry point address of the machine code
				1558	that implements the kernel.
				1559
				1560	Kernel Descriptor for GFX6-GFX9
				1561	+++++++++++++++++++++++++++++++
				1562
				1563	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1564
				1565	.. table:: Kernel Descriptor for GFX6-GFX9
				1566	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1567
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1568	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1569	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1570	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1571	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1572	address space memory
				1573	required for a work-group
				1574	in bytes. This does not
				1575	include any dynamically
				1576	allocated local address
				1577	space memory that may be
				1578	added when the kernel is
				1579	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1580	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1581	private address space
				1582	memory required for a
				1583	work-item in bytes. If
				1584	is_dynamic_callstack is 1
				1585	then additional space must
				1586	be added to this value for
				1587	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1588	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1589	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1590	negative) from base
				1591	address of kernel
				1592	descriptor to kernel's
				1593	entry point instruction
				1594	which must be 256 byte
				1595	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1596	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1597	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1598	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1599	program settings used by
				1600	CP to set up
				1601	``COMPUTE_PGM_RSRC1``
				1602	configuration
				1603	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1604	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1605	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1606	program settings used by
				1607	CP to set up
				1608	``COMPUTE_PGM_RSRC2``
				1609	configuration
				1610	register. See
				1611	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1612	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1613	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1614	(see
				1615	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1616
				1617	The total number of SGPR
				1618	user data registers
				1619	requested must not exceed
				1620	16 and match value in
				1621	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1622	Any requests beyond 16
				1623	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1624	449 1 bit EnableSGPRDispatchPtr see above
				1625	450 1 bit EnableSGPRQueuePtr see above
				1626	451 1 bit EnableSGPRKernargSegmentPtr see above
				1627	452 1 bit EnableSGPRDispatchID see above
				1628	453 1 bit EnableSGPRFlatScratchInit see above
				1629	454 1 bit EnableSGPRPrivateSegmentSize see above
				1630	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1631	should always be 0.
				1632	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1633	should always be 0.
				1634	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1635	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1636	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1637	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1638	bytes
				1639	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1640	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1641
				1642	..
				1643
				1644	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1645	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1646
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1647	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1648	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1649	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1650	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1651	used by each work-item,
				1652	granularity is device
				1653	specific:
				1654
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1655	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1656	- max_vgpr 1..256
				1657	- roundup((max_vgpg + 1)
				1658	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1659
				1660	Used by CP to set up
				1661	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1662	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1663	used by a wavefront,
				1664	granularity is device
				1665	specific:
				1666
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1667	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1668	- max_sgpr 1..112
				1669	- roundup((max_sgpg + 1)
				1670	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1671	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1672	- max_sgpr 1..112
				1673	- roundup((max_sgpg + 1)
				1674	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1675
				1676	Includes the special SGPRs
				1677	for VCC, Flat Scratch (for
				1678	GFX7 onwards) and XNACK
				1679	(for GFX8 onwards). It does
				1680	not include the 16 SGPR
				1681	added if a trap handler is
				1682	enabled.
				1683
				1684	Used by CP to set up
				1685	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1686	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1687
				1688	Start executing wavefront
				1689	at the specified priority.
				1690
				1691	CP is responsible for
				1692	filling in
				1693	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1694	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1695	with specified rounding
				1696	mode for single (32
				1697	bit) floating point
				1698	precision floating point
				1699	operations.
				1700
				1701	Floating point rounding
				1702	mode values are defined in
				1703	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1704
				1705	Used by CP to set up
				1706	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1707	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1708	with specified rounding
				1709	denorm mode for half/double (16
				1710	and 64 bit) floating point
				1711	precision floating point
				1712	operations.
				1713
				1714	Floating point rounding
				1715	mode values are defined in
				1716	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1717
				1718	Used by CP to set up
				1719	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1720	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1721	with specified denorm mode
				1722	for single (32
				1723	bit) floating point
				1724	precision floating point
				1725	operations.
				1726
				1727	Floating point denorm mode
				1728	values are defined in
				1729	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1730
				1731	Used by CP to set up
				1732	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1733	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1734	with specified denorm mode
				1735	for half/double (16
				1736	and 64 bit) floating point
				1737	precision floating point
				1738	operations.
				1739
				1740	Floating point denorm mode
				1741	values are defined in
				1742	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1743
				1744	Used by CP to set up
				1745	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1746	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1747
				1748	Start executing wavefront
				1749	in privilege trap handler
				1750	mode.
				1751
				1752	CP is responsible for
				1753	filling in
				1754	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1755	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1756	with DX10 clamp mode
				1757	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1758	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1759	treatment of NaN's (when
				1760	set, clamp NaN to zero,
				1761	otherwise pass NaN
				1762	through).
				1763
				1764	Used by CP to set up
				1765	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1766	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1767
				1768	Start executing wavefront
				1769	in single step mode.
				1770
				1771	CP is responsible for
				1772	filling in
				1773	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1774	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1775	with IEEE mode
				1776	enabled. Floating point
				1777	opcodes that support
				1778	exception flag gathering
				1779	will quiet and propagate
				1780	signaling-NaN inputs per
				1781	IEEE 754-2008. Min_dx10 and
				1782	max_dx10 become IEEE
				1783	754-2008 compliant due to
				1784	signaling-NaN propagation
				1785	and quieting.
				1786
				1787	Used by CP to set up
				1788	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1789	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1790
				1791	Only one work-group allowed
				1792	to execute on a compute
				1793	unit.
				1794
				1795	CP is responsible for
				1796	filling in
				1797	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1798	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1799
				1800	Flag that can be used to
				1801	control debugging code.
				1802
				1803	CP is responsible for
				1804	filling in
				1805	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1806	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1807	Reserved, must be 0.
				1808	GFX9
				1809	Wavefront starts execution
				1810	with specified fp16 overflow
				1811	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1812
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1813	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1814	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1815	- If 1, fp16 overflow that is the
				1816	result of an +/-INF input value
				1817	or divide by 0 produces a +/-INF,
				1818	otherwise clamps computed
				1819	overflow to +/-MAX_FP16 as
				1820	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1821
				1822	Used by CP to set up
				1823	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1824	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1825	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1826	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1827
				1828	..
				1829
				1830	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1831	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1832
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1833	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1834	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1835	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1836	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1837	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1838	system register (see
				1839	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1840
				1841	Used by CP to set up
				1842	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1843	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1844	user data registers
				1845	requested. This number must
				1846	match the number of user
				1847	data registers enabled.
				1848
				1849	Used by CP to set up
				1850	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1851	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1852	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1853	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854	be enabled.
				1855
				1856	CP sets
				1857	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1858	if the runtime has
				1859	installed a trap handler
				1860	regardless of the setting
				1861	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1862	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1863	system SGPR register for
				1864	the work-group id in the X
				1865	dimension (see
				1866	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1867
				1868	Used by CP to set up
				1869	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1870	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1871	system SGPR register for
				1872	the work-group id in the Y
				1873	dimension (see
				1874	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1875
				1876	Used by CP to set up
				1877	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1878	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1879	system SGPR register for
				1880	the work-group id in the Z
				1881	dimension (see
				1882	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1883
				1884	Used by CP to set up
				1885	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1886	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1887	system SGPR register for
				1888	work-group information (see
				1889	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1890
				1891	Used by CP to set up
				1892	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1893	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1894	VGPR system registers used
				1895	for the work-item ID.
				1896	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1897	defines the values.
				1898
				1899	Used by CP to set up
				1900	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1901	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1902
				1903	Wavefront starts execution
				1904	with address watch
				1905	exceptions enabled which
				1906	are generated when L1 has
				1907	witnessed a thread access
				1908	an *address of
				1909	interest*.
				1910
				1911	CP is responsible for
				1912	filling in the address
				1913	watch bit in
				1914	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1915	according to what the
				1916	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1917	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1918
				1919	Wavefront starts execution
				1920	with memory violation
				1921	exceptions exceptions
				1922	enabled which are generated
				1923	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1924	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1925	L1 or LDS
				1926	(write-to-read-only-memory,
				1927	mis-aligned atomic, LDS
				1928	address out of range,
				1929	illegal address, etc.).
				1930
				1931	CP sets the memory
				1932	violation bit in
				1933	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1934	according to what the
				1935	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1936	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1937
				1938	CP uses the rounded value
				1939	from the dispatch packet,
				1940	not this value, as the
				1941	dispatch may contain
				1942	dynamically allocated group
				1943	segment memory. CP writes
				1944	directly to
				1945	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1946
				1947	Amount of group segment
				1948	(LDS) to allocate for each
				1949	work-group. Granularity is
				1950	device specific:
				1951
				1952	GFX6:
				1953	roundup(lds-size / (64 * 4))
				1954	GFX7-GFX9:
				1955	roundup(lds-size / (128 * 4))
				1956
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1957	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1958	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1959	enabled.
				1960
				1961	Used by CP to set up
				1962	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1963	(set from bits 0..6).
				1964
				1965	IEEE 754 FP Invalid
				1966	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1967	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1968	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1969	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1970	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1971	_DIVISION_BY_ZERO Zero
				1972	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1973	_OVERFLOW
				1974	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1975	_UNDERFLOW
				1976	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1977	_INEXACT
				1978	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1979	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1980	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1981	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1982	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1983	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1984
				1985	..
				1986
				1987	.. table:: Floating Point Rounding Mode Enumeration Values
				1988	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1989
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1990	====================================== ===== ==============================
				1991	Enumeration Name Value Description
				1992	====================================== ===== ==============================
				1993	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1994	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1995	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1996	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1997	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1998
				1999	..
				2000
				2001	.. table:: Floating Point Denorm Mode Enumeration Values
				2002	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2003
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2004	====================================== ===== ==============================
				2005	Enumeration Name Value Description
				2006	====================================== ===== ==============================
				2007	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				2008	Denorms
				2009	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2010	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2011	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				2012	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2013
				2014	..
				2015
				2016	.. table:: System VGPR Work-Item ID Enumeration Values
				2017	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2018
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2019	======================================== ===== ============================
				2020	Enumeration Name Value Description
				2021	======================================== ===== ============================
				2022	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				2023	ID.
				2024	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				2025	dimensions ID.
				2026	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				2027	dimensions ID.
				2028	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				2029	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2030
				2031	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2032
				2033	Initial Kernel Execution State
				2034	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2035
				2036	This section defines the register state that will be set up by the packet
				2037	processor prior to the start of execution of every wavefront. This is limited by
				2038	the constraints of the hardware controllers of CP/ADC/SPI.
				2039
				2040	The order of the SGPR registers is defined, but the compiler can specify which
				2041	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2042	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2043	for enabled registers are dense starting at SGPR0: the first enabled register is
				2044	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2045	an SGPR number.
				2046
				2047	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2048	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2049	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2050	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2051	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2052	dispatch.
				2053
				2054	SGPR register initial state is defined in
				2055	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2056
				2057	.. table:: SGPR Register Set Up Order
				2058	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2059
				2060	========== ========================== ====== ==============================
				2061	SGPR Order Name Number Description
				2062	(kernel descriptor enable of
				2063	field) SGPRs
				2064	========== ========================== ====== ==============================
				2065	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2066	(enable_sgpr_private with Scratch Wavefront Offset
				2067	_segment_buffer) as an offset, to access the
				2068	private memory space using a
				2069	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2070
				2071	CP uses the value provided by
				2072	the runtime.
				2073	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2074	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2075	actually executing.
				2076	then Queue Ptr 2 64 bit address of amd_queue_t
				2077	(enable_sgpr_queue_ptr) object for AQL queue on which
				2078	the dispatch packet was
				2079	queued.
				2080	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2081	(enable_sgpr_kernarg segment. This is directly
				2082	_segment_ptr) copied from the
				2083	kernarg_address in the kernel
				2084	dispatch packet.
				2085
				2086	Having CP load it once avoids
				2087	loading it at the beginning of
				2088	every wavefront.
				2089	then Dispatch Id 2 64 bit Dispatch ID of the
				2090	(enable_sgpr_dispatch_id) dispatch packet being
				2091	executed.
				2092	then Flat Scratch Init 2 This is 2 SGPRs:
				2093	(enable_sgpr_flat_scratch
				2094	_init) GFX6
				2095	Not supported.
				2096	GFX7-GFX8
				2097	The first SGPR is a 32 bit
				2098	byte offset from
				2099	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2100	to per SPI base of memory
				2101	for scratch for the queue
				2102	executing the kernel
				2103	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2104	from the runtime. (The
				2105	Scratch Segment Buffer base
				2106	address is
				2107	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2108	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2109	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2110	be added to this offset by
				2111	the kernel machine code,
				2112	right shifted by 8, and
				2113	moved to the FLAT_SCRATCH_HI
				2114	SGPR register.
				2115	FLAT_SCRATCH_HI corresponds
				2116	to SGPRn-4 on GFX7, and
				2117	SGPRn-6 on GFX8 (where SGPRn
				2118	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2119	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2120	FLAT_SCRATCH_HI is
				2121	multiplied by 256 (as it is
				2122	in units of 256 bytes) and
				2123	added to
				2124	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2125	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2126	FLAT SCRATCH BASE in flat
				2127	memory instructions that
				2128	access the scratch
				2129	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2130
				2131	The second SGPR is 32 bit
				2132	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2133	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2134	usage. CP obtains this from
				2135	the runtime, and it is
				2136	always a multiple of DWORD.
				2137	CP checks that the value in
				2138	the kernel dispatch packet
				2139	Private Segment Byte Size is
				2140	not larger, and requests the
				2141	runtime to increase the
				2142	queue's scratch size if
				2143	necessary. The kernel code
				2144	must move it to
				2145	FLAT_SCRATCH_LO which is
				2146	SGPRn-3 on GFX7 and SGPRn-5
				2147	on GFX8. FLAT_SCRATCH_LO is
				2148	used as the FLAT SCRATCH
				2149	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2150	instructions. Having CP load
				2151	it once avoids loading it at
				2152	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2153	wavefront.
				2154	GFX9
				2155	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2156	64 bit base address of the
				2157	per SPI scratch backing
				2158	memory managed by SPI for
				2159	the queue executing the
				2160	kernel dispatch. CP obtains
				2161	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2162	divides it if there are
				2163	multiple Shader Arrays each
				2164	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2165	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2166	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2167	machine code and the result
				2168	moved to the FLAT_SCRATCH
				2169	SGPR which is SGPRn-6 and
				2170	SGPRn-5. It is used as the
				2171	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2172	memory instructions.
				2173	then Private Segment Size 1 The 32 bit byte size of a
				2174	(enable_sgpr_private single
				2175	work-item's
				2176	scratch_segment_size) memory
				2177	allocation. This is the
				2178	value from the kernel
				2179	dispatch packet Private
				2180	Segment Byte Size rounded up
				2181	by CP to a multiple of
				2182	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2183
				2184	Having CP load it once avoids
				2185	loading it at the beginning of
				2186	every wavefront.
				2187
				2188	This is not used for
				2189	GFX7-GFX8 since it is the same
				2190	value as the second SGPR of
				2191	Flat Scratch Init. However, it
				2192	may be needed for GFX9 which
				2193	changes the meaning of the
				2194	Flat Scratch Init value.
				2195	then Grid Work-Group Count X 1 32 bit count of the number of
				2196	(enable_sgpr_grid work-groups in the X dimension
				2197	_workgroup_count_X) for the grid being
				2198	executed. Computed from the
				2199	fields in the kernel dispatch
				2200	packet as ((grid_size.x +
				2201	workgroup_size.x - 1) /
				2202	workgroup_size.x).
				2203	then Grid Work-Group Count Y 1 32 bit count of the number of
				2204	(enable_sgpr_grid work-groups in the Y dimension
				2205	_workgroup_count_Y && for the grid being
				2206	less than 16 previous executed. Computed from the
				2207	SGPRs) fields in the kernel dispatch
				2208	packet as ((grid_size.y +
				2209	workgroup_size.y - 1) /
				2210	workgroupSize.y).
				2211
				2212	Only initialized if <16
				2213	previous SGPRs initialized.
				2214	then Grid Work-Group Count Z 1 32 bit count of the number of
				2215	(enable_sgpr_grid work-groups in the Z dimension
				2216	_workgroup_count_Z && for the grid being
				2217	less than 16 previous executed. Computed from the
				2218	SGPRs) fields in the kernel dispatch
				2219	packet as ((grid_size.z +
				2220	workgroup_size.z - 1) /
				2221	workgroupSize.z).
				2222
				2223	Only initialized if <16
				2224	previous SGPRs initialized.
				2225	then Work-Group Id X 1 32 bit work-group id in X
				2226	(enable_sgpr_workgroup_id dimension of grid for
				2227	_X) wavefront.
				2228	then Work-Group Id Y 1 32 bit work-group id in Y
				2229	(enable_sgpr_workgroup_id dimension of grid for
				2230	_Y) wavefront.
				2231	then Work-Group Id Z 1 32 bit work-group id in Z
				2232	(enable_sgpr_workgroup_id dimension of grid for
				2233	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2234	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2235	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2236	_info) threadgroup_size_in_wavefronts[5:0]}
				2237	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2238	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2239	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2240	dispatch. Must be used as an
				2241	offset with Private
				2242	segment address when using
				2243	Scratch Segment Buffer. It
				2244	must be used to set up FLAT
				2245	SCRATCH for flat addressing
				2246	(see
				2247	:ref:`amdgpu-amdhsa-flat-scratch`).
				2248	========== ========================== ====== ==============================
				2249
				2250	The order of the VGPR registers is defined, but the compiler can specify which
				2251	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2252	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2253	for enabled registers are dense starting at VGPR0: the first enabled register is
				2254	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2255	VGPR number.
				2256
				2257	VGPR register initial state is defined in
				2258	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2259
				2260	.. table:: VGPR Register Set Up Order
				2261	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2262
				2263	========== ========================== ====== ==============================
				2264	VGPR Order Name Number Description
				2265	(kernel descriptor enable of
				2266	field) VGPRs
				2267	========== ========================== ====== ==============================
				2268	First Work-Item Id X 1 32 bit work item id in X
				2269	(Always initialized) dimension of work-group for
				2270	wavefront lane.
				2271	then Work-Item Id Y 1 32 bit work item id in Y
				2272	(enable_vgpr_workitem_id dimension of work-group for
				2273	> 0) wavefront lane.
				2274	then Work-Item Id Z 1 32 bit work item id in Z
				2275	(enable_vgpr_workitem_id dimension of work-group for
				2276	> 1) wavefront lane.
				2277	========== ========================== ====== ==============================
				2278
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2279	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2280
				2281	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2282	registers.
				2283	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2284	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2285	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2286	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2287	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2288	or (X, Y, Z).
				2289
				2290	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2291	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2292
				2293	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2294	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2295	instructions (GFX9).
				2296
				2297	If buffer operations are used then the compiler can generate a V# with the
				2298	following properties:
				2299
				2300	* base address of 0
				2301	* no swizzle
				2302	* ATC: 1 if IOMMU present (such as APU)
				2303	* ptr64: 1
				2304	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2305	APU and NC for dGPU).
				2306
				2307	.. _amdgpu-amdhsa-kernel-prolog:
				2308
				2309	Kernel Prolog
				2310	~~~~~~~~~~~~~
				2311
				2312	.. _amdgpu-amdhsa-m0:
				2313
				2314	M0
				2315	++
				2316
				2317	GFX6-GFX8
				2318	The M0 register must be initialized with a value at least the total LDS size
				2319	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2320	available in dispatch packet. For M0, it is also possible to use maximum
				2321	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2322	GFX7-GFX8).
				2323	GFX9
				2324	The M0 register is not used for range checking LDS accesses and so does not
				2325	need to be initialized in the prolog.
				2326
				2327	.. _amdgpu-amdhsa-flat-scratch:
				2328
				2329	Flat Scratch
				2330	++++++++++++
				2331
				2332	If the kernel may use flat operations to access scratch memory, the prolog code
				2333	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2334	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2335	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2336
				2337	GFX6
				2338	Flat scratch is not supported.
				2339
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2340	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2341	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2342	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2343	being managed by SPI for the queue executing the kernel dispatch. This is
				2344	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2345	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2346	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2347	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2348	by 8 before moving into FLAT_SCRATCH_LO.
				2349	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2350	work-items scratch memory usage. This is directly loaded from the kernel
				2351	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2352	DWORD. Having CP load it once avoids loading it at the beginning of every
				2353	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2354	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2355
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2356	GFX9
				2357	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2358	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2359	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2360	pair for use as the flat scratch base in flat memory instructions.
				2361
				2362	.. _amdgpu-amdhsa-memory-model:
				2363
				2364	Memory Model
				2365	~~~~~~~~~~~~
				2366
				2367	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2368	(see :ref:`memmodel`). The implementation is WIP.
				2369
				2370	.. TODO
				2371	Update when implementation complete.
				2372
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2373	The AMDGPU backend supports the memory synchronization scopes specified in
				2374	:ref:`amdgpu-memory-scopes`.
				2375
				2376	The code sequences used to implement the memory model are defined in table
				2377	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2378
				2379	The sequences specify the order of instructions that a single thread must
				2380	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2381	to other memory instructions executed by the same thread. This allows them to be
				2382	moved earlier or later which can allow them to be combined with other instances
				2383	of the same instruction, or hoisted/sunk out of loops to improve
				2384	performance. Only the instructions related to the memory model are given;
				2385	additional ``s_waitcnt`` instructions are required to ensure registers are
				2386	defined before being used. These may be able to be combined with the memory
				2387	model ``s_waitcnt`` instructions as described above.
				2388
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2389	The AMDGPU backend supports the following memory models:
				2390
				2391	HSA Memory Model [HSA]_
				2392	The HSA memory model uses a single happens-before relation for all address
				2393	spaces (see :ref:`amdgpu-address-spaces`).
				2394	OpenCL Memory Model [OpenCL]_
				2395	The OpenCL memory model which has separate happens-before relations for the
				2396	global and local address spaces. Only a fence specifying both global and
				2397	local address space, and seq_cst instructions join the relationships. Since
				2398	the LLVM ``memfence`` instruction does not allow an address space to be
				2399	specified the OpenCL fence has to convervatively assume both local and
				2400	global address space was specified. However, optimizations can often be
				2401	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2402	no intervening memory instructions which access the corresponding address
				2403	space. The code sequences in the table indicate what can be omitted for the
				2404	OpenCL memory. The target triple environment is used to determine if the
				2405	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2406
				2407	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2408	operations.
				2409
				2410	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2411	termed vector memory operations.
				2412
				2413	For GFX6-GFX9:
				2414
				2415	* Each agent has multiple compute units (CU).
				2416	* Each CU has multiple SIMDs that execute wavefronts.
				2417	* The wavefronts for a single work-group are executed in the same CU but may be
				2418	executed by different SIMDs.
				2419	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2420	executing on it.
				2421	* All LDS operations of a CU are performed as wavefront wide operations in a
				2422	global order and involve no caching. Completion is reported to a wavefront in
				2423	execution order.
				2424	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2425	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2426	can be reordered relative to each other, which can result in reordering the
				2427	visibility of vector memory operations with respect to LDS operations of other
				2428	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2429	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2430	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2431	same wavefront.
				2432	* The vector memory operations are performed as wavefront wide operations and
				2433	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2434	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2435	vector memory order if they access LDS memory, and out of LDS operation order
				2436	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2437	* The vector memory operations access a single vector L1 cache shared by all
				2438	SIMDs a CU. Therefore, no special action is required for coherence between the
				2439	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2440	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2441	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2442	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2443	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2444	scalar operations are used in a restricted way so do not impact the memory
				2445	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2446	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2447	the same agent.
				2448	* The L2 cache has independent channels to service disjoint ranges of virtual
				2449	addresses.
				2450	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2451	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2452	(which may be executing on different CUs) of an agent can be reordered
				2453	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2454	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2455	previous vector memory operation has completed before executing a subsequent
				2456	vector memory or LDS operation and so can be used to meet the requirements of
				2457	acquire and release.
				2458	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2459	of virtual addresses can be set up to bypass it to ensure system coherence.
				2460
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2461	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2462	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2463	memory, atomic memory orderings are not meaningful and all accesses are treated
				2464	as non-atomic.
				2465
				2466	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2467	scalar memory instructions). Since the constant address space contents do not
				2468	change during the execution of a kernel dispatch it is not legal to perform
				2469	stores, and atomic memory orderings are not meaningful and all access are
				2470	treated as non-atomic.
				2471
				2472	A memory synchronization scope wider than work-group is not meaningful for the
				2473	group (LDS) address space and is treated as work-group.
				2474
				2475	The memory model does not support the region address space which is treated as
				2476	non-atomic.
				2477
				2478	Acquire memory ordering is not meaningful on store atomic instructions and is
				2479	treated as non-atomic.
				2480
				2481	Release memory ordering is not meaningful on load atomic instructions and is
				2482	treated a non-atomic.
				2483
				2484	Acquire-release memory ordering is not meaningful on load or store atomic
				2485	instructions and is treated as acquire and release respectively.
				2486
				2487	AMDGPU backend only uses scalar memory operations to access memory that is
				2488	proven to not change during the execution of the kernel dispatch. This includes
				2489	constant address space and global address space for program scope const
				2490	variables. Therefore the kernel machine code does not have to maintain the
				2491	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2492	and vector L1 caches are invalidated between kernel dispatches by CP since
				2493	constant address space data may change between kernel dispatch executions. See
				2494	:ref:`amdgpu-amdhsa-memory-spaces`.
				2495
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2496	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2497	case the AMDGPU backend ensures the memory location used to spill is never
				2498	accessed by vector memory operations at the same time. If scalar writes are used
				2499	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2500	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2501	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2502	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2503	as all scalar writes are write-before-read in the same thread.
				2504
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2505	Scratch backing memory (which is used for the private address space)
				2506	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2507	address space is only accessed by a single thread, and is always
				2508	write-before-read, there is never a need to invalidate these entries from the L1
				2509	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2510	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2511
				2512	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2513	to invalidate the L2 cache. This also causes it to be treated as
				2514	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2515	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2516	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2517
				2518	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2519	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2520
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2521	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2522	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2523	Ordering Sync Scope Address
				2524	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2525	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2526	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2527	-----------------------------------------------------------------------------------
				2528	load none none - global - !volatile & !nontemporal
				2529	- generic
				2530	- private 1. buffer/global/flat_load
				2531	- constant
				2532	- volatile & !nontemporal
				2533
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2534	1. buffer/global/flat_load
				2535	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2536
				2537	- nontemporal
				2538
				2539	1. buffer/global/flat_load
				2540	glc=1 slc=1
				2541
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2542	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2543	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2544	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2545	- private 1. buffer/global/flat_store
				2546	- constant
				2547	- nontemporal
				2548
				2549	1. buffer/global/flat_stote
				2550	glc=1 slc=1
				2551
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2552	store none none - local 1. ds_store
				2553	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2554	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2555	load atomic unordered any any Same as non-atomic.
				2556	store atomic unordered any any Same as non-atomic.
				2557	atomicrmw unordered any any *Same as monotonic
				2558	atomic*.
				2559	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2560	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2561	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2562	- wavefront - generic
				2563	- workgroup
				2564	load atomic monotonic - singlethread - local 1. ds_load
				2565	- wavefront
				2566	- workgroup
				2567	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2568	- system - generic glc=1
				2569	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2570	- wavefront - generic
				2571	- workgroup
				2572	- agent
				2573	- system
				2574	store atomic monotonic - singlethread - local 1. ds_store
				2575	- wavefront
				2576	- workgroup
				2577	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2578	- wavefront - generic
				2579	- workgroup
				2580	- agent
				2581	- system
				2582	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2583	- wavefront
				2584	- workgroup
				2585	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2586	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2587	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2588	- wavefront - local
				2589	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2590	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2591	load atomic acquire - workgroup - local 1. ds_load
				2592	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2593
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2594	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2595	- Must happen before
				2596	any following
				2597	global/generic
				2598	load/load
				2599	atomic/store/store
				2600	atomic/atomicrmw.
				2601	- Ensures any
				2602	following global
				2603	data read is no
				2604	older than the load
				2605	atomic value being
				2606	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2607	load atomic acquire - workgroup - generic 1. flat_load
				2608	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2609
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2610	- If OpenCL, omit.
				2611	- Must happen before
				2612	any following
				2613	global/generic
				2614	load/load
				2615	atomic/store/store
				2616	atomic/atomicrmw.
				2617	- Ensures any
				2618	following global
				2619	data read is no
				2620	older than the load
				2621	atomic value being
				2622	acquired.
				2623	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2624	- system glc=1
				2625	2. s_waitcnt vmcnt(0)
				2626
				2627	- Must happen before
				2628	following
				2629	buffer_wbinvl1_vol.
				2630	- Ensures the load
				2631	has completed
				2632	before invalidating
				2633	the cache.
				2634
				2635	3. buffer_wbinvl1_vol
				2636
				2637	- Must happen before
				2638	any following
				2639	global/generic
				2640	load/load
				2641	atomic/atomicrmw.
				2642	- Ensures that
				2643	following
				2644	loads will not see
				2645	stale global data.
				2646
				2647	load atomic acquire - agent - generic 1. flat_load glc=1
				2648	- system 2. s_waitcnt vmcnt(0) &
				2649	lgkmcnt(0)
				2650
				2651	- If OpenCL omit
				2652	lgkmcnt(0).
				2653	- Must happen before
				2654	following
				2655	buffer_wbinvl1_vol.
				2656	- Ensures the flat_load
				2657	has completed
				2658	before invalidating
				2659	the cache.
				2660
				2661	3. buffer_wbinvl1_vol
				2662
				2663	- Must happen before
				2664	any following
				2665	global/generic
				2666	load/load
				2667	atomic/atomicrmw.
				2668	- Ensures that
				2669	following loads
				2670	will not see stale
				2671	global data.
				2672
				2673	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2674	- wavefront - local
				2675	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2676	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2677	atomicrmw acquire - workgroup - local 1. ds_atomic
				2678	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2679
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2680	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2681	- Must happen before
				2682	any following
				2683	global/generic
				2684	load/load
				2685	atomic/store/store
				2686	atomic/atomicrmw.
				2687	- Ensures any
				2688	following global
				2689	data read is no
				2690	older than the
				2691	atomicrmw value
				2692	being acquired.
				2693
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2694	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2695	2. waitcnt lgkmcnt(0)
				2696
				2697	- If OpenCL, omit.
				2698	- Must happen before
				2699	any following
				2700	global/generic
				2701	load/load
				2702	atomic/store/store
				2703	atomic/atomicrmw.
				2704	- Ensures any
				2705	following global
				2706	data read is no
				2707	older than the
				2708	atomicrmw value
				2709	being acquired.
				2710
				2711	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2712	- system 2. s_waitcnt vmcnt(0)
				2713
				2714	- Must happen before
				2715	following
				2716	buffer_wbinvl1_vol.
				2717	- Ensures the
				2718	atomicrmw has
				2719	completed before
				2720	invalidating the
				2721	cache.
				2722
				2723	3. buffer_wbinvl1_vol
				2724
				2725	- Must happen before
				2726	any following
				2727	global/generic
				2728	load/load
				2729	atomic/atomicrmw.
				2730	- Ensures that
				2731	following loads
				2732	will not see stale
				2733	global data.
				2734
				2735	atomicrmw acquire - agent - generic 1. flat_atomic
				2736	- system 2. s_waitcnt vmcnt(0) &
				2737	lgkmcnt(0)
				2738
				2739	- If OpenCL, omit
				2740	lgkmcnt(0).
				2741	- Must happen before
				2742	following
				2743	buffer_wbinvl1_vol.
				2744	- Ensures the
				2745	atomicrmw has
				2746	completed before
				2747	invalidating the
				2748	cache.
				2749
				2750	3. buffer_wbinvl1_vol
				2751
				2752	- Must happen before
				2753	any following
				2754	global/generic
				2755	load/load
				2756	atomic/atomicrmw.
				2757	- Ensures that
				2758	following loads
				2759	will not see stale
				2760	global data.
				2761
				2762	fence acquire - singlethread none none
				2763	- wavefront
				2764	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2765
				2766	- If OpenCL and
				2767	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2768	not generic, omit.
				2769	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2770	currently has no
				2771	address space on
				2772	the fence need to
				2773	conservatively
				2774	always generate. If
				2775	fence had an
				2776	address space then
				2777	set to address
				2778	space of OpenCL
				2779	fence flag, or to
				2780	generic if both
				2781	local and global
				2782	flags are
				2783	specified.
				2784	- Must happen after
				2785	any preceding
				2786	local/generic load
				2787	atomic/atomicrmw
				2788	with an equal or
				2789	wider sync scope
				2790	and memory ordering
				2791	stronger than
				2792	unordered (this is
				2793	termed the
				2794	fence-paired-atomic).
				2795	- Must happen before
				2796	any following
				2797	global/generic
				2798	load/load
				2799	atomic/store/store
				2800	atomic/atomicrmw.
				2801	- Ensures any
				2802	following global
				2803	data read is no
				2804	older than the
				2805	value read by the
				2806	fence-paired-atomic.
				2807
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2808	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2809	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2810
				2811	- If OpenCL and
				2812	address space is
				2813	not generic, omit
				2814	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2815	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2816	currently has no
				2817	address space on
				2818	the fence need to
				2819	conservatively
				2820	always generate
				2821	(see comment for
				2822	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2823	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2824	separate s_waitcnt
				2825	vmcnt(0) and
				2826	s_waitcnt
				2827	lgkmcnt(0) to allow
				2828	them to be
				2829	independently moved
				2830	according to the
				2831	following rules.
				2832	- s_waitcnt vmcnt(0)
				2833	must happen after
				2834	any preceding
				2835	global/generic load
				2836	atomic/atomicrmw
				2837	with an equal or
				2838	wider sync scope
				2839	and memory ordering
				2840	stronger than
				2841	unordered (this is
				2842	termed the
				2843	fence-paired-atomic).
				2844	- s_waitcnt lgkmcnt(0)
				2845	must happen after
				2846	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2847	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2848	atomic/atomicrmw
				2849	with an equal or
				2850	wider sync scope
				2851	and memory ordering
				2852	stronger than
				2853	unordered (this is
				2854	termed the
				2855	fence-paired-atomic).
				2856	- Must happen before
				2857	the following
				2858	buffer_wbinvl1_vol.
				2859	- Ensures that the
				2860	fence-paired atomic
				2861	has completed
				2862	before invalidating
				2863	the
				2864	cache. Therefore
				2865	any following
				2866	locations read must
				2867	be no older than
				2868	the value read by
				2869	the
				2870	fence-paired-atomic.
				2871
				2872	2. buffer_wbinvl1_vol
				2873
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2874	- Must happen before any
				2875	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2876	load/load
				2877	atomic/store/store
				2878	atomic/atomicrmw.
				2879	- Ensures that
				2880	following loads
				2881	will not see stale
				2882	global data.
				2883
				2884	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2885	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2886	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2887	- wavefront - local
				2888	- generic
				2889	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2890
				2891	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2892	- Must happen after
				2893	any preceding
				2894	local/generic
				2895	load/store/load
				2896	atomic/store
				2897	atomic/atomicrmw.
				2898	- Must happen before
				2899	the following
				2900	store.
				2901	- Ensures that all
				2902	memory operations
				2903	to local have
				2904	completed before
				2905	performing the
				2906	store that is being
				2907	released.
				2908
				2909	2. buffer/global/flat_store
				2910	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2911	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2912
				2913	- If OpenCL, omit.
				2914	- Must happen after
				2915	any preceding
				2916	local/generic
				2917	load/store/load
				2918	atomic/store
				2919	atomic/atomicrmw.
				2920	- Must happen before
				2921	the following
				2922	store.
				2923	- Ensures that all
				2924	memory operations
				2925	to local have
				2926	completed before
				2927	performing the
				2928	store that is being
				2929	released.
				2930
				2931	2. flat_store
				2932	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2933	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2934
				2935	- If OpenCL, omit
				2936	lgkmcnt(0).
				2937	- Could be split into
				2938	separate s_waitcnt
				2939	vmcnt(0) and
				2940	s_waitcnt
				2941	lgkmcnt(0) to allow
				2942	them to be
				2943	independently moved
				2944	according to the
				2945	following rules.
				2946	- s_waitcnt vmcnt(0)
				2947	must happen after
				2948	any preceding
				2949	global/generic
				2950	load/store/load
				2951	atomic/store
				2952	atomic/atomicrmw.
				2953	- s_waitcnt lgkmcnt(0)
				2954	must happen after
				2955	any preceding
				2956	local/generic
				2957	load/store/load
				2958	atomic/store
				2959	atomic/atomicrmw.
				2960	- Must happen before
				2961	the following
				2962	store.
				2963	- Ensures that all
				2964	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2965	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2966	completed before
				2967	performing the
				2968	store that is being
				2969	released.
				2970
				2971	2. buffer/global/ds/flat_store
				2972	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2973	- wavefront - local
				2974	- generic
				2975	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2976
				2977	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2978	- Must happen after
				2979	any preceding
				2980	local/generic
				2981	load/store/load
				2982	atomic/store
				2983	atomic/atomicrmw.
				2984	- Must happen before
				2985	the following
				2986	atomicrmw.
				2987	- Ensures that all
				2988	memory operations
				2989	to local have
				2990	completed before
				2991	performing the
				2992	atomicrmw that is
				2993	being released.
				2994
				2995	2. buffer/global/flat_atomic
				2996	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2997	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2998
				2999	- If OpenCL, omit.
				3000	- Must happen after
				3001	any preceding
				3002	local/generic
				3003	load/store/load
				3004	atomic/store
				3005	atomic/atomicrmw.
				3006	- Must happen before
				3007	the following
				3008	atomicrmw.
				3009	- Ensures that all
				3010	memory operations
				3011	to local have
				3012	completed before
				3013	performing the
				3014	atomicrmw that is
				3015	being released.
				3016
				3017	2. flat_atomic
				3018	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3019	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3020
				3021	- If OpenCL, omit
				3022	lgkmcnt(0).
				3023	- Could be split into
				3024	separate s_waitcnt
				3025	vmcnt(0) and
				3026	s_waitcnt
				3027	lgkmcnt(0) to allow
				3028	them to be
				3029	independently moved
				3030	according to the
				3031	following rules.
				3032	- s_waitcnt vmcnt(0)
				3033	must happen after
				3034	any preceding
				3035	global/generic
				3036	load/store/load
				3037	atomic/store
				3038	atomic/atomicrmw.
				3039	- s_waitcnt lgkmcnt(0)
				3040	must happen after
				3041	any preceding
				3042	local/generic
				3043	load/store/load
				3044	atomic/store
				3045	atomic/atomicrmw.
				3046	- Must happen before
				3047	the following
				3048	atomicrmw.
				3049	- Ensures that all
				3050	memory operations
				3051	to global and local
				3052	have completed
				3053	before performing
				3054	the atomicrmw that
				3055	is being released.
				3056
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3057	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3058	fence release - singlethread none none
				3059	- wavefront
				3060	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3061
				3062	- If OpenCL and
				3063	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3064	not generic, omit.
				3065	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3066	currently has no
				3067	address space on
				3068	the fence need to
				3069	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3070	always generate. If
				3071	fence had an
				3072	address space then
				3073	set to address
				3074	space of OpenCL
				3075	fence flag, or to
				3076	generic if both
				3077	local and global
				3078	flags are
				3079	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3080	- Must happen after
				3081	any preceding
				3082	local/generic
				3083	load/load
				3084	atomic/store/store
				3085	atomic/atomicrmw.
				3086	- Must happen before
				3087	any following store
				3088	atomic/atomicrmw
				3089	with an equal or
				3090	wider sync scope
				3091	and memory ordering
				3092	stronger than
				3093	unordered (this is
				3094	termed the
				3095	fence-paired-atomic).
				3096	- Ensures that all
				3097	memory operations
				3098	to local have
				3099	completed before
				3100	performing the
				3101	following
				3102	fence-paired-atomic.
				3103
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3104	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3105	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3106
				3107	- If OpenCL and
				3108	address space is
				3109	not generic, omit
				3110	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3111	- If OpenCL and
				3112	address space is
				3113	local, omit
				3114	vmcnt(0).
				3115	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3116	currently has no
				3117	address space on
				3118	the fence need to
				3119	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3120	always generate. If
				3121	fence had an
				3122	address space then
				3123	set to address
				3124	space of OpenCL
				3125	fence flag, or to
				3126	generic if both
				3127	local and global
				3128	flags are
				3129	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3130	- Could be split into
				3131	separate s_waitcnt
				3132	vmcnt(0) and
				3133	s_waitcnt
				3134	lgkmcnt(0) to allow
				3135	them to be
				3136	independently moved
				3137	according to the
				3138	following rules.
				3139	- s_waitcnt vmcnt(0)
				3140	must happen after
				3141	any preceding
				3142	global/generic
				3143	load/store/load
				3144	atomic/store
				3145	atomic/atomicrmw.
				3146	- s_waitcnt lgkmcnt(0)
				3147	must happen after
				3148	any preceding
				3149	local/generic
				3150	load/store/load
				3151	atomic/store
				3152	atomic/atomicrmw.
				3153	- Must happen before
				3154	any following store
				3155	atomic/atomicrmw
				3156	with an equal or
				3157	wider sync scope
				3158	and memory ordering
				3159	stronger than
				3160	unordered (this is
				3161	termed the
				3162	fence-paired-atomic).
				3163	- Ensures that all
				3164	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3165	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3166	completed before
				3167	performing the
				3168	following
				3169	fence-paired-atomic.
				3170
				3171	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3172	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3173	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3174	- wavefront - local
				3175	- generic
				3176	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3177
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3178	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3179	- Must happen after
				3180	any preceding
				3181	local/generic
				3182	load/store/load
				3183	atomic/store
				3184	atomic/atomicrmw.
				3185	- Must happen before
				3186	the following
				3187	atomicrmw.
				3188	- Ensures that all
				3189	memory operations
				3190	to local have
				3191	completed before
				3192	performing the
				3193	atomicrmw that is
				3194	being released.
				3195
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3196	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3197	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3198	2. s_waitcnt lgkmcnt(0)
				3199
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3200	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3201	- Must happen before
				3202	any following
				3203	global/generic
				3204	load/load
				3205	atomic/store/store
				3206	atomic/atomicrmw.
				3207	- Ensures any
				3208	following global
				3209	data read is no
				3210	older than the load
				3211	atomic value being
				3212	acquired.
				3213
				3214	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3215
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3216	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3217	- Must happen after
				3218	any preceding
				3219	local/generic
				3220	load/store/load
				3221	atomic/store
				3222	atomic/atomicrmw.
				3223	- Must happen before
				3224	the following
				3225	atomicrmw.
				3226	- Ensures that all
				3227	memory operations
				3228	to local have
				3229	completed before
				3230	performing the
				3231	atomicrmw that is
				3232	being released.
				3233
				3234	2. flat_atomic
				3235	3. s_waitcnt lgkmcnt(0)
				3236
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3237	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3238	- Must happen before
				3239	any following
				3240	global/generic
				3241	load/load
				3242	atomic/store/store
				3243	atomic/atomicrmw.
				3244	- Ensures any
				3245	following global
				3246	data read is no
				3247	older than the load
				3248	atomic value being
				3249	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3250
				3251	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3252	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3253
				3254	- If OpenCL, omit
				3255	lgkmcnt(0).
				3256	- Could be split into
				3257	separate s_waitcnt
				3258	vmcnt(0) and
				3259	s_waitcnt
				3260	lgkmcnt(0) to allow
				3261	them to be
				3262	independently moved
				3263	according to the
				3264	following rules.
				3265	- s_waitcnt vmcnt(0)
				3266	must happen after
				3267	any preceding
				3268	global/generic
				3269	load/store/load
				3270	atomic/store
				3271	atomic/atomicrmw.
				3272	- s_waitcnt lgkmcnt(0)
				3273	must happen after
				3274	any preceding
				3275	local/generic
				3276	load/store/load
				3277	atomic/store
				3278	atomic/atomicrmw.
				3279	- Must happen before
				3280	the following
				3281	atomicrmw.
				3282	- Ensures that all
				3283	memory operations
				3284	to global have
				3285	completed before
				3286	performing the
				3287	atomicrmw that is
				3288	being released.
				3289
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3290	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3291	3. s_waitcnt vmcnt(0)
				3292
				3293	- Must happen before
				3294	following
				3295	buffer_wbinvl1_vol.
				3296	- Ensures the
				3297	atomicrmw has
				3298	completed before
				3299	invalidating the
				3300	cache.
				3301
				3302	4. buffer_wbinvl1_vol
				3303
				3304	- Must happen before
				3305	any following
				3306	global/generic
				3307	load/load
				3308	atomic/atomicrmw.
				3309	- Ensures that
				3310	following loads
				3311	will not see stale
				3312	global data.
				3313
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3314	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3315	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3316
				3317	- If OpenCL, omit
				3318	lgkmcnt(0).
				3319	- Could be split into
				3320	separate s_waitcnt
				3321	vmcnt(0) and
				3322	s_waitcnt
				3323	lgkmcnt(0) to allow
				3324	them to be
				3325	independently moved
				3326	according to the
				3327	following rules.
				3328	- s_waitcnt vmcnt(0)
				3329	must happen after
				3330	any preceding
				3331	global/generic
				3332	load/store/load
				3333	atomic/store
				3334	atomic/atomicrmw.
				3335	- s_waitcnt lgkmcnt(0)
				3336	must happen after
				3337	any preceding
				3338	local/generic
				3339	load/store/load
				3340	atomic/store
				3341	atomic/atomicrmw.
				3342	- Must happen before
				3343	the following
				3344	atomicrmw.
				3345	- Ensures that all
				3346	memory operations
				3347	to global have
				3348	completed before
				3349	performing the
				3350	atomicrmw that is
				3351	being released.
				3352
				3353	2. flat_atomic
				3354	3. s_waitcnt vmcnt(0) &
				3355	lgkmcnt(0)
				3356
				3357	- If OpenCL, omit
				3358	lgkmcnt(0).
				3359	- Must happen before
				3360	following
				3361	buffer_wbinvl1_vol.
				3362	- Ensures the
				3363	atomicrmw has
				3364	completed before
				3365	invalidating the
				3366	cache.
				3367
				3368	4. buffer_wbinvl1_vol
				3369
				3370	- Must happen before
				3371	any following
				3372	global/generic
				3373	load/load
				3374	atomic/atomicrmw.
				3375	- Ensures that
				3376	following loads
				3377	will not see stale
				3378	global data.
				3379
				3380	fence acq_rel - singlethread none none
				3381	- wavefront
				3382	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3383
				3384	- If OpenCL and
				3385	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3386	not generic, omit.
				3387	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3388	since LLVM
				3389	currently has no
				3390	address space on
				3391	the fence need to
				3392	conservatively
				3393	always generate
				3394	(see comment for
				3395	previous fence).
				3396	- Must happen after
				3397	any preceding
				3398	local/generic
				3399	load/load
				3400	atomic/store/store
				3401	atomic/atomicrmw.
				3402	- Must happen before
				3403	any following
				3404	global/generic
				3405	load/load
				3406	atomic/store/store
				3407	atomic/atomicrmw.
				3408	- Ensures that all
				3409	memory operations
				3410	to local have
				3411	completed before
				3412	performing any
				3413	following global
				3414	memory operations.
				3415	- Ensures that the
				3416	preceding
				3417	local/generic load
				3418	atomic/atomicrmw
				3419	with an equal or
				3420	wider sync scope
				3421	and memory ordering
				3422	stronger than
				3423	unordered (this is
				3424	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3425	acquire-fence-paired-atomic
				3426	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3427	before following
				3428	global memory
				3429	operations. This
				3430	satisfies the
				3431	requirements of
				3432	acquire.
				3433	- Ensures that all
				3434	previous memory
				3435	operations have
				3436	completed before a
				3437	following
				3438	local/generic store
				3439	atomic/atomicrmw
				3440	with an equal or
				3441	wider sync scope
				3442	and memory ordering
				3443	stronger than
				3444	unordered (this is
				3445	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3446	release-fence-paired-atomic
				3447	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3448	requirements of
				3449	release.
				3450
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3451	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3452	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3453
				3454	- If OpenCL and
				3455	address space is
				3456	not generic, omit
				3457	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3458	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3459	currently has no
				3460	address space on
				3461	the fence need to
				3462	conservatively
				3463	always generate
				3464	(see comment for
				3465	previous fence).
				3466	- Could be split into
				3467	separate s_waitcnt
				3468	vmcnt(0) and
				3469	s_waitcnt
				3470	lgkmcnt(0) to allow
				3471	them to be
				3472	independently moved
				3473	according to the
				3474	following rules.
				3475	- s_waitcnt vmcnt(0)
				3476	must happen after
				3477	any preceding
				3478	global/generic
				3479	load/store/load
				3480	atomic/store
				3481	atomic/atomicrmw.
				3482	- s_waitcnt lgkmcnt(0)
				3483	must happen after
				3484	any preceding
				3485	local/generic
				3486	load/store/load
				3487	atomic/store
				3488	atomic/atomicrmw.
				3489	- Must happen before
				3490	the following
				3491	buffer_wbinvl1_vol.
				3492	- Ensures that the
				3493	preceding
				3494	global/local/generic
				3495	load
				3496	atomic/atomicrmw
				3497	with an equal or
				3498	wider sync scope
				3499	and memory ordering
				3500	stronger than
				3501	unordered (this is
				3502	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3503	acquire-fence-paired-atomic
				3504	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3505	before invalidating
				3506	the cache. This
				3507	satisfies the
				3508	requirements of
				3509	acquire.
				3510	- Ensures that all
				3511	previous memory
				3512	operations have
				3513	completed before a
				3514	following
				3515	global/local/generic
				3516	store
				3517	atomic/atomicrmw
				3518	with an equal or
				3519	wider sync scope
				3520	and memory ordering
				3521	stronger than
				3522	unordered (this is
				3523	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3524	release-fence-paired-atomic
				3525	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3526	requirements of
				3527	release.
				3528
				3529	2. buffer_wbinvl1_vol
				3530
				3531	- Must happen before
				3532	any following
				3533	global/generic
				3534	load/load
				3535	atomic/store/store
				3536	atomic/atomicrmw.
				3537	- Ensures that
				3538	following loads
				3539	will not see stale
				3540	global data. This
				3541	satisfies the
				3542	requirements of
				3543	acquire.
				3544
				3545	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3546	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3547	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3548	- wavefront - local load atomic acquire,
				3549	- generic except must generated
				3550	all instructions even
				3551	for OpenCL.*
				3552	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3553	- generic
				3554	- Must
				3555	happen after
				3556	preceding
				3557	global/generic load
				3558	atomic/store
				3559	atomic/atomicrmw
				3560	with memory
				3561	ordering of seq_cst
				3562	and with equal or
				3563	wider sync scope.
				3564	(Note that seq_cst
				3565	fences have their
				3566	own s_waitcnt
				3567	lgkmcnt(0) and so do
				3568	not need to be
				3569	considered.)
				3570	- Ensures any
				3571	preceding
				3572	sequential
				3573	consistent local
				3574	memory instructions
				3575	have completed
				3576	before executing
				3577	this sequentially
				3578	consistent
				3579	instruction. This
				3580	prevents reordering
				3581	a seq_cst store
				3582	followed by a
				3583	seq_cst load. (Note
				3584	that seq_cst is
				3585	stronger than
				3586	acquire/release as
				3587	the reordering of
				3588	load acquire
				3589	followed by a store
				3590	release is
				3591	prevented by the
				3592	waitcnt of
				3593	the release, but
				3594	there is nothing
				3595	preventing a store
				3596	release followed by
				3597	load acquire from
				3598	competing out of
				3599	order.)
				3600
				3601	2. *Following
				3602	instructions same as
				3603	corresponding load
				3604	atomic acquire,
				3605	except must generated
				3606	all instructions even
				3607	for OpenCL.*
				3608	load atomic seq_cst - workgroup - local *Same as corresponding
				3609	load atomic acquire,
				3610	except must generated
				3611	all instructions even
				3612	for OpenCL.*
				3613	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3614	- system - generic vmcnt(0)
				3615
				3616	- Could be split into
				3617	separate s_waitcnt
				3618	vmcnt(0)
				3619	and s_waitcnt
				3620	lgkmcnt(0) to allow
				3621	them to be
				3622	independently moved
				3623	according to the
				3624	following rules.
				3625	- waitcnt lgkmcnt(0)
				3626	must happen after
				3627	preceding
				3628	global/generic load
				3629	atomic/store
				3630	atomic/atomicrmw
				3631	with memory
				3632	ordering of seq_cst
				3633	and with equal or
				3634	wider sync scope.
				3635	(Note that seq_cst
				3636	fences have their
				3637	own s_waitcnt
				3638	lgkmcnt(0) and so do
				3639	not need to be
				3640	considered.)
				3641	- waitcnt vmcnt(0)
				3642	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3643	preceding
				3644	global/generic load
				3645	atomic/store
				3646	atomic/atomicrmw
				3647	with memory
				3648	ordering of seq_cst
				3649	and with equal or
				3650	wider sync scope.
				3651	(Note that seq_cst
				3652	fences have their
				3653	own s_waitcnt
				3654	vmcnt(0) and so do
				3655	not need to be
				3656	considered.)
				3657	- Ensures any
				3658	preceding
				3659	sequential
				3660	consistent global
				3661	memory instructions
				3662	have completed
				3663	before executing
				3664	this sequentially
				3665	consistent
				3666	instruction. This
				3667	prevents reordering
				3668	a seq_cst store
				3669	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3670	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3671	that seq_cst is
				3672	stronger than
				3673	acquire/release as
				3674	the reordering of
				3675	load acquire
				3676	followed by a store
				3677	release is
				3678	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3679	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3680	the release, but
				3681	there is nothing
				3682	preventing a store
				3683	release followed by
				3684	load acquire from
				3685	competing out of
				3686	order.)
				3687
				3688	2. *Following
				3689	instructions same as
				3690	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3691	atomic acquire,
				3692	except must generated
				3693	all instructions even
				3694	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3695	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3696	- wavefront - local store atomic release,
				3697	- workgroup - generic except must generated
				3698	all instructions even
				3699	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3700	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3701	- system - generic store atomic release,
				3702	except must generated
				3703	all instructions even
				3704	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3705	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3706	- wavefront - local atomicrmw acq_rel,
				3707	- workgroup - generic except must generated
				3708	all instructions even
				3709	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3710	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3711	- system - generic atomicrmw acq_rel,
				3712	except must generated
				3713	all instructions even
				3714	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3715	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3716	- wavefront fence acq_rel,
				3717	- workgroup except must generated
				3718	- agent all instructions even
				3719	- system for OpenCL.*
				3720	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3721
				3722	The memory order also adds the single thread optimization constrains defined in
				3723	table
				3724	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3725
				3726	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3727	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3728
				3729	============ ==============================================================
				3730	LLVM Memory Optimization Constraints
				3731	Ordering
				3732	============ ==============================================================
				3733	unordered none
				3734	monotonic none
				3735	acquire - If a load atomic/atomicrmw then no following load/load
				3736	atomic/store/ store atomic/atomicrmw/fence instruction can
				3737	be moved before the acquire.
				3738	- If a fence then same as load atomic, plus no preceding
				3739	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3740	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3741	atomic/store/ store atomic/atomicrmw/fence instruction can
				3742	be moved after the release.
				3743	- If a fence then same as store atomic, plus no following
				3744	associated fence-paired-atomic can be moved before the
				3745	fence.
				3746	acq_rel Same constraints as both acquire and release.
				3747	seq_cst - If a load atomic then same constraints as acquire, plus no
				3748	preceding sequentially consistent load atomic/store
				3749	atomic/atomicrmw/fence instruction can be moved after the
				3750	seq_cst.
				3751	- If a store atomic then the same constraints as release, plus
				3752	no following sequentially consistent load atomic/store
				3753	atomic/atomicrmw/fence instruction can be moved before the
				3754	seq_cst.
				3755	- If an atomicrmw/fence then same constraints as acq_rel.
				3756	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3757
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3758	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3759	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3760
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3761	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3762	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3763	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3764
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3765	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3766	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3767
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3768	=================== =============== =============== =======================
				3769	Usage Code Sequence Trap Handler Description
				3770	Inputs
				3771	=================== =============== =============== =======================
				3772	reserved ``s_trap 0x00`` Reserved by hardware.
				3773	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3774	``queue_ptr`` ``debugtrap``
				3775	``VGPR0``: intrinsic (not
				3776	``arg`` implemented).
				3777	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3778	``queue_ptr`` terminated and its
				3779	associated queue put
				3780	into the error state.
				3781	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3782	``queue_ptr`` installed handled
				3783	same as ``llvm.trap``.
				3784	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3785	breakpoints.
				3786	debugger ``s_trap 0x08`` Reserved for debugger.
				3787	debugger ``s_trap 0xfe`` Reserved for debugger.
				3788	debugger ``s_trap 0xff`` Reserved for debugger.
				3789	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3790
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3791	AMDPAL
				3792	------
				3793
				3794	This section provides code conventions used when the target triple OS is
				3795	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3796	from the application/runtime to each invocation of a hardware shader. These
				3797	parameters include both generic, application-controlled parameters called
				3798	user data as well as system-generated parameters that are a product of the
				3799	draw or dispatch execution.
				3800
				3801	User Data
				3802	~~~~~~~~~
				3803
				3804	Each hardware stage has a set of 32-bit user data registers which can be
				3805	written from a command buffer and then loaded into SGPRs when waves are launched
				3806	via a subsequent dispatch or draw operation. This is the way most arguments are
				3807	passed from the application/runtime to a hardware shader.
				3808
				3809	Compute User Data
				3810	~~~~~~~~~~~~~~~~~
				3811
				3812	Compute shader user data mappings are simpler than graphics shaders, and have a
				3813	fixed mapping.
				3814
				3815	Note that there are always 10 available user data entries in registers -
				3816	entries beyond that limit must be fetched from memory (via the spill table
				3817	pointer) by the shader.
				3818
				3819	.. table:: PAL Compute Shader User Data Registers
				3820	:name: pal-compute-user-data-registers
				3821
				3822	============= ================================
				3823	User Register Description
				3824	============= ================================
				3825	0 Global Internal Table (32-bit pointer)
				3826	1 Per-Shader Internal Table (32-bit pointer)
				3827	2 - 11 Application-Controlled User Data (10 32-bit values)
				3828	12 Spill Table (32-bit pointer)
				3829	13 - 14 Thread Group Count (64-bit pointer)
				3830	15 GDS Range
				3831	============= ================================
				3832
				3833	Graphics User Data
				3834	~~~~~~~~~~~~~~~~~~
				3835
				3836	Graphics pipelines support a much more flexible user data mapping:
				3837
				3838	.. table:: PAL Graphics Shader User Data Registers
				3839	:name: pal-graphics-user-data-registers
				3840
				3841	============= ================================
				3842	User Register Description
				3843	============= ================================
				3844	0 Global Internal Table (32-bit pointer)
				3845	+ Per-Shader Internal Table (32-bit pointer)
				3846	+ 1-15 Application Controlled User Data
				3847	(1-15 Contiguous 32-bit Values in Registers)
				3848	+ Spill Table (32-bit pointer)
				3849	+ Draw Index (First Stage Only)
				3850	+ Vertex Offset (First Stage Only)
				3851	+ Instance Offset (First Stage Only)
				3852	============= ================================
				3853
				3854	The placement of the global internal table remains fixed in the first *user
				3855	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3856	to any desired user data SGPR register, with the following regstrictions:
				3857
				3858	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3859	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3860	shader runs).
				3861
				3862	* Application-controlled user data must be mapped into a contiguous range of
				3863	user data registers.
				3864
				3865	* The application-controlled user data range supports compaction remapping, so
				3866	only entries that are actually consumed by the shader must be assigned to
				3867	corresponding registers. Note that in order to support an efficient runtime
				3868	implementation, the remapping must pack registers in the same order as
				3869	entries, with unused entries removed.
				3870
				3871	.. _pal_global_internal_table:
				3872
				3873	Global Internal Table
				3874	~~~~~~~~~~~~~~~~~~~~~
				3875
				3876	The global internal table is a table of shader resource descriptors (SRDs) that
				3877	define how certain engine-wide, runtime-managed resources should be accessed
				3878	from a shader. The majority of these resources have HW-defined formats, and it
				3879	is up to the compiler to write/read data as required by the target hardware.
				3880
				3881	The following table illustrates the required format:
				3882
				3883	.. table:: PAL Global Internal Table
				3884	:name: pal-git-table
				3885
				3886	============= ================================
				3887	Offset Description
				3888	============= ================================
				3889	0-3 Graphics Scratch SRD
				3890	4-7 Compute Scratch SRD
				3891	8-11 ES/GS Ring Output SRD
				3892	12-15 ES/GS Ring Input SRD
				3893	16-19 GS/VS Ring Output #0
				3894	20-23 GS/VS Ring Output #1
				3895	24-27 GS/VS Ring Output #2
				3896	28-31 GS/VS Ring Output #3
				3897	32-35 GS/VS Ring Input SRD
				3898	36-39 Tessellation Factor Buffer SRD
				3899	40-43 Off-Chip LDS Buffer SRD
				3900	44-47 Off-Chip Param Cache Buffer SRD
				3901	48-51 Sample Position Buffer SRD
				3902	52 vaRange::ShadowDescriptorTable High Bits
				3903	============= ================================
				3904
				3905	The pointer to the global internal table passed to the shader as user data
				3906	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3907	the top 32 bits of the pipeline, so the shader may use the program
				3908	counter's top 32 bits.
				3909
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3910	Unspecified OS
				3911	--------------
				3912
				3913	This section provides code conventions used when the target triple OS is
				3914	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3915
				3916	Trap Handler ABI
				3917	~~~~~~~~~~~~~~~~
				3918
				3919	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3920	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3921	instructions are handled as follows:
				3922
				3923	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3924	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3925
				3926	=============== =============== ===========================================
				3927	Usage Code Sequence Description
				3928	=============== =============== ===========================================
				3929	llvm.trap s_endpgm Causes wavefront to be terminated.
				3930	llvm.debugtrap none Compiler warning given that there is no
				3931	trap handler installed.
				3932	=============== =============== ===========================================
				3933
				3934	Source Languages
				3935	================
				3936
				3937	.. _amdgpu-opencl:
				3938
				3939	OpenCL
				3940	------
				3941
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3942	When the language is OpenCL the following differences occur:
				3943
				3944	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3945	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				3946	arguments for the AMDHSA OS (see
				3947	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3948	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3949	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3950
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3951	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				3952	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				3953
				3954	======== ==== ========= ===========================================
				3955	Position Byte Byte Description
				3956	Size Alignment
				3957	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	3958	1 8 8 OpenCL Global Offset X
				3959	2 8 8 OpenCL Global Offset Y
				3960	3 8 8 OpenCL Global Offset Z
				3961	4 8 8 OpenCL address of printf buffer
				3962	5 8 8 OpenCL address of virtual queue used by
				3963	enqueue_kernel.
				3964	6 8 8 OpenCL address of AqlWrap struct used by
				3965	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3966	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3967
				3968	.. _amdgpu-hcc:
				3969
				3970	HCC
				3971	---
				3972
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3973	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3974
				3975	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3976
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3977	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3978	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3979
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3980	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3981	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3982
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3983	This section describes general syntax for instructions and operands.
				3984
				3985	Instructions
				3986	~~~~~~~~~~~~
				3987
				3988	.. toctree::
				3989	:hidden:
				3990
				3991	AMDGPUAsmGFX7
				3992	AMDGPUAsmGFX8
				3993	AMDGPUAsmGFX9
				3994	AMDGPUOperandSyntax
				3995
				3996	An instruction has the following syntax:
				3997
				3998	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				3999
				4000	Note that operands are normally comma-separated while modifiers are space-separated.
				4001
				4002	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4003
				4004	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4005	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4006
				4007	Note that features under development are not included in this description.
				4008
				4009	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4010	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4011	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4012
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4013	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4014	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4015
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4016	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4017
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4018	* SGPR registers: s0, ... or s[0], ...
				4019	* VGPR registers: v0, ... or v[0], ...
				4020	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4021	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4022	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4023	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4024	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4025	* Register index expressions: v[2*2], s[1-1:2-1]
				4026	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4027
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4028	Modifiers
				4029	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4030
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4031	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4032
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4033	Instruction Examples
				4034	~~~~~~~~~~~~~~~~~~~~
				4035
				4036	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4037	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4038
				4039	.. code-block:: nasm
				4040
				4041	ds_add_u32 v2, v4 offset:16
				4042	ds_write_src2_b64 v2 offset0:4 offset1:8
				4043	ds_cmpst_f32 v2, v4, v6
				4044	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4045
				4046
				4047	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4048
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4049	FLAT
				4050	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4051
				4052	.. code-block:: nasm
				4053
				4054	flat_load_dword v1, v[3:4]
				4055	flat_store_dwordx3 v[3:4], v[5:7]
				4056	flat_atomic_swap v1, v[3:4], v5 glc
				4057	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4058	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4059
				4060	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4061
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4062	MUBUF
				4063	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4064
				4065	.. code-block:: nasm
				4066
				4067	buffer_load_dword v1, off, s[4:7], s1
				4068	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4069	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4070	buffer_wbinvl1
				4071	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4072
				4073	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4074
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4075	SMRD/SMEM
				4076	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4077
				4078	.. code-block:: nasm
				4079
				4080	s_load_dword s1, s[2:3], 0xfc
				4081	s_load_dwordx8 s[8:15], s[2:3], s4
				4082	s_load_dwordx16 s[88:103], s[2:3], s4
				4083	s_dcache_inv_vol
				4084	s_memtime s[4:5]
				4085
				4086	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4087
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4088	SOP1
				4089	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4090
				4091	.. code-block:: nasm
				4092
				4093	s_mov_b32 s1, s2
				4094	s_mov_b64 s[0:1], 0x80000000
				4095	s_cmov_b32 s1, 200
				4096	s_wqm_b64 s[2:3], s[4:5]
				4097	s_bcnt0_i32_b64 s1, s[2:3]
				4098	s_swappc_b64 s[2:3], s[4:5]
				4099	s_cbranch_join s[4:5]
				4100
				4101	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4102
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4103	SOP2
				4104	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4105
				4106	.. code-block:: nasm
				4107
				4108	s_add_u32 s1, s2, s3
				4109	s_and_b64 s[2:3], s[4:5], s[6:7]
				4110	s_cselect_b32 s1, s2, s3
				4111	s_andn2_b32 s2, s4, s6
				4112	s_lshr_b64 s[2:3], s[4:5], s6
				4113	s_ashr_i32 s2, s4, s6
				4114	s_bfm_b64 s[2:3], s4, s6
				4115	s_bfe_i64 s[2:3], s[4:5], s6
				4116	s_cbranch_g_fork s[4:5], s[6:7]
				4117
				4118	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4119
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4120	SOPC
				4121	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4122
				4123	.. code-block:: nasm
				4124
				4125	s_cmp_eq_i32 s1, s2
				4126	s_bitcmp1_b32 s1, s2
				4127	s_bitcmp0_b64 s[2:3], s4
				4128	s_setvskip s3, s5
				4129
				4130	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4131
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4132	SOPP
				4133	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4134
				4135	.. code-block:: nasm
				4136
				4137	s_barrier
				4138	s_nop 2
				4139	s_endpgm
				4140	s_waitcnt 0 ; Wait for all counters to be 0
				4141	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4142	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4143	s_sethalt 9
				4144	s_sleep 10
				4145	s_sendmsg 0x1
				4146	s_sendmsg sendmsg(MSG_INTERRUPT)
				4147	s_trap 1
				4148
				4149	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4150
				4151	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4152	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4153	range or acceptable values.
				4154
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4155	VALU
				4156	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4157
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4158	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4159	the assembler will automatically use optimal encoding based on its operands.
				4160	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4161
				4162	* _e32 for 32-bit VOP1/VOP2/VOPC
				4163	* _e64 for 64-bit VOP3
				4164	* _dpp for VOP_DPP
				4165	* _sdwa for VOP_SDWA
				4166
				4167	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4168
				4169	.. code-block:: nasm
				4170
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4171	v_mov_b32 v1, v2
				4172	v_mov_b32_e32 v1, v2
				4173	v_nop
				4174	v_cvt_f64_i32_e32 v[1:2], v2
				4175	v_floor_f32_e32 v1, v2
				4176	v_bfrev_b32_e32 v1, v2
				4177	v_add_f32_e32 v1, v2, v3
				4178	v_mul_i32_i24_e64 v1, v2, 3
				4179	v_mul_i32_i24_e32 v1, -3, v3
				4180	v_mul_i32_i24_e32 v1, -100, v3
				4181	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4182	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4183
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4184	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4185
				4186	.. code-block:: nasm
				4187
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4188	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4189	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4190	v_mov_b32 v0, v0 wave_shl:1
				4191	v_mov_b32 v0, v0 row_mirror
				4192	v_mov_b32 v0, v0 row_bcast:31
				4193	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4194	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4195	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4196
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4197	VOP_SDWA examples:
				4198
				4199	.. code-block:: nasm
				4200
				4201	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4202	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4203	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4204	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4205	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4206
				4207	For full list of supported instructions, refer to "Vector ALU instructions".
				4208
				4209	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4210	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4211
				4212	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4213	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4214
				4215	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4216	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4217
				4218	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4219	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4220
				4221	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4222	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4223
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4224
				4225	major, minor, and stepping are all integers that describe the instruction
				4226	set architecture (ISA) version of the assembly program.
				4227
				4228	vendor and arch are quoted strings. vendor should always be equal to
				4229	"AMD" and arch should always be equal to "AMDGPU".
				4230
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4231	By default, the assembler will derive the ISA version, vendor, and arch
				4232	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4233
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4234	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4235	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4236
				4237	This directives specifies that the symbol with given name is a kernel entry point
				4238	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4239
				4240	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4241	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4242
				4243	This directive marks the beginning of a list of key / value pairs that are used
				4244	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4245	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4246	any amd_kernel_code_t values that are unspecified a default value will be
				4247	used. The default value for all keys is 0, with the following exceptions:
				4248
				4249	- kernel_code_version_major defaults to 1.
				4250	- machine_kind defaults to 1.
				4251	- machine_version_major, machine_version_minor, and
				4252	machine_version_stepping are derived from the value of the -mcpu option
				4253	that is passed to the assembler.
				4254	- kernel_code_entry_byte_offset defaults to 256.
				4255	- wavefront_size defaults to 6.
				4256	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4257	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4258	as a power of two, so a value of n means an alignment of 2^ n.
				4259
				4260	The .amd_kernel_code_t directive must be placed immediately after the
				4261	function label and before any instructions.
				4262
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4263	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4264	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4265
				4266	Here is an example of a minimal amd_kernel_code_t specification:
				4267
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4268	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4269
				4270	.hsa_code_object_version 1,0
				4271	.hsa_code_object_isa
				4272
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4273	.hsatext
				4274	.globl hello_world
				4275	.p2align 8
				4276	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4277
				4278	hello_world:
				4279
				4280	.amd_kernel_code_t
				4281	enable_sgpr_kernarg_segment_ptr = 1
				4282	is_ptr64 = 1
				4283	compute_pgm_rsrc1_vgprs = 0
				4284	compute_pgm_rsrc1_sgprs = 0
				4285	compute_pgm_rsrc2_user_sgpr = 2
				4286	kernarg_segment_byte_size = 8
				4287	wavefront_sgpr_count = 2
				4288	workitem_vgpr_count = 3
				4289	.end_amd_kernel_code_t
				4290
				4291	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4292	v_mov_b32 v0, 3.14159
				4293	s_waitcnt lgkmcnt(0)
				4294	v_mov_b32 v1, s0
				4295	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4296	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4297	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4298	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4299	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4300
				4301	Additional Documentation
				4302	========================
				4303
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4304	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4305	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4306	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4307	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4308	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4309	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4310	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4311	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4312	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4313	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4314	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4315	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4316	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4317	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4318	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4319	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__