Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: d6255c8a15eaa60d155872d2f2062a2b637bf82e [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
				201	``gfx902`` ``amdgcn`` APU - xnack TBA
				202	[on]
				203	.. TODO
				204	Add product
				205	names.
				206	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	207
				208	.. _amdgpu-target-features:
				209
				210	Target Features
				211	---------------
				212
				213	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	214	processor specific features. Not all target features are supported by
				215	all processors. The runtime must ensure that the features supported by
				216	the device used to execute the code match the features enabled when
				217	generating the code. A mismatch of features may result in incorrect
				218	execution, or a reduction in performance.
				219
				220	The target features supported by each processor, and the default value
				221	used if not specified explicitly, is listed in
				222	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	223
				224	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				225	target features.
				226
				227	For example:
				228
				229	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	230	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	231	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	232	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	233
				234	.. table:: AMDGPU Target Features
				235	:name: amdgpu-target-feature-table
				236
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	============== ==================================================
				238	Target Feature Description
				239	============== ==================================================
				240	-m[no-]xnack Enable/disable generating code that has
				241	memory clauses that are compatible with
				242	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	This is used for demand paging and page
				245	migration. If XNACK replay is enabled in
				246	the device, then if a page fault occurs
				247	the code may execute incorrectly if the
				248	``xnack`` feature is not enabled. Executing
				249	code that has the feature enabled on a
				250	device that does not have XNACK replay
				251	enabled will execute correctly, but may
				252	be less performant than code with the
				253	feature disabled.
				254	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	255
				256	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	257
				258	Address Spaces
				259	--------------
				260
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	261	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	262
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	263	The memory space names used in the table, aside from the region memory space, is
				264	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	265
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	266	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	.. table:: Address Space Mapping
				269	:name: amdgpu-address-space-mapping-table
				270
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	271	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	272	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	273	================== =================
				274	0 Generic (Flat)
				275	1 Global
				276	2 Region (GDS)
				277	3 Local (group/LDS)
				278	4 Constant
				279	5 Private (Scratch)
				280	6 Constant 32-bit
				281	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	282
				283	.. _amdgpu-memory-scopes:
				284
				285	Memory Scopes
				286	-------------
				287
				288	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				289	backend memory model when the target triple OS is ``amdhsa`` (see
				290	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				291
				292	The memory model supported is based on the HSA memory model [HSA]_ which is
				293	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				294	relation is transitive over the synchonizes-with relation independent of scope,
				295	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	296	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	297
				298	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				299	inclusion and requires the memory scopes to exactly match. However, this
				300	is conservatively correct for OpenCL.
				301
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	302	.. table:: AMDHSA LLVM Sync Scopes
				303	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	================ ==========================================================
				306	LLVM Sync Scope Description
				307	================ ==========================================================
				308	none The default: ``system``.
				309
				310	Synchronizes with, and participates in modification and
				311	seq_cst total orderings with, other operations (except
				312	image operations) for all address spaces (except private,
				313	or generic that accesses private) provided the other
				314	operation's sync scope is:
				315
				316	- ``system``.
				317	- ``agent`` and executed by a thread on the same agent.
				318	- ``workgroup`` and executed by a thread in the same
				319	workgroup.
				320	- ``wavefront`` and executed by a thread in the same
				321	wavefront.
				322
				323	``agent`` Synchronizes with, and participates in modification and
				324	seq_cst total orderings with, other operations (except
				325	image operations) for all address spaces (except private,
				326	or generic that accesses private) provided the other
				327	operation's sync scope is:
				328
				329	- ``system`` or ``agent`` and executed by a thread on the
				330	same agent.
				331	- ``workgroup`` and executed by a thread in the same
				332	workgroup.
				333	- ``wavefront`` and executed by a thread in the same
				334	wavefront.
				335
				336	``workgroup`` Synchronizes with, and participates in modification and
				337	seq_cst total orderings with, other operations (except
				338	image operations) for all address spaces (except private,
				339	or generic that accesses private) provided the other
				340	operation's sync scope is:
				341
				342	- ``system``, ``agent`` or ``workgroup`` and executed by a
				343	thread in the same workgroup.
				344	- ``wavefront`` and executed by a thread in the same
				345	wavefront.
				346
				347	``wavefront`` Synchronizes with, and participates in modification and
				348	seq_cst total orderings with, other operations (except
				349	image operations) for all address spaces (except private,
				350	or generic that accesses private) provided the other
				351	operation's sync scope is:
				352
				353	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				354	and executed by a thread in the same wavefront.
				355
				356	``singlethread`` Only synchronizes with, and participates in modification
				357	and seq_cst total orderings with, other operations (except
				358	image operations) running in the same thread for all
				359	address spaces (for example, in signal handlers).
				360	================ ==========================================================
				361
				362	AMDGPU Intrinsics
				363	-----------------
				364
				365	The AMDGPU backend implements the following intrinsics.
				366
				367	This section is WIP.
				368
				369	.. TODO
				370	List AMDGPU intrinsics
				371
				372	Code Object
				373	===========
				374
				375	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				376	can be linked by ``lld`` to produce a standard ELF shared code object which can
				377	be loaded and executed on an AMDGPU target.
				378
				379	Header
				380	------
				381
				382	The AMDGPU backend uses the following ELF header:
				383
				384	.. table:: AMDGPU ELF Header
				385	:name: amdgpu-elf-header-table
				386
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	387	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	388	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	389	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	390	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				391	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	392	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				393	- ``ELFOSABI_AMDGPU_HSA``
				394	- ``ELFOSABI_AMDGPU_PAL``
				395	- ``ELFOSABI_AMDGPU_MESA3D``
				396	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				397	- ``ELFABIVERSION_AMDGPU_PAL``
				398	- ``ELFABIVERSION_AMDGPU_MESA3D``
				399	``e_type`` - ``ET_REL``
				400	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	401	``e_machine`` ``EM_AMDGPU``
				402	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	403	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	404	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	405
				406	..
				407
				408	.. table:: AMDGPU ELF Header Enumeration Values
				409	:name: amdgpu-elf-header-enumeration-values-table
				410
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	411	=============================== =====
				412	Name Value
				413	=============================== =====
				414	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	415	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	416	``ELFOSABI_AMDGPU_HSA`` 64
				417	``ELFOSABI_AMDGPU_PAL`` 65
				418	``ELFOSABI_AMDGPU_MESA3D`` 66
				419	``ELFABIVERSION_AMDGPU_HSA`` 1
				420	``ELFABIVERSION_AMDGPU_PAL`` 0
				421	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				422	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	423
				424	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	425	The ELF class is:
				426
				427	* ``ELFCLASS32`` for ``r600`` architecture.
				428
				429	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				430	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	431
				432	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	433	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	434
				435	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	436	One of the following AMD GPU architecture specific OS ABIs
				437	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	438
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	439	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	440
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	441	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	442
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	443	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				444
				445	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	446
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	447	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	448	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				449	object conforms:
				450
				451	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				452	runtime ABI.
				453
				454	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				455	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	456
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	457	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	458	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	459
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	460	``e_type``
				461	Can be one of the following values:
				462
				463
				464	``ET_REL``
				465	The type produced by the AMD GPU backend compiler as it is relocatable code
				466	object.
				467
				468	``ET_DYN``
				469	The type produced by the linker as it is a shared code object.
				470
				471	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				472
				473	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	474	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				475	by the ``r600`` and ``amdgcn`` architectures (see
				476	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				477	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				478	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	479
				480	``e_entry``
				481	The entry point is 0 as the entry points for individual kernels must be
				482	selected in order to invoke them through AQL packets.
				483
				484	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	485	The AMDGPU backend uses the following ELF header flags:
				486
				487	.. table:: AMDGPU ELF Header ``e_flags``
				488	:name: amdgpu-elf-header-e_flags-table
				489
				490	================================= ========== =============================
				491	Name Value Description
				492	================================= ========== =============================
				493	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				494	-------------------------------------------- -----------------------------
				495	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				496	mask for
				497	``EF_AMDGPU_MACH_xxx`` values
				498	defined in
				499	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	500	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				501	target feature is
				502	enabled for all code
				503	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	504	If the processor
				505	does not support the
				506	``xnack`` target
				507	feature then must
				508	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	509	See
				510	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	511	================================= ========== =============================
				512
				513	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				514	:name: amdgpu-ef-amdgpu-mach-table
				515
				516	================================= ========== =============================
				517	Name Value Description (see
				518	:ref:`amdgpu-processor-table`)
				519	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	520	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				521	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				522	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				523	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				524	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				525	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				526	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				527	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				528	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				529	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				530	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				531	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				532	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				533	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				534	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				535	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				536	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				537	reserved 0x011 - Reserved for ``r600``
				538	0x01f architecture processors.
				539	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				540	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				541	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				542	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				543	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				544	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				545	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				546	reserved 0x027 Reserved.
				547	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				548	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
				553	reserved 0x02e Reserved.
				554	reserved 0x02f Reserved.
				555	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	556	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	557
				558	Sections
				559	--------
				560
				561	An AMDGPU target ELF code object has the standard ELF sections which include:
				562
				563	.. table:: AMDGPU ELF Sections
				564	:name: amdgpu-elf-sections-table
				565
				566	================== ================ =================================
				567	Name Type Attributes
				568	================== ================ =================================
				569	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				570	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				571	``.debug_``\ \* ``SHT_PROGBITS`` none
				572	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				573	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				574	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				575	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				576	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				577	``.note`` ``SHT_NOTE`` none
				578	``.rela``\ name ``SHT_RELA`` none
				579	``.rela.dyn`` ``SHT_RELA`` none
				580	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				581	``.shstrtab`` ``SHT_STRTAB`` none
				582	``.strtab`` ``SHT_STRTAB`` none
				583	``.symtab`` ``SHT_SYMTAB`` none
				584	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				585	================== ================ =================================
				586
				587	These sections have their standard meanings (see [ELF]_) and are only generated
				588	if needed.
				589
				590	``.debug``\ \*
				591	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				592	DWARF produced by the AMDGPU backend.
				593
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	594	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	595	The standard sections used by a dynamic loader.
				596
				597	``.note``
				598	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				599	backend.
				600
				601	``.rela``\ name, ``.rela.dyn``
				602	For relocatable code objects, name is the name of the section that the
				603	relocation records apply. For example, ``.rela.text`` is the section name for
				604	relocation records associated with the ``.text`` section.
				605
				606	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				607	records from each of the relocatable code object's ``.rela``\ name sections.
				608
				609	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				610	the AMDGPU backend.
				611
				612	``.text``
				613	The executable machine code for the kernels and functions they call. Generated
				614	as position independent code. See :ref:`amdgpu-code-conventions` for
				615	information on conventions used in the isa generation.
				616
				617	.. _amdgpu-note-records:
				618
				619	Note Records
				620	------------
				621
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	622	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				623	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				624	aligned. In addition, minimal zero byte padding must be generated to ensure the
				625	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				626	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	627
				628	The AMDGPU backend code object uses the following ELF note records in the
				629	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	630	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	631	variable size strings have a corresponding ``*_size`` field that specifies the
				632	number of bytes, including the terminating null character, in the string. The
				633	string(s) come immediately after the preceding fields.
				634
				635	Additional note records can be present.
				636
				637	.. table:: AMDGPU ELF Note Records
				638	:name: amdgpu-elf-note-records-table
				639
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	640	===== ============================== ======================================
				641	Name Type Description
				642	===== ============================== ======================================
				643	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	644	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	645
				646	..
				647
				648	.. table:: AMDGPU ELF Note Record Enumeration Values
				649	:name: amdgpu-elf-note-record-enumeration-values-table
				650
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	651	============================== =====
				652	Name Value
				653	============================== =====
				654	reserved 0-9
				655	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	656	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	657	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	658
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	659	``NT_AMD_AMDGPU_HSA_METADATA``
				660	Specifies extensible metadata associated with the code objects executed on HSA
				661	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				662	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				663	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				664	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	665
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	666	.. _amdgpu-symbols:
				667
				668	Symbols
				669	-------
				670
				671	Symbols include the following:
				672
				673	.. table:: AMDGPU ELF Symbols
				674	:name: amdgpu-elf-symbols-table
				675
				676	===================== ============== ============= ==================
				677	Name Type Section Description
				678	===================== ============== ============= ==================
				679	link-name ``STT_OBJECT`` - ``.data`` Global variable
				680	- ``.rodata``
				681	- ``.bss``
				682	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				683	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				684	===================== ============== ============= ==================
				685
				686	Global variable
				687	Global variables both used and defined by the compilation unit.
				688
				689	If the symbol is defined in the compilation unit then it is allocated in the
				690	appropriate section according to if it has initialized data or is readonly.
				691
				692	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				693	will resolve relocations using the definition provided by another code object
				694	or explicitly defined by the runtime.
				695
				696	All global symbols, whether defined in the compilation unit or external, are
				697	accessed by the machine code indirectly through a GOT table entry. This
				698	allows them to be preemptable. The GOT table is only supported when the target
				699	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	700
				701	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	702	Add description of linked shared object symbols. Seems undefined symbols
				703	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	704
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	705	Kernel descriptor
				706	Every HSA kernel has an associated kernel descriptor. It is the address of the
				707	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				708	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				709	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				710
				711	Kernel entry point
				712	Every HSA kernel also has a symbol for its machine code entry point.
				713
				714	.. _amdgpu-relocation-records:
				715
				716	Relocation Records
				717	------------------
				718
				719	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				720	relocatable fields are:
				721
				722	``word32``
				723	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				724	alignment. These values use the same byte order as other word values in the
				725	AMD GPU architecture.
				726
				727	``word64``
				728	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				729	alignment. These values use the same byte order as other word values in the
				730	AMD GPU architecture.
				731
				732	Following notations are used for specifying relocation calculations:
				733
				734	A
				735	Represents the addend used to compute the value of the relocatable field.
				736
				737	G
				738	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	739	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	740
				741	GOT
				742	Represents the address of the global offset table.
				743
				744	P
				745	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				746	of the storage unit being relocated (computed using ``r_offset``).
				747
				748	S
				749	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	750	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				751
				752	B
				753	Represents the base address of a loaded executable or shared object which is
				754	the difference between the ELF address and the actual load address. Relocations
				755	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	756
				757	The following relocation types are supported:
				758
				759	.. table:: AMDGPU ELF Relocation Records
				760	:name: amdgpu-elf-relocation-records-table
				761
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	762	========================== ======= ===== ========== ==============================
				763	Relocation Type Kind Value Field Calculation
				764	========================== ======= ===== ========== ==============================
				765	``R_AMDGPU_NONE`` 0 none none
				766	``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
				767	``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
				768	``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
				769	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				770	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
				771	``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
				772	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				773	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				774	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				775	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				776	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				777	reserved 12
				778	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				779	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	780
				781	.. _amdgpu-dwarf:
				782
				783	DWARF
				784	-----
				785
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	786	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	787	information that maps the code object executable code and data to the source
				788	language constructs. It can be used by tools such as debuggers and profilers.
				789
				790	Address Space Mapping
				791	~~~~~~~~~~~~~~~~~~~~~
				792
				793	The following address space mapping is used:
				794
				795	.. table:: AMDGPU DWARF Address Space Mapping
				796	:name: amdgpu-dwarf-address-space-mapping-table
				797
				798	=================== =================
				799	DWARF Address Space Memory Space
				800	=================== =================
				801	1 Private (Scratch)
				802	2 Local (group/LDS)
				803	omitted Global
				804	omitted Constant
				805	omitted Generic (Flat)
				806	not supported Region (GDS)
				807	=================== =================
				808
				809	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				810	used in the table.
				811
				812	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				813	DWARF address space of the value of the pointer when it is in the private or
				814	local address space. Otherwise the attribute is omitted.
				815
				816	An ``XDEREF`` operation is generated in location list expressions for variables
				817	that are allocated in the private and local address space. Otherwise no
				818	``XDREF`` is omitted.
				819
				820	Register Mapping
				821	~~~~~~~~~~~~~~~~
				822
				823	This section is WIP.
				824
				825	.. TODO
				826	Define DWARF register enumeration.
				827
				828	If want to present a wavefront state then should expose vector registers as
				829	64 wide (rather than per work-item view that LLVM uses). Either as separate
				830	registers, or a 64x4 byte single register. In either case use a new LANE op
				831	(akin to XDREF) to select the current lane usage in a location
				832	expression. This would also allow scalar register spilling to vector register
				833	lanes to be expressed (currently no debug information is being generated for
				834	spilling). If choose a wide single register approach then use LANE in
				835	conjunction with PIECE operation to select the dword part of the register for
				836	the current lane. If the separate register approach then use LANE to select
				837	the register.
				838
				839	Source Text
				840	~~~~~~~~~~~
				841
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	842	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				843	runtime) may be embedded into the DWARF v5 line table using the ``clang
				844	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	845
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	846	For example:
				847
				848	``-gembed-source``
				849	Enable the embedded source DWARF v5 extension.
				850	``-gno-embed-source``
				851	Disable the embedded source DWARF v5 extension.
				852
				853	.. table:: AMDGPU Debug Options
				854	:name: amdgpu-debug-options
				855
				856	==================== ==================================================
				857	Debug Flag Description
				858	==================== ==================================================
				859	-g[no-]embed-source Enable/disable embedding source text in DWARF
				860	debug sections. Useful for environments where
				861	source cannot be written to disk, such as
				862	when performing online compilation.
				863	==================== ==================================================
				864
				865	This option enables one extended content types in the DWARF v5 Line Number
				866	Program Header, which is used to encode embedded source.
				867
				868	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				869	:name: amdgpu-dwarf-extended-content-types
				870
				871	============================ ======================
				872	Content Type Form
				873	============================ ======================
				874	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				875	============================ ======================
				876
				877	The source field will contain the UTF-8 encoded, null-terminated source text
				878	with ``'\n'`` line endings. When the source field is present, consumers can use
				879	the embedded source instead of attempting to discover the source on disk. When
				880	the source field is absent, consumers can access the file to get the source
				881	text.
				882
				883	The above content type appears in the ``file_name_entry_format`` field of the
				884	line table prologue, and its corresponding value appear in the ``file_names``
				885	field. The current encoding of the content type is documented in table
				886	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				887
				888	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				889	:name: amdgpu-dwarf-extended-content-types-encoding
				890
				891	============================ ====================
				892	Content Type Value
				893	============================ ====================
				894	``DW_LNCT_LLVM_source`` 0x2001
				895	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	896
				897	.. _amdgpu-code-conventions:
				898
				899	Code Conventions
				900	================
				901
				902	This section provides code conventions used for each supported target triple OS
				903	(see :ref:`amdgpu-target-triples`).
				904
				905	AMDHSA
				906	------
				907
				908	This section provides code conventions used when the target triple OS is
				909	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				910
				911	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	912
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	913	Code Object Target Identification
				914	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				915
				916	The AMDHSA OS uses the following syntax to specify the code object
				917	target as a single string:
				918
				919	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				920
				921	Where:
				922
				923	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				924	are the same as the Target Triple (see
				925	:ref:`amdgpu-target-triples`).
				926
				927	- ``<Processor>`` is the same as the Processor (see
				928	:ref:`amdgpu-processors`).
				929
				930	- ``<Target Features>`` is a list of the enabled Target Features
				931	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				932	apply to Processor. The list must be in the same order as listed
				933	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				934	Features* must be included in the list if they are enabled even if
				935	that is the default for Processor.
				936
				937	For example:
				938
				939	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				940
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	941	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	942	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	943
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	944	The code object metadata specifies extensible metadata associated with the code
				945	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				946	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				947	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				948	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				949	information necessary to support the ROCM kernel queries. For example, the
				950	segment sizes needed in a dispatch packet. In addition, a high level language
				951	runtime may require other information to be included. For example, the AMD
				952	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	953
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	954	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	955	:doc:`YamlIO`).
				956
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	957	.. TODO
				958	Is the string null terminated? It probably should not if YAML allows it to
				959	contain null characters, otherwise it should be.
				960
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	961	The metadata is represented as a single YAML document comprised of the mapping
				962	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				963	referenced tables.
				964
				965	For boolean values, the string values of ``false`` and ``true`` are used for
				966	false and true respectively.
				967
				968	Additional information can be added to the mappings. To avoid conflicts, any
				969	non-AMD key names should be prefixed by "vendor-name.".
				970
				971	.. table:: AMDHSA Code Object Metadata Mapping
				972	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				973
				974	========== ============== ========= =======================================
				975	String Key Value Type Required? Description
				976	========== ============== ========= =======================================
				977	"Version" sequence of Required - The first integer is the major
				978	2 integers version. Currently 1.
				979	- The second integer is the minor
				980	version. Currently 0.
				981	"Printf" sequence of Each string is encoded information
				982	strings about a printf function call. The
				983	encoded information is organized as
				984	fields separated by colon (':'):
				985
				986	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				987
				988	where:
				989
				990	``ID``
				991	A 32 bit integer as a unique id for
				992	each printf function call
				993
				994	``N``
				995	A 32 bit integer equal to the number
				996	of arguments of printf function call
				997	minus 1
				998
				999	``S[i]`` (where i = 0, 1, ... , N-1)
				1000	32 bit integers for the size in bytes
				1001	of the i-th FormatString argument of
				1002	the printf function call
				1003
				1004	FormatString
				1005	The format string passed to the
				1006	printf function call.
				1007	"Kernels" sequence of Required Sequence of the mappings for each
				1008	mapping kernel in the code object. See
				1009	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1010	for the definition of the mapping.
				1011	========== ============== ========= =======================================
				1012
				1013	..
				1014
				1015	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1016	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1017
				1018	================= ============== ========= ================================
				1019	String Key Value Type Required? Description
				1020	================= ============== ========= ================================
				1021	"Name" string Required Source name of the kernel.
				1022	"SymbolName" string Required Name of the kernel
				1023	descriptor ELF symbol.
				1024	"Language" string Source language of the kernel.
				1025	Values include:
				1026
				1027	- "OpenCL C"
				1028	- "OpenCL C++"
				1029	- "HCC"
				1030	- "OpenMP"
				1031
				1032	"LanguageVersion" sequence of - The first integer is the major
				1033	2 integers version.
				1034	- The second integer is the
				1035	minor version.
				1036	"Attrs" mapping Mapping of kernel attributes.
				1037	See
				1038	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1039	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1040	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1041	mapping kernel arguments. See
				1042	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1043	for the definition of the mapping.
				1044	"CodeProps" mapping Mapping of properties related to
				1045	the kernel code. See
				1046	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1047	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1048	================= ============== ========= ================================
				1049
				1050	..
				1051
				1052	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1053	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1054
				1055	=================== ============== ========= ==============================
				1056	String Key Value Type Required? Description
				1057	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1058	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1059	3 integers must be >=1 and the dispatch
				1060	work-group size X, Y, Z must
				1061	correspond to the specified
				1062	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1063
				1064	Corresponds to the OpenCL
				1065	``reqd_work_group_size``
				1066	attribute.
				1067	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1068	3 integers X, Y, Z is likely to be the
				1069	specified values.
				1070
				1071	Corresponds to the OpenCL
				1072	``work_group_size_hint``
				1073	attribute.
				1074	"VecTypeHint" string The name of a scalar or vector
				1075	type.
				1076
				1077	Corresponds to the OpenCL
				1078	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1079
				1080	"RuntimeHandle" string The external symbol name
				1081	associated with a kernel.
				1082	OpenCL runtime allocates a
				1083	global buffer for the symbol
				1084	and saves the kernel's address
				1085	to it, which is used for
				1086	device side enqueueing. Only
				1087	available for device side
				1088	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1089	=================== ============== ========= ==============================
				1090
				1091	..
				1092
				1093	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1094	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1095
				1096	================= ============== ========= ================================
				1097	String Key Value Type Required? Description
				1098	================= ============== ========= ================================
				1099	"Name" string Kernel argument name.
				1100	"TypeName" string Kernel argument type name.
				1101	"Size" integer Required Kernel argument size in bytes.
				1102	"Align" integer Required Kernel argument alignment in
				1103	bytes. Must be a power of two.
				1104	"ValueKind" string Required Kernel argument kind that
				1105	specifies how to set up the
				1106	corresponding argument.
				1107	Values include:
				1108
				1109	"ByValue"
				1110	The argument is copied
				1111	directly into the kernarg.
				1112
				1113	"GlobalBuffer"
				1114	A global address space pointer
				1115	to the buffer data is passed
				1116	in the kernarg.
				1117
				1118	"DynamicSharedPointer"
				1119	A group address space pointer
				1120	to dynamically allocated LDS
				1121	is passed in the kernarg.
				1122
				1123	"Sampler"
				1124	A global address space
				1125	pointer to a S# is passed in
				1126	the kernarg.
				1127
				1128	"Image"
				1129	A global address space
				1130	pointer to a T# is passed in
				1131	the kernarg.
				1132
				1133	"Pipe"
				1134	A global address space pointer
				1135	to an OpenCL pipe is passed in
				1136	the kernarg.
				1137
				1138	"Queue"
				1139	A global address space pointer
				1140	to an OpenCL device enqueue
				1141	queue is passed in the
				1142	kernarg.
				1143
				1144	"HiddenGlobalOffsetX"
				1145	The OpenCL grid dispatch
				1146	global offset for the X
				1147	dimension is passed in the
				1148	kernarg.
				1149
				1150	"HiddenGlobalOffsetY"
				1151	The OpenCL grid dispatch
				1152	global offset for the Y
				1153	dimension is passed in the
				1154	kernarg.
				1155
				1156	"HiddenGlobalOffsetZ"
				1157	The OpenCL grid dispatch
				1158	global offset for the Z
				1159	dimension is passed in the
				1160	kernarg.
				1161
				1162	"HiddenNone"
				1163	An argument that is not used
				1164	by the kernel. Space needs to
				1165	be left for it, but it does
				1166	not need to be set up.
				1167
				1168	"HiddenPrintfBuffer"
				1169	A global address space pointer
				1170	to the runtime printf buffer
				1171	is passed in kernarg.
				1172
				1173	"HiddenDefaultQueue"
				1174	A global address space pointer
				1175	to the OpenCL device enqueue
				1176	queue that should be used by
				1177	the kernel by default is
				1178	passed in the kernarg.
				1179
				1180	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1181	A global address space pointer
				1182	to help link enqueued kernels into
				1183	the ancestor tree for determining
				1184	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1185
				1186	"ValueType" string Required Kernel argument value type. Only
				1187	present if "ValueKind" is
				1188	"ByValue". For vector data
				1189	types, the value is for the
				1190	element type. Values include:
				1191
				1192	- "Struct"
				1193	- "I8"
				1194	- "U8"
				1195	- "I16"
				1196	- "U16"
				1197	- "F16"
				1198	- "I32"
				1199	- "U32"
				1200	- "F32"
				1201	- "I64"
				1202	- "U64"
				1203	- "F64"
				1204
				1205	.. TODO
				1206	How can it be determined if a
				1207	vector type, and what size
				1208	vector?
				1209	"PointeeAlign" integer Alignment in bytes of pointee
				1210	type for pointer type kernel
				1211	argument. Must be a power
				1212	of 2. Only present if
				1213	"ValueKind" is
				1214	"DynamicSharedPointer".
				1215	"AddrSpaceQual" string Kernel argument address space
				1216	qualifier. Only present if
				1217	"ValueKind" is "GlobalBuffer" or
				1218	"DynamicSharedPointer". Values
				1219	are:
				1220
				1221	- "Private"
				1222	- "Global"
				1223	- "Constant"
				1224	- "Local"
				1225	- "Generic"
				1226	- "Region"
				1227
				1228	.. TODO
				1229	Is GlobalBuffer only Global
				1230	or Constant? Is
				1231	DynamicSharedPointer always
				1232	Local? Can HCC allow Generic?
				1233	How can Private or Region
				1234	ever happen?
				1235	"AccQual" string Kernel argument access
				1236	qualifier. Only present if
				1237	"ValueKind" is "Image" or
				1238	"Pipe". Values
				1239	are:
				1240
				1241	- "ReadOnly"
				1242	- "WriteOnly"
				1243	- "ReadWrite"
				1244
				1245	.. TODO
				1246	Does this apply to
				1247	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1248	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1249	performed by the kernel on the
				1250	kernel argument. Only present if
				1251	"ValueKind" is "GlobalBuffer",
				1252	"Image", or "Pipe". This may be
				1253	more restrictive than indicated
				1254	by "AccQual" to reflect what the
				1255	kernel actual does. If not
				1256	present then the runtime must
				1257	assume what is implied by
				1258	"AccQual" and "IsConst". Values
				1259	are:
				1260
				1261	- "ReadOnly"
				1262	- "WriteOnly"
				1263	- "ReadWrite"
				1264
				1265	"IsConst" boolean Indicates if the kernel argument
				1266	is const qualified. Only present
				1267	if "ValueKind" is
				1268	"GlobalBuffer".
				1269
				1270	"IsRestrict" boolean Indicates if the kernel argument
				1271	is restrict qualified. Only
				1272	present if "ValueKind" is
				1273	"GlobalBuffer".
				1274
				1275	"IsVolatile" boolean Indicates if the kernel argument
				1276	is volatile qualified. Only
				1277	present if "ValueKind" is
				1278	"GlobalBuffer".
				1279
				1280	"IsPipe" boolean Indicates if the kernel argument
				1281	is pipe qualified. Only present
				1282	if "ValueKind" is "Pipe".
				1283
				1284	.. TODO
				1285	Can GlobalBuffer be pipe
				1286	qualified?
				1287	================= ============== ========= ================================
				1288
				1289	..
				1290
				1291	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1292	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1293
				1294	============================ ============== ========= =====================
				1295	String Key Value Type Required? Description
				1296	============================ ============== ========= =====================
				1297	"KernargSegmentSize" integer Required The size in bytes of
				1298	the kernarg segment
				1299	that holds the values
				1300	of the arguments to
				1301	the kernel.
				1302	"GroupSegmentFixedSize" integer Required The amount of group
				1303	segment memory
				1304	required by a
				1305	work-group in
				1306	bytes. This does not
				1307	include any
				1308	dynamically allocated
				1309	group segment memory
				1310	that may be added
				1311	when the kernel is
				1312	dispatched.
				1313	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1314	private address space
				1315	memory required for a
				1316	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1317	bytes. If the kernel
				1318	uses a dynamic call
				1319	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1320	space must be added
				1321	to this value for the
				1322	call stack.
				1323	"KernargSegmentAlign" integer Required The maximum byte
				1324	alignment of
				1325	arguments in the
				1326	kernarg segment. Must
				1327	be a power of 2.
				1328	"WavefrontSize" integer Required Wavefront size. Must
				1329	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1330	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1331	registers used by a
				1332	wavefront for
				1333	GFX6-GFX9. This
				1334	includes the special
				1335	SGPRs for VCC, Flat
				1336	Scratch (GFX7-GFX9)
				1337	and XNACK (for
				1338	GFX8-GFX9). It does
				1339	not include the 16
				1340	SGPR added if a trap
				1341	handler is
				1342	enabled. It is not
				1343	rounded up to the
				1344	allocation
				1345	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1346	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1347	registers used by
				1348	each work-item for
				1349	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1350	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1351	work-group size
				1352	supported by the
				1353	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1354	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1355	consistent with
				1356	ReqdWorkGroupSize if
				1357	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1358	"NumSpilledSGPRs" integer Number of stores from
				1359	a scalar register to
				1360	a register allocator
				1361	created spill
				1362	location.
				1363	"NumSpilledVGPRs" integer Number of stores from
				1364	a vector register to
				1365	a register allocator
				1366	created spill
				1367	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1368	============================ ============== ========= =====================
				1369
				1370	..
				1371
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1372	Kernel Dispatch
				1373	~~~~~~~~~~~~~~~
				1374
				1375	The HSA architected queuing language (AQL) defines a user space memory interface
				1376	that can be used to control the dispatch of kernels, in an agent independent
				1377	way. An agent can have zero or more AQL queues created for it using the ROCm
				1378	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1379	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1380	mechanics and packet layouts.
				1381
				1382	The packet processor of a kernel agent is responsible for detecting and
				1383	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1384	packet processor is implemented by the hardware command processor (CP),
				1385	asynchronous dispatch controller (ADC) and shader processor input controller
				1386	(SPI).
				1387
				1388	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1389	mode driver to initialize and register the AQL queue with CP.
				1390
				1391	To dispatch a kernel the following actions are performed. This can occur in the
				1392	CPU host program, or from an HSA kernel executing on a GPU.
				1393
				1394	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1395	executed is obtained.
				1396	2. A pointer to the kernel descriptor (see
				1397	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1398	obtained. It must be for a kernel that is contained in a code object that that
				1399	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1400	associated.
				1401	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1402	for a memory region with the kernarg property for the kernel agent that will
				1403	execute the kernel. It must be at least 16 byte aligned.
				1404	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1405	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1406	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1407	memory in the same way constant memory is accessed. (Note that the HSA
				1408	specification allows an implementation to copy the kernel argument contents to
				1409	another location that is accessed by the kernel.)
				1410	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1411	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1412	packet. The packet must be set up, and the final write must use an atomic
				1413	store release to set the packet kind to ensure the packet contents are
				1414	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1415	notify the kernel agent that the AQL queue has been updated. These rules, and
				1416	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1417	System Architecture Specification* [HSA]_.
				1418	6. A kernel dispatch packet includes information about the actual dispatch,
				1419	such as grid and work-group size, together with information from the code
				1420	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1421	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1422	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1423	7. CP executes micro-code and is responsible for detecting and setting up the
				1424	GPU to execute the wavefronts of a kernel dispatch.
				1425	8. CP ensures that when the a wavefront starts executing the kernel machine
				1426	code, the scalar general purpose registers (SGPR) and vector general purpose
				1427	registers (VGPR) are set up as required by the machine code. The required
				1428	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1429	register state is defined in
				1430	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1431	9. The prolog of the kernel machine code (see
				1432	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1433	before continuing executing the machine code that corresponds to the kernel.
				1434	10. When the kernel dispatch has completed execution, CP signals the completion
				1435	signal specified in the kernel dispatch packet if not 0.
				1436
				1437	.. _amdgpu-amdhsa-memory-spaces:
				1438
				1439	Memory Spaces
				1440	~~~~~~~~~~~~~
				1441
				1442	The memory space properties are:
				1443
				1444	.. table:: AMDHSA Memory Spaces
				1445	:name: amdgpu-amdhsa-memory-spaces-table
				1446
				1447	================= =========== ======== ======= ==================
				1448	Memory Space Name HSA Segment Hardware Address NULL Value
				1449	Name Name Size
				1450	================= =========== ======== ======= ==================
				1451	Private private scratch 32 0x00000000
				1452	Local group LDS 32 0xFFFFFFFF
				1453	Global global global 64 0x0000000000000000
				1454	Constant constant *same as 64 0x0000000000000000
				1455	global*
				1456	Generic flat flat 64 0x0000000000000000
				1457	Region N/A GDS 32 *not implemented
				1458	for AMDHSA*
				1459	================= =========== ======== ======= ==================
				1460
				1461	The global and constant memory spaces both use global virtual addresses, which
				1462	are the same virtual address space used by the CPU. However, some virtual
				1463	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1464	and some by both.
				1465
				1466	Using the constant memory space indicates that the data will not change during
				1467	the execution of the kernel. This allows scalar read instructions to be
				1468	used. The vector and scalar L1 caches are invalidated of volatile data before
				1469	each kernel dispatch execution to allow constant memory to change values between
				1470	kernel dispatches.
				1471
				1472	The local memory space uses the hardware Local Data Store (LDS) which is
				1473	automatically allocated when the hardware creates work-groups of wavefronts, and
				1474	freed when all the wavefronts of a work-group have terminated. The data store
				1475	(DS) instructions can be used to access it.
				1476
				1477	The private memory space uses the hardware scratch memory support. If the kernel
				1478	uses scratch, then the hardware allocates memory that is accessed using
				1479	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1480	address to physical address is:
				1481
				1482	``wavefront-scratch-base +
				1483	(private-address * wavefront-size * 4) +
				1484	(wavefront-lane-id * 4)``
				1485
				1486	There are different ways that the wavefront scratch base address is determined
				1487	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1488	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1489	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1490	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1491	same private address, the interleaving results in adjacent dwords being accessed
				1492	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1493	supported except by flat and scratch instructions in GFX9.
				1494
				1495	The generic address space uses the hardware flat address support available in
				1496	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1497	local appertures), that are outside the range of addressible global memory, to
				1498	map from a flat address to a private or local address.
				1499
				1500	FLAT instructions can take a flat address and access global, private (scratch)
				1501	and group (LDS) memory depending in if the address is within one of the
				1502	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1503	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1504	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1505	(see :ref:`amdgpu-amdhsa-m0`).
				1506
				1507	To convert between a segment address and a flat address the base address of the
				1508	appertures address can be used. For GFX7-GFX8 these are available in the
				1509	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1510	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1511	GFX9 the appature base addresses are directly available as inline constant
				1512	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1513	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1514	which makes it easier to convert from flat to segment or segment to flat.
				1515
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1516	Image and Samplers
				1517	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1518
				1519	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1520	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1521	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1522	enumeration values for the queries that are not trivially deducible from the S#
				1523	representation.
				1524
				1525	HSA Signals
				1526	~~~~~~~~~~~
				1527
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1528	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1529	structure allocated in memory accessible from both the CPU and GPU. The
				1530	structure is defined by the ROCm runtime and subject to change between releases
				1531	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1532
				1533	.. _amdgpu-amdhsa-hsa-aql-queue:
				1534
				1535	HSA AQL Queue
				1536	~~~~~~~~~~~~~
				1537
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1538	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1539	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1540	fields needed to implement certain language features such as the flat address
				1541	aperture bases. It also contains fields used by CP such as managing the
				1542	allocation of scratch memory.
				1543
				1544	.. _amdgpu-amdhsa-kernel-descriptor:
				1545
				1546	Kernel Descriptor
				1547	~~~~~~~~~~~~~~~~~
				1548
				1549	A kernel descriptor consists of the information needed by CP to initiate the
				1550	execution of a kernel, including the entry point address of the machine code
				1551	that implements the kernel.
				1552
				1553	Kernel Descriptor for GFX6-GFX9
				1554	+++++++++++++++++++++++++++++++
				1555
				1556	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1557
				1558	.. table:: Kernel Descriptor for GFX6-GFX9
				1559	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1560
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1561	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1562	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1563	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1564	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1565	address space memory
				1566	required for a work-group
				1567	in bytes. This does not
				1568	include any dynamically
				1569	allocated local address
				1570	space memory that may be
				1571	added when the kernel is
				1572	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1573	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1574	private address space
				1575	memory required for a
				1576	work-item in bytes. If
				1577	is_dynamic_callstack is 1
				1578	then additional space must
				1579	be added to this value for
				1580	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1581	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1582	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1583	negative) from base
				1584	address of kernel
				1585	descriptor to kernel's
				1586	entry point instruction
				1587	which must be 256 byte
				1588	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1589	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1590	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1591	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1592	program settings used by
				1593	CP to set up
				1594	``COMPUTE_PGM_RSRC1``
				1595	configuration
				1596	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1597	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1598	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1599	program settings used by
				1600	CP to set up
				1601	``COMPUTE_PGM_RSRC2``
				1602	configuration
				1603	register. See
				1604	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1605	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1606	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	(see
				1608	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1609
				1610	The total number of SGPR
				1611	user data registers
				1612	requested must not exceed
				1613	16 and match value in
				1614	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1615	Any requests beyond 16
				1616	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1617	449 1 bit EnableSGPRDispatchPtr see above
				1618	450 1 bit EnableSGPRQueuePtr see above
				1619	451 1 bit EnableSGPRKernargSegmentPtr see above
				1620	452 1 bit EnableSGPRDispatchID see above
				1621	453 1 bit EnableSGPRFlatScratchInit see above
				1622	454 1 bit EnableSGPRPrivateSegmentSize see above
				1623	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1624	should always be 0.
				1625	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1626	should always be 0.
				1627	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1628	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1629	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1630	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1631	bytes
				1632	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1633	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1634
				1635	..
				1636
				1637	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1638	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1639
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1640	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1641	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1642	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1643	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1644	used by each work-item,
				1645	granularity is device
				1646	specific:
				1647
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1648	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1649	- max_vgpr 1..256
				1650	- roundup((max_vgpg + 1)
				1651	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1652
				1653	Used by CP to set up
				1654	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1655	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1656	used by a wavefront,
				1657	granularity is device
				1658	specific:
				1659
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1660	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1661	- max_sgpr 1..112
				1662	- roundup((max_sgpg + 1)
				1663	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1664	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1665	- max_sgpr 1..112
				1666	- roundup((max_sgpg + 1)
				1667	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1668
				1669	Includes the special SGPRs
				1670	for VCC, Flat Scratch (for
				1671	GFX7 onwards) and XNACK
				1672	(for GFX8 onwards). It does
				1673	not include the 16 SGPR
				1674	added if a trap handler is
				1675	enabled.
				1676
				1677	Used by CP to set up
				1678	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1679	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1680
				1681	Start executing wavefront
				1682	at the specified priority.
				1683
				1684	CP is responsible for
				1685	filling in
				1686	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1687	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1688	with specified rounding
				1689	mode for single (32
				1690	bit) floating point
				1691	precision floating point
				1692	operations.
				1693
				1694	Floating point rounding
				1695	mode values are defined in
				1696	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1697
				1698	Used by CP to set up
				1699	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1700	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1701	with specified rounding
				1702	denorm mode for half/double (16
				1703	and 64 bit) floating point
				1704	precision floating point
				1705	operations.
				1706
				1707	Floating point rounding
				1708	mode values are defined in
				1709	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1710
				1711	Used by CP to set up
				1712	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1713	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1714	with specified denorm mode
				1715	for single (32
				1716	bit) floating point
				1717	precision floating point
				1718	operations.
				1719
				1720	Floating point denorm mode
				1721	values are defined in
				1722	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1723
				1724	Used by CP to set up
				1725	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1726	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1727	with specified denorm mode
				1728	for half/double (16
				1729	and 64 bit) floating point
				1730	precision floating point
				1731	operations.
				1732
				1733	Floating point denorm mode
				1734	values are defined in
				1735	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1736
				1737	Used by CP to set up
				1738	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1739	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1740
				1741	Start executing wavefront
				1742	in privilege trap handler
				1743	mode.
				1744
				1745	CP is responsible for
				1746	filling in
				1747	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1748	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1749	with DX10 clamp mode
				1750	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1751	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1752	treatment of NaN's (when
				1753	set, clamp NaN to zero,
				1754	otherwise pass NaN
				1755	through).
				1756
				1757	Used by CP to set up
				1758	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1759	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1760
				1761	Start executing wavefront
				1762	in single step mode.
				1763
				1764	CP is responsible for
				1765	filling in
				1766	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1767	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1768	with IEEE mode
				1769	enabled. Floating point
				1770	opcodes that support
				1771	exception flag gathering
				1772	will quiet and propagate
				1773	signaling-NaN inputs per
				1774	IEEE 754-2008. Min_dx10 and
				1775	max_dx10 become IEEE
				1776	754-2008 compliant due to
				1777	signaling-NaN propagation
				1778	and quieting.
				1779
				1780	Used by CP to set up
				1781	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1782	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1783
				1784	Only one work-group allowed
				1785	to execute on a compute
				1786	unit.
				1787
				1788	CP is responsible for
				1789	filling in
				1790	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1791	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1792
				1793	Flag that can be used to
				1794	control debugging code.
				1795
				1796	CP is responsible for
				1797	filling in
				1798	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1799	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1800	Reserved, must be 0.
				1801	GFX9
				1802	Wavefront starts execution
				1803	with specified fp16 overflow
				1804	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1805
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1806	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1807	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1808	- If 1, fp16 overflow that is the
				1809	result of an +/-INF input value
				1810	or divide by 0 produces a +/-INF,
				1811	otherwise clamps computed
				1812	overflow to +/-MAX_FP16 as
				1813	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1814
				1815	Used by CP to set up
				1816	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1817	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1818	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1819	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1820
				1821	..
				1822
				1823	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1824	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1825
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1826	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1827	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1828	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1829	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1830	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1831	system register (see
				1832	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1833
				1834	Used by CP to set up
				1835	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1836	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1837	user data registers
				1838	requested. This number must
				1839	match the number of user
				1840	data registers enabled.
				1841
				1842	Used by CP to set up
				1843	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1844	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1845	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1846	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1847	be enabled.
				1848
				1849	CP sets
				1850	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1851	if the runtime has
				1852	installed a trap handler
				1853	regardless of the setting
				1854	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1855	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1856	system SGPR register for
				1857	the work-group id in the X
				1858	dimension (see
				1859	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1860
				1861	Used by CP to set up
				1862	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1863	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1864	system SGPR register for
				1865	the work-group id in the Y
				1866	dimension (see
				1867	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1868
				1869	Used by CP to set up
				1870	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1871	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1872	system SGPR register for
				1873	the work-group id in the Z
				1874	dimension (see
				1875	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1876
				1877	Used by CP to set up
				1878	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1879	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1880	system SGPR register for
				1881	work-group information (see
				1882	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1883
				1884	Used by CP to set up
				1885	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1886	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1887	VGPR system registers used
				1888	for the work-item ID.
				1889	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1890	defines the values.
				1891
				1892	Used by CP to set up
				1893	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1894	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1895
				1896	Wavefront starts execution
				1897	with address watch
				1898	exceptions enabled which
				1899	are generated when L1 has
				1900	witnessed a thread access
				1901	an *address of
				1902	interest*.
				1903
				1904	CP is responsible for
				1905	filling in the address
				1906	watch bit in
				1907	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1908	according to what the
				1909	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1910	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1911
				1912	Wavefront starts execution
				1913	with memory violation
				1914	exceptions exceptions
				1915	enabled which are generated
				1916	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1917	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1918	L1 or LDS
				1919	(write-to-read-only-memory,
				1920	mis-aligned atomic, LDS
				1921	address out of range,
				1922	illegal address, etc.).
				1923
				1924	CP sets the memory
				1925	violation bit in
				1926	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1927	according to what the
				1928	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1929	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1930
				1931	CP uses the rounded value
				1932	from the dispatch packet,
				1933	not this value, as the
				1934	dispatch may contain
				1935	dynamically allocated group
				1936	segment memory. CP writes
				1937	directly to
				1938	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1939
				1940	Amount of group segment
				1941	(LDS) to allocate for each
				1942	work-group. Granularity is
				1943	device specific:
				1944
				1945	GFX6:
				1946	roundup(lds-size / (64 * 4))
				1947	GFX7-GFX9:
				1948	roundup(lds-size / (128 * 4))
				1949
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1950	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1951	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1952	enabled.
				1953
				1954	Used by CP to set up
				1955	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1956	(set from bits 0..6).
				1957
				1958	IEEE 754 FP Invalid
				1959	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1960	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1961	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1962	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1963	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1964	_DIVISION_BY_ZERO Zero
				1965	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1966	_OVERFLOW
				1967	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1968	_UNDERFLOW
				1969	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1970	_INEXACT
				1971	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1972	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1973	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1974	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1975	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1976	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1977
				1978	..
				1979
				1980	.. table:: Floating Point Rounding Mode Enumeration Values
				1981	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1982
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1983	====================================== ===== ==============================
				1984	Enumeration Name Value Description
				1985	====================================== ===== ==============================
				1986	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1987	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1988	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1989	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1990	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1991
				1992	..
				1993
				1994	.. table:: Floating Point Denorm Mode Enumeration Values
				1995	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1996
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1997	====================================== ===== ==============================
				1998	Enumeration Name Value Description
				1999	====================================== ===== ==============================
				2000	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				2001	Denorms
				2002	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2003	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2004	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				2005	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2006
				2007	..
				2008
				2009	.. table:: System VGPR Work-Item ID Enumeration Values
				2010	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2011
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2012	======================================== ===== ============================
				2013	Enumeration Name Value Description
				2014	======================================== ===== ============================
				2015	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				2016	ID.
				2017	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				2018	dimensions ID.
				2019	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				2020	dimensions ID.
				2021	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				2022	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2023
				2024	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2025
				2026	Initial Kernel Execution State
				2027	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2028
				2029	This section defines the register state that will be set up by the packet
				2030	processor prior to the start of execution of every wavefront. This is limited by
				2031	the constraints of the hardware controllers of CP/ADC/SPI.
				2032
				2033	The order of the SGPR registers is defined, but the compiler can specify which
				2034	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2035	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2036	for enabled registers are dense starting at SGPR0: the first enabled register is
				2037	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2038	an SGPR number.
				2039
				2040	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2041	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2042	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2043	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2044	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2045	dispatch.
				2046
				2047	SGPR register initial state is defined in
				2048	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2049
				2050	.. table:: SGPR Register Set Up Order
				2051	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2052
				2053	========== ========================== ====== ==============================
				2054	SGPR Order Name Number Description
				2055	(kernel descriptor enable of
				2056	field) SGPRs
				2057	========== ========================== ====== ==============================
				2058	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2059	(enable_sgpr_private with Scratch Wavefront Offset
				2060	_segment_buffer) as an offset, to access the
				2061	private memory space using a
				2062	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2063
				2064	CP uses the value provided by
				2065	the runtime.
				2066	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2067	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2068	actually executing.
				2069	then Queue Ptr 2 64 bit address of amd_queue_t
				2070	(enable_sgpr_queue_ptr) object for AQL queue on which
				2071	the dispatch packet was
				2072	queued.
				2073	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2074	(enable_sgpr_kernarg segment. This is directly
				2075	_segment_ptr) copied from the
				2076	kernarg_address in the kernel
				2077	dispatch packet.
				2078
				2079	Having CP load it once avoids
				2080	loading it at the beginning of
				2081	every wavefront.
				2082	then Dispatch Id 2 64 bit Dispatch ID of the
				2083	(enable_sgpr_dispatch_id) dispatch packet being
				2084	executed.
				2085	then Flat Scratch Init 2 This is 2 SGPRs:
				2086	(enable_sgpr_flat_scratch
				2087	_init) GFX6
				2088	Not supported.
				2089	GFX7-GFX8
				2090	The first SGPR is a 32 bit
				2091	byte offset from
				2092	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2093	to per SPI base of memory
				2094	for scratch for the queue
				2095	executing the kernel
				2096	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2097	from the runtime. (The
				2098	Scratch Segment Buffer base
				2099	address is
				2100	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2101	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2102	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2103	be added to this offset by
				2104	the kernel machine code,
				2105	right shifted by 8, and
				2106	moved to the FLAT_SCRATCH_HI
				2107	SGPR register.
				2108	FLAT_SCRATCH_HI corresponds
				2109	to SGPRn-4 on GFX7, and
				2110	SGPRn-6 on GFX8 (where SGPRn
				2111	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2112	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2113	FLAT_SCRATCH_HI is
				2114	multiplied by 256 (as it is
				2115	in units of 256 bytes) and
				2116	added to
				2117	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2118	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2119	FLAT SCRATCH BASE in flat
				2120	memory instructions that
				2121	access the scratch
				2122	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2123
				2124	The second SGPR is 32 bit
				2125	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2126	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2127	usage. CP obtains this from
				2128	the runtime, and it is
				2129	always a multiple of DWORD.
				2130	CP checks that the value in
				2131	the kernel dispatch packet
				2132	Private Segment Byte Size is
				2133	not larger, and requests the
				2134	runtime to increase the
				2135	queue's scratch size if
				2136	necessary. The kernel code
				2137	must move it to
				2138	FLAT_SCRATCH_LO which is
				2139	SGPRn-3 on GFX7 and SGPRn-5
				2140	on GFX8. FLAT_SCRATCH_LO is
				2141	used as the FLAT SCRATCH
				2142	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2143	instructions. Having CP load
				2144	it once avoids loading it at
				2145	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2146	wavefront.
				2147	GFX9
				2148	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2149	64 bit base address of the
				2150	per SPI scratch backing
				2151	memory managed by SPI for
				2152	the queue executing the
				2153	kernel dispatch. CP obtains
				2154	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2155	divides it if there are
				2156	multiple Shader Arrays each
				2157	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2158	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2159	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2160	machine code and the result
				2161	moved to the FLAT_SCRATCH
				2162	SGPR which is SGPRn-6 and
				2163	SGPRn-5. It is used as the
				2164	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2165	memory instructions.
				2166	then Private Segment Size 1 The 32 bit byte size of a
				2167	(enable_sgpr_private single
				2168	work-item's
				2169	scratch_segment_size) memory
				2170	allocation. This is the
				2171	value from the kernel
				2172	dispatch packet Private
				2173	Segment Byte Size rounded up
				2174	by CP to a multiple of
				2175	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2176
				2177	Having CP load it once avoids
				2178	loading it at the beginning of
				2179	every wavefront.
				2180
				2181	This is not used for
				2182	GFX7-GFX8 since it is the same
				2183	value as the second SGPR of
				2184	Flat Scratch Init. However, it
				2185	may be needed for GFX9 which
				2186	changes the meaning of the
				2187	Flat Scratch Init value.
				2188	then Grid Work-Group Count X 1 32 bit count of the number of
				2189	(enable_sgpr_grid work-groups in the X dimension
				2190	_workgroup_count_X) for the grid being
				2191	executed. Computed from the
				2192	fields in the kernel dispatch
				2193	packet as ((grid_size.x +
				2194	workgroup_size.x - 1) /
				2195	workgroup_size.x).
				2196	then Grid Work-Group Count Y 1 32 bit count of the number of
				2197	(enable_sgpr_grid work-groups in the Y dimension
				2198	_workgroup_count_Y && for the grid being
				2199	less than 16 previous executed. Computed from the
				2200	SGPRs) fields in the kernel dispatch
				2201	packet as ((grid_size.y +
				2202	workgroup_size.y - 1) /
				2203	workgroupSize.y).
				2204
				2205	Only initialized if <16
				2206	previous SGPRs initialized.
				2207	then Grid Work-Group Count Z 1 32 bit count of the number of
				2208	(enable_sgpr_grid work-groups in the Z dimension
				2209	_workgroup_count_Z && for the grid being
				2210	less than 16 previous executed. Computed from the
				2211	SGPRs) fields in the kernel dispatch
				2212	packet as ((grid_size.z +
				2213	workgroup_size.z - 1) /
				2214	workgroupSize.z).
				2215
				2216	Only initialized if <16
				2217	previous SGPRs initialized.
				2218	then Work-Group Id X 1 32 bit work-group id in X
				2219	(enable_sgpr_workgroup_id dimension of grid for
				2220	_X) wavefront.
				2221	then Work-Group Id Y 1 32 bit work-group id in Y
				2222	(enable_sgpr_workgroup_id dimension of grid for
				2223	_Y) wavefront.
				2224	then Work-Group Id Z 1 32 bit work-group id in Z
				2225	(enable_sgpr_workgroup_id dimension of grid for
				2226	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2227	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2228	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2229	_info) threadgroup_size_in_wavefronts[5:0]}
				2230	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2231	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2232	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2233	dispatch. Must be used as an
				2234	offset with Private
				2235	segment address when using
				2236	Scratch Segment Buffer. It
				2237	must be used to set up FLAT
				2238	SCRATCH for flat addressing
				2239	(see
				2240	:ref:`amdgpu-amdhsa-flat-scratch`).
				2241	========== ========================== ====== ==============================
				2242
				2243	The order of the VGPR registers is defined, but the compiler can specify which
				2244	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2245	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2246	for enabled registers are dense starting at VGPR0: the first enabled register is
				2247	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2248	VGPR number.
				2249
				2250	VGPR register initial state is defined in
				2251	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2252
				2253	.. table:: VGPR Register Set Up Order
				2254	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2255
				2256	========== ========================== ====== ==============================
				2257	VGPR Order Name Number Description
				2258	(kernel descriptor enable of
				2259	field) VGPRs
				2260	========== ========================== ====== ==============================
				2261	First Work-Item Id X 1 32 bit work item id in X
				2262	(Always initialized) dimension of work-group for
				2263	wavefront lane.
				2264	then Work-Item Id Y 1 32 bit work item id in Y
				2265	(enable_vgpr_workitem_id dimension of work-group for
				2266	> 0) wavefront lane.
				2267	then Work-Item Id Z 1 32 bit work item id in Z
				2268	(enable_vgpr_workitem_id dimension of work-group for
				2269	> 1) wavefront lane.
				2270	========== ========================== ====== ==============================
				2271
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame^]	2272	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2273
				2274	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2275	registers.
				2276	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2277	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2278	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2279	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2280	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2281	or (X, Y, Z).
				2282
				2283	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2284	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2285
				2286	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2287	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2288	instructions (GFX9).
				2289
				2290	If buffer operations are used then the compiler can generate a V# with the
				2291	following properties:
				2292
				2293	* base address of 0
				2294	* no swizzle
				2295	* ATC: 1 if IOMMU present (such as APU)
				2296	* ptr64: 1
				2297	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2298	APU and NC for dGPU).
				2299
				2300	.. _amdgpu-amdhsa-kernel-prolog:
				2301
				2302	Kernel Prolog
				2303	~~~~~~~~~~~~~
				2304
				2305	.. _amdgpu-amdhsa-m0:
				2306
				2307	M0
				2308	++
				2309
				2310	GFX6-GFX8
				2311	The M0 register must be initialized with a value at least the total LDS size
				2312	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2313	available in dispatch packet. For M0, it is also possible to use maximum
				2314	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2315	GFX7-GFX8).
				2316	GFX9
				2317	The M0 register is not used for range checking LDS accesses and so does not
				2318	need to be initialized in the prolog.
				2319
				2320	.. _amdgpu-amdhsa-flat-scratch:
				2321
				2322	Flat Scratch
				2323	++++++++++++
				2324
				2325	If the kernel may use flat operations to access scratch memory, the prolog code
				2326	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2327	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2328	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2329
				2330	GFX6
				2331	Flat scratch is not supported.
				2332
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2333	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2334	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2335	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2336	being managed by SPI for the queue executing the kernel dispatch. This is
				2337	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2338	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2339	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2340	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2341	by 8 before moving into FLAT_SCRATCH_LO.
				2342	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2343	work-items scratch memory usage. This is directly loaded from the kernel
				2344	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2345	DWORD. Having CP load it once avoids loading it at the beginning of every
				2346	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2347	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2348
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2349	GFX9
				2350	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2351	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2352	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2353	pair for use as the flat scratch base in flat memory instructions.
				2354
				2355	.. _amdgpu-amdhsa-memory-model:
				2356
				2357	Memory Model
				2358	~~~~~~~~~~~~
				2359
				2360	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2361	(see :ref:`memmodel`). The implementation is WIP.
				2362
				2363	.. TODO
				2364	Update when implementation complete.
				2365
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2366	The AMDGPU backend supports the memory synchronization scopes specified in
				2367	:ref:`amdgpu-memory-scopes`.
				2368
				2369	The code sequences used to implement the memory model are defined in table
				2370	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2371
				2372	The sequences specify the order of instructions that a single thread must
				2373	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2374	to other memory instructions executed by the same thread. This allows them to be
				2375	moved earlier or later which can allow them to be combined with other instances
				2376	of the same instruction, or hoisted/sunk out of loops to improve
				2377	performance. Only the instructions related to the memory model are given;
				2378	additional ``s_waitcnt`` instructions are required to ensure registers are
				2379	defined before being used. These may be able to be combined with the memory
				2380	model ``s_waitcnt`` instructions as described above.
				2381
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2382	The AMDGPU backend supports the following memory models:
				2383
				2384	HSA Memory Model [HSA]_
				2385	The HSA memory model uses a single happens-before relation for all address
				2386	spaces (see :ref:`amdgpu-address-spaces`).
				2387	OpenCL Memory Model [OpenCL]_
				2388	The OpenCL memory model which has separate happens-before relations for the
				2389	global and local address spaces. Only a fence specifying both global and
				2390	local address space, and seq_cst instructions join the relationships. Since
				2391	the LLVM ``memfence`` instruction does not allow an address space to be
				2392	specified the OpenCL fence has to convervatively assume both local and
				2393	global address space was specified. However, optimizations can often be
				2394	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2395	no intervening memory instructions which access the corresponding address
				2396	space. The code sequences in the table indicate what can be omitted for the
				2397	OpenCL memory. The target triple environment is used to determine if the
				2398	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2399
				2400	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2401	operations.
				2402
				2403	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2404	termed vector memory operations.
				2405
				2406	For GFX6-GFX9:
				2407
				2408	* Each agent has multiple compute units (CU).
				2409	* Each CU has multiple SIMDs that execute wavefronts.
				2410	* The wavefronts for a single work-group are executed in the same CU but may be
				2411	executed by different SIMDs.
				2412	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2413	executing on it.
				2414	* All LDS operations of a CU are performed as wavefront wide operations in a
				2415	global order and involve no caching. Completion is reported to a wavefront in
				2416	execution order.
				2417	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2418	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2419	can be reordered relative to each other, which can result in reordering the
				2420	visibility of vector memory operations with respect to LDS operations of other
				2421	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2422	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2423	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2424	same wavefront.
				2425	* The vector memory operations are performed as wavefront wide operations and
				2426	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2427	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2428	vector memory order if they access LDS memory, and out of LDS operation order
				2429	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2430	* The vector memory operations access a single vector L1 cache shared by all
				2431	SIMDs a CU. Therefore, no special action is required for coherence between the
				2432	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2433	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2434	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2435	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2436	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2437	scalar operations are used in a restricted way so do not impact the memory
				2438	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2439	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2440	the same agent.
				2441	* The L2 cache has independent channels to service disjoint ranges of virtual
				2442	addresses.
				2443	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2444	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2445	(which may be executing on different CUs) of an agent can be reordered
				2446	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2447	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2448	previous vector memory operation has completed before executing a subsequent
				2449	vector memory or LDS operation and so can be used to meet the requirements of
				2450	acquire and release.
				2451	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2452	of virtual addresses can be set up to bypass it to ensure system coherence.
				2453
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2454	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2455	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2456	memory, atomic memory orderings are not meaningful and all accesses are treated
				2457	as non-atomic.
				2458
				2459	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2460	scalar memory instructions). Since the constant address space contents do not
				2461	change during the execution of a kernel dispatch it is not legal to perform
				2462	stores, and atomic memory orderings are not meaningful and all access are
				2463	treated as non-atomic.
				2464
				2465	A memory synchronization scope wider than work-group is not meaningful for the
				2466	group (LDS) address space and is treated as work-group.
				2467
				2468	The memory model does not support the region address space which is treated as
				2469	non-atomic.
				2470
				2471	Acquire memory ordering is not meaningful on store atomic instructions and is
				2472	treated as non-atomic.
				2473
				2474	Release memory ordering is not meaningful on load atomic instructions and is
				2475	treated a non-atomic.
				2476
				2477	Acquire-release memory ordering is not meaningful on load or store atomic
				2478	instructions and is treated as acquire and release respectively.
				2479
				2480	AMDGPU backend only uses scalar memory operations to access memory that is
				2481	proven to not change during the execution of the kernel dispatch. This includes
				2482	constant address space and global address space for program scope const
				2483	variables. Therefore the kernel machine code does not have to maintain the
				2484	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2485	and vector L1 caches are invalidated between kernel dispatches by CP since
				2486	constant address space data may change between kernel dispatch executions. See
				2487	:ref:`amdgpu-amdhsa-memory-spaces`.
				2488
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2489	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2490	case the AMDGPU backend ensures the memory location used to spill is never
				2491	accessed by vector memory operations at the same time. If scalar writes are used
				2492	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2493	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2494	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2495	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2496	as all scalar writes are write-before-read in the same thread.
				2497
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2498	Scratch backing memory (which is used for the private address space)
				2499	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2500	address space is only accessed by a single thread, and is always
				2501	write-before-read, there is never a need to invalidate these entries from the L1
				2502	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2503	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2504
				2505	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2506	to invalidate the L2 cache. This also causes it to be treated as
				2507	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2508	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2509	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2510
				2511	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2512	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2513
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2514	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2515	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2516	Ordering Sync Scope Address
				2517	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2518	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2519	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2520	-----------------------------------------------------------------------------------
				2521	load none none - global - !volatile & !nontemporal
				2522	- generic
				2523	- private 1. buffer/global/flat_load
				2524	- constant
				2525	- volatile & !nontemporal
				2526
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2527	1. buffer/global/flat_load
				2528	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2529
				2530	- nontemporal
				2531
				2532	1. buffer/global/flat_load
				2533	glc=1 slc=1
				2534
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2535	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2536	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2537	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2538	- private 1. buffer/global/flat_store
				2539	- constant
				2540	- nontemporal
				2541
				2542	1. buffer/global/flat_stote
				2543	glc=1 slc=1
				2544
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2545	store none none - local 1. ds_store
				2546	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2547	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2548	load atomic unordered any any Same as non-atomic.
				2549	store atomic unordered any any Same as non-atomic.
				2550	atomicrmw unordered any any *Same as monotonic
				2551	atomic*.
				2552	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2553	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2554	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2555	- wavefront - generic
				2556	- workgroup
				2557	load atomic monotonic - singlethread - local 1. ds_load
				2558	- wavefront
				2559	- workgroup
				2560	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2561	- system - generic glc=1
				2562	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2563	- wavefront - generic
				2564	- workgroup
				2565	- agent
				2566	- system
				2567	store atomic monotonic - singlethread - local 1. ds_store
				2568	- wavefront
				2569	- workgroup
				2570	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2571	- wavefront - generic
				2572	- workgroup
				2573	- agent
				2574	- system
				2575	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2576	- wavefront
				2577	- workgroup
				2578	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2579	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2580	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2581	- wavefront - local
				2582	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2583	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2584	load atomic acquire - workgroup - local 1. ds_load
				2585	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2586
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2587	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2588	- Must happen before
				2589	any following
				2590	global/generic
				2591	load/load
				2592	atomic/store/store
				2593	atomic/atomicrmw.
				2594	- Ensures any
				2595	following global
				2596	data read is no
				2597	older than the load
				2598	atomic value being
				2599	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2600	load atomic acquire - workgroup - generic 1. flat_load
				2601	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2602
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2603	- If OpenCL, omit.
				2604	- Must happen before
				2605	any following
				2606	global/generic
				2607	load/load
				2608	atomic/store/store
				2609	atomic/atomicrmw.
				2610	- Ensures any
				2611	following global
				2612	data read is no
				2613	older than the load
				2614	atomic value being
				2615	acquired.
				2616	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2617	- system glc=1
				2618	2. s_waitcnt vmcnt(0)
				2619
				2620	- Must happen before
				2621	following
				2622	buffer_wbinvl1_vol.
				2623	- Ensures the load
				2624	has completed
				2625	before invalidating
				2626	the cache.
				2627
				2628	3. buffer_wbinvl1_vol
				2629
				2630	- Must happen before
				2631	any following
				2632	global/generic
				2633	load/load
				2634	atomic/atomicrmw.
				2635	- Ensures that
				2636	following
				2637	loads will not see
				2638	stale global data.
				2639
				2640	load atomic acquire - agent - generic 1. flat_load glc=1
				2641	- system 2. s_waitcnt vmcnt(0) &
				2642	lgkmcnt(0)
				2643
				2644	- If OpenCL omit
				2645	lgkmcnt(0).
				2646	- Must happen before
				2647	following
				2648	buffer_wbinvl1_vol.
				2649	- Ensures the flat_load
				2650	has completed
				2651	before invalidating
				2652	the cache.
				2653
				2654	3. buffer_wbinvl1_vol
				2655
				2656	- Must happen before
				2657	any following
				2658	global/generic
				2659	load/load
				2660	atomic/atomicrmw.
				2661	- Ensures that
				2662	following loads
				2663	will not see stale
				2664	global data.
				2665
				2666	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2667	- wavefront - local
				2668	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2669	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2670	atomicrmw acquire - workgroup - local 1. ds_atomic
				2671	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2672
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2673	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2674	- Must happen before
				2675	any following
				2676	global/generic
				2677	load/load
				2678	atomic/store/store
				2679	atomic/atomicrmw.
				2680	- Ensures any
				2681	following global
				2682	data read is no
				2683	older than the
				2684	atomicrmw value
				2685	being acquired.
				2686
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2687	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2688	2. waitcnt lgkmcnt(0)
				2689
				2690	- If OpenCL, omit.
				2691	- Must happen before
				2692	any following
				2693	global/generic
				2694	load/load
				2695	atomic/store/store
				2696	atomic/atomicrmw.
				2697	- Ensures any
				2698	following global
				2699	data read is no
				2700	older than the
				2701	atomicrmw value
				2702	being acquired.
				2703
				2704	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2705	- system 2. s_waitcnt vmcnt(0)
				2706
				2707	- Must happen before
				2708	following
				2709	buffer_wbinvl1_vol.
				2710	- Ensures the
				2711	atomicrmw has
				2712	completed before
				2713	invalidating the
				2714	cache.
				2715
				2716	3. buffer_wbinvl1_vol
				2717
				2718	- Must happen before
				2719	any following
				2720	global/generic
				2721	load/load
				2722	atomic/atomicrmw.
				2723	- Ensures that
				2724	following loads
				2725	will not see stale
				2726	global data.
				2727
				2728	atomicrmw acquire - agent - generic 1. flat_atomic
				2729	- system 2. s_waitcnt vmcnt(0) &
				2730	lgkmcnt(0)
				2731
				2732	- If OpenCL, omit
				2733	lgkmcnt(0).
				2734	- Must happen before
				2735	following
				2736	buffer_wbinvl1_vol.
				2737	- Ensures the
				2738	atomicrmw has
				2739	completed before
				2740	invalidating the
				2741	cache.
				2742
				2743	3. buffer_wbinvl1_vol
				2744
				2745	- Must happen before
				2746	any following
				2747	global/generic
				2748	load/load
				2749	atomic/atomicrmw.
				2750	- Ensures that
				2751	following loads
				2752	will not see stale
				2753	global data.
				2754
				2755	fence acquire - singlethread none none
				2756	- wavefront
				2757	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2758
				2759	- If OpenCL and
				2760	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2761	not generic, omit.
				2762	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2763	currently has no
				2764	address space on
				2765	the fence need to
				2766	conservatively
				2767	always generate. If
				2768	fence had an
				2769	address space then
				2770	set to address
				2771	space of OpenCL
				2772	fence flag, or to
				2773	generic if both
				2774	local and global
				2775	flags are
				2776	specified.
				2777	- Must happen after
				2778	any preceding
				2779	local/generic load
				2780	atomic/atomicrmw
				2781	with an equal or
				2782	wider sync scope
				2783	and memory ordering
				2784	stronger than
				2785	unordered (this is
				2786	termed the
				2787	fence-paired-atomic).
				2788	- Must happen before
				2789	any following
				2790	global/generic
				2791	load/load
				2792	atomic/store/store
				2793	atomic/atomicrmw.
				2794	- Ensures any
				2795	following global
				2796	data read is no
				2797	older than the
				2798	value read by the
				2799	fence-paired-atomic.
				2800
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2801	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2802	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2803
				2804	- If OpenCL and
				2805	address space is
				2806	not generic, omit
				2807	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2808	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2809	currently has no
				2810	address space on
				2811	the fence need to
				2812	conservatively
				2813	always generate
				2814	(see comment for
				2815	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2816	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2817	separate s_waitcnt
				2818	vmcnt(0) and
				2819	s_waitcnt
				2820	lgkmcnt(0) to allow
				2821	them to be
				2822	independently moved
				2823	according to the
				2824	following rules.
				2825	- s_waitcnt vmcnt(0)
				2826	must happen after
				2827	any preceding
				2828	global/generic load
				2829	atomic/atomicrmw
				2830	with an equal or
				2831	wider sync scope
				2832	and memory ordering
				2833	stronger than
				2834	unordered (this is
				2835	termed the
				2836	fence-paired-atomic).
				2837	- s_waitcnt lgkmcnt(0)
				2838	must happen after
				2839	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2840	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2841	atomic/atomicrmw
				2842	with an equal or
				2843	wider sync scope
				2844	and memory ordering
				2845	stronger than
				2846	unordered (this is
				2847	termed the
				2848	fence-paired-atomic).
				2849	- Must happen before
				2850	the following
				2851	buffer_wbinvl1_vol.
				2852	- Ensures that the
				2853	fence-paired atomic
				2854	has completed
				2855	before invalidating
				2856	the
				2857	cache. Therefore
				2858	any following
				2859	locations read must
				2860	be no older than
				2861	the value read by
				2862	the
				2863	fence-paired-atomic.
				2864
				2865	2. buffer_wbinvl1_vol
				2866
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2867	- Must happen before any
				2868	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2869	load/load
				2870	atomic/store/store
				2871	atomic/atomicrmw.
				2872	- Ensures that
				2873	following loads
				2874	will not see stale
				2875	global data.
				2876
				2877	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2878	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2879	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2880	- wavefront - local
				2881	- generic
				2882	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2883
				2884	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2885	- Must happen after
				2886	any preceding
				2887	local/generic
				2888	load/store/load
				2889	atomic/store
				2890	atomic/atomicrmw.
				2891	- Must happen before
				2892	the following
				2893	store.
				2894	- Ensures that all
				2895	memory operations
				2896	to local have
				2897	completed before
				2898	performing the
				2899	store that is being
				2900	released.
				2901
				2902	2. buffer/global/flat_store
				2903	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2904	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2905
				2906	- If OpenCL, omit.
				2907	- Must happen after
				2908	any preceding
				2909	local/generic
				2910	load/store/load
				2911	atomic/store
				2912	atomic/atomicrmw.
				2913	- Must happen before
				2914	the following
				2915	store.
				2916	- Ensures that all
				2917	memory operations
				2918	to local have
				2919	completed before
				2920	performing the
				2921	store that is being
				2922	released.
				2923
				2924	2. flat_store
				2925	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2926	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2927
				2928	- If OpenCL, omit
				2929	lgkmcnt(0).
				2930	- Could be split into
				2931	separate s_waitcnt
				2932	vmcnt(0) and
				2933	s_waitcnt
				2934	lgkmcnt(0) to allow
				2935	them to be
				2936	independently moved
				2937	according to the
				2938	following rules.
				2939	- s_waitcnt vmcnt(0)
				2940	must happen after
				2941	any preceding
				2942	global/generic
				2943	load/store/load
				2944	atomic/store
				2945	atomic/atomicrmw.
				2946	- s_waitcnt lgkmcnt(0)
				2947	must happen after
				2948	any preceding
				2949	local/generic
				2950	load/store/load
				2951	atomic/store
				2952	atomic/atomicrmw.
				2953	- Must happen before
				2954	the following
				2955	store.
				2956	- Ensures that all
				2957	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2958	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2959	completed before
				2960	performing the
				2961	store that is being
				2962	released.
				2963
				2964	2. buffer/global/ds/flat_store
				2965	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2966	- wavefront - local
				2967	- generic
				2968	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2969
				2970	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2971	- Must happen after
				2972	any preceding
				2973	local/generic
				2974	load/store/load
				2975	atomic/store
				2976	atomic/atomicrmw.
				2977	- Must happen before
				2978	the following
				2979	atomicrmw.
				2980	- Ensures that all
				2981	memory operations
				2982	to local have
				2983	completed before
				2984	performing the
				2985	atomicrmw that is
				2986	being released.
				2987
				2988	2. buffer/global/flat_atomic
				2989	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2990	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2991
				2992	- If OpenCL, omit.
				2993	- Must happen after
				2994	any preceding
				2995	local/generic
				2996	load/store/load
				2997	atomic/store
				2998	atomic/atomicrmw.
				2999	- Must happen before
				3000	the following
				3001	atomicrmw.
				3002	- Ensures that all
				3003	memory operations
				3004	to local have
				3005	completed before
				3006	performing the
				3007	atomicrmw that is
				3008	being released.
				3009
				3010	2. flat_atomic
				3011	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3012	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3013
				3014	- If OpenCL, omit
				3015	lgkmcnt(0).
				3016	- Could be split into
				3017	separate s_waitcnt
				3018	vmcnt(0) and
				3019	s_waitcnt
				3020	lgkmcnt(0) to allow
				3021	them to be
				3022	independently moved
				3023	according to the
				3024	following rules.
				3025	- s_waitcnt vmcnt(0)
				3026	must happen after
				3027	any preceding
				3028	global/generic
				3029	load/store/load
				3030	atomic/store
				3031	atomic/atomicrmw.
				3032	- s_waitcnt lgkmcnt(0)
				3033	must happen after
				3034	any preceding
				3035	local/generic
				3036	load/store/load
				3037	atomic/store
				3038	atomic/atomicrmw.
				3039	- Must happen before
				3040	the following
				3041	atomicrmw.
				3042	- Ensures that all
				3043	memory operations
				3044	to global and local
				3045	have completed
				3046	before performing
				3047	the atomicrmw that
				3048	is being released.
				3049
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3050	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3051	fence release - singlethread none none
				3052	- wavefront
				3053	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3054
				3055	- If OpenCL and
				3056	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3057	not generic, omit.
				3058	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3059	currently has no
				3060	address space on
				3061	the fence need to
				3062	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3063	always generate. If
				3064	fence had an
				3065	address space then
				3066	set to address
				3067	space of OpenCL
				3068	fence flag, or to
				3069	generic if both
				3070	local and global
				3071	flags are
				3072	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3073	- Must happen after
				3074	any preceding
				3075	local/generic
				3076	load/load
				3077	atomic/store/store
				3078	atomic/atomicrmw.
				3079	- Must happen before
				3080	any following store
				3081	atomic/atomicrmw
				3082	with an equal or
				3083	wider sync scope
				3084	and memory ordering
				3085	stronger than
				3086	unordered (this is
				3087	termed the
				3088	fence-paired-atomic).
				3089	- Ensures that all
				3090	memory operations
				3091	to local have
				3092	completed before
				3093	performing the
				3094	following
				3095	fence-paired-atomic.
				3096
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3097	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3098	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3099
				3100	- If OpenCL and
				3101	address space is
				3102	not generic, omit
				3103	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3104	- If OpenCL and
				3105	address space is
				3106	local, omit
				3107	vmcnt(0).
				3108	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3109	currently has no
				3110	address space on
				3111	the fence need to
				3112	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3113	always generate. If
				3114	fence had an
				3115	address space then
				3116	set to address
				3117	space of OpenCL
				3118	fence flag, or to
				3119	generic if both
				3120	local and global
				3121	flags are
				3122	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3123	- Could be split into
				3124	separate s_waitcnt
				3125	vmcnt(0) and
				3126	s_waitcnt
				3127	lgkmcnt(0) to allow
				3128	them to be
				3129	independently moved
				3130	according to the
				3131	following rules.
				3132	- s_waitcnt vmcnt(0)
				3133	must happen after
				3134	any preceding
				3135	global/generic
				3136	load/store/load
				3137	atomic/store
				3138	atomic/atomicrmw.
				3139	- s_waitcnt lgkmcnt(0)
				3140	must happen after
				3141	any preceding
				3142	local/generic
				3143	load/store/load
				3144	atomic/store
				3145	atomic/atomicrmw.
				3146	- Must happen before
				3147	any following store
				3148	atomic/atomicrmw
				3149	with an equal or
				3150	wider sync scope
				3151	and memory ordering
				3152	stronger than
				3153	unordered (this is
				3154	termed the
				3155	fence-paired-atomic).
				3156	- Ensures that all
				3157	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3158	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3159	completed before
				3160	performing the
				3161	following
				3162	fence-paired-atomic.
				3163
				3164	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3165	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3166	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3167	- wavefront - local
				3168	- generic
				3169	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3170
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3171	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3172	- Must happen after
				3173	any preceding
				3174	local/generic
				3175	load/store/load
				3176	atomic/store
				3177	atomic/atomicrmw.
				3178	- Must happen before
				3179	the following
				3180	atomicrmw.
				3181	- Ensures that all
				3182	memory operations
				3183	to local have
				3184	completed before
				3185	performing the
				3186	atomicrmw that is
				3187	being released.
				3188
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3189	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3190	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3191	2. s_waitcnt lgkmcnt(0)
				3192
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3193	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3194	- Must happen before
				3195	any following
				3196	global/generic
				3197	load/load
				3198	atomic/store/store
				3199	atomic/atomicrmw.
				3200	- Ensures any
				3201	following global
				3202	data read is no
				3203	older than the load
				3204	atomic value being
				3205	acquired.
				3206
				3207	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3208
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3209	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3210	- Must happen after
				3211	any preceding
				3212	local/generic
				3213	load/store/load
				3214	atomic/store
				3215	atomic/atomicrmw.
				3216	- Must happen before
				3217	the following
				3218	atomicrmw.
				3219	- Ensures that all
				3220	memory operations
				3221	to local have
				3222	completed before
				3223	performing the
				3224	atomicrmw that is
				3225	being released.
				3226
				3227	2. flat_atomic
				3228	3. s_waitcnt lgkmcnt(0)
				3229
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3230	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3231	- Must happen before
				3232	any following
				3233	global/generic
				3234	load/load
				3235	atomic/store/store
				3236	atomic/atomicrmw.
				3237	- Ensures any
				3238	following global
				3239	data read is no
				3240	older than the load
				3241	atomic value being
				3242	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3243
				3244	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3245	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3246
				3247	- If OpenCL, omit
				3248	lgkmcnt(0).
				3249	- Could be split into
				3250	separate s_waitcnt
				3251	vmcnt(0) and
				3252	s_waitcnt
				3253	lgkmcnt(0) to allow
				3254	them to be
				3255	independently moved
				3256	according to the
				3257	following rules.
				3258	- s_waitcnt vmcnt(0)
				3259	must happen after
				3260	any preceding
				3261	global/generic
				3262	load/store/load
				3263	atomic/store
				3264	atomic/atomicrmw.
				3265	- s_waitcnt lgkmcnt(0)
				3266	must happen after
				3267	any preceding
				3268	local/generic
				3269	load/store/load
				3270	atomic/store
				3271	atomic/atomicrmw.
				3272	- Must happen before
				3273	the following
				3274	atomicrmw.
				3275	- Ensures that all
				3276	memory operations
				3277	to global have
				3278	completed before
				3279	performing the
				3280	atomicrmw that is
				3281	being released.
				3282
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3283	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3284	3. s_waitcnt vmcnt(0)
				3285
				3286	- Must happen before
				3287	following
				3288	buffer_wbinvl1_vol.
				3289	- Ensures the
				3290	atomicrmw has
				3291	completed before
				3292	invalidating the
				3293	cache.
				3294
				3295	4. buffer_wbinvl1_vol
				3296
				3297	- Must happen before
				3298	any following
				3299	global/generic
				3300	load/load
				3301	atomic/atomicrmw.
				3302	- Ensures that
				3303	following loads
				3304	will not see stale
				3305	global data.
				3306
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3307	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3308	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3309
				3310	- If OpenCL, omit
				3311	lgkmcnt(0).
				3312	- Could be split into
				3313	separate s_waitcnt
				3314	vmcnt(0) and
				3315	s_waitcnt
				3316	lgkmcnt(0) to allow
				3317	them to be
				3318	independently moved
				3319	according to the
				3320	following rules.
				3321	- s_waitcnt vmcnt(0)
				3322	must happen after
				3323	any preceding
				3324	global/generic
				3325	load/store/load
				3326	atomic/store
				3327	atomic/atomicrmw.
				3328	- s_waitcnt lgkmcnt(0)
				3329	must happen after
				3330	any preceding
				3331	local/generic
				3332	load/store/load
				3333	atomic/store
				3334	atomic/atomicrmw.
				3335	- Must happen before
				3336	the following
				3337	atomicrmw.
				3338	- Ensures that all
				3339	memory operations
				3340	to global have
				3341	completed before
				3342	performing the
				3343	atomicrmw that is
				3344	being released.
				3345
				3346	2. flat_atomic
				3347	3. s_waitcnt vmcnt(0) &
				3348	lgkmcnt(0)
				3349
				3350	- If OpenCL, omit
				3351	lgkmcnt(0).
				3352	- Must happen before
				3353	following
				3354	buffer_wbinvl1_vol.
				3355	- Ensures the
				3356	atomicrmw has
				3357	completed before
				3358	invalidating the
				3359	cache.
				3360
				3361	4. buffer_wbinvl1_vol
				3362
				3363	- Must happen before
				3364	any following
				3365	global/generic
				3366	load/load
				3367	atomic/atomicrmw.
				3368	- Ensures that
				3369	following loads
				3370	will not see stale
				3371	global data.
				3372
				3373	fence acq_rel - singlethread none none
				3374	- wavefront
				3375	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3376
				3377	- If OpenCL and
				3378	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3379	not generic, omit.
				3380	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3381	since LLVM
				3382	currently has no
				3383	address space on
				3384	the fence need to
				3385	conservatively
				3386	always generate
				3387	(see comment for
				3388	previous fence).
				3389	- Must happen after
				3390	any preceding
				3391	local/generic
				3392	load/load
				3393	atomic/store/store
				3394	atomic/atomicrmw.
				3395	- Must happen before
				3396	any following
				3397	global/generic
				3398	load/load
				3399	atomic/store/store
				3400	atomic/atomicrmw.
				3401	- Ensures that all
				3402	memory operations
				3403	to local have
				3404	completed before
				3405	performing any
				3406	following global
				3407	memory operations.
				3408	- Ensures that the
				3409	preceding
				3410	local/generic load
				3411	atomic/atomicrmw
				3412	with an equal or
				3413	wider sync scope
				3414	and memory ordering
				3415	stronger than
				3416	unordered (this is
				3417	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3418	acquire-fence-paired-atomic
				3419	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3420	before following
				3421	global memory
				3422	operations. This
				3423	satisfies the
				3424	requirements of
				3425	acquire.
				3426	- Ensures that all
				3427	previous memory
				3428	operations have
				3429	completed before a
				3430	following
				3431	local/generic store
				3432	atomic/atomicrmw
				3433	with an equal or
				3434	wider sync scope
				3435	and memory ordering
				3436	stronger than
				3437	unordered (this is
				3438	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3439	release-fence-paired-atomic
				3440	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3441	requirements of
				3442	release.
				3443
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3444	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3445	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3446
				3447	- If OpenCL and
				3448	address space is
				3449	not generic, omit
				3450	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3451	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3452	currently has no
				3453	address space on
				3454	the fence need to
				3455	conservatively
				3456	always generate
				3457	(see comment for
				3458	previous fence).
				3459	- Could be split into
				3460	separate s_waitcnt
				3461	vmcnt(0) and
				3462	s_waitcnt
				3463	lgkmcnt(0) to allow
				3464	them to be
				3465	independently moved
				3466	according to the
				3467	following rules.
				3468	- s_waitcnt vmcnt(0)
				3469	must happen after
				3470	any preceding
				3471	global/generic
				3472	load/store/load
				3473	atomic/store
				3474	atomic/atomicrmw.
				3475	- s_waitcnt lgkmcnt(0)
				3476	must happen after
				3477	any preceding
				3478	local/generic
				3479	load/store/load
				3480	atomic/store
				3481	atomic/atomicrmw.
				3482	- Must happen before
				3483	the following
				3484	buffer_wbinvl1_vol.
				3485	- Ensures that the
				3486	preceding
				3487	global/local/generic
				3488	load
				3489	atomic/atomicrmw
				3490	with an equal or
				3491	wider sync scope
				3492	and memory ordering
				3493	stronger than
				3494	unordered (this is
				3495	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3496	acquire-fence-paired-atomic
				3497	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3498	before invalidating
				3499	the cache. This
				3500	satisfies the
				3501	requirements of
				3502	acquire.
				3503	- Ensures that all
				3504	previous memory
				3505	operations have
				3506	completed before a
				3507	following
				3508	global/local/generic
				3509	store
				3510	atomic/atomicrmw
				3511	with an equal or
				3512	wider sync scope
				3513	and memory ordering
				3514	stronger than
				3515	unordered (this is
				3516	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3517	release-fence-paired-atomic
				3518	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3519	requirements of
				3520	release.
				3521
				3522	2. buffer_wbinvl1_vol
				3523
				3524	- Must happen before
				3525	any following
				3526	global/generic
				3527	load/load
				3528	atomic/store/store
				3529	atomic/atomicrmw.
				3530	- Ensures that
				3531	following loads
				3532	will not see stale
				3533	global data. This
				3534	satisfies the
				3535	requirements of
				3536	acquire.
				3537
				3538	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3539	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3540	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3541	- wavefront - local load atomic acquire,
				3542	- generic except must generated
				3543	all instructions even
				3544	for OpenCL.*
				3545	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3546	- generic
				3547	- Must
				3548	happen after
				3549	preceding
				3550	global/generic load
				3551	atomic/store
				3552	atomic/atomicrmw
				3553	with memory
				3554	ordering of seq_cst
				3555	and with equal or
				3556	wider sync scope.
				3557	(Note that seq_cst
				3558	fences have their
				3559	own s_waitcnt
				3560	lgkmcnt(0) and so do
				3561	not need to be
				3562	considered.)
				3563	- Ensures any
				3564	preceding
				3565	sequential
				3566	consistent local
				3567	memory instructions
				3568	have completed
				3569	before executing
				3570	this sequentially
				3571	consistent
				3572	instruction. This
				3573	prevents reordering
				3574	a seq_cst store
				3575	followed by a
				3576	seq_cst load. (Note
				3577	that seq_cst is
				3578	stronger than
				3579	acquire/release as
				3580	the reordering of
				3581	load acquire
				3582	followed by a store
				3583	release is
				3584	prevented by the
				3585	waitcnt of
				3586	the release, but
				3587	there is nothing
				3588	preventing a store
				3589	release followed by
				3590	load acquire from
				3591	competing out of
				3592	order.)
				3593
				3594	2. *Following
				3595	instructions same as
				3596	corresponding load
				3597	atomic acquire,
				3598	except must generated
				3599	all instructions even
				3600	for OpenCL.*
				3601	load atomic seq_cst - workgroup - local *Same as corresponding
				3602	load atomic acquire,
				3603	except must generated
				3604	all instructions even
				3605	for OpenCL.*
				3606	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3607	- system - generic vmcnt(0)
				3608
				3609	- Could be split into
				3610	separate s_waitcnt
				3611	vmcnt(0)
				3612	and s_waitcnt
				3613	lgkmcnt(0) to allow
				3614	them to be
				3615	independently moved
				3616	according to the
				3617	following rules.
				3618	- waitcnt lgkmcnt(0)
				3619	must happen after
				3620	preceding
				3621	global/generic load
				3622	atomic/store
				3623	atomic/atomicrmw
				3624	with memory
				3625	ordering of seq_cst
				3626	and with equal or
				3627	wider sync scope.
				3628	(Note that seq_cst
				3629	fences have their
				3630	own s_waitcnt
				3631	lgkmcnt(0) and so do
				3632	not need to be
				3633	considered.)
				3634	- waitcnt vmcnt(0)
				3635	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3636	preceding
				3637	global/generic load
				3638	atomic/store
				3639	atomic/atomicrmw
				3640	with memory
				3641	ordering of seq_cst
				3642	and with equal or
				3643	wider sync scope.
				3644	(Note that seq_cst
				3645	fences have their
				3646	own s_waitcnt
				3647	vmcnt(0) and so do
				3648	not need to be
				3649	considered.)
				3650	- Ensures any
				3651	preceding
				3652	sequential
				3653	consistent global
				3654	memory instructions
				3655	have completed
				3656	before executing
				3657	this sequentially
				3658	consistent
				3659	instruction. This
				3660	prevents reordering
				3661	a seq_cst store
				3662	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3663	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3664	that seq_cst is
				3665	stronger than
				3666	acquire/release as
				3667	the reordering of
				3668	load acquire
				3669	followed by a store
				3670	release is
				3671	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3672	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3673	the release, but
				3674	there is nothing
				3675	preventing a store
				3676	release followed by
				3677	load acquire from
				3678	competing out of
				3679	order.)
				3680
				3681	2. *Following
				3682	instructions same as
				3683	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3684	atomic acquire,
				3685	except must generated
				3686	all instructions even
				3687	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3688	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3689	- wavefront - local store atomic release,
				3690	- workgroup - generic except must generated
				3691	all instructions even
				3692	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3693	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3694	- system - generic store atomic release,
				3695	except must generated
				3696	all instructions even
				3697	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3698	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3699	- wavefront - local atomicrmw acq_rel,
				3700	- workgroup - generic except must generated
				3701	all instructions even
				3702	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3703	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3704	- system - generic atomicrmw acq_rel,
				3705	except must generated
				3706	all instructions even
				3707	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3708	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3709	- wavefront fence acq_rel,
				3710	- workgroup except must generated
				3711	- agent all instructions even
				3712	- system for OpenCL.*
				3713	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3714
				3715	The memory order also adds the single thread optimization constrains defined in
				3716	table
				3717	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3718
				3719	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3720	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3721
				3722	============ ==============================================================
				3723	LLVM Memory Optimization Constraints
				3724	Ordering
				3725	============ ==============================================================
				3726	unordered none
				3727	monotonic none
				3728	acquire - If a load atomic/atomicrmw then no following load/load
				3729	atomic/store/ store atomic/atomicrmw/fence instruction can
				3730	be moved before the acquire.
				3731	- If a fence then same as load atomic, plus no preceding
				3732	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3733	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3734	atomic/store/ store atomic/atomicrmw/fence instruction can
				3735	be moved after the release.
				3736	- If a fence then same as store atomic, plus no following
				3737	associated fence-paired-atomic can be moved before the
				3738	fence.
				3739	acq_rel Same constraints as both acquire and release.
				3740	seq_cst - If a load atomic then same constraints as acquire, plus no
				3741	preceding sequentially consistent load atomic/store
				3742	atomic/atomicrmw/fence instruction can be moved after the
				3743	seq_cst.
				3744	- If a store atomic then the same constraints as release, plus
				3745	no following sequentially consistent load atomic/store
				3746	atomic/atomicrmw/fence instruction can be moved before the
				3747	seq_cst.
				3748	- If an atomicrmw/fence then same constraints as acq_rel.
				3749	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3750
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3751	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3752	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3753
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3754	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3755	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3756	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3757
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3758	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3759	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3760
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3761	=================== =============== =============== =======================
				3762	Usage Code Sequence Trap Handler Description
				3763	Inputs
				3764	=================== =============== =============== =======================
				3765	reserved ``s_trap 0x00`` Reserved by hardware.
				3766	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3767	``queue_ptr`` ``debugtrap``
				3768	``VGPR0``: intrinsic (not
				3769	``arg`` implemented).
				3770	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3771	``queue_ptr`` terminated and its
				3772	associated queue put
				3773	into the error state.
				3774	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3775	``queue_ptr`` installed handled
				3776	same as ``llvm.trap``.
				3777	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3778	breakpoints.
				3779	debugger ``s_trap 0x08`` Reserved for debugger.
				3780	debugger ``s_trap 0xfe`` Reserved for debugger.
				3781	debugger ``s_trap 0xff`` Reserved for debugger.
				3782	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3783
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3784	AMDPAL
				3785	------
				3786
				3787	This section provides code conventions used when the target triple OS is
				3788	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3789	from the application/runtime to each invocation of a hardware shader. These
				3790	parameters include both generic, application-controlled parameters called
				3791	user data as well as system-generated parameters that are a product of the
				3792	draw or dispatch execution.
				3793
				3794	User Data
				3795	~~~~~~~~~
				3796
				3797	Each hardware stage has a set of 32-bit user data registers which can be
				3798	written from a command buffer and then loaded into SGPRs when waves are launched
				3799	via a subsequent dispatch or draw operation. This is the way most arguments are
				3800	passed from the application/runtime to a hardware shader.
				3801
				3802	Compute User Data
				3803	~~~~~~~~~~~~~~~~~
				3804
				3805	Compute shader user data mappings are simpler than graphics shaders, and have a
				3806	fixed mapping.
				3807
				3808	Note that there are always 10 available user data entries in registers -
				3809	entries beyond that limit must be fetched from memory (via the spill table
				3810	pointer) by the shader.
				3811
				3812	.. table:: PAL Compute Shader User Data Registers
				3813	:name: pal-compute-user-data-registers
				3814
				3815	============= ================================
				3816	User Register Description
				3817	============= ================================
				3818	0 Global Internal Table (32-bit pointer)
				3819	1 Per-Shader Internal Table (32-bit pointer)
				3820	2 - 11 Application-Controlled User Data (10 32-bit values)
				3821	12 Spill Table (32-bit pointer)
				3822	13 - 14 Thread Group Count (64-bit pointer)
				3823	15 GDS Range
				3824	============= ================================
				3825
				3826	Graphics User Data
				3827	~~~~~~~~~~~~~~~~~~
				3828
				3829	Graphics pipelines support a much more flexible user data mapping:
				3830
				3831	.. table:: PAL Graphics Shader User Data Registers
				3832	:name: pal-graphics-user-data-registers
				3833
				3834	============= ================================
				3835	User Register Description
				3836	============= ================================
				3837	0 Global Internal Table (32-bit pointer)
				3838	+ Per-Shader Internal Table (32-bit pointer)
				3839	+ 1-15 Application Controlled User Data
				3840	(1-15 Contiguous 32-bit Values in Registers)
				3841	+ Spill Table (32-bit pointer)
				3842	+ Draw Index (First Stage Only)
				3843	+ Vertex Offset (First Stage Only)
				3844	+ Instance Offset (First Stage Only)
				3845	============= ================================
				3846
				3847	The placement of the global internal table remains fixed in the first *user
				3848	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3849	to any desired user data SGPR register, with the following regstrictions:
				3850
				3851	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3852	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3853	shader runs).
				3854
				3855	* Application-controlled user data must be mapped into a contiguous range of
				3856	user data registers.
				3857
				3858	* The application-controlled user data range supports compaction remapping, so
				3859	only entries that are actually consumed by the shader must be assigned to
				3860	corresponding registers. Note that in order to support an efficient runtime
				3861	implementation, the remapping must pack registers in the same order as
				3862	entries, with unused entries removed.
				3863
				3864	.. _pal_global_internal_table:
				3865
				3866	Global Internal Table
				3867	~~~~~~~~~~~~~~~~~~~~~
				3868
				3869	The global internal table is a table of shader resource descriptors (SRDs) that
				3870	define how certain engine-wide, runtime-managed resources should be accessed
				3871	from a shader. The majority of these resources have HW-defined formats, and it
				3872	is up to the compiler to write/read data as required by the target hardware.
				3873
				3874	The following table illustrates the required format:
				3875
				3876	.. table:: PAL Global Internal Table
				3877	:name: pal-git-table
				3878
				3879	============= ================================
				3880	Offset Description
				3881	============= ================================
				3882	0-3 Graphics Scratch SRD
				3883	4-7 Compute Scratch SRD
				3884	8-11 ES/GS Ring Output SRD
				3885	12-15 ES/GS Ring Input SRD
				3886	16-19 GS/VS Ring Output #0
				3887	20-23 GS/VS Ring Output #1
				3888	24-27 GS/VS Ring Output #2
				3889	28-31 GS/VS Ring Output #3
				3890	32-35 GS/VS Ring Input SRD
				3891	36-39 Tessellation Factor Buffer SRD
				3892	40-43 Off-Chip LDS Buffer SRD
				3893	44-47 Off-Chip Param Cache Buffer SRD
				3894	48-51 Sample Position Buffer SRD
				3895	52 vaRange::ShadowDescriptorTable High Bits
				3896	============= ================================
				3897
				3898	The pointer to the global internal table passed to the shader as user data
				3899	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3900	the top 32 bits of the pipeline, so the shader may use the program
				3901	counter's top 32 bits.
				3902
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3903	Unspecified OS
				3904	--------------
				3905
				3906	This section provides code conventions used when the target triple OS is
				3907	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3908
				3909	Trap Handler ABI
				3910	~~~~~~~~~~~~~~~~
				3911
				3912	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3913	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3914	instructions are handled as follows:
				3915
				3916	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3917	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3918
				3919	=============== =============== ===========================================
				3920	Usage Code Sequence Description
				3921	=============== =============== ===========================================
				3922	llvm.trap s_endpgm Causes wavefront to be terminated.
				3923	llvm.debugtrap none Compiler warning given that there is no
				3924	trap handler installed.
				3925	=============== =============== ===========================================
				3926
				3927	Source Languages
				3928	================
				3929
				3930	.. _amdgpu-opencl:
				3931
				3932	OpenCL
				3933	------
				3934
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3935	When the language is OpenCL the following differences occur:
				3936
				3937	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3938	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				3939	arguments for the AMDHSA OS (see
				3940	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3941	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3942	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3943
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3944	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				3945	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				3946
				3947	======== ==== ========= ===========================================
				3948	Position Byte Byte Description
				3949	Size Alignment
				3950	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	3951	1 8 8 OpenCL Global Offset X
				3952	2 8 8 OpenCL Global Offset Y
				3953	3 8 8 OpenCL Global Offset Z
				3954	4 8 8 OpenCL address of printf buffer
				3955	5 8 8 OpenCL address of virtual queue used by
				3956	enqueue_kernel.
				3957	6 8 8 OpenCL address of AqlWrap struct used by
				3958	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3959	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3960
				3961	.. _amdgpu-hcc:
				3962
				3963	HCC
				3964	---
				3965
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3966	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3967
				3968	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3969
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3970	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3971	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3972
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3973	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3974	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3975
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3976	This section describes general syntax for instructions and operands.
				3977
				3978	Instructions
				3979	~~~~~~~~~~~~
				3980
				3981	.. toctree::
				3982	:hidden:
				3983
				3984	AMDGPUAsmGFX7
				3985	AMDGPUAsmGFX8
				3986	AMDGPUAsmGFX9
				3987	AMDGPUOperandSyntax
				3988
				3989	An instruction has the following syntax:
				3990
				3991	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				3992
				3993	Note that operands are normally comma-separated while modifiers are space-separated.
				3994
				3995	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				3996
				3997	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				3998	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				3999
				4000	Note that features under development are not included in this description.
				4001
				4002	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4003	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4004	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4005
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4006	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4007	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4008
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4009	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4010
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4011	* SGPR registers: s0, ... or s[0], ...
				4012	* VGPR registers: v0, ... or v[0], ...
				4013	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4014	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4015	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4016	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4017	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4018	* Register index expressions: v[2*2], s[1-1:2-1]
				4019	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4020
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4021	Modifiers
				4022	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4023
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4024	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4025
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4026	Instruction Examples
				4027	~~~~~~~~~~~~~~~~~~~~
				4028
				4029	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4030	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4031
				4032	.. code-block:: nasm
				4033
				4034	ds_add_u32 v2, v4 offset:16
				4035	ds_write_src2_b64 v2 offset0:4 offset1:8
				4036	ds_cmpst_f32 v2, v4, v6
				4037	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4038
				4039
				4040	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4041
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4042	FLAT
				4043	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4044
				4045	.. code-block:: nasm
				4046
				4047	flat_load_dword v1, v[3:4]
				4048	flat_store_dwordx3 v[3:4], v[5:7]
				4049	flat_atomic_swap v1, v[3:4], v5 glc
				4050	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4051	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4052
				4053	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4054
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4055	MUBUF
				4056	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4057
				4058	.. code-block:: nasm
				4059
				4060	buffer_load_dword v1, off, s[4:7], s1
				4061	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4062	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4063	buffer_wbinvl1
				4064	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4065
				4066	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4067
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4068	SMRD/SMEM
				4069	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4070
				4071	.. code-block:: nasm
				4072
				4073	s_load_dword s1, s[2:3], 0xfc
				4074	s_load_dwordx8 s[8:15], s[2:3], s4
				4075	s_load_dwordx16 s[88:103], s[2:3], s4
				4076	s_dcache_inv_vol
				4077	s_memtime s[4:5]
				4078
				4079	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4080
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4081	SOP1
				4082	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4083
				4084	.. code-block:: nasm
				4085
				4086	s_mov_b32 s1, s2
				4087	s_mov_b64 s[0:1], 0x80000000
				4088	s_cmov_b32 s1, 200
				4089	s_wqm_b64 s[2:3], s[4:5]
				4090	s_bcnt0_i32_b64 s1, s[2:3]
				4091	s_swappc_b64 s[2:3], s[4:5]
				4092	s_cbranch_join s[4:5]
				4093
				4094	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4095
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4096	SOP2
				4097	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4098
				4099	.. code-block:: nasm
				4100
				4101	s_add_u32 s1, s2, s3
				4102	s_and_b64 s[2:3], s[4:5], s[6:7]
				4103	s_cselect_b32 s1, s2, s3
				4104	s_andn2_b32 s2, s4, s6
				4105	s_lshr_b64 s[2:3], s[4:5], s6
				4106	s_ashr_i32 s2, s4, s6
				4107	s_bfm_b64 s[2:3], s4, s6
				4108	s_bfe_i64 s[2:3], s[4:5], s6
				4109	s_cbranch_g_fork s[4:5], s[6:7]
				4110
				4111	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4112
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4113	SOPC
				4114	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4115
				4116	.. code-block:: nasm
				4117
				4118	s_cmp_eq_i32 s1, s2
				4119	s_bitcmp1_b32 s1, s2
				4120	s_bitcmp0_b64 s[2:3], s4
				4121	s_setvskip s3, s5
				4122
				4123	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4124
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4125	SOPP
				4126	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4127
				4128	.. code-block:: nasm
				4129
				4130	s_barrier
				4131	s_nop 2
				4132	s_endpgm
				4133	s_waitcnt 0 ; Wait for all counters to be 0
				4134	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4135	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4136	s_sethalt 9
				4137	s_sleep 10
				4138	s_sendmsg 0x1
				4139	s_sendmsg sendmsg(MSG_INTERRUPT)
				4140	s_trap 1
				4141
				4142	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4143
				4144	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4145	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4146	range or acceptable values.
				4147
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4148	VALU
				4149	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4150
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4151	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4152	the assembler will automatically use optimal encoding based on its operands.
				4153	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4154
				4155	* _e32 for 32-bit VOP1/VOP2/VOPC
				4156	* _e64 for 64-bit VOP3
				4157	* _dpp for VOP_DPP
				4158	* _sdwa for VOP_SDWA
				4159
				4160	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4161
				4162	.. code-block:: nasm
				4163
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4164	v_mov_b32 v1, v2
				4165	v_mov_b32_e32 v1, v2
				4166	v_nop
				4167	v_cvt_f64_i32_e32 v[1:2], v2
				4168	v_floor_f32_e32 v1, v2
				4169	v_bfrev_b32_e32 v1, v2
				4170	v_add_f32_e32 v1, v2, v3
				4171	v_mul_i32_i24_e64 v1, v2, 3
				4172	v_mul_i32_i24_e32 v1, -3, v3
				4173	v_mul_i32_i24_e32 v1, -100, v3
				4174	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4175	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4176
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4177	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4178
				4179	.. code-block:: nasm
				4180
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4181	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4182	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4183	v_mov_b32 v0, v0 wave_shl:1
				4184	v_mov_b32 v0, v0 row_mirror
				4185	v_mov_b32 v0, v0 row_bcast:31
				4186	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4187	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4188	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4189
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4190	VOP_SDWA examples:
				4191
				4192	.. code-block:: nasm
				4193
				4194	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4195	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4196	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4197	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4198	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4199
				4200	For full list of supported instructions, refer to "Vector ALU instructions".
				4201
				4202	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4203	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4204
				4205	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4206	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4207
				4208	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4209	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4210
				4211	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4212	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4213
				4214	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4215	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4216
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4217
				4218	major, minor, and stepping are all integers that describe the instruction
				4219	set architecture (ISA) version of the assembly program.
				4220
				4221	vendor and arch are quoted strings. vendor should always be equal to
				4222	"AMD" and arch should always be equal to "AMDGPU".
				4223
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4224	By default, the assembler will derive the ISA version, vendor, and arch
				4225	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4226
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4227	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4228	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4229
				4230	This directives specifies that the symbol with given name is a kernel entry point
				4231	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4232
				4233	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4234	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4235
				4236	This directive marks the beginning of a list of key / value pairs that are used
				4237	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4238	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4239	any amd_kernel_code_t values that are unspecified a default value will be
				4240	used. The default value for all keys is 0, with the following exceptions:
				4241
				4242	- kernel_code_version_major defaults to 1.
				4243	- machine_kind defaults to 1.
				4244	- machine_version_major, machine_version_minor, and
				4245	machine_version_stepping are derived from the value of the -mcpu option
				4246	that is passed to the assembler.
				4247	- kernel_code_entry_byte_offset defaults to 256.
				4248	- wavefront_size defaults to 6.
				4249	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4250	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4251	as a power of two, so a value of n means an alignment of 2^ n.
				4252
				4253	The .amd_kernel_code_t directive must be placed immediately after the
				4254	function label and before any instructions.
				4255
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4256	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4257	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4258
				4259	Here is an example of a minimal amd_kernel_code_t specification:
				4260
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4261	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4262
				4263	.hsa_code_object_version 1,0
				4264	.hsa_code_object_isa
				4265
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4266	.hsatext
				4267	.globl hello_world
				4268	.p2align 8
				4269	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4270
				4271	hello_world:
				4272
				4273	.amd_kernel_code_t
				4274	enable_sgpr_kernarg_segment_ptr = 1
				4275	is_ptr64 = 1
				4276	compute_pgm_rsrc1_vgprs = 0
				4277	compute_pgm_rsrc1_sgprs = 0
				4278	compute_pgm_rsrc2_user_sgpr = 2
				4279	kernarg_segment_byte_size = 8
				4280	wavefront_sgpr_count = 2
				4281	workitem_vgpr_count = 3
				4282	.end_amd_kernel_code_t
				4283
				4284	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4285	v_mov_b32 v0, 3.14159
				4286	s_waitcnt lgkmcnt(0)
				4287	v_mov_b32 v1, s0
				4288	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4289	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4290	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4291	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4292	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4293
				4294	Additional Documentation
				4295	========================
				4296
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4297	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4298	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4299	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4300	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4301	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4302	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4303	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4304	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4305	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4306	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4307	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4308	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4309	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4310	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4311	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4312	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__