Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 7a264af62f40cbfde335918ae8ff53eb60c95197 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
				204	[off]
				205	.. TODO
				206	Add product
				207	names.
				208	``gfx906`` ``amdgcn`` dGPU - xnack TBA
				209	[off]
				210	.. TODO
				211	Add product
				212	names.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	213	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	214
				215	.. _amdgpu-target-features:
				216
				217	Target Features
				218	---------------
				219
				220	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	221	processor specific features. Not all target features are supported by
				222	all processors. The runtime must ensure that the features supported by
				223	the device used to execute the code match the features enabled when
				224	generating the code. A mismatch of features may result in incorrect
				225	execution, or a reduction in performance.
				226
				227	The target features supported by each processor, and the default value
				228	used if not specified explicitly, is listed in
				229	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				232	target features.
				233
				234	For example:
				235
				236	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	238	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
				241	.. table:: AMDGPU Target Features
				242	:name: amdgpu-target-feature-table
				243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	============== ==================================================
				245	Target Feature Description
				246	============== ==================================================
				247	-m[no-]xnack Enable/disable generating code that has
				248	memory clauses that are compatible with
				249	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	250
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	251	This is used for demand paging and page
				252	migration. If XNACK replay is enabled in
				253	the device, then if a page fault occurs
				254	the code may execute incorrectly if the
				255	``xnack`` feature is not enabled. Executing
				256	code that has the feature enabled on a
				257	device that does not have XNACK replay
				258	enabled will execute correctly, but may
				259	be less performant than code with the
				260	feature disabled.
				261	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	262
				263	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
				265	Address Spaces
				266	--------------
				267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	The memory space names used in the table, aside from the region memory space, is
				271	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	272
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	273	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	274
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	275	.. table:: Address Space Mapping
				276	:name: amdgpu-address-space-mapping-table
				277
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	280	================== =================
				281	0 Generic (Flat)
				282	1 Global
				283	2 Region (GDS)
				284	3 Local (group/LDS)
				285	4 Constant
				286	5 Private (Scratch)
				287	6 Constant 32-bit
				288	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	289
				290	.. _amdgpu-memory-scopes:
				291
				292	Memory Scopes
				293	-------------
				294
				295	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				296	backend memory model when the target triple OS is ``amdhsa`` (see
				297	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				298
				299	The memory model supported is based on the HSA memory model [HSA]_ which is
				300	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				301	relation is transitive over the synchonizes-with relation independent of scope,
				302	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	303	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				306	inclusion and requires the memory scopes to exactly match. However, this
				307	is conservatively correct for OpenCL.
				308
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	309	.. table:: AMDHSA LLVM Sync Scopes
				310	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	311
				312	================ ==========================================================
				313	LLVM Sync Scope Description
				314	================ ==========================================================
				315	none The default: ``system``.
				316
				317	Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``.
				324	- ``agent`` and executed by a thread on the same agent.
				325	- ``workgroup`` and executed by a thread in the same
				326	workgroup.
				327	- ``wavefront`` and executed by a thread in the same
				328	wavefront.
				329
				330	``agent`` Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system`` or ``agent`` and executed by a thread on the
				337	same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``workgroup`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system``, ``agent`` or ``workgroup`` and executed by a
				350	thread in the same workgroup.
				351	- ``wavefront`` and executed by a thread in the same
				352	wavefront.
				353
				354	``wavefront`` Synchronizes with, and participates in modification and
				355	seq_cst total orderings with, other operations (except
				356	image operations) for all address spaces (except private,
				357	or generic that accesses private) provided the other
				358	operation's sync scope is:
				359
				360	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				361	and executed by a thread in the same wavefront.
				362
				363	``singlethread`` Only synchronizes with, and participates in modification
				364	and seq_cst total orderings with, other operations (except
				365	image operations) running in the same thread for all
				366	address spaces (for example, in signal handlers).
				367	================ ==========================================================
				368
				369	AMDGPU Intrinsics
				370	-----------------
				371
				372	The AMDGPU backend implements the following intrinsics.
				373
				374	This section is WIP.
				375
				376	.. TODO
				377	List AMDGPU intrinsics
				378
				379	Code Object
				380	===========
				381
				382	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				383	can be linked by ``lld`` to produce a standard ELF shared code object which can
				384	be loaded and executed on an AMDGPU target.
				385
				386	Header
				387	------
				388
				389	The AMDGPU backend uses the following ELF header:
				390
				391	.. table:: AMDGPU ELF Header
				392	:name: amdgpu-elf-header-table
				393
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	394	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	396	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	397	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				398	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	399	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				400	- ``ELFOSABI_AMDGPU_HSA``
				401	- ``ELFOSABI_AMDGPU_PAL``
				402	- ``ELFOSABI_AMDGPU_MESA3D``
				403	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				404	- ``ELFABIVERSION_AMDGPU_PAL``
				405	- ``ELFABIVERSION_AMDGPU_MESA3D``
				406	``e_type`` - ``ET_REL``
				407	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	``e_machine`` ``EM_AMDGPU``
				409	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	410	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	411	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	412
				413	..
				414
				415	.. table:: AMDGPU ELF Header Enumeration Values
				416	:name: amdgpu-elf-header-enumeration-values-table
				417
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	418	=============================== =====
				419	Name Value
				420	=============================== =====
				421	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	422	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	423	``ELFOSABI_AMDGPU_HSA`` 64
				424	``ELFOSABI_AMDGPU_PAL`` 65
				425	``ELFOSABI_AMDGPU_MESA3D`` 66
				426	``ELFABIVERSION_AMDGPU_HSA`` 1
				427	``ELFABIVERSION_AMDGPU_PAL`` 0
				428	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				429	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	430
				431	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	432	The ELF class is:
				433
				434	* ``ELFCLASS32`` for ``r600`` architecture.
				435
				436	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				437	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	438
				439	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	440	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	441
				442	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	443	One of the following AMD GPU architecture specific OS ABIs
				444	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	445
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	446	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	447
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	449
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	450	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				451
				452	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	453
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	455	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				456	object conforms:
				457
				458	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				459	runtime ABI.
				460
				461	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				462	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	463
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	464	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	465	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	466
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467	``e_type``
				468	Can be one of the following values:
				469
				470
				471	``ET_REL``
				472	The type produced by the AMD GPU backend compiler as it is relocatable code
				473	object.
				474
				475	``ET_DYN``
				476	The type produced by the linker as it is a shared code object.
				477
				478	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				479
				480	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				482	by the ``r600`` and ``amdgcn`` architectures (see
				483	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				484	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				485	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	486
				487	``e_entry``
				488	The entry point is 0 as the entry points for individual kernels must be
				489	selected in order to invoke them through AQL packets.
				490
				491	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	492	The AMDGPU backend uses the following ELF header flags:
				493
				494	.. table:: AMDGPU ELF Header ``e_flags``
				495	:name: amdgpu-elf-header-e_flags-table
				496
				497	================================= ========== =============================
				498	Name Value Description
				499	================================= ========== =============================
				500	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				501	-------------------------------------------- -----------------------------
				502	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				503	mask for
				504	``EF_AMDGPU_MACH_xxx`` values
				505	defined in
				506	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	507	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				508	target feature is
				509	enabled for all code
				510	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	511	If the processor
				512	does not support the
				513	``xnack`` target
				514	feature then must
				515	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	516	See
				517	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	518	================================= ========== =============================
				519
				520	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				521	:name: amdgpu-ef-amdgpu-mach-table
				522
				523	================================= ========== =============================
				524	Name Value Description (see
				525	:ref:`amdgpu-processor-table`)
				526	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	527	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				528	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				529	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				530	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				531	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				532	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				533	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				534	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				535	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				536	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				537	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				538	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				539	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				540	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				541	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				542	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				543	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				544	reserved 0x011 - Reserved for ``r600``
				545	0x01f architecture processors.
				546	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				547	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				548	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				553	reserved 0x027 Reserved.
				554	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				555	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				556	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				557	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	560	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	562	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	563	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	564
				565	Sections
				566	--------
				567
				568	An AMDGPU target ELF code object has the standard ELF sections which include:
				569
				570	.. table:: AMDGPU ELF Sections
				571	:name: amdgpu-elf-sections-table
				572
				573	================== ================ =================================
				574	Name Type Attributes
				575	================== ================ =================================
				576	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				577	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				578	``.debug_``\ \* ``SHT_PROGBITS`` none
				579	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				580	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				581	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				582	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				583	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				584	``.note`` ``SHT_NOTE`` none
				585	``.rela``\ name ``SHT_RELA`` none
				586	``.rela.dyn`` ``SHT_RELA`` none
				587	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				588	``.shstrtab`` ``SHT_STRTAB`` none
				589	``.strtab`` ``SHT_STRTAB`` none
				590	``.symtab`` ``SHT_SYMTAB`` none
				591	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				592	================== ================ =================================
				593
				594	These sections have their standard meanings (see [ELF]_) and are only generated
				595	if needed.
				596
				597	``.debug``\ \*
				598	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				599	DWARF produced by the AMDGPU backend.
				600
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	601	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602	The standard sections used by a dynamic loader.
				603
				604	``.note``
				605	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				606	backend.
				607
				608	``.rela``\ name, ``.rela.dyn``
				609	For relocatable code objects, name is the name of the section that the
				610	relocation records apply. For example, ``.rela.text`` is the section name for
				611	relocation records associated with the ``.text`` section.
				612
				613	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				614	records from each of the relocatable code object's ``.rela``\ name sections.
				615
				616	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				617	the AMDGPU backend.
				618
				619	``.text``
				620	The executable machine code for the kernels and functions they call. Generated
				621	as position independent code. See :ref:`amdgpu-code-conventions` for
				622	information on conventions used in the isa generation.
				623
				624	.. _amdgpu-note-records:
				625
				626	Note Records
				627	------------
				628
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	629	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				630	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				631	aligned. In addition, minimal zero byte padding must be generated to ensure the
				632	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				633	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	634
				635	The AMDGPU backend code object uses the following ELF note records in the
				636	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	637	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	638	variable size strings have a corresponding ``*_size`` field that specifies the
				639	number of bytes, including the terminating null character, in the string. The
				640	string(s) come immediately after the preceding fields.
				641
				642	Additional note records can be present.
				643
				644	.. table:: AMDGPU ELF Note Records
				645	:name: amdgpu-elf-note-records-table
				646
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	647	===== ============================== ======================================
				648	Name Type Description
				649	===== ============================== ======================================
				650	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	651	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	652
				653	..
				654
				655	.. table:: AMDGPU ELF Note Record Enumeration Values
				656	:name: amdgpu-elf-note-record-enumeration-values-table
				657
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	658	============================== =====
				659	Name Value
				660	============================== =====
				661	reserved 0-9
				662	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	663	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	664	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	665
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	666	``NT_AMD_AMDGPU_HSA_METADATA``
				667	Specifies extensible metadata associated with the code objects executed on HSA
				668	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				669	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				670	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				671	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	672
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	673	.. _amdgpu-symbols:
				674
				675	Symbols
				676	-------
				677
				678	Symbols include the following:
				679
				680	.. table:: AMDGPU ELF Symbols
				681	:name: amdgpu-elf-symbols-table
				682
				683	===================== ============== ============= ==================
				684	Name Type Section Description
				685	===================== ============== ============= ==================
				686	link-name ``STT_OBJECT`` - ``.data`` Global variable
				687	- ``.rodata``
				688	- ``.bss``
				689	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				690	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				691	===================== ============== ============= ==================
				692
				693	Global variable
				694	Global variables both used and defined by the compilation unit.
				695
				696	If the symbol is defined in the compilation unit then it is allocated in the
				697	appropriate section according to if it has initialized data or is readonly.
				698
				699	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				700	will resolve relocations using the definition provided by another code object
				701	or explicitly defined by the runtime.
				702
				703	All global symbols, whether defined in the compilation unit or external, are
				704	accessed by the machine code indirectly through a GOT table entry. This
				705	allows them to be preemptable. The GOT table is only supported when the target
				706	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	707
				708	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	709	Add description of linked shared object symbols. Seems undefined symbols
				710	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	711
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	712	Kernel descriptor
				713	Every HSA kernel has an associated kernel descriptor. It is the address of the
				714	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				715	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				716	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				717
				718	Kernel entry point
				719	Every HSA kernel also has a symbol for its machine code entry point.
				720
				721	.. _amdgpu-relocation-records:
				722
				723	Relocation Records
				724	------------------
				725
				726	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				727	relocatable fields are:
				728
				729	``word32``
				730	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				731	alignment. These values use the same byte order as other word values in the
				732	AMD GPU architecture.
				733
				734	``word64``
				735	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				736	alignment. These values use the same byte order as other word values in the
				737	AMD GPU architecture.
				738
				739	Following notations are used for specifying relocation calculations:
				740
				741	A
				742	Represents the addend used to compute the value of the relocatable field.
				743
				744	G
				745	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	746	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	747
				748	GOT
				749	Represents the address of the global offset table.
				750
				751	P
				752	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				753	of the storage unit being relocated (computed using ``r_offset``).
				754
				755	S
				756	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	757	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				758
				759	B
				760	Represents the base address of a loaded executable or shared object which is
				761	the difference between the ELF address and the actual load address. Relocations
				762	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	763
				764	The following relocation types are supported:
				765
				766	.. table:: AMDGPU ELF Relocation Records
				767	:name: amdgpu-elf-relocation-records-table
				768
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	769	========================== ======= ===== ========== ==============================
				770	Relocation Type Kind Value Field Calculation
				771	========================== ======= ===== ========== ==============================
				772	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	773	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				774	Dynamic
				775	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				776	Dynamic
				777	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	778	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	779	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				780	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	781	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				782	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	783	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				784	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				785	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				786	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				787	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				788	reserved 12
				789	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				790	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	791
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	792	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				793	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				794
				795	There is no current OS loader support for 32 bit programs and so
				796	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	797
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	798	.. _amdgpu-dwarf:
				799
				800	DWARF
				801	-----
				802
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	803	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	804	information that maps the code object executable code and data to the source
				805	language constructs. It can be used by tools such as debuggers and profilers.
				806
				807	Address Space Mapping
				808	~~~~~~~~~~~~~~~~~~~~~
				809
				810	The following address space mapping is used:
				811
				812	.. table:: AMDGPU DWARF Address Space Mapping
				813	:name: amdgpu-dwarf-address-space-mapping-table
				814
				815	=================== =================
				816	DWARF Address Space Memory Space
				817	=================== =================
				818	1 Private (Scratch)
				819	2 Local (group/LDS)
				820	omitted Global
				821	omitted Constant
				822	omitted Generic (Flat)
				823	not supported Region (GDS)
				824	=================== =================
				825
				826	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				827	used in the table.
				828
				829	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				830	DWARF address space of the value of the pointer when it is in the private or
				831	local address space. Otherwise the attribute is omitted.
				832
				833	An ``XDEREF`` operation is generated in location list expressions for variables
				834	that are allocated in the private and local address space. Otherwise no
				835	``XDREF`` is omitted.
				836
				837	Register Mapping
				838	~~~~~~~~~~~~~~~~
				839
				840	This section is WIP.
				841
				842	.. TODO
				843	Define DWARF register enumeration.
				844
				845	If want to present a wavefront state then should expose vector registers as
				846	64 wide (rather than per work-item view that LLVM uses). Either as separate
				847	registers, or a 64x4 byte single register. In either case use a new LANE op
				848	(akin to XDREF) to select the current lane usage in a location
				849	expression. This would also allow scalar register spilling to vector register
				850	lanes to be expressed (currently no debug information is being generated for
				851	spilling). If choose a wide single register approach then use LANE in
				852	conjunction with PIECE operation to select the dword part of the register for
				853	the current lane. If the separate register approach then use LANE to select
				854	the register.
				855
				856	Source Text
				857	~~~~~~~~~~~
				858
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	859	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				860	runtime) may be embedded into the DWARF v5 line table using the ``clang
				861	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	862
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	863	For example:
				864
				865	``-gembed-source``
				866	Enable the embedded source DWARF v5 extension.
				867	``-gno-embed-source``
				868	Disable the embedded source DWARF v5 extension.
				869
				870	.. table:: AMDGPU Debug Options
				871	:name: amdgpu-debug-options
				872
				873	==================== ==================================================
				874	Debug Flag Description
				875	==================== ==================================================
				876	-g[no-]embed-source Enable/disable embedding source text in DWARF
				877	debug sections. Useful for environments where
				878	source cannot be written to disk, such as
				879	when performing online compilation.
				880	==================== ==================================================
				881
				882	This option enables one extended content types in the DWARF v5 Line Number
				883	Program Header, which is used to encode embedded source.
				884
				885	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				886	:name: amdgpu-dwarf-extended-content-types
				887
				888	============================ ======================
				889	Content Type Form
				890	============================ ======================
				891	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				892	============================ ======================
				893
				894	The source field will contain the UTF-8 encoded, null-terminated source text
				895	with ``'\n'`` line endings. When the source field is present, consumers can use
				896	the embedded source instead of attempting to discover the source on disk. When
				897	the source field is absent, consumers can access the file to get the source
				898	text.
				899
				900	The above content type appears in the ``file_name_entry_format`` field of the
				901	line table prologue, and its corresponding value appear in the ``file_names``
				902	field. The current encoding of the content type is documented in table
				903	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				904
				905	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				906	:name: amdgpu-dwarf-extended-content-types-encoding
				907
				908	============================ ====================
				909	Content Type Value
				910	============================ ====================
				911	``DW_LNCT_LLVM_source`` 0x2001
				912	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	913
				914	.. _amdgpu-code-conventions:
				915
				916	Code Conventions
				917	================
				918
				919	This section provides code conventions used for each supported target triple OS
				920	(see :ref:`amdgpu-target-triples`).
				921
				922	AMDHSA
				923	------
				924
				925	This section provides code conventions used when the target triple OS is
				926	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				927
				928	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	929
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	930	Code Object Target Identification
				931	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				932
				933	The AMDHSA OS uses the following syntax to specify the code object
				934	target as a single string:
				935
				936	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				937
				938	Where:
				939
				940	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				941	are the same as the Target Triple (see
				942	:ref:`amdgpu-target-triples`).
				943
				944	- ``<Processor>`` is the same as the Processor (see
				945	:ref:`amdgpu-processors`).
				946
				947	- ``<Target Features>`` is a list of the enabled Target Features
				948	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				949	apply to Processor. The list must be in the same order as listed
				950	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				951	Features* must be included in the list if they are enabled even if
				952	that is the default for Processor.
				953
				954	For example:
				955
				956	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				957
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	958	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	959	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	960
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	961	The code object metadata specifies extensible metadata associated with the code
				962	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				963	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				964	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				965	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				966	information necessary to support the ROCM kernel queries. For example, the
				967	segment sizes needed in a dispatch packet. In addition, a high level language
				968	runtime may require other information to be included. For example, the AMD
				969	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	970
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	971	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	972	:doc:`YamlIO`).
				973
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	974	.. TODO
				975	Is the string null terminated? It probably should not if YAML allows it to
				976	contain null characters, otherwise it should be.
				977
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	978	The metadata is represented as a single YAML document comprised of the mapping
				979	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				980	referenced tables.
				981
				982	For boolean values, the string values of ``false`` and ``true`` are used for
				983	false and true respectively.
				984
				985	Additional information can be added to the mappings. To avoid conflicts, any
				986	non-AMD key names should be prefixed by "vendor-name.".
				987
				988	.. table:: AMDHSA Code Object Metadata Mapping
				989	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				990
				991	========== ============== ========= =======================================
				992	String Key Value Type Required? Description
				993	========== ============== ========= =======================================
				994	"Version" sequence of Required - The first integer is the major
				995	2 integers version. Currently 1.
				996	- The second integer is the minor
				997	version. Currently 0.
				998	"Printf" sequence of Each string is encoded information
				999	strings about a printf function call. The
				1000	encoded information is organized as
				1001	fields separated by colon (':'):
				1002
				1003	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1004
				1005	where:
				1006
				1007	``ID``
				1008	A 32 bit integer as a unique id for
				1009	each printf function call
				1010
				1011	``N``
				1012	A 32 bit integer equal to the number
				1013	of arguments of printf function call
				1014	minus 1
				1015
				1016	``S[i]`` (where i = 0, 1, ... , N-1)
				1017	32 bit integers for the size in bytes
				1018	of the i-th FormatString argument of
				1019	the printf function call
				1020
				1021	FormatString
				1022	The format string passed to the
				1023	printf function call.
				1024	"Kernels" sequence of Required Sequence of the mappings for each
				1025	mapping kernel in the code object. See
				1026	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1027	for the definition of the mapping.
				1028	========== ============== ========= =======================================
				1029
				1030	..
				1031
				1032	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1033	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1034
				1035	================= ============== ========= ================================
				1036	String Key Value Type Required? Description
				1037	================= ============== ========= ================================
				1038	"Name" string Required Source name of the kernel.
				1039	"SymbolName" string Required Name of the kernel
				1040	descriptor ELF symbol.
				1041	"Language" string Source language of the kernel.
				1042	Values include:
				1043
				1044	- "OpenCL C"
				1045	- "OpenCL C++"
				1046	- "HCC"
				1047	- "OpenMP"
				1048
				1049	"LanguageVersion" sequence of - The first integer is the major
				1050	2 integers version.
				1051	- The second integer is the
				1052	minor version.
				1053	"Attrs" mapping Mapping of kernel attributes.
				1054	See
				1055	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1056	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1057	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1058	mapping kernel arguments. See
				1059	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1060	for the definition of the mapping.
				1061	"CodeProps" mapping Mapping of properties related to
				1062	the kernel code. See
				1063	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1064	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1065	================= ============== ========= ================================
				1066
				1067	..
				1068
				1069	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1070	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1071
				1072	=================== ============== ========= ==============================
				1073	String Key Value Type Required? Description
				1074	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1075	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1076	3 integers must be >=1 and the dispatch
				1077	work-group size X, Y, Z must
				1078	correspond to the specified
				1079	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1080
				1081	Corresponds to the OpenCL
				1082	``reqd_work_group_size``
				1083	attribute.
				1084	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1085	3 integers X, Y, Z is likely to be the
				1086	specified values.
				1087
				1088	Corresponds to the OpenCL
				1089	``work_group_size_hint``
				1090	attribute.
				1091	"VecTypeHint" string The name of a scalar or vector
				1092	type.
				1093
				1094	Corresponds to the OpenCL
				1095	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1096
				1097	"RuntimeHandle" string The external symbol name
				1098	associated with a kernel.
				1099	OpenCL runtime allocates a
				1100	global buffer for the symbol
				1101	and saves the kernel's address
				1102	to it, which is used for
				1103	device side enqueueing. Only
				1104	available for device side
				1105	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1106	=================== ============== ========= ==============================
				1107
				1108	..
				1109
				1110	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1111	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1112
				1113	================= ============== ========= ================================
				1114	String Key Value Type Required? Description
				1115	================= ============== ========= ================================
				1116	"Name" string Kernel argument name.
				1117	"TypeName" string Kernel argument type name.
				1118	"Size" integer Required Kernel argument size in bytes.
				1119	"Align" integer Required Kernel argument alignment in
				1120	bytes. Must be a power of two.
				1121	"ValueKind" string Required Kernel argument kind that
				1122	specifies how to set up the
				1123	corresponding argument.
				1124	Values include:
				1125
				1126	"ByValue"
				1127	The argument is copied
				1128	directly into the kernarg.
				1129
				1130	"GlobalBuffer"
				1131	A global address space pointer
				1132	to the buffer data is passed
				1133	in the kernarg.
				1134
				1135	"DynamicSharedPointer"
				1136	A group address space pointer
				1137	to dynamically allocated LDS
				1138	is passed in the kernarg.
				1139
				1140	"Sampler"
				1141	A global address space
				1142	pointer to a S# is passed in
				1143	the kernarg.
				1144
				1145	"Image"
				1146	A global address space
				1147	pointer to a T# is passed in
				1148	the kernarg.
				1149
				1150	"Pipe"
				1151	A global address space pointer
				1152	to an OpenCL pipe is passed in
				1153	the kernarg.
				1154
				1155	"Queue"
				1156	A global address space pointer
				1157	to an OpenCL device enqueue
				1158	queue is passed in the
				1159	kernarg.
				1160
				1161	"HiddenGlobalOffsetX"
				1162	The OpenCL grid dispatch
				1163	global offset for the X
				1164	dimension is passed in the
				1165	kernarg.
				1166
				1167	"HiddenGlobalOffsetY"
				1168	The OpenCL grid dispatch
				1169	global offset for the Y
				1170	dimension is passed in the
				1171	kernarg.
				1172
				1173	"HiddenGlobalOffsetZ"
				1174	The OpenCL grid dispatch
				1175	global offset for the Z
				1176	dimension is passed in the
				1177	kernarg.
				1178
				1179	"HiddenNone"
				1180	An argument that is not used
				1181	by the kernel. Space needs to
				1182	be left for it, but it does
				1183	not need to be set up.
				1184
				1185	"HiddenPrintfBuffer"
				1186	A global address space pointer
				1187	to the runtime printf buffer
				1188	is passed in kernarg.
				1189
				1190	"HiddenDefaultQueue"
				1191	A global address space pointer
				1192	to the OpenCL device enqueue
				1193	queue that should be used by
				1194	the kernel by default is
				1195	passed in the kernarg.
				1196
				1197	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1198	A global address space pointer
				1199	to help link enqueued kernels into
				1200	the ancestor tree for determining
				1201	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1202
				1203	"ValueType" string Required Kernel argument value type. Only
				1204	present if "ValueKind" is
				1205	"ByValue". For vector data
				1206	types, the value is for the
				1207	element type. Values include:
				1208
				1209	- "Struct"
				1210	- "I8"
				1211	- "U8"
				1212	- "I16"
				1213	- "U16"
				1214	- "F16"
				1215	- "I32"
				1216	- "U32"
				1217	- "F32"
				1218	- "I64"
				1219	- "U64"
				1220	- "F64"
				1221
				1222	.. TODO
				1223	How can it be determined if a
				1224	vector type, and what size
				1225	vector?
				1226	"PointeeAlign" integer Alignment in bytes of pointee
				1227	type for pointer type kernel
				1228	argument. Must be a power
				1229	of 2. Only present if
				1230	"ValueKind" is
				1231	"DynamicSharedPointer".
				1232	"AddrSpaceQual" string Kernel argument address space
				1233	qualifier. Only present if
				1234	"ValueKind" is "GlobalBuffer" or
				1235	"DynamicSharedPointer". Values
				1236	are:
				1237
				1238	- "Private"
				1239	- "Global"
				1240	- "Constant"
				1241	- "Local"
				1242	- "Generic"
				1243	- "Region"
				1244
				1245	.. TODO
				1246	Is GlobalBuffer only Global
				1247	or Constant? Is
				1248	DynamicSharedPointer always
				1249	Local? Can HCC allow Generic?
				1250	How can Private or Region
				1251	ever happen?
				1252	"AccQual" string Kernel argument access
				1253	qualifier. Only present if
				1254	"ValueKind" is "Image" or
				1255	"Pipe". Values
				1256	are:
				1257
				1258	- "ReadOnly"
				1259	- "WriteOnly"
				1260	- "ReadWrite"
				1261
				1262	.. TODO
				1263	Does this apply to
				1264	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1265	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1266	performed by the kernel on the
				1267	kernel argument. Only present if
				1268	"ValueKind" is "GlobalBuffer",
				1269	"Image", or "Pipe". This may be
				1270	more restrictive than indicated
				1271	by "AccQual" to reflect what the
				1272	kernel actual does. If not
				1273	present then the runtime must
				1274	assume what is implied by
				1275	"AccQual" and "IsConst". Values
				1276	are:
				1277
				1278	- "ReadOnly"
				1279	- "WriteOnly"
				1280	- "ReadWrite"
				1281
				1282	"IsConst" boolean Indicates if the kernel argument
				1283	is const qualified. Only present
				1284	if "ValueKind" is
				1285	"GlobalBuffer".
				1286
				1287	"IsRestrict" boolean Indicates if the kernel argument
				1288	is restrict qualified. Only
				1289	present if "ValueKind" is
				1290	"GlobalBuffer".
				1291
				1292	"IsVolatile" boolean Indicates if the kernel argument
				1293	is volatile qualified. Only
				1294	present if "ValueKind" is
				1295	"GlobalBuffer".
				1296
				1297	"IsPipe" boolean Indicates if the kernel argument
				1298	is pipe qualified. Only present
				1299	if "ValueKind" is "Pipe".
				1300
				1301	.. TODO
				1302	Can GlobalBuffer be pipe
				1303	qualified?
				1304	================= ============== ========= ================================
				1305
				1306	..
				1307
				1308	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1309	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1310
				1311	============================ ============== ========= =====================
				1312	String Key Value Type Required? Description
				1313	============================ ============== ========= =====================
				1314	"KernargSegmentSize" integer Required The size in bytes of
				1315	the kernarg segment
				1316	that holds the values
				1317	of the arguments to
				1318	the kernel.
				1319	"GroupSegmentFixedSize" integer Required The amount of group
				1320	segment memory
				1321	required by a
				1322	work-group in
				1323	bytes. This does not
				1324	include any
				1325	dynamically allocated
				1326	group segment memory
				1327	that may be added
				1328	when the kernel is
				1329	dispatched.
				1330	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1331	private address space
				1332	memory required for a
				1333	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1334	bytes. If the kernel
				1335	uses a dynamic call
				1336	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1337	space must be added
				1338	to this value for the
				1339	call stack.
				1340	"KernargSegmentAlign" integer Required The maximum byte
				1341	alignment of
				1342	arguments in the
				1343	kernarg segment. Must
				1344	be a power of 2.
				1345	"WavefrontSize" integer Required Wavefront size. Must
				1346	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1347	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1348	registers used by a
				1349	wavefront for
				1350	GFX6-GFX9. This
				1351	includes the special
				1352	SGPRs for VCC, Flat
				1353	Scratch (GFX7-GFX9)
				1354	and XNACK (for
				1355	GFX8-GFX9). It does
				1356	not include the 16
				1357	SGPR added if a trap
				1358	handler is
				1359	enabled. It is not
				1360	rounded up to the
				1361	allocation
				1362	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1363	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1364	registers used by
				1365	each work-item for
				1366	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1367	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1368	work-group size
				1369	supported by the
				1370	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1371	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1372	consistent with
				1373	ReqdWorkGroupSize if
				1374	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1375	"NumSpilledSGPRs" integer Number of stores from
				1376	a scalar register to
				1377	a register allocator
				1378	created spill
				1379	location.
				1380	"NumSpilledVGPRs" integer Number of stores from
				1381	a vector register to
				1382	a register allocator
				1383	created spill
				1384	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1385	============================ ============== ========= =====================
				1386
				1387	..
				1388
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1389	Kernel Dispatch
				1390	~~~~~~~~~~~~~~~
				1391
				1392	The HSA architected queuing language (AQL) defines a user space memory interface
				1393	that can be used to control the dispatch of kernels, in an agent independent
				1394	way. An agent can have zero or more AQL queues created for it using the ROCm
				1395	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1396	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1397	mechanics and packet layouts.
				1398
				1399	The packet processor of a kernel agent is responsible for detecting and
				1400	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1401	packet processor is implemented by the hardware command processor (CP),
				1402	asynchronous dispatch controller (ADC) and shader processor input controller
				1403	(SPI).
				1404
				1405	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1406	mode driver to initialize and register the AQL queue with CP.
				1407
				1408	To dispatch a kernel the following actions are performed. This can occur in the
				1409	CPU host program, or from an HSA kernel executing on a GPU.
				1410
				1411	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1412	executed is obtained.
				1413	2. A pointer to the kernel descriptor (see
				1414	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1415	obtained. It must be for a kernel that is contained in a code object that that
				1416	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1417	associated.
				1418	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1419	for a memory region with the kernarg property for the kernel agent that will
				1420	execute the kernel. It must be at least 16 byte aligned.
				1421	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1422	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1423	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1424	memory in the same way constant memory is accessed. (Note that the HSA
				1425	specification allows an implementation to copy the kernel argument contents to
				1426	another location that is accessed by the kernel.)
				1427	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1428	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1429	packet. The packet must be set up, and the final write must use an atomic
				1430	store release to set the packet kind to ensure the packet contents are
				1431	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1432	notify the kernel agent that the AQL queue has been updated. These rules, and
				1433	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1434	System Architecture Specification* [HSA]_.
				1435	6. A kernel dispatch packet includes information about the actual dispatch,
				1436	such as grid and work-group size, together with information from the code
				1437	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1438	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1439	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1440	7. CP executes micro-code and is responsible for detecting and setting up the
				1441	GPU to execute the wavefronts of a kernel dispatch.
				1442	8. CP ensures that when the a wavefront starts executing the kernel machine
				1443	code, the scalar general purpose registers (SGPR) and vector general purpose
				1444	registers (VGPR) are set up as required by the machine code. The required
				1445	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1446	register state is defined in
				1447	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1448	9. The prolog of the kernel machine code (see
				1449	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1450	before continuing executing the machine code that corresponds to the kernel.
				1451	10. When the kernel dispatch has completed execution, CP signals the completion
				1452	signal specified in the kernel dispatch packet if not 0.
				1453
				1454	.. _amdgpu-amdhsa-memory-spaces:
				1455
				1456	Memory Spaces
				1457	~~~~~~~~~~~~~
				1458
				1459	The memory space properties are:
				1460
				1461	.. table:: AMDHSA Memory Spaces
				1462	:name: amdgpu-amdhsa-memory-spaces-table
				1463
				1464	================= =========== ======== ======= ==================
				1465	Memory Space Name HSA Segment Hardware Address NULL Value
				1466	Name Name Size
				1467	================= =========== ======== ======= ==================
				1468	Private private scratch 32 0x00000000
				1469	Local group LDS 32 0xFFFFFFFF
				1470	Global global global 64 0x0000000000000000
				1471	Constant constant *same as 64 0x0000000000000000
				1472	global*
				1473	Generic flat flat 64 0x0000000000000000
				1474	Region N/A GDS 32 *not implemented
				1475	for AMDHSA*
				1476	================= =========== ======== ======= ==================
				1477
				1478	The global and constant memory spaces both use global virtual addresses, which
				1479	are the same virtual address space used by the CPU. However, some virtual
				1480	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1481	and some by both.
				1482
				1483	Using the constant memory space indicates that the data will not change during
				1484	the execution of the kernel. This allows scalar read instructions to be
				1485	used. The vector and scalar L1 caches are invalidated of volatile data before
				1486	each kernel dispatch execution to allow constant memory to change values between
				1487	kernel dispatches.
				1488
				1489	The local memory space uses the hardware Local Data Store (LDS) which is
				1490	automatically allocated when the hardware creates work-groups of wavefronts, and
				1491	freed when all the wavefronts of a work-group have terminated. The data store
				1492	(DS) instructions can be used to access it.
				1493
				1494	The private memory space uses the hardware scratch memory support. If the kernel
				1495	uses scratch, then the hardware allocates memory that is accessed using
				1496	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1497	address to physical address is:
				1498
				1499	``wavefront-scratch-base +
				1500	(private-address * wavefront-size * 4) +
				1501	(wavefront-lane-id * 4)``
				1502
				1503	There are different ways that the wavefront scratch base address is determined
				1504	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1505	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1506	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1507	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1508	same private address, the interleaving results in adjacent dwords being accessed
				1509	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1510	supported except by flat and scratch instructions in GFX9.
				1511
				1512	The generic address space uses the hardware flat address support available in
				1513	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1514	local appertures), that are outside the range of addressible global memory, to
				1515	map from a flat address to a private or local address.
				1516
				1517	FLAT instructions can take a flat address and access global, private (scratch)
				1518	and group (LDS) memory depending in if the address is within one of the
				1519	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1520	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1521	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1522	(see :ref:`amdgpu-amdhsa-m0`).
				1523
				1524	To convert between a segment address and a flat address the base address of the
				1525	appertures address can be used. For GFX7-GFX8 these are available in the
				1526	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1527	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1528	GFX9 the appature base addresses are directly available as inline constant
				1529	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1530	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1531	which makes it easier to convert from flat to segment or segment to flat.
				1532
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1533	Image and Samplers
				1534	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1535
				1536	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1537	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1538	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1539	enumeration values for the queries that are not trivially deducible from the S#
				1540	representation.
				1541
				1542	HSA Signals
				1543	~~~~~~~~~~~
				1544
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1545	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1546	structure allocated in memory accessible from both the CPU and GPU. The
				1547	structure is defined by the ROCm runtime and subject to change between releases
				1548	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1549
				1550	.. _amdgpu-amdhsa-hsa-aql-queue:
				1551
				1552	HSA AQL Queue
				1553	~~~~~~~~~~~~~
				1554
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1555	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1556	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1557	fields needed to implement certain language features such as the flat address
				1558	aperture bases. It also contains fields used by CP such as managing the
				1559	allocation of scratch memory.
				1560
				1561	.. _amdgpu-amdhsa-kernel-descriptor:
				1562
				1563	Kernel Descriptor
				1564	~~~~~~~~~~~~~~~~~
				1565
				1566	A kernel descriptor consists of the information needed by CP to initiate the
				1567	execution of a kernel, including the entry point address of the machine code
				1568	that implements the kernel.
				1569
				1570	Kernel Descriptor for GFX6-GFX9
				1571	+++++++++++++++++++++++++++++++
				1572
				1573	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1574
				1575	.. table:: Kernel Descriptor for GFX6-GFX9
				1576	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1577
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1578	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1579	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1580	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1581	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1582	address space memory
				1583	required for a work-group
				1584	in bytes. This does not
				1585	include any dynamically
				1586	allocated local address
				1587	space memory that may be
				1588	added when the kernel is
				1589	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1590	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1591	private address space
				1592	memory required for a
				1593	work-item in bytes. If
				1594	is_dynamic_callstack is 1
				1595	then additional space must
				1596	be added to this value for
				1597	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1598	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1599	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600	negative) from base
				1601	address of kernel
				1602	descriptor to kernel's
				1603	entry point instruction
				1604	which must be 256 byte
				1605	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1606	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1608	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1609	program settings used by
				1610	CP to set up
				1611	``COMPUTE_PGM_RSRC1``
				1612	configuration
				1613	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1614	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1615	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1616	program settings used by
				1617	CP to set up
				1618	``COMPUTE_PGM_RSRC2``
				1619	configuration
				1620	register. See
				1621	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1622	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1623	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1624	(see
				1625	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1626
				1627	The total number of SGPR
				1628	user data registers
				1629	requested must not exceed
				1630	16 and match value in
				1631	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1632	Any requests beyond 16
				1633	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1634	449 1 bit EnableSGPRDispatchPtr see above
				1635	450 1 bit EnableSGPRQueuePtr see above
				1636	451 1 bit EnableSGPRKernargSegmentPtr see above
				1637	452 1 bit EnableSGPRDispatchID see above
				1638	453 1 bit EnableSGPRFlatScratchInit see above
				1639	454 1 bit EnableSGPRPrivateSegmentSize see above
				1640	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1641	should always be 0.
				1642	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1643	should always be 0.
				1644	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1645	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1646	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1647	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1648	bytes
				1649	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1650	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1651
				1652	..
				1653
				1654	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1655	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1656
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1657	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1658	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1659	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1660	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1661	used by each work-item,
				1662	granularity is device
				1663	specific:
				1664
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1665	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1666	- max_vgpr 1..256
				1667	- roundup((max_vgpg + 1)
				1668	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1669
				1670	Used by CP to set up
				1671	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1672	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1673	used by a wavefront,
				1674	granularity is device
				1675	specific:
				1676
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1677	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1678	- max_sgpr 1..112
				1679	- roundup((max_sgpg + 1)
				1680	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1681	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1682	- max_sgpr 1..112
				1683	- roundup((max_sgpg + 1)
				1684	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1685
				1686	Includes the special SGPRs
				1687	for VCC, Flat Scratch (for
				1688	GFX7 onwards) and XNACK
				1689	(for GFX8 onwards). It does
				1690	not include the 16 SGPR
				1691	added if a trap handler is
				1692	enabled.
				1693
				1694	Used by CP to set up
				1695	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1696	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1697
				1698	Start executing wavefront
				1699	at the specified priority.
				1700
				1701	CP is responsible for
				1702	filling in
				1703	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1704	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1705	with specified rounding
				1706	mode for single (32
				1707	bit) floating point
				1708	precision floating point
				1709	operations.
				1710
				1711	Floating point rounding
				1712	mode values are defined in
				1713	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1714
				1715	Used by CP to set up
				1716	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1717	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1718	with specified rounding
				1719	denorm mode for half/double (16
				1720	and 64 bit) floating point
				1721	precision floating point
				1722	operations.
				1723
				1724	Floating point rounding
				1725	mode values are defined in
				1726	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1727
				1728	Used by CP to set up
				1729	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1730	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1731	with specified denorm mode
				1732	for single (32
				1733	bit) floating point
				1734	precision floating point
				1735	operations.
				1736
				1737	Floating point denorm mode
				1738	values are defined in
				1739	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1740
				1741	Used by CP to set up
				1742	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1743	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1744	with specified denorm mode
				1745	for half/double (16
				1746	and 64 bit) floating point
				1747	precision floating point
				1748	operations.
				1749
				1750	Floating point denorm mode
				1751	values are defined in
				1752	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1753
				1754	Used by CP to set up
				1755	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1756	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1757
				1758	Start executing wavefront
				1759	in privilege trap handler
				1760	mode.
				1761
				1762	CP is responsible for
				1763	filling in
				1764	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1765	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1766	with DX10 clamp mode
				1767	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1768	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1769	treatment of NaN's (when
				1770	set, clamp NaN to zero,
				1771	otherwise pass NaN
				1772	through).
				1773
				1774	Used by CP to set up
				1775	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1776	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1777
				1778	Start executing wavefront
				1779	in single step mode.
				1780
				1781	CP is responsible for
				1782	filling in
				1783	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1784	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1785	with IEEE mode
				1786	enabled. Floating point
				1787	opcodes that support
				1788	exception flag gathering
				1789	will quiet and propagate
				1790	signaling-NaN inputs per
				1791	IEEE 754-2008. Min_dx10 and
				1792	max_dx10 become IEEE
				1793	754-2008 compliant due to
				1794	signaling-NaN propagation
				1795	and quieting.
				1796
				1797	Used by CP to set up
				1798	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1799	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1800
				1801	Only one work-group allowed
				1802	to execute on a compute
				1803	unit.
				1804
				1805	CP is responsible for
				1806	filling in
				1807	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1808	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1809
				1810	Flag that can be used to
				1811	control debugging code.
				1812
				1813	CP is responsible for
				1814	filling in
				1815	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1816	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1817	Reserved, must be 0.
				1818	GFX9
				1819	Wavefront starts execution
				1820	with specified fp16 overflow
				1821	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1822
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1823	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1824	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1825	- If 1, fp16 overflow that is the
				1826	result of an +/-INF input value
				1827	or divide by 0 produces a +/-INF,
				1828	otherwise clamps computed
				1829	overflow to +/-MAX_FP16 as
				1830	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1831
				1832	Used by CP to set up
				1833	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1834	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1835	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1836	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1837
				1838	..
				1839
				1840	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1841	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1842
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1843	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1844	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1845	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1847	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1848	system register (see
				1849	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1850
				1851	Used by CP to set up
				1852	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1853	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854	user data registers
				1855	requested. This number must
				1856	match the number of user
				1857	data registers enabled.
				1858
				1859	Used by CP to set up
				1860	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame^]	1861	6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1862
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame^]	1863	This bit represents
				1864	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
				1865	which is set by the CP if
				1866	the runtime has installed a
				1867	trap handler.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1868	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1869	system SGPR register for
				1870	the work-group id in the X
				1871	dimension (see
				1872	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1873
				1874	Used by CP to set up
				1875	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1876	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1877	system SGPR register for
				1878	the work-group id in the Y
				1879	dimension (see
				1880	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1881
				1882	Used by CP to set up
				1883	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1884	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1885	system SGPR register for
				1886	the work-group id in the Z
				1887	dimension (see
				1888	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1889
				1890	Used by CP to set up
				1891	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1892	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1893	system SGPR register for
				1894	work-group information (see
				1895	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1896
				1897	Used by CP to set up
				1898	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1899	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1900	VGPR system registers used
				1901	for the work-item ID.
				1902	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1903	defines the values.
				1904
				1905	Used by CP to set up
				1906	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1907	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1908
				1909	Wavefront starts execution
				1910	with address watch
				1911	exceptions enabled which
				1912	are generated when L1 has
				1913	witnessed a thread access
				1914	an *address of
				1915	interest*.
				1916
				1917	CP is responsible for
				1918	filling in the address
				1919	watch bit in
				1920	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1921	according to what the
				1922	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1923	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1924
				1925	Wavefront starts execution
				1926	with memory violation
				1927	exceptions exceptions
				1928	enabled which are generated
				1929	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1930	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1931	L1 or LDS
				1932	(write-to-read-only-memory,
				1933	mis-aligned atomic, LDS
				1934	address out of range,
				1935	illegal address, etc.).
				1936
				1937	CP sets the memory
				1938	violation bit in
				1939	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1940	according to what the
				1941	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1942	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1943
				1944	CP uses the rounded value
				1945	from the dispatch packet,
				1946	not this value, as the
				1947	dispatch may contain
				1948	dynamically allocated group
				1949	segment memory. CP writes
				1950	directly to
				1951	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1952
				1953	Amount of group segment
				1954	(LDS) to allocate for each
				1955	work-group. Granularity is
				1956	device specific:
				1957
				1958	GFX6:
				1959	roundup(lds-size / (64 * 4))
				1960	GFX7-GFX9:
				1961	roundup(lds-size / (128 * 4))
				1962
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1963	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1964	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1965	enabled.
				1966
				1967	Used by CP to set up
				1968	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1969	(set from bits 0..6).
				1970
				1971	IEEE 754 FP Invalid
				1972	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1973	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1974	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1975	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1976	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1977	_DIVISION_BY_ZERO Zero
				1978	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1979	_OVERFLOW
				1980	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1981	_UNDERFLOW
				1982	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1983	_INEXACT
				1984	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1985	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1986	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1987	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1988	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1989	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1990
				1991	..
				1992
				1993	.. table:: Floating Point Rounding Mode Enumeration Values
				1994	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1995
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1996	====================================== ===== ==============================
				1997	Enumeration Name Value Description
				1998	====================================== ===== ==============================
				1999	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2000	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2001	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2002	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				2003	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2004
				2005	..
				2006
				2007	.. table:: Floating Point Denorm Mode Enumeration Values
				2008	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2009
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2010	====================================== ===== ==============================
				2011	Enumeration Name Value Description
				2012	====================================== ===== ==============================
				2013	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				2014	Denorms
				2015	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2016	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2017	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				2018	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2019
				2020	..
				2021
				2022	.. table:: System VGPR Work-Item ID Enumeration Values
				2023	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2024
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2025	======================================== ===== ============================
				2026	Enumeration Name Value Description
				2027	======================================== ===== ============================
				2028	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				2029	ID.
				2030	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				2031	dimensions ID.
				2032	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				2033	dimensions ID.
				2034	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				2035	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2036
				2037	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2038
				2039	Initial Kernel Execution State
				2040	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2041
				2042	This section defines the register state that will be set up by the packet
				2043	processor prior to the start of execution of every wavefront. This is limited by
				2044	the constraints of the hardware controllers of CP/ADC/SPI.
				2045
				2046	The order of the SGPR registers is defined, but the compiler can specify which
				2047	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2048	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2049	for enabled registers are dense starting at SGPR0: the first enabled register is
				2050	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2051	an SGPR number.
				2052
				2053	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2054	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2055	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2056	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2057	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2058	dispatch.
				2059
				2060	SGPR register initial state is defined in
				2061	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2062
				2063	.. table:: SGPR Register Set Up Order
				2064	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2065
				2066	========== ========================== ====== ==============================
				2067	SGPR Order Name Number Description
				2068	(kernel descriptor enable of
				2069	field) SGPRs
				2070	========== ========================== ====== ==============================
				2071	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2072	(enable_sgpr_private with Scratch Wavefront Offset
				2073	_segment_buffer) as an offset, to access the
				2074	private memory space using a
				2075	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2076
				2077	CP uses the value provided by
				2078	the runtime.
				2079	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2080	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2081	actually executing.
				2082	then Queue Ptr 2 64 bit address of amd_queue_t
				2083	(enable_sgpr_queue_ptr) object for AQL queue on which
				2084	the dispatch packet was
				2085	queued.
				2086	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2087	(enable_sgpr_kernarg segment. This is directly
				2088	_segment_ptr) copied from the
				2089	kernarg_address in the kernel
				2090	dispatch packet.
				2091
				2092	Having CP load it once avoids
				2093	loading it at the beginning of
				2094	every wavefront.
				2095	then Dispatch Id 2 64 bit Dispatch ID of the
				2096	(enable_sgpr_dispatch_id) dispatch packet being
				2097	executed.
				2098	then Flat Scratch Init 2 This is 2 SGPRs:
				2099	(enable_sgpr_flat_scratch
				2100	_init) GFX6
				2101	Not supported.
				2102	GFX7-GFX8
				2103	The first SGPR is a 32 bit
				2104	byte offset from
				2105	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2106	to per SPI base of memory
				2107	for scratch for the queue
				2108	executing the kernel
				2109	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2110	from the runtime. (The
				2111	Scratch Segment Buffer base
				2112	address is
				2113	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2114	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2115	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2116	be added to this offset by
				2117	the kernel machine code,
				2118	right shifted by 8, and
				2119	moved to the FLAT_SCRATCH_HI
				2120	SGPR register.
				2121	FLAT_SCRATCH_HI corresponds
				2122	to SGPRn-4 on GFX7, and
				2123	SGPRn-6 on GFX8 (where SGPRn
				2124	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2125	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2126	FLAT_SCRATCH_HI is
				2127	multiplied by 256 (as it is
				2128	in units of 256 bytes) and
				2129	added to
				2130	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2131	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2132	FLAT SCRATCH BASE in flat
				2133	memory instructions that
				2134	access the scratch
				2135	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2136
				2137	The second SGPR is 32 bit
				2138	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2139	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2140	usage. CP obtains this from
				2141	the runtime, and it is
				2142	always a multiple of DWORD.
				2143	CP checks that the value in
				2144	the kernel dispatch packet
				2145	Private Segment Byte Size is
				2146	not larger, and requests the
				2147	runtime to increase the
				2148	queue's scratch size if
				2149	necessary. The kernel code
				2150	must move it to
				2151	FLAT_SCRATCH_LO which is
				2152	SGPRn-3 on GFX7 and SGPRn-5
				2153	on GFX8. FLAT_SCRATCH_LO is
				2154	used as the FLAT SCRATCH
				2155	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2156	instructions. Having CP load
				2157	it once avoids loading it at
				2158	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2159	wavefront.
				2160	GFX9
				2161	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2162	64 bit base address of the
				2163	per SPI scratch backing
				2164	memory managed by SPI for
				2165	the queue executing the
				2166	kernel dispatch. CP obtains
				2167	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2168	divides it if there are
				2169	multiple Shader Arrays each
				2170	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2171	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2172	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2173	machine code and the result
				2174	moved to the FLAT_SCRATCH
				2175	SGPR which is SGPRn-6 and
				2176	SGPRn-5. It is used as the
				2177	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2178	memory instructions.
				2179	then Private Segment Size 1 The 32 bit byte size of a
				2180	(enable_sgpr_private single
				2181	work-item's
				2182	scratch_segment_size) memory
				2183	allocation. This is the
				2184	value from the kernel
				2185	dispatch packet Private
				2186	Segment Byte Size rounded up
				2187	by CP to a multiple of
				2188	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2189
				2190	Having CP load it once avoids
				2191	loading it at the beginning of
				2192	every wavefront.
				2193
				2194	This is not used for
				2195	GFX7-GFX8 since it is the same
				2196	value as the second SGPR of
				2197	Flat Scratch Init. However, it
				2198	may be needed for GFX9 which
				2199	changes the meaning of the
				2200	Flat Scratch Init value.
				2201	then Grid Work-Group Count X 1 32 bit count of the number of
				2202	(enable_sgpr_grid work-groups in the X dimension
				2203	_workgroup_count_X) for the grid being
				2204	executed. Computed from the
				2205	fields in the kernel dispatch
				2206	packet as ((grid_size.x +
				2207	workgroup_size.x - 1) /
				2208	workgroup_size.x).
				2209	then Grid Work-Group Count Y 1 32 bit count of the number of
				2210	(enable_sgpr_grid work-groups in the Y dimension
				2211	_workgroup_count_Y && for the grid being
				2212	less than 16 previous executed. Computed from the
				2213	SGPRs) fields in the kernel dispatch
				2214	packet as ((grid_size.y +
				2215	workgroup_size.y - 1) /
				2216	workgroupSize.y).
				2217
				2218	Only initialized if <16
				2219	previous SGPRs initialized.
				2220	then Grid Work-Group Count Z 1 32 bit count of the number of
				2221	(enable_sgpr_grid work-groups in the Z dimension
				2222	_workgroup_count_Z && for the grid being
				2223	less than 16 previous executed. Computed from the
				2224	SGPRs) fields in the kernel dispatch
				2225	packet as ((grid_size.z +
				2226	workgroup_size.z - 1) /
				2227	workgroupSize.z).
				2228
				2229	Only initialized if <16
				2230	previous SGPRs initialized.
				2231	then Work-Group Id X 1 32 bit work-group id in X
				2232	(enable_sgpr_workgroup_id dimension of grid for
				2233	_X) wavefront.
				2234	then Work-Group Id Y 1 32 bit work-group id in Y
				2235	(enable_sgpr_workgroup_id dimension of grid for
				2236	_Y) wavefront.
				2237	then Work-Group Id Z 1 32 bit work-group id in Z
				2238	(enable_sgpr_workgroup_id dimension of grid for
				2239	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2240	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2241	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2242	_info) threadgroup_size_in_wavefronts[5:0]}
				2243	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2244	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2245	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2246	dispatch. Must be used as an
				2247	offset with Private
				2248	segment address when using
				2249	Scratch Segment Buffer. It
				2250	must be used to set up FLAT
				2251	SCRATCH for flat addressing
				2252	(see
				2253	:ref:`amdgpu-amdhsa-flat-scratch`).
				2254	========== ========================== ====== ==============================
				2255
				2256	The order of the VGPR registers is defined, but the compiler can specify which
				2257	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2258	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2259	for enabled registers are dense starting at VGPR0: the first enabled register is
				2260	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2261	VGPR number.
				2262
				2263	VGPR register initial state is defined in
				2264	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2265
				2266	.. table:: VGPR Register Set Up Order
				2267	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2268
				2269	========== ========================== ====== ==============================
				2270	VGPR Order Name Number Description
				2271	(kernel descriptor enable of
				2272	field) VGPRs
				2273	========== ========================== ====== ==============================
				2274	First Work-Item Id X 1 32 bit work item id in X
				2275	(Always initialized) dimension of work-group for
				2276	wavefront lane.
				2277	then Work-Item Id Y 1 32 bit work item id in Y
				2278	(enable_vgpr_workitem_id dimension of work-group for
				2279	> 0) wavefront lane.
				2280	then Work-Item Id Z 1 32 bit work item id in Z
				2281	(enable_vgpr_workitem_id dimension of work-group for
				2282	> 1) wavefront lane.
				2283	========== ========================== ====== ==============================
				2284
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2285	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2286
				2287	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2288	registers.
				2289	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2290	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2291	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2292	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2293	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2294	or (X, Y, Z).
				2295
				2296	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2297	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2298
				2299	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2300	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2301	instructions (GFX9).
				2302
				2303	If buffer operations are used then the compiler can generate a V# with the
				2304	following properties:
				2305
				2306	* base address of 0
				2307	* no swizzle
				2308	* ATC: 1 if IOMMU present (such as APU)
				2309	* ptr64: 1
				2310	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2311	APU and NC for dGPU).
				2312
				2313	.. _amdgpu-amdhsa-kernel-prolog:
				2314
				2315	Kernel Prolog
				2316	~~~~~~~~~~~~~
				2317
				2318	.. _amdgpu-amdhsa-m0:
				2319
				2320	M0
				2321	++
				2322
				2323	GFX6-GFX8
				2324	The M0 register must be initialized with a value at least the total LDS size
				2325	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2326	available in dispatch packet. For M0, it is also possible to use maximum
				2327	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2328	GFX7-GFX8).
				2329	GFX9
				2330	The M0 register is not used for range checking LDS accesses and so does not
				2331	need to be initialized in the prolog.
				2332
				2333	.. _amdgpu-amdhsa-flat-scratch:
				2334
				2335	Flat Scratch
				2336	++++++++++++
				2337
				2338	If the kernel may use flat operations to access scratch memory, the prolog code
				2339	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2340	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2341	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2342
				2343	GFX6
				2344	Flat scratch is not supported.
				2345
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2346	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2347	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2348	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2349	being managed by SPI for the queue executing the kernel dispatch. This is
				2350	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2351	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2352	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2353	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2354	by 8 before moving into FLAT_SCRATCH_LO.
				2355	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2356	work-items scratch memory usage. This is directly loaded from the kernel
				2357	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2358	DWORD. Having CP load it once avoids loading it at the beginning of every
				2359	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2360	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2361
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2362	GFX9
				2363	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2364	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2365	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2366	pair for use as the flat scratch base in flat memory instructions.
				2367
				2368	.. _amdgpu-amdhsa-memory-model:
				2369
				2370	Memory Model
				2371	~~~~~~~~~~~~
				2372
				2373	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2374	(see :ref:`memmodel`). The implementation is WIP.
				2375
				2376	.. TODO
				2377	Update when implementation complete.
				2378
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2379	The AMDGPU backend supports the memory synchronization scopes specified in
				2380	:ref:`amdgpu-memory-scopes`.
				2381
				2382	The code sequences used to implement the memory model are defined in table
				2383	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2384
				2385	The sequences specify the order of instructions that a single thread must
				2386	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2387	to other memory instructions executed by the same thread. This allows them to be
				2388	moved earlier or later which can allow them to be combined with other instances
				2389	of the same instruction, or hoisted/sunk out of loops to improve
				2390	performance. Only the instructions related to the memory model are given;
				2391	additional ``s_waitcnt`` instructions are required to ensure registers are
				2392	defined before being used. These may be able to be combined with the memory
				2393	model ``s_waitcnt`` instructions as described above.
				2394
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2395	The AMDGPU backend supports the following memory models:
				2396
				2397	HSA Memory Model [HSA]_
				2398	The HSA memory model uses a single happens-before relation for all address
				2399	spaces (see :ref:`amdgpu-address-spaces`).
				2400	OpenCL Memory Model [OpenCL]_
				2401	The OpenCL memory model which has separate happens-before relations for the
				2402	global and local address spaces. Only a fence specifying both global and
				2403	local address space, and seq_cst instructions join the relationships. Since
				2404	the LLVM ``memfence`` instruction does not allow an address space to be
				2405	specified the OpenCL fence has to convervatively assume both local and
				2406	global address space was specified. However, optimizations can often be
				2407	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2408	no intervening memory instructions which access the corresponding address
				2409	space. The code sequences in the table indicate what can be omitted for the
				2410	OpenCL memory. The target triple environment is used to determine if the
				2411	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2412
				2413	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2414	operations.
				2415
				2416	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2417	termed vector memory operations.
				2418
				2419	For GFX6-GFX9:
				2420
				2421	* Each agent has multiple compute units (CU).
				2422	* Each CU has multiple SIMDs that execute wavefronts.
				2423	* The wavefronts for a single work-group are executed in the same CU but may be
				2424	executed by different SIMDs.
				2425	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2426	executing on it.
				2427	* All LDS operations of a CU are performed as wavefront wide operations in a
				2428	global order and involve no caching. Completion is reported to a wavefront in
				2429	execution order.
				2430	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2431	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2432	can be reordered relative to each other, which can result in reordering the
				2433	visibility of vector memory operations with respect to LDS operations of other
				2434	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2435	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2436	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2437	same wavefront.
				2438	* The vector memory operations are performed as wavefront wide operations and
				2439	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2440	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2441	vector memory order if they access LDS memory, and out of LDS operation order
				2442	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2443	* The vector memory operations access a single vector L1 cache shared by all
				2444	SIMDs a CU. Therefore, no special action is required for coherence between the
				2445	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2446	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2447	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2448	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2449	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2450	scalar operations are used in a restricted way so do not impact the memory
				2451	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2452	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2453	the same agent.
				2454	* The L2 cache has independent channels to service disjoint ranges of virtual
				2455	addresses.
				2456	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2457	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2458	(which may be executing on different CUs) of an agent can be reordered
				2459	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2460	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2461	previous vector memory operation has completed before executing a subsequent
				2462	vector memory or LDS operation and so can be used to meet the requirements of
				2463	acquire and release.
				2464	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2465	of virtual addresses can be set up to bypass it to ensure system coherence.
				2466
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2467	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2468	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2469	memory, atomic memory orderings are not meaningful and all accesses are treated
				2470	as non-atomic.
				2471
				2472	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2473	scalar memory instructions). Since the constant address space contents do not
				2474	change during the execution of a kernel dispatch it is not legal to perform
				2475	stores, and atomic memory orderings are not meaningful and all access are
				2476	treated as non-atomic.
				2477
				2478	A memory synchronization scope wider than work-group is not meaningful for the
				2479	group (LDS) address space and is treated as work-group.
				2480
				2481	The memory model does not support the region address space which is treated as
				2482	non-atomic.
				2483
				2484	Acquire memory ordering is not meaningful on store atomic instructions and is
				2485	treated as non-atomic.
				2486
				2487	Release memory ordering is not meaningful on load atomic instructions and is
				2488	treated a non-atomic.
				2489
				2490	Acquire-release memory ordering is not meaningful on load or store atomic
				2491	instructions and is treated as acquire and release respectively.
				2492
				2493	AMDGPU backend only uses scalar memory operations to access memory that is
				2494	proven to not change during the execution of the kernel dispatch. This includes
				2495	constant address space and global address space for program scope const
				2496	variables. Therefore the kernel machine code does not have to maintain the
				2497	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2498	and vector L1 caches are invalidated between kernel dispatches by CP since
				2499	constant address space data may change between kernel dispatch executions. See
				2500	:ref:`amdgpu-amdhsa-memory-spaces`.
				2501
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2502	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2503	case the AMDGPU backend ensures the memory location used to spill is never
				2504	accessed by vector memory operations at the same time. If scalar writes are used
				2505	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2506	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2507	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2508	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2509	as all scalar writes are write-before-read in the same thread.
				2510
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2511	Scratch backing memory (which is used for the private address space)
				2512	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2513	address space is only accessed by a single thread, and is always
				2514	write-before-read, there is never a need to invalidate these entries from the L1
				2515	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2516	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2517
				2518	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2519	to invalidate the L2 cache. This also causes it to be treated as
				2520	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2521	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2522	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2523
				2524	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2525	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2526
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2527	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2528	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2529	Ordering Sync Scope Address
				2530	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2531	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2532	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2533	-----------------------------------------------------------------------------------
				2534	load none none - global - !volatile & !nontemporal
				2535	- generic
				2536	- private 1. buffer/global/flat_load
				2537	- constant
				2538	- volatile & !nontemporal
				2539
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2540	1. buffer/global/flat_load
				2541	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2542
				2543	- nontemporal
				2544
				2545	1. buffer/global/flat_load
				2546	glc=1 slc=1
				2547
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2548	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2549	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2550	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2551	- private 1. buffer/global/flat_store
				2552	- constant
				2553	- nontemporal
				2554
				2555	1. buffer/global/flat_stote
				2556	glc=1 slc=1
				2557
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2558	store none none - local 1. ds_store
				2559	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2560	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2561	load atomic unordered any any Same as non-atomic.
				2562	store atomic unordered any any Same as non-atomic.
				2563	atomicrmw unordered any any *Same as monotonic
				2564	atomic*.
				2565	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2566	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2567	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2568	- wavefront - generic
				2569	- workgroup
				2570	load atomic monotonic - singlethread - local 1. ds_load
				2571	- wavefront
				2572	- workgroup
				2573	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2574	- system - generic glc=1
				2575	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2576	- wavefront - generic
				2577	- workgroup
				2578	- agent
				2579	- system
				2580	store atomic monotonic - singlethread - local 1. ds_store
				2581	- wavefront
				2582	- workgroup
				2583	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2584	- wavefront - generic
				2585	- workgroup
				2586	- agent
				2587	- system
				2588	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2589	- wavefront
				2590	- workgroup
				2591	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2592	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2593	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2594	- wavefront - local
				2595	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2596	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2597	load atomic acquire - workgroup - local 1. ds_load
				2598	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2599
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2600	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2601	- Must happen before
				2602	any following
				2603	global/generic
				2604	load/load
				2605	atomic/store/store
				2606	atomic/atomicrmw.
				2607	- Ensures any
				2608	following global
				2609	data read is no
				2610	older than the load
				2611	atomic value being
				2612	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2613	load atomic acquire - workgroup - generic 1. flat_load
				2614	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2615
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2616	- If OpenCL, omit.
				2617	- Must happen before
				2618	any following
				2619	global/generic
				2620	load/load
				2621	atomic/store/store
				2622	atomic/atomicrmw.
				2623	- Ensures any
				2624	following global
				2625	data read is no
				2626	older than the load
				2627	atomic value being
				2628	acquired.
				2629	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2630	- system glc=1
				2631	2. s_waitcnt vmcnt(0)
				2632
				2633	- Must happen before
				2634	following
				2635	buffer_wbinvl1_vol.
				2636	- Ensures the load
				2637	has completed
				2638	before invalidating
				2639	the cache.
				2640
				2641	3. buffer_wbinvl1_vol
				2642
				2643	- Must happen before
				2644	any following
				2645	global/generic
				2646	load/load
				2647	atomic/atomicrmw.
				2648	- Ensures that
				2649	following
				2650	loads will not see
				2651	stale global data.
				2652
				2653	load atomic acquire - agent - generic 1. flat_load glc=1
				2654	- system 2. s_waitcnt vmcnt(0) &
				2655	lgkmcnt(0)
				2656
				2657	- If OpenCL omit
				2658	lgkmcnt(0).
				2659	- Must happen before
				2660	following
				2661	buffer_wbinvl1_vol.
				2662	- Ensures the flat_load
				2663	has completed
				2664	before invalidating
				2665	the cache.
				2666
				2667	3. buffer_wbinvl1_vol
				2668
				2669	- Must happen before
				2670	any following
				2671	global/generic
				2672	load/load
				2673	atomic/atomicrmw.
				2674	- Ensures that
				2675	following loads
				2676	will not see stale
				2677	global data.
				2678
				2679	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2680	- wavefront - local
				2681	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2682	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2683	atomicrmw acquire - workgroup - local 1. ds_atomic
				2684	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2685
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2686	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2687	- Must happen before
				2688	any following
				2689	global/generic
				2690	load/load
				2691	atomic/store/store
				2692	atomic/atomicrmw.
				2693	- Ensures any
				2694	following global
				2695	data read is no
				2696	older than the
				2697	atomicrmw value
				2698	being acquired.
				2699
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2700	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2701	2. waitcnt lgkmcnt(0)
				2702
				2703	- If OpenCL, omit.
				2704	- Must happen before
				2705	any following
				2706	global/generic
				2707	load/load
				2708	atomic/store/store
				2709	atomic/atomicrmw.
				2710	- Ensures any
				2711	following global
				2712	data read is no
				2713	older than the
				2714	atomicrmw value
				2715	being acquired.
				2716
				2717	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2718	- system 2. s_waitcnt vmcnt(0)
				2719
				2720	- Must happen before
				2721	following
				2722	buffer_wbinvl1_vol.
				2723	- Ensures the
				2724	atomicrmw has
				2725	completed before
				2726	invalidating the
				2727	cache.
				2728
				2729	3. buffer_wbinvl1_vol
				2730
				2731	- Must happen before
				2732	any following
				2733	global/generic
				2734	load/load
				2735	atomic/atomicrmw.
				2736	- Ensures that
				2737	following loads
				2738	will not see stale
				2739	global data.
				2740
				2741	atomicrmw acquire - agent - generic 1. flat_atomic
				2742	- system 2. s_waitcnt vmcnt(0) &
				2743	lgkmcnt(0)
				2744
				2745	- If OpenCL, omit
				2746	lgkmcnt(0).
				2747	- Must happen before
				2748	following
				2749	buffer_wbinvl1_vol.
				2750	- Ensures the
				2751	atomicrmw has
				2752	completed before
				2753	invalidating the
				2754	cache.
				2755
				2756	3. buffer_wbinvl1_vol
				2757
				2758	- Must happen before
				2759	any following
				2760	global/generic
				2761	load/load
				2762	atomic/atomicrmw.
				2763	- Ensures that
				2764	following loads
				2765	will not see stale
				2766	global data.
				2767
				2768	fence acquire - singlethread none none
				2769	- wavefront
				2770	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2771
				2772	- If OpenCL and
				2773	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2774	not generic, omit.
				2775	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2776	currently has no
				2777	address space on
				2778	the fence need to
				2779	conservatively
				2780	always generate. If
				2781	fence had an
				2782	address space then
				2783	set to address
				2784	space of OpenCL
				2785	fence flag, or to
				2786	generic if both
				2787	local and global
				2788	flags are
				2789	specified.
				2790	- Must happen after
				2791	any preceding
				2792	local/generic load
				2793	atomic/atomicrmw
				2794	with an equal or
				2795	wider sync scope
				2796	and memory ordering
				2797	stronger than
				2798	unordered (this is
				2799	termed the
				2800	fence-paired-atomic).
				2801	- Must happen before
				2802	any following
				2803	global/generic
				2804	load/load
				2805	atomic/store/store
				2806	atomic/atomicrmw.
				2807	- Ensures any
				2808	following global
				2809	data read is no
				2810	older than the
				2811	value read by the
				2812	fence-paired-atomic.
				2813
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2814	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2815	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2816
				2817	- If OpenCL and
				2818	address space is
				2819	not generic, omit
				2820	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2821	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2822	currently has no
				2823	address space on
				2824	the fence need to
				2825	conservatively
				2826	always generate
				2827	(see comment for
				2828	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2829	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2830	separate s_waitcnt
				2831	vmcnt(0) and
				2832	s_waitcnt
				2833	lgkmcnt(0) to allow
				2834	them to be
				2835	independently moved
				2836	according to the
				2837	following rules.
				2838	- s_waitcnt vmcnt(0)
				2839	must happen after
				2840	any preceding
				2841	global/generic load
				2842	atomic/atomicrmw
				2843	with an equal or
				2844	wider sync scope
				2845	and memory ordering
				2846	stronger than
				2847	unordered (this is
				2848	termed the
				2849	fence-paired-atomic).
				2850	- s_waitcnt lgkmcnt(0)
				2851	must happen after
				2852	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2853	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2854	atomic/atomicrmw
				2855	with an equal or
				2856	wider sync scope
				2857	and memory ordering
				2858	stronger than
				2859	unordered (this is
				2860	termed the
				2861	fence-paired-atomic).
				2862	- Must happen before
				2863	the following
				2864	buffer_wbinvl1_vol.
				2865	- Ensures that the
				2866	fence-paired atomic
				2867	has completed
				2868	before invalidating
				2869	the
				2870	cache. Therefore
				2871	any following
				2872	locations read must
				2873	be no older than
				2874	the value read by
				2875	the
				2876	fence-paired-atomic.
				2877
				2878	2. buffer_wbinvl1_vol
				2879
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2880	- Must happen before any
				2881	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2882	load/load
				2883	atomic/store/store
				2884	atomic/atomicrmw.
				2885	- Ensures that
				2886	following loads
				2887	will not see stale
				2888	global data.
				2889
				2890	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2891	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2892	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2893	- wavefront - local
				2894	- generic
				2895	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2896
				2897	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2898	- Must happen after
				2899	any preceding
				2900	local/generic
				2901	load/store/load
				2902	atomic/store
				2903	atomic/atomicrmw.
				2904	- Must happen before
				2905	the following
				2906	store.
				2907	- Ensures that all
				2908	memory operations
				2909	to local have
				2910	completed before
				2911	performing the
				2912	store that is being
				2913	released.
				2914
				2915	2. buffer/global/flat_store
				2916	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2917	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2918
				2919	- If OpenCL, omit.
				2920	- Must happen after
				2921	any preceding
				2922	local/generic
				2923	load/store/load
				2924	atomic/store
				2925	atomic/atomicrmw.
				2926	- Must happen before
				2927	the following
				2928	store.
				2929	- Ensures that all
				2930	memory operations
				2931	to local have
				2932	completed before
				2933	performing the
				2934	store that is being
				2935	released.
				2936
				2937	2. flat_store
				2938	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2939	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2940
				2941	- If OpenCL, omit
				2942	lgkmcnt(0).
				2943	- Could be split into
				2944	separate s_waitcnt
				2945	vmcnt(0) and
				2946	s_waitcnt
				2947	lgkmcnt(0) to allow
				2948	them to be
				2949	independently moved
				2950	according to the
				2951	following rules.
				2952	- s_waitcnt vmcnt(0)
				2953	must happen after
				2954	any preceding
				2955	global/generic
				2956	load/store/load
				2957	atomic/store
				2958	atomic/atomicrmw.
				2959	- s_waitcnt lgkmcnt(0)
				2960	must happen after
				2961	any preceding
				2962	local/generic
				2963	load/store/load
				2964	atomic/store
				2965	atomic/atomicrmw.
				2966	- Must happen before
				2967	the following
				2968	store.
				2969	- Ensures that all
				2970	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2971	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2972	completed before
				2973	performing the
				2974	store that is being
				2975	released.
				2976
				2977	2. buffer/global/ds/flat_store
				2978	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2979	- wavefront - local
				2980	- generic
				2981	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2982
				2983	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2984	- Must happen after
				2985	any preceding
				2986	local/generic
				2987	load/store/load
				2988	atomic/store
				2989	atomic/atomicrmw.
				2990	- Must happen before
				2991	the following
				2992	atomicrmw.
				2993	- Ensures that all
				2994	memory operations
				2995	to local have
				2996	completed before
				2997	performing the
				2998	atomicrmw that is
				2999	being released.
				3000
				3001	2. buffer/global/flat_atomic
				3002	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3003	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3004
				3005	- If OpenCL, omit.
				3006	- Must happen after
				3007	any preceding
				3008	local/generic
				3009	load/store/load
				3010	atomic/store
				3011	atomic/atomicrmw.
				3012	- Must happen before
				3013	the following
				3014	atomicrmw.
				3015	- Ensures that all
				3016	memory operations
				3017	to local have
				3018	completed before
				3019	performing the
				3020	atomicrmw that is
				3021	being released.
				3022
				3023	2. flat_atomic
				3024	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3025	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3026
				3027	- If OpenCL, omit
				3028	lgkmcnt(0).
				3029	- Could be split into
				3030	separate s_waitcnt
				3031	vmcnt(0) and
				3032	s_waitcnt
				3033	lgkmcnt(0) to allow
				3034	them to be
				3035	independently moved
				3036	according to the
				3037	following rules.
				3038	- s_waitcnt vmcnt(0)
				3039	must happen after
				3040	any preceding
				3041	global/generic
				3042	load/store/load
				3043	atomic/store
				3044	atomic/atomicrmw.
				3045	- s_waitcnt lgkmcnt(0)
				3046	must happen after
				3047	any preceding
				3048	local/generic
				3049	load/store/load
				3050	atomic/store
				3051	atomic/atomicrmw.
				3052	- Must happen before
				3053	the following
				3054	atomicrmw.
				3055	- Ensures that all
				3056	memory operations
				3057	to global and local
				3058	have completed
				3059	before performing
				3060	the atomicrmw that
				3061	is being released.
				3062
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3063	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3064	fence release - singlethread none none
				3065	- wavefront
				3066	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3067
				3068	- If OpenCL and
				3069	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3070	not generic, omit.
				3071	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3072	currently has no
				3073	address space on
				3074	the fence need to
				3075	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3076	always generate. If
				3077	fence had an
				3078	address space then
				3079	set to address
				3080	space of OpenCL
				3081	fence flag, or to
				3082	generic if both
				3083	local and global
				3084	flags are
				3085	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3086	- Must happen after
				3087	any preceding
				3088	local/generic
				3089	load/load
				3090	atomic/store/store
				3091	atomic/atomicrmw.
				3092	- Must happen before
				3093	any following store
				3094	atomic/atomicrmw
				3095	with an equal or
				3096	wider sync scope
				3097	and memory ordering
				3098	stronger than
				3099	unordered (this is
				3100	termed the
				3101	fence-paired-atomic).
				3102	- Ensures that all
				3103	memory operations
				3104	to local have
				3105	completed before
				3106	performing the
				3107	following
				3108	fence-paired-atomic.
				3109
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3110	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3111	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3112
				3113	- If OpenCL and
				3114	address space is
				3115	not generic, omit
				3116	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3117	- If OpenCL and
				3118	address space is
				3119	local, omit
				3120	vmcnt(0).
				3121	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3122	currently has no
				3123	address space on
				3124	the fence need to
				3125	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3126	always generate. If
				3127	fence had an
				3128	address space then
				3129	set to address
				3130	space of OpenCL
				3131	fence flag, or to
				3132	generic if both
				3133	local and global
				3134	flags are
				3135	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3136	- Could be split into
				3137	separate s_waitcnt
				3138	vmcnt(0) and
				3139	s_waitcnt
				3140	lgkmcnt(0) to allow
				3141	them to be
				3142	independently moved
				3143	according to the
				3144	following rules.
				3145	- s_waitcnt vmcnt(0)
				3146	must happen after
				3147	any preceding
				3148	global/generic
				3149	load/store/load
				3150	atomic/store
				3151	atomic/atomicrmw.
				3152	- s_waitcnt lgkmcnt(0)
				3153	must happen after
				3154	any preceding
				3155	local/generic
				3156	load/store/load
				3157	atomic/store
				3158	atomic/atomicrmw.
				3159	- Must happen before
				3160	any following store
				3161	atomic/atomicrmw
				3162	with an equal or
				3163	wider sync scope
				3164	and memory ordering
				3165	stronger than
				3166	unordered (this is
				3167	termed the
				3168	fence-paired-atomic).
				3169	- Ensures that all
				3170	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3171	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3172	completed before
				3173	performing the
				3174	following
				3175	fence-paired-atomic.
				3176
				3177	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3178	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3179	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3180	- wavefront - local
				3181	- generic
				3182	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3183
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3184	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3185	- Must happen after
				3186	any preceding
				3187	local/generic
				3188	load/store/load
				3189	atomic/store
				3190	atomic/atomicrmw.
				3191	- Must happen before
				3192	the following
				3193	atomicrmw.
				3194	- Ensures that all
				3195	memory operations
				3196	to local have
				3197	completed before
				3198	performing the
				3199	atomicrmw that is
				3200	being released.
				3201
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3202	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3203	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3204	2. s_waitcnt lgkmcnt(0)
				3205
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3206	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3207	- Must happen before
				3208	any following
				3209	global/generic
				3210	load/load
				3211	atomic/store/store
				3212	atomic/atomicrmw.
				3213	- Ensures any
				3214	following global
				3215	data read is no
				3216	older than the load
				3217	atomic value being
				3218	acquired.
				3219
				3220	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3221
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3222	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3223	- Must happen after
				3224	any preceding
				3225	local/generic
				3226	load/store/load
				3227	atomic/store
				3228	atomic/atomicrmw.
				3229	- Must happen before
				3230	the following
				3231	atomicrmw.
				3232	- Ensures that all
				3233	memory operations
				3234	to local have
				3235	completed before
				3236	performing the
				3237	atomicrmw that is
				3238	being released.
				3239
				3240	2. flat_atomic
				3241	3. s_waitcnt lgkmcnt(0)
				3242
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3243	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3244	- Must happen before
				3245	any following
				3246	global/generic
				3247	load/load
				3248	atomic/store/store
				3249	atomic/atomicrmw.
				3250	- Ensures any
				3251	following global
				3252	data read is no
				3253	older than the load
				3254	atomic value being
				3255	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3256
				3257	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3258	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3259
				3260	- If OpenCL, omit
				3261	lgkmcnt(0).
				3262	- Could be split into
				3263	separate s_waitcnt
				3264	vmcnt(0) and
				3265	s_waitcnt
				3266	lgkmcnt(0) to allow
				3267	them to be
				3268	independently moved
				3269	according to the
				3270	following rules.
				3271	- s_waitcnt vmcnt(0)
				3272	must happen after
				3273	any preceding
				3274	global/generic
				3275	load/store/load
				3276	atomic/store
				3277	atomic/atomicrmw.
				3278	- s_waitcnt lgkmcnt(0)
				3279	must happen after
				3280	any preceding
				3281	local/generic
				3282	load/store/load
				3283	atomic/store
				3284	atomic/atomicrmw.
				3285	- Must happen before
				3286	the following
				3287	atomicrmw.
				3288	- Ensures that all
				3289	memory operations
				3290	to global have
				3291	completed before
				3292	performing the
				3293	atomicrmw that is
				3294	being released.
				3295
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3296	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3297	3. s_waitcnt vmcnt(0)
				3298
				3299	- Must happen before
				3300	following
				3301	buffer_wbinvl1_vol.
				3302	- Ensures the
				3303	atomicrmw has
				3304	completed before
				3305	invalidating the
				3306	cache.
				3307
				3308	4. buffer_wbinvl1_vol
				3309
				3310	- Must happen before
				3311	any following
				3312	global/generic
				3313	load/load
				3314	atomic/atomicrmw.
				3315	- Ensures that
				3316	following loads
				3317	will not see stale
				3318	global data.
				3319
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3320	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3321	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3322
				3323	- If OpenCL, omit
				3324	lgkmcnt(0).
				3325	- Could be split into
				3326	separate s_waitcnt
				3327	vmcnt(0) and
				3328	s_waitcnt
				3329	lgkmcnt(0) to allow
				3330	them to be
				3331	independently moved
				3332	according to the
				3333	following rules.
				3334	- s_waitcnt vmcnt(0)
				3335	must happen after
				3336	any preceding
				3337	global/generic
				3338	load/store/load
				3339	atomic/store
				3340	atomic/atomicrmw.
				3341	- s_waitcnt lgkmcnt(0)
				3342	must happen after
				3343	any preceding
				3344	local/generic
				3345	load/store/load
				3346	atomic/store
				3347	atomic/atomicrmw.
				3348	- Must happen before
				3349	the following
				3350	atomicrmw.
				3351	- Ensures that all
				3352	memory operations
				3353	to global have
				3354	completed before
				3355	performing the
				3356	atomicrmw that is
				3357	being released.
				3358
				3359	2. flat_atomic
				3360	3. s_waitcnt vmcnt(0) &
				3361	lgkmcnt(0)
				3362
				3363	- If OpenCL, omit
				3364	lgkmcnt(0).
				3365	- Must happen before
				3366	following
				3367	buffer_wbinvl1_vol.
				3368	- Ensures the
				3369	atomicrmw has
				3370	completed before
				3371	invalidating the
				3372	cache.
				3373
				3374	4. buffer_wbinvl1_vol
				3375
				3376	- Must happen before
				3377	any following
				3378	global/generic
				3379	load/load
				3380	atomic/atomicrmw.
				3381	- Ensures that
				3382	following loads
				3383	will not see stale
				3384	global data.
				3385
				3386	fence acq_rel - singlethread none none
				3387	- wavefront
				3388	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3389
				3390	- If OpenCL and
				3391	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3392	not generic, omit.
				3393	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3394	since LLVM
				3395	currently has no
				3396	address space on
				3397	the fence need to
				3398	conservatively
				3399	always generate
				3400	(see comment for
				3401	previous fence).
				3402	- Must happen after
				3403	any preceding
				3404	local/generic
				3405	load/load
				3406	atomic/store/store
				3407	atomic/atomicrmw.
				3408	- Must happen before
				3409	any following
				3410	global/generic
				3411	load/load
				3412	atomic/store/store
				3413	atomic/atomicrmw.
				3414	- Ensures that all
				3415	memory operations
				3416	to local have
				3417	completed before
				3418	performing any
				3419	following global
				3420	memory operations.
				3421	- Ensures that the
				3422	preceding
				3423	local/generic load
				3424	atomic/atomicrmw
				3425	with an equal or
				3426	wider sync scope
				3427	and memory ordering
				3428	stronger than
				3429	unordered (this is
				3430	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3431	acquire-fence-paired-atomic
				3432	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3433	before following
				3434	global memory
				3435	operations. This
				3436	satisfies the
				3437	requirements of
				3438	acquire.
				3439	- Ensures that all
				3440	previous memory
				3441	operations have
				3442	completed before a
				3443	following
				3444	local/generic store
				3445	atomic/atomicrmw
				3446	with an equal or
				3447	wider sync scope
				3448	and memory ordering
				3449	stronger than
				3450	unordered (this is
				3451	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3452	release-fence-paired-atomic
				3453	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3454	requirements of
				3455	release.
				3456
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3457	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3458	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3459
				3460	- If OpenCL and
				3461	address space is
				3462	not generic, omit
				3463	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3464	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3465	currently has no
				3466	address space on
				3467	the fence need to
				3468	conservatively
				3469	always generate
				3470	(see comment for
				3471	previous fence).
				3472	- Could be split into
				3473	separate s_waitcnt
				3474	vmcnt(0) and
				3475	s_waitcnt
				3476	lgkmcnt(0) to allow
				3477	them to be
				3478	independently moved
				3479	according to the
				3480	following rules.
				3481	- s_waitcnt vmcnt(0)
				3482	must happen after
				3483	any preceding
				3484	global/generic
				3485	load/store/load
				3486	atomic/store
				3487	atomic/atomicrmw.
				3488	- s_waitcnt lgkmcnt(0)
				3489	must happen after
				3490	any preceding
				3491	local/generic
				3492	load/store/load
				3493	atomic/store
				3494	atomic/atomicrmw.
				3495	- Must happen before
				3496	the following
				3497	buffer_wbinvl1_vol.
				3498	- Ensures that the
				3499	preceding
				3500	global/local/generic
				3501	load
				3502	atomic/atomicrmw
				3503	with an equal or
				3504	wider sync scope
				3505	and memory ordering
				3506	stronger than
				3507	unordered (this is
				3508	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3509	acquire-fence-paired-atomic
				3510	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3511	before invalidating
				3512	the cache. This
				3513	satisfies the
				3514	requirements of
				3515	acquire.
				3516	- Ensures that all
				3517	previous memory
				3518	operations have
				3519	completed before a
				3520	following
				3521	global/local/generic
				3522	store
				3523	atomic/atomicrmw
				3524	with an equal or
				3525	wider sync scope
				3526	and memory ordering
				3527	stronger than
				3528	unordered (this is
				3529	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3530	release-fence-paired-atomic
				3531	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3532	requirements of
				3533	release.
				3534
				3535	2. buffer_wbinvl1_vol
				3536
				3537	- Must happen before
				3538	any following
				3539	global/generic
				3540	load/load
				3541	atomic/store/store
				3542	atomic/atomicrmw.
				3543	- Ensures that
				3544	following loads
				3545	will not see stale
				3546	global data. This
				3547	satisfies the
				3548	requirements of
				3549	acquire.
				3550
				3551	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3552	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3553	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3554	- wavefront - local load atomic acquire,
				3555	- generic except must generated
				3556	all instructions even
				3557	for OpenCL.*
				3558	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3559	- generic
				3560	- Must
				3561	happen after
				3562	preceding
				3563	global/generic load
				3564	atomic/store
				3565	atomic/atomicrmw
				3566	with memory
				3567	ordering of seq_cst
				3568	and with equal or
				3569	wider sync scope.
				3570	(Note that seq_cst
				3571	fences have their
				3572	own s_waitcnt
				3573	lgkmcnt(0) and so do
				3574	not need to be
				3575	considered.)
				3576	- Ensures any
				3577	preceding
				3578	sequential
				3579	consistent local
				3580	memory instructions
				3581	have completed
				3582	before executing
				3583	this sequentially
				3584	consistent
				3585	instruction. This
				3586	prevents reordering
				3587	a seq_cst store
				3588	followed by a
				3589	seq_cst load. (Note
				3590	that seq_cst is
				3591	stronger than
				3592	acquire/release as
				3593	the reordering of
				3594	load acquire
				3595	followed by a store
				3596	release is
				3597	prevented by the
				3598	waitcnt of
				3599	the release, but
				3600	there is nothing
				3601	preventing a store
				3602	release followed by
				3603	load acquire from
				3604	competing out of
				3605	order.)
				3606
				3607	2. *Following
				3608	instructions same as
				3609	corresponding load
				3610	atomic acquire,
				3611	except must generated
				3612	all instructions even
				3613	for OpenCL.*
				3614	load atomic seq_cst - workgroup - local *Same as corresponding
				3615	load atomic acquire,
				3616	except must generated
				3617	all instructions even
				3618	for OpenCL.*
				3619	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3620	- system - generic vmcnt(0)
				3621
				3622	- Could be split into
				3623	separate s_waitcnt
				3624	vmcnt(0)
				3625	and s_waitcnt
				3626	lgkmcnt(0) to allow
				3627	them to be
				3628	independently moved
				3629	according to the
				3630	following rules.
				3631	- waitcnt lgkmcnt(0)
				3632	must happen after
				3633	preceding
				3634	global/generic load
				3635	atomic/store
				3636	atomic/atomicrmw
				3637	with memory
				3638	ordering of seq_cst
				3639	and with equal or
				3640	wider sync scope.
				3641	(Note that seq_cst
				3642	fences have their
				3643	own s_waitcnt
				3644	lgkmcnt(0) and so do
				3645	not need to be
				3646	considered.)
				3647	- waitcnt vmcnt(0)
				3648	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3649	preceding
				3650	global/generic load
				3651	atomic/store
				3652	atomic/atomicrmw
				3653	with memory
				3654	ordering of seq_cst
				3655	and with equal or
				3656	wider sync scope.
				3657	(Note that seq_cst
				3658	fences have their
				3659	own s_waitcnt
				3660	vmcnt(0) and so do
				3661	not need to be
				3662	considered.)
				3663	- Ensures any
				3664	preceding
				3665	sequential
				3666	consistent global
				3667	memory instructions
				3668	have completed
				3669	before executing
				3670	this sequentially
				3671	consistent
				3672	instruction. This
				3673	prevents reordering
				3674	a seq_cst store
				3675	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3676	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3677	that seq_cst is
				3678	stronger than
				3679	acquire/release as
				3680	the reordering of
				3681	load acquire
				3682	followed by a store
				3683	release is
				3684	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3685	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3686	the release, but
				3687	there is nothing
				3688	preventing a store
				3689	release followed by
				3690	load acquire from
				3691	competing out of
				3692	order.)
				3693
				3694	2. *Following
				3695	instructions same as
				3696	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3697	atomic acquire,
				3698	except must generated
				3699	all instructions even
				3700	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3701	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3702	- wavefront - local store atomic release,
				3703	- workgroup - generic except must generated
				3704	all instructions even
				3705	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3706	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3707	- system - generic store atomic release,
				3708	except must generated
				3709	all instructions even
				3710	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3711	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3712	- wavefront - local atomicrmw acq_rel,
				3713	- workgroup - generic except must generated
				3714	all instructions even
				3715	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3716	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3717	- system - generic atomicrmw acq_rel,
				3718	except must generated
				3719	all instructions even
				3720	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3721	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3722	- wavefront fence acq_rel,
				3723	- workgroup except must generated
				3724	- agent all instructions even
				3725	- system for OpenCL.*
				3726	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3727
				3728	The memory order also adds the single thread optimization constrains defined in
				3729	table
				3730	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3731
				3732	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3733	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3734
				3735	============ ==============================================================
				3736	LLVM Memory Optimization Constraints
				3737	Ordering
				3738	============ ==============================================================
				3739	unordered none
				3740	monotonic none
				3741	acquire - If a load atomic/atomicrmw then no following load/load
				3742	atomic/store/ store atomic/atomicrmw/fence instruction can
				3743	be moved before the acquire.
				3744	- If a fence then same as load atomic, plus no preceding
				3745	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3746	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3747	atomic/store/ store atomic/atomicrmw/fence instruction can
				3748	be moved after the release.
				3749	- If a fence then same as store atomic, plus no following
				3750	associated fence-paired-atomic can be moved before the
				3751	fence.
				3752	acq_rel Same constraints as both acquire and release.
				3753	seq_cst - If a load atomic then same constraints as acquire, plus no
				3754	preceding sequentially consistent load atomic/store
				3755	atomic/atomicrmw/fence instruction can be moved after the
				3756	seq_cst.
				3757	- If a store atomic then the same constraints as release, plus
				3758	no following sequentially consistent load atomic/store
				3759	atomic/atomicrmw/fence instruction can be moved before the
				3760	seq_cst.
				3761	- If an atomicrmw/fence then same constraints as acq_rel.
				3762	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3763
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3764	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3765	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3766
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3767	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3768	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3769	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3770
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3771	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3772	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3773
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3774	=================== =============== =============== =======================
				3775	Usage Code Sequence Trap Handler Description
				3776	Inputs
				3777	=================== =============== =============== =======================
				3778	reserved ``s_trap 0x00`` Reserved by hardware.
				3779	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3780	``queue_ptr`` ``debugtrap``
				3781	``VGPR0``: intrinsic (not
				3782	``arg`` implemented).
				3783	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3784	``queue_ptr`` terminated and its
				3785	associated queue put
				3786	into the error state.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3787	``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
				3788	installed then
				3789	behaves as a
				3790	no-operation. The
				3791	trap handler is
				3792	entered and
				3793	immediately returns
				3794	to continue
				3795	execution of the
				3796	wavefront.
				3797	- If the debugger is
				3798	installed, causes
				3799	the debug trap to be
				3800	reported by the
				3801	debugger and the
				3802	wavefront is put in
				3803	the halt state until
				3804	resumed by the
				3805	debugger.
				3806	reserved ``s_trap 0x04`` Reserved.
				3807	reserved ``s_trap 0x05`` Reserved.
				3808	reserved ``s_trap 0x06`` Reserved.
				3809	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3810	breakpoints.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3811	reserved ``s_trap 0x08`` Reserved.
				3812	reserved ``s_trap 0xfe`` Reserved.
				3813	reserved ``s_trap 0xff`` Reserved.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3814	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3815
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3816	AMDPAL
				3817	------
				3818
				3819	This section provides code conventions used when the target triple OS is
				3820	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3821	from the application/runtime to each invocation of a hardware shader. These
				3822	parameters include both generic, application-controlled parameters called
				3823	user data as well as system-generated parameters that are a product of the
				3824	draw or dispatch execution.
				3825
				3826	User Data
				3827	~~~~~~~~~
				3828
				3829	Each hardware stage has a set of 32-bit user data registers which can be
				3830	written from a command buffer and then loaded into SGPRs when waves are launched
				3831	via a subsequent dispatch or draw operation. This is the way most arguments are
				3832	passed from the application/runtime to a hardware shader.
				3833
				3834	Compute User Data
				3835	~~~~~~~~~~~~~~~~~
				3836
				3837	Compute shader user data mappings are simpler than graphics shaders, and have a
				3838	fixed mapping.
				3839
				3840	Note that there are always 10 available user data entries in registers -
				3841	entries beyond that limit must be fetched from memory (via the spill table
				3842	pointer) by the shader.
				3843
				3844	.. table:: PAL Compute Shader User Data Registers
				3845	:name: pal-compute-user-data-registers
				3846
				3847	============= ================================
				3848	User Register Description
				3849	============= ================================
				3850	0 Global Internal Table (32-bit pointer)
				3851	1 Per-Shader Internal Table (32-bit pointer)
				3852	2 - 11 Application-Controlled User Data (10 32-bit values)
				3853	12 Spill Table (32-bit pointer)
				3854	13 - 14 Thread Group Count (64-bit pointer)
				3855	15 GDS Range
				3856	============= ================================
				3857
				3858	Graphics User Data
				3859	~~~~~~~~~~~~~~~~~~
				3860
				3861	Graphics pipelines support a much more flexible user data mapping:
				3862
				3863	.. table:: PAL Graphics Shader User Data Registers
				3864	:name: pal-graphics-user-data-registers
				3865
				3866	============= ================================
				3867	User Register Description
				3868	============= ================================
				3869	0 Global Internal Table (32-bit pointer)
				3870	+ Per-Shader Internal Table (32-bit pointer)
				3871	+ 1-15 Application Controlled User Data
				3872	(1-15 Contiguous 32-bit Values in Registers)
				3873	+ Spill Table (32-bit pointer)
				3874	+ Draw Index (First Stage Only)
				3875	+ Vertex Offset (First Stage Only)
				3876	+ Instance Offset (First Stage Only)
				3877	============= ================================
				3878
				3879	The placement of the global internal table remains fixed in the first *user
				3880	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3881	to any desired user data SGPR register, with the following regstrictions:
				3882
				3883	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3884	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3885	shader runs).
				3886
				3887	* Application-controlled user data must be mapped into a contiguous range of
				3888	user data registers.
				3889
				3890	* The application-controlled user data range supports compaction remapping, so
				3891	only entries that are actually consumed by the shader must be assigned to
				3892	corresponding registers. Note that in order to support an efficient runtime
				3893	implementation, the remapping must pack registers in the same order as
				3894	entries, with unused entries removed.
				3895
				3896	.. _pal_global_internal_table:
				3897
				3898	Global Internal Table
				3899	~~~~~~~~~~~~~~~~~~~~~
				3900
				3901	The global internal table is a table of shader resource descriptors (SRDs) that
				3902	define how certain engine-wide, runtime-managed resources should be accessed
				3903	from a shader. The majority of these resources have HW-defined formats, and it
				3904	is up to the compiler to write/read data as required by the target hardware.
				3905
				3906	The following table illustrates the required format:
				3907
				3908	.. table:: PAL Global Internal Table
				3909	:name: pal-git-table
				3910
				3911	============= ================================
				3912	Offset Description
				3913	============= ================================
				3914	0-3 Graphics Scratch SRD
				3915	4-7 Compute Scratch SRD
				3916	8-11 ES/GS Ring Output SRD
				3917	12-15 ES/GS Ring Input SRD
				3918	16-19 GS/VS Ring Output #0
				3919	20-23 GS/VS Ring Output #1
				3920	24-27 GS/VS Ring Output #2
				3921	28-31 GS/VS Ring Output #3
				3922	32-35 GS/VS Ring Input SRD
				3923	36-39 Tessellation Factor Buffer SRD
				3924	40-43 Off-Chip LDS Buffer SRD
				3925	44-47 Off-Chip Param Cache Buffer SRD
				3926	48-51 Sample Position Buffer SRD
				3927	52 vaRange::ShadowDescriptorTable High Bits
				3928	============= ================================
				3929
				3930	The pointer to the global internal table passed to the shader as user data
				3931	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3932	the top 32 bits of the pipeline, so the shader may use the program
				3933	counter's top 32 bits.
				3934
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3935	Unspecified OS
				3936	--------------
				3937
				3938	This section provides code conventions used when the target triple OS is
				3939	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3940
				3941	Trap Handler ABI
				3942	~~~~~~~~~~~~~~~~
				3943
				3944	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3945	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3946	instructions are handled as follows:
				3947
				3948	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3949	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3950
				3951	=============== =============== ===========================================
				3952	Usage Code Sequence Description
				3953	=============== =============== ===========================================
				3954	llvm.trap s_endpgm Causes wavefront to be terminated.
				3955	llvm.debugtrap none Compiler warning given that there is no
				3956	trap handler installed.
				3957	=============== =============== ===========================================
				3958
				3959	Source Languages
				3960	================
				3961
				3962	.. _amdgpu-opencl:
				3963
				3964	OpenCL
				3965	------
				3966
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3967	When the language is OpenCL the following differences occur:
				3968
				3969	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3970	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				3971	arguments for the AMDHSA OS (see
				3972	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3973	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3974	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3975
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3976	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				3977	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				3978
				3979	======== ==== ========= ===========================================
				3980	Position Byte Byte Description
				3981	Size Alignment
				3982	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	3983	1 8 8 OpenCL Global Offset X
				3984	2 8 8 OpenCL Global Offset Y
				3985	3 8 8 OpenCL Global Offset Z
				3986	4 8 8 OpenCL address of printf buffer
				3987	5 8 8 OpenCL address of virtual queue used by
				3988	enqueue_kernel.
				3989	6 8 8 OpenCL address of AqlWrap struct used by
				3990	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3991	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3992
				3993	.. _amdgpu-hcc:
				3994
				3995	HCC
				3996	---
				3997
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3998	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3999
				4000	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				4001
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4002	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4003	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4004
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4005	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	4006	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4007
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4008	This section describes general syntax for instructions and operands.
				4009
				4010	Instructions
				4011	~~~~~~~~~~~~
				4012
				4013	.. toctree::
				4014	:hidden:
				4015
				4016	AMDGPUAsmGFX7
				4017	AMDGPUAsmGFX8
				4018	AMDGPUAsmGFX9
				4019	AMDGPUOperandSyntax
				4020
				4021	An instruction has the following syntax:
				4022
				4023	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				4024
				4025	Note that operands are normally comma-separated while modifiers are space-separated.
				4026
				4027	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4028
				4029	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4030	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4031
				4032	Note that features under development are not included in this description.
				4033
				4034	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4035	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4036	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4037
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4038	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4039	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4040
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4041	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4042
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4043	* SGPR registers: s0, ... or s[0], ...
				4044	* VGPR registers: v0, ... or v[0], ...
				4045	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4046	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4047	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4048	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4049	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4050	* Register index expressions: v[2*2], s[1-1:2-1]
				4051	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4052
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4053	Modifiers
				4054	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4055
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4056	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4057
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4058	Instruction Examples
				4059	~~~~~~~~~~~~~~~~~~~~
				4060
				4061	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4062	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4063
				4064	.. code-block:: nasm
				4065
				4066	ds_add_u32 v2, v4 offset:16
				4067	ds_write_src2_b64 v2 offset0:4 offset1:8
				4068	ds_cmpst_f32 v2, v4, v6
				4069	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4070
				4071
				4072	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4073
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4074	FLAT
				4075	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4076
				4077	.. code-block:: nasm
				4078
				4079	flat_load_dword v1, v[3:4]
				4080	flat_store_dwordx3 v[3:4], v[5:7]
				4081	flat_atomic_swap v1, v[3:4], v5 glc
				4082	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4083	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4084
				4085	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4086
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4087	MUBUF
				4088	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4089
				4090	.. code-block:: nasm
				4091
				4092	buffer_load_dword v1, off, s[4:7], s1
				4093	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4094	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4095	buffer_wbinvl1
				4096	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4097
				4098	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4099
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4100	SMRD/SMEM
				4101	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4102
				4103	.. code-block:: nasm
				4104
				4105	s_load_dword s1, s[2:3], 0xfc
				4106	s_load_dwordx8 s[8:15], s[2:3], s4
				4107	s_load_dwordx16 s[88:103], s[2:3], s4
				4108	s_dcache_inv_vol
				4109	s_memtime s[4:5]
				4110
				4111	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4112
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4113	SOP1
				4114	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4115
				4116	.. code-block:: nasm
				4117
				4118	s_mov_b32 s1, s2
				4119	s_mov_b64 s[0:1], 0x80000000
				4120	s_cmov_b32 s1, 200
				4121	s_wqm_b64 s[2:3], s[4:5]
				4122	s_bcnt0_i32_b64 s1, s[2:3]
				4123	s_swappc_b64 s[2:3], s[4:5]
				4124	s_cbranch_join s[4:5]
				4125
				4126	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4127
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4128	SOP2
				4129	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4130
				4131	.. code-block:: nasm
				4132
				4133	s_add_u32 s1, s2, s3
				4134	s_and_b64 s[2:3], s[4:5], s[6:7]
				4135	s_cselect_b32 s1, s2, s3
				4136	s_andn2_b32 s2, s4, s6
				4137	s_lshr_b64 s[2:3], s[4:5], s6
				4138	s_ashr_i32 s2, s4, s6
				4139	s_bfm_b64 s[2:3], s4, s6
				4140	s_bfe_i64 s[2:3], s[4:5], s6
				4141	s_cbranch_g_fork s[4:5], s[6:7]
				4142
				4143	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4144
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4145	SOPC
				4146	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4147
				4148	.. code-block:: nasm
				4149
				4150	s_cmp_eq_i32 s1, s2
				4151	s_bitcmp1_b32 s1, s2
				4152	s_bitcmp0_b64 s[2:3], s4
				4153	s_setvskip s3, s5
				4154
				4155	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4156
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4157	SOPP
				4158	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4159
				4160	.. code-block:: nasm
				4161
				4162	s_barrier
				4163	s_nop 2
				4164	s_endpgm
				4165	s_waitcnt 0 ; Wait for all counters to be 0
				4166	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4167	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4168	s_sethalt 9
				4169	s_sleep 10
				4170	s_sendmsg 0x1
				4171	s_sendmsg sendmsg(MSG_INTERRUPT)
				4172	s_trap 1
				4173
				4174	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4175
				4176	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4177	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4178	range or acceptable values.
				4179
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4180	VALU
				4181	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4182
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4183	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4184	the assembler will automatically use optimal encoding based on its operands.
				4185	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4186
				4187	* _e32 for 32-bit VOP1/VOP2/VOPC
				4188	* _e64 for 64-bit VOP3
				4189	* _dpp for VOP_DPP
				4190	* _sdwa for VOP_SDWA
				4191
				4192	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4193
				4194	.. code-block:: nasm
				4195
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4196	v_mov_b32 v1, v2
				4197	v_mov_b32_e32 v1, v2
				4198	v_nop
				4199	v_cvt_f64_i32_e32 v[1:2], v2
				4200	v_floor_f32_e32 v1, v2
				4201	v_bfrev_b32_e32 v1, v2
				4202	v_add_f32_e32 v1, v2, v3
				4203	v_mul_i32_i24_e64 v1, v2, 3
				4204	v_mul_i32_i24_e32 v1, -3, v3
				4205	v_mul_i32_i24_e32 v1, -100, v3
				4206	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4207	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4208
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4209	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4210
				4211	.. code-block:: nasm
				4212
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4213	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4214	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4215	v_mov_b32 v0, v0 wave_shl:1
				4216	v_mov_b32 v0, v0 row_mirror
				4217	v_mov_b32 v0, v0 row_bcast:31
				4218	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4219	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4220	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4221
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4222	VOP_SDWA examples:
				4223
				4224	.. code-block:: nasm
				4225
				4226	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4227	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4228	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4229	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4230	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4231
				4232	For full list of supported instructions, refer to "Vector ALU instructions".
				4233
				4234	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4235	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4236
				4237	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4238	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4239
				4240	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4241	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4242
				4243	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4244	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4245
				4246	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4247	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4248
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4249
				4250	major, minor, and stepping are all integers that describe the instruction
				4251	set architecture (ISA) version of the assembly program.
				4252
				4253	vendor and arch are quoted strings. vendor should always be equal to
				4254	"AMD" and arch should always be equal to "AMDGPU".
				4255
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4256	By default, the assembler will derive the ISA version, vendor, and arch
				4257	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4258
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4259	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4260	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4261
				4262	This directives specifies that the symbol with given name is a kernel entry point
				4263	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4264
				4265	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4266	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4267
				4268	This directive marks the beginning of a list of key / value pairs that are used
				4269	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4270	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4271	any amd_kernel_code_t values that are unspecified a default value will be
				4272	used. The default value for all keys is 0, with the following exceptions:
				4273
				4274	- kernel_code_version_major defaults to 1.
				4275	- machine_kind defaults to 1.
				4276	- machine_version_major, machine_version_minor, and
				4277	machine_version_stepping are derived from the value of the -mcpu option
				4278	that is passed to the assembler.
				4279	- kernel_code_entry_byte_offset defaults to 256.
				4280	- wavefront_size defaults to 6.
				4281	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4282	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4283	as a power of two, so a value of n means an alignment of 2^ n.
				4284
				4285	The .amd_kernel_code_t directive must be placed immediately after the
				4286	function label and before any instructions.
				4287
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4288	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4289	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4290
				4291	Here is an example of a minimal amd_kernel_code_t specification:
				4292
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4293	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4294
				4295	.hsa_code_object_version 1,0
				4296	.hsa_code_object_isa
				4297
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4298	.hsatext
				4299	.globl hello_world
				4300	.p2align 8
				4301	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4302
				4303	hello_world:
				4304
				4305	.amd_kernel_code_t
				4306	enable_sgpr_kernarg_segment_ptr = 1
				4307	is_ptr64 = 1
				4308	compute_pgm_rsrc1_vgprs = 0
				4309	compute_pgm_rsrc1_sgprs = 0
				4310	compute_pgm_rsrc2_user_sgpr = 2
				4311	kernarg_segment_byte_size = 8
				4312	wavefront_sgpr_count = 2
				4313	workitem_vgpr_count = 3
				4314	.end_amd_kernel_code_t
				4315
				4316	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4317	v_mov_b32 v0, 3.14159
				4318	s_waitcnt lgkmcnt(0)
				4319	v_mov_b32 v1, s0
				4320	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4321	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4322	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4323	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4324	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4325
				4326	Additional Documentation
				4327	========================
				4328
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4329	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4330	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4331	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4332	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4333	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4334	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4335	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4336	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4337	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4338	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4339	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4340	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4341	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4342	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4343	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4344	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__