Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 773e0742fa9378364c77aefb4ac78d906633c728 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
				204	[off]
				205	.. TODO
				206	Add product
				207	names.
				208	``gfx906`` ``amdgcn`` dGPU - xnack TBA
				209	[off]
				210	.. TODO
				211	Add product
				212	names.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	213	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	214
				215	.. _amdgpu-target-features:
				216
				217	Target Features
				218	---------------
				219
				220	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	221	processor specific features. Not all target features are supported by
				222	all processors. The runtime must ensure that the features supported by
				223	the device used to execute the code match the features enabled when
				224	generating the code. A mismatch of features may result in incorrect
				225	execution, or a reduction in performance.
				226
				227	The target features supported by each processor, and the default value
				228	used if not specified explicitly, is listed in
				229	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				232	target features.
				233
				234	For example:
				235
				236	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	238	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
				241	.. table:: AMDGPU Target Features
				242	:name: amdgpu-target-feature-table
				243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	============== ==================================================
				245	Target Feature Description
				246	============== ==================================================
				247	-m[no-]xnack Enable/disable generating code that has
				248	memory clauses that are compatible with
				249	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	250
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	251	This is used for demand paging and page
				252	migration. If XNACK replay is enabled in
				253	the device, then if a page fault occurs
				254	the code may execute incorrectly if the
				255	``xnack`` feature is not enabled. Executing
				256	code that has the feature enabled on a
				257	device that does not have XNACK replay
				258	enabled will execute correctly, but may
				259	be less performant than code with the
				260	feature disabled.
				261	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	262
				263	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
				265	Address Spaces
				266	--------------
				267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	The memory space names used in the table, aside from the region memory space, is
				271	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	272
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	273	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	274
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	275	.. table:: Address Space Mapping
				276	:name: amdgpu-address-space-mapping-table
				277
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	280	================== =================
				281	0 Generic (Flat)
				282	1 Global
				283	2 Region (GDS)
				284	3 Local (group/LDS)
				285	4 Constant
				286	5 Private (Scratch)
				287	6 Constant 32-bit
				288	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	289
				290	.. _amdgpu-memory-scopes:
				291
				292	Memory Scopes
				293	-------------
				294
				295	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				296	backend memory model when the target triple OS is ``amdhsa`` (see
				297	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				298
				299	The memory model supported is based on the HSA memory model [HSA]_ which is
				300	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				301	relation is transitive over the synchonizes-with relation independent of scope,
				302	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	303	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				306	inclusion and requires the memory scopes to exactly match. However, this
				307	is conservatively correct for OpenCL.
				308
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	309	.. table:: AMDHSA LLVM Sync Scopes
				310	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	311
				312	================ ==========================================================
				313	LLVM Sync Scope Description
				314	================ ==========================================================
				315	none The default: ``system``.
				316
				317	Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``.
				324	- ``agent`` and executed by a thread on the same agent.
				325	- ``workgroup`` and executed by a thread in the same
				326	workgroup.
				327	- ``wavefront`` and executed by a thread in the same
				328	wavefront.
				329
				330	``agent`` Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system`` or ``agent`` and executed by a thread on the
				337	same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``workgroup`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system``, ``agent`` or ``workgroup`` and executed by a
				350	thread in the same workgroup.
				351	- ``wavefront`` and executed by a thread in the same
				352	wavefront.
				353
				354	``wavefront`` Synchronizes with, and participates in modification and
				355	seq_cst total orderings with, other operations (except
				356	image operations) for all address spaces (except private,
				357	or generic that accesses private) provided the other
				358	operation's sync scope is:
				359
				360	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				361	and executed by a thread in the same wavefront.
				362
				363	``singlethread`` Only synchronizes with, and participates in modification
				364	and seq_cst total orderings with, other operations (except
				365	image operations) running in the same thread for all
				366	address spaces (for example, in signal handlers).
				367	================ ==========================================================
				368
				369	AMDGPU Intrinsics
				370	-----------------
				371
				372	The AMDGPU backend implements the following intrinsics.
				373
				374	This section is WIP.
				375
				376	.. TODO
				377	List AMDGPU intrinsics
				378
				379	Code Object
				380	===========
				381
				382	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				383	can be linked by ``lld`` to produce a standard ELF shared code object which can
				384	be loaded and executed on an AMDGPU target.
				385
				386	Header
				387	------
				388
				389	The AMDGPU backend uses the following ELF header:
				390
				391	.. table:: AMDGPU ELF Header
				392	:name: amdgpu-elf-header-table
				393
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	394	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	396	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	397	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				398	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	399	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				400	- ``ELFOSABI_AMDGPU_HSA``
				401	- ``ELFOSABI_AMDGPU_PAL``
				402	- ``ELFOSABI_AMDGPU_MESA3D``
				403	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				404	- ``ELFABIVERSION_AMDGPU_PAL``
				405	- ``ELFABIVERSION_AMDGPU_MESA3D``
				406	``e_type`` - ``ET_REL``
				407	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	``e_machine`` ``EM_AMDGPU``
				409	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	410	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	411	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	412
				413	..
				414
				415	.. table:: AMDGPU ELF Header Enumeration Values
				416	:name: amdgpu-elf-header-enumeration-values-table
				417
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	418	=============================== =====
				419	Name Value
				420	=============================== =====
				421	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	422	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	423	``ELFOSABI_AMDGPU_HSA`` 64
				424	``ELFOSABI_AMDGPU_PAL`` 65
				425	``ELFOSABI_AMDGPU_MESA3D`` 66
				426	``ELFABIVERSION_AMDGPU_HSA`` 1
				427	``ELFABIVERSION_AMDGPU_PAL`` 0
				428	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				429	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	430
				431	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	432	The ELF class is:
				433
				434	* ``ELFCLASS32`` for ``r600`` architecture.
				435
				436	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				437	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	438
				439	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	440	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	441
				442	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	443	One of the following AMD GPU architecture specific OS ABIs
				444	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	445
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	446	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	447
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	449
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	450	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				451
				452	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	453
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	455	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				456	object conforms:
				457
				458	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				459	runtime ABI.
				460
				461	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				462	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	463
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	464	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	465	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	466
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467	``e_type``
				468	Can be one of the following values:
				469
				470
				471	``ET_REL``
				472	The type produced by the AMD GPU backend compiler as it is relocatable code
				473	object.
				474
				475	``ET_DYN``
				476	The type produced by the linker as it is a shared code object.
				477
				478	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				479
				480	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				482	by the ``r600`` and ``amdgcn`` architectures (see
				483	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				484	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				485	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	486
				487	``e_entry``
				488	The entry point is 0 as the entry points for individual kernels must be
				489	selected in order to invoke them through AQL packets.
				490
				491	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	492	The AMDGPU backend uses the following ELF header flags:
				493
				494	.. table:: AMDGPU ELF Header ``e_flags``
				495	:name: amdgpu-elf-header-e_flags-table
				496
				497	================================= ========== =============================
				498	Name Value Description
				499	================================= ========== =============================
				500	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				501	-------------------------------------------- -----------------------------
				502	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				503	mask for
				504	``EF_AMDGPU_MACH_xxx`` values
				505	defined in
				506	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	507	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				508	target feature is
				509	enabled for all code
				510	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	511	If the processor
				512	does not support the
				513	``xnack`` target
				514	feature then must
				515	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	516	See
				517	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	518	================================= ========== =============================
				519
				520	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				521	:name: amdgpu-ef-amdgpu-mach-table
				522
				523	================================= ========== =============================
				524	Name Value Description (see
				525	:ref:`amdgpu-processor-table`)
				526	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	527	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				528	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				529	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				530	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				531	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				532	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				533	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				534	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				535	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				536	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				537	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				538	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				539	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				540	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				541	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				542	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				543	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				544	reserved 0x011 - Reserved for ``r600``
				545	0x01f architecture processors.
				546	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				547	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				548	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				553	reserved 0x027 Reserved.
				554	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				555	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				556	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				557	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	560	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	562	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	563	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	564
				565	Sections
				566	--------
				567
				568	An AMDGPU target ELF code object has the standard ELF sections which include:
				569
				570	.. table:: AMDGPU ELF Sections
				571	:name: amdgpu-elf-sections-table
				572
				573	================== ================ =================================
				574	Name Type Attributes
				575	================== ================ =================================
				576	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				577	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				578	``.debug_``\ \* ``SHT_PROGBITS`` none
				579	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				580	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				581	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				582	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				583	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				584	``.note`` ``SHT_NOTE`` none
				585	``.rela``\ name ``SHT_RELA`` none
				586	``.rela.dyn`` ``SHT_RELA`` none
				587	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				588	``.shstrtab`` ``SHT_STRTAB`` none
				589	``.strtab`` ``SHT_STRTAB`` none
				590	``.symtab`` ``SHT_SYMTAB`` none
				591	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				592	================== ================ =================================
				593
				594	These sections have their standard meanings (see [ELF]_) and are only generated
				595	if needed.
				596
				597	``.debug``\ \*
				598	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				599	DWARF produced by the AMDGPU backend.
				600
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	601	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602	The standard sections used by a dynamic loader.
				603
				604	``.note``
				605	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				606	backend.
				607
				608	``.rela``\ name, ``.rela.dyn``
				609	For relocatable code objects, name is the name of the section that the
				610	relocation records apply. For example, ``.rela.text`` is the section name for
				611	relocation records associated with the ``.text`` section.
				612
				613	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				614	records from each of the relocatable code object's ``.rela``\ name sections.
				615
				616	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				617	the AMDGPU backend.
				618
				619	``.text``
				620	The executable machine code for the kernels and functions they call. Generated
				621	as position independent code. See :ref:`amdgpu-code-conventions` for
				622	information on conventions used in the isa generation.
				623
				624	.. _amdgpu-note-records:
				625
				626	Note Records
				627	------------
				628
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	629	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				630	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				631	aligned. In addition, minimal zero byte padding must be generated to ensure the
				632	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				633	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	634
				635	The AMDGPU backend code object uses the following ELF note records in the
				636	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	637	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	638	variable size strings have a corresponding ``*_size`` field that specifies the
				639	number of bytes, including the terminating null character, in the string. The
				640	string(s) come immediately after the preceding fields.
				641
				642	Additional note records can be present.
				643
				644	.. table:: AMDGPU ELF Note Records
				645	:name: amdgpu-elf-note-records-table
				646
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	647	===== ============================== ======================================
				648	Name Type Description
				649	===== ============================== ======================================
				650	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	651	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	652
				653	..
				654
				655	.. table:: AMDGPU ELF Note Record Enumeration Values
				656	:name: amdgpu-elf-note-record-enumeration-values-table
				657
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	658	============================== =====
				659	Name Value
				660	============================== =====
				661	reserved 0-9
				662	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	663	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	664	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	665
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	666	``NT_AMD_AMDGPU_HSA_METADATA``
				667	Specifies extensible metadata associated with the code objects executed on HSA
				668	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				669	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				670	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				671	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	672
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	673	.. _amdgpu-symbols:
				674
				675	Symbols
				676	-------
				677
				678	Symbols include the following:
				679
				680	.. table:: AMDGPU ELF Symbols
				681	:name: amdgpu-elf-symbols-table
				682
				683	===================== ============== ============= ==================
				684	Name Type Section Description
				685	===================== ============== ============= ==================
				686	link-name ``STT_OBJECT`` - ``.data`` Global variable
				687	- ``.rodata``
				688	- ``.bss``
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	689	link-name\ ``.kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	690	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				691	===================== ============== ============= ==================
				692
				693	Global variable
				694	Global variables both used and defined by the compilation unit.
				695
				696	If the symbol is defined in the compilation unit then it is allocated in the
				697	appropriate section according to if it has initialized data or is readonly.
				698
				699	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				700	will resolve relocations using the definition provided by another code object
				701	or explicitly defined by the runtime.
				702
				703	All global symbols, whether defined in the compilation unit or external, are
				704	accessed by the machine code indirectly through a GOT table entry. This
				705	allows them to be preemptable. The GOT table is only supported when the target
				706	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	707
				708	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	709	Add description of linked shared object symbols. Seems undefined symbols
				710	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	711
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	712	Kernel descriptor
				713	Every HSA kernel has an associated kernel descriptor. It is the address of the
				714	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				715	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				716	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				717
				718	Kernel entry point
				719	Every HSA kernel also has a symbol for its machine code entry point.
				720
				721	.. _amdgpu-relocation-records:
				722
				723	Relocation Records
				724	------------------
				725
				726	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				727	relocatable fields are:
				728
				729	``word32``
				730	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				731	alignment. These values use the same byte order as other word values in the
				732	AMD GPU architecture.
				733
				734	``word64``
				735	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				736	alignment. These values use the same byte order as other word values in the
				737	AMD GPU architecture.
				738
				739	Following notations are used for specifying relocation calculations:
				740
				741	A
				742	Represents the addend used to compute the value of the relocatable field.
				743
				744	G
				745	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	746	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	747
				748	GOT
				749	Represents the address of the global offset table.
				750
				751	P
				752	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				753	of the storage unit being relocated (computed using ``r_offset``).
				754
				755	S
				756	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	757	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				758
				759	B
				760	Represents the base address of a loaded executable or shared object which is
				761	the difference between the ELF address and the actual load address. Relocations
				762	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	763
				764	The following relocation types are supported:
				765
				766	.. table:: AMDGPU ELF Relocation Records
				767	:name: amdgpu-elf-relocation-records-table
				768
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	769	========================== ======= ===== ========== ==============================
				770	Relocation Type Kind Value Field Calculation
				771	========================== ======= ===== ========== ==============================
				772	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	773	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				774	Dynamic
				775	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				776	Dynamic
				777	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	778	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	779	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				780	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	781	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				782	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	783	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				784	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				785	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				786	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				787	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				788	reserved 12
				789	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				790	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	791
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	792	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				793	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				794
				795	There is no current OS loader support for 32 bit programs and so
				796	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	797
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	798	.. _amdgpu-dwarf:
				799
				800	DWARF
				801	-----
				802
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	803	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	804	information that maps the code object executable code and data to the source
				805	language constructs. It can be used by tools such as debuggers and profilers.
				806
				807	Address Space Mapping
				808	~~~~~~~~~~~~~~~~~~~~~
				809
				810	The following address space mapping is used:
				811
				812	.. table:: AMDGPU DWARF Address Space Mapping
				813	:name: amdgpu-dwarf-address-space-mapping-table
				814
				815	=================== =================
				816	DWARF Address Space Memory Space
				817	=================== =================
				818	1 Private (Scratch)
				819	2 Local (group/LDS)
				820	omitted Global
				821	omitted Constant
				822	omitted Generic (Flat)
				823	not supported Region (GDS)
				824	=================== =================
				825
				826	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				827	used in the table.
				828
				829	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				830	DWARF address space of the value of the pointer when it is in the private or
				831	local address space. Otherwise the attribute is omitted.
				832
				833	An ``XDEREF`` operation is generated in location list expressions for variables
				834	that are allocated in the private and local address space. Otherwise no
				835	``XDREF`` is omitted.
				836
				837	Register Mapping
				838	~~~~~~~~~~~~~~~~
				839
				840	This section is WIP.
				841
				842	.. TODO
				843	Define DWARF register enumeration.
				844
				845	If want to present a wavefront state then should expose vector registers as
				846	64 wide (rather than per work-item view that LLVM uses). Either as separate
				847	registers, or a 64x4 byte single register. In either case use a new LANE op
				848	(akin to XDREF) to select the current lane usage in a location
				849	expression. This would also allow scalar register spilling to vector register
				850	lanes to be expressed (currently no debug information is being generated for
				851	spilling). If choose a wide single register approach then use LANE in
				852	conjunction with PIECE operation to select the dword part of the register for
				853	the current lane. If the separate register approach then use LANE to select
				854	the register.
				855
				856	Source Text
				857	~~~~~~~~~~~
				858
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	859	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				860	runtime) may be embedded into the DWARF v5 line table using the ``clang
				861	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	862
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	863	For example:
				864
				865	``-gembed-source``
				866	Enable the embedded source DWARF v5 extension.
				867	``-gno-embed-source``
				868	Disable the embedded source DWARF v5 extension.
				869
				870	.. table:: AMDGPU Debug Options
				871	:name: amdgpu-debug-options
				872
				873	==================== ==================================================
				874	Debug Flag Description
				875	==================== ==================================================
				876	-g[no-]embed-source Enable/disable embedding source text in DWARF
				877	debug sections. Useful for environments where
				878	source cannot be written to disk, such as
				879	when performing online compilation.
				880	==================== ==================================================
				881
				882	This option enables one extended content types in the DWARF v5 Line Number
				883	Program Header, which is used to encode embedded source.
				884
				885	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				886	:name: amdgpu-dwarf-extended-content-types
				887
				888	============================ ======================
				889	Content Type Form
				890	============================ ======================
				891	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				892	============================ ======================
				893
				894	The source field will contain the UTF-8 encoded, null-terminated source text
				895	with ``'\n'`` line endings. When the source field is present, consumers can use
				896	the embedded source instead of attempting to discover the source on disk. When
				897	the source field is absent, consumers can access the file to get the source
				898	text.
				899
				900	The above content type appears in the ``file_name_entry_format`` field of the
				901	line table prologue, and its corresponding value appear in the ``file_names``
				902	field. The current encoding of the content type is documented in table
				903	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				904
				905	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				906	:name: amdgpu-dwarf-extended-content-types-encoding
				907
				908	============================ ====================
				909	Content Type Value
				910	============================ ====================
				911	``DW_LNCT_LLVM_source`` 0x2001
				912	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	913
				914	.. _amdgpu-code-conventions:
				915
				916	Code Conventions
				917	================
				918
				919	This section provides code conventions used for each supported target triple OS
				920	(see :ref:`amdgpu-target-triples`).
				921
				922	AMDHSA
				923	------
				924
				925	This section provides code conventions used when the target triple OS is
				926	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				927
				928	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	929
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	930	Code Object Target Identification
				931	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				932
				933	The AMDHSA OS uses the following syntax to specify the code object
				934	target as a single string:
				935
				936	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				937
				938	Where:
				939
				940	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				941	are the same as the Target Triple (see
				942	:ref:`amdgpu-target-triples`).
				943
				944	- ``<Processor>`` is the same as the Processor (see
				945	:ref:`amdgpu-processors`).
				946
				947	- ``<Target Features>`` is a list of the enabled Target Features
				948	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				949	apply to Processor. The list must be in the same order as listed
				950	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				951	Features* must be included in the list if they are enabled even if
				952	that is the default for Processor.
				953
				954	For example:
				955
				956	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				957
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	958	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	959	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	960
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	961	The code object metadata specifies extensible metadata associated with the code
				962	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				963	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				964	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				965	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				966	information necessary to support the ROCM kernel queries. For example, the
				967	segment sizes needed in a dispatch packet. In addition, a high level language
				968	runtime may require other information to be included. For example, the AMD
				969	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	970
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	971	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	972	:doc:`YamlIO`).
				973
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	974	.. TODO
				975	Is the string null terminated? It probably should not if YAML allows it to
				976	contain null characters, otherwise it should be.
				977
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	978	The metadata is represented as a single YAML document comprised of the mapping
				979	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				980	referenced tables.
				981
				982	For boolean values, the string values of ``false`` and ``true`` are used for
				983	false and true respectively.
				984
				985	Additional information can be added to the mappings. To avoid conflicts, any
				986	non-AMD key names should be prefixed by "vendor-name.".
				987
				988	.. table:: AMDHSA Code Object Metadata Mapping
				989	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				990
				991	========== ============== ========= =======================================
				992	String Key Value Type Required? Description
				993	========== ============== ========= =======================================
				994	"Version" sequence of Required - The first integer is the major
				995	2 integers version. Currently 1.
				996	- The second integer is the minor
				997	version. Currently 0.
				998	"Printf" sequence of Each string is encoded information
				999	strings about a printf function call. The
				1000	encoded information is organized as
				1001	fields separated by colon (':'):
				1002
				1003	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1004
				1005	where:
				1006
				1007	``ID``
				1008	A 32 bit integer as a unique id for
				1009	each printf function call
				1010
				1011	``N``
				1012	A 32 bit integer equal to the number
				1013	of arguments of printf function call
				1014	minus 1
				1015
				1016	``S[i]`` (where i = 0, 1, ... , N-1)
				1017	32 bit integers for the size in bytes
				1018	of the i-th FormatString argument of
				1019	the printf function call
				1020
				1021	FormatString
				1022	The format string passed to the
				1023	printf function call.
				1024	"Kernels" sequence of Required Sequence of the mappings for each
				1025	mapping kernel in the code object. See
				1026	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1027	for the definition of the mapping.
				1028	========== ============== ========= =======================================
				1029
				1030	..
				1031
				1032	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1033	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1034
				1035	================= ============== ========= ================================
				1036	String Key Value Type Required? Description
				1037	================= ============== ========= ================================
				1038	"Name" string Required Source name of the kernel.
				1039	"SymbolName" string Required Name of the kernel
				1040	descriptor ELF symbol.
				1041	"Language" string Source language of the kernel.
				1042	Values include:
				1043
				1044	- "OpenCL C"
				1045	- "OpenCL C++"
				1046	- "HCC"
				1047	- "OpenMP"
				1048
				1049	"LanguageVersion" sequence of - The first integer is the major
				1050	2 integers version.
				1051	- The second integer is the
				1052	minor version.
				1053	"Attrs" mapping Mapping of kernel attributes.
				1054	See
				1055	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1056	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1057	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1058	mapping kernel arguments. See
				1059	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1060	for the definition of the mapping.
				1061	"CodeProps" mapping Mapping of properties related to
				1062	the kernel code. See
				1063	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1064	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1065	================= ============== ========= ================================
				1066
				1067	..
				1068
				1069	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1070	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1071
				1072	=================== ============== ========= ==============================
				1073	String Key Value Type Required? Description
				1074	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1075	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1076	3 integers must be >=1 and the dispatch
				1077	work-group size X, Y, Z must
				1078	correspond to the specified
				1079	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1080
				1081	Corresponds to the OpenCL
				1082	``reqd_work_group_size``
				1083	attribute.
				1084	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1085	3 integers X, Y, Z is likely to be the
				1086	specified values.
				1087
				1088	Corresponds to the OpenCL
				1089	``work_group_size_hint``
				1090	attribute.
				1091	"VecTypeHint" string The name of a scalar or vector
				1092	type.
				1093
				1094	Corresponds to the OpenCL
				1095	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1096
				1097	"RuntimeHandle" string The external symbol name
				1098	associated with a kernel.
				1099	OpenCL runtime allocates a
				1100	global buffer for the symbol
				1101	and saves the kernel's address
				1102	to it, which is used for
				1103	device side enqueueing. Only
				1104	available for device side
				1105	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1106	=================== ============== ========= ==============================
				1107
				1108	..
				1109
				1110	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1111	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1112
				1113	================= ============== ========= ================================
				1114	String Key Value Type Required? Description
				1115	================= ============== ========= ================================
				1116	"Name" string Kernel argument name.
				1117	"TypeName" string Kernel argument type name.
				1118	"Size" integer Required Kernel argument size in bytes.
				1119	"Align" integer Required Kernel argument alignment in
				1120	bytes. Must be a power of two.
				1121	"ValueKind" string Required Kernel argument kind that
				1122	specifies how to set up the
				1123	corresponding argument.
				1124	Values include:
				1125
				1126	"ByValue"
				1127	The argument is copied
				1128	directly into the kernarg.
				1129
				1130	"GlobalBuffer"
				1131	A global address space pointer
				1132	to the buffer data is passed
				1133	in the kernarg.
				1134
				1135	"DynamicSharedPointer"
				1136	A group address space pointer
				1137	to dynamically allocated LDS
				1138	is passed in the kernarg.
				1139
				1140	"Sampler"
				1141	A global address space
				1142	pointer to a S# is passed in
				1143	the kernarg.
				1144
				1145	"Image"
				1146	A global address space
				1147	pointer to a T# is passed in
				1148	the kernarg.
				1149
				1150	"Pipe"
				1151	A global address space pointer
				1152	to an OpenCL pipe is passed in
				1153	the kernarg.
				1154
				1155	"Queue"
				1156	A global address space pointer
				1157	to an OpenCL device enqueue
				1158	queue is passed in the
				1159	kernarg.
				1160
				1161	"HiddenGlobalOffsetX"
				1162	The OpenCL grid dispatch
				1163	global offset for the X
				1164	dimension is passed in the
				1165	kernarg.
				1166
				1167	"HiddenGlobalOffsetY"
				1168	The OpenCL grid dispatch
				1169	global offset for the Y
				1170	dimension is passed in the
				1171	kernarg.
				1172
				1173	"HiddenGlobalOffsetZ"
				1174	The OpenCL grid dispatch
				1175	global offset for the Z
				1176	dimension is passed in the
				1177	kernarg.
				1178
				1179	"HiddenNone"
				1180	An argument that is not used
				1181	by the kernel. Space needs to
				1182	be left for it, but it does
				1183	not need to be set up.
				1184
				1185	"HiddenPrintfBuffer"
				1186	A global address space pointer
				1187	to the runtime printf buffer
				1188	is passed in kernarg.
				1189
				1190	"HiddenDefaultQueue"
				1191	A global address space pointer
				1192	to the OpenCL device enqueue
				1193	queue that should be used by
				1194	the kernel by default is
				1195	passed in the kernarg.
				1196
				1197	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1198	A global address space pointer
				1199	to help link enqueued kernels into
				1200	the ancestor tree for determining
				1201	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1202
				1203	"ValueType" string Required Kernel argument value type. Only
				1204	present if "ValueKind" is
				1205	"ByValue". For vector data
				1206	types, the value is for the
				1207	element type. Values include:
				1208
				1209	- "Struct"
				1210	- "I8"
				1211	- "U8"
				1212	- "I16"
				1213	- "U16"
				1214	- "F16"
				1215	- "I32"
				1216	- "U32"
				1217	- "F32"
				1218	- "I64"
				1219	- "U64"
				1220	- "F64"
				1221
				1222	.. TODO
				1223	How can it be determined if a
				1224	vector type, and what size
				1225	vector?
				1226	"PointeeAlign" integer Alignment in bytes of pointee
				1227	type for pointer type kernel
				1228	argument. Must be a power
				1229	of 2. Only present if
				1230	"ValueKind" is
				1231	"DynamicSharedPointer".
				1232	"AddrSpaceQual" string Kernel argument address space
				1233	qualifier. Only present if
				1234	"ValueKind" is "GlobalBuffer" or
				1235	"DynamicSharedPointer". Values
				1236	are:
				1237
				1238	- "Private"
				1239	- "Global"
				1240	- "Constant"
				1241	- "Local"
				1242	- "Generic"
				1243	- "Region"
				1244
				1245	.. TODO
				1246	Is GlobalBuffer only Global
				1247	or Constant? Is
				1248	DynamicSharedPointer always
				1249	Local? Can HCC allow Generic?
				1250	How can Private or Region
				1251	ever happen?
				1252	"AccQual" string Kernel argument access
				1253	qualifier. Only present if
				1254	"ValueKind" is "Image" or
				1255	"Pipe". Values
				1256	are:
				1257
				1258	- "ReadOnly"
				1259	- "WriteOnly"
				1260	- "ReadWrite"
				1261
				1262	.. TODO
				1263	Does this apply to
				1264	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1265	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1266	performed by the kernel on the
				1267	kernel argument. Only present if
				1268	"ValueKind" is "GlobalBuffer",
				1269	"Image", or "Pipe". This may be
				1270	more restrictive than indicated
				1271	by "AccQual" to reflect what the
				1272	kernel actual does. If not
				1273	present then the runtime must
				1274	assume what is implied by
				1275	"AccQual" and "IsConst". Values
				1276	are:
				1277
				1278	- "ReadOnly"
				1279	- "WriteOnly"
				1280	- "ReadWrite"
				1281
				1282	"IsConst" boolean Indicates if the kernel argument
				1283	is const qualified. Only present
				1284	if "ValueKind" is
				1285	"GlobalBuffer".
				1286
				1287	"IsRestrict" boolean Indicates if the kernel argument
				1288	is restrict qualified. Only
				1289	present if "ValueKind" is
				1290	"GlobalBuffer".
				1291
				1292	"IsVolatile" boolean Indicates if the kernel argument
				1293	is volatile qualified. Only
				1294	present if "ValueKind" is
				1295	"GlobalBuffer".
				1296
				1297	"IsPipe" boolean Indicates if the kernel argument
				1298	is pipe qualified. Only present
				1299	if "ValueKind" is "Pipe".
				1300
				1301	.. TODO
				1302	Can GlobalBuffer be pipe
				1303	qualified?
				1304	================= ============== ========= ================================
				1305
				1306	..
				1307
				1308	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1309	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1310
				1311	============================ ============== ========= =====================
				1312	String Key Value Type Required? Description
				1313	============================ ============== ========= =====================
				1314	"KernargSegmentSize" integer Required The size in bytes of
				1315	the kernarg segment
				1316	that holds the values
				1317	of the arguments to
				1318	the kernel.
				1319	"GroupSegmentFixedSize" integer Required The amount of group
				1320	segment memory
				1321	required by a
				1322	work-group in
				1323	bytes. This does not
				1324	include any
				1325	dynamically allocated
				1326	group segment memory
				1327	that may be added
				1328	when the kernel is
				1329	dispatched.
				1330	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1331	private address space
				1332	memory required for a
				1333	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1334	bytes. If the kernel
				1335	uses a dynamic call
				1336	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1337	space must be added
				1338	to this value for the
				1339	call stack.
				1340	"KernargSegmentAlign" integer Required The maximum byte
				1341	alignment of
				1342	arguments in the
				1343	kernarg segment. Must
				1344	be a power of 2.
				1345	"WavefrontSize" integer Required Wavefront size. Must
				1346	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1347	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1348	registers used by a
				1349	wavefront for
				1350	GFX6-GFX9. This
				1351	includes the special
				1352	SGPRs for VCC, Flat
				1353	Scratch (GFX7-GFX9)
				1354	and XNACK (for
				1355	GFX8-GFX9). It does
				1356	not include the 16
				1357	SGPR added if a trap
				1358	handler is
				1359	enabled. It is not
				1360	rounded up to the
				1361	allocation
				1362	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1363	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1364	registers used by
				1365	each work-item for
				1366	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1367	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1368	work-group size
				1369	supported by the
				1370	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1371	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1372	consistent with
				1373	ReqdWorkGroupSize if
				1374	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1375	"NumSpilledSGPRs" integer Number of stores from
				1376	a scalar register to
				1377	a register allocator
				1378	created spill
				1379	location.
				1380	"NumSpilledVGPRs" integer Number of stores from
				1381	a vector register to
				1382	a register allocator
				1383	created spill
				1384	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1385	============================ ============== ========= =====================
				1386
				1387	..
				1388
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1389	Kernel Dispatch
				1390	~~~~~~~~~~~~~~~
				1391
				1392	The HSA architected queuing language (AQL) defines a user space memory interface
				1393	that can be used to control the dispatch of kernels, in an agent independent
				1394	way. An agent can have zero or more AQL queues created for it using the ROCm
				1395	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1396	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1397	mechanics and packet layouts.
				1398
				1399	The packet processor of a kernel agent is responsible for detecting and
				1400	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1401	packet processor is implemented by the hardware command processor (CP),
				1402	asynchronous dispatch controller (ADC) and shader processor input controller
				1403	(SPI).
				1404
				1405	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1406	mode driver to initialize and register the AQL queue with CP.
				1407
				1408	To dispatch a kernel the following actions are performed. This can occur in the
				1409	CPU host program, or from an HSA kernel executing on a GPU.
				1410
				1411	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1412	executed is obtained.
				1413	2. A pointer to the kernel descriptor (see
				1414	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1415	obtained. It must be for a kernel that is contained in a code object that that
				1416	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1417	associated.
				1418	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1419	for a memory region with the kernarg property for the kernel agent that will
				1420	execute the kernel. It must be at least 16 byte aligned.
				1421	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1422	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1423	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1424	memory in the same way constant memory is accessed. (Note that the HSA
				1425	specification allows an implementation to copy the kernel argument contents to
				1426	another location that is accessed by the kernel.)
				1427	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1428	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1429	packet. The packet must be set up, and the final write must use an atomic
				1430	store release to set the packet kind to ensure the packet contents are
				1431	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1432	notify the kernel agent that the AQL queue has been updated. These rules, and
				1433	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1434	System Architecture Specification* [HSA]_.
				1435	6. A kernel dispatch packet includes information about the actual dispatch,
				1436	such as grid and work-group size, together with information from the code
				1437	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1438	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1439	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1440	7. CP executes micro-code and is responsible for detecting and setting up the
				1441	GPU to execute the wavefronts of a kernel dispatch.
				1442	8. CP ensures that when the a wavefront starts executing the kernel machine
				1443	code, the scalar general purpose registers (SGPR) and vector general purpose
				1444	registers (VGPR) are set up as required by the machine code. The required
				1445	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1446	register state is defined in
				1447	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1448	9. The prolog of the kernel machine code (see
				1449	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1450	before continuing executing the machine code that corresponds to the kernel.
				1451	10. When the kernel dispatch has completed execution, CP signals the completion
				1452	signal specified in the kernel dispatch packet if not 0.
				1453
				1454	.. _amdgpu-amdhsa-memory-spaces:
				1455
				1456	Memory Spaces
				1457	~~~~~~~~~~~~~
				1458
				1459	The memory space properties are:
				1460
				1461	.. table:: AMDHSA Memory Spaces
				1462	:name: amdgpu-amdhsa-memory-spaces-table
				1463
				1464	================= =========== ======== ======= ==================
				1465	Memory Space Name HSA Segment Hardware Address NULL Value
				1466	Name Name Size
				1467	================= =========== ======== ======= ==================
				1468	Private private scratch 32 0x00000000
				1469	Local group LDS 32 0xFFFFFFFF
				1470	Global global global 64 0x0000000000000000
				1471	Constant constant *same as 64 0x0000000000000000
				1472	global*
				1473	Generic flat flat 64 0x0000000000000000
				1474	Region N/A GDS 32 *not implemented
				1475	for AMDHSA*
				1476	================= =========== ======== ======= ==================
				1477
				1478	The global and constant memory spaces both use global virtual addresses, which
				1479	are the same virtual address space used by the CPU. However, some virtual
				1480	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1481	and some by both.
				1482
				1483	Using the constant memory space indicates that the data will not change during
				1484	the execution of the kernel. This allows scalar read instructions to be
				1485	used. The vector and scalar L1 caches are invalidated of volatile data before
				1486	each kernel dispatch execution to allow constant memory to change values between
				1487	kernel dispatches.
				1488
				1489	The local memory space uses the hardware Local Data Store (LDS) which is
				1490	automatically allocated when the hardware creates work-groups of wavefronts, and
				1491	freed when all the wavefronts of a work-group have terminated. The data store
				1492	(DS) instructions can be used to access it.
				1493
				1494	The private memory space uses the hardware scratch memory support. If the kernel
				1495	uses scratch, then the hardware allocates memory that is accessed using
				1496	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1497	address to physical address is:
				1498
				1499	``wavefront-scratch-base +
				1500	(private-address * wavefront-size * 4) +
				1501	(wavefront-lane-id * 4)``
				1502
				1503	There are different ways that the wavefront scratch base address is determined
				1504	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1505	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1506	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1507	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1508	same private address, the interleaving results in adjacent dwords being accessed
				1509	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1510	supported except by flat and scratch instructions in GFX9.
				1511
				1512	The generic address space uses the hardware flat address support available in
				1513	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1514	local appertures), that are outside the range of addressible global memory, to
				1515	map from a flat address to a private or local address.
				1516
				1517	FLAT instructions can take a flat address and access global, private (scratch)
				1518	and group (LDS) memory depending in if the address is within one of the
				1519	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1520	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1521	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1522	(see :ref:`amdgpu-amdhsa-m0`).
				1523
				1524	To convert between a segment address and a flat address the base address of the
				1525	appertures address can be used. For GFX7-GFX8 these are available in the
				1526	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1527	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1528	GFX9 the appature base addresses are directly available as inline constant
				1529	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1530	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1531	which makes it easier to convert from flat to segment or segment to flat.
				1532
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1533	Image and Samplers
				1534	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1535
				1536	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1537	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1538	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1539	enumeration values for the queries that are not trivially deducible from the S#
				1540	representation.
				1541
				1542	HSA Signals
				1543	~~~~~~~~~~~
				1544
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1545	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1546	structure allocated in memory accessible from both the CPU and GPU. The
				1547	structure is defined by the ROCm runtime and subject to change between releases
				1548	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1549
				1550	.. _amdgpu-amdhsa-hsa-aql-queue:
				1551
				1552	HSA AQL Queue
				1553	~~~~~~~~~~~~~
				1554
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1555	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1556	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1557	fields needed to implement certain language features such as the flat address
				1558	aperture bases. It also contains fields used by CP such as managing the
				1559	allocation of scratch memory.
				1560
				1561	.. _amdgpu-amdhsa-kernel-descriptor:
				1562
				1563	Kernel Descriptor
				1564	~~~~~~~~~~~~~~~~~
				1565
				1566	A kernel descriptor consists of the information needed by CP to initiate the
				1567	execution of a kernel, including the entry point address of the machine code
				1568	that implements the kernel.
				1569
				1570	Kernel Descriptor for GFX6-GFX9
				1571	+++++++++++++++++++++++++++++++
				1572
				1573	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1574
				1575	.. table:: Kernel Descriptor for GFX6-GFX9
				1576	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1577
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1578	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1579	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1580	======= ======= =============================== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1581	31:0 4 bytes GROUP_SEGMENT_FIXED_SIZE The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1582	address space memory
				1583	required for a work-group
				1584	in bytes. This does not
				1585	include any dynamically
				1586	allocated local address
				1587	space memory that may be
				1588	added when the kernel is
				1589	dispatched.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1590	63:32 4 bytes PRIVATE_SEGMENT_FIXED_SIZE The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1591	private address space
				1592	memory required for a
				1593	work-item in bytes. If
				1594	is_dynamic_callstack is 1
				1595	then additional space must
				1596	be added to this value for
				1597	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1598	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1599	191:128 8 bytes KERNEL_CODE_ENTRY_BYTE_OFFSET Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600	negative) from base
				1601	address of kernel
				1602	descriptor to kernel's
				1603	entry point instruction
				1604	which must be 256 byte
				1605	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1606	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	bytes
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1608	415:384 4 bytes COMPUTE_PGM_RSRC1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1609	program settings used by
				1610	CP to set up
				1611	``COMPUTE_PGM_RSRC1``
				1612	configuration
				1613	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1614	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1615	447:416 4 bytes COMPUTE_PGM_RSRC2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1616	program settings used by
				1617	CP to set up
				1618	``COMPUTE_PGM_RSRC2``
				1619	configuration
				1620	register. See
				1621	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1622	448 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1623	_BUFFER SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1624	(see
				1625	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1626
				1627	The total number of SGPR
				1628	user data registers
				1629	requested must not exceed
				1630	16 and match value in
				1631	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1632	Any requests beyond 16
				1633	will be ignored.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	1634	449 1 bit ENABLE_SGPR_DISPATCH_PTR see above
				1635	450 1 bit ENABLE_SGPR_QUEUE_PTR see above
				1636	451 1 bit ENABLE_SGPR_KERNARG_SEGMENT_PTR see above
				1637	452 1 bit ENABLE_SGPR_DISPATCH_ID see above
				1638	453 1 bit ENABLE_SGPR_FLAT_SCRATCH_INIT see above
				1639	454 1 bit ENABLE_SGPR_PRIVATE_SEGMENT see above
				1640	_SIZE
				1641	455 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1642	_COUNT_X should always be 0.
				1643	456 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1644	_COUNT_Y should always be 0.
				1645	457 1 bit ENABLE_SGPR_GRID_WORKGROUP Not implemented in CP and
				1646	_COUNT_Z should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1647	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1648	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1649	bytes
				1650	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1651	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1652
				1653	..
				1654
				1655	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1656	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1657
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1658	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1659	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1660	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1661	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1662	used by each work-item,
				1663	granularity is device
				1664	specific:
				1665
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1666	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1667	- max_vgpr 1..256
				1668	- roundup((max_vgpg + 1)
				1669	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1670
				1671	Used by CP to set up
				1672	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1673	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1674	used by a wavefront,
				1675	granularity is device
				1676	specific:
				1677
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1678	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1679	- max_sgpr 1..112
				1680	- roundup((max_sgpg + 1)
				1681	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1682	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1683	- max_sgpr 1..112
				1684	- roundup((max_sgpg + 1)
				1685	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1686
				1687	Includes the special SGPRs
				1688	for VCC, Flat Scratch (for
				1689	GFX7 onwards) and XNACK
				1690	(for GFX8 onwards). It does
				1691	not include the 16 SGPR
				1692	added if a trap handler is
				1693	enabled.
				1694
				1695	Used by CP to set up
				1696	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1697	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1698
				1699	Start executing wavefront
				1700	at the specified priority.
				1701
				1702	CP is responsible for
				1703	filling in
				1704	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1705	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1706	with specified rounding
				1707	mode for single (32
				1708	bit) floating point
				1709	precision floating point
				1710	operations.
				1711
				1712	Floating point rounding
				1713	mode values are defined in
				1714	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1715
				1716	Used by CP to set up
				1717	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1718	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1719	with specified rounding
				1720	denorm mode for half/double (16
				1721	and 64 bit) floating point
				1722	precision floating point
				1723	operations.
				1724
				1725	Floating point rounding
				1726	mode values are defined in
				1727	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1728
				1729	Used by CP to set up
				1730	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1731	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1732	with specified denorm mode
				1733	for single (32
				1734	bit) floating point
				1735	precision floating point
				1736	operations.
				1737
				1738	Floating point denorm mode
				1739	values are defined in
				1740	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1741
				1742	Used by CP to set up
				1743	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1744	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1745	with specified denorm mode
				1746	for half/double (16
				1747	and 64 bit) floating point
				1748	precision floating point
				1749	operations.
				1750
				1751	Floating point denorm mode
				1752	values are defined in
				1753	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1754
				1755	Used by CP to set up
				1756	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1757	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1758
				1759	Start executing wavefront
				1760	in privilege trap handler
				1761	mode.
				1762
				1763	CP is responsible for
				1764	filling in
				1765	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1766	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1767	with DX10 clamp mode
				1768	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1769	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1770	treatment of NaN's (when
				1771	set, clamp NaN to zero,
				1772	otherwise pass NaN
				1773	through).
				1774
				1775	Used by CP to set up
				1776	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1777	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1778
				1779	Start executing wavefront
				1780	in single step mode.
				1781
				1782	CP is responsible for
				1783	filling in
				1784	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1785	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1786	with IEEE mode
				1787	enabled. Floating point
				1788	opcodes that support
				1789	exception flag gathering
				1790	will quiet and propagate
				1791	signaling-NaN inputs per
				1792	IEEE 754-2008. Min_dx10 and
				1793	max_dx10 become IEEE
				1794	754-2008 compliant due to
				1795	signaling-NaN propagation
				1796	and quieting.
				1797
				1798	Used by CP to set up
				1799	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1800	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1801
				1802	Only one work-group allowed
				1803	to execute on a compute
				1804	unit.
				1805
				1806	CP is responsible for
				1807	filling in
				1808	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1809	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1810
				1811	Flag that can be used to
				1812	control debugging code.
				1813
				1814	CP is responsible for
				1815	filling in
				1816	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1817	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1818	Reserved, must be 0.
				1819	GFX9
				1820	Wavefront starts execution
				1821	with specified fp16 overflow
				1822	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1823
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1824	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1825	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1826	- If 1, fp16 overflow that is the
				1827	result of an +/-INF input value
				1828	or divide by 0 produces a +/-INF,
				1829	otherwise clamps computed
				1830	overflow to +/-MAX_FP16 as
				1831	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1832
				1833	Used by CP to set up
				1834	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1835	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1836	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1837	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1838
				1839	..
				1840
				1841	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1842	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1843
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1844	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1845	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1846	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1847	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1848	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1849	system register (see
				1850	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1851
				1852	Used by CP to set up
				1853	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1854	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1855	user data registers
				1856	requested. This number must
				1857	match the number of user
				1858	data registers enabled.
				1859
				1860	Used by CP to set up
				1861	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1862	6 1 bit ENABLE_TRAP_HANDLER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1863
Konstantin Zhuravlyov	2ca6b1f	2018-05-29 19:09:13 +0000	[diff] [blame]	1864	This bit represents
				1865	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``,
				1866	which is set by the CP if
				1867	the runtime has installed a
				1868	trap handler.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1869	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1870	system SGPR register for
				1871	the work-group id in the X
				1872	dimension (see
				1873	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1874
				1875	Used by CP to set up
				1876	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1877	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1878	system SGPR register for
				1879	the work-group id in the Y
				1880	dimension (see
				1881	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1882
				1883	Used by CP to set up
				1884	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1885	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1886	system SGPR register for
				1887	the work-group id in the Z
				1888	dimension (see
				1889	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1890
				1891	Used by CP to set up
				1892	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1893	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1894	system SGPR register for
				1895	work-group information (see
				1896	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1897
				1898	Used by CP to set up
				1899	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1900	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1901	VGPR system registers used
				1902	for the work-item ID.
				1903	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1904	defines the values.
				1905
				1906	Used by CP to set up
				1907	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1908	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1909
				1910	Wavefront starts execution
				1911	with address watch
				1912	exceptions enabled which
				1913	are generated when L1 has
				1914	witnessed a thread access
				1915	an *address of
				1916	interest*.
				1917
				1918	CP is responsible for
				1919	filling in the address
				1920	watch bit in
				1921	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1922	according to what the
				1923	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1924	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1925
				1926	Wavefront starts execution
				1927	with memory violation
				1928	exceptions exceptions
				1929	enabled which are generated
				1930	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1931	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1932	L1 or LDS
				1933	(write-to-read-only-memory,
				1934	mis-aligned atomic, LDS
				1935	address out of range,
				1936	illegal address, etc.).
				1937
				1938	CP sets the memory
				1939	violation bit in
				1940	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1941	according to what the
				1942	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1943	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1944
				1945	CP uses the rounded value
				1946	from the dispatch packet,
				1947	not this value, as the
				1948	dispatch may contain
				1949	dynamically allocated group
				1950	segment memory. CP writes
				1951	directly to
				1952	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1953
				1954	Amount of group segment
				1955	(LDS) to allocate for each
				1956	work-group. Granularity is
				1957	device specific:
				1958
				1959	GFX6:
				1960	roundup(lds-size / (64 * 4))
				1961	GFX7-GFX9:
				1962	roundup(lds-size / (128 * 4))
				1963
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1964	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1965	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1966	enabled.
				1967
				1968	Used by CP to set up
				1969	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1970	(set from bits 0..6).
				1971
				1972	IEEE 754 FP Invalid
				1973	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1974	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1975	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1976	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1977	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1978	_DIVISION_BY_ZERO Zero
				1979	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1980	_OVERFLOW
				1981	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1982	_UNDERFLOW
				1983	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1984	_INEXACT
				1985	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1986	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1987	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1988	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1989	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1990	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1991
				1992	..
				1993
				1994	.. table:: Floating Point Rounding Mode Enumeration Values
				1995	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1996
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1997	====================================== ===== ==============================
				1998	Enumeration Name Value Description
				1999	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2000	FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2001	FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2002	FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2003	FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2004	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2005
				2006	..
				2007
				2008	.. table:: Floating Point Denorm Mode Enumeration Values
				2009	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2010
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2011	====================================== ===== ==============================
				2012	Enumeration Name Value Description
				2013	====================================== ===== ==============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2014	FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2015	Denorms
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2016	FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2017	FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2018	FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2019	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2020
				2021	..
				2022
				2023	.. table:: System VGPR Work-Item ID Enumeration Values
				2024	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2025
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2026	======================================== ===== ============================
				2027	Enumeration Name Value Description
				2028	======================================== ===== ============================
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2029	SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2030	ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2031	SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2032	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2033	SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2034	dimensions ID.
Konstantin Zhuravlyov	00f2cb1	2018-06-12 18:02:46 +0000	[diff] [blame]	2035	SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2036	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2037
				2038	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2039
				2040	Initial Kernel Execution State
				2041	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2042
				2043	This section defines the register state that will be set up by the packet
				2044	processor prior to the start of execution of every wavefront. This is limited by
				2045	the constraints of the hardware controllers of CP/ADC/SPI.
				2046
				2047	The order of the SGPR registers is defined, but the compiler can specify which
				2048	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2049	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2050	for enabled registers are dense starting at SGPR0: the first enabled register is
				2051	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2052	an SGPR number.
				2053
				2054	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2055	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2056	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2057	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2058	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2059	dispatch.
				2060
				2061	SGPR register initial state is defined in
				2062	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2063
				2064	.. table:: SGPR Register Set Up Order
				2065	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2066
				2067	========== ========================== ====== ==============================
				2068	SGPR Order Name Number Description
				2069	(kernel descriptor enable of
				2070	field) SGPRs
				2071	========== ========================== ====== ==============================
				2072	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2073	(enable_sgpr_private with Scratch Wavefront Offset
				2074	_segment_buffer) as an offset, to access the
				2075	private memory space using a
				2076	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2077
				2078	CP uses the value provided by
				2079	the runtime.
				2080	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2081	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2082	actually executing.
				2083	then Queue Ptr 2 64 bit address of amd_queue_t
				2084	(enable_sgpr_queue_ptr) object for AQL queue on which
				2085	the dispatch packet was
				2086	queued.
				2087	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2088	(enable_sgpr_kernarg segment. This is directly
				2089	_segment_ptr) copied from the
				2090	kernarg_address in the kernel
				2091	dispatch packet.
				2092
				2093	Having CP load it once avoids
				2094	loading it at the beginning of
				2095	every wavefront.
				2096	then Dispatch Id 2 64 bit Dispatch ID of the
				2097	(enable_sgpr_dispatch_id) dispatch packet being
				2098	executed.
				2099	then Flat Scratch Init 2 This is 2 SGPRs:
				2100	(enable_sgpr_flat_scratch
				2101	_init) GFX6
				2102	Not supported.
				2103	GFX7-GFX8
				2104	The first SGPR is a 32 bit
				2105	byte offset from
				2106	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2107	to per SPI base of memory
				2108	for scratch for the queue
				2109	executing the kernel
				2110	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2111	from the runtime. (The
				2112	Scratch Segment Buffer base
				2113	address is
				2114	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2115	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2116	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2117	be added to this offset by
				2118	the kernel machine code,
				2119	right shifted by 8, and
				2120	moved to the FLAT_SCRATCH_HI
				2121	SGPR register.
				2122	FLAT_SCRATCH_HI corresponds
				2123	to SGPRn-4 on GFX7, and
				2124	SGPRn-6 on GFX8 (where SGPRn
				2125	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2126	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2127	FLAT_SCRATCH_HI is
				2128	multiplied by 256 (as it is
				2129	in units of 256 bytes) and
				2130	added to
				2131	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2132	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2133	FLAT SCRATCH BASE in flat
				2134	memory instructions that
				2135	access the scratch
				2136	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2137
				2138	The second SGPR is 32 bit
				2139	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2140	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2141	usage. CP obtains this from
				2142	the runtime, and it is
				2143	always a multiple of DWORD.
				2144	CP checks that the value in
				2145	the kernel dispatch packet
				2146	Private Segment Byte Size is
				2147	not larger, and requests the
				2148	runtime to increase the
				2149	queue's scratch size if
				2150	necessary. The kernel code
				2151	must move it to
				2152	FLAT_SCRATCH_LO which is
				2153	SGPRn-3 on GFX7 and SGPRn-5
				2154	on GFX8. FLAT_SCRATCH_LO is
				2155	used as the FLAT SCRATCH
				2156	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2157	instructions. Having CP load
				2158	it once avoids loading it at
				2159	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2160	wavefront.
				2161	GFX9
				2162	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2163	64 bit base address of the
				2164	per SPI scratch backing
				2165	memory managed by SPI for
				2166	the queue executing the
				2167	kernel dispatch. CP obtains
				2168	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2169	divides it if there are
				2170	multiple Shader Arrays each
				2171	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2172	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2173	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2174	machine code and the result
				2175	moved to the FLAT_SCRATCH
				2176	SGPR which is SGPRn-6 and
				2177	SGPRn-5. It is used as the
				2178	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2179	memory instructions.
				2180	then Private Segment Size 1 The 32 bit byte size of a
				2181	(enable_sgpr_private single
				2182	work-item's
				2183	scratch_segment_size) memory
				2184	allocation. This is the
				2185	value from the kernel
				2186	dispatch packet Private
				2187	Segment Byte Size rounded up
				2188	by CP to a multiple of
				2189	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2190
				2191	Having CP load it once avoids
				2192	loading it at the beginning of
				2193	every wavefront.
				2194
				2195	This is not used for
				2196	GFX7-GFX8 since it is the same
				2197	value as the second SGPR of
				2198	Flat Scratch Init. However, it
				2199	may be needed for GFX9 which
				2200	changes the meaning of the
				2201	Flat Scratch Init value.
				2202	then Grid Work-Group Count X 1 32 bit count of the number of
				2203	(enable_sgpr_grid work-groups in the X dimension
				2204	_workgroup_count_X) for the grid being
				2205	executed. Computed from the
				2206	fields in the kernel dispatch
				2207	packet as ((grid_size.x +
				2208	workgroup_size.x - 1) /
				2209	workgroup_size.x).
				2210	then Grid Work-Group Count Y 1 32 bit count of the number of
				2211	(enable_sgpr_grid work-groups in the Y dimension
				2212	_workgroup_count_Y && for the grid being
				2213	less than 16 previous executed. Computed from the
				2214	SGPRs) fields in the kernel dispatch
				2215	packet as ((grid_size.y +
				2216	workgroup_size.y - 1) /
				2217	workgroupSize.y).
				2218
				2219	Only initialized if <16
				2220	previous SGPRs initialized.
				2221	then Grid Work-Group Count Z 1 32 bit count of the number of
				2222	(enable_sgpr_grid work-groups in the Z dimension
				2223	_workgroup_count_Z && for the grid being
				2224	less than 16 previous executed. Computed from the
				2225	SGPRs) fields in the kernel dispatch
				2226	packet as ((grid_size.z +
				2227	workgroup_size.z - 1) /
				2228	workgroupSize.z).
				2229
				2230	Only initialized if <16
				2231	previous SGPRs initialized.
				2232	then Work-Group Id X 1 32 bit work-group id in X
				2233	(enable_sgpr_workgroup_id dimension of grid for
				2234	_X) wavefront.
				2235	then Work-Group Id Y 1 32 bit work-group id in Y
				2236	(enable_sgpr_workgroup_id dimension of grid for
				2237	_Y) wavefront.
				2238	then Work-Group Id Z 1 32 bit work-group id in Z
				2239	(enable_sgpr_workgroup_id dimension of grid for
				2240	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2241	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2242	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2243	_info) threadgroup_size_in_wavefronts[5:0]}
				2244	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2245	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2246	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2247	dispatch. Must be used as an
				2248	offset with Private
				2249	segment address when using
				2250	Scratch Segment Buffer. It
				2251	must be used to set up FLAT
				2252	SCRATCH for flat addressing
				2253	(see
				2254	:ref:`amdgpu-amdhsa-flat-scratch`).
				2255	========== ========================== ====== ==============================
				2256
				2257	The order of the VGPR registers is defined, but the compiler can specify which
				2258	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2259	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2260	for enabled registers are dense starting at VGPR0: the first enabled register is
				2261	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2262	VGPR number.
				2263
				2264	VGPR register initial state is defined in
				2265	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2266
				2267	.. table:: VGPR Register Set Up Order
				2268	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2269
				2270	========== ========================== ====== ==============================
				2271	VGPR Order Name Number Description
				2272	(kernel descriptor enable of
				2273	field) VGPRs
				2274	========== ========================== ====== ==============================
				2275	First Work-Item Id X 1 32 bit work item id in X
				2276	(Always initialized) dimension of work-group for
				2277	wavefront lane.
				2278	then Work-Item Id Y 1 32 bit work item id in Y
				2279	(enable_vgpr_workitem_id dimension of work-group for
				2280	> 0) wavefront lane.
				2281	then Work-Item Id Z 1 32 bit work item id in Z
				2282	(enable_vgpr_workitem_id dimension of work-group for
				2283	> 1) wavefront lane.
				2284	========== ========================== ====== ==============================
				2285
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2286	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2287
				2288	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2289	registers.
				2290	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2291	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2292	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2293	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2294	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2295	or (X, Y, Z).
				2296
				2297	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2298	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2299
				2300	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2301	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2302	instructions (GFX9).
				2303
				2304	If buffer operations are used then the compiler can generate a V# with the
				2305	following properties:
				2306
				2307	* base address of 0
				2308	* no swizzle
				2309	* ATC: 1 if IOMMU present (such as APU)
				2310	* ptr64: 1
				2311	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2312	APU and NC for dGPU).
				2313
				2314	.. _amdgpu-amdhsa-kernel-prolog:
				2315
				2316	Kernel Prolog
				2317	~~~~~~~~~~~~~
				2318
				2319	.. _amdgpu-amdhsa-m0:
				2320
				2321	M0
				2322	++
				2323
				2324	GFX6-GFX8
				2325	The M0 register must be initialized with a value at least the total LDS size
				2326	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2327	available in dispatch packet. For M0, it is also possible to use maximum
				2328	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2329	GFX7-GFX8).
				2330	GFX9
				2331	The M0 register is not used for range checking LDS accesses and so does not
				2332	need to be initialized in the prolog.
				2333
				2334	.. _amdgpu-amdhsa-flat-scratch:
				2335
				2336	Flat Scratch
				2337	++++++++++++
				2338
				2339	If the kernel may use flat operations to access scratch memory, the prolog code
				2340	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2341	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2342	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2343
				2344	GFX6
				2345	Flat scratch is not supported.
				2346
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2347	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2348	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2349	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2350	being managed by SPI for the queue executing the kernel dispatch. This is
				2351	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2352	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2353	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2354	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2355	by 8 before moving into FLAT_SCRATCH_LO.
				2356	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2357	work-items scratch memory usage. This is directly loaded from the kernel
				2358	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2359	DWORD. Having CP load it once avoids loading it at the beginning of every
				2360	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2361	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2362
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2363	GFX9
				2364	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2365	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2366	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2367	pair for use as the flat scratch base in flat memory instructions.
				2368
				2369	.. _amdgpu-amdhsa-memory-model:
				2370
				2371	Memory Model
				2372	~~~~~~~~~~~~
				2373
				2374	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2375	(see :ref:`memmodel`). The implementation is WIP.
				2376
				2377	.. TODO
				2378	Update when implementation complete.
				2379
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2380	The AMDGPU backend supports the memory synchronization scopes specified in
				2381	:ref:`amdgpu-memory-scopes`.
				2382
				2383	The code sequences used to implement the memory model are defined in table
				2384	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2385
				2386	The sequences specify the order of instructions that a single thread must
				2387	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2388	to other memory instructions executed by the same thread. This allows them to be
				2389	moved earlier or later which can allow them to be combined with other instances
				2390	of the same instruction, or hoisted/sunk out of loops to improve
				2391	performance. Only the instructions related to the memory model are given;
				2392	additional ``s_waitcnt`` instructions are required to ensure registers are
				2393	defined before being used. These may be able to be combined with the memory
				2394	model ``s_waitcnt`` instructions as described above.
				2395
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2396	The AMDGPU backend supports the following memory models:
				2397
				2398	HSA Memory Model [HSA]_
				2399	The HSA memory model uses a single happens-before relation for all address
				2400	spaces (see :ref:`amdgpu-address-spaces`).
				2401	OpenCL Memory Model [OpenCL]_
				2402	The OpenCL memory model which has separate happens-before relations for the
				2403	global and local address spaces. Only a fence specifying both global and
				2404	local address space, and seq_cst instructions join the relationships. Since
				2405	the LLVM ``memfence`` instruction does not allow an address space to be
				2406	specified the OpenCL fence has to convervatively assume both local and
				2407	global address space was specified. However, optimizations can often be
				2408	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2409	no intervening memory instructions which access the corresponding address
				2410	space. The code sequences in the table indicate what can be omitted for the
				2411	OpenCL memory. The target triple environment is used to determine if the
				2412	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2413
				2414	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2415	operations.
				2416
				2417	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2418	termed vector memory operations.
				2419
				2420	For GFX6-GFX9:
				2421
				2422	* Each agent has multiple compute units (CU).
				2423	* Each CU has multiple SIMDs that execute wavefronts.
				2424	* The wavefronts for a single work-group are executed in the same CU but may be
				2425	executed by different SIMDs.
				2426	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2427	executing on it.
				2428	* All LDS operations of a CU are performed as wavefront wide operations in a
				2429	global order and involve no caching. Completion is reported to a wavefront in
				2430	execution order.
				2431	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2432	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2433	can be reordered relative to each other, which can result in reordering the
				2434	visibility of vector memory operations with respect to LDS operations of other
				2435	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2436	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2437	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2438	same wavefront.
				2439	* The vector memory operations are performed as wavefront wide operations and
				2440	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2441	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2442	vector memory order if they access LDS memory, and out of LDS operation order
				2443	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2444	* The vector memory operations access a single vector L1 cache shared by all
				2445	SIMDs a CU. Therefore, no special action is required for coherence between the
				2446	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2447	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2448	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2449	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2450	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2451	scalar operations are used in a restricted way so do not impact the memory
				2452	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2453	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2454	the same agent.
				2455	* The L2 cache has independent channels to service disjoint ranges of virtual
				2456	addresses.
				2457	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2458	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2459	(which may be executing on different CUs) of an agent can be reordered
				2460	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2461	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2462	previous vector memory operation has completed before executing a subsequent
				2463	vector memory or LDS operation and so can be used to meet the requirements of
				2464	acquire and release.
				2465	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2466	of virtual addresses can be set up to bypass it to ensure system coherence.
				2467
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2468	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2469	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2470	memory, atomic memory orderings are not meaningful and all accesses are treated
				2471	as non-atomic.
				2472
				2473	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2474	scalar memory instructions). Since the constant address space contents do not
				2475	change during the execution of a kernel dispatch it is not legal to perform
				2476	stores, and atomic memory orderings are not meaningful and all access are
				2477	treated as non-atomic.
				2478
				2479	A memory synchronization scope wider than work-group is not meaningful for the
				2480	group (LDS) address space and is treated as work-group.
				2481
				2482	The memory model does not support the region address space which is treated as
				2483	non-atomic.
				2484
				2485	Acquire memory ordering is not meaningful on store atomic instructions and is
				2486	treated as non-atomic.
				2487
				2488	Release memory ordering is not meaningful on load atomic instructions and is
				2489	treated a non-atomic.
				2490
				2491	Acquire-release memory ordering is not meaningful on load or store atomic
				2492	instructions and is treated as acquire and release respectively.
				2493
				2494	AMDGPU backend only uses scalar memory operations to access memory that is
				2495	proven to not change during the execution of the kernel dispatch. This includes
				2496	constant address space and global address space for program scope const
				2497	variables. Therefore the kernel machine code does not have to maintain the
				2498	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2499	and vector L1 caches are invalidated between kernel dispatches by CP since
				2500	constant address space data may change between kernel dispatch executions. See
				2501	:ref:`amdgpu-amdhsa-memory-spaces`.
				2502
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2503	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2504	case the AMDGPU backend ensures the memory location used to spill is never
				2505	accessed by vector memory operations at the same time. If scalar writes are used
				2506	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2507	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2508	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2509	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2510	as all scalar writes are write-before-read in the same thread.
				2511
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2512	Scratch backing memory (which is used for the private address space)
				2513	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2514	address space is only accessed by a single thread, and is always
				2515	write-before-read, there is never a need to invalidate these entries from the L1
				2516	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2517	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2518
				2519	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2520	to invalidate the L2 cache. This also causes it to be treated as
				2521	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2522	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2523	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2524
				2525	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2526	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2527
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2528	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2529	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2530	Ordering Sync Scope Address
				2531	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2532	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2533	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2534	-----------------------------------------------------------------------------------
				2535	load none none - global - !volatile & !nontemporal
				2536	- generic
				2537	- private 1. buffer/global/flat_load
				2538	- constant
				2539	- volatile & !nontemporal
				2540
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2541	1. buffer/global/flat_load
				2542	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2543
				2544	- nontemporal
				2545
				2546	1. buffer/global/flat_load
				2547	glc=1 slc=1
				2548
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2549	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2550	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2551	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2552	- private 1. buffer/global/flat_store
				2553	- constant
				2554	- nontemporal
				2555
				2556	1. buffer/global/flat_stote
				2557	glc=1 slc=1
				2558
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2559	store none none - local 1. ds_store
				2560	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2561	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2562	load atomic unordered any any Same as non-atomic.
				2563	store atomic unordered any any Same as non-atomic.
				2564	atomicrmw unordered any any *Same as monotonic
				2565	atomic*.
				2566	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2567	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2568	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2569	- wavefront - generic
				2570	- workgroup
				2571	load atomic monotonic - singlethread - local 1. ds_load
				2572	- wavefront
				2573	- workgroup
				2574	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2575	- system - generic glc=1
				2576	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2577	- wavefront - generic
				2578	- workgroup
				2579	- agent
				2580	- system
				2581	store atomic monotonic - singlethread - local 1. ds_store
				2582	- wavefront
				2583	- workgroup
				2584	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2585	- wavefront - generic
				2586	- workgroup
				2587	- agent
				2588	- system
				2589	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2590	- wavefront
				2591	- workgroup
				2592	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2593	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2594	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2595	- wavefront - local
				2596	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2597	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2598	load atomic acquire - workgroup - local 1. ds_load
				2599	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2600
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2601	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2602	- Must happen before
				2603	any following
				2604	global/generic
				2605	load/load
				2606	atomic/store/store
				2607	atomic/atomicrmw.
				2608	- Ensures any
				2609	following global
				2610	data read is no
				2611	older than the load
				2612	atomic value being
				2613	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2614	load atomic acquire - workgroup - generic 1. flat_load
				2615	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2616
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2617	- If OpenCL, omit.
				2618	- Must happen before
				2619	any following
				2620	global/generic
				2621	load/load
				2622	atomic/store/store
				2623	atomic/atomicrmw.
				2624	- Ensures any
				2625	following global
				2626	data read is no
				2627	older than the load
				2628	atomic value being
				2629	acquired.
				2630	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2631	- system glc=1
				2632	2. s_waitcnt vmcnt(0)
				2633
				2634	- Must happen before
				2635	following
				2636	buffer_wbinvl1_vol.
				2637	- Ensures the load
				2638	has completed
				2639	before invalidating
				2640	the cache.
				2641
				2642	3. buffer_wbinvl1_vol
				2643
				2644	- Must happen before
				2645	any following
				2646	global/generic
				2647	load/load
				2648	atomic/atomicrmw.
				2649	- Ensures that
				2650	following
				2651	loads will not see
				2652	stale global data.
				2653
				2654	load atomic acquire - agent - generic 1. flat_load glc=1
				2655	- system 2. s_waitcnt vmcnt(0) &
				2656	lgkmcnt(0)
				2657
				2658	- If OpenCL omit
				2659	lgkmcnt(0).
				2660	- Must happen before
				2661	following
				2662	buffer_wbinvl1_vol.
				2663	- Ensures the flat_load
				2664	has completed
				2665	before invalidating
				2666	the cache.
				2667
				2668	3. buffer_wbinvl1_vol
				2669
				2670	- Must happen before
				2671	any following
				2672	global/generic
				2673	load/load
				2674	atomic/atomicrmw.
				2675	- Ensures that
				2676	following loads
				2677	will not see stale
				2678	global data.
				2679
				2680	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2681	- wavefront - local
				2682	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2683	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2684	atomicrmw acquire - workgroup - local 1. ds_atomic
				2685	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2686
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2687	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2688	- Must happen before
				2689	any following
				2690	global/generic
				2691	load/load
				2692	atomic/store/store
				2693	atomic/atomicrmw.
				2694	- Ensures any
				2695	following global
				2696	data read is no
				2697	older than the
				2698	atomicrmw value
				2699	being acquired.
				2700
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2701	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2702	2. waitcnt lgkmcnt(0)
				2703
				2704	- If OpenCL, omit.
				2705	- Must happen before
				2706	any following
				2707	global/generic
				2708	load/load
				2709	atomic/store/store
				2710	atomic/atomicrmw.
				2711	- Ensures any
				2712	following global
				2713	data read is no
				2714	older than the
				2715	atomicrmw value
				2716	being acquired.
				2717
				2718	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2719	- system 2. s_waitcnt vmcnt(0)
				2720
				2721	- Must happen before
				2722	following
				2723	buffer_wbinvl1_vol.
				2724	- Ensures the
				2725	atomicrmw has
				2726	completed before
				2727	invalidating the
				2728	cache.
				2729
				2730	3. buffer_wbinvl1_vol
				2731
				2732	- Must happen before
				2733	any following
				2734	global/generic
				2735	load/load
				2736	atomic/atomicrmw.
				2737	- Ensures that
				2738	following loads
				2739	will not see stale
				2740	global data.
				2741
				2742	atomicrmw acquire - agent - generic 1. flat_atomic
				2743	- system 2. s_waitcnt vmcnt(0) &
				2744	lgkmcnt(0)
				2745
				2746	- If OpenCL, omit
				2747	lgkmcnt(0).
				2748	- Must happen before
				2749	following
				2750	buffer_wbinvl1_vol.
				2751	- Ensures the
				2752	atomicrmw has
				2753	completed before
				2754	invalidating the
				2755	cache.
				2756
				2757	3. buffer_wbinvl1_vol
				2758
				2759	- Must happen before
				2760	any following
				2761	global/generic
				2762	load/load
				2763	atomic/atomicrmw.
				2764	- Ensures that
				2765	following loads
				2766	will not see stale
				2767	global data.
				2768
				2769	fence acquire - singlethread none none
				2770	- wavefront
				2771	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2772
				2773	- If OpenCL and
				2774	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2775	not generic, omit.
				2776	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2777	currently has no
				2778	address space on
				2779	the fence need to
				2780	conservatively
				2781	always generate. If
				2782	fence had an
				2783	address space then
				2784	set to address
				2785	space of OpenCL
				2786	fence flag, or to
				2787	generic if both
				2788	local and global
				2789	flags are
				2790	specified.
				2791	- Must happen after
				2792	any preceding
				2793	local/generic load
				2794	atomic/atomicrmw
				2795	with an equal or
				2796	wider sync scope
				2797	and memory ordering
				2798	stronger than
				2799	unordered (this is
				2800	termed the
				2801	fence-paired-atomic).
				2802	- Must happen before
				2803	any following
				2804	global/generic
				2805	load/load
				2806	atomic/store/store
				2807	atomic/atomicrmw.
				2808	- Ensures any
				2809	following global
				2810	data read is no
				2811	older than the
				2812	value read by the
				2813	fence-paired-atomic.
				2814
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2815	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2816	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2817
				2818	- If OpenCL and
				2819	address space is
				2820	not generic, omit
				2821	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2822	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2823	currently has no
				2824	address space on
				2825	the fence need to
				2826	conservatively
				2827	always generate
				2828	(see comment for
				2829	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2830	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2831	separate s_waitcnt
				2832	vmcnt(0) and
				2833	s_waitcnt
				2834	lgkmcnt(0) to allow
				2835	them to be
				2836	independently moved
				2837	according to the
				2838	following rules.
				2839	- s_waitcnt vmcnt(0)
				2840	must happen after
				2841	any preceding
				2842	global/generic load
				2843	atomic/atomicrmw
				2844	with an equal or
				2845	wider sync scope
				2846	and memory ordering
				2847	stronger than
				2848	unordered (this is
				2849	termed the
				2850	fence-paired-atomic).
				2851	- s_waitcnt lgkmcnt(0)
				2852	must happen after
				2853	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2854	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2855	atomic/atomicrmw
				2856	with an equal or
				2857	wider sync scope
				2858	and memory ordering
				2859	stronger than
				2860	unordered (this is
				2861	termed the
				2862	fence-paired-atomic).
				2863	- Must happen before
				2864	the following
				2865	buffer_wbinvl1_vol.
				2866	- Ensures that the
				2867	fence-paired atomic
				2868	has completed
				2869	before invalidating
				2870	the
				2871	cache. Therefore
				2872	any following
				2873	locations read must
				2874	be no older than
				2875	the value read by
				2876	the
				2877	fence-paired-atomic.
				2878
				2879	2. buffer_wbinvl1_vol
				2880
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2881	- Must happen before any
				2882	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2883	load/load
				2884	atomic/store/store
				2885	atomic/atomicrmw.
				2886	- Ensures that
				2887	following loads
				2888	will not see stale
				2889	global data.
				2890
				2891	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2892	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2893	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2894	- wavefront - local
				2895	- generic
				2896	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2897
				2898	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2899	- Must happen after
				2900	any preceding
				2901	local/generic
				2902	load/store/load
				2903	atomic/store
				2904	atomic/atomicrmw.
				2905	- Must happen before
				2906	the following
				2907	store.
				2908	- Ensures that all
				2909	memory operations
				2910	to local have
				2911	completed before
				2912	performing the
				2913	store that is being
				2914	released.
				2915
				2916	2. buffer/global/flat_store
				2917	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2918	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2919
				2920	- If OpenCL, omit.
				2921	- Must happen after
				2922	any preceding
				2923	local/generic
				2924	load/store/load
				2925	atomic/store
				2926	atomic/atomicrmw.
				2927	- Must happen before
				2928	the following
				2929	store.
				2930	- Ensures that all
				2931	memory operations
				2932	to local have
				2933	completed before
				2934	performing the
				2935	store that is being
				2936	released.
				2937
				2938	2. flat_store
				2939	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2940	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2941
				2942	- If OpenCL, omit
				2943	lgkmcnt(0).
				2944	- Could be split into
				2945	separate s_waitcnt
				2946	vmcnt(0) and
				2947	s_waitcnt
				2948	lgkmcnt(0) to allow
				2949	them to be
				2950	independently moved
				2951	according to the
				2952	following rules.
				2953	- s_waitcnt vmcnt(0)
				2954	must happen after
				2955	any preceding
				2956	global/generic
				2957	load/store/load
				2958	atomic/store
				2959	atomic/atomicrmw.
				2960	- s_waitcnt lgkmcnt(0)
				2961	must happen after
				2962	any preceding
				2963	local/generic
				2964	load/store/load
				2965	atomic/store
				2966	atomic/atomicrmw.
				2967	- Must happen before
				2968	the following
				2969	store.
				2970	- Ensures that all
				2971	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2972	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2973	completed before
				2974	performing the
				2975	store that is being
				2976	released.
				2977
				2978	2. buffer/global/ds/flat_store
				2979	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2980	- wavefront - local
				2981	- generic
				2982	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2983
				2984	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2985	- Must happen after
				2986	any preceding
				2987	local/generic
				2988	load/store/load
				2989	atomic/store
				2990	atomic/atomicrmw.
				2991	- Must happen before
				2992	the following
				2993	atomicrmw.
				2994	- Ensures that all
				2995	memory operations
				2996	to local have
				2997	completed before
				2998	performing the
				2999	atomicrmw that is
				3000	being released.
				3001
				3002	2. buffer/global/flat_atomic
				3003	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3004	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3005
				3006	- If OpenCL, omit.
				3007	- Must happen after
				3008	any preceding
				3009	local/generic
				3010	load/store/load
				3011	atomic/store
				3012	atomic/atomicrmw.
				3013	- Must happen before
				3014	the following
				3015	atomicrmw.
				3016	- Ensures that all
				3017	memory operations
				3018	to local have
				3019	completed before
				3020	performing the
				3021	atomicrmw that is
				3022	being released.
				3023
				3024	2. flat_atomic
				3025	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3026	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3027
				3028	- If OpenCL, omit
				3029	lgkmcnt(0).
				3030	- Could be split into
				3031	separate s_waitcnt
				3032	vmcnt(0) and
				3033	s_waitcnt
				3034	lgkmcnt(0) to allow
				3035	them to be
				3036	independently moved
				3037	according to the
				3038	following rules.
				3039	- s_waitcnt vmcnt(0)
				3040	must happen after
				3041	any preceding
				3042	global/generic
				3043	load/store/load
				3044	atomic/store
				3045	atomic/atomicrmw.
				3046	- s_waitcnt lgkmcnt(0)
				3047	must happen after
				3048	any preceding
				3049	local/generic
				3050	load/store/load
				3051	atomic/store
				3052	atomic/atomicrmw.
				3053	- Must happen before
				3054	the following
				3055	atomicrmw.
				3056	- Ensures that all
				3057	memory operations
				3058	to global and local
				3059	have completed
				3060	before performing
				3061	the atomicrmw that
				3062	is being released.
				3063
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3064	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3065	fence release - singlethread none none
				3066	- wavefront
				3067	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3068
				3069	- If OpenCL and
				3070	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3071	not generic, omit.
				3072	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3073	currently has no
				3074	address space on
				3075	the fence need to
				3076	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3077	always generate. If
				3078	fence had an
				3079	address space then
				3080	set to address
				3081	space of OpenCL
				3082	fence flag, or to
				3083	generic if both
				3084	local and global
				3085	flags are
				3086	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3087	- Must happen after
				3088	any preceding
				3089	local/generic
				3090	load/load
				3091	atomic/store/store
				3092	atomic/atomicrmw.
				3093	- Must happen before
				3094	any following store
				3095	atomic/atomicrmw
				3096	with an equal or
				3097	wider sync scope
				3098	and memory ordering
				3099	stronger than
				3100	unordered (this is
				3101	termed the
				3102	fence-paired-atomic).
				3103	- Ensures that all
				3104	memory operations
				3105	to local have
				3106	completed before
				3107	performing the
				3108	following
				3109	fence-paired-atomic.
				3110
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3111	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3112	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3113
				3114	- If OpenCL and
				3115	address space is
				3116	not generic, omit
				3117	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3118	- If OpenCL and
				3119	address space is
				3120	local, omit
				3121	vmcnt(0).
				3122	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3123	currently has no
				3124	address space on
				3125	the fence need to
				3126	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3127	always generate. If
				3128	fence had an
				3129	address space then
				3130	set to address
				3131	space of OpenCL
				3132	fence flag, or to
				3133	generic if both
				3134	local and global
				3135	flags are
				3136	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3137	- Could be split into
				3138	separate s_waitcnt
				3139	vmcnt(0) and
				3140	s_waitcnt
				3141	lgkmcnt(0) to allow
				3142	them to be
				3143	independently moved
				3144	according to the
				3145	following rules.
				3146	- s_waitcnt vmcnt(0)
				3147	must happen after
				3148	any preceding
				3149	global/generic
				3150	load/store/load
				3151	atomic/store
				3152	atomic/atomicrmw.
				3153	- s_waitcnt lgkmcnt(0)
				3154	must happen after
				3155	any preceding
				3156	local/generic
				3157	load/store/load
				3158	atomic/store
				3159	atomic/atomicrmw.
				3160	- Must happen before
				3161	any following store
				3162	atomic/atomicrmw
				3163	with an equal or
				3164	wider sync scope
				3165	and memory ordering
				3166	stronger than
				3167	unordered (this is
				3168	termed the
				3169	fence-paired-atomic).
				3170	- Ensures that all
				3171	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3172	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3173	completed before
				3174	performing the
				3175	following
				3176	fence-paired-atomic.
				3177
				3178	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3179	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3180	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3181	- wavefront - local
				3182	- generic
				3183	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3184
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3185	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3186	- Must happen after
				3187	any preceding
				3188	local/generic
				3189	load/store/load
				3190	atomic/store
				3191	atomic/atomicrmw.
				3192	- Must happen before
				3193	the following
				3194	atomicrmw.
				3195	- Ensures that all
				3196	memory operations
				3197	to local have
				3198	completed before
				3199	performing the
				3200	atomicrmw that is
				3201	being released.
				3202
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3203	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3204	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3205	2. s_waitcnt lgkmcnt(0)
				3206
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3207	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3208	- Must happen before
				3209	any following
				3210	global/generic
				3211	load/load
				3212	atomic/store/store
				3213	atomic/atomicrmw.
				3214	- Ensures any
				3215	following global
				3216	data read is no
				3217	older than the load
				3218	atomic value being
				3219	acquired.
				3220
				3221	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3222
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3223	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3224	- Must happen after
				3225	any preceding
				3226	local/generic
				3227	load/store/load
				3228	atomic/store
				3229	atomic/atomicrmw.
				3230	- Must happen before
				3231	the following
				3232	atomicrmw.
				3233	- Ensures that all
				3234	memory operations
				3235	to local have
				3236	completed before
				3237	performing the
				3238	atomicrmw that is
				3239	being released.
				3240
				3241	2. flat_atomic
				3242	3. s_waitcnt lgkmcnt(0)
				3243
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3244	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3245	- Must happen before
				3246	any following
				3247	global/generic
				3248	load/load
				3249	atomic/store/store
				3250	atomic/atomicrmw.
				3251	- Ensures any
				3252	following global
				3253	data read is no
				3254	older than the load
				3255	atomic value being
				3256	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3257
				3258	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3259	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3260
				3261	- If OpenCL, omit
				3262	lgkmcnt(0).
				3263	- Could be split into
				3264	separate s_waitcnt
				3265	vmcnt(0) and
				3266	s_waitcnt
				3267	lgkmcnt(0) to allow
				3268	them to be
				3269	independently moved
				3270	according to the
				3271	following rules.
				3272	- s_waitcnt vmcnt(0)
				3273	must happen after
				3274	any preceding
				3275	global/generic
				3276	load/store/load
				3277	atomic/store
				3278	atomic/atomicrmw.
				3279	- s_waitcnt lgkmcnt(0)
				3280	must happen after
				3281	any preceding
				3282	local/generic
				3283	load/store/load
				3284	atomic/store
				3285	atomic/atomicrmw.
				3286	- Must happen before
				3287	the following
				3288	atomicrmw.
				3289	- Ensures that all
				3290	memory operations
				3291	to global have
				3292	completed before
				3293	performing the
				3294	atomicrmw that is
				3295	being released.
				3296
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3297	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3298	3. s_waitcnt vmcnt(0)
				3299
				3300	- Must happen before
				3301	following
				3302	buffer_wbinvl1_vol.
				3303	- Ensures the
				3304	atomicrmw has
				3305	completed before
				3306	invalidating the
				3307	cache.
				3308
				3309	4. buffer_wbinvl1_vol
				3310
				3311	- Must happen before
				3312	any following
				3313	global/generic
				3314	load/load
				3315	atomic/atomicrmw.
				3316	- Ensures that
				3317	following loads
				3318	will not see stale
				3319	global data.
				3320
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3321	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3322	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3323
				3324	- If OpenCL, omit
				3325	lgkmcnt(0).
				3326	- Could be split into
				3327	separate s_waitcnt
				3328	vmcnt(0) and
				3329	s_waitcnt
				3330	lgkmcnt(0) to allow
				3331	them to be
				3332	independently moved
				3333	according to the
				3334	following rules.
				3335	- s_waitcnt vmcnt(0)
				3336	must happen after
				3337	any preceding
				3338	global/generic
				3339	load/store/load
				3340	atomic/store
				3341	atomic/atomicrmw.
				3342	- s_waitcnt lgkmcnt(0)
				3343	must happen after
				3344	any preceding
				3345	local/generic
				3346	load/store/load
				3347	atomic/store
				3348	atomic/atomicrmw.
				3349	- Must happen before
				3350	the following
				3351	atomicrmw.
				3352	- Ensures that all
				3353	memory operations
				3354	to global have
				3355	completed before
				3356	performing the
				3357	atomicrmw that is
				3358	being released.
				3359
				3360	2. flat_atomic
				3361	3. s_waitcnt vmcnt(0) &
				3362	lgkmcnt(0)
				3363
				3364	- If OpenCL, omit
				3365	lgkmcnt(0).
				3366	- Must happen before
				3367	following
				3368	buffer_wbinvl1_vol.
				3369	- Ensures the
				3370	atomicrmw has
				3371	completed before
				3372	invalidating the
				3373	cache.
				3374
				3375	4. buffer_wbinvl1_vol
				3376
				3377	- Must happen before
				3378	any following
				3379	global/generic
				3380	load/load
				3381	atomic/atomicrmw.
				3382	- Ensures that
				3383	following loads
				3384	will not see stale
				3385	global data.
				3386
				3387	fence acq_rel - singlethread none none
				3388	- wavefront
				3389	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3390
				3391	- If OpenCL and
				3392	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3393	not generic, omit.
				3394	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3395	since LLVM
				3396	currently has no
				3397	address space on
				3398	the fence need to
				3399	conservatively
				3400	always generate
				3401	(see comment for
				3402	previous fence).
				3403	- Must happen after
				3404	any preceding
				3405	local/generic
				3406	load/load
				3407	atomic/store/store
				3408	atomic/atomicrmw.
				3409	- Must happen before
				3410	any following
				3411	global/generic
				3412	load/load
				3413	atomic/store/store
				3414	atomic/atomicrmw.
				3415	- Ensures that all
				3416	memory operations
				3417	to local have
				3418	completed before
				3419	performing any
				3420	following global
				3421	memory operations.
				3422	- Ensures that the
				3423	preceding
				3424	local/generic load
				3425	atomic/atomicrmw
				3426	with an equal or
				3427	wider sync scope
				3428	and memory ordering
				3429	stronger than
				3430	unordered (this is
				3431	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3432	acquire-fence-paired-atomic
				3433	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3434	before following
				3435	global memory
				3436	operations. This
				3437	satisfies the
				3438	requirements of
				3439	acquire.
				3440	- Ensures that all
				3441	previous memory
				3442	operations have
				3443	completed before a
				3444	following
				3445	local/generic store
				3446	atomic/atomicrmw
				3447	with an equal or
				3448	wider sync scope
				3449	and memory ordering
				3450	stronger than
				3451	unordered (this is
				3452	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3453	release-fence-paired-atomic
				3454	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3455	requirements of
				3456	release.
				3457
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3458	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3459	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3460
				3461	- If OpenCL and
				3462	address space is
				3463	not generic, omit
				3464	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3465	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3466	currently has no
				3467	address space on
				3468	the fence need to
				3469	conservatively
				3470	always generate
				3471	(see comment for
				3472	previous fence).
				3473	- Could be split into
				3474	separate s_waitcnt
				3475	vmcnt(0) and
				3476	s_waitcnt
				3477	lgkmcnt(0) to allow
				3478	them to be
				3479	independently moved
				3480	according to the
				3481	following rules.
				3482	- s_waitcnt vmcnt(0)
				3483	must happen after
				3484	any preceding
				3485	global/generic
				3486	load/store/load
				3487	atomic/store
				3488	atomic/atomicrmw.
				3489	- s_waitcnt lgkmcnt(0)
				3490	must happen after
				3491	any preceding
				3492	local/generic
				3493	load/store/load
				3494	atomic/store
				3495	atomic/atomicrmw.
				3496	- Must happen before
				3497	the following
				3498	buffer_wbinvl1_vol.
				3499	- Ensures that the
				3500	preceding
				3501	global/local/generic
				3502	load
				3503	atomic/atomicrmw
				3504	with an equal or
				3505	wider sync scope
				3506	and memory ordering
				3507	stronger than
				3508	unordered (this is
				3509	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3510	acquire-fence-paired-atomic
				3511	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3512	before invalidating
				3513	the cache. This
				3514	satisfies the
				3515	requirements of
				3516	acquire.
				3517	- Ensures that all
				3518	previous memory
				3519	operations have
				3520	completed before a
				3521	following
				3522	global/local/generic
				3523	store
				3524	atomic/atomicrmw
				3525	with an equal or
				3526	wider sync scope
				3527	and memory ordering
				3528	stronger than
				3529	unordered (this is
				3530	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3531	release-fence-paired-atomic
				3532	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3533	requirements of
				3534	release.
				3535
				3536	2. buffer_wbinvl1_vol
				3537
				3538	- Must happen before
				3539	any following
				3540	global/generic
				3541	load/load
				3542	atomic/store/store
				3543	atomic/atomicrmw.
				3544	- Ensures that
				3545	following loads
				3546	will not see stale
				3547	global data. This
				3548	satisfies the
				3549	requirements of
				3550	acquire.
				3551
				3552	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3553	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3554	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3555	- wavefront - local load atomic acquire,
				3556	- generic except must generated
				3557	all instructions even
				3558	for OpenCL.*
				3559	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3560	- generic
				3561	- Must
				3562	happen after
				3563	preceding
				3564	global/generic load
				3565	atomic/store
				3566	atomic/atomicrmw
				3567	with memory
				3568	ordering of seq_cst
				3569	and with equal or
				3570	wider sync scope.
				3571	(Note that seq_cst
				3572	fences have their
				3573	own s_waitcnt
				3574	lgkmcnt(0) and so do
				3575	not need to be
				3576	considered.)
				3577	- Ensures any
				3578	preceding
				3579	sequential
				3580	consistent local
				3581	memory instructions
				3582	have completed
				3583	before executing
				3584	this sequentially
				3585	consistent
				3586	instruction. This
				3587	prevents reordering
				3588	a seq_cst store
				3589	followed by a
				3590	seq_cst load. (Note
				3591	that seq_cst is
				3592	stronger than
				3593	acquire/release as
				3594	the reordering of
				3595	load acquire
				3596	followed by a store
				3597	release is
				3598	prevented by the
				3599	waitcnt of
				3600	the release, but
				3601	there is nothing
				3602	preventing a store
				3603	release followed by
				3604	load acquire from
				3605	competing out of
				3606	order.)
				3607
				3608	2. *Following
				3609	instructions same as
				3610	corresponding load
				3611	atomic acquire,
				3612	except must generated
				3613	all instructions even
				3614	for OpenCL.*
				3615	load atomic seq_cst - workgroup - local *Same as corresponding
				3616	load atomic acquire,
				3617	except must generated
				3618	all instructions even
				3619	for OpenCL.*
				3620	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3621	- system - generic vmcnt(0)
				3622
				3623	- Could be split into
				3624	separate s_waitcnt
				3625	vmcnt(0)
				3626	and s_waitcnt
				3627	lgkmcnt(0) to allow
				3628	them to be
				3629	independently moved
				3630	according to the
				3631	following rules.
				3632	- waitcnt lgkmcnt(0)
				3633	must happen after
				3634	preceding
				3635	global/generic load
				3636	atomic/store
				3637	atomic/atomicrmw
				3638	with memory
				3639	ordering of seq_cst
				3640	and with equal or
				3641	wider sync scope.
				3642	(Note that seq_cst
				3643	fences have their
				3644	own s_waitcnt
				3645	lgkmcnt(0) and so do
				3646	not need to be
				3647	considered.)
				3648	- waitcnt vmcnt(0)
				3649	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3650	preceding
				3651	global/generic load
				3652	atomic/store
				3653	atomic/atomicrmw
				3654	with memory
				3655	ordering of seq_cst
				3656	and with equal or
				3657	wider sync scope.
				3658	(Note that seq_cst
				3659	fences have their
				3660	own s_waitcnt
				3661	vmcnt(0) and so do
				3662	not need to be
				3663	considered.)
				3664	- Ensures any
				3665	preceding
				3666	sequential
				3667	consistent global
				3668	memory instructions
				3669	have completed
				3670	before executing
				3671	this sequentially
				3672	consistent
				3673	instruction. This
				3674	prevents reordering
				3675	a seq_cst store
				3676	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3677	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3678	that seq_cst is
				3679	stronger than
				3680	acquire/release as
				3681	the reordering of
				3682	load acquire
				3683	followed by a store
				3684	release is
				3685	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3686	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3687	the release, but
				3688	there is nothing
				3689	preventing a store
				3690	release followed by
				3691	load acquire from
				3692	competing out of
				3693	order.)
				3694
				3695	2. *Following
				3696	instructions same as
				3697	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3698	atomic acquire,
				3699	except must generated
				3700	all instructions even
				3701	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3702	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3703	- wavefront - local store atomic release,
				3704	- workgroup - generic except must generated
				3705	all instructions even
				3706	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3707	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3708	- system - generic store atomic release,
				3709	except must generated
				3710	all instructions even
				3711	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3712	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3713	- wavefront - local atomicrmw acq_rel,
				3714	- workgroup - generic except must generated
				3715	all instructions even
				3716	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3717	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3718	- system - generic atomicrmw acq_rel,
				3719	except must generated
				3720	all instructions even
				3721	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3722	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3723	- wavefront fence acq_rel,
				3724	- workgroup except must generated
				3725	- agent all instructions even
				3726	- system for OpenCL.*
				3727	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3728
				3729	The memory order also adds the single thread optimization constrains defined in
				3730	table
				3731	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3732
				3733	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3734	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3735
				3736	============ ==============================================================
				3737	LLVM Memory Optimization Constraints
				3738	Ordering
				3739	============ ==============================================================
				3740	unordered none
				3741	monotonic none
				3742	acquire - If a load atomic/atomicrmw then no following load/load
				3743	atomic/store/ store atomic/atomicrmw/fence instruction can
				3744	be moved before the acquire.
				3745	- If a fence then same as load atomic, plus no preceding
				3746	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3747	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3748	atomic/store/ store atomic/atomicrmw/fence instruction can
				3749	be moved after the release.
				3750	- If a fence then same as store atomic, plus no following
				3751	associated fence-paired-atomic can be moved before the
				3752	fence.
				3753	acq_rel Same constraints as both acquire and release.
				3754	seq_cst - If a load atomic then same constraints as acquire, plus no
				3755	preceding sequentially consistent load atomic/store
				3756	atomic/atomicrmw/fence instruction can be moved after the
				3757	seq_cst.
				3758	- If a store atomic then the same constraints as release, plus
				3759	no following sequentially consistent load atomic/store
				3760	atomic/atomicrmw/fence instruction can be moved before the
				3761	seq_cst.
				3762	- If an atomicrmw/fence then same constraints as acq_rel.
				3763	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3764
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3765	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3766	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3767
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3768	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3769	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3770	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3771
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3772	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3773	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3774
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3775	=================== =============== =============== =======================
				3776	Usage Code Sequence Trap Handler Description
				3777	Inputs
				3778	=================== =============== =============== =======================
				3779	reserved ``s_trap 0x00`` Reserved by hardware.
				3780	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3781	``queue_ptr`` ``debugtrap``
				3782	``VGPR0``: intrinsic (not
				3783	``arg`` implemented).
				3784	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3785	``queue_ptr`` terminated and its
				3786	associated queue put
				3787	into the error state.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3788	``llvm.debugtrap`` ``s_trap 0x03`` - If debugger not
				3789	installed then
				3790	behaves as a
				3791	no-operation. The
				3792	trap handler is
				3793	entered and
				3794	immediately returns
				3795	to continue
				3796	execution of the
				3797	wavefront.
				3798	- If the debugger is
				3799	installed, causes
				3800	the debug trap to be
				3801	reported by the
				3802	debugger and the
				3803	wavefront is put in
				3804	the halt state until
				3805	resumed by the
				3806	debugger.
				3807	reserved ``s_trap 0x04`` Reserved.
				3808	reserved ``s_trap 0x05`` Reserved.
				3809	reserved ``s_trap 0x06`` Reserved.
				3810	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3811	breakpoints.
Tony Tye	43259df	2018-05-16 16:19:34 +0000	[diff] [blame]	3812	reserved ``s_trap 0x08`` Reserved.
				3813	reserved ``s_trap 0xfe`` Reserved.
				3814	reserved ``s_trap 0xff`` Reserved.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3815	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3816
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3817	AMDPAL
				3818	------
				3819
				3820	This section provides code conventions used when the target triple OS is
				3821	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3822	from the application/runtime to each invocation of a hardware shader. These
				3823	parameters include both generic, application-controlled parameters called
				3824	user data as well as system-generated parameters that are a product of the
				3825	draw or dispatch execution.
				3826
				3827	User Data
				3828	~~~~~~~~~
				3829
				3830	Each hardware stage has a set of 32-bit user data registers which can be
				3831	written from a command buffer and then loaded into SGPRs when waves are launched
				3832	via a subsequent dispatch or draw operation. This is the way most arguments are
				3833	passed from the application/runtime to a hardware shader.
				3834
				3835	Compute User Data
				3836	~~~~~~~~~~~~~~~~~
				3837
				3838	Compute shader user data mappings are simpler than graphics shaders, and have a
				3839	fixed mapping.
				3840
				3841	Note that there are always 10 available user data entries in registers -
				3842	entries beyond that limit must be fetched from memory (via the spill table
				3843	pointer) by the shader.
				3844
				3845	.. table:: PAL Compute Shader User Data Registers
				3846	:name: pal-compute-user-data-registers
				3847
				3848	============= ================================
				3849	User Register Description
				3850	============= ================================
				3851	0 Global Internal Table (32-bit pointer)
				3852	1 Per-Shader Internal Table (32-bit pointer)
				3853	2 - 11 Application-Controlled User Data (10 32-bit values)
				3854	12 Spill Table (32-bit pointer)
				3855	13 - 14 Thread Group Count (64-bit pointer)
				3856	15 GDS Range
				3857	============= ================================
				3858
				3859	Graphics User Data
				3860	~~~~~~~~~~~~~~~~~~
				3861
				3862	Graphics pipelines support a much more flexible user data mapping:
				3863
				3864	.. table:: PAL Graphics Shader User Data Registers
				3865	:name: pal-graphics-user-data-registers
				3866
				3867	============= ================================
				3868	User Register Description
				3869	============= ================================
				3870	0 Global Internal Table (32-bit pointer)
				3871	+ Per-Shader Internal Table (32-bit pointer)
				3872	+ 1-15 Application Controlled User Data
				3873	(1-15 Contiguous 32-bit Values in Registers)
				3874	+ Spill Table (32-bit pointer)
				3875	+ Draw Index (First Stage Only)
				3876	+ Vertex Offset (First Stage Only)
				3877	+ Instance Offset (First Stage Only)
				3878	============= ================================
				3879
				3880	The placement of the global internal table remains fixed in the first *user
				3881	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3882	to any desired user data SGPR register, with the following regstrictions:
				3883
				3884	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3885	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3886	shader runs).
				3887
				3888	* Application-controlled user data must be mapped into a contiguous range of
				3889	user data registers.
				3890
				3891	* The application-controlled user data range supports compaction remapping, so
				3892	only entries that are actually consumed by the shader must be assigned to
				3893	corresponding registers. Note that in order to support an efficient runtime
				3894	implementation, the remapping must pack registers in the same order as
				3895	entries, with unused entries removed.
				3896
				3897	.. _pal_global_internal_table:
				3898
				3899	Global Internal Table
				3900	~~~~~~~~~~~~~~~~~~~~~
				3901
				3902	The global internal table is a table of shader resource descriptors (SRDs) that
				3903	define how certain engine-wide, runtime-managed resources should be accessed
				3904	from a shader. The majority of these resources have HW-defined formats, and it
				3905	is up to the compiler to write/read data as required by the target hardware.
				3906
				3907	The following table illustrates the required format:
				3908
				3909	.. table:: PAL Global Internal Table
				3910	:name: pal-git-table
				3911
				3912	============= ================================
				3913	Offset Description
				3914	============= ================================
				3915	0-3 Graphics Scratch SRD
				3916	4-7 Compute Scratch SRD
				3917	8-11 ES/GS Ring Output SRD
				3918	12-15 ES/GS Ring Input SRD
				3919	16-19 GS/VS Ring Output #0
				3920	20-23 GS/VS Ring Output #1
				3921	24-27 GS/VS Ring Output #2
				3922	28-31 GS/VS Ring Output #3
				3923	32-35 GS/VS Ring Input SRD
				3924	36-39 Tessellation Factor Buffer SRD
				3925	40-43 Off-Chip LDS Buffer SRD
				3926	44-47 Off-Chip Param Cache Buffer SRD
				3927	48-51 Sample Position Buffer SRD
				3928	52 vaRange::ShadowDescriptorTable High Bits
				3929	============= ================================
				3930
				3931	The pointer to the global internal table passed to the shader as user data
				3932	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3933	the top 32 bits of the pipeline, so the shader may use the program
				3934	counter's top 32 bits.
				3935
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3936	Unspecified OS
				3937	--------------
				3938
				3939	This section provides code conventions used when the target triple OS is
				3940	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3941
				3942	Trap Handler ABI
				3943	~~~~~~~~~~~~~~~~
				3944
				3945	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3946	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3947	instructions are handled as follows:
				3948
				3949	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3950	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3951
				3952	=============== =============== ===========================================
				3953	Usage Code Sequence Description
				3954	=============== =============== ===========================================
				3955	llvm.trap s_endpgm Causes wavefront to be terminated.
				3956	llvm.debugtrap none Compiler warning given that there is no
				3957	trap handler installed.
				3958	=============== =============== ===========================================
				3959
				3960	Source Languages
				3961	================
				3962
				3963	.. _amdgpu-opencl:
				3964
				3965	OpenCL
				3966	------
				3967
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3968	When the language is OpenCL the following differences occur:
				3969
				3970	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3971	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				3972	arguments for the AMDHSA OS (see
				3973	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3974	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3975	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3976
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3977	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				3978	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				3979
				3980	======== ==== ========= ===========================================
				3981	Position Byte Byte Description
				3982	Size Alignment
				3983	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	3984	1 8 8 OpenCL Global Offset X
				3985	2 8 8 OpenCL Global Offset Y
				3986	3 8 8 OpenCL Global Offset Z
				3987	4 8 8 OpenCL address of printf buffer
				3988	5 8 8 OpenCL address of virtual queue used by
				3989	enqueue_kernel.
				3990	6 8 8 OpenCL address of AqlWrap struct used by
				3991	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3992	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3993
				3994	.. _amdgpu-hcc:
				3995
				3996	HCC
				3997	---
				3998
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3999	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4000
				4001	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				4002
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4003	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4004	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4005
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4006	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	4007	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4008
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4009	This section describes general syntax for instructions and operands.
				4010
				4011	Instructions
				4012	~~~~~~~~~~~~
				4013
				4014	.. toctree::
				4015	:hidden:
				4016
				4017	AMDGPUAsmGFX7
				4018	AMDGPUAsmGFX8
				4019	AMDGPUAsmGFX9
				4020	AMDGPUOperandSyntax
				4021
				4022	An instruction has the following syntax:
				4023
				4024	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				4025
				4026	Note that operands are normally comma-separated while modifiers are space-separated.
				4027
				4028	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4029
				4030	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4031	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4032
				4033	Note that features under development are not included in this description.
				4034
				4035	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4036	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4037	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4038
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4039	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4040	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4041
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4042	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4043
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4044	* SGPR registers: s0, ... or s[0], ...
				4045	* VGPR registers: v0, ... or v[0], ...
				4046	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4047	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4048	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4049	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4050	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4051	* Register index expressions: v[2*2], s[1-1:2-1]
				4052	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4053
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4054	Modifiers
				4055	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4056
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4057	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4058
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4059	Instruction Examples
				4060	~~~~~~~~~~~~~~~~~~~~
				4061
				4062	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4063	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4064
				4065	.. code-block:: nasm
				4066
				4067	ds_add_u32 v2, v4 offset:16
				4068	ds_write_src2_b64 v2 offset0:4 offset1:8
				4069	ds_cmpst_f32 v2, v4, v6
				4070	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4071
				4072
				4073	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4074
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4075	FLAT
				4076	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4077
				4078	.. code-block:: nasm
				4079
				4080	flat_load_dword v1, v[3:4]
				4081	flat_store_dwordx3 v[3:4], v[5:7]
				4082	flat_atomic_swap v1, v[3:4], v5 glc
				4083	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4084	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4085
				4086	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4087
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4088	MUBUF
				4089	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4090
				4091	.. code-block:: nasm
				4092
				4093	buffer_load_dword v1, off, s[4:7], s1
				4094	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4095	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4096	buffer_wbinvl1
				4097	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4098
				4099	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4100
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4101	SMRD/SMEM
				4102	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4103
				4104	.. code-block:: nasm
				4105
				4106	s_load_dword s1, s[2:3], 0xfc
				4107	s_load_dwordx8 s[8:15], s[2:3], s4
				4108	s_load_dwordx16 s[88:103], s[2:3], s4
				4109	s_dcache_inv_vol
				4110	s_memtime s[4:5]
				4111
				4112	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4113
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4114	SOP1
				4115	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4116
				4117	.. code-block:: nasm
				4118
				4119	s_mov_b32 s1, s2
				4120	s_mov_b64 s[0:1], 0x80000000
				4121	s_cmov_b32 s1, 200
				4122	s_wqm_b64 s[2:3], s[4:5]
				4123	s_bcnt0_i32_b64 s1, s[2:3]
				4124	s_swappc_b64 s[2:3], s[4:5]
				4125	s_cbranch_join s[4:5]
				4126
				4127	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4128
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4129	SOP2
				4130	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4131
				4132	.. code-block:: nasm
				4133
				4134	s_add_u32 s1, s2, s3
				4135	s_and_b64 s[2:3], s[4:5], s[6:7]
				4136	s_cselect_b32 s1, s2, s3
				4137	s_andn2_b32 s2, s4, s6
				4138	s_lshr_b64 s[2:3], s[4:5], s6
				4139	s_ashr_i32 s2, s4, s6
				4140	s_bfm_b64 s[2:3], s4, s6
				4141	s_bfe_i64 s[2:3], s[4:5], s6
				4142	s_cbranch_g_fork s[4:5], s[6:7]
				4143
				4144	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4145
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4146	SOPC
				4147	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4148
				4149	.. code-block:: nasm
				4150
				4151	s_cmp_eq_i32 s1, s2
				4152	s_bitcmp1_b32 s1, s2
				4153	s_bitcmp0_b64 s[2:3], s4
				4154	s_setvskip s3, s5
				4155
				4156	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4157
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4158	SOPP
				4159	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4160
				4161	.. code-block:: nasm
				4162
				4163	s_barrier
				4164	s_nop 2
				4165	s_endpgm
				4166	s_waitcnt 0 ; Wait for all counters to be 0
				4167	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4168	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4169	s_sethalt 9
				4170	s_sleep 10
				4171	s_sendmsg 0x1
				4172	s_sendmsg sendmsg(MSG_INTERRUPT)
				4173	s_trap 1
				4174
				4175	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4176
				4177	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4178	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4179	range or acceptable values.
				4180
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4181	VALU
				4182	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4183
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4184	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4185	the assembler will automatically use optimal encoding based on its operands.
				4186	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4187
				4188	* _e32 for 32-bit VOP1/VOP2/VOPC
				4189	* _e64 for 64-bit VOP3
				4190	* _dpp for VOP_DPP
				4191	* _sdwa for VOP_SDWA
				4192
				4193	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4194
				4195	.. code-block:: nasm
				4196
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4197	v_mov_b32 v1, v2
				4198	v_mov_b32_e32 v1, v2
				4199	v_nop
				4200	v_cvt_f64_i32_e32 v[1:2], v2
				4201	v_floor_f32_e32 v1, v2
				4202	v_bfrev_b32_e32 v1, v2
				4203	v_add_f32_e32 v1, v2, v3
				4204	v_mul_i32_i24_e64 v1, v2, 3
				4205	v_mul_i32_i24_e32 v1, -3, v3
				4206	v_mul_i32_i24_e32 v1, -100, v3
				4207	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4208	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4209
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4210	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4211
				4212	.. code-block:: nasm
				4213
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4214	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4215	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4216	v_mov_b32 v0, v0 wave_shl:1
				4217	v_mov_b32 v0, v0 row_mirror
				4218	v_mov_b32 v0, v0 row_bcast:31
				4219	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4220	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4221	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4222
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4223	VOP_SDWA examples:
				4224
				4225	.. code-block:: nasm
				4226
				4227	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4228	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4229	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4230	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4231	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4232
				4233	For full list of supported instructions, refer to "Vector ALU instructions".
				4234
				4235	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4236	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4237
				4238	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4239	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4240
				4241	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4242	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4243
				4244	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4245	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4246
				4247	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4248	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4249
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4250
				4251	major, minor, and stepping are all integers that describe the instruction
				4252	set architecture (ISA) version of the assembly program.
				4253
				4254	vendor and arch are quoted strings. vendor should always be equal to
				4255	"AMD" and arch should always be equal to "AMDGPU".
				4256
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4257	By default, the assembler will derive the ISA version, vendor, and arch
				4258	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4259
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4260	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4261	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4262
				4263	This directives specifies that the symbol with given name is a kernel entry point
				4264	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4265
				4266	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4267	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4268
				4269	This directive marks the beginning of a list of key / value pairs that are used
				4270	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4271	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4272	any amd_kernel_code_t values that are unspecified a default value will be
				4273	used. The default value for all keys is 0, with the following exceptions:
				4274
				4275	- kernel_code_version_major defaults to 1.
				4276	- machine_kind defaults to 1.
				4277	- machine_version_major, machine_version_minor, and
				4278	machine_version_stepping are derived from the value of the -mcpu option
				4279	that is passed to the assembler.
				4280	- kernel_code_entry_byte_offset defaults to 256.
				4281	- wavefront_size defaults to 6.
				4282	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4283	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4284	as a power of two, so a value of n means an alignment of 2^ n.
				4285
				4286	The .amd_kernel_code_t directive must be placed immediately after the
				4287	function label and before any instructions.
				4288
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4289	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4290	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4291
				4292	Here is an example of a minimal amd_kernel_code_t specification:
				4293
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4294	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4295
				4296	.hsa_code_object_version 1,0
				4297	.hsa_code_object_isa
				4298
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4299	.hsatext
				4300	.globl hello_world
				4301	.p2align 8
				4302	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4303
				4304	hello_world:
				4305
				4306	.amd_kernel_code_t
				4307	enable_sgpr_kernarg_segment_ptr = 1
				4308	is_ptr64 = 1
				4309	compute_pgm_rsrc1_vgprs = 0
				4310	compute_pgm_rsrc1_sgprs = 0
				4311	compute_pgm_rsrc2_user_sgpr = 2
				4312	kernarg_segment_byte_size = 8
				4313	wavefront_sgpr_count = 2
				4314	workitem_vgpr_count = 3
				4315	.end_amd_kernel_code_t
				4316
				4317	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4318	v_mov_b32 v0, 3.14159
				4319	s_waitcnt lgkmcnt(0)
				4320	v_mov_b32 v1, s0
				4321	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4322	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4323	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4324	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4325	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4326
				4327	Additional Documentation
				4328	========================
				4329
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4330	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4331	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4332	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4333	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4334	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4335	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4336	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4337	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4338	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4339	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4340	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4341	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4342	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4343	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4344	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4345	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__