Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 7d358dfdce4e17f90aaeba8abbcf889bc4737036 [file] [log] [blame]

Eugene Zelenko	3507b04	2018-03-21 17:09:35 +0000	[diff] [blame]	1	=============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	67	<empty> Default.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	68	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	69
				70	.. _amdgpu-processors:
				71
				72	Processors
				73	----------
				74
				75	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				76	names from both the Processor and Alternative Processor can be used.
				77
				78	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	79	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	80
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	81	=========== =============== ============ ===== ========= ======= ==================
				82	Processor Alternative Target dGPU/ Target ROCm Example
				83	Processor Triple APU Features Support Products
				84	Architecture Supported
				85	[Default]
				86	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	88	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	89	``r600`` ``r600`` dGPU
				90	``r630`` ``r600`` dGPU
				91	``rs880`` ``r600`` dGPU
				92	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``rv710`` ``r600`` dGPU
				96	``rv730`` ``r600`` dGPU
				97	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	99	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	100	``cedar`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	101	``cypress`` ``r600`` dGPU
				102	``juniper`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	103	``redwood`` ``r600`` dGPU
				104	``sumo`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	106	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	107	``barts`` ``r600`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	108	``caicos`` ``r600`` dGPU
				109	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	110	``turks`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	114	``gfx601`` - ``hainan`` ``amdgcn`` dGPU
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	115	- ``oland``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	116	- ``pitcairn``
				117	- ``verde``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	119	-----------------------------------------------------------------------------------
				120	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				121	- A6 Pro-7050B
				122	- A8-7100
				123	- A8 Pro-7150B
				124	- A10-7300
				125	- A10 Pro-7350B
				126	- FX-7500
				127	- A8-7200P
				128	- A10-7400P
				129	- FX-7600P
				130	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				131	- FirePro W9100
				132	- FirePro S9150
				133	- FirePro S9170
				134	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				135	- Radeon R9 290x
				136	- Radeon R390
				137	- Radeon R390x
				138	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				139	- ``mullins`` - E1-2200
				140	- E1-2500
				141	- E2-3000
				142	- E2-3800
				143	- A4-5000
				144	- A4-5100
				145	- A6-5200
				146	- A4 Pro-3340B
				147	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				148	- Radeon HD 8770
				149	- R7 260
				150	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	152	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	153	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				154	[on] - Pro A6-8500B
				155	- A8-8600P
				156	- Pro A8-8600B
				157	- FX-8800P
				158	- Pro A12-8800B
				159	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				160	[on] - Pro A10-8700B
				161	- A10-8780P
				162	\ ``amdgcn`` APU - xnack - A10-9600P
				163	[on] - A10-9630P
				164	- A12-9700P
				165	- A12-9730P
				166	- FX-9800P
				167	- FX-9830P
				168	\ ``amdgcn`` APU - xnack - E2-9010
				169	[on] - A6-9210
				170	- A9-9410
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	171	``gfx802`` - ``iceland`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
				172	- ``tonga`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	173	- FirePro W7100
				174	- Radeon R285
				175	- Radeon R9 380
				176	- Radeon R9 385
				177	- Mobile FirePro
				178	M7170
				179	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				180	[off] - Radeon R9 Fury
				181	- Radeon R9 FuryX
				182	- Radeon Pro Duo
				183	- FirePro S9300x2
				184	- Radeon Instinct MI8
				185	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				186	[off] - Radeon RX 480
				187	- Radeon Instinct MI6
				188	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				189	[off]
				190	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				191	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	193	-----------------------------------------------------------------------------------
				194	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				195	[off] Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	b6efb90	2018-04-14 01:58:10 +0000	[diff] [blame]	201	``gfx902`` ``amdgcn`` APU - xnack - Ryzen 3 2200G
				202	[on] - Ryzen 5 2400G
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	203	``gfx904`` ``amdgcn`` dGPU - xnack TBA
				204	[off]
				205	.. TODO
				206	Add product
				207	names.
				208	``gfx906`` ``amdgcn`` dGPU - xnack TBA
				209	[off]
				210	.. TODO
				211	Add product
				212	names.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	213	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	214
				215	.. _amdgpu-target-features:
				216
				217	Target Features
				218	---------------
				219
				220	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	221	processor specific features. Not all target features are supported by
				222	all processors. The runtime must ensure that the features supported by
				223	the device used to execute the code match the features enabled when
				224	generating the code. A mismatch of features may result in incorrect
				225	execution, or a reduction in performance.
				226
				227	The target features supported by each processor, and the default value
				228	used if not specified explicitly, is listed in
				229	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	230
				231	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				232	target features.
				233
				234	For example:
				235
				236	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	237	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	238	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	239	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	240
				241	.. table:: AMDGPU Target Features
				242	:name: amdgpu-target-feature-table
				243
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	244	============== ==================================================
				245	Target Feature Description
				246	============== ==================================================
				247	-m[no-]xnack Enable/disable generating code that has
				248	memory clauses that are compatible with
				249	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	250
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	251	This is used for demand paging and page
				252	migration. If XNACK replay is enabled in
				253	the device, then if a page fault occurs
				254	the code may execute incorrectly if the
				255	``xnack`` feature is not enabled. Executing
				256	code that has the feature enabled on a
				257	device that does not have XNACK replay
				258	enabled will execute correctly, but may
				259	be less performant than code with the
				260	feature disabled.
				261	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	262
				263	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	264
				265	Address Spaces
				266	--------------
				267
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	268	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	269
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	270	The memory space names used in the table, aside from the region memory space, is
				271	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	272
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	273	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	274
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	275	.. table:: Address Space Mapping
				276	:name: amdgpu-address-space-mapping-table
				277
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	278	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	279	LLVM Address Space Memory Space
Yaxun Liu	0124b54	2018-02-13 18:00:25 +0000	[diff] [blame]	280	================== =================
				281	0 Generic (Flat)
				282	1 Global
				283	2 Region (GDS)
				284	3 Local (group/LDS)
				285	4 Constant
				286	5 Private (Scratch)
				287	6 Constant 32-bit
				288	================== =================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	289
				290	.. _amdgpu-memory-scopes:
				291
				292	Memory Scopes
				293	-------------
				294
				295	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				296	backend memory model when the target triple OS is ``amdhsa`` (see
				297	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				298
				299	The memory model supported is based on the HSA memory model [HSA]_ which is
				300	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				301	relation is transitive over the synchonizes-with relation independent of scope,
				302	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	303	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	304
				305	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				306	inclusion and requires the memory scopes to exactly match. However, this
				307	is conservatively correct for OpenCL.
				308
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	309	.. table:: AMDHSA LLVM Sync Scopes
				310	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	311
				312	================ ==========================================================
				313	LLVM Sync Scope Description
				314	================ ==========================================================
				315	none The default: ``system``.
				316
				317	Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``.
				324	- ``agent`` and executed by a thread on the same agent.
				325	- ``workgroup`` and executed by a thread in the same
				326	workgroup.
				327	- ``wavefront`` and executed by a thread in the same
				328	wavefront.
				329
				330	``agent`` Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system`` or ``agent`` and executed by a thread on the
				337	same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``workgroup`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system``, ``agent`` or ``workgroup`` and executed by a
				350	thread in the same workgroup.
				351	- ``wavefront`` and executed by a thread in the same
				352	wavefront.
				353
				354	``wavefront`` Synchronizes with, and participates in modification and
				355	seq_cst total orderings with, other operations (except
				356	image operations) for all address spaces (except private,
				357	or generic that accesses private) provided the other
				358	operation's sync scope is:
				359
				360	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				361	and executed by a thread in the same wavefront.
				362
				363	``singlethread`` Only synchronizes with, and participates in modification
				364	and seq_cst total orderings with, other operations (except
				365	image operations) running in the same thread for all
				366	address spaces (for example, in signal handlers).
				367	================ ==========================================================
				368
				369	AMDGPU Intrinsics
				370	-----------------
				371
				372	The AMDGPU backend implements the following intrinsics.
				373
				374	This section is WIP.
				375
				376	.. TODO
				377	List AMDGPU intrinsics
				378
				379	Code Object
				380	===========
				381
				382	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				383	can be linked by ``lld`` to produce a standard ELF shared code object which can
				384	be loaded and executed on an AMDGPU target.
				385
				386	Header
				387	------
				388
				389	The AMDGPU backend uses the following ELF header:
				390
				391	.. table:: AMDGPU ELF Header
				392	:name: amdgpu-elf-header-table
				393
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	394	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	396	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	397	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				398	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	399	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				400	- ``ELFOSABI_AMDGPU_HSA``
				401	- ``ELFOSABI_AMDGPU_PAL``
				402	- ``ELFOSABI_AMDGPU_MESA3D``
				403	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				404	- ``ELFABIVERSION_AMDGPU_PAL``
				405	- ``ELFABIVERSION_AMDGPU_MESA3D``
				406	``e_type`` - ``ET_REL``
				407	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	``e_machine`` ``EM_AMDGPU``
				409	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	410	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	411	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	412
				413	..
				414
				415	.. table:: AMDGPU ELF Header Enumeration Values
				416	:name: amdgpu-elf-header-enumeration-values-table
				417
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	418	=============================== =====
				419	Name Value
				420	=============================== =====
				421	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	422	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	423	``ELFOSABI_AMDGPU_HSA`` 64
				424	``ELFOSABI_AMDGPU_PAL`` 65
				425	``ELFOSABI_AMDGPU_MESA3D`` 66
				426	``ELFABIVERSION_AMDGPU_HSA`` 1
				427	``ELFABIVERSION_AMDGPU_PAL`` 0
				428	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				429	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	430
				431	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	432	The ELF class is:
				433
				434	* ``ELFCLASS32`` for ``r600`` architecture.
				435
				436	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				437	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	438
				439	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	440	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	441
				442	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	443	One of the following AMD GPU architecture specific OS ABIs
				444	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	445
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	446	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	447
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	449
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	450	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				451
				452	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	453
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	455	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				456	object conforms:
				457
				458	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				459	runtime ABI.
				460
				461	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				462	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	463
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	464	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	465	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	466
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467	``e_type``
				468	Can be one of the following values:
				469
				470
				471	``ET_REL``
				472	The type produced by the AMD GPU backend compiler as it is relocatable code
				473	object.
				474
				475	``ET_DYN``
				476	The type produced by the linker as it is a shared code object.
				477
				478	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				479
				480	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				482	by the ``r600`` and ``amdgcn`` architectures (see
				483	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				484	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				485	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	486
				487	``e_entry``
				488	The entry point is 0 as the entry points for individual kernels must be
				489	selected in order to invoke them through AQL packets.
				490
				491	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	492	The AMDGPU backend uses the following ELF header flags:
				493
				494	.. table:: AMDGPU ELF Header ``e_flags``
				495	:name: amdgpu-elf-header-e_flags-table
				496
				497	================================= ========== =============================
				498	Name Value Description
				499	================================= ========== =============================
				500	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				501	-------------------------------------------- -----------------------------
				502	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				503	mask for
				504	``EF_AMDGPU_MACH_xxx`` values
				505	defined in
				506	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	507	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				508	target feature is
				509	enabled for all code
				510	contained in the code object.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	511	If the processor
				512	does not support the
				513	``xnack`` target
				514	feature then must
				515	be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	516	See
				517	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	518	================================= ========== =============================
				519
				520	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				521	:name: amdgpu-ef-amdgpu-mach-table
				522
				523	================================= ========== =============================
				524	Name Value Description (see
				525	:ref:`amdgpu-processor-table`)
				526	================================= ========== =============================
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	527	``EF_AMDGPU_MACH_NONE`` 0x000 not specified
				528	``EF_AMDGPU_MACH_R600_R600`` 0x001 ``r600``
				529	``EF_AMDGPU_MACH_R600_R630`` 0x002 ``r630``
				530	``EF_AMDGPU_MACH_R600_RS880`` 0x003 ``rs880``
				531	``EF_AMDGPU_MACH_R600_RV670`` 0x004 ``rv670``
				532	``EF_AMDGPU_MACH_R600_RV710`` 0x005 ``rv710``
				533	``EF_AMDGPU_MACH_R600_RV730`` 0x006 ``rv730``
				534	``EF_AMDGPU_MACH_R600_RV770`` 0x007 ``rv770``
				535	``EF_AMDGPU_MACH_R600_CEDAR`` 0x008 ``cedar``
				536	``EF_AMDGPU_MACH_R600_CYPRESS`` 0x009 ``cypress``
				537	``EF_AMDGPU_MACH_R600_JUNIPER`` 0x00a ``juniper``
				538	``EF_AMDGPU_MACH_R600_REDWOOD`` 0x00b ``redwood``
				539	``EF_AMDGPU_MACH_R600_SUMO`` 0x00c ``sumo``
				540	``EF_AMDGPU_MACH_R600_BARTS`` 0x00d ``barts``
				541	``EF_AMDGPU_MACH_R600_CAICOS`` 0x00e ``caicos``
				542	``EF_AMDGPU_MACH_R600_CAYMAN`` 0x00f ``cayman``
				543	``EF_AMDGPU_MACH_R600_TURKS`` 0x010 ``turks``
				544	reserved 0x011 - Reserved for ``r600``
				545	0x01f architecture processors.
				546	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 0x020 ``gfx600``
				547	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 0x021 ``gfx601``
				548	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 0x022 ``gfx700``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 0x023 ``gfx701``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 0x024 ``gfx702``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 0x025 ``gfx703``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 0x026 ``gfx704``
				553	reserved 0x027 Reserved.
				554	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 0x028 ``gfx801``
				555	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 0x029 ``gfx802``
				556	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 0x02a ``gfx803``
				557	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 0x02b ``gfx810``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 0x02c ``gfx900``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 0x02d ``gfx902``
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	560	``EF_AMDGPU_MACH_AMDGCN_GFX904`` 0x02e ``gfx904``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX906`` 0x02f ``gfx906``
Konstantin Zhuravlyov	9122a63	2018-02-16 22:33:59 +0000	[diff] [blame]	562	reserved 0x030 Reserved.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	563	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	564
				565	Sections
				566	--------
				567
				568	An AMDGPU target ELF code object has the standard ELF sections which include:
				569
				570	.. table:: AMDGPU ELF Sections
				571	:name: amdgpu-elf-sections-table
				572
				573	================== ================ =================================
				574	Name Type Attributes
				575	================== ================ =================================
				576	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				577	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				578	``.debug_``\ \* ``SHT_PROGBITS`` none
				579	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				580	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				581	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				582	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				583	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				584	``.note`` ``SHT_NOTE`` none
				585	``.rela``\ name ``SHT_RELA`` none
				586	``.rela.dyn`` ``SHT_RELA`` none
				587	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				588	``.shstrtab`` ``SHT_STRTAB`` none
				589	``.strtab`` ``SHT_STRTAB`` none
				590	``.symtab`` ``SHT_SYMTAB`` none
				591	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				592	================== ================ =================================
				593
				594	These sections have their standard meanings (see [ELF]_) and are only generated
				595	if needed.
				596
				597	``.debug``\ \*
				598	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				599	DWARF produced by the AMDGPU backend.
				600
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	601	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602	The standard sections used by a dynamic loader.
				603
				604	``.note``
				605	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				606	backend.
				607
				608	``.rela``\ name, ``.rela.dyn``
				609	For relocatable code objects, name is the name of the section that the
				610	relocation records apply. For example, ``.rela.text`` is the section name for
				611	relocation records associated with the ``.text`` section.
				612
				613	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				614	records from each of the relocatable code object's ``.rela``\ name sections.
				615
				616	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				617	the AMDGPU backend.
				618
				619	``.text``
				620	The executable machine code for the kernels and functions they call. Generated
				621	as position independent code. See :ref:`amdgpu-code-conventions` for
				622	information on conventions used in the isa generation.
				623
				624	.. _amdgpu-note-records:
				625
				626	Note Records
				627	------------
				628
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	629	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				630	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				631	aligned. In addition, minimal zero byte padding must be generated to ensure the
				632	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				633	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	634
				635	The AMDGPU backend code object uses the following ELF note records in the
				636	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	637	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	638	variable size strings have a corresponding ``*_size`` field that specifies the
				639	number of bytes, including the terminating null character, in the string. The
				640	string(s) come immediately after the preceding fields.
				641
				642	Additional note records can be present.
				643
				644	.. table:: AMDGPU ELF Note Records
				645	:name: amdgpu-elf-note-records-table
				646
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	647	===== ============================== ======================================
				648	Name Type Description
				649	===== ============================== ======================================
				650	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	651	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	652
				653	..
				654
				655	.. table:: AMDGPU ELF Note Record Enumeration Values
				656	:name: amdgpu-elf-note-record-enumeration-values-table
				657
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	658	============================== =====
				659	Name Value
				660	============================== =====
				661	reserved 0-9
				662	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	663	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	664	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	665
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	666	``NT_AMD_AMDGPU_HSA_METADATA``
				667	Specifies extensible metadata associated with the code objects executed on HSA
				668	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				669	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				670	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				671	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	672
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	673	.. _amdgpu-symbols:
				674
				675	Symbols
				676	-------
				677
				678	Symbols include the following:
				679
				680	.. table:: AMDGPU ELF Symbols
				681	:name: amdgpu-elf-symbols-table
				682
				683	===================== ============== ============= ==================
				684	Name Type Section Description
				685	===================== ============== ============= ==================
				686	link-name ``STT_OBJECT`` - ``.data`` Global variable
				687	- ``.rodata``
				688	- ``.bss``
				689	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				690	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				691	===================== ============== ============= ==================
				692
				693	Global variable
				694	Global variables both used and defined by the compilation unit.
				695
				696	If the symbol is defined in the compilation unit then it is allocated in the
				697	appropriate section according to if it has initialized data or is readonly.
				698
				699	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				700	will resolve relocations using the definition provided by another code object
				701	or explicitly defined by the runtime.
				702
				703	All global symbols, whether defined in the compilation unit or external, are
				704	accessed by the machine code indirectly through a GOT table entry. This
				705	allows them to be preemptable. The GOT table is only supported when the target
				706	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	707
				708	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	709	Add description of linked shared object symbols. Seems undefined symbols
				710	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	711
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	712	Kernel descriptor
				713	Every HSA kernel has an associated kernel descriptor. It is the address of the
				714	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				715	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				716	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				717
				718	Kernel entry point
				719	Every HSA kernel also has a symbol for its machine code entry point.
				720
				721	.. _amdgpu-relocation-records:
				722
				723	Relocation Records
				724	------------------
				725
				726	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				727	relocatable fields are:
				728
				729	``word32``
				730	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				731	alignment. These values use the same byte order as other word values in the
				732	AMD GPU architecture.
				733
				734	``word64``
				735	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				736	alignment. These values use the same byte order as other word values in the
				737	AMD GPU architecture.
				738
				739	Following notations are used for specifying relocation calculations:
				740
				741	A
				742	Represents the addend used to compute the value of the relocatable field.
				743
				744	G
				745	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	746	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	747
				748	GOT
				749	Represents the address of the global offset table.
				750
				751	P
				752	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				753	of the storage unit being relocated (computed using ``r_offset``).
				754
				755	S
				756	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	757	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				758
				759	B
				760	Represents the base address of a loaded executable or shared object which is
				761	the difference between the ELF address and the actual load address. Relocations
				762	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	763
				764	The following relocation types are supported:
				765
				766	.. table:: AMDGPU ELF Relocation Records
				767	:name: amdgpu-elf-relocation-records-table
				768
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	769	========================== ======= ===== ========== ==============================
				770	Relocation Type Kind Value Field Calculation
				771	========================== ======= ===== ========== ==============================
				772	``R_AMDGPU_NONE`` 0 none none
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	773	``R_AMDGPU_ABS32_LO`` Static, 1 ``word32`` (S + A) & 0xFFFFFFFF
				774	Dynamic
				775	``R_AMDGPU_ABS32_HI`` Static, 2 ``word32`` (S + A) >> 32
				776	Dynamic
				777	``R_AMDGPU_ABS64`` Static, 3 ``word64`` S + A
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	778	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	779	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				780	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	781	``R_AMDGPU_ABS32`` Static, 6 ``word32`` S + A
				782	Dynamic
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame]	783	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				784	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				785	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				786	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				787	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				788	reserved 12
				789	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				790	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	791
Tony Tye	223f4c7	2018-04-13 01:01:27 +0000	[diff] [blame]	792	``R_AMDGPU_ABS32_LO`` and ``R_AMDGPU_ABS32_HI`` are only supported by
				793	the ``mesa3d`` OS, which does not support ``R_AMDGPU_ABS64``.
				794
				795	There is no current OS loader support for 32 bit programs and so
				796	``R_AMDGPU_ABS32`` is not used.
Matt Arsenault	0084adc	2018-04-30 19:08:16 +0000	[diff] [blame]	797
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	798	.. _amdgpu-dwarf:
				799
				800	DWARF
				801	-----
				802
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	803	Standard DWARF [DWARF]_ Version 5 sections can be generated. These contain
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	804	information that maps the code object executable code and data to the source
				805	language constructs. It can be used by tools such as debuggers and profilers.
				806
				807	Address Space Mapping
				808	~~~~~~~~~~~~~~~~~~~~~
				809
				810	The following address space mapping is used:
				811
				812	.. table:: AMDGPU DWARF Address Space Mapping
				813	:name: amdgpu-dwarf-address-space-mapping-table
				814
				815	=================== =================
				816	DWARF Address Space Memory Space
				817	=================== =================
				818	1 Private (Scratch)
				819	2 Local (group/LDS)
				820	omitted Global
				821	omitted Constant
				822	omitted Generic (Flat)
				823	not supported Region (GDS)
				824	=================== =================
				825
				826	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				827	used in the table.
				828
				829	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				830	DWARF address space of the value of the pointer when it is in the private or
				831	local address space. Otherwise the attribute is omitted.
				832
				833	An ``XDEREF`` operation is generated in location list expressions for variables
				834	that are allocated in the private and local address space. Otherwise no
				835	``XDREF`` is omitted.
				836
				837	Register Mapping
				838	~~~~~~~~~~~~~~~~
				839
				840	This section is WIP.
				841
				842	.. TODO
				843	Define DWARF register enumeration.
				844
				845	If want to present a wavefront state then should expose vector registers as
				846	64 wide (rather than per work-item view that LLVM uses). Either as separate
				847	registers, or a 64x4 byte single register. In either case use a new LANE op
				848	(akin to XDREF) to select the current lane usage in a location
				849	expression. This would also allow scalar register spilling to vector register
				850	lanes to be expressed (currently no debug information is being generated for
				851	spilling). If choose a wide single register approach then use LANE in
				852	conjunction with PIECE operation to select the dword part of the register for
				853	the current lane. If the separate register approach then use LANE to select
				854	the register.
				855
				856	Source Text
				857	~~~~~~~~~~~
				858
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	859	Source text for online-compiled programs (e.g. those compiled by the OpenCL
				860	runtime) may be embedded into the DWARF v5 line table using the ``clang
				861	-gembed-source`` option, described in table :ref:`amdgpu-debug-options`.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	862
Scott Linder	16c7bda	2018-02-23 23:01:06 +0000	[diff] [blame]	863	For example:
				864
				865	``-gembed-source``
				866	Enable the embedded source DWARF v5 extension.
				867	``-gno-embed-source``
				868	Disable the embedded source DWARF v5 extension.
				869
				870	.. table:: AMDGPU Debug Options
				871	:name: amdgpu-debug-options
				872
				873	==================== ==================================================
				874	Debug Flag Description
				875	==================== ==================================================
				876	-g[no-]embed-source Enable/disable embedding source text in DWARF
				877	debug sections. Useful for environments where
				878	source cannot be written to disk, such as
				879	when performing online compilation.
				880	==================== ==================================================
				881
				882	This option enables one extended content types in the DWARF v5 Line Number
				883	Program Header, which is used to encode embedded source.
				884
				885	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types
				886	:name: amdgpu-dwarf-extended-content-types
				887
				888	============================ ======================
				889	Content Type Form
				890	============================ ======================
				891	``DW_LNCT_LLVM_source`` ``DW_FORM_line_strp``
				892	============================ ======================
				893
				894	The source field will contain the UTF-8 encoded, null-terminated source text
				895	with ``'\n'`` line endings. When the source field is present, consumers can use
				896	the embedded source instead of attempting to discover the source on disk. When
				897	the source field is absent, consumers can access the file to get the source
				898	text.
				899
				900	The above content type appears in the ``file_name_entry_format`` field of the
				901	line table prologue, and its corresponding value appear in the ``file_names``
				902	field. The current encoding of the content type is documented in table
				903	:ref:`amdgpu-dwarf-extended-content-types-encoding`
				904
				905	.. table:: AMDGPU DWARF Line Number Program Header Extended Content Types Encoding
				906	:name: amdgpu-dwarf-extended-content-types-encoding
				907
				908	============================ ====================
				909	Content Type Value
				910	============================ ====================
				911	``DW_LNCT_LLVM_source`` 0x2001
				912	============================ ====================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	913
				914	.. _amdgpu-code-conventions:
				915
				916	Code Conventions
				917	================
				918
				919	This section provides code conventions used for each supported target triple OS
				920	(see :ref:`amdgpu-target-triples`).
				921
				922	AMDHSA
				923	------
				924
				925	This section provides code conventions used when the target triple OS is
				926	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				927
				928	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	929
Tony Tye	01bfd6c	2018-03-27 21:20:46 +0000	[diff] [blame]	930	Code Object Target Identification
				931	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				932
				933	The AMDHSA OS uses the following syntax to specify the code object
				934	target as a single string:
				935
				936	``<Architecture>-<Vendor>-<OS>-<Environment>-<Processor><Target Features>``
				937
				938	Where:
				939
				940	- ``<Architecture>``, ``<Vendor>``, ``<OS>`` and ``<Environment>``
				941	are the same as the Target Triple (see
				942	:ref:`amdgpu-target-triples`).
				943
				944	- ``<Processor>`` is the same as the Processor (see
				945	:ref:`amdgpu-processors`).
				946
				947	- ``<Target Features>`` is a list of the enabled Target Features
				948	(see :ref:`amdgpu-target-features`), each prefixed by a plus, that
				949	apply to Processor. The list must be in the same order as listed
				950	in the table :ref:`amdgpu-target-feature-table`. Note that *Target
				951	Features* must be included in the list if they are enabled even if
				952	that is the default for Processor.
				953
				954	For example:
				955
				956	``"amdgcn-amd-amdhsa--gfx902+xnack"``
				957
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	958	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	959	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	960
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	961	The code object metadata specifies extensible metadata associated with the code
				962	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				963	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				964	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				965	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				966	information necessary to support the ROCM kernel queries. For example, the
				967	segment sizes needed in a dispatch packet. In addition, a high level language
				968	runtime may require other information to be included. For example, the AMD
				969	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	970
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	971	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	972	:doc:`YamlIO`).
				973
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	974	.. TODO
				975	Is the string null terminated? It probably should not if YAML allows it to
				976	contain null characters, otherwise it should be.
				977
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	978	The metadata is represented as a single YAML document comprised of the mapping
				979	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				980	referenced tables.
				981
				982	For boolean values, the string values of ``false`` and ``true`` are used for
				983	false and true respectively.
				984
				985	Additional information can be added to the mappings. To avoid conflicts, any
				986	non-AMD key names should be prefixed by "vendor-name.".
				987
				988	.. table:: AMDHSA Code Object Metadata Mapping
				989	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				990
				991	========== ============== ========= =======================================
				992	String Key Value Type Required? Description
				993	========== ============== ========= =======================================
				994	"Version" sequence of Required - The first integer is the major
				995	2 integers version. Currently 1.
				996	- The second integer is the minor
				997	version. Currently 0.
				998	"Printf" sequence of Each string is encoded information
				999	strings about a printf function call. The
				1000	encoded information is organized as
				1001	fields separated by colon (':'):
				1002
				1003	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				1004
				1005	where:
				1006
				1007	``ID``
				1008	A 32 bit integer as a unique id for
				1009	each printf function call
				1010
				1011	``N``
				1012	A 32 bit integer equal to the number
				1013	of arguments of printf function call
				1014	minus 1
				1015
				1016	``S[i]`` (where i = 0, 1, ... , N-1)
				1017	32 bit integers for the size in bytes
				1018	of the i-th FormatString argument of
				1019	the printf function call
				1020
				1021	FormatString
				1022	The format string passed to the
				1023	printf function call.
				1024	"Kernels" sequence of Required Sequence of the mappings for each
				1025	mapping kernel in the code object. See
				1026	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				1027	for the definition of the mapping.
				1028	========== ============== ========= =======================================
				1029
				1030	..
				1031
				1032	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				1033	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				1034
				1035	================= ============== ========= ================================
				1036	String Key Value Type Required? Description
				1037	================= ============== ========= ================================
				1038	"Name" string Required Source name of the kernel.
				1039	"SymbolName" string Required Name of the kernel
				1040	descriptor ELF symbol.
				1041	"Language" string Source language of the kernel.
				1042	Values include:
				1043
				1044	- "OpenCL C"
				1045	- "OpenCL C++"
				1046	- "HCC"
				1047	- "OpenMP"
				1048
				1049	"LanguageVersion" sequence of - The first integer is the major
				1050	2 integers version.
				1051	- The second integer is the
				1052	minor version.
				1053	"Attrs" mapping Mapping of kernel attributes.
				1054	See
				1055	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				1056	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1057	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1058	mapping kernel arguments. See
				1059	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				1060	for the definition of the mapping.
				1061	"CodeProps" mapping Mapping of properties related to
				1062	the kernel code. See
				1063	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				1064	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1065	================= ============== ========= ================================
				1066
				1067	..
				1068
				1069	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				1070	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				1071
				1072	=================== ============== ========= ==============================
				1073	String Key Value Type Required? Description
				1074	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1075	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				1076	3 integers must be >=1 and the dispatch
				1077	work-group size X, Y, Z must
				1078	correspond to the specified
				1079	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1080
				1081	Corresponds to the OpenCL
				1082	``reqd_work_group_size``
				1083	attribute.
				1084	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1085	3 integers X, Y, Z is likely to be the
				1086	specified values.
				1087
				1088	Corresponds to the OpenCL
				1089	``work_group_size_hint``
				1090	attribute.
				1091	"VecTypeHint" string The name of a scalar or vector
				1092	type.
				1093
				1094	Corresponds to the OpenCL
				1095	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1096
				1097	"RuntimeHandle" string The external symbol name
				1098	associated with a kernel.
				1099	OpenCL runtime allocates a
				1100	global buffer for the symbol
				1101	and saves the kernel's address
				1102	to it, which is used for
				1103	device side enqueueing. Only
				1104	available for device side
				1105	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1106	=================== ============== ========= ==============================
				1107
				1108	..
				1109
				1110	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1111	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1112
				1113	================= ============== ========= ================================
				1114	String Key Value Type Required? Description
				1115	================= ============== ========= ================================
				1116	"Name" string Kernel argument name.
				1117	"TypeName" string Kernel argument type name.
				1118	"Size" integer Required Kernel argument size in bytes.
				1119	"Align" integer Required Kernel argument alignment in
				1120	bytes. Must be a power of two.
				1121	"ValueKind" string Required Kernel argument kind that
				1122	specifies how to set up the
				1123	corresponding argument.
				1124	Values include:
				1125
				1126	"ByValue"
				1127	The argument is copied
				1128	directly into the kernarg.
				1129
				1130	"GlobalBuffer"
				1131	A global address space pointer
				1132	to the buffer data is passed
				1133	in the kernarg.
				1134
				1135	"DynamicSharedPointer"
				1136	A group address space pointer
				1137	to dynamically allocated LDS
				1138	is passed in the kernarg.
				1139
				1140	"Sampler"
				1141	A global address space
				1142	pointer to a S# is passed in
				1143	the kernarg.
				1144
				1145	"Image"
				1146	A global address space
				1147	pointer to a T# is passed in
				1148	the kernarg.
				1149
				1150	"Pipe"
				1151	A global address space pointer
				1152	to an OpenCL pipe is passed in
				1153	the kernarg.
				1154
				1155	"Queue"
				1156	A global address space pointer
				1157	to an OpenCL device enqueue
				1158	queue is passed in the
				1159	kernarg.
				1160
				1161	"HiddenGlobalOffsetX"
				1162	The OpenCL grid dispatch
				1163	global offset for the X
				1164	dimension is passed in the
				1165	kernarg.
				1166
				1167	"HiddenGlobalOffsetY"
				1168	The OpenCL grid dispatch
				1169	global offset for the Y
				1170	dimension is passed in the
				1171	kernarg.
				1172
				1173	"HiddenGlobalOffsetZ"
				1174	The OpenCL grid dispatch
				1175	global offset for the Z
				1176	dimension is passed in the
				1177	kernarg.
				1178
				1179	"HiddenNone"
				1180	An argument that is not used
				1181	by the kernel. Space needs to
				1182	be left for it, but it does
				1183	not need to be set up.
				1184
				1185	"HiddenPrintfBuffer"
				1186	A global address space pointer
				1187	to the runtime printf buffer
				1188	is passed in kernarg.
				1189
				1190	"HiddenDefaultQueue"
				1191	A global address space pointer
				1192	to the OpenCL device enqueue
				1193	queue that should be used by
				1194	the kernel by default is
				1195	passed in the kernarg.
				1196
				1197	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1198	A global address space pointer
				1199	to help link enqueued kernels into
				1200	the ancestor tree for determining
				1201	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1202
				1203	"ValueType" string Required Kernel argument value type. Only
				1204	present if "ValueKind" is
				1205	"ByValue". For vector data
				1206	types, the value is for the
				1207	element type. Values include:
				1208
				1209	- "Struct"
				1210	- "I8"
				1211	- "U8"
				1212	- "I16"
				1213	- "U16"
				1214	- "F16"
				1215	- "I32"
				1216	- "U32"
				1217	- "F32"
				1218	- "I64"
				1219	- "U64"
				1220	- "F64"
				1221
				1222	.. TODO
				1223	How can it be determined if a
				1224	vector type, and what size
				1225	vector?
				1226	"PointeeAlign" integer Alignment in bytes of pointee
				1227	type for pointer type kernel
				1228	argument. Must be a power
				1229	of 2. Only present if
				1230	"ValueKind" is
				1231	"DynamicSharedPointer".
				1232	"AddrSpaceQual" string Kernel argument address space
				1233	qualifier. Only present if
				1234	"ValueKind" is "GlobalBuffer" or
				1235	"DynamicSharedPointer". Values
				1236	are:
				1237
				1238	- "Private"
				1239	- "Global"
				1240	- "Constant"
				1241	- "Local"
				1242	- "Generic"
				1243	- "Region"
				1244
				1245	.. TODO
				1246	Is GlobalBuffer only Global
				1247	or Constant? Is
				1248	DynamicSharedPointer always
				1249	Local? Can HCC allow Generic?
				1250	How can Private or Region
				1251	ever happen?
				1252	"AccQual" string Kernel argument access
				1253	qualifier. Only present if
				1254	"ValueKind" is "Image" or
				1255	"Pipe". Values
				1256	are:
				1257
				1258	- "ReadOnly"
				1259	- "WriteOnly"
				1260	- "ReadWrite"
				1261
				1262	.. TODO
				1263	Does this apply to
				1264	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1265	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1266	performed by the kernel on the
				1267	kernel argument. Only present if
				1268	"ValueKind" is "GlobalBuffer",
				1269	"Image", or "Pipe". This may be
				1270	more restrictive than indicated
				1271	by "AccQual" to reflect what the
				1272	kernel actual does. If not
				1273	present then the runtime must
				1274	assume what is implied by
				1275	"AccQual" and "IsConst". Values
				1276	are:
				1277
				1278	- "ReadOnly"
				1279	- "WriteOnly"
				1280	- "ReadWrite"
				1281
				1282	"IsConst" boolean Indicates if the kernel argument
				1283	is const qualified. Only present
				1284	if "ValueKind" is
				1285	"GlobalBuffer".
				1286
				1287	"IsRestrict" boolean Indicates if the kernel argument
				1288	is restrict qualified. Only
				1289	present if "ValueKind" is
				1290	"GlobalBuffer".
				1291
				1292	"IsVolatile" boolean Indicates if the kernel argument
				1293	is volatile qualified. Only
				1294	present if "ValueKind" is
				1295	"GlobalBuffer".
				1296
				1297	"IsPipe" boolean Indicates if the kernel argument
				1298	is pipe qualified. Only present
				1299	if "ValueKind" is "Pipe".
				1300
				1301	.. TODO
				1302	Can GlobalBuffer be pipe
				1303	qualified?
				1304	================= ============== ========= ================================
				1305
				1306	..
				1307
				1308	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1309	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1310
				1311	============================ ============== ========= =====================
				1312	String Key Value Type Required? Description
				1313	============================ ============== ========= =====================
				1314	"KernargSegmentSize" integer Required The size in bytes of
				1315	the kernarg segment
				1316	that holds the values
				1317	of the arguments to
				1318	the kernel.
				1319	"GroupSegmentFixedSize" integer Required The amount of group
				1320	segment memory
				1321	required by a
				1322	work-group in
				1323	bytes. This does not
				1324	include any
				1325	dynamically allocated
				1326	group segment memory
				1327	that may be added
				1328	when the kernel is
				1329	dispatched.
				1330	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1331	private address space
				1332	memory required for a
				1333	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1334	bytes. If the kernel
				1335	uses a dynamic call
				1336	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1337	space must be added
				1338	to this value for the
				1339	call stack.
				1340	"KernargSegmentAlign" integer Required The maximum byte
				1341	alignment of
				1342	arguments in the
				1343	kernarg segment. Must
				1344	be a power of 2.
				1345	"WavefrontSize" integer Required Wavefront size. Must
				1346	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1347	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1348	registers used by a
				1349	wavefront for
				1350	GFX6-GFX9. This
				1351	includes the special
				1352	SGPRs for VCC, Flat
				1353	Scratch (GFX7-GFX9)
				1354	and XNACK (for
				1355	GFX8-GFX9). It does
				1356	not include the 16
				1357	SGPR added if a trap
				1358	handler is
				1359	enabled. It is not
				1360	rounded up to the
				1361	allocation
				1362	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1363	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1364	registers used by
				1365	each work-item for
				1366	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1367	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1368	work-group size
				1369	supported by the
				1370	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1371	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1372	consistent with
				1373	ReqdWorkGroupSize if
				1374	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1375	"NumSpilledSGPRs" integer Number of stores from
				1376	a scalar register to
				1377	a register allocator
				1378	created spill
				1379	location.
				1380	"NumSpilledVGPRs" integer Number of stores from
				1381	a vector register to
				1382	a register allocator
				1383	created spill
				1384	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1385	============================ ============== ========= =====================
				1386
				1387	..
				1388
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1389	Kernel Dispatch
				1390	~~~~~~~~~~~~~~~
				1391
				1392	The HSA architected queuing language (AQL) defines a user space memory interface
				1393	that can be used to control the dispatch of kernels, in an agent independent
				1394	way. An agent can have zero or more AQL queues created for it using the ROCm
				1395	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1396	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1397	mechanics and packet layouts.
				1398
				1399	The packet processor of a kernel agent is responsible for detecting and
				1400	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1401	packet processor is implemented by the hardware command processor (CP),
				1402	asynchronous dispatch controller (ADC) and shader processor input controller
				1403	(SPI).
				1404
				1405	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1406	mode driver to initialize and register the AQL queue with CP.
				1407
				1408	To dispatch a kernel the following actions are performed. This can occur in the
				1409	CPU host program, or from an HSA kernel executing on a GPU.
				1410
				1411	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1412	executed is obtained.
				1413	2. A pointer to the kernel descriptor (see
				1414	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1415	obtained. It must be for a kernel that is contained in a code object that that
				1416	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1417	associated.
				1418	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1419	for a memory region with the kernarg property for the kernel agent that will
				1420	execute the kernel. It must be at least 16 byte aligned.
				1421	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1422	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1423	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1424	memory in the same way constant memory is accessed. (Note that the HSA
				1425	specification allows an implementation to copy the kernel argument contents to
				1426	another location that is accessed by the kernel.)
				1427	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1428	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1429	packet. The packet must be set up, and the final write must use an atomic
				1430	store release to set the packet kind to ensure the packet contents are
				1431	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1432	notify the kernel agent that the AQL queue has been updated. These rules, and
				1433	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1434	System Architecture Specification* [HSA]_.
				1435	6. A kernel dispatch packet includes information about the actual dispatch,
				1436	such as grid and work-group size, together with information from the code
				1437	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1438	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1439	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1440	7. CP executes micro-code and is responsible for detecting and setting up the
				1441	GPU to execute the wavefronts of a kernel dispatch.
				1442	8. CP ensures that when the a wavefront starts executing the kernel machine
				1443	code, the scalar general purpose registers (SGPR) and vector general purpose
				1444	registers (VGPR) are set up as required by the machine code. The required
				1445	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1446	register state is defined in
				1447	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1448	9. The prolog of the kernel machine code (see
				1449	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1450	before continuing executing the machine code that corresponds to the kernel.
				1451	10. When the kernel dispatch has completed execution, CP signals the completion
				1452	signal specified in the kernel dispatch packet if not 0.
				1453
				1454	.. _amdgpu-amdhsa-memory-spaces:
				1455
				1456	Memory Spaces
				1457	~~~~~~~~~~~~~
				1458
				1459	The memory space properties are:
				1460
				1461	.. table:: AMDHSA Memory Spaces
				1462	:name: amdgpu-amdhsa-memory-spaces-table
				1463
				1464	================= =========== ======== ======= ==================
				1465	Memory Space Name HSA Segment Hardware Address NULL Value
				1466	Name Name Size
				1467	================= =========== ======== ======= ==================
				1468	Private private scratch 32 0x00000000
				1469	Local group LDS 32 0xFFFFFFFF
				1470	Global global global 64 0x0000000000000000
				1471	Constant constant *same as 64 0x0000000000000000
				1472	global*
				1473	Generic flat flat 64 0x0000000000000000
				1474	Region N/A GDS 32 *not implemented
				1475	for AMDHSA*
				1476	================= =========== ======== ======= ==================
				1477
				1478	The global and constant memory spaces both use global virtual addresses, which
				1479	are the same virtual address space used by the CPU. However, some virtual
				1480	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1481	and some by both.
				1482
				1483	Using the constant memory space indicates that the data will not change during
				1484	the execution of the kernel. This allows scalar read instructions to be
				1485	used. The vector and scalar L1 caches are invalidated of volatile data before
				1486	each kernel dispatch execution to allow constant memory to change values between
				1487	kernel dispatches.
				1488
				1489	The local memory space uses the hardware Local Data Store (LDS) which is
				1490	automatically allocated when the hardware creates work-groups of wavefronts, and
				1491	freed when all the wavefronts of a work-group have terminated. The data store
				1492	(DS) instructions can be used to access it.
				1493
				1494	The private memory space uses the hardware scratch memory support. If the kernel
				1495	uses scratch, then the hardware allocates memory that is accessed using
				1496	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1497	address to physical address is:
				1498
				1499	``wavefront-scratch-base +
				1500	(private-address * wavefront-size * 4) +
				1501	(wavefront-lane-id * 4)``
				1502
				1503	There are different ways that the wavefront scratch base address is determined
				1504	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1505	memory can be accessed in an interleaved manner using buffer instruction with
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1506	the scratch buffer descriptor and per wavefront scratch offset, by the scratch
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1507	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1508	same private address, the interleaving results in adjacent dwords being accessed
				1509	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1510	supported except by flat and scratch instructions in GFX9.
				1511
				1512	The generic address space uses the hardware flat address support available in
				1513	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1514	local appertures), that are outside the range of addressible global memory, to
				1515	map from a flat address to a private or local address.
				1516
				1517	FLAT instructions can take a flat address and access global, private (scratch)
				1518	and group (LDS) memory depending in if the address is within one of the
				1519	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1520	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1521	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1522	(see :ref:`amdgpu-amdhsa-m0`).
				1523
				1524	To convert between a segment address and a flat address the base address of the
				1525	appertures address can be used. For GFX7-GFX8 these are available in the
				1526	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1527	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1528	GFX9 the appature base addresses are directly available as inline constant
				1529	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1530	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1531	which makes it easier to convert from flat to segment or segment to flat.
				1532
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1533	Image and Samplers
				1534	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1535
				1536	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1537	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1538	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1539	enumeration values for the queries that are not trivially deducible from the S#
				1540	representation.
				1541
				1542	HSA Signals
				1543	~~~~~~~~~~~
				1544
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1545	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1546	structure allocated in memory accessible from both the CPU and GPU. The
				1547	structure is defined by the ROCm runtime and subject to change between releases
				1548	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1549
				1550	.. _amdgpu-amdhsa-hsa-aql-queue:
				1551
				1552	HSA AQL Queue
				1553	~~~~~~~~~~~~~
				1554
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1555	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1556	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1557	fields needed to implement certain language features such as the flat address
				1558	aperture bases. It also contains fields used by CP such as managing the
				1559	allocation of scratch memory.
				1560
				1561	.. _amdgpu-amdhsa-kernel-descriptor:
				1562
				1563	Kernel Descriptor
				1564	~~~~~~~~~~~~~~~~~
				1565
				1566	A kernel descriptor consists of the information needed by CP to initiate the
				1567	execution of a kernel, including the entry point address of the machine code
				1568	that implements the kernel.
				1569
				1570	Kernel Descriptor for GFX6-GFX9
				1571	+++++++++++++++++++++++++++++++
				1572
				1573	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1574
				1575	.. table:: Kernel Descriptor for GFX6-GFX9
				1576	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1577
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1578	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1579	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1580	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1581	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1582	address space memory
				1583	required for a work-group
				1584	in bytes. This does not
				1585	include any dynamically
				1586	allocated local address
				1587	space memory that may be
				1588	added when the kernel is
				1589	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1590	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1591	private address space
				1592	memory required for a
				1593	work-item in bytes. If
				1594	is_dynamic_callstack is 1
				1595	then additional space must
				1596	be added to this value for
				1597	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1598	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1599	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600	negative) from base
				1601	address of kernel
				1602	descriptor to kernel's
				1603	entry point instruction
				1604	which must be 256 byte
				1605	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1606	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1608	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1609	program settings used by
				1610	CP to set up
				1611	``COMPUTE_PGM_RSRC1``
				1612	configuration
				1613	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1614	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1615	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1616	program settings used by
				1617	CP to set up
				1618	``COMPUTE_PGM_RSRC2``
				1619	configuration
				1620	register. See
				1621	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1622	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1623	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1624	(see
				1625	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1626
				1627	The total number of SGPR
				1628	user data registers
				1629	requested must not exceed
				1630	16 and match value in
				1631	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1632	Any requests beyond 16
				1633	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1634	449 1 bit EnableSGPRDispatchPtr see above
				1635	450 1 bit EnableSGPRQueuePtr see above
				1636	451 1 bit EnableSGPRKernargSegmentPtr see above
				1637	452 1 bit EnableSGPRDispatchID see above
				1638	453 1 bit EnableSGPRFlatScratchInit see above
				1639	454 1 bit EnableSGPRPrivateSegmentSize see above
				1640	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1641	should always be 0.
				1642	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1643	should always be 0.
				1644	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1645	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1646	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1647	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1648	bytes
				1649	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1650	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1651
				1652	..
				1653
				1654	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1655	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1656
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1657	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1658	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1659	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1660	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1661	used by each work-item,
				1662	granularity is device
				1663	specific:
				1664
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1665	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1666	- max_vgpr 1..256
				1667	- roundup((max_vgpg + 1)
				1668	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1669
				1670	Used by CP to set up
				1671	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1672	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1673	used by a wavefront,
				1674	granularity is device
				1675	specific:
				1676
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1677	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1678	- max_sgpr 1..112
				1679	- roundup((max_sgpg + 1)
				1680	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1681	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1682	- max_sgpr 1..112
				1683	- roundup((max_sgpg + 1)
				1684	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1685
				1686	Includes the special SGPRs
				1687	for VCC, Flat Scratch (for
				1688	GFX7 onwards) and XNACK
				1689	(for GFX8 onwards). It does
				1690	not include the 16 SGPR
				1691	added if a trap handler is
				1692	enabled.
				1693
				1694	Used by CP to set up
				1695	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1696	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1697
				1698	Start executing wavefront
				1699	at the specified priority.
				1700
				1701	CP is responsible for
				1702	filling in
				1703	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1704	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1705	with specified rounding
				1706	mode for single (32
				1707	bit) floating point
				1708	precision floating point
				1709	operations.
				1710
				1711	Floating point rounding
				1712	mode values are defined in
				1713	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1714
				1715	Used by CP to set up
				1716	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1717	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1718	with specified rounding
				1719	denorm mode for half/double (16
				1720	and 64 bit) floating point
				1721	precision floating point
				1722	operations.
				1723
				1724	Floating point rounding
				1725	mode values are defined in
				1726	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1727
				1728	Used by CP to set up
				1729	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1730	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1731	with specified denorm mode
				1732	for single (32
				1733	bit) floating point
				1734	precision floating point
				1735	operations.
				1736
				1737	Floating point denorm mode
				1738	values are defined in
				1739	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1740
				1741	Used by CP to set up
				1742	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1743	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1744	with specified denorm mode
				1745	for half/double (16
				1746	and 64 bit) floating point
				1747	precision floating point
				1748	operations.
				1749
				1750	Floating point denorm mode
				1751	values are defined in
				1752	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1753
				1754	Used by CP to set up
				1755	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1756	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1757
				1758	Start executing wavefront
				1759	in privilege trap handler
				1760	mode.
				1761
				1762	CP is responsible for
				1763	filling in
				1764	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1765	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1766	with DX10 clamp mode
				1767	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1768	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1769	treatment of NaN's (when
				1770	set, clamp NaN to zero,
				1771	otherwise pass NaN
				1772	through).
				1773
				1774	Used by CP to set up
				1775	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1776	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1777
				1778	Start executing wavefront
				1779	in single step mode.
				1780
				1781	CP is responsible for
				1782	filling in
				1783	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1784	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1785	with IEEE mode
				1786	enabled. Floating point
				1787	opcodes that support
				1788	exception flag gathering
				1789	will quiet and propagate
				1790	signaling-NaN inputs per
				1791	IEEE 754-2008. Min_dx10 and
				1792	max_dx10 become IEEE
				1793	754-2008 compliant due to
				1794	signaling-NaN propagation
				1795	and quieting.
				1796
				1797	Used by CP to set up
				1798	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1799	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1800
				1801	Only one work-group allowed
				1802	to execute on a compute
				1803	unit.
				1804
				1805	CP is responsible for
				1806	filling in
				1807	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1808	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1809
				1810	Flag that can be used to
				1811	control debugging code.
				1812
				1813	CP is responsible for
				1814	filling in
				1815	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1816	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1817	Reserved, must be 0.
				1818	GFX9
				1819	Wavefront starts execution
				1820	with specified fp16 overflow
				1821	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1822
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1823	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1824	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1825	- If 1, fp16 overflow that is the
				1826	result of an +/-INF input value
				1827	or divide by 0 produces a +/-INF,
				1828	otherwise clamps computed
				1829	overflow to +/-MAX_FP16 as
				1830	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1831
				1832	Used by CP to set up
				1833	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1834	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1835	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1836	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1837
				1838	..
				1839
				1840	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1841	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1842
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1843	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1844	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1845	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1847	_WAVEFRONT_OFFSET SGPR wavefront scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1848	system register (see
				1849	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1850
				1851	Used by CP to set up
				1852	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1853	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854	user data registers
				1855	requested. This number must
				1856	match the number of user
				1857	data registers enabled.
				1858
				1859	Used by CP to set up
				1860	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1861	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1862	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1863	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1864	be enabled.
				1865
				1866	CP sets
				1867	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1868	if the runtime has
				1869	installed a trap handler
				1870	regardless of the setting
				1871	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1872	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1873	system SGPR register for
				1874	the work-group id in the X
				1875	dimension (see
				1876	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1877
				1878	Used by CP to set up
				1879	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1880	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1881	system SGPR register for
				1882	the work-group id in the Y
				1883	dimension (see
				1884	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1885
				1886	Used by CP to set up
				1887	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1888	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1889	system SGPR register for
				1890	the work-group id in the Z
				1891	dimension (see
				1892	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1893
				1894	Used by CP to set up
				1895	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1896	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1897	system SGPR register for
				1898	work-group information (see
				1899	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1900
				1901	Used by CP to set up
				1902	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1903	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1904	VGPR system registers used
				1905	for the work-item ID.
				1906	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1907	defines the values.
				1908
				1909	Used by CP to set up
				1910	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1911	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1912
				1913	Wavefront starts execution
				1914	with address watch
				1915	exceptions enabled which
				1916	are generated when L1 has
				1917	witnessed a thread access
				1918	an *address of
				1919	interest*.
				1920
				1921	CP is responsible for
				1922	filling in the address
				1923	watch bit in
				1924	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1925	according to what the
				1926	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1927	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1928
				1929	Wavefront starts execution
				1930	with memory violation
				1931	exceptions exceptions
				1932	enabled which are generated
				1933	when a memory violation has
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	1934	occurred for this wavefront from
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1935	L1 or LDS
				1936	(write-to-read-only-memory,
				1937	mis-aligned atomic, LDS
				1938	address out of range,
				1939	illegal address, etc.).
				1940
				1941	CP sets the memory
				1942	violation bit in
				1943	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1944	according to what the
				1945	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1946	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1947
				1948	CP uses the rounded value
				1949	from the dispatch packet,
				1950	not this value, as the
				1951	dispatch may contain
				1952	dynamically allocated group
				1953	segment memory. CP writes
				1954	directly to
				1955	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1956
				1957	Amount of group segment
				1958	(LDS) to allocate for each
				1959	work-group. Granularity is
				1960	device specific:
				1961
				1962	GFX6:
				1963	roundup(lds-size / (64 * 4))
				1964	GFX7-GFX9:
				1965	roundup(lds-size / (128 * 4))
				1966
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1967	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1968	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1969	enabled.
				1970
				1971	Used by CP to set up
				1972	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1973	(set from bits 0..6).
				1974
				1975	IEEE 754 FP Invalid
				1976	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1977	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1978	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1979	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1980	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1981	_DIVISION_BY_ZERO Zero
				1982	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1983	_OVERFLOW
				1984	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1985	_UNDERFLOW
				1986	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1987	_INEXACT
				1988	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1989	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1990	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1991	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1992	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1993	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1994
				1995	..
				1996
				1997	.. table:: Floating Point Rounding Mode Enumeration Values
				1998	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1999
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2000	====================================== ===== ==============================
				2001	Enumeration Name Value Description
				2002	====================================== ===== ==============================
				2003	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				2004	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				2005	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				2006	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				2007	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2008
				2009	..
				2010
				2011	.. table:: Floating Point Denorm Mode Enumeration Values
				2012	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				2013
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2014	====================================== ===== ==============================
				2015	Enumeration Name Value Description
				2016	====================================== ===== ==============================
				2017	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				2018	Denorms
				2019	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				2020	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				2021	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				2022	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2023
				2024	..
				2025
				2026	.. table:: System VGPR Work-Item ID Enumeration Values
				2027	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				2028
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	2029	======================================== ===== ============================
				2030	Enumeration Name Value Description
				2031	======================================== ===== ============================
				2032	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				2033	ID.
				2034	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				2035	dimensions ID.
				2036	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				2037	dimensions ID.
				2038	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				2039	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2040
				2041	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				2042
				2043	Initial Kernel Execution State
				2044	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				2045
				2046	This section defines the register state that will be set up by the packet
				2047	processor prior to the start of execution of every wavefront. This is limited by
				2048	the constraints of the hardware controllers of CP/ADC/SPI.
				2049
				2050	The order of the SGPR registers is defined, but the compiler can specify which
				2051	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2052	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2053	for enabled registers are dense starting at SGPR0: the first enabled register is
				2054	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2055	an SGPR number.
				2056
				2057	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2058	all wavefronts of the grid. It is possible to specify more than 16 User SGPRs using
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2059	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2060	initialized. These are then immediately followed by the System SGPRs that are
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2061	set up by ADC/SPI and can have different values for each wavefront of the grid
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2062	dispatch.
				2063
				2064	SGPR register initial state is defined in
				2065	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2066
				2067	.. table:: SGPR Register Set Up Order
				2068	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2069
				2070	========== ========================== ====== ==============================
				2071	SGPR Order Name Number Description
				2072	(kernel descriptor enable of
				2073	field) SGPRs
				2074	========== ========================== ====== ==============================
				2075	First Private Segment Buffer 4 V# that can be used, together
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2076	(enable_sgpr_private with Scratch Wavefront Offset
				2077	_segment_buffer) as an offset, to access the
				2078	private memory space using a
				2079	segment address.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2080
				2081	CP uses the value provided by
				2082	the runtime.
				2083	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2084	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2085	actually executing.
				2086	then Queue Ptr 2 64 bit address of amd_queue_t
				2087	(enable_sgpr_queue_ptr) object for AQL queue on which
				2088	the dispatch packet was
				2089	queued.
				2090	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2091	(enable_sgpr_kernarg segment. This is directly
				2092	_segment_ptr) copied from the
				2093	kernarg_address in the kernel
				2094	dispatch packet.
				2095
				2096	Having CP load it once avoids
				2097	loading it at the beginning of
				2098	every wavefront.
				2099	then Dispatch Id 2 64 bit Dispatch ID of the
				2100	(enable_sgpr_dispatch_id) dispatch packet being
				2101	executed.
				2102	then Flat Scratch Init 2 This is 2 SGPRs:
				2103	(enable_sgpr_flat_scratch
				2104	_init) GFX6
				2105	Not supported.
				2106	GFX7-GFX8
				2107	The first SGPR is a 32 bit
				2108	byte offset from
				2109	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2110	to per SPI base of memory
				2111	for scratch for the queue
				2112	executing the kernel
				2113	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2114	from the runtime. (The
				2115	Scratch Segment Buffer base
				2116	address is
				2117	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2118	plus this offset.) The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2119	of Scratch Wavefront Offset must
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2120	be added to this offset by
				2121	the kernel machine code,
				2122	right shifted by 8, and
				2123	moved to the FLAT_SCRATCH_HI
				2124	SGPR register.
				2125	FLAT_SCRATCH_HI corresponds
				2126	to SGPRn-4 on GFX7, and
				2127	SGPRn-6 on GFX8 (where SGPRn
				2128	is the highest numbered SGPR
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2129	allocated to the wavefront).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2130	FLAT_SCRATCH_HI is
				2131	multiplied by 256 (as it is
				2132	in units of 256 bytes) and
				2133	added to
				2134	``SH_HIDDEN_PRIVATE_BASE_VIMID``
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2135	to calculate the per wavefront
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2136	FLAT SCRATCH BASE in flat
				2137	memory instructions that
				2138	access the scratch
				2139	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2140
				2141	The second SGPR is 32 bit
				2142	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2143	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2144	usage. CP obtains this from
				2145	the runtime, and it is
				2146	always a multiple of DWORD.
				2147	CP checks that the value in
				2148	the kernel dispatch packet
				2149	Private Segment Byte Size is
				2150	not larger, and requests the
				2151	runtime to increase the
				2152	queue's scratch size if
				2153	necessary. The kernel code
				2154	must move it to
				2155	FLAT_SCRATCH_LO which is
				2156	SGPRn-3 on GFX7 and SGPRn-5
				2157	on GFX8. FLAT_SCRATCH_LO is
				2158	used as the FLAT SCRATCH
				2159	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2160	instructions. Having CP load
				2161	it once avoids loading it at
				2162	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2163	wavefront.
				2164	GFX9
				2165	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2166	64 bit base address of the
				2167	per SPI scratch backing
				2168	memory managed by SPI for
				2169	the queue executing the
				2170	kernel dispatch. CP obtains
				2171	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2172	divides it if there are
				2173	multiple Shader Arrays each
				2174	with its own SPI). The value
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2175	of Scratch Wavefront Offset must
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2176	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2177	machine code and the result
				2178	moved to the FLAT_SCRATCH
				2179	SGPR which is SGPRn-6 and
				2180	SGPRn-5. It is used as the
				2181	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2182	memory instructions.
				2183	then Private Segment Size 1 The 32 bit byte size of a
				2184	(enable_sgpr_private single
				2185	work-item's
				2186	scratch_segment_size) memory
				2187	allocation. This is the
				2188	value from the kernel
				2189	dispatch packet Private
				2190	Segment Byte Size rounded up
				2191	by CP to a multiple of
				2192	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2193
				2194	Having CP load it once avoids
				2195	loading it at the beginning of
				2196	every wavefront.
				2197
				2198	This is not used for
				2199	GFX7-GFX8 since it is the same
				2200	value as the second SGPR of
				2201	Flat Scratch Init. However, it
				2202	may be needed for GFX9 which
				2203	changes the meaning of the
				2204	Flat Scratch Init value.
				2205	then Grid Work-Group Count X 1 32 bit count of the number of
				2206	(enable_sgpr_grid work-groups in the X dimension
				2207	_workgroup_count_X) for the grid being
				2208	executed. Computed from the
				2209	fields in the kernel dispatch
				2210	packet as ((grid_size.x +
				2211	workgroup_size.x - 1) /
				2212	workgroup_size.x).
				2213	then Grid Work-Group Count Y 1 32 bit count of the number of
				2214	(enable_sgpr_grid work-groups in the Y dimension
				2215	_workgroup_count_Y && for the grid being
				2216	less than 16 previous executed. Computed from the
				2217	SGPRs) fields in the kernel dispatch
				2218	packet as ((grid_size.y +
				2219	workgroup_size.y - 1) /
				2220	workgroupSize.y).
				2221
				2222	Only initialized if <16
				2223	previous SGPRs initialized.
				2224	then Grid Work-Group Count Z 1 32 bit count of the number of
				2225	(enable_sgpr_grid work-groups in the Z dimension
				2226	_workgroup_count_Z && for the grid being
				2227	less than 16 previous executed. Computed from the
				2228	SGPRs) fields in the kernel dispatch
				2229	packet as ((grid_size.z +
				2230	workgroup_size.z - 1) /
				2231	workgroupSize.z).
				2232
				2233	Only initialized if <16
				2234	previous SGPRs initialized.
				2235	then Work-Group Id X 1 32 bit work-group id in X
				2236	(enable_sgpr_workgroup_id dimension of grid for
				2237	_X) wavefront.
				2238	then Work-Group Id Y 1 32 bit work-group id in Y
				2239	(enable_sgpr_workgroup_id dimension of grid for
				2240	_Y) wavefront.
				2241	then Work-Group Id Z 1 32 bit work-group id in Z
				2242	(enable_sgpr_workgroup_id dimension of grid for
				2243	_Z) wavefront.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2244	then Work-Group Info 1 {first_wavefront, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2245	(enable_sgpr_workgroup ordered_append_term[10:0],
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2246	_info) threadgroup_size_in_wavefronts[5:0]}
				2247	then Scratch Wavefront Offset 1 32 bit byte offset from base
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2248	(enable_sgpr_private of scratch base of queue
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2249	_segment_wavefront_offset) executing the kernel
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2250	dispatch. Must be used as an
				2251	offset with Private
				2252	segment address when using
				2253	Scratch Segment Buffer. It
				2254	must be used to set up FLAT
				2255	SCRATCH for flat addressing
				2256	(see
				2257	:ref:`amdgpu-amdhsa-flat-scratch`).
				2258	========== ========================== ====== ==============================
				2259
				2260	The order of the VGPR registers is defined, but the compiler can specify which
				2261	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2262	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2263	for enabled registers are dense starting at VGPR0: the first enabled register is
				2264	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2265	VGPR number.
				2266
				2267	VGPR register initial state is defined in
				2268	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2269
				2270	.. table:: VGPR Register Set Up Order
				2271	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2272
				2273	========== ========================== ====== ==============================
				2274	VGPR Order Name Number Description
				2275	(kernel descriptor enable of
				2276	field) VGPRs
				2277	========== ========================== ====== ==============================
				2278	First Work-Item Id X 1 32 bit work item id in X
				2279	(Always initialized) dimension of work-group for
				2280	wavefront lane.
				2281	then Work-Item Id Y 1 32 bit work item id in Y
				2282	(enable_vgpr_workitem_id dimension of work-group for
				2283	> 0) wavefront lane.
				2284	then Work-Item Id Z 1 32 bit work item id in Z
				2285	(enable_vgpr_workitem_id dimension of work-group for
				2286	> 1) wavefront lane.
				2287	========== ========================== ====== ==============================
				2288
Hiroshi Inoue	bcadfee	2018-04-12 05:53:20 +0000	[diff] [blame]	2289	The setting of registers is done by GPU CP/ADC/SPI hardware as follows:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2290
				2291	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2292	registers.
				2293	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2294	combination including none.
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2295	3. Scratch Wavefront Offset is set by SPI in a per wavefront basis which is why
				2296	its value cannot included with the flat scratch init value which is per queue.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2297	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2298	or (X, Y, Z).
				2299
				2300	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2301	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2302
				2303	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2304	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2305	instructions (GFX9).
				2306
				2307	If buffer operations are used then the compiler can generate a V# with the
				2308	following properties:
				2309
				2310	* base address of 0
				2311	* no swizzle
				2312	* ATC: 1 if IOMMU present (such as APU)
				2313	* ptr64: 1
				2314	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2315	APU and NC for dGPU).
				2316
				2317	.. _amdgpu-amdhsa-kernel-prolog:
				2318
				2319	Kernel Prolog
				2320	~~~~~~~~~~~~~
				2321
				2322	.. _amdgpu-amdhsa-m0:
				2323
				2324	M0
				2325	++
				2326
				2327	GFX6-GFX8
				2328	The M0 register must be initialized with a value at least the total LDS size
				2329	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2330	available in dispatch packet. For M0, it is also possible to use maximum
				2331	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2332	GFX7-GFX8).
				2333	GFX9
				2334	The M0 register is not used for range checking LDS accesses and so does not
				2335	need to be initialized in the prolog.
				2336
				2337	.. _amdgpu-amdhsa-flat-scratch:
				2338
				2339	Flat Scratch
				2340	++++++++++++
				2341
				2342	If the kernel may use flat operations to access scratch memory, the prolog code
				2343	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2344	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wavefront
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2345	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2346
				2347	GFX6
				2348	Flat scratch is not supported.
				2349
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2350	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2351	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2352	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2353	being managed by SPI for the queue executing the kernel dispatch. This is
				2354	the same value used in the Scratch Segment Buffer V# base address. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2355	prolog must add the value of Scratch Wavefront Offset to get the wavefront's byte
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2356	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2357	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2358	by 8 before moving into FLAT_SCRATCH_LO.
				2359	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2360	work-items scratch memory usage. This is directly loaded from the kernel
				2361	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2362	DWORD. Having CP load it once avoids loading it at the beginning of every
				2363	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2364	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2365
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2366	GFX9
				2367	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2368	memory being managed by SPI for the queue executing the kernel dispatch. The
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2369	prolog must add the value of Scratch Wavefront Offset and moved to the FLAT_SCRATCH
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2370	pair for use as the flat scratch base in flat memory instructions.
				2371
				2372	.. _amdgpu-amdhsa-memory-model:
				2373
				2374	Memory Model
				2375	~~~~~~~~~~~~
				2376
				2377	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2378	(see :ref:`memmodel`). The implementation is WIP.
				2379
				2380	.. TODO
				2381	Update when implementation complete.
				2382
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2383	The AMDGPU backend supports the memory synchronization scopes specified in
				2384	:ref:`amdgpu-memory-scopes`.
				2385
				2386	The code sequences used to implement the memory model are defined in table
				2387	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2388
				2389	The sequences specify the order of instructions that a single thread must
				2390	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2391	to other memory instructions executed by the same thread. This allows them to be
				2392	moved earlier or later which can allow them to be combined with other instances
				2393	of the same instruction, or hoisted/sunk out of loops to improve
				2394	performance. Only the instructions related to the memory model are given;
				2395	additional ``s_waitcnt`` instructions are required to ensure registers are
				2396	defined before being used. These may be able to be combined with the memory
				2397	model ``s_waitcnt`` instructions as described above.
				2398
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2399	The AMDGPU backend supports the following memory models:
				2400
				2401	HSA Memory Model [HSA]_
				2402	The HSA memory model uses a single happens-before relation for all address
				2403	spaces (see :ref:`amdgpu-address-spaces`).
				2404	OpenCL Memory Model [OpenCL]_
				2405	The OpenCL memory model which has separate happens-before relations for the
				2406	global and local address spaces. Only a fence specifying both global and
				2407	local address space, and seq_cst instructions join the relationships. Since
				2408	the LLVM ``memfence`` instruction does not allow an address space to be
				2409	specified the OpenCL fence has to convervatively assume both local and
				2410	global address space was specified. However, optimizations can often be
				2411	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2412	no intervening memory instructions which access the corresponding address
				2413	space. The code sequences in the table indicate what can be omitted for the
				2414	OpenCL memory. The target triple environment is used to determine if the
				2415	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2416
				2417	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2418	operations.
				2419
				2420	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2421	termed vector memory operations.
				2422
				2423	For GFX6-GFX9:
				2424
				2425	* Each agent has multiple compute units (CU).
				2426	* Each CU has multiple SIMDs that execute wavefronts.
				2427	* The wavefronts for a single work-group are executed in the same CU but may be
				2428	executed by different SIMDs.
				2429	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2430	executing on it.
				2431	* All LDS operations of a CU are performed as wavefront wide operations in a
				2432	global order and involve no caching. Completion is reported to a wavefront in
				2433	execution order.
				2434	* The LDS memory has multiple request queues shared by the SIMDs of a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2435	CU. Therefore, the LDS operations performed by different wavefronts of a work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2436	can be reordered relative to each other, which can result in reordering the
				2437	visibility of vector memory operations with respect to LDS operations of other
				2438	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2439	ensure synchronization between LDS operations and vector memory operations
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2440	between wavefronts of a work-group, but not between operations performed by the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2441	same wavefront.
				2442	* The vector memory operations are performed as wavefront wide operations and
				2443	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2444	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2445	vector memory order if they access LDS memory, and out of LDS operation order
				2446	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2447	* The vector memory operations access a single vector L1 cache shared by all
				2448	SIMDs a CU. Therefore, no special action is required for coherence between the
				2449	lanes of a single wavefront, or for coherence between wavefronts in the same
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2450	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between wavefronts
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2451	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2452	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2453	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2454	scalar operations are used in a restricted way so do not impact the memory
				2455	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2456	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2457	the same agent.
				2458	* The L2 cache has independent channels to service disjoint ranges of virtual
				2459	addresses.
				2460	* Each CU has a separate request queue per channel. Therefore, the vector and
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2461	scalar memory operations performed by wavefronts executing in different work-groups
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2462	(which may be executing on different CUs) of an agent can be reordered
				2463	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2464	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2465	previous vector memory operation has completed before executing a subsequent
				2466	vector memory or LDS operation and so can be used to meet the requirements of
				2467	acquire and release.
				2468	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2469	of virtual addresses can be set up to bypass it to ensure system coherence.
				2470
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2471	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2472	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2473	memory, atomic memory orderings are not meaningful and all accesses are treated
				2474	as non-atomic.
				2475
				2476	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2477	scalar memory instructions). Since the constant address space contents do not
				2478	change during the execution of a kernel dispatch it is not legal to perform
				2479	stores, and atomic memory orderings are not meaningful and all access are
				2480	treated as non-atomic.
				2481
				2482	A memory synchronization scope wider than work-group is not meaningful for the
				2483	group (LDS) address space and is treated as work-group.
				2484
				2485	The memory model does not support the region address space which is treated as
				2486	non-atomic.
				2487
				2488	Acquire memory ordering is not meaningful on store atomic instructions and is
				2489	treated as non-atomic.
				2490
				2491	Release memory ordering is not meaningful on load atomic instructions and is
				2492	treated a non-atomic.
				2493
				2494	Acquire-release memory ordering is not meaningful on load or store atomic
				2495	instructions and is treated as acquire and release respectively.
				2496
				2497	AMDGPU backend only uses scalar memory operations to access memory that is
				2498	proven to not change during the execution of the kernel dispatch. This includes
				2499	constant address space and global address space for program scope const
				2500	variables. Therefore the kernel machine code does not have to maintain the
				2501	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2502	and vector L1 caches are invalidated between kernel dispatches by CP since
				2503	constant address space data may change between kernel dispatch executions. See
				2504	:ref:`amdgpu-amdhsa-memory-spaces`.
				2505
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2506	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2507	case the AMDGPU backend ensures the memory location used to spill is never
				2508	accessed by vector memory operations at the same time. If scalar writes are used
				2509	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2510	return since the locations may be used for vector memory instructions by a
Tony Tye	5bbcca6	2018-03-08 05:46:01 +0000	[diff] [blame]	2511	future wavefront that uses the same scratch area, or a function call that creates a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2512	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2513	as all scalar writes are write-before-read in the same thread.
				2514
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2515	Scratch backing memory (which is used for the private address space)
				2516	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2517	address space is only accessed by a single thread, and is always
				2518	write-before-read, there is never a need to invalidate these entries from the L1
				2519	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2520	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2521
				2522	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2523	to invalidate the L2 cache. This also causes it to be treated as
				2524	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2525	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2526	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2527
				2528	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2529	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2530
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2531	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2532	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2533	Ordering Sync Scope Address
				2534	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2535	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2536	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2537	-----------------------------------------------------------------------------------
				2538	load none none - global - !volatile & !nontemporal
				2539	- generic
				2540	- private 1. buffer/global/flat_load
				2541	- constant
				2542	- volatile & !nontemporal
				2543
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2544	1. buffer/global/flat_load
				2545	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2546
				2547	- nontemporal
				2548
				2549	1. buffer/global/flat_load
				2550	glc=1 slc=1
				2551
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2552	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2553	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2554	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2555	- private 1. buffer/global/flat_store
				2556	- constant
				2557	- nontemporal
				2558
				2559	1. buffer/global/flat_stote
				2560	glc=1 slc=1
				2561
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2562	store none none - local 1. ds_store
				2563	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2564	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2565	load atomic unordered any any Same as non-atomic.
				2566	store atomic unordered any any Same as non-atomic.
				2567	atomicrmw unordered any any *Same as monotonic
				2568	atomic*.
				2569	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2570	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2571	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2572	- wavefront - generic
				2573	- workgroup
				2574	load atomic monotonic - singlethread - local 1. ds_load
				2575	- wavefront
				2576	- workgroup
				2577	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2578	- system - generic glc=1
				2579	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2580	- wavefront - generic
				2581	- workgroup
				2582	- agent
				2583	- system
				2584	store atomic monotonic - singlethread - local 1. ds_store
				2585	- wavefront
				2586	- workgroup
				2587	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2588	- wavefront - generic
				2589	- workgroup
				2590	- agent
				2591	- system
				2592	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2593	- wavefront
				2594	- workgroup
				2595	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2596	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2597	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2598	- wavefront - local
				2599	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2600	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2601	load atomic acquire - workgroup - local 1. ds_load
				2602	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2603
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2604	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2605	- Must happen before
				2606	any following
				2607	global/generic
				2608	load/load
				2609	atomic/store/store
				2610	atomic/atomicrmw.
				2611	- Ensures any
				2612	following global
				2613	data read is no
				2614	older than the load
				2615	atomic value being
				2616	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2617	load atomic acquire - workgroup - generic 1. flat_load
				2618	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2619
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2620	- If OpenCL, omit.
				2621	- Must happen before
				2622	any following
				2623	global/generic
				2624	load/load
				2625	atomic/store/store
				2626	atomic/atomicrmw.
				2627	- Ensures any
				2628	following global
				2629	data read is no
				2630	older than the load
				2631	atomic value being
				2632	acquired.
				2633	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2634	- system glc=1
				2635	2. s_waitcnt vmcnt(0)
				2636
				2637	- Must happen before
				2638	following
				2639	buffer_wbinvl1_vol.
				2640	- Ensures the load
				2641	has completed
				2642	before invalidating
				2643	the cache.
				2644
				2645	3. buffer_wbinvl1_vol
				2646
				2647	- Must happen before
				2648	any following
				2649	global/generic
				2650	load/load
				2651	atomic/atomicrmw.
				2652	- Ensures that
				2653	following
				2654	loads will not see
				2655	stale global data.
				2656
				2657	load atomic acquire - agent - generic 1. flat_load glc=1
				2658	- system 2. s_waitcnt vmcnt(0) &
				2659	lgkmcnt(0)
				2660
				2661	- If OpenCL omit
				2662	lgkmcnt(0).
				2663	- Must happen before
				2664	following
				2665	buffer_wbinvl1_vol.
				2666	- Ensures the flat_load
				2667	has completed
				2668	before invalidating
				2669	the cache.
				2670
				2671	3. buffer_wbinvl1_vol
				2672
				2673	- Must happen before
				2674	any following
				2675	global/generic
				2676	load/load
				2677	atomic/atomicrmw.
				2678	- Ensures that
				2679	following loads
				2680	will not see stale
				2681	global data.
				2682
				2683	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2684	- wavefront - local
				2685	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2686	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2687	atomicrmw acquire - workgroup - local 1. ds_atomic
				2688	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2689
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2690	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2691	- Must happen before
				2692	any following
				2693	global/generic
				2694	load/load
				2695	atomic/store/store
				2696	atomic/atomicrmw.
				2697	- Ensures any
				2698	following global
				2699	data read is no
				2700	older than the
				2701	atomicrmw value
				2702	being acquired.
				2703
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2704	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2705	2. waitcnt lgkmcnt(0)
				2706
				2707	- If OpenCL, omit.
				2708	- Must happen before
				2709	any following
				2710	global/generic
				2711	load/load
				2712	atomic/store/store
				2713	atomic/atomicrmw.
				2714	- Ensures any
				2715	following global
				2716	data read is no
				2717	older than the
				2718	atomicrmw value
				2719	being acquired.
				2720
				2721	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2722	- system 2. s_waitcnt vmcnt(0)
				2723
				2724	- Must happen before
				2725	following
				2726	buffer_wbinvl1_vol.
				2727	- Ensures the
				2728	atomicrmw has
				2729	completed before
				2730	invalidating the
				2731	cache.
				2732
				2733	3. buffer_wbinvl1_vol
				2734
				2735	- Must happen before
				2736	any following
				2737	global/generic
				2738	load/load
				2739	atomic/atomicrmw.
				2740	- Ensures that
				2741	following loads
				2742	will not see stale
				2743	global data.
				2744
				2745	atomicrmw acquire - agent - generic 1. flat_atomic
				2746	- system 2. s_waitcnt vmcnt(0) &
				2747	lgkmcnt(0)
				2748
				2749	- If OpenCL, omit
				2750	lgkmcnt(0).
				2751	- Must happen before
				2752	following
				2753	buffer_wbinvl1_vol.
				2754	- Ensures the
				2755	atomicrmw has
				2756	completed before
				2757	invalidating the
				2758	cache.
				2759
				2760	3. buffer_wbinvl1_vol
				2761
				2762	- Must happen before
				2763	any following
				2764	global/generic
				2765	load/load
				2766	atomic/atomicrmw.
				2767	- Ensures that
				2768	following loads
				2769	will not see stale
				2770	global data.
				2771
				2772	fence acquire - singlethread none none
				2773	- wavefront
				2774	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2775
				2776	- If OpenCL and
				2777	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2778	not generic, omit.
				2779	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2780	currently has no
				2781	address space on
				2782	the fence need to
				2783	conservatively
				2784	always generate. If
				2785	fence had an
				2786	address space then
				2787	set to address
				2788	space of OpenCL
				2789	fence flag, or to
				2790	generic if both
				2791	local and global
				2792	flags are
				2793	specified.
				2794	- Must happen after
				2795	any preceding
				2796	local/generic load
				2797	atomic/atomicrmw
				2798	with an equal or
				2799	wider sync scope
				2800	and memory ordering
				2801	stronger than
				2802	unordered (this is
				2803	termed the
				2804	fence-paired-atomic).
				2805	- Must happen before
				2806	any following
				2807	global/generic
				2808	load/load
				2809	atomic/store/store
				2810	atomic/atomicrmw.
				2811	- Ensures any
				2812	following global
				2813	data read is no
				2814	older than the
				2815	value read by the
				2816	fence-paired-atomic.
				2817
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2818	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2819	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2820
				2821	- If OpenCL and
				2822	address space is
				2823	not generic, omit
				2824	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2825	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2826	currently has no
				2827	address space on
				2828	the fence need to
				2829	conservatively
				2830	always generate
				2831	(see comment for
				2832	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2833	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2834	separate s_waitcnt
				2835	vmcnt(0) and
				2836	s_waitcnt
				2837	lgkmcnt(0) to allow
				2838	them to be
				2839	independently moved
				2840	according to the
				2841	following rules.
				2842	- s_waitcnt vmcnt(0)
				2843	must happen after
				2844	any preceding
				2845	global/generic load
				2846	atomic/atomicrmw
				2847	with an equal or
				2848	wider sync scope
				2849	and memory ordering
				2850	stronger than
				2851	unordered (this is
				2852	termed the
				2853	fence-paired-atomic).
				2854	- s_waitcnt lgkmcnt(0)
				2855	must happen after
				2856	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2857	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2858	atomic/atomicrmw
				2859	with an equal or
				2860	wider sync scope
				2861	and memory ordering
				2862	stronger than
				2863	unordered (this is
				2864	termed the
				2865	fence-paired-atomic).
				2866	- Must happen before
				2867	the following
				2868	buffer_wbinvl1_vol.
				2869	- Ensures that the
				2870	fence-paired atomic
				2871	has completed
				2872	before invalidating
				2873	the
				2874	cache. Therefore
				2875	any following
				2876	locations read must
				2877	be no older than
				2878	the value read by
				2879	the
				2880	fence-paired-atomic.
				2881
				2882	2. buffer_wbinvl1_vol
				2883
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2884	- Must happen before any
				2885	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2886	load/load
				2887	atomic/store/store
				2888	atomic/atomicrmw.
				2889	- Ensures that
				2890	following loads
				2891	will not see stale
				2892	global data.
				2893
				2894	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2895	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2896	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2897	- wavefront - local
				2898	- generic
				2899	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2900
				2901	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2902	- Must happen after
				2903	any preceding
				2904	local/generic
				2905	load/store/load
				2906	atomic/store
				2907	atomic/atomicrmw.
				2908	- Must happen before
				2909	the following
				2910	store.
				2911	- Ensures that all
				2912	memory operations
				2913	to local have
				2914	completed before
				2915	performing the
				2916	store that is being
				2917	released.
				2918
				2919	2. buffer/global/flat_store
				2920	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2921	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2922
				2923	- If OpenCL, omit.
				2924	- Must happen after
				2925	any preceding
				2926	local/generic
				2927	load/store/load
				2928	atomic/store
				2929	atomic/atomicrmw.
				2930	- Must happen before
				2931	the following
				2932	store.
				2933	- Ensures that all
				2934	memory operations
				2935	to local have
				2936	completed before
				2937	performing the
				2938	store that is being
				2939	released.
				2940
				2941	2. flat_store
				2942	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2943	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2944
				2945	- If OpenCL, omit
				2946	lgkmcnt(0).
				2947	- Could be split into
				2948	separate s_waitcnt
				2949	vmcnt(0) and
				2950	s_waitcnt
				2951	lgkmcnt(0) to allow
				2952	them to be
				2953	independently moved
				2954	according to the
				2955	following rules.
				2956	- s_waitcnt vmcnt(0)
				2957	must happen after
				2958	any preceding
				2959	global/generic
				2960	load/store/load
				2961	atomic/store
				2962	atomic/atomicrmw.
				2963	- s_waitcnt lgkmcnt(0)
				2964	must happen after
				2965	any preceding
				2966	local/generic
				2967	load/store/load
				2968	atomic/store
				2969	atomic/atomicrmw.
				2970	- Must happen before
				2971	the following
				2972	store.
				2973	- Ensures that all
				2974	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2975	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2976	completed before
				2977	performing the
				2978	store that is being
				2979	released.
				2980
				2981	2. buffer/global/ds/flat_store
				2982	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2983	- wavefront - local
				2984	- generic
				2985	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2986
				2987	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2988	- Must happen after
				2989	any preceding
				2990	local/generic
				2991	load/store/load
				2992	atomic/store
				2993	atomic/atomicrmw.
				2994	- Must happen before
				2995	the following
				2996	atomicrmw.
				2997	- Ensures that all
				2998	memory operations
				2999	to local have
				3000	completed before
				3001	performing the
				3002	atomicrmw that is
				3003	being released.
				3004
				3005	2. buffer/global/flat_atomic
				3006	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3007	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3008
				3009	- If OpenCL, omit.
				3010	- Must happen after
				3011	any preceding
				3012	local/generic
				3013	load/store/load
				3014	atomic/store
				3015	atomic/atomicrmw.
				3016	- Must happen before
				3017	the following
				3018	atomicrmw.
				3019	- Ensures that all
				3020	memory operations
				3021	to local have
				3022	completed before
				3023	performing the
				3024	atomicrmw that is
				3025	being released.
				3026
				3027	2. flat_atomic
				3028	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				3029	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3030
				3031	- If OpenCL, omit
				3032	lgkmcnt(0).
				3033	- Could be split into
				3034	separate s_waitcnt
				3035	vmcnt(0) and
				3036	s_waitcnt
				3037	lgkmcnt(0) to allow
				3038	them to be
				3039	independently moved
				3040	according to the
				3041	following rules.
				3042	- s_waitcnt vmcnt(0)
				3043	must happen after
				3044	any preceding
				3045	global/generic
				3046	load/store/load
				3047	atomic/store
				3048	atomic/atomicrmw.
				3049	- s_waitcnt lgkmcnt(0)
				3050	must happen after
				3051	any preceding
				3052	local/generic
				3053	load/store/load
				3054	atomic/store
				3055	atomic/atomicrmw.
				3056	- Must happen before
				3057	the following
				3058	atomicrmw.
				3059	- Ensures that all
				3060	memory operations
				3061	to global and local
				3062	have completed
				3063	before performing
				3064	the atomicrmw that
				3065	is being released.
				3066
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3067	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3068	fence release - singlethread none none
				3069	- wavefront
				3070	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3071
				3072	- If OpenCL and
				3073	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3074	not generic, omit.
				3075	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3076	currently has no
				3077	address space on
				3078	the fence need to
				3079	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3080	always generate. If
				3081	fence had an
				3082	address space then
				3083	set to address
				3084	space of OpenCL
				3085	fence flag, or to
				3086	generic if both
				3087	local and global
				3088	flags are
				3089	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3090	- Must happen after
				3091	any preceding
				3092	local/generic
				3093	load/load
				3094	atomic/store/store
				3095	atomic/atomicrmw.
				3096	- Must happen before
				3097	any following store
				3098	atomic/atomicrmw
				3099	with an equal or
				3100	wider sync scope
				3101	and memory ordering
				3102	stronger than
				3103	unordered (this is
				3104	termed the
				3105	fence-paired-atomic).
				3106	- Ensures that all
				3107	memory operations
				3108	to local have
				3109	completed before
				3110	performing the
				3111	following
				3112	fence-paired-atomic.
				3113
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3114	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3115	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3116
				3117	- If OpenCL and
				3118	address space is
				3119	not generic, omit
				3120	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3121	- If OpenCL and
				3122	address space is
				3123	local, omit
				3124	vmcnt(0).
				3125	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3126	currently has no
				3127	address space on
				3128	the fence need to
				3129	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3130	always generate. If
				3131	fence had an
				3132	address space then
				3133	set to address
				3134	space of OpenCL
				3135	fence flag, or to
				3136	generic if both
				3137	local and global
				3138	flags are
				3139	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3140	- Could be split into
				3141	separate s_waitcnt
				3142	vmcnt(0) and
				3143	s_waitcnt
				3144	lgkmcnt(0) to allow
				3145	them to be
				3146	independently moved
				3147	according to the
				3148	following rules.
				3149	- s_waitcnt vmcnt(0)
				3150	must happen after
				3151	any preceding
				3152	global/generic
				3153	load/store/load
				3154	atomic/store
				3155	atomic/atomicrmw.
				3156	- s_waitcnt lgkmcnt(0)
				3157	must happen after
				3158	any preceding
				3159	local/generic
				3160	load/store/load
				3161	atomic/store
				3162	atomic/atomicrmw.
				3163	- Must happen before
				3164	any following store
				3165	atomic/atomicrmw
				3166	with an equal or
				3167	wider sync scope
				3168	and memory ordering
				3169	stronger than
				3170	unordered (this is
				3171	termed the
				3172	fence-paired-atomic).
				3173	- Ensures that all
				3174	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3175	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3176	completed before
				3177	performing the
				3178	following
				3179	fence-paired-atomic.
				3180
				3181	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3182	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3183	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3184	- wavefront - local
				3185	- generic
				3186	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3187
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3188	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3189	- Must happen after
				3190	any preceding
				3191	local/generic
				3192	load/store/load
				3193	atomic/store
				3194	atomic/atomicrmw.
				3195	- Must happen before
				3196	the following
				3197	atomicrmw.
				3198	- Ensures that all
				3199	memory operations
				3200	to local have
				3201	completed before
				3202	performing the
				3203	atomicrmw that is
				3204	being released.
				3205
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3206	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3207	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3208	2. s_waitcnt lgkmcnt(0)
				3209
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3210	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3211	- Must happen before
				3212	any following
				3213	global/generic
				3214	load/load
				3215	atomic/store/store
				3216	atomic/atomicrmw.
				3217	- Ensures any
				3218	following global
				3219	data read is no
				3220	older than the load
				3221	atomic value being
				3222	acquired.
				3223
				3224	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3225
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3226	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3227	- Must happen after
				3228	any preceding
				3229	local/generic
				3230	load/store/load
				3231	atomic/store
				3232	atomic/atomicrmw.
				3233	- Must happen before
				3234	the following
				3235	atomicrmw.
				3236	- Ensures that all
				3237	memory operations
				3238	to local have
				3239	completed before
				3240	performing the
				3241	atomicrmw that is
				3242	being released.
				3243
				3244	2. flat_atomic
				3245	3. s_waitcnt lgkmcnt(0)
				3246
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3247	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3248	- Must happen before
				3249	any following
				3250	global/generic
				3251	load/load
				3252	atomic/store/store
				3253	atomic/atomicrmw.
				3254	- Ensures any
				3255	following global
				3256	data read is no
				3257	older than the load
				3258	atomic value being
				3259	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3260
				3261	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3262	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3263
				3264	- If OpenCL, omit
				3265	lgkmcnt(0).
				3266	- Could be split into
				3267	separate s_waitcnt
				3268	vmcnt(0) and
				3269	s_waitcnt
				3270	lgkmcnt(0) to allow
				3271	them to be
				3272	independently moved
				3273	according to the
				3274	following rules.
				3275	- s_waitcnt vmcnt(0)
				3276	must happen after
				3277	any preceding
				3278	global/generic
				3279	load/store/load
				3280	atomic/store
				3281	atomic/atomicrmw.
				3282	- s_waitcnt lgkmcnt(0)
				3283	must happen after
				3284	any preceding
				3285	local/generic
				3286	load/store/load
				3287	atomic/store
				3288	atomic/atomicrmw.
				3289	- Must happen before
				3290	the following
				3291	atomicrmw.
				3292	- Ensures that all
				3293	memory operations
				3294	to global have
				3295	completed before
				3296	performing the
				3297	atomicrmw that is
				3298	being released.
				3299
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3300	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3301	3. s_waitcnt vmcnt(0)
				3302
				3303	- Must happen before
				3304	following
				3305	buffer_wbinvl1_vol.
				3306	- Ensures the
				3307	atomicrmw has
				3308	completed before
				3309	invalidating the
				3310	cache.
				3311
				3312	4. buffer_wbinvl1_vol
				3313
				3314	- Must happen before
				3315	any following
				3316	global/generic
				3317	load/load
				3318	atomic/atomicrmw.
				3319	- Ensures that
				3320	following loads
				3321	will not see stale
				3322	global data.
				3323
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3324	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3325	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3326
				3327	- If OpenCL, omit
				3328	lgkmcnt(0).
				3329	- Could be split into
				3330	separate s_waitcnt
				3331	vmcnt(0) and
				3332	s_waitcnt
				3333	lgkmcnt(0) to allow
				3334	them to be
				3335	independently moved
				3336	according to the
				3337	following rules.
				3338	- s_waitcnt vmcnt(0)
				3339	must happen after
				3340	any preceding
				3341	global/generic
				3342	load/store/load
				3343	atomic/store
				3344	atomic/atomicrmw.
				3345	- s_waitcnt lgkmcnt(0)
				3346	must happen after
				3347	any preceding
				3348	local/generic
				3349	load/store/load
				3350	atomic/store
				3351	atomic/atomicrmw.
				3352	- Must happen before
				3353	the following
				3354	atomicrmw.
				3355	- Ensures that all
				3356	memory operations
				3357	to global have
				3358	completed before
				3359	performing the
				3360	atomicrmw that is
				3361	being released.
				3362
				3363	2. flat_atomic
				3364	3. s_waitcnt vmcnt(0) &
				3365	lgkmcnt(0)
				3366
				3367	- If OpenCL, omit
				3368	lgkmcnt(0).
				3369	- Must happen before
				3370	following
				3371	buffer_wbinvl1_vol.
				3372	- Ensures the
				3373	atomicrmw has
				3374	completed before
				3375	invalidating the
				3376	cache.
				3377
				3378	4. buffer_wbinvl1_vol
				3379
				3380	- Must happen before
				3381	any following
				3382	global/generic
				3383	load/load
				3384	atomic/atomicrmw.
				3385	- Ensures that
				3386	following loads
				3387	will not see stale
				3388	global data.
				3389
				3390	fence acq_rel - singlethread none none
				3391	- wavefront
				3392	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3393
				3394	- If OpenCL and
				3395	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3396	not generic, omit.
				3397	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3398	since LLVM
				3399	currently has no
				3400	address space on
				3401	the fence need to
				3402	conservatively
				3403	always generate
				3404	(see comment for
				3405	previous fence).
				3406	- Must happen after
				3407	any preceding
				3408	local/generic
				3409	load/load
				3410	atomic/store/store
				3411	atomic/atomicrmw.
				3412	- Must happen before
				3413	any following
				3414	global/generic
				3415	load/load
				3416	atomic/store/store
				3417	atomic/atomicrmw.
				3418	- Ensures that all
				3419	memory operations
				3420	to local have
				3421	completed before
				3422	performing any
				3423	following global
				3424	memory operations.
				3425	- Ensures that the
				3426	preceding
				3427	local/generic load
				3428	atomic/atomicrmw
				3429	with an equal or
				3430	wider sync scope
				3431	and memory ordering
				3432	stronger than
				3433	unordered (this is
				3434	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3435	acquire-fence-paired-atomic
				3436	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3437	before following
				3438	global memory
				3439	operations. This
				3440	satisfies the
				3441	requirements of
				3442	acquire.
				3443	- Ensures that all
				3444	previous memory
				3445	operations have
				3446	completed before a
				3447	following
				3448	local/generic store
				3449	atomic/atomicrmw
				3450	with an equal or
				3451	wider sync scope
				3452	and memory ordering
				3453	stronger than
				3454	unordered (this is
				3455	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3456	release-fence-paired-atomic
				3457	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3458	requirements of
				3459	release.
				3460
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3461	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3462	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3463
				3464	- If OpenCL and
				3465	address space is
				3466	not generic, omit
				3467	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3468	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3469	currently has no
				3470	address space on
				3471	the fence need to
				3472	conservatively
				3473	always generate
				3474	(see comment for
				3475	previous fence).
				3476	- Could be split into
				3477	separate s_waitcnt
				3478	vmcnt(0) and
				3479	s_waitcnt
				3480	lgkmcnt(0) to allow
				3481	them to be
				3482	independently moved
				3483	according to the
				3484	following rules.
				3485	- s_waitcnt vmcnt(0)
				3486	must happen after
				3487	any preceding
				3488	global/generic
				3489	load/store/load
				3490	atomic/store
				3491	atomic/atomicrmw.
				3492	- s_waitcnt lgkmcnt(0)
				3493	must happen after
				3494	any preceding
				3495	local/generic
				3496	load/store/load
				3497	atomic/store
				3498	atomic/atomicrmw.
				3499	- Must happen before
				3500	the following
				3501	buffer_wbinvl1_vol.
				3502	- Ensures that the
				3503	preceding
				3504	global/local/generic
				3505	load
				3506	atomic/atomicrmw
				3507	with an equal or
				3508	wider sync scope
				3509	and memory ordering
				3510	stronger than
				3511	unordered (this is
				3512	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3513	acquire-fence-paired-atomic
				3514	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3515	before invalidating
				3516	the cache. This
				3517	satisfies the
				3518	requirements of
				3519	acquire.
				3520	- Ensures that all
				3521	previous memory
				3522	operations have
				3523	completed before a
				3524	following
				3525	global/local/generic
				3526	store
				3527	atomic/atomicrmw
				3528	with an equal or
				3529	wider sync scope
				3530	and memory ordering
				3531	stronger than
				3532	unordered (this is
				3533	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3534	release-fence-paired-atomic
				3535	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3536	requirements of
				3537	release.
				3538
				3539	2. buffer_wbinvl1_vol
				3540
				3541	- Must happen before
				3542	any following
				3543	global/generic
				3544	load/load
				3545	atomic/store/store
				3546	atomic/atomicrmw.
				3547	- Ensures that
				3548	following loads
				3549	will not see stale
				3550	global data. This
				3551	satisfies the
				3552	requirements of
				3553	acquire.
				3554
				3555	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3556	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3557	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3558	- wavefront - local load atomic acquire,
				3559	- generic except must generated
				3560	all instructions even
				3561	for OpenCL.*
				3562	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3563	- generic
				3564	- Must
				3565	happen after
				3566	preceding
				3567	global/generic load
				3568	atomic/store
				3569	atomic/atomicrmw
				3570	with memory
				3571	ordering of seq_cst
				3572	and with equal or
				3573	wider sync scope.
				3574	(Note that seq_cst
				3575	fences have their
				3576	own s_waitcnt
				3577	lgkmcnt(0) and so do
				3578	not need to be
				3579	considered.)
				3580	- Ensures any
				3581	preceding
				3582	sequential
				3583	consistent local
				3584	memory instructions
				3585	have completed
				3586	before executing
				3587	this sequentially
				3588	consistent
				3589	instruction. This
				3590	prevents reordering
				3591	a seq_cst store
				3592	followed by a
				3593	seq_cst load. (Note
				3594	that seq_cst is
				3595	stronger than
				3596	acquire/release as
				3597	the reordering of
				3598	load acquire
				3599	followed by a store
				3600	release is
				3601	prevented by the
				3602	waitcnt of
				3603	the release, but
				3604	there is nothing
				3605	preventing a store
				3606	release followed by
				3607	load acquire from
				3608	competing out of
				3609	order.)
				3610
				3611	2. *Following
				3612	instructions same as
				3613	corresponding load
				3614	atomic acquire,
				3615	except must generated
				3616	all instructions even
				3617	for OpenCL.*
				3618	load atomic seq_cst - workgroup - local *Same as corresponding
				3619	load atomic acquire,
				3620	except must generated
				3621	all instructions even
				3622	for OpenCL.*
				3623	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3624	- system - generic vmcnt(0)
				3625
				3626	- Could be split into
				3627	separate s_waitcnt
				3628	vmcnt(0)
				3629	and s_waitcnt
				3630	lgkmcnt(0) to allow
				3631	them to be
				3632	independently moved
				3633	according to the
				3634	following rules.
				3635	- waitcnt lgkmcnt(0)
				3636	must happen after
				3637	preceding
				3638	global/generic load
				3639	atomic/store
				3640	atomic/atomicrmw
				3641	with memory
				3642	ordering of seq_cst
				3643	and with equal or
				3644	wider sync scope.
				3645	(Note that seq_cst
				3646	fences have their
				3647	own s_waitcnt
				3648	lgkmcnt(0) and so do
				3649	not need to be
				3650	considered.)
				3651	- waitcnt vmcnt(0)
				3652	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3653	preceding
				3654	global/generic load
				3655	atomic/store
				3656	atomic/atomicrmw
				3657	with memory
				3658	ordering of seq_cst
				3659	and with equal or
				3660	wider sync scope.
				3661	(Note that seq_cst
				3662	fences have their
				3663	own s_waitcnt
				3664	vmcnt(0) and so do
				3665	not need to be
				3666	considered.)
				3667	- Ensures any
				3668	preceding
				3669	sequential
				3670	consistent global
				3671	memory instructions
				3672	have completed
				3673	before executing
				3674	this sequentially
				3675	consistent
				3676	instruction. This
				3677	prevents reordering
				3678	a seq_cst store
				3679	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3680	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3681	that seq_cst is
				3682	stronger than
				3683	acquire/release as
				3684	the reordering of
				3685	load acquire
				3686	followed by a store
				3687	release is
				3688	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3689	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3690	the release, but
				3691	there is nothing
				3692	preventing a store
				3693	release followed by
				3694	load acquire from
				3695	competing out of
				3696	order.)
				3697
				3698	2. *Following
				3699	instructions same as
				3700	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3701	atomic acquire,
				3702	except must generated
				3703	all instructions even
				3704	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3705	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3706	- wavefront - local store atomic release,
				3707	- workgroup - generic except must generated
				3708	all instructions even
				3709	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3710	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3711	- system - generic store atomic release,
				3712	except must generated
				3713	all instructions even
				3714	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3715	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3716	- wavefront - local atomicrmw acq_rel,
				3717	- workgroup - generic except must generated
				3718	all instructions even
				3719	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3720	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3721	- system - generic atomicrmw acq_rel,
				3722	except must generated
				3723	all instructions even
				3724	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3725	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3726	- wavefront fence acq_rel,
				3727	- workgroup except must generated
				3728	- agent all instructions even
				3729	- system for OpenCL.*
				3730	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3731
				3732	The memory order also adds the single thread optimization constrains defined in
				3733	table
				3734	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3735
				3736	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3737	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3738
				3739	============ ==============================================================
				3740	LLVM Memory Optimization Constraints
				3741	Ordering
				3742	============ ==============================================================
				3743	unordered none
				3744	monotonic none
				3745	acquire - If a load atomic/atomicrmw then no following load/load
				3746	atomic/store/ store atomic/atomicrmw/fence instruction can
				3747	be moved before the acquire.
				3748	- If a fence then same as load atomic, plus no preceding
				3749	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3750	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3751	atomic/store/ store atomic/atomicrmw/fence instruction can
				3752	be moved after the release.
				3753	- If a fence then same as store atomic, plus no following
				3754	associated fence-paired-atomic can be moved before the
				3755	fence.
				3756	acq_rel Same constraints as both acquire and release.
				3757	seq_cst - If a load atomic then same constraints as acquire, plus no
				3758	preceding sequentially consistent load atomic/store
				3759	atomic/atomicrmw/fence instruction can be moved after the
				3760	seq_cst.
				3761	- If a store atomic then the same constraints as release, plus
				3762	no following sequentially consistent load atomic/store
				3763	atomic/atomicrmw/fence instruction can be moved before the
				3764	seq_cst.
				3765	- If an atomicrmw/fence then same constraints as acq_rel.
				3766	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3767
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3768	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3769	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3770
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3771	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3772	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3773	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3774
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3775	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3776	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3777
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3778	=================== =============== =============== =======================
				3779	Usage Code Sequence Trap Handler Description
				3780	Inputs
				3781	=================== =============== =============== =======================
				3782	reserved ``s_trap 0x00`` Reserved by hardware.
				3783	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3784	``queue_ptr`` ``debugtrap``
				3785	``VGPR0``: intrinsic (not
				3786	``arg`` implemented).
				3787	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3788	``queue_ptr`` terminated and its
				3789	associated queue put
				3790	into the error state.
				3791	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3792	``queue_ptr`` installed handled
				3793	same as ``llvm.trap``.
				3794	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3795	breakpoints.
				3796	debugger ``s_trap 0x08`` Reserved for debugger.
				3797	debugger ``s_trap 0xfe`` Reserved for debugger.
				3798	debugger ``s_trap 0xff`` Reserved for debugger.
				3799	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3800
Tim Corringham	af2dfc6	2018-04-04 13:02:09 +0000	[diff] [blame]	3801	AMDPAL
				3802	------
				3803
				3804	This section provides code conventions used when the target triple OS is
				3805	``amdpal`` (see :ref:`amdgpu-target-triples`) for passing runtime parameters
				3806	from the application/runtime to each invocation of a hardware shader. These
				3807	parameters include both generic, application-controlled parameters called
				3808	user data as well as system-generated parameters that are a product of the
				3809	draw or dispatch execution.
				3810
				3811	User Data
				3812	~~~~~~~~~
				3813
				3814	Each hardware stage has a set of 32-bit user data registers which can be
				3815	written from a command buffer and then loaded into SGPRs when waves are launched
				3816	via a subsequent dispatch or draw operation. This is the way most arguments are
				3817	passed from the application/runtime to a hardware shader.
				3818
				3819	Compute User Data
				3820	~~~~~~~~~~~~~~~~~
				3821
				3822	Compute shader user data mappings are simpler than graphics shaders, and have a
				3823	fixed mapping.
				3824
				3825	Note that there are always 10 available user data entries in registers -
				3826	entries beyond that limit must be fetched from memory (via the spill table
				3827	pointer) by the shader.
				3828
				3829	.. table:: PAL Compute Shader User Data Registers
				3830	:name: pal-compute-user-data-registers
				3831
				3832	============= ================================
				3833	User Register Description
				3834	============= ================================
				3835	0 Global Internal Table (32-bit pointer)
				3836	1 Per-Shader Internal Table (32-bit pointer)
				3837	2 - 11 Application-Controlled User Data (10 32-bit values)
				3838	12 Spill Table (32-bit pointer)
				3839	13 - 14 Thread Group Count (64-bit pointer)
				3840	15 GDS Range
				3841	============= ================================
				3842
				3843	Graphics User Data
				3844	~~~~~~~~~~~~~~~~~~
				3845
				3846	Graphics pipelines support a much more flexible user data mapping:
				3847
				3848	.. table:: PAL Graphics Shader User Data Registers
				3849	:name: pal-graphics-user-data-registers
				3850
				3851	============= ================================
				3852	User Register Description
				3853	============= ================================
				3854	0 Global Internal Table (32-bit pointer)
				3855	+ Per-Shader Internal Table (32-bit pointer)
				3856	+ 1-15 Application Controlled User Data
				3857	(1-15 Contiguous 32-bit Values in Registers)
				3858	+ Spill Table (32-bit pointer)
				3859	+ Draw Index (First Stage Only)
				3860	+ Vertex Offset (First Stage Only)
				3861	+ Instance Offset (First Stage Only)
				3862	============= ================================
				3863
				3864	The placement of the global internal table remains fixed in the first *user
				3865	data SGPR register*. Otherwise all parameters are optional, and can be mapped
				3866	to any desired user data SGPR register, with the following regstrictions:
				3867
				3868	* Draw Index, Vertex Offset, and Instance Offset can only be used by the first
				3869	activehardware stage in a graphics pipeline (i.e. where the API vertex
				3870	shader runs).
				3871
				3872	* Application-controlled user data must be mapped into a contiguous range of
				3873	user data registers.
				3874
				3875	* The application-controlled user data range supports compaction remapping, so
				3876	only entries that are actually consumed by the shader must be assigned to
				3877	corresponding registers. Note that in order to support an efficient runtime
				3878	implementation, the remapping must pack registers in the same order as
				3879	entries, with unused entries removed.
				3880
				3881	.. _pal_global_internal_table:
				3882
				3883	Global Internal Table
				3884	~~~~~~~~~~~~~~~~~~~~~
				3885
				3886	The global internal table is a table of shader resource descriptors (SRDs) that
				3887	define how certain engine-wide, runtime-managed resources should be accessed
				3888	from a shader. The majority of these resources have HW-defined formats, and it
				3889	is up to the compiler to write/read data as required by the target hardware.
				3890
				3891	The following table illustrates the required format:
				3892
				3893	.. table:: PAL Global Internal Table
				3894	:name: pal-git-table
				3895
				3896	============= ================================
				3897	Offset Description
				3898	============= ================================
				3899	0-3 Graphics Scratch SRD
				3900	4-7 Compute Scratch SRD
				3901	8-11 ES/GS Ring Output SRD
				3902	12-15 ES/GS Ring Input SRD
				3903	16-19 GS/VS Ring Output #0
				3904	20-23 GS/VS Ring Output #1
				3905	24-27 GS/VS Ring Output #2
				3906	28-31 GS/VS Ring Output #3
				3907	32-35 GS/VS Ring Input SRD
				3908	36-39 Tessellation Factor Buffer SRD
				3909	40-43 Off-Chip LDS Buffer SRD
				3910	44-47 Off-Chip Param Cache Buffer SRD
				3911	48-51 Sample Position Buffer SRD
				3912	52 vaRange::ShadowDescriptorTable High Bits
				3913	============= ================================
				3914
				3915	The pointer to the global internal table passed to the shader as user data
				3916	is a 32-bit pointer. The top 32 bits should be assumed to be the same as
				3917	the top 32 bits of the pipeline, so the shader may use the program
				3918	counter's top 32 bits.
				3919
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3920	Unspecified OS
				3921	--------------
				3922
				3923	This section provides code conventions used when the target triple OS is
				3924	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3925
				3926	Trap Handler ABI
				3927	~~~~~~~~~~~~~~~~
				3928
				3929	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3930	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3931	instructions are handled as follows:
				3932
				3933	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3934	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3935
				3936	=============== =============== ===========================================
				3937	Usage Code Sequence Description
				3938	=============== =============== ===========================================
				3939	llvm.trap s_endpgm Causes wavefront to be terminated.
				3940	llvm.debugtrap none Compiler warning given that there is no
				3941	trap handler installed.
				3942	=============== =============== ===========================================
				3943
				3944	Source Languages
				3945	================
				3946
				3947	.. _amdgpu-opencl:
				3948
				3949	OpenCL
				3950	------
				3951
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3952	When the language is OpenCL the following differences occur:
				3953
				3954	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3955	2. The AMDGPU backend appends additional arguments to the kernel's explicit
				3956	arguments for the AMDHSA OS (see
				3957	:ref:`opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table`).
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3958	3. Additional metadata is generated
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3959	(see :ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3960
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3961	.. table:: OpenCL kernel implicit arguments appended for AMDHSA OS
				3962	:name: opencl-kernel-implicit-arguments-appended-for-amdhsa-os-table
				3963
				3964	======== ==== ========= ===========================================
				3965	Position Byte Byte Description
				3966	Size Alignment
				3967	======== ==== ========= ===========================================
Tony Tye	88441a3	2018-03-23 18:58:47 +0000	[diff] [blame]	3968	1 8 8 OpenCL Global Offset X
				3969	2 8 8 OpenCL Global Offset Y
				3970	3 8 8 OpenCL Global Offset Z
				3971	4 8 8 OpenCL address of printf buffer
				3972	5 8 8 OpenCL address of virtual queue used by
				3973	enqueue_kernel.
				3974	6 8 8 OpenCL address of AqlWrap struct used by
				3975	enqueue_kernel.
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3976	======== ==== ========= ===========================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3977
				3978	.. _amdgpu-hcc:
				3979
				3980	HCC
				3981	---
				3982
Tony Tye	7a893d4	2018-03-23 18:45:18 +0000	[diff] [blame]	3983	When the language is HCC the following differences occur:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3984
				3985	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3986
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3987	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3988	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3989
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3990	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3991	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3992
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	3993	This section describes general syntax for instructions and operands.
				3994
				3995	Instructions
				3996	~~~~~~~~~~~~
				3997
				3998	.. toctree::
				3999	:hidden:
				4000
				4001	AMDGPUAsmGFX7
				4002	AMDGPUAsmGFX8
				4003	AMDGPUAsmGFX9
				4004	AMDGPUOperandSyntax
				4005
				4006	An instruction has the following syntax:
				4007
				4008	<opcode> <operand0>, <operand1>,... <modifier0> <modifier1>...
				4009
				4010	Note that operands are normally comma-separated while modifiers are space-separated.
				4011
				4012	The order of operands and modifiers is fixed. Most modifiers are optional and may be omitted.
				4013
				4014	See detailed instruction syntax description for :doc:`GFX7<AMDGPUAsmGFX7>`,
				4015	:doc:`GFX8<AMDGPUAsmGFX8>` and :doc:`GFX9<AMDGPUAsmGFX9>`.
				4016
				4017	Note that features under development are not included in this description.
				4018
				4019	For more information about instructions, their semantics and supported combinations of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4020	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4021	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4022
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4023	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4024	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4025
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4026	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4027
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4028	* SGPR registers: s0, ... or s[0], ...
				4029	* VGPR registers: v0, ... or v[0], ...
				4030	* TTMP registers: ttmp0, ... or ttmp[0], ...
				4031	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				4032	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				4033	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				4034	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				4035	* Register index expressions: v[2*2], s[1-1:2-1]
				4036	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4037
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4038	Modifiers
				4039	~~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4040
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4041	Detailed description of modifiers may be found :doc:`here<AMDGPUOperandSyntax>`.
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4042
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4043	Instruction Examples
				4044	~~~~~~~~~~~~~~~~~~~~
				4045
				4046	DS
Dmitry Preobrazhensky	c6d31e6	2018-03-12 15:55:08 +0000	[diff] [blame]	4047	++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4048
				4049	.. code-block:: nasm
				4050
				4051	ds_add_u32 v2, v4 offset:16
				4052	ds_write_src2_b64 v2 offset0:4 offset1:8
				4053	ds_cmpst_f32 v2, v4, v6
				4054	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				4055
				4056
				4057	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				4058
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4059	FLAT
				4060	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4061
				4062	.. code-block:: nasm
				4063
				4064	flat_load_dword v1, v[3:4]
				4065	flat_store_dwordx3 v[3:4], v[5:7]
				4066	flat_atomic_swap v1, v[3:4], v5 glc
				4067	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				4068	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				4069
				4070	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				4071
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4072	MUBUF
				4073	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4074
				4075	.. code-block:: nasm
				4076
				4077	buffer_load_dword v1, off, s[4:7], s1
				4078	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				4079	buffer_store_format_xy v[1:2], off, s[4:7], s1
				4080	buffer_wbinvl1
				4081	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				4082
				4083	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				4084
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4085	SMRD/SMEM
				4086	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4087
				4088	.. code-block:: nasm
				4089
				4090	s_load_dword s1, s[2:3], 0xfc
				4091	s_load_dwordx8 s[8:15], s[2:3], s4
				4092	s_load_dwordx16 s[88:103], s[2:3], s4
				4093	s_dcache_inv_vol
				4094	s_memtime s[4:5]
				4095
				4096	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				4097
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4098	SOP1
				4099	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4100
				4101	.. code-block:: nasm
				4102
				4103	s_mov_b32 s1, s2
				4104	s_mov_b64 s[0:1], 0x80000000
				4105	s_cmov_b32 s1, 200
				4106	s_wqm_b64 s[2:3], s[4:5]
				4107	s_bcnt0_i32_b64 s1, s[2:3]
				4108	s_swappc_b64 s[2:3], s[4:5]
				4109	s_cbranch_join s[4:5]
				4110
				4111	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				4112
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4113	SOP2
				4114	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4115
				4116	.. code-block:: nasm
				4117
				4118	s_add_u32 s1, s2, s3
				4119	s_and_b64 s[2:3], s[4:5], s[6:7]
				4120	s_cselect_b32 s1, s2, s3
				4121	s_andn2_b32 s2, s4, s6
				4122	s_lshr_b64 s[2:3], s[4:5], s6
				4123	s_ashr_i32 s2, s4, s6
				4124	s_bfm_b64 s[2:3], s4, s6
				4125	s_bfe_i64 s[2:3], s[4:5], s6
				4126	s_cbranch_g_fork s[4:5], s[6:7]
				4127
				4128	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				4129
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4130	SOPC
				4131	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4132
				4133	.. code-block:: nasm
				4134
				4135	s_cmp_eq_i32 s1, s2
				4136	s_bitcmp1_b32 s1, s2
				4137	s_bitcmp0_b64 s[2:3], s4
				4138	s_setvskip s3, s5
				4139
				4140	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				4141
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4142	SOPP
				4143	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4144
				4145	.. code-block:: nasm
				4146
				4147	s_barrier
				4148	s_nop 2
				4149	s_endpgm
				4150	s_waitcnt 0 ; Wait for all counters to be 0
				4151	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				4152	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				4153	s_sethalt 9
				4154	s_sleep 10
				4155	s_sendmsg 0x1
				4156	s_sendmsg sendmsg(MSG_INTERRUPT)
				4157	s_trap 1
				4158
				4159	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				4160
				4161	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	4162	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4163	range or acceptable values.
				4164
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4165	VALU
				4166	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4167
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4168	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				4169	the assembler will automatically use optimal encoding based on its operands.
				4170	To force specific encoding, one can add a suffix to the opcode of the instruction:
				4171
				4172	* _e32 for 32-bit VOP1/VOP2/VOPC
				4173	* _e64 for 64-bit VOP3
				4174	* _dpp for VOP_DPP
				4175	* _sdwa for VOP_SDWA
				4176
				4177	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4178
				4179	.. code-block:: nasm
				4180
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4181	v_mov_b32 v1, v2
				4182	v_mov_b32_e32 v1, v2
				4183	v_nop
				4184	v_cvt_f64_i32_e32 v[1:2], v2
				4185	v_floor_f32_e32 v1, v2
				4186	v_bfrev_b32_e32 v1, v2
				4187	v_add_f32_e32 v1, v2, v3
				4188	v_mul_i32_i24_e64 v1, v2, 3
				4189	v_mul_i32_i24_e32 v1, -3, v3
				4190	v_mul_i32_i24_e32 v1, -100, v3
				4191	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4192	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4193
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4194	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4195
				4196	.. code-block:: nasm
				4197
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4198	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4199	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4200	v_mov_b32 v0, v0 wave_shl:1
				4201	v_mov_b32 v0, v0 row_mirror
				4202	v_mov_b32 v0, v0 row_bcast:31
				4203	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4204	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4205	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4206
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4207	VOP_SDWA examples:
				4208
				4209	.. code-block:: nasm
				4210
				4211	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4212	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4213	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4214	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4215	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4216
				4217	For full list of supported instructions, refer to "Vector ALU instructions".
				4218
				4219	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4220	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4221
				4222	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4223	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4224
				4225	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4226	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4227
				4228	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4229	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4230
				4231	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4232	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4233
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4234
				4235	major, minor, and stepping are all integers that describe the instruction
				4236	set architecture (ISA) version of the assembly program.
				4237
				4238	vendor and arch are quoted strings. vendor should always be equal to
				4239	"AMD" and arch should always be equal to "AMDGPU".
				4240
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4241	By default, the assembler will derive the ISA version, vendor, and arch
				4242	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4243
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4244	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4245	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4246
				4247	This directives specifies that the symbol with given name is a kernel entry point
				4248	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4249
				4250	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4251	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4252
				4253	This directive marks the beginning of a list of key / value pairs that are used
				4254	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4255	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4256	any amd_kernel_code_t values that are unspecified a default value will be
				4257	used. The default value for all keys is 0, with the following exceptions:
				4258
				4259	- kernel_code_version_major defaults to 1.
				4260	- machine_kind defaults to 1.
				4261	- machine_version_major, machine_version_minor, and
				4262	machine_version_stepping are derived from the value of the -mcpu option
				4263	that is passed to the assembler.
				4264	- kernel_code_entry_byte_offset defaults to 256.
				4265	- wavefront_size defaults to 6.
				4266	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4267	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4268	as a power of two, so a value of n means an alignment of 2^ n.
				4269
				4270	The .amd_kernel_code_t directive must be placed immediately after the
				4271	function label and before any instructions.
				4272
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4273	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4274	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4275
				4276	Here is an example of a minimal amd_kernel_code_t specification:
				4277
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4278	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4279
				4280	.hsa_code_object_version 1,0
				4281	.hsa_code_object_isa
				4282
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4283	.hsatext
				4284	.globl hello_world
				4285	.p2align 8
				4286	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4287
				4288	hello_world:
				4289
				4290	.amd_kernel_code_t
				4291	enable_sgpr_kernarg_segment_ptr = 1
				4292	is_ptr64 = 1
				4293	compute_pgm_rsrc1_vgprs = 0
				4294	compute_pgm_rsrc1_sgprs = 0
				4295	compute_pgm_rsrc2_user_sgpr = 2
				4296	kernarg_segment_byte_size = 8
				4297	wavefront_sgpr_count = 2
				4298	workitem_vgpr_count = 3
				4299	.end_amd_kernel_code_t
				4300
				4301	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4302	v_mov_b32 v0, 3.14159
				4303	s_waitcnt lgkmcnt(0)
				4304	v_mov_b32 v1, s0
				4305	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4306	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4307	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4308	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4309	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4310
				4311	Additional Documentation
				4312	========================
				4313
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4314	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4315	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4316	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4317	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4318	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4319	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4320	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4321	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4322	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4323	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4324	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4325	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4326	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4327	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4328	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4329	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__