Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: bce77333175fdcbe4335d008a705e8a3cf076449 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
				67	<empty> Defaults to ``opencl``.
				68	``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
				69	``amdgizcl`` Same as ``opencl`` except a different address space mapping is
				70	used (see :ref:`amdgpu-address-spaces`).
				71	``amdgiz`` Same as ``opencl`` except a different address space mapping is
				72	used (see :ref:`amdgpu-address-spaces`).
				73	``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
				74	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	75
				76	.. _amdgpu-processors:
				77
				78	Processors
				79	----------
				80
				81	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				82	names from both the Processor and Alternative Processor can be used.
				83
				84	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	85	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	86
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	87	=========== =============== ============ ===== ========= ======= ==================
				88	Processor Alternative Target dGPU/ Target ROCm Example
				89	Processor Triple APU Features Support Products
				90	Architecture Supported
				91	[Default]
				92	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``r600`` ``r600`` dGPU
				96	``r630`` ``r600`` dGPU
				97	``rs880`` ``r600`` dGPU
				98	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	99	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	100	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	101	``rv710`` ``r600`` dGPU
				102	``rv730`` ``r600`` dGPU
				103	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	104	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	105	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	106	``cedar`` ``r600`` dGPU
				107	``redwood`` ``r600`` dGPU
				108	``sumo`` ``r600`` dGPU
				109	``juniper`` ``r600`` dGPU
				110	``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``barts`` ``r600`` dGPU
				114	``turks`` ``r600`` dGPU
				115	``caicos`` ``r600`` dGPU
				116	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	117	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	118	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	119	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
				120	``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
				121	- ``verde``
				122	- ``oland``
				123	- ``hainan``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	124	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	125	-----------------------------------------------------------------------------------
				126	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				127	- A6 Pro-7050B
				128	- A8-7100
				129	- A8 Pro-7150B
				130	- A10-7300
				131	- A10 Pro-7350B
				132	- FX-7500
				133	- A8-7200P
				134	- A10-7400P
				135	- FX-7600P
				136	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				137	- FirePro W9100
				138	- FirePro S9150
				139	- FirePro S9170
				140	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				141	- Radeon R9 290x
				142	- Radeon R390
				143	- Radeon R390x
				144	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				145	- ``mullins`` - E1-2200
				146	- E1-2500
				147	- E2-3000
				148	- E2-3800
				149	- A4-5000
				150	- A4-5100
				151	- A6-5200
				152	- A4 Pro-3340B
				153	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				154	- Radeon HD 8770
				155	- R7 260
				156	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	157	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	158	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	159	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				160	[on] - Pro A6-8500B
				161	- A8-8600P
				162	- Pro A8-8600B
				163	- FX-8800P
				164	- Pro A12-8800B
				165	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				166	[on] - Pro A10-8700B
				167	- A10-8780P
				168	\ ``amdgcn`` APU - xnack - A10-9600P
				169	[on] - A10-9630P
				170	- A12-9700P
				171	- A12-9730P
				172	- FX-9800P
				173	- FX-9830P
				174	\ ``amdgcn`` APU - xnack - E2-9010
				175	[on] - A6-9210
				176	- A9-9410
				177	``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tye	a697880	2017-12-12 05:47:00 +0000	[diff] [blame]	178	- ``iceland`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	179	- FirePro W7100
				180	- Radeon R285
				181	- Radeon R9 380
				182	- Radeon R9 385
				183	- Mobile FirePro
				184	M7170
				185	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				186	[off] - Radeon R9 Fury
				187	- Radeon R9 FuryX
				188	- Radeon Pro Duo
				189	- FirePro S9300x2
				190	- Radeon Instinct MI8
				191	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				192	[off] - Radeon RX 480
				193	- Radeon Instinct MI6
				194	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				195	[off]
				196	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				197	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	198	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	199	-----------------------------------------------------------------------------------
				200	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				201	[off] Frontier Edition
				202	- Radeon RX Vega 56
				203	- Radeon RX Vega 64
				204	- Radeon RX Vega 64
				205	Liquid
				206	- Radeon Instinct MI25
				207	``gfx902`` ``amdgcn`` APU - xnack TBA
				208	[on]
				209	.. TODO
				210	Add product
				211	names.
				212	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	213
				214	.. _amdgpu-target-features:
				215
				216	Target Features
				217	---------------
				218
				219	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	220	processor specific features. Not all target features are supported by
				221	all processors. The runtime must ensure that the features supported by
				222	the device used to execute the code match the features enabled when
				223	generating the code. A mismatch of features may result in incorrect
				224	execution, or a reduction in performance.
				225
				226	The target features supported by each processor, and the default value
				227	used if not specified explicitly, is listed in
				228	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	229
				230	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				231	target features.
				232
				233	For example:
				234
				235	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	236	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	237	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	238	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	239
				240	.. table:: AMDGPU Target Features
				241	:name: amdgpu-target-feature-table
				242
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	243	============== ==================================================
				244	Target Feature Description
				245	============== ==================================================
				246	-m[no-]xnack Enable/disable generating code that has
				247	memory clauses that are compatible with
				248	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	249
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	250	This is used for demand paging and page
				251	migration. If XNACK replay is enabled in
				252	the device, then if a page fault occurs
				253	the code may execute incorrectly if the
				254	``xnack`` feature is not enabled. Executing
				255	code that has the feature enabled on a
				256	device that does not have XNACK replay
				257	enabled will execute correctly, but may
				258	be less performant than code with the
				259	feature disabled.
				260	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	261
				262	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	263
				264	Address Spaces
				265	--------------
				266
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	267	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	268
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	269	The memory space names used in the table, aside from the region memory space, is
				270	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	271
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	272	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	273
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	274	.. table:: Address Space Mapping
				275	:name: amdgpu-address-space-mapping-table
				276
				277	================== ================= ================= ================= =================
				278	LLVM Address Space Memory Space
				279	------------------ -----------------------------------------------------------------------
				280	\ Current Default amdgiz/amdgizcl hcc Future Default
				281	================== ================= ================= ================= =================
				282	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				283	1 Global Global Global Global
				284	2 Constant Constant Constant Region (GDS)
				285	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				286	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				287	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				288	================== ================= ================= ================= =================
				289
				290	Current Default
				291	This is the current default address space mapping used for all languages
				292	except hcc. This will shortly be deprecated.
				293
				294	amdgiz/amdgizcl
				295	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				296	is specified as the target triple environment value.
				297
				298	hcc
				299	This is the current address space mapping used when ``hcc`` is specified as
				300	the target triple environment value.This will shortly be deprecated.
				301
				302	Future Default
				303	This will shortly be the only address space mapping for all languages using
				304	AMDGPU backend.
				305
				306	.. _amdgpu-memory-scopes:
				307
				308	Memory Scopes
				309	-------------
				310
				311	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				312	backend memory model when the target triple OS is ``amdhsa`` (see
				313	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				314
				315	The memory model supported is based on the HSA memory model [HSA]_ which is
				316	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				317	relation is transitive over the synchonizes-with relation independent of scope,
				318	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	319	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	320
				321	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				322	inclusion and requires the memory scopes to exactly match. However, this
				323	is conservatively correct for OpenCL.
				324
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	325	.. table:: AMDHSA LLVM Sync Scopes
				326	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	327
				328	================ ==========================================================
				329	LLVM Sync Scope Description
				330	================ ==========================================================
				331	none The default: ``system``.
				332
				333	Synchronizes with, and participates in modification and
				334	seq_cst total orderings with, other operations (except
				335	image operations) for all address spaces (except private,
				336	or generic that accesses private) provided the other
				337	operation's sync scope is:
				338
				339	- ``system``.
				340	- ``agent`` and executed by a thread on the same agent.
				341	- ``workgroup`` and executed by a thread in the same
				342	workgroup.
				343	- ``wavefront`` and executed by a thread in the same
				344	wavefront.
				345
				346	``agent`` Synchronizes with, and participates in modification and
				347	seq_cst total orderings with, other operations (except
				348	image operations) for all address spaces (except private,
				349	or generic that accesses private) provided the other
				350	operation's sync scope is:
				351
				352	- ``system`` or ``agent`` and executed by a thread on the
				353	same agent.
				354	- ``workgroup`` and executed by a thread in the same
				355	workgroup.
				356	- ``wavefront`` and executed by a thread in the same
				357	wavefront.
				358
				359	``workgroup`` Synchronizes with, and participates in modification and
				360	seq_cst total orderings with, other operations (except
				361	image operations) for all address spaces (except private,
				362	or generic that accesses private) provided the other
				363	operation's sync scope is:
				364
				365	- ``system``, ``agent`` or ``workgroup`` and executed by a
				366	thread in the same workgroup.
				367	- ``wavefront`` and executed by a thread in the same
				368	wavefront.
				369
				370	``wavefront`` Synchronizes with, and participates in modification and
				371	seq_cst total orderings with, other operations (except
				372	image operations) for all address spaces (except private,
				373	or generic that accesses private) provided the other
				374	operation's sync scope is:
				375
				376	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				377	and executed by a thread in the same wavefront.
				378
				379	``singlethread`` Only synchronizes with, and participates in modification
				380	and seq_cst total orderings with, other operations (except
				381	image operations) running in the same thread for all
				382	address spaces (for example, in signal handlers).
				383	================ ==========================================================
				384
				385	AMDGPU Intrinsics
				386	-----------------
				387
				388	The AMDGPU backend implements the following intrinsics.
				389
				390	This section is WIP.
				391
				392	.. TODO
				393	List AMDGPU intrinsics
				394
				395	Code Object
				396	===========
				397
				398	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				399	can be linked by ``lld`` to produce a standard ELF shared code object which can
				400	be loaded and executed on an AMDGPU target.
				401
				402	Header
				403	------
				404
				405	The AMDGPU backend uses the following ELF header:
				406
				407	.. table:: AMDGPU ELF Header
				408	:name: amdgpu-elf-header-table
				409
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	410	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	412	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	413	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				414	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	415	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				416	- ``ELFOSABI_AMDGPU_HSA``
				417	- ``ELFOSABI_AMDGPU_PAL``
				418	- ``ELFOSABI_AMDGPU_MESA3D``
				419	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				420	- ``ELFABIVERSION_AMDGPU_PAL``
				421	- ``ELFABIVERSION_AMDGPU_MESA3D``
				422	``e_type`` - ``ET_REL``
				423	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424	``e_machine`` ``EM_AMDGPU``
				425	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	426	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	427	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428
				429	..
				430
				431	.. table:: AMDGPU ELF Header Enumeration Values
				432	:name: amdgpu-elf-header-enumeration-values-table
				433
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	434	=============================== =====
				435	Name Value
				436	=============================== =====
				437	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	438	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	439	``ELFOSABI_AMDGPU_HSA`` 64
				440	``ELFOSABI_AMDGPU_PAL`` 65
				441	``ELFOSABI_AMDGPU_MESA3D`` 66
				442	``ELFABIVERSION_AMDGPU_HSA`` 1
				443	``ELFABIVERSION_AMDGPU_PAL`` 0
				444	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				445	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	446
				447	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	The ELF class is:
				449
				450	* ``ELFCLASS32`` for ``r600`` architecture.
				451
				452	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				453	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454
				455	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	456	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	457
				458	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	459	One of the following AMD GPU architecture specific OS ABIs
				460	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	461
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	462	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	463
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	464	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	465
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	466	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				467
				468	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	469
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	470	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	471	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				472	object conforms:
				473
				474	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				475	runtime ABI.
				476
				477	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				478	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	479
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	480	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	482
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	483	``e_type``
				484	Can be one of the following values:
				485
				486
				487	``ET_REL``
				488	The type produced by the AMD GPU backend compiler as it is relocatable code
				489	object.
				490
				491	``ET_DYN``
				492	The type produced by the linker as it is a shared code object.
				493
				494	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				495
				496	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	497	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				498	by the ``r600`` and ``amdgcn`` architectures (see
				499	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				500	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				501	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	502
				503	``e_entry``
				504	The entry point is 0 as the entry points for individual kernels must be
				505	selected in order to invoke them through AQL packets.
				506
				507	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	508	The AMDGPU backend uses the following ELF header flags:
				509
				510	.. table:: AMDGPU ELF Header ``e_flags``
				511	:name: amdgpu-elf-header-e_flags-table
				512
				513	================================= ========== =============================
				514	Name Value Description
				515	================================= ========== =============================
				516	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				517	-------------------------------------------- -----------------------------
				518	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				519	mask for
				520	``EF_AMDGPU_MACH_xxx`` values
				521	defined in
				522	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	523	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				524	target feature is
				525	enabled for all code
				526	contained in the code object.
				527	See
				528	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	529	================================= ========== =============================
				530
				531	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				532	:name: amdgpu-ef-amdgpu-mach-table
				533
				534	================================= ========== =============================
				535	Name Value Description (see
				536	:ref:`amdgpu-processor-table`)
				537	================================= ========== =============================
				538	``EF_AMDGPU_MACH_NONE`` 0 not specified
				539	``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
				540	``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
				541	``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
				542	``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
				543	``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
				544	``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
				545	``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
				546	``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
				547	``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
				548	``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
				549	``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
				550	``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
				551	``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
				552	``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
				553	``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
				554	``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
				555	reserved 17-31 Reserved for ``r600``
				556	architecture processors.
				557	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
				560	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
				562	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	563	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tye	a697880	2017-12-12 05:47:00 +0000	[diff] [blame]	564	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
				565	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
				566	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
				567	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
				568	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
				569	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	570	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	571
				572	Sections
				573	--------
				574
				575	An AMDGPU target ELF code object has the standard ELF sections which include:
				576
				577	.. table:: AMDGPU ELF Sections
				578	:name: amdgpu-elf-sections-table
				579
				580	================== ================ =================================
				581	Name Type Attributes
				582	================== ================ =================================
				583	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				584	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				585	``.debug_``\ \* ``SHT_PROGBITS`` none
				586	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				587	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				588	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				589	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				590	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				591	``.note`` ``SHT_NOTE`` none
				592	``.rela``\ name ``SHT_RELA`` none
				593	``.rela.dyn`` ``SHT_RELA`` none
				594	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				595	``.shstrtab`` ``SHT_STRTAB`` none
				596	``.strtab`` ``SHT_STRTAB`` none
				597	``.symtab`` ``SHT_SYMTAB`` none
				598	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				599	================== ================ =================================
				600
				601	These sections have their standard meanings (see [ELF]_) and are only generated
				602	if needed.
				603
				604	``.debug``\ \*
				605	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				606	DWARF produced by the AMDGPU backend.
				607
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	608	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	609	The standard sections used by a dynamic loader.
				610
				611	``.note``
				612	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				613	backend.
				614
				615	``.rela``\ name, ``.rela.dyn``
				616	For relocatable code objects, name is the name of the section that the
				617	relocation records apply. For example, ``.rela.text`` is the section name for
				618	relocation records associated with the ``.text`` section.
				619
				620	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				621	records from each of the relocatable code object's ``.rela``\ name sections.
				622
				623	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				624	the AMDGPU backend.
				625
				626	``.text``
				627	The executable machine code for the kernels and functions they call. Generated
				628	as position independent code. See :ref:`amdgpu-code-conventions` for
				629	information on conventions used in the isa generation.
				630
				631	.. _amdgpu-note-records:
				632
				633	Note Records
				634	------------
				635
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	636	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				637	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				638	aligned. In addition, minimal zero byte padding must be generated to ensure the
				639	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				640	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
				642	The AMDGPU backend code object uses the following ELF note records in the
				643	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	644	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	645	variable size strings have a corresponding ``*_size`` field that specifies the
				646	number of bytes, including the terminating null character, in the string. The
				647	string(s) come immediately after the preceding fields.
				648
				649	Additional note records can be present.
				650
				651	.. table:: AMDGPU ELF Note Records
				652	:name: amdgpu-elf-note-records-table
				653
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	654	===== ============================== ======================================
				655	Name Type Description
				656	===== ============================== ======================================
				657	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	658	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	659
				660	..
				661
				662	.. table:: AMDGPU ELF Note Record Enumeration Values
				663	:name: amdgpu-elf-note-record-enumeration-values-table
				664
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	665	============================== =====
				666	Name Value
				667	============================== =====
				668	reserved 0-9
				669	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	670	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	671	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	672
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	673	``NT_AMD_AMDGPU_HSA_METADATA``
				674	Specifies extensible metadata associated with the code objects executed on HSA
				675	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				676	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				677	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				678	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	679
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	680	.. _amdgpu-symbols:
				681
				682	Symbols
				683	-------
				684
				685	Symbols include the following:
				686
				687	.. table:: AMDGPU ELF Symbols
				688	:name: amdgpu-elf-symbols-table
				689
				690	===================== ============== ============= ==================
				691	Name Type Section Description
				692	===================== ============== ============= ==================
				693	link-name ``STT_OBJECT`` - ``.data`` Global variable
				694	- ``.rodata``
				695	- ``.bss``
				696	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				697	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				698	===================== ============== ============= ==================
				699
				700	Global variable
				701	Global variables both used and defined by the compilation unit.
				702
				703	If the symbol is defined in the compilation unit then it is allocated in the
				704	appropriate section according to if it has initialized data or is readonly.
				705
				706	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				707	will resolve relocations using the definition provided by another code object
				708	or explicitly defined by the runtime.
				709
				710	All global symbols, whether defined in the compilation unit or external, are
				711	accessed by the machine code indirectly through a GOT table entry. This
				712	allows them to be preemptable. The GOT table is only supported when the target
				713	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	714
				715	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	716	Add description of linked shared object symbols. Seems undefined symbols
				717	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	718
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	719	Kernel descriptor
				720	Every HSA kernel has an associated kernel descriptor. It is the address of the
				721	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				722	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				723	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				724
				725	Kernel entry point
				726	Every HSA kernel also has a symbol for its machine code entry point.
				727
				728	.. _amdgpu-relocation-records:
				729
				730	Relocation Records
				731	------------------
				732
				733	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				734	relocatable fields are:
				735
				736	``word32``
				737	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				738	alignment. These values use the same byte order as other word values in the
				739	AMD GPU architecture.
				740
				741	``word64``
				742	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				743	alignment. These values use the same byte order as other word values in the
				744	AMD GPU architecture.
				745
				746	Following notations are used for specifying relocation calculations:
				747
				748	A
				749	Represents the addend used to compute the value of the relocatable field.
				750
				751	G
				752	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	753	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	754
				755	GOT
				756	Represents the address of the global offset table.
				757
				758	P
				759	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				760	of the storage unit being relocated (computed using ``r_offset``).
				761
				762	S
				763	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	764	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				765
				766	B
				767	Represents the base address of a loaded executable or shared object which is
				768	the difference between the ELF address and the actual load address. Relocations
				769	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	770
				771	The following relocation types are supported:
				772
				773	.. table:: AMDGPU ELF Relocation Records
				774	:name: amdgpu-elf-relocation-records-table
				775
				776	========================== ===== ========== ==============================
				777	Relocation Type Value Field Calculation
				778	========================== ===== ========== ==============================
				779	``R_AMDGPU_NONE`` 0 none none
				780	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				781	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				782	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				783	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				784	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				785	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				786	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				787	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				788	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				789	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				790	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	791	reserved 12
				792	``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	793	========================== ===== ========== ==============================
				794
				795	.. _amdgpu-dwarf:
				796
				797	DWARF
				798	-----
				799
				800	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				801	information that maps the code object executable code and data to the source
				802	language constructs. It can be used by tools such as debuggers and profilers.
				803
				804	Address Space Mapping
				805	~~~~~~~~~~~~~~~~~~~~~
				806
				807	The following address space mapping is used:
				808
				809	.. table:: AMDGPU DWARF Address Space Mapping
				810	:name: amdgpu-dwarf-address-space-mapping-table
				811
				812	=================== =================
				813	DWARF Address Space Memory Space
				814	=================== =================
				815	1 Private (Scratch)
				816	2 Local (group/LDS)
				817	omitted Global
				818	omitted Constant
				819	omitted Generic (Flat)
				820	not supported Region (GDS)
				821	=================== =================
				822
				823	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				824	used in the table.
				825
				826	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				827	DWARF address space of the value of the pointer when it is in the private or
				828	local address space. Otherwise the attribute is omitted.
				829
				830	An ``XDEREF`` operation is generated in location list expressions for variables
				831	that are allocated in the private and local address space. Otherwise no
				832	``XDREF`` is omitted.
				833
				834	Register Mapping
				835	~~~~~~~~~~~~~~~~
				836
				837	This section is WIP.
				838
				839	.. TODO
				840	Define DWARF register enumeration.
				841
				842	If want to present a wavefront state then should expose vector registers as
				843	64 wide (rather than per work-item view that LLVM uses). Either as separate
				844	registers, or a 64x4 byte single register. In either case use a new LANE op
				845	(akin to XDREF) to select the current lane usage in a location
				846	expression. This would also allow scalar register spilling to vector register
				847	lanes to be expressed (currently no debug information is being generated for
				848	spilling). If choose a wide single register approach then use LANE in
				849	conjunction with PIECE operation to select the dword part of the register for
				850	the current lane. If the separate register approach then use LANE to select
				851	the register.
				852
				853	Source Text
				854	~~~~~~~~~~~
				855
				856	This section is WIP.
				857
				858	.. TODO
				859	DWARF extension to include runtime generated source text.
				860
				861	.. _amdgpu-code-conventions:
				862
				863	Code Conventions
				864	================
				865
				866	This section provides code conventions used for each supported target triple OS
				867	(see :ref:`amdgpu-target-triples`).
				868
				869	AMDHSA
				870	------
				871
				872	This section provides code conventions used when the target triple OS is
				873	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				874
				875	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	876
				877	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	878	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	879
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	880	The code object metadata specifies extensible metadata associated with the code
				881	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				882	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				883	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				884	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				885	information necessary to support the ROCM kernel queries. For example, the
				886	segment sizes needed in a dispatch packet. In addition, a high level language
				887	runtime may require other information to be included. For example, the AMD
				888	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	889
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	890	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	891	:doc:`YamlIO`).
				892
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	893	.. TODO
				894	Is the string null terminated? It probably should not if YAML allows it to
				895	contain null characters, otherwise it should be.
				896
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	897	The metadata is represented as a single YAML document comprised of the mapping
				898	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				899	referenced tables.
				900
				901	For boolean values, the string values of ``false`` and ``true`` are used for
				902	false and true respectively.
				903
				904	Additional information can be added to the mappings. To avoid conflicts, any
				905	non-AMD key names should be prefixed by "vendor-name.".
				906
				907	.. table:: AMDHSA Code Object Metadata Mapping
				908	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				909
				910	========== ============== ========= =======================================
				911	String Key Value Type Required? Description
				912	========== ============== ========= =======================================
				913	"Version" sequence of Required - The first integer is the major
				914	2 integers version. Currently 1.
				915	- The second integer is the minor
				916	version. Currently 0.
				917	"Printf" sequence of Each string is encoded information
				918	strings about a printf function call. The
				919	encoded information is organized as
				920	fields separated by colon (':'):
				921
				922	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				923
				924	where:
				925
				926	``ID``
				927	A 32 bit integer as a unique id for
				928	each printf function call
				929
				930	``N``
				931	A 32 bit integer equal to the number
				932	of arguments of printf function call
				933	minus 1
				934
				935	``S[i]`` (where i = 0, 1, ... , N-1)
				936	32 bit integers for the size in bytes
				937	of the i-th FormatString argument of
				938	the printf function call
				939
				940	FormatString
				941	The format string passed to the
				942	printf function call.
				943	"Kernels" sequence of Required Sequence of the mappings for each
				944	mapping kernel in the code object. See
				945	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				946	for the definition of the mapping.
				947	========== ============== ========= =======================================
				948
				949	..
				950
				951	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				952	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				953
				954	================= ============== ========= ================================
				955	String Key Value Type Required? Description
				956	================= ============== ========= ================================
				957	"Name" string Required Source name of the kernel.
				958	"SymbolName" string Required Name of the kernel
				959	descriptor ELF symbol.
				960	"Language" string Source language of the kernel.
				961	Values include:
				962
				963	- "OpenCL C"
				964	- "OpenCL C++"
				965	- "HCC"
				966	- "OpenMP"
				967
				968	"LanguageVersion" sequence of - The first integer is the major
				969	2 integers version.
				970	- The second integer is the
				971	minor version.
				972	"Attrs" mapping Mapping of kernel attributes.
				973	See
				974	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				975	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	976	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	977	mapping kernel arguments. See
				978	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				979	for the definition of the mapping.
				980	"CodeProps" mapping Mapping of properties related to
				981	the kernel code. See
				982	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				983	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	984	================= ============== ========= ================================
				985
				986	..
				987
				988	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				989	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				990
				991	=================== ============== ========= ==============================
				992	String Key Value Type Required? Description
				993	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame^]	994	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				995	3 integers must be >=1 and the dispatch
				996	work-group size X, Y, Z must
				997	correspond to the specified
				998	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	999
				1000	Corresponds to the OpenCL
				1001	``reqd_work_group_size``
				1002	attribute.
				1003	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1004	3 integers X, Y, Z is likely to be the
				1005	specified values.
				1006
				1007	Corresponds to the OpenCL
				1008	``work_group_size_hint``
				1009	attribute.
				1010	"VecTypeHint" string The name of a scalar or vector
				1011	type.
				1012
				1013	Corresponds to the OpenCL
				1014	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1015
				1016	"RuntimeHandle" string The external symbol name
				1017	associated with a kernel.
				1018	OpenCL runtime allocates a
				1019	global buffer for the symbol
				1020	and saves the kernel's address
				1021	to it, which is used for
				1022	device side enqueueing. Only
				1023	available for device side
				1024	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1025	=================== ============== ========= ==============================
				1026
				1027	..
				1028
				1029	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1030	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1031
				1032	================= ============== ========= ================================
				1033	String Key Value Type Required? Description
				1034	================= ============== ========= ================================
				1035	"Name" string Kernel argument name.
				1036	"TypeName" string Kernel argument type name.
				1037	"Size" integer Required Kernel argument size in bytes.
				1038	"Align" integer Required Kernel argument alignment in
				1039	bytes. Must be a power of two.
				1040	"ValueKind" string Required Kernel argument kind that
				1041	specifies how to set up the
				1042	corresponding argument.
				1043	Values include:
				1044
				1045	"ByValue"
				1046	The argument is copied
				1047	directly into the kernarg.
				1048
				1049	"GlobalBuffer"
				1050	A global address space pointer
				1051	to the buffer data is passed
				1052	in the kernarg.
				1053
				1054	"DynamicSharedPointer"
				1055	A group address space pointer
				1056	to dynamically allocated LDS
				1057	is passed in the kernarg.
				1058
				1059	"Sampler"
				1060	A global address space
				1061	pointer to a S# is passed in
				1062	the kernarg.
				1063
				1064	"Image"
				1065	A global address space
				1066	pointer to a T# is passed in
				1067	the kernarg.
				1068
				1069	"Pipe"
				1070	A global address space pointer
				1071	to an OpenCL pipe is passed in
				1072	the kernarg.
				1073
				1074	"Queue"
				1075	A global address space pointer
				1076	to an OpenCL device enqueue
				1077	queue is passed in the
				1078	kernarg.
				1079
				1080	"HiddenGlobalOffsetX"
				1081	The OpenCL grid dispatch
				1082	global offset for the X
				1083	dimension is passed in the
				1084	kernarg.
				1085
				1086	"HiddenGlobalOffsetY"
				1087	The OpenCL grid dispatch
				1088	global offset for the Y
				1089	dimension is passed in the
				1090	kernarg.
				1091
				1092	"HiddenGlobalOffsetZ"
				1093	The OpenCL grid dispatch
				1094	global offset for the Z
				1095	dimension is passed in the
				1096	kernarg.
				1097
				1098	"HiddenNone"
				1099	An argument that is not used
				1100	by the kernel. Space needs to
				1101	be left for it, but it does
				1102	not need to be set up.
				1103
				1104	"HiddenPrintfBuffer"
				1105	A global address space pointer
				1106	to the runtime printf buffer
				1107	is passed in kernarg.
				1108
				1109	"HiddenDefaultQueue"
				1110	A global address space pointer
				1111	to the OpenCL device enqueue
				1112	queue that should be used by
				1113	the kernel by default is
				1114	passed in the kernarg.
				1115
				1116	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1117	A global address space pointer
				1118	to help link enqueued kernels into
				1119	the ancestor tree for determining
				1120	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1121
				1122	"ValueType" string Required Kernel argument value type. Only
				1123	present if "ValueKind" is
				1124	"ByValue". For vector data
				1125	types, the value is for the
				1126	element type. Values include:
				1127
				1128	- "Struct"
				1129	- "I8"
				1130	- "U8"
				1131	- "I16"
				1132	- "U16"
				1133	- "F16"
				1134	- "I32"
				1135	- "U32"
				1136	- "F32"
				1137	- "I64"
				1138	- "U64"
				1139	- "F64"
				1140
				1141	.. TODO
				1142	How can it be determined if a
				1143	vector type, and what size
				1144	vector?
				1145	"PointeeAlign" integer Alignment in bytes of pointee
				1146	type for pointer type kernel
				1147	argument. Must be a power
				1148	of 2. Only present if
				1149	"ValueKind" is
				1150	"DynamicSharedPointer".
				1151	"AddrSpaceQual" string Kernel argument address space
				1152	qualifier. Only present if
				1153	"ValueKind" is "GlobalBuffer" or
				1154	"DynamicSharedPointer". Values
				1155	are:
				1156
				1157	- "Private"
				1158	- "Global"
				1159	- "Constant"
				1160	- "Local"
				1161	- "Generic"
				1162	- "Region"
				1163
				1164	.. TODO
				1165	Is GlobalBuffer only Global
				1166	or Constant? Is
				1167	DynamicSharedPointer always
				1168	Local? Can HCC allow Generic?
				1169	How can Private or Region
				1170	ever happen?
				1171	"AccQual" string Kernel argument access
				1172	qualifier. Only present if
				1173	"ValueKind" is "Image" or
				1174	"Pipe". Values
				1175	are:
				1176
				1177	- "ReadOnly"
				1178	- "WriteOnly"
				1179	- "ReadWrite"
				1180
				1181	.. TODO
				1182	Does this apply to
				1183	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1184	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1185	performed by the kernel on the
				1186	kernel argument. Only present if
				1187	"ValueKind" is "GlobalBuffer",
				1188	"Image", or "Pipe". This may be
				1189	more restrictive than indicated
				1190	by "AccQual" to reflect what the
				1191	kernel actual does. If not
				1192	present then the runtime must
				1193	assume what is implied by
				1194	"AccQual" and "IsConst". Values
				1195	are:
				1196
				1197	- "ReadOnly"
				1198	- "WriteOnly"
				1199	- "ReadWrite"
				1200
				1201	"IsConst" boolean Indicates if the kernel argument
				1202	is const qualified. Only present
				1203	if "ValueKind" is
				1204	"GlobalBuffer".
				1205
				1206	"IsRestrict" boolean Indicates if the kernel argument
				1207	is restrict qualified. Only
				1208	present if "ValueKind" is
				1209	"GlobalBuffer".
				1210
				1211	"IsVolatile" boolean Indicates if the kernel argument
				1212	is volatile qualified. Only
				1213	present if "ValueKind" is
				1214	"GlobalBuffer".
				1215
				1216	"IsPipe" boolean Indicates if the kernel argument
				1217	is pipe qualified. Only present
				1218	if "ValueKind" is "Pipe".
				1219
				1220	.. TODO
				1221	Can GlobalBuffer be pipe
				1222	qualified?
				1223	================= ============== ========= ================================
				1224
				1225	..
				1226
				1227	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1228	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1229
				1230	============================ ============== ========= =====================
				1231	String Key Value Type Required? Description
				1232	============================ ============== ========= =====================
				1233	"KernargSegmentSize" integer Required The size in bytes of
				1234	the kernarg segment
				1235	that holds the values
				1236	of the arguments to
				1237	the kernel.
				1238	"GroupSegmentFixedSize" integer Required The amount of group
				1239	segment memory
				1240	required by a
				1241	work-group in
				1242	bytes. This does not
				1243	include any
				1244	dynamically allocated
				1245	group segment memory
				1246	that may be added
				1247	when the kernel is
				1248	dispatched.
				1249	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1250	private address space
				1251	memory required for a
				1252	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1253	bytes. If the kernel
				1254	uses a dynamic call
				1255	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1256	space must be added
				1257	to this value for the
				1258	call stack.
				1259	"KernargSegmentAlign" integer Required The maximum byte
				1260	alignment of
				1261	arguments in the
				1262	kernarg segment. Must
				1263	be a power of 2.
				1264	"WavefrontSize" integer Required Wavefront size. Must
				1265	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1266	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1267	registers used by a
				1268	wavefront for
				1269	GFX6-GFX9. This
				1270	includes the special
				1271	SGPRs for VCC, Flat
				1272	Scratch (GFX7-GFX9)
				1273	and XNACK (for
				1274	GFX8-GFX9). It does
				1275	not include the 16
				1276	SGPR added if a trap
				1277	handler is
				1278	enabled. It is not
				1279	rounded up to the
				1280	allocation
				1281	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1282	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1283	registers used by
				1284	each work-item for
				1285	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1286	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1287	work-group size
				1288	supported by the
				1289	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1290	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame^]	1291	consistent with
				1292	ReqdWorkGroupSize if
				1293	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1294	"NumSpilledSGPRs" integer Number of stores from
				1295	a scalar register to
				1296	a register allocator
				1297	created spill
				1298	location.
				1299	"NumSpilledVGPRs" integer Number of stores from
				1300	a vector register to
				1301	a register allocator
				1302	created spill
				1303	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1304	============================ ============== ========= =====================
				1305
				1306	..
				1307
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1308	Kernel Dispatch
				1309	~~~~~~~~~~~~~~~
				1310
				1311	The HSA architected queuing language (AQL) defines a user space memory interface
				1312	that can be used to control the dispatch of kernels, in an agent independent
				1313	way. An agent can have zero or more AQL queues created for it using the ROCm
				1314	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1315	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1316	mechanics and packet layouts.
				1317
				1318	The packet processor of a kernel agent is responsible for detecting and
				1319	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1320	packet processor is implemented by the hardware command processor (CP),
				1321	asynchronous dispatch controller (ADC) and shader processor input controller
				1322	(SPI).
				1323
				1324	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1325	mode driver to initialize and register the AQL queue with CP.
				1326
				1327	To dispatch a kernel the following actions are performed. This can occur in the
				1328	CPU host program, or from an HSA kernel executing on a GPU.
				1329
				1330	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1331	executed is obtained.
				1332	2. A pointer to the kernel descriptor (see
				1333	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1334	obtained. It must be for a kernel that is contained in a code object that that
				1335	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1336	associated.
				1337	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1338	for a memory region with the kernarg property for the kernel agent that will
				1339	execute the kernel. It must be at least 16 byte aligned.
				1340	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1341	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1342	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1343	memory in the same way constant memory is accessed. (Note that the HSA
				1344	specification allows an implementation to copy the kernel argument contents to
				1345	another location that is accessed by the kernel.)
				1346	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1347	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1348	packet. The packet must be set up, and the final write must use an atomic
				1349	store release to set the packet kind to ensure the packet contents are
				1350	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1351	notify the kernel agent that the AQL queue has been updated. These rules, and
				1352	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1353	System Architecture Specification* [HSA]_.
				1354	6. A kernel dispatch packet includes information about the actual dispatch,
				1355	such as grid and work-group size, together with information from the code
				1356	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1357	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1358	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1359	7. CP executes micro-code and is responsible for detecting and setting up the
				1360	GPU to execute the wavefronts of a kernel dispatch.
				1361	8. CP ensures that when the a wavefront starts executing the kernel machine
				1362	code, the scalar general purpose registers (SGPR) and vector general purpose
				1363	registers (VGPR) are set up as required by the machine code. The required
				1364	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1365	register state is defined in
				1366	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1367	9. The prolog of the kernel machine code (see
				1368	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1369	before continuing executing the machine code that corresponds to the kernel.
				1370	10. When the kernel dispatch has completed execution, CP signals the completion
				1371	signal specified in the kernel dispatch packet if not 0.
				1372
				1373	.. _amdgpu-amdhsa-memory-spaces:
				1374
				1375	Memory Spaces
				1376	~~~~~~~~~~~~~
				1377
				1378	The memory space properties are:
				1379
				1380	.. table:: AMDHSA Memory Spaces
				1381	:name: amdgpu-amdhsa-memory-spaces-table
				1382
				1383	================= =========== ======== ======= ==================
				1384	Memory Space Name HSA Segment Hardware Address NULL Value
				1385	Name Name Size
				1386	================= =========== ======== ======= ==================
				1387	Private private scratch 32 0x00000000
				1388	Local group LDS 32 0xFFFFFFFF
				1389	Global global global 64 0x0000000000000000
				1390	Constant constant *same as 64 0x0000000000000000
				1391	global*
				1392	Generic flat flat 64 0x0000000000000000
				1393	Region N/A GDS 32 *not implemented
				1394	for AMDHSA*
				1395	================= =========== ======== ======= ==================
				1396
				1397	The global and constant memory spaces both use global virtual addresses, which
				1398	are the same virtual address space used by the CPU. However, some virtual
				1399	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1400	and some by both.
				1401
				1402	Using the constant memory space indicates that the data will not change during
				1403	the execution of the kernel. This allows scalar read instructions to be
				1404	used. The vector and scalar L1 caches are invalidated of volatile data before
				1405	each kernel dispatch execution to allow constant memory to change values between
				1406	kernel dispatches.
				1407
				1408	The local memory space uses the hardware Local Data Store (LDS) which is
				1409	automatically allocated when the hardware creates work-groups of wavefronts, and
				1410	freed when all the wavefronts of a work-group have terminated. The data store
				1411	(DS) instructions can be used to access it.
				1412
				1413	The private memory space uses the hardware scratch memory support. If the kernel
				1414	uses scratch, then the hardware allocates memory that is accessed using
				1415	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1416	address to physical address is:
				1417
				1418	``wavefront-scratch-base +
				1419	(private-address * wavefront-size * 4) +
				1420	(wavefront-lane-id * 4)``
				1421
				1422	There are different ways that the wavefront scratch base address is determined
				1423	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1424	memory can be accessed in an interleaved manner using buffer instruction with
				1425	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1426	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1427	same private address, the interleaving results in adjacent dwords being accessed
				1428	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1429	supported except by flat and scratch instructions in GFX9.
				1430
				1431	The generic address space uses the hardware flat address support available in
				1432	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1433	local appertures), that are outside the range of addressible global memory, to
				1434	map from a flat address to a private or local address.
				1435
				1436	FLAT instructions can take a flat address and access global, private (scratch)
				1437	and group (LDS) memory depending in if the address is within one of the
				1438	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1439	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1440	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1441	(see :ref:`amdgpu-amdhsa-m0`).
				1442
				1443	To convert between a segment address and a flat address the base address of the
				1444	appertures address can be used. For GFX7-GFX8 these are available in the
				1445	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1446	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1447	GFX9 the appature base addresses are directly available as inline constant
				1448	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1449	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1450	which makes it easier to convert from flat to segment or segment to flat.
				1451
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1452	Image and Samplers
				1453	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1454
				1455	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1456	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1457	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1458	enumeration values for the queries that are not trivially deducible from the S#
				1459	representation.
				1460
				1461	HSA Signals
				1462	~~~~~~~~~~~
				1463
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1464	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1465	structure allocated in memory accessible from both the CPU and GPU. The
				1466	structure is defined by the ROCm runtime and subject to change between releases
				1467	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1468
				1469	.. _amdgpu-amdhsa-hsa-aql-queue:
				1470
				1471	HSA AQL Queue
				1472	~~~~~~~~~~~~~
				1473
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1474	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1475	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1476	fields needed to implement certain language features such as the flat address
				1477	aperture bases. It also contains fields used by CP such as managing the
				1478	allocation of scratch memory.
				1479
				1480	.. _amdgpu-amdhsa-kernel-descriptor:
				1481
				1482	Kernel Descriptor
				1483	~~~~~~~~~~~~~~~~~
				1484
				1485	A kernel descriptor consists of the information needed by CP to initiate the
				1486	execution of a kernel, including the entry point address of the machine code
				1487	that implements the kernel.
				1488
				1489	Kernel Descriptor for GFX6-GFX9
				1490	+++++++++++++++++++++++++++++++
				1491
				1492	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1493
				1494	.. table:: Kernel Descriptor for GFX6-GFX9
				1495	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1496
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1497	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1498	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1499	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1500	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1501	address space memory
				1502	required for a work-group
				1503	in bytes. This does not
				1504	include any dynamically
				1505	allocated local address
				1506	space memory that may be
				1507	added when the kernel is
				1508	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1509	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1510	private address space
				1511	memory required for a
				1512	work-item in bytes. If
				1513	is_dynamic_callstack is 1
				1514	then additional space must
				1515	be added to this value for
				1516	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1517	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1518	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1519	negative) from base
				1520	address of kernel
				1521	descriptor to kernel's
				1522	entry point instruction
				1523	which must be 256 byte
				1524	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame^]	1525	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1526	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1527	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1528	program settings used by
				1529	CP to set up
				1530	``COMPUTE_PGM_RSRC1``
				1531	configuration
				1532	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1533	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1534	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1535	program settings used by
				1536	CP to set up
				1537	``COMPUTE_PGM_RSRC2``
				1538	configuration
				1539	register. See
				1540	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1541	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1542	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1543	(see
				1544	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1545
				1546	The total number of SGPR
				1547	user data registers
				1548	requested must not exceed
				1549	16 and match value in
				1550	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1551	Any requests beyond 16
				1552	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1553	449 1 bit EnableSGPRDispatchPtr see above
				1554	450 1 bit EnableSGPRQueuePtr see above
				1555	451 1 bit EnableSGPRKernargSegmentPtr see above
				1556	452 1 bit EnableSGPRDispatchID see above
				1557	453 1 bit EnableSGPRFlatScratchInit see above
				1558	454 1 bit EnableSGPRPrivateSegmentSize see above
				1559	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1560	should always be 0.
				1561	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1562	should always be 0.
				1563	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1564	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1565	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1566	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1567	bytes
				1568	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1569	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1570
				1571	..
				1572
				1573	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1574	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1575
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1576	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1577	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1578	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1579	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1580	used by each work-item,
				1581	granularity is device
				1582	specific:
				1583
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1584	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1585	- max_vgpr 1..256
				1586	- roundup((max_vgpg + 1)
				1587	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1588
				1589	Used by CP to set up
				1590	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1591	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1592	used by a wavefront,
				1593	granularity is device
				1594	specific:
				1595
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1596	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1597	- max_sgpr 1..112
				1598	- roundup((max_sgpg + 1)
				1599	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1601	- max_sgpr 1..112
				1602	- roundup((max_sgpg + 1)
				1603	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1604
				1605	Includes the special SGPRs
				1606	for VCC, Flat Scratch (for
				1607	GFX7 onwards) and XNACK
				1608	(for GFX8 onwards). It does
				1609	not include the 16 SGPR
				1610	added if a trap handler is
				1611	enabled.
				1612
				1613	Used by CP to set up
				1614	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1615	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1616
				1617	Start executing wavefront
				1618	at the specified priority.
				1619
				1620	CP is responsible for
				1621	filling in
				1622	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1623	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1624	with specified rounding
				1625	mode for single (32
				1626	bit) floating point
				1627	precision floating point
				1628	operations.
				1629
				1630	Floating point rounding
				1631	mode values are defined in
				1632	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1633
				1634	Used by CP to set up
				1635	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1636	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1637	with specified rounding
				1638	denorm mode for half/double (16
				1639	and 64 bit) floating point
				1640	precision floating point
				1641	operations.
				1642
				1643	Floating point rounding
				1644	mode values are defined in
				1645	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1646
				1647	Used by CP to set up
				1648	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1649	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1650	with specified denorm mode
				1651	for single (32
				1652	bit) floating point
				1653	precision floating point
				1654	operations.
				1655
				1656	Floating point denorm mode
				1657	values are defined in
				1658	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1659
				1660	Used by CP to set up
				1661	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1662	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1663	with specified denorm mode
				1664	for half/double (16
				1665	and 64 bit) floating point
				1666	precision floating point
				1667	operations.
				1668
				1669	Floating point denorm mode
				1670	values are defined in
				1671	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1672
				1673	Used by CP to set up
				1674	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1675	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1676
				1677	Start executing wavefront
				1678	in privilege trap handler
				1679	mode.
				1680
				1681	CP is responsible for
				1682	filling in
				1683	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1684	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1685	with DX10 clamp mode
				1686	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1687	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1688	treatment of NaN's (when
				1689	set, clamp NaN to zero,
				1690	otherwise pass NaN
				1691	through).
				1692
				1693	Used by CP to set up
				1694	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1695	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1696
				1697	Start executing wavefront
				1698	in single step mode.
				1699
				1700	CP is responsible for
				1701	filling in
				1702	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1703	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1704	with IEEE mode
				1705	enabled. Floating point
				1706	opcodes that support
				1707	exception flag gathering
				1708	will quiet and propagate
				1709	signaling-NaN inputs per
				1710	IEEE 754-2008. Min_dx10 and
				1711	max_dx10 become IEEE
				1712	754-2008 compliant due to
				1713	signaling-NaN propagation
				1714	and quieting.
				1715
				1716	Used by CP to set up
				1717	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1718	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1719
				1720	Only one work-group allowed
				1721	to execute on a compute
				1722	unit.
				1723
				1724	CP is responsible for
				1725	filling in
				1726	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1727	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1728
				1729	Flag that can be used to
				1730	control debugging code.
				1731
				1732	CP is responsible for
				1733	filling in
				1734	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1735	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1736	Reserved, must be 0.
				1737	GFX9
				1738	Wavefront starts execution
				1739	with specified fp16 overflow
				1740	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1741
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1742	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1743	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1744	- If 1, fp16 overflow that is the
				1745	result of an +/-INF input value
				1746	or divide by 0 produces a +/-INF,
				1747	otherwise clamps computed
				1748	overflow to +/-MAX_FP16 as
				1749	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1750
				1751	Used by CP to set up
				1752	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1753	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1754	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1755	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1756
				1757	..
				1758
				1759	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1760	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1761
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1762	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1763	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1764	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1765	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1766	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1767	system register (see
				1768	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1769
				1770	Used by CP to set up
				1771	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1772	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1773	user data registers
				1774	requested. This number must
				1775	match the number of user
				1776	data registers enabled.
				1777
				1778	Used by CP to set up
				1779	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1780	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1781	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1782	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1783	be enabled.
				1784
				1785	CP sets
				1786	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1787	if the runtime has
				1788	installed a trap handler
				1789	regardless of the setting
				1790	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1791	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1792	system SGPR register for
				1793	the work-group id in the X
				1794	dimension (see
				1795	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1796
				1797	Used by CP to set up
				1798	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1799	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1800	system SGPR register for
				1801	the work-group id in the Y
				1802	dimension (see
				1803	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1804
				1805	Used by CP to set up
				1806	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1807	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1808	system SGPR register for
				1809	the work-group id in the Z
				1810	dimension (see
				1811	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1812
				1813	Used by CP to set up
				1814	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1815	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1816	system SGPR register for
				1817	work-group information (see
				1818	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1819
				1820	Used by CP to set up
				1821	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1822	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1823	VGPR system registers used
				1824	for the work-item ID.
				1825	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1826	defines the values.
				1827
				1828	Used by CP to set up
				1829	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1830	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1831
				1832	Wavefront starts execution
				1833	with address watch
				1834	exceptions enabled which
				1835	are generated when L1 has
				1836	witnessed a thread access
				1837	an *address of
				1838	interest*.
				1839
				1840	CP is responsible for
				1841	filling in the address
				1842	watch bit in
				1843	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1844	according to what the
				1845	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1847
				1848	Wavefront starts execution
				1849	with memory violation
				1850	exceptions exceptions
				1851	enabled which are generated
				1852	when a memory violation has
				1853	occurred for this wave from
				1854	L1 or LDS
				1855	(write-to-read-only-memory,
				1856	mis-aligned atomic, LDS
				1857	address out of range,
				1858	illegal address, etc.).
				1859
				1860	CP sets the memory
				1861	violation bit in
				1862	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1863	according to what the
				1864	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1865	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1866
				1867	CP uses the rounded value
				1868	from the dispatch packet,
				1869	not this value, as the
				1870	dispatch may contain
				1871	dynamically allocated group
				1872	segment memory. CP writes
				1873	directly to
				1874	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1875
				1876	Amount of group segment
				1877	(LDS) to allocate for each
				1878	work-group. Granularity is
				1879	device specific:
				1880
				1881	GFX6:
				1882	roundup(lds-size / (64 * 4))
				1883	GFX7-GFX9:
				1884	roundup(lds-size / (128 * 4))
				1885
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1886	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1887	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1888	enabled.
				1889
				1890	Used by CP to set up
				1891	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1892	(set from bits 0..6).
				1893
				1894	IEEE 754 FP Invalid
				1895	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1896	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1897	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1898	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1899	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1900	_DIVISION_BY_ZERO Zero
				1901	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1902	_OVERFLOW
				1903	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1904	_UNDERFLOW
				1905	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1906	_INEXACT
				1907	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1908	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1909	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1910	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1911	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1912	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1913
				1914	..
				1915
				1916	.. table:: Floating Point Rounding Mode Enumeration Values
				1917	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1918
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1919	====================================== ===== ==============================
				1920	Enumeration Name Value Description
				1921	====================================== ===== ==============================
				1922	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1923	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1924	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1925	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1926	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1927
				1928	..
				1929
				1930	.. table:: Floating Point Denorm Mode Enumeration Values
				1931	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1932
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1933	====================================== ===== ==============================
				1934	Enumeration Name Value Description
				1935	====================================== ===== ==============================
				1936	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1937	Denorms
				1938	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1939	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1940	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1941	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1942
				1943	..
				1944
				1945	.. table:: System VGPR Work-Item ID Enumeration Values
				1946	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1947
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1948	======================================== ===== ============================
				1949	Enumeration Name Value Description
				1950	======================================== ===== ============================
				1951	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1952	ID.
				1953	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1954	dimensions ID.
				1955	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1956	dimensions ID.
				1957	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1958	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1959
				1960	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1961
				1962	Initial Kernel Execution State
				1963	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1964
				1965	This section defines the register state that will be set up by the packet
				1966	processor prior to the start of execution of every wavefront. This is limited by
				1967	the constraints of the hardware controllers of CP/ADC/SPI.
				1968
				1969	The order of the SGPR registers is defined, but the compiler can specify which
				1970	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1971	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1972	for enabled registers are dense starting at SGPR0: the first enabled register is
				1973	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1974	an SGPR number.
				1975
				1976	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1977	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1978	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1979	initialized. These are then immediately followed by the System SGPRs that are
				1980	set up by ADC/SPI and can have different values for each wave of the grid
				1981	dispatch.
				1982
				1983	SGPR register initial state is defined in
				1984	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1985
				1986	.. table:: SGPR Register Set Up Order
				1987	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1988
				1989	========== ========================== ====== ==============================
				1990	SGPR Order Name Number Description
				1991	(kernel descriptor enable of
				1992	field) SGPRs
				1993	========== ========================== ====== ==============================
				1994	First Private Segment Buffer 4 V# that can be used, together
				1995	(enable_sgpr_private with Scratch Wave Offset as an
				1996	_segment_buffer) offset, to access the private
				1997	memory space using a segment
				1998	address.
				1999
				2000	CP uses the value provided by
				2001	the runtime.
				2002	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2003	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2004	actually executing.
				2005	then Queue Ptr 2 64 bit address of amd_queue_t
				2006	(enable_sgpr_queue_ptr) object for AQL queue on which
				2007	the dispatch packet was
				2008	queued.
				2009	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2010	(enable_sgpr_kernarg segment. This is directly
				2011	_segment_ptr) copied from the
				2012	kernarg_address in the kernel
				2013	dispatch packet.
				2014
				2015	Having CP load it once avoids
				2016	loading it at the beginning of
				2017	every wavefront.
				2018	then Dispatch Id 2 64 bit Dispatch ID of the
				2019	(enable_sgpr_dispatch_id) dispatch packet being
				2020	executed.
				2021	then Flat Scratch Init 2 This is 2 SGPRs:
				2022	(enable_sgpr_flat_scratch
				2023	_init) GFX6
				2024	Not supported.
				2025	GFX7-GFX8
				2026	The first SGPR is a 32 bit
				2027	byte offset from
				2028	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2029	to per SPI base of memory
				2030	for scratch for the queue
				2031	executing the kernel
				2032	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2033	from the runtime. (The
				2034	Scratch Segment Buffer base
				2035	address is
				2036	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2037	plus this offset.) The value
				2038	of Scratch Wave Offset must
				2039	be added to this offset by
				2040	the kernel machine code,
				2041	right shifted by 8, and
				2042	moved to the FLAT_SCRATCH_HI
				2043	SGPR register.
				2044	FLAT_SCRATCH_HI corresponds
				2045	to SGPRn-4 on GFX7, and
				2046	SGPRn-6 on GFX8 (where SGPRn
				2047	is the highest numbered SGPR
				2048	allocated to the wave).
				2049	FLAT_SCRATCH_HI is
				2050	multiplied by 256 (as it is
				2051	in units of 256 bytes) and
				2052	added to
				2053	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2054	to calculate the per wave
				2055	FLAT SCRATCH BASE in flat
				2056	memory instructions that
				2057	access the scratch
				2058	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2059
				2060	The second SGPR is 32 bit
				2061	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2062	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2063	usage. CP obtains this from
				2064	the runtime, and it is
				2065	always a multiple of DWORD.
				2066	CP checks that the value in
				2067	the kernel dispatch packet
				2068	Private Segment Byte Size is
				2069	not larger, and requests the
				2070	runtime to increase the
				2071	queue's scratch size if
				2072	necessary. The kernel code
				2073	must move it to
				2074	FLAT_SCRATCH_LO which is
				2075	SGPRn-3 on GFX7 and SGPRn-5
				2076	on GFX8. FLAT_SCRATCH_LO is
				2077	used as the FLAT SCRATCH
				2078	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2079	instructions. Having CP load
				2080	it once avoids loading it at
				2081	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2082	wavefront.
				2083	GFX9
				2084	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2085	64 bit base address of the
				2086	per SPI scratch backing
				2087	memory managed by SPI for
				2088	the queue executing the
				2089	kernel dispatch. CP obtains
				2090	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2091	divides it if there are
				2092	multiple Shader Arrays each
				2093	with its own SPI). The value
				2094	of Scratch Wave Offset must
				2095	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2096	machine code and the result
				2097	moved to the FLAT_SCRATCH
				2098	SGPR which is SGPRn-6 and
				2099	SGPRn-5. It is used as the
				2100	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2101	memory instructions.
				2102	then Private Segment Size 1 The 32 bit byte size of a
				2103	(enable_sgpr_private single
				2104	work-item's
				2105	scratch_segment_size) memory
				2106	allocation. This is the
				2107	value from the kernel
				2108	dispatch packet Private
				2109	Segment Byte Size rounded up
				2110	by CP to a multiple of
				2111	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2112
				2113	Having CP load it once avoids
				2114	loading it at the beginning of
				2115	every wavefront.
				2116
				2117	This is not used for
				2118	GFX7-GFX8 since it is the same
				2119	value as the second SGPR of
				2120	Flat Scratch Init. However, it
				2121	may be needed for GFX9 which
				2122	changes the meaning of the
				2123	Flat Scratch Init value.
				2124	then Grid Work-Group Count X 1 32 bit count of the number of
				2125	(enable_sgpr_grid work-groups in the X dimension
				2126	_workgroup_count_X) for the grid being
				2127	executed. Computed from the
				2128	fields in the kernel dispatch
				2129	packet as ((grid_size.x +
				2130	workgroup_size.x - 1) /
				2131	workgroup_size.x).
				2132	then Grid Work-Group Count Y 1 32 bit count of the number of
				2133	(enable_sgpr_grid work-groups in the Y dimension
				2134	_workgroup_count_Y && for the grid being
				2135	less than 16 previous executed. Computed from the
				2136	SGPRs) fields in the kernel dispatch
				2137	packet as ((grid_size.y +
				2138	workgroup_size.y - 1) /
				2139	workgroupSize.y).
				2140
				2141	Only initialized if <16
				2142	previous SGPRs initialized.
				2143	then Grid Work-Group Count Z 1 32 bit count of the number of
				2144	(enable_sgpr_grid work-groups in the Z dimension
				2145	_workgroup_count_Z && for the grid being
				2146	less than 16 previous executed. Computed from the
				2147	SGPRs) fields in the kernel dispatch
				2148	packet as ((grid_size.z +
				2149	workgroup_size.z - 1) /
				2150	workgroupSize.z).
				2151
				2152	Only initialized if <16
				2153	previous SGPRs initialized.
				2154	then Work-Group Id X 1 32 bit work-group id in X
				2155	(enable_sgpr_workgroup_id dimension of grid for
				2156	_X) wavefront.
				2157	then Work-Group Id Y 1 32 bit work-group id in Y
				2158	(enable_sgpr_workgroup_id dimension of grid for
				2159	_Y) wavefront.
				2160	then Work-Group Id Z 1 32 bit work-group id in Z
				2161	(enable_sgpr_workgroup_id dimension of grid for
				2162	_Z) wavefront.
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2163	then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2164	(enable_sgpr_workgroup ordered_append_term[10:0],
				2165	_info) threadgroup_size_in_waves[5:0]}
				2166	then Scratch Wave Offset 1 32 bit byte offset from base
				2167	(enable_sgpr_private of scratch base of queue
				2168	_segment_wave_offset) executing the kernel
				2169	dispatch. Must be used as an
				2170	offset with Private
				2171	segment address when using
				2172	Scratch Segment Buffer. It
				2173	must be used to set up FLAT
				2174	SCRATCH for flat addressing
				2175	(see
				2176	:ref:`amdgpu-amdhsa-flat-scratch`).
				2177	========== ========================== ====== ==============================
				2178
				2179	The order of the VGPR registers is defined, but the compiler can specify which
				2180	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2181	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2182	for enabled registers are dense starting at VGPR0: the first enabled register is
				2183	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2184	VGPR number.
				2185
				2186	VGPR register initial state is defined in
				2187	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2188
				2189	.. table:: VGPR Register Set Up Order
				2190	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2191
				2192	========== ========================== ====== ==============================
				2193	VGPR Order Name Number Description
				2194	(kernel descriptor enable of
				2195	field) VGPRs
				2196	========== ========================== ====== ==============================
				2197	First Work-Item Id X 1 32 bit work item id in X
				2198	(Always initialized) dimension of work-group for
				2199	wavefront lane.
				2200	then Work-Item Id Y 1 32 bit work item id in Y
				2201	(enable_vgpr_workitem_id dimension of work-group for
				2202	> 0) wavefront lane.
				2203	then Work-Item Id Z 1 32 bit work item id in Z
				2204	(enable_vgpr_workitem_id dimension of work-group for
				2205	> 1) wavefront lane.
				2206	========== ========================== ====== ==============================
				2207
				2208	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2209
				2210	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2211	registers.
				2212	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2213	combination including none.
				2214	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2215	cannot included with the flat scratch init value which is per queue.
				2216	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2217	or (X, Y, Z).
				2218
				2219	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2220	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2221
				2222	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2223	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2224	instructions (GFX9).
				2225
				2226	If buffer operations are used then the compiler can generate a V# with the
				2227	following properties:
				2228
				2229	* base address of 0
				2230	* no swizzle
				2231	* ATC: 1 if IOMMU present (such as APU)
				2232	* ptr64: 1
				2233	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2234	APU and NC for dGPU).
				2235
				2236	.. _amdgpu-amdhsa-kernel-prolog:
				2237
				2238	Kernel Prolog
				2239	~~~~~~~~~~~~~
				2240
				2241	.. _amdgpu-amdhsa-m0:
				2242
				2243	M0
				2244	++
				2245
				2246	GFX6-GFX8
				2247	The M0 register must be initialized with a value at least the total LDS size
				2248	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2249	available in dispatch packet. For M0, it is also possible to use maximum
				2250	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2251	GFX7-GFX8).
				2252	GFX9
				2253	The M0 register is not used for range checking LDS accesses and so does not
				2254	need to be initialized in the prolog.
				2255
				2256	.. _amdgpu-amdhsa-flat-scratch:
				2257
				2258	Flat Scratch
				2259	++++++++++++
				2260
				2261	If the kernel may use flat operations to access scratch memory, the prolog code
				2262	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2263	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2264	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2265
				2266	GFX6
				2267	Flat scratch is not supported.
				2268
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2269	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2270	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2271	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2272	being managed by SPI for the queue executing the kernel dispatch. This is
				2273	the same value used in the Scratch Segment Buffer V# base address. The
				2274	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2275	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2276	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2277	by 8 before moving into FLAT_SCRATCH_LO.
				2278	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2279	work-items scratch memory usage. This is directly loaded from the kernel
				2280	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2281	DWORD. Having CP load it once avoids loading it at the beginning of every
				2282	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2283	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2284
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2285	GFX9
				2286	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2287	memory being managed by SPI for the queue executing the kernel dispatch. The
				2288	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2289	pair for use as the flat scratch base in flat memory instructions.
				2290
				2291	.. _amdgpu-amdhsa-memory-model:
				2292
				2293	Memory Model
				2294	~~~~~~~~~~~~
				2295
				2296	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2297	(see :ref:`memmodel`). The implementation is WIP.
				2298
				2299	.. TODO
				2300	Update when implementation complete.
				2301
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2302	The AMDGPU backend supports the memory synchronization scopes specified in
				2303	:ref:`amdgpu-memory-scopes`.
				2304
				2305	The code sequences used to implement the memory model are defined in table
				2306	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2307
				2308	The sequences specify the order of instructions that a single thread must
				2309	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2310	to other memory instructions executed by the same thread. This allows them to be
				2311	moved earlier or later which can allow them to be combined with other instances
				2312	of the same instruction, or hoisted/sunk out of loops to improve
				2313	performance. Only the instructions related to the memory model are given;
				2314	additional ``s_waitcnt`` instructions are required to ensure registers are
				2315	defined before being used. These may be able to be combined with the memory
				2316	model ``s_waitcnt`` instructions as described above.
				2317
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2318	The AMDGPU backend supports the following memory models:
				2319
				2320	HSA Memory Model [HSA]_
				2321	The HSA memory model uses a single happens-before relation for all address
				2322	spaces (see :ref:`amdgpu-address-spaces`).
				2323	OpenCL Memory Model [OpenCL]_
				2324	The OpenCL memory model which has separate happens-before relations for the
				2325	global and local address spaces. Only a fence specifying both global and
				2326	local address space, and seq_cst instructions join the relationships. Since
				2327	the LLVM ``memfence`` instruction does not allow an address space to be
				2328	specified the OpenCL fence has to convervatively assume both local and
				2329	global address space was specified. However, optimizations can often be
				2330	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2331	no intervening memory instructions which access the corresponding address
				2332	space. The code sequences in the table indicate what can be omitted for the
				2333	OpenCL memory. The target triple environment is used to determine if the
				2334	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2335
				2336	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2337	operations.
				2338
				2339	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2340	termed vector memory operations.
				2341
				2342	For GFX6-GFX9:
				2343
				2344	* Each agent has multiple compute units (CU).
				2345	* Each CU has multiple SIMDs that execute wavefronts.
				2346	* The wavefronts for a single work-group are executed in the same CU but may be
				2347	executed by different SIMDs.
				2348	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2349	executing on it.
				2350	* All LDS operations of a CU are performed as wavefront wide operations in a
				2351	global order and involve no caching. Completion is reported to a wavefront in
				2352	execution order.
				2353	* The LDS memory has multiple request queues shared by the SIMDs of a
				2354	CU. Therefore, the LDS operations performed by different waves of a work-group
				2355	can be reordered relative to each other, which can result in reordering the
				2356	visibility of vector memory operations with respect to LDS operations of other
				2357	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2358	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2359	between waves of a work-group, but not between operations performed by the
				2360	same wavefront.
				2361	* The vector memory operations are performed as wavefront wide operations and
				2362	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2363	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2364	vector memory order if they access LDS memory, and out of LDS operation order
				2365	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2366	* The vector memory operations access a single vector L1 cache shared by all
				2367	SIMDs a CU. Therefore, no special action is required for coherence between the
				2368	lanes of a single wavefront, or for coherence between wavefronts in the same
				2369	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
				2370	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2371	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2372	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2373	scalar operations are used in a restricted way so do not impact the memory
				2374	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2375	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2376	the same agent.
				2377	* The L2 cache has independent channels to service disjoint ranges of virtual
				2378	addresses.
				2379	* Each CU has a separate request queue per channel. Therefore, the vector and
				2380	scalar memory operations performed by waves executing in different work-groups
				2381	(which may be executing on different CUs) of an agent can be reordered
				2382	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2383	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2384	previous vector memory operation has completed before executing a subsequent
				2385	vector memory or LDS operation and so can be used to meet the requirements of
				2386	acquire and release.
				2387	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2388	of virtual addresses can be set up to bypass it to ensure system coherence.
				2389
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2390	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2391	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2392	memory, atomic memory orderings are not meaningful and all accesses are treated
				2393	as non-atomic.
				2394
				2395	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2396	scalar memory instructions). Since the constant address space contents do not
				2397	change during the execution of a kernel dispatch it is not legal to perform
				2398	stores, and atomic memory orderings are not meaningful and all access are
				2399	treated as non-atomic.
				2400
				2401	A memory synchronization scope wider than work-group is not meaningful for the
				2402	group (LDS) address space and is treated as work-group.
				2403
				2404	The memory model does not support the region address space which is treated as
				2405	non-atomic.
				2406
				2407	Acquire memory ordering is not meaningful on store atomic instructions and is
				2408	treated as non-atomic.
				2409
				2410	Release memory ordering is not meaningful on load atomic instructions and is
				2411	treated a non-atomic.
				2412
				2413	Acquire-release memory ordering is not meaningful on load or store atomic
				2414	instructions and is treated as acquire and release respectively.
				2415
				2416	AMDGPU backend only uses scalar memory operations to access memory that is
				2417	proven to not change during the execution of the kernel dispatch. This includes
				2418	constant address space and global address space for program scope const
				2419	variables. Therefore the kernel machine code does not have to maintain the
				2420	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2421	and vector L1 caches are invalidated between kernel dispatches by CP since
				2422	constant address space data may change between kernel dispatch executions. See
				2423	:ref:`amdgpu-amdhsa-memory-spaces`.
				2424
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2425	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2426	case the AMDGPU backend ensures the memory location used to spill is never
				2427	accessed by vector memory operations at the same time. If scalar writes are used
				2428	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2429	return since the locations may be used for vector memory instructions by a
				2430	future wave that uses the same scratch area, or a function call that creates a
				2431	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2432	as all scalar writes are write-before-read in the same thread.
				2433
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2434	Scratch backing memory (which is used for the private address space)
				2435	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2436	address space is only accessed by a single thread, and is always
				2437	write-before-read, there is never a need to invalidate these entries from the L1
				2438	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2439	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2440
				2441	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2442	to invalidate the L2 cache. This also causes it to be treated as
				2443	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2444	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2445	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2446
				2447	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2448	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2449
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2450	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2451	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2452	Ordering Sync Scope Address
				2453	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2454	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2455	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2456	-----------------------------------------------------------------------------------
				2457	load none none - global - !volatile & !nontemporal
				2458	- generic
				2459	- private 1. buffer/global/flat_load
				2460	- constant
				2461	- volatile & !nontemporal
				2462
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2463	1. buffer/global/flat_load
				2464	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2465
				2466	- nontemporal
				2467
				2468	1. buffer/global/flat_load
				2469	glc=1 slc=1
				2470
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2471	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2472	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2473	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2474	- private 1. buffer/global/flat_store
				2475	- constant
				2476	- nontemporal
				2477
				2478	1. buffer/global/flat_stote
				2479	glc=1 slc=1
				2480
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2481	store none none - local 1. ds_store
				2482	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2483	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2484	load atomic unordered any any Same as non-atomic.
				2485	store atomic unordered any any Same as non-atomic.
				2486	atomicrmw unordered any any *Same as monotonic
				2487	atomic*.
				2488	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2489	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2490	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2491	- wavefront - generic
				2492	- workgroup
				2493	load atomic monotonic - singlethread - local 1. ds_load
				2494	- wavefront
				2495	- workgroup
				2496	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2497	- system - generic glc=1
				2498	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2499	- wavefront - generic
				2500	- workgroup
				2501	- agent
				2502	- system
				2503	store atomic monotonic - singlethread - local 1. ds_store
				2504	- wavefront
				2505	- workgroup
				2506	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2507	- wavefront - generic
				2508	- workgroup
				2509	- agent
				2510	- system
				2511	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2512	- wavefront
				2513	- workgroup
				2514	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2515	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2516	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2517	- wavefront - local
				2518	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2519	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2520	load atomic acquire - workgroup - local 1. ds_load
				2521	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2522
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2523	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2524	- Must happen before
				2525	any following
				2526	global/generic
				2527	load/load
				2528	atomic/store/store
				2529	atomic/atomicrmw.
				2530	- Ensures any
				2531	following global
				2532	data read is no
				2533	older than the load
				2534	atomic value being
				2535	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2536	load atomic acquire - workgroup - generic 1. flat_load
				2537	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2538
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2539	- If OpenCL, omit.
				2540	- Must happen before
				2541	any following
				2542	global/generic
				2543	load/load
				2544	atomic/store/store
				2545	atomic/atomicrmw.
				2546	- Ensures any
				2547	following global
				2548	data read is no
				2549	older than the load
				2550	atomic value being
				2551	acquired.
				2552	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2553	- system glc=1
				2554	2. s_waitcnt vmcnt(0)
				2555
				2556	- Must happen before
				2557	following
				2558	buffer_wbinvl1_vol.
				2559	- Ensures the load
				2560	has completed
				2561	before invalidating
				2562	the cache.
				2563
				2564	3. buffer_wbinvl1_vol
				2565
				2566	- Must happen before
				2567	any following
				2568	global/generic
				2569	load/load
				2570	atomic/atomicrmw.
				2571	- Ensures that
				2572	following
				2573	loads will not see
				2574	stale global data.
				2575
				2576	load atomic acquire - agent - generic 1. flat_load glc=1
				2577	- system 2. s_waitcnt vmcnt(0) &
				2578	lgkmcnt(0)
				2579
				2580	- If OpenCL omit
				2581	lgkmcnt(0).
				2582	- Must happen before
				2583	following
				2584	buffer_wbinvl1_vol.
				2585	- Ensures the flat_load
				2586	has completed
				2587	before invalidating
				2588	the cache.
				2589
				2590	3. buffer_wbinvl1_vol
				2591
				2592	- Must happen before
				2593	any following
				2594	global/generic
				2595	load/load
				2596	atomic/atomicrmw.
				2597	- Ensures that
				2598	following loads
				2599	will not see stale
				2600	global data.
				2601
				2602	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2603	- wavefront - local
				2604	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2605	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2606	atomicrmw acquire - workgroup - local 1. ds_atomic
				2607	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2608
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2609	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2610	- Must happen before
				2611	any following
				2612	global/generic
				2613	load/load
				2614	atomic/store/store
				2615	atomic/atomicrmw.
				2616	- Ensures any
				2617	following global
				2618	data read is no
				2619	older than the
				2620	atomicrmw value
				2621	being acquired.
				2622
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2623	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2624	2. waitcnt lgkmcnt(0)
				2625
				2626	- If OpenCL, omit.
				2627	- Must happen before
				2628	any following
				2629	global/generic
				2630	load/load
				2631	atomic/store/store
				2632	atomic/atomicrmw.
				2633	- Ensures any
				2634	following global
				2635	data read is no
				2636	older than the
				2637	atomicrmw value
				2638	being acquired.
				2639
				2640	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2641	- system 2. s_waitcnt vmcnt(0)
				2642
				2643	- Must happen before
				2644	following
				2645	buffer_wbinvl1_vol.
				2646	- Ensures the
				2647	atomicrmw has
				2648	completed before
				2649	invalidating the
				2650	cache.
				2651
				2652	3. buffer_wbinvl1_vol
				2653
				2654	- Must happen before
				2655	any following
				2656	global/generic
				2657	load/load
				2658	atomic/atomicrmw.
				2659	- Ensures that
				2660	following loads
				2661	will not see stale
				2662	global data.
				2663
				2664	atomicrmw acquire - agent - generic 1. flat_atomic
				2665	- system 2. s_waitcnt vmcnt(0) &
				2666	lgkmcnt(0)
				2667
				2668	- If OpenCL, omit
				2669	lgkmcnt(0).
				2670	- Must happen before
				2671	following
				2672	buffer_wbinvl1_vol.
				2673	- Ensures the
				2674	atomicrmw has
				2675	completed before
				2676	invalidating the
				2677	cache.
				2678
				2679	3. buffer_wbinvl1_vol
				2680
				2681	- Must happen before
				2682	any following
				2683	global/generic
				2684	load/load
				2685	atomic/atomicrmw.
				2686	- Ensures that
				2687	following loads
				2688	will not see stale
				2689	global data.
				2690
				2691	fence acquire - singlethread none none
				2692	- wavefront
				2693	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2694
				2695	- If OpenCL and
				2696	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2697	not generic, omit.
				2698	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2699	currently has no
				2700	address space on
				2701	the fence need to
				2702	conservatively
				2703	always generate. If
				2704	fence had an
				2705	address space then
				2706	set to address
				2707	space of OpenCL
				2708	fence flag, or to
				2709	generic if both
				2710	local and global
				2711	flags are
				2712	specified.
				2713	- Must happen after
				2714	any preceding
				2715	local/generic load
				2716	atomic/atomicrmw
				2717	with an equal or
				2718	wider sync scope
				2719	and memory ordering
				2720	stronger than
				2721	unordered (this is
				2722	termed the
				2723	fence-paired-atomic).
				2724	- Must happen before
				2725	any following
				2726	global/generic
				2727	load/load
				2728	atomic/store/store
				2729	atomic/atomicrmw.
				2730	- Ensures any
				2731	following global
				2732	data read is no
				2733	older than the
				2734	value read by the
				2735	fence-paired-atomic.
				2736
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2737	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2738	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2739
				2740	- If OpenCL and
				2741	address space is
				2742	not generic, omit
				2743	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2744	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2745	currently has no
				2746	address space on
				2747	the fence need to
				2748	conservatively
				2749	always generate
				2750	(see comment for
				2751	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2752	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2753	separate s_waitcnt
				2754	vmcnt(0) and
				2755	s_waitcnt
				2756	lgkmcnt(0) to allow
				2757	them to be
				2758	independently moved
				2759	according to the
				2760	following rules.
				2761	- s_waitcnt vmcnt(0)
				2762	must happen after
				2763	any preceding
				2764	global/generic load
				2765	atomic/atomicrmw
				2766	with an equal or
				2767	wider sync scope
				2768	and memory ordering
				2769	stronger than
				2770	unordered (this is
				2771	termed the
				2772	fence-paired-atomic).
				2773	- s_waitcnt lgkmcnt(0)
				2774	must happen after
				2775	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2776	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2777	atomic/atomicrmw
				2778	with an equal or
				2779	wider sync scope
				2780	and memory ordering
				2781	stronger than
				2782	unordered (this is
				2783	termed the
				2784	fence-paired-atomic).
				2785	- Must happen before
				2786	the following
				2787	buffer_wbinvl1_vol.
				2788	- Ensures that the
				2789	fence-paired atomic
				2790	has completed
				2791	before invalidating
				2792	the
				2793	cache. Therefore
				2794	any following
				2795	locations read must
				2796	be no older than
				2797	the value read by
				2798	the
				2799	fence-paired-atomic.
				2800
				2801	2. buffer_wbinvl1_vol
				2802
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2803	- Must happen before any
				2804	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2805	load/load
				2806	atomic/store/store
				2807	atomic/atomicrmw.
				2808	- Ensures that
				2809	following loads
				2810	will not see stale
				2811	global data.
				2812
				2813	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2814	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2815	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2816	- wavefront - local
				2817	- generic
				2818	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2819
				2820	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2821	- Must happen after
				2822	any preceding
				2823	local/generic
				2824	load/store/load
				2825	atomic/store
				2826	atomic/atomicrmw.
				2827	- Must happen before
				2828	the following
				2829	store.
				2830	- Ensures that all
				2831	memory operations
				2832	to local have
				2833	completed before
				2834	performing the
				2835	store that is being
				2836	released.
				2837
				2838	2. buffer/global/flat_store
				2839	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2840	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2841
				2842	- If OpenCL, omit.
				2843	- Must happen after
				2844	any preceding
				2845	local/generic
				2846	load/store/load
				2847	atomic/store
				2848	atomic/atomicrmw.
				2849	- Must happen before
				2850	the following
				2851	store.
				2852	- Ensures that all
				2853	memory operations
				2854	to local have
				2855	completed before
				2856	performing the
				2857	store that is being
				2858	released.
				2859
				2860	2. flat_store
				2861	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2862	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2863
				2864	- If OpenCL, omit
				2865	lgkmcnt(0).
				2866	- Could be split into
				2867	separate s_waitcnt
				2868	vmcnt(0) and
				2869	s_waitcnt
				2870	lgkmcnt(0) to allow
				2871	them to be
				2872	independently moved
				2873	according to the
				2874	following rules.
				2875	- s_waitcnt vmcnt(0)
				2876	must happen after
				2877	any preceding
				2878	global/generic
				2879	load/store/load
				2880	atomic/store
				2881	atomic/atomicrmw.
				2882	- s_waitcnt lgkmcnt(0)
				2883	must happen after
				2884	any preceding
				2885	local/generic
				2886	load/store/load
				2887	atomic/store
				2888	atomic/atomicrmw.
				2889	- Must happen before
				2890	the following
				2891	store.
				2892	- Ensures that all
				2893	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2894	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2895	completed before
				2896	performing the
				2897	store that is being
				2898	released.
				2899
				2900	2. buffer/global/ds/flat_store
				2901	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2902	- wavefront - local
				2903	- generic
				2904	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2905
				2906	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2907	- Must happen after
				2908	any preceding
				2909	local/generic
				2910	load/store/load
				2911	atomic/store
				2912	atomic/atomicrmw.
				2913	- Must happen before
				2914	the following
				2915	atomicrmw.
				2916	- Ensures that all
				2917	memory operations
				2918	to local have
				2919	completed before
				2920	performing the
				2921	atomicrmw that is
				2922	being released.
				2923
				2924	2. buffer/global/flat_atomic
				2925	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2926	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2927
				2928	- If OpenCL, omit.
				2929	- Must happen after
				2930	any preceding
				2931	local/generic
				2932	load/store/load
				2933	atomic/store
				2934	atomic/atomicrmw.
				2935	- Must happen before
				2936	the following
				2937	atomicrmw.
				2938	- Ensures that all
				2939	memory operations
				2940	to local have
				2941	completed before
				2942	performing the
				2943	atomicrmw that is
				2944	being released.
				2945
				2946	2. flat_atomic
				2947	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2948	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2949
				2950	- If OpenCL, omit
				2951	lgkmcnt(0).
				2952	- Could be split into
				2953	separate s_waitcnt
				2954	vmcnt(0) and
				2955	s_waitcnt
				2956	lgkmcnt(0) to allow
				2957	them to be
				2958	independently moved
				2959	according to the
				2960	following rules.
				2961	- s_waitcnt vmcnt(0)
				2962	must happen after
				2963	any preceding
				2964	global/generic
				2965	load/store/load
				2966	atomic/store
				2967	atomic/atomicrmw.
				2968	- s_waitcnt lgkmcnt(0)
				2969	must happen after
				2970	any preceding
				2971	local/generic
				2972	load/store/load
				2973	atomic/store
				2974	atomic/atomicrmw.
				2975	- Must happen before
				2976	the following
				2977	atomicrmw.
				2978	- Ensures that all
				2979	memory operations
				2980	to global and local
				2981	have completed
				2982	before performing
				2983	the atomicrmw that
				2984	is being released.
				2985
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2986	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2987	fence release - singlethread none none
				2988	- wavefront
				2989	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2990
				2991	- If OpenCL and
				2992	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2993	not generic, omit.
				2994	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2995	currently has no
				2996	address space on
				2997	the fence need to
				2998	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2999	always generate. If
				3000	fence had an
				3001	address space then
				3002	set to address
				3003	space of OpenCL
				3004	fence flag, or to
				3005	generic if both
				3006	local and global
				3007	flags are
				3008	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3009	- Must happen after
				3010	any preceding
				3011	local/generic
				3012	load/load
				3013	atomic/store/store
				3014	atomic/atomicrmw.
				3015	- Must happen before
				3016	any following store
				3017	atomic/atomicrmw
				3018	with an equal or
				3019	wider sync scope
				3020	and memory ordering
				3021	stronger than
				3022	unordered (this is
				3023	termed the
				3024	fence-paired-atomic).
				3025	- Ensures that all
				3026	memory operations
				3027	to local have
				3028	completed before
				3029	performing the
				3030	following
				3031	fence-paired-atomic.
				3032
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3033	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3034	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3035
				3036	- If OpenCL and
				3037	address space is
				3038	not generic, omit
				3039	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3040	- If OpenCL and
				3041	address space is
				3042	local, omit
				3043	vmcnt(0).
				3044	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3045	currently has no
				3046	address space on
				3047	the fence need to
				3048	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3049	always generate. If
				3050	fence had an
				3051	address space then
				3052	set to address
				3053	space of OpenCL
				3054	fence flag, or to
				3055	generic if both
				3056	local and global
				3057	flags are
				3058	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3059	- Could be split into
				3060	separate s_waitcnt
				3061	vmcnt(0) and
				3062	s_waitcnt
				3063	lgkmcnt(0) to allow
				3064	them to be
				3065	independently moved
				3066	according to the
				3067	following rules.
				3068	- s_waitcnt vmcnt(0)
				3069	must happen after
				3070	any preceding
				3071	global/generic
				3072	load/store/load
				3073	atomic/store
				3074	atomic/atomicrmw.
				3075	- s_waitcnt lgkmcnt(0)
				3076	must happen after
				3077	any preceding
				3078	local/generic
				3079	load/store/load
				3080	atomic/store
				3081	atomic/atomicrmw.
				3082	- Must happen before
				3083	any following store
				3084	atomic/atomicrmw
				3085	with an equal or
				3086	wider sync scope
				3087	and memory ordering
				3088	stronger than
				3089	unordered (this is
				3090	termed the
				3091	fence-paired-atomic).
				3092	- Ensures that all
				3093	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3094	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3095	completed before
				3096	performing the
				3097	following
				3098	fence-paired-atomic.
				3099
				3100	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3101	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3102	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3103	- wavefront - local
				3104	- generic
				3105	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3106
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3107	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3108	- Must happen after
				3109	any preceding
				3110	local/generic
				3111	load/store/load
				3112	atomic/store
				3113	atomic/atomicrmw.
				3114	- Must happen before
				3115	the following
				3116	atomicrmw.
				3117	- Ensures that all
				3118	memory operations
				3119	to local have
				3120	completed before
				3121	performing the
				3122	atomicrmw that is
				3123	being released.
				3124
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3125	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3126	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3127	2. s_waitcnt lgkmcnt(0)
				3128
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3129	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3130	- Must happen before
				3131	any following
				3132	global/generic
				3133	load/load
				3134	atomic/store/store
				3135	atomic/atomicrmw.
				3136	- Ensures any
				3137	following global
				3138	data read is no
				3139	older than the load
				3140	atomic value being
				3141	acquired.
				3142
				3143	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3144
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3145	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3146	- Must happen after
				3147	any preceding
				3148	local/generic
				3149	load/store/load
				3150	atomic/store
				3151	atomic/atomicrmw.
				3152	- Must happen before
				3153	the following
				3154	atomicrmw.
				3155	- Ensures that all
				3156	memory operations
				3157	to local have
				3158	completed before
				3159	performing the
				3160	atomicrmw that is
				3161	being released.
				3162
				3163	2. flat_atomic
				3164	3. s_waitcnt lgkmcnt(0)
				3165
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3166	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3167	- Must happen before
				3168	any following
				3169	global/generic
				3170	load/load
				3171	atomic/store/store
				3172	atomic/atomicrmw.
				3173	- Ensures any
				3174	following global
				3175	data read is no
				3176	older than the load
				3177	atomic value being
				3178	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3179
				3180	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3181	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3182
				3183	- If OpenCL, omit
				3184	lgkmcnt(0).
				3185	- Could be split into
				3186	separate s_waitcnt
				3187	vmcnt(0) and
				3188	s_waitcnt
				3189	lgkmcnt(0) to allow
				3190	them to be
				3191	independently moved
				3192	according to the
				3193	following rules.
				3194	- s_waitcnt vmcnt(0)
				3195	must happen after
				3196	any preceding
				3197	global/generic
				3198	load/store/load
				3199	atomic/store
				3200	atomic/atomicrmw.
				3201	- s_waitcnt lgkmcnt(0)
				3202	must happen after
				3203	any preceding
				3204	local/generic
				3205	load/store/load
				3206	atomic/store
				3207	atomic/atomicrmw.
				3208	- Must happen before
				3209	the following
				3210	atomicrmw.
				3211	- Ensures that all
				3212	memory operations
				3213	to global have
				3214	completed before
				3215	performing the
				3216	atomicrmw that is
				3217	being released.
				3218
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3219	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3220	3. s_waitcnt vmcnt(0)
				3221
				3222	- Must happen before
				3223	following
				3224	buffer_wbinvl1_vol.
				3225	- Ensures the
				3226	atomicrmw has
				3227	completed before
				3228	invalidating the
				3229	cache.
				3230
				3231	4. buffer_wbinvl1_vol
				3232
				3233	- Must happen before
				3234	any following
				3235	global/generic
				3236	load/load
				3237	atomic/atomicrmw.
				3238	- Ensures that
				3239	following loads
				3240	will not see stale
				3241	global data.
				3242
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3243	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3244	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3245
				3246	- If OpenCL, omit
				3247	lgkmcnt(0).
				3248	- Could be split into
				3249	separate s_waitcnt
				3250	vmcnt(0) and
				3251	s_waitcnt
				3252	lgkmcnt(0) to allow
				3253	them to be
				3254	independently moved
				3255	according to the
				3256	following rules.
				3257	- s_waitcnt vmcnt(0)
				3258	must happen after
				3259	any preceding
				3260	global/generic
				3261	load/store/load
				3262	atomic/store
				3263	atomic/atomicrmw.
				3264	- s_waitcnt lgkmcnt(0)
				3265	must happen after
				3266	any preceding
				3267	local/generic
				3268	load/store/load
				3269	atomic/store
				3270	atomic/atomicrmw.
				3271	- Must happen before
				3272	the following
				3273	atomicrmw.
				3274	- Ensures that all
				3275	memory operations
				3276	to global have
				3277	completed before
				3278	performing the
				3279	atomicrmw that is
				3280	being released.
				3281
				3282	2. flat_atomic
				3283	3. s_waitcnt vmcnt(0) &
				3284	lgkmcnt(0)
				3285
				3286	- If OpenCL, omit
				3287	lgkmcnt(0).
				3288	- Must happen before
				3289	following
				3290	buffer_wbinvl1_vol.
				3291	- Ensures the
				3292	atomicrmw has
				3293	completed before
				3294	invalidating the
				3295	cache.
				3296
				3297	4. buffer_wbinvl1_vol
				3298
				3299	- Must happen before
				3300	any following
				3301	global/generic
				3302	load/load
				3303	atomic/atomicrmw.
				3304	- Ensures that
				3305	following loads
				3306	will not see stale
				3307	global data.
				3308
				3309	fence acq_rel - singlethread none none
				3310	- wavefront
				3311	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3312
				3313	- If OpenCL and
				3314	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3315	not generic, omit.
				3316	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3317	since LLVM
				3318	currently has no
				3319	address space on
				3320	the fence need to
				3321	conservatively
				3322	always generate
				3323	(see comment for
				3324	previous fence).
				3325	- Must happen after
				3326	any preceding
				3327	local/generic
				3328	load/load
				3329	atomic/store/store
				3330	atomic/atomicrmw.
				3331	- Must happen before
				3332	any following
				3333	global/generic
				3334	load/load
				3335	atomic/store/store
				3336	atomic/atomicrmw.
				3337	- Ensures that all
				3338	memory operations
				3339	to local have
				3340	completed before
				3341	performing any
				3342	following global
				3343	memory operations.
				3344	- Ensures that the
				3345	preceding
				3346	local/generic load
				3347	atomic/atomicrmw
				3348	with an equal or
				3349	wider sync scope
				3350	and memory ordering
				3351	stronger than
				3352	unordered (this is
				3353	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3354	acquire-fence-paired-atomic
				3355	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3356	before following
				3357	global memory
				3358	operations. This
				3359	satisfies the
				3360	requirements of
				3361	acquire.
				3362	- Ensures that all
				3363	previous memory
				3364	operations have
				3365	completed before a
				3366	following
				3367	local/generic store
				3368	atomic/atomicrmw
				3369	with an equal or
				3370	wider sync scope
				3371	and memory ordering
				3372	stronger than
				3373	unordered (this is
				3374	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3375	release-fence-paired-atomic
				3376	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3377	requirements of
				3378	release.
				3379
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3380	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3381	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3382
				3383	- If OpenCL and
				3384	address space is
				3385	not generic, omit
				3386	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3387	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3388	currently has no
				3389	address space on
				3390	the fence need to
				3391	conservatively
				3392	always generate
				3393	(see comment for
				3394	previous fence).
				3395	- Could be split into
				3396	separate s_waitcnt
				3397	vmcnt(0) and
				3398	s_waitcnt
				3399	lgkmcnt(0) to allow
				3400	them to be
				3401	independently moved
				3402	according to the
				3403	following rules.
				3404	- s_waitcnt vmcnt(0)
				3405	must happen after
				3406	any preceding
				3407	global/generic
				3408	load/store/load
				3409	atomic/store
				3410	atomic/atomicrmw.
				3411	- s_waitcnt lgkmcnt(0)
				3412	must happen after
				3413	any preceding
				3414	local/generic
				3415	load/store/load
				3416	atomic/store
				3417	atomic/atomicrmw.
				3418	- Must happen before
				3419	the following
				3420	buffer_wbinvl1_vol.
				3421	- Ensures that the
				3422	preceding
				3423	global/local/generic
				3424	load
				3425	atomic/atomicrmw
				3426	with an equal or
				3427	wider sync scope
				3428	and memory ordering
				3429	stronger than
				3430	unordered (this is
				3431	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3432	acquire-fence-paired-atomic
				3433	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3434	before invalidating
				3435	the cache. This
				3436	satisfies the
				3437	requirements of
				3438	acquire.
				3439	- Ensures that all
				3440	previous memory
				3441	operations have
				3442	completed before a
				3443	following
				3444	global/local/generic
				3445	store
				3446	atomic/atomicrmw
				3447	with an equal or
				3448	wider sync scope
				3449	and memory ordering
				3450	stronger than
				3451	unordered (this is
				3452	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3453	release-fence-paired-atomic
				3454	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3455	requirements of
				3456	release.
				3457
				3458	2. buffer_wbinvl1_vol
				3459
				3460	- Must happen before
				3461	any following
				3462	global/generic
				3463	load/load
				3464	atomic/store/store
				3465	atomic/atomicrmw.
				3466	- Ensures that
				3467	following loads
				3468	will not see stale
				3469	global data. This
				3470	satisfies the
				3471	requirements of
				3472	acquire.
				3473
				3474	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3475	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3476	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3477	- wavefront - local load atomic acquire,
				3478	- generic except must generated
				3479	all instructions even
				3480	for OpenCL.*
				3481	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3482	- generic
				3483	- Must
				3484	happen after
				3485	preceding
				3486	global/generic load
				3487	atomic/store
				3488	atomic/atomicrmw
				3489	with memory
				3490	ordering of seq_cst
				3491	and with equal or
				3492	wider sync scope.
				3493	(Note that seq_cst
				3494	fences have their
				3495	own s_waitcnt
				3496	lgkmcnt(0) and so do
				3497	not need to be
				3498	considered.)
				3499	- Ensures any
				3500	preceding
				3501	sequential
				3502	consistent local
				3503	memory instructions
				3504	have completed
				3505	before executing
				3506	this sequentially
				3507	consistent
				3508	instruction. This
				3509	prevents reordering
				3510	a seq_cst store
				3511	followed by a
				3512	seq_cst load. (Note
				3513	that seq_cst is
				3514	stronger than
				3515	acquire/release as
				3516	the reordering of
				3517	load acquire
				3518	followed by a store
				3519	release is
				3520	prevented by the
				3521	waitcnt of
				3522	the release, but
				3523	there is nothing
				3524	preventing a store
				3525	release followed by
				3526	load acquire from
				3527	competing out of
				3528	order.)
				3529
				3530	2. *Following
				3531	instructions same as
				3532	corresponding load
				3533	atomic acquire,
				3534	except must generated
				3535	all instructions even
				3536	for OpenCL.*
				3537	load atomic seq_cst - workgroup - local *Same as corresponding
				3538	load atomic acquire,
				3539	except must generated
				3540	all instructions even
				3541	for OpenCL.*
				3542	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3543	- system - generic vmcnt(0)
				3544
				3545	- Could be split into
				3546	separate s_waitcnt
				3547	vmcnt(0)
				3548	and s_waitcnt
				3549	lgkmcnt(0) to allow
				3550	them to be
				3551	independently moved
				3552	according to the
				3553	following rules.
				3554	- waitcnt lgkmcnt(0)
				3555	must happen after
				3556	preceding
				3557	global/generic load
				3558	atomic/store
				3559	atomic/atomicrmw
				3560	with memory
				3561	ordering of seq_cst
				3562	and with equal or
				3563	wider sync scope.
				3564	(Note that seq_cst
				3565	fences have their
				3566	own s_waitcnt
				3567	lgkmcnt(0) and so do
				3568	not need to be
				3569	considered.)
				3570	- waitcnt vmcnt(0)
				3571	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3572	preceding
				3573	global/generic load
				3574	atomic/store
				3575	atomic/atomicrmw
				3576	with memory
				3577	ordering of seq_cst
				3578	and with equal or
				3579	wider sync scope.
				3580	(Note that seq_cst
				3581	fences have their
				3582	own s_waitcnt
				3583	vmcnt(0) and so do
				3584	not need to be
				3585	considered.)
				3586	- Ensures any
				3587	preceding
				3588	sequential
				3589	consistent global
				3590	memory instructions
				3591	have completed
				3592	before executing
				3593	this sequentially
				3594	consistent
				3595	instruction. This
				3596	prevents reordering
				3597	a seq_cst store
				3598	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3599	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3600	that seq_cst is
				3601	stronger than
				3602	acquire/release as
				3603	the reordering of
				3604	load acquire
				3605	followed by a store
				3606	release is
				3607	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3608	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3609	the release, but
				3610	there is nothing
				3611	preventing a store
				3612	release followed by
				3613	load acquire from
				3614	competing out of
				3615	order.)
				3616
				3617	2. *Following
				3618	instructions same as
				3619	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3620	atomic acquire,
				3621	except must generated
				3622	all instructions even
				3623	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3624	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3625	- wavefront - local store atomic release,
				3626	- workgroup - generic except must generated
				3627	all instructions even
				3628	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3629	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3630	- system - generic store atomic release,
				3631	except must generated
				3632	all instructions even
				3633	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3634	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3635	- wavefront - local atomicrmw acq_rel,
				3636	- workgroup - generic except must generated
				3637	all instructions even
				3638	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3639	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3640	- system - generic atomicrmw acq_rel,
				3641	except must generated
				3642	all instructions even
				3643	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3644	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3645	- wavefront fence acq_rel,
				3646	- workgroup except must generated
				3647	- agent all instructions even
				3648	- system for OpenCL.*
				3649	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3650
				3651	The memory order also adds the single thread optimization constrains defined in
				3652	table
				3653	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3654
				3655	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3656	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3657
				3658	============ ==============================================================
				3659	LLVM Memory Optimization Constraints
				3660	Ordering
				3661	============ ==============================================================
				3662	unordered none
				3663	monotonic none
				3664	acquire - If a load atomic/atomicrmw then no following load/load
				3665	atomic/store/ store atomic/atomicrmw/fence instruction can
				3666	be moved before the acquire.
				3667	- If a fence then same as load atomic, plus no preceding
				3668	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3669	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3670	atomic/store/ store atomic/atomicrmw/fence instruction can
				3671	be moved after the release.
				3672	- If a fence then same as store atomic, plus no following
				3673	associated fence-paired-atomic can be moved before the
				3674	fence.
				3675	acq_rel Same constraints as both acquire and release.
				3676	seq_cst - If a load atomic then same constraints as acquire, plus no
				3677	preceding sequentially consistent load atomic/store
				3678	atomic/atomicrmw/fence instruction can be moved after the
				3679	seq_cst.
				3680	- If a store atomic then the same constraints as release, plus
				3681	no following sequentially consistent load atomic/store
				3682	atomic/atomicrmw/fence instruction can be moved before the
				3683	seq_cst.
				3684	- If an atomicrmw/fence then same constraints as acq_rel.
				3685	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3686
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3687	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3688	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3689
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3690	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3691	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3692	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3693
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3694	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3695	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3696
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3697	=================== =============== =============== =======================
				3698	Usage Code Sequence Trap Handler Description
				3699	Inputs
				3700	=================== =============== =============== =======================
				3701	reserved ``s_trap 0x00`` Reserved by hardware.
				3702	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3703	``queue_ptr`` ``debugtrap``
				3704	``VGPR0``: intrinsic (not
				3705	``arg`` implemented).
				3706	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3707	``queue_ptr`` terminated and its
				3708	associated queue put
				3709	into the error state.
				3710	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3711	``queue_ptr`` installed handled
				3712	same as ``llvm.trap``.
				3713	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3714	breakpoints.
				3715	debugger ``s_trap 0x08`` Reserved for debugger.
				3716	debugger ``s_trap 0xfe`` Reserved for debugger.
				3717	debugger ``s_trap 0xff`` Reserved for debugger.
				3718	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3719
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3720	Unspecified OS
				3721	--------------
				3722
				3723	This section provides code conventions used when the target triple OS is
				3724	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3725
				3726	Trap Handler ABI
				3727	~~~~~~~~~~~~~~~~
				3728
				3729	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3730	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3731	instructions are handled as follows:
				3732
				3733	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3734	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3735
				3736	=============== =============== ===========================================
				3737	Usage Code Sequence Description
				3738	=============== =============== ===========================================
				3739	llvm.trap s_endpgm Causes wavefront to be terminated.
				3740	llvm.debugtrap none Compiler warning given that there is no
				3741	trap handler installed.
				3742	=============== =============== ===========================================
				3743
				3744	Source Languages
				3745	================
				3746
				3747	.. _amdgpu-opencl:
				3748
				3749	OpenCL
				3750	------
				3751
				3752	When generating code for the OpenCL language the target triple environment
				3753	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3754
				3755	When the language is OpenCL the following differences occur:
				3756
				3757	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3758	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3759	3. Additional metadata is generated
				3760	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3761
				3762	.. TODO
				3763	Specify what affect this has. Hidden arguments added. Additional metadata
				3764	generated.
				3765
				3766	.. _amdgpu-hcc:
				3767
				3768	HCC
				3769	---
				3770
				3771	When generating code for the OpenCL language the target triple environment
				3772	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3773
				3774	When the language is OpenCL the following differences occur:
				3775
				3776	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3777
				3778	.. TODO
				3779	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3780
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3781	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3782	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3783
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3784	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3785	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3786
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3787	This section describes general syntax for instructions and operands. For more
				3788	information about instructions, their semantics and supported combinations of
				3789	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3790	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3791
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3792	An instruction has the following syntax (register operands are normally
				3793	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3794
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3795	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3796
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3797	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3798	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3799
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3800	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3801
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3802	* SGPR registers: s0, ... or s[0], ...
				3803	* VGPR registers: v0, ... or v[0], ...
				3804	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3805	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3806	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3807	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3808	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3809	* Register index expressions: v[2*2], s[1-1:2-1]
				3810	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3811
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3812	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3813
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3814	* offset, offset0, offset1
				3815	* idxen, offen bits
				3816	* glc, slc, tfe bits
				3817	* waitcnt: integer or combination of counter values
				3818	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3819
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3820	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3821
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3822	* DPP modifiers:
				3823
				3824	- row_shl, row_shr, row_ror, row_rol
				3825	- row_mirror, row_half_mirror, row_bcast
				3826	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3827	- row_mask, bank_mask, bound_ctrl
				3828
				3829	* SDWA modifiers:
				3830
				3831	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3832	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3833	- abs, neg, sext
				3834
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3835	Instruction Examples
				3836	~~~~~~~~~~~~~~~~~~~~
				3837
				3838	DS
				3839	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3840
				3841	.. code-block:: nasm
				3842
				3843	ds_add_u32 v2, v4 offset:16
				3844	ds_write_src2_b64 v2 offset0:4 offset1:8
				3845	ds_cmpst_f32 v2, v4, v6
				3846	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3847
				3848
				3849	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3850
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3851	FLAT
				3852	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3853
				3854	.. code-block:: nasm
				3855
				3856	flat_load_dword v1, v[3:4]
				3857	flat_store_dwordx3 v[3:4], v[5:7]
				3858	flat_atomic_swap v1, v[3:4], v5 glc
				3859	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3860	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3861
				3862	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3863
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3864	MUBUF
				3865	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3866
				3867	.. code-block:: nasm
				3868
				3869	buffer_load_dword v1, off, s[4:7], s1
				3870	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3871	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3872	buffer_wbinvl1
				3873	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3874
				3875	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3876
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3877	SMRD/SMEM
				3878	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3879
				3880	.. code-block:: nasm
				3881
				3882	s_load_dword s1, s[2:3], 0xfc
				3883	s_load_dwordx8 s[8:15], s[2:3], s4
				3884	s_load_dwordx16 s[88:103], s[2:3], s4
				3885	s_dcache_inv_vol
				3886	s_memtime s[4:5]
				3887
				3888	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3889
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3890	SOP1
				3891	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3892
				3893	.. code-block:: nasm
				3894
				3895	s_mov_b32 s1, s2
				3896	s_mov_b64 s[0:1], 0x80000000
				3897	s_cmov_b32 s1, 200
				3898	s_wqm_b64 s[2:3], s[4:5]
				3899	s_bcnt0_i32_b64 s1, s[2:3]
				3900	s_swappc_b64 s[2:3], s[4:5]
				3901	s_cbranch_join s[4:5]
				3902
				3903	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3904
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3905	SOP2
				3906	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3907
				3908	.. code-block:: nasm
				3909
				3910	s_add_u32 s1, s2, s3
				3911	s_and_b64 s[2:3], s[4:5], s[6:7]
				3912	s_cselect_b32 s1, s2, s3
				3913	s_andn2_b32 s2, s4, s6
				3914	s_lshr_b64 s[2:3], s[4:5], s6
				3915	s_ashr_i32 s2, s4, s6
				3916	s_bfm_b64 s[2:3], s4, s6
				3917	s_bfe_i64 s[2:3], s[4:5], s6
				3918	s_cbranch_g_fork s[4:5], s[6:7]
				3919
				3920	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3921
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3922	SOPC
				3923	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3924
				3925	.. code-block:: nasm
				3926
				3927	s_cmp_eq_i32 s1, s2
				3928	s_bitcmp1_b32 s1, s2
				3929	s_bitcmp0_b64 s[2:3], s4
				3930	s_setvskip s3, s5
				3931
				3932	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3933
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3934	SOPP
				3935	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3936
				3937	.. code-block:: nasm
				3938
				3939	s_barrier
				3940	s_nop 2
				3941	s_endpgm
				3942	s_waitcnt 0 ; Wait for all counters to be 0
				3943	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3944	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3945	s_sethalt 9
				3946	s_sleep 10
				3947	s_sendmsg 0x1
				3948	s_sendmsg sendmsg(MSG_INTERRUPT)
				3949	s_trap 1
				3950
				3951	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3952
				3953	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3954	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3955	range or acceptable values.
				3956
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3957	VALU
				3958	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3959
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3960	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3961	the assembler will automatically use optimal encoding based on its operands.
				3962	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3963
				3964	* _e32 for 32-bit VOP1/VOP2/VOPC
				3965	* _e64 for 64-bit VOP3
				3966	* _dpp for VOP_DPP
				3967	* _sdwa for VOP_SDWA
				3968
				3969	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3970
				3971	.. code-block:: nasm
				3972
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3973	v_mov_b32 v1, v2
				3974	v_mov_b32_e32 v1, v2
				3975	v_nop
				3976	v_cvt_f64_i32_e32 v[1:2], v2
				3977	v_floor_f32_e32 v1, v2
				3978	v_bfrev_b32_e32 v1, v2
				3979	v_add_f32_e32 v1, v2, v3
				3980	v_mul_i32_i24_e64 v1, v2, 3
				3981	v_mul_i32_i24_e32 v1, -3, v3
				3982	v_mul_i32_i24_e32 v1, -100, v3
				3983	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3984	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3985
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3986	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3987
				3988	.. code-block:: nasm
				3989
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3990	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3991	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3992	v_mov_b32 v0, v0 wave_shl:1
				3993	v_mov_b32 v0, v0 row_mirror
				3994	v_mov_b32 v0, v0 row_bcast:31
				3995	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3996	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3997	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3998
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3999	VOP_SDWA examples:
				4000
				4001	.. code-block:: nasm
				4002
				4003	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4004	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4005	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4006	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4007	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4008
				4009	For full list of supported instructions, refer to "Vector ALU instructions".
				4010
				4011	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4012	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4013
				4014	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4015	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4016
				4017	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4018	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4019
				4020	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4021	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4022
				4023	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4024	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4025
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4026
				4027	major, minor, and stepping are all integers that describe the instruction
				4028	set architecture (ISA) version of the assembly program.
				4029
				4030	vendor and arch are quoted strings. vendor should always be equal to
				4031	"AMD" and arch should always be equal to "AMDGPU".
				4032
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4033	By default, the assembler will derive the ISA version, vendor, and arch
				4034	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4035
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4036	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4037	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4038
				4039	This directives specifies that the symbol with given name is a kernel entry point
				4040	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4041
				4042	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4043	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4044
				4045	This directive marks the beginning of a list of key / value pairs that are used
				4046	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4047	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4048	any amd_kernel_code_t values that are unspecified a default value will be
				4049	used. The default value for all keys is 0, with the following exceptions:
				4050
				4051	- kernel_code_version_major defaults to 1.
				4052	- machine_kind defaults to 1.
				4053	- machine_version_major, machine_version_minor, and
				4054	machine_version_stepping are derived from the value of the -mcpu option
				4055	that is passed to the assembler.
				4056	- kernel_code_entry_byte_offset defaults to 256.
				4057	- wavefront_size defaults to 6.
				4058	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4059	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4060	as a power of two, so a value of n means an alignment of 2^ n.
				4061
				4062	The .amd_kernel_code_t directive must be placed immediately after the
				4063	function label and before any instructions.
				4064
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4065	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4066	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4067
				4068	Here is an example of a minimal amd_kernel_code_t specification:
				4069
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4070	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4071
				4072	.hsa_code_object_version 1,0
				4073	.hsa_code_object_isa
				4074
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4075	.hsatext
				4076	.globl hello_world
				4077	.p2align 8
				4078	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4079
				4080	hello_world:
				4081
				4082	.amd_kernel_code_t
				4083	enable_sgpr_kernarg_segment_ptr = 1
				4084	is_ptr64 = 1
				4085	compute_pgm_rsrc1_vgprs = 0
				4086	compute_pgm_rsrc1_sgprs = 0
				4087	compute_pgm_rsrc2_user_sgpr = 2
				4088	kernarg_segment_byte_size = 8
				4089	wavefront_sgpr_count = 2
				4090	workitem_vgpr_count = 3
				4091	.end_amd_kernel_code_t
				4092
				4093	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4094	v_mov_b32 v0, 3.14159
				4095	s_waitcnt lgkmcnt(0)
				4096	v_mov_b32 v1, s0
				4097	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4098	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4099	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4100	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4101	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4102
				4103	Additional Documentation
				4104	========================
				4105
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4106	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4107	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4108	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4109	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4110	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4111	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4112	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4113	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4114	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4115	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4116	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4117	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4118	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4119	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4120	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4121	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__