Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 439089348fffb8add0314fe259e5608250a49415 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
				67	<empty> Defaults to ``opencl``.
				68	``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
				69	``amdgizcl`` Same as ``opencl`` except a different address space mapping is
				70	used (see :ref:`amdgpu-address-spaces`).
				71	``amdgiz`` Same as ``opencl`` except a different address space mapping is
				72	used (see :ref:`amdgpu-address-spaces`).
				73	``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
				74	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	75
				76	.. _amdgpu-processors:
				77
				78	Processors
				79	----------
				80
				81	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				82	names from both the Processor and Alternative Processor can be used.
				83
				84	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	85	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	86
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	87	=========== =============== ============ ===== ========= ======= ==================
				88	Processor Alternative Target dGPU/ Target ROCm Example
				89	Processor Triple APU Features Support Products
				90	Architecture Supported
				91	[Default]
				92	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``r600`` ``r600`` dGPU
				96	``r630`` ``r600`` dGPU
				97	``rs880`` ``r600`` dGPU
				98	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	99	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	100	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	101	``rv710`` ``r600`` dGPU
				102	``rv730`` ``r600`` dGPU
				103	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	104	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	105	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	106	``cedar`` ``r600`` dGPU
				107	``redwood`` ``r600`` dGPU
				108	``sumo`` ``r600`` dGPU
				109	``juniper`` ``r600`` dGPU
				110	``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``barts`` ``r600`` dGPU
				114	``turks`` ``r600`` dGPU
				115	``caicos`` ``r600`` dGPU
				116	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	117	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	118	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	119	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
				120	``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
				121	- ``verde``
				122	- ``oland``
				123	- ``hainan``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	124	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	125	-----------------------------------------------------------------------------------
				126	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				127	- A6 Pro-7050B
				128	- A8-7100
				129	- A8 Pro-7150B
				130	- A10-7300
				131	- A10 Pro-7350B
				132	- FX-7500
				133	- A8-7200P
				134	- A10-7400P
				135	- FX-7600P
				136	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				137	- FirePro W9100
				138	- FirePro S9150
				139	- FirePro S9170
				140	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				141	- Radeon R9 290x
				142	- Radeon R390
				143	- Radeon R390x
				144	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				145	- ``mullins`` - E1-2200
				146	- E1-2500
				147	- E2-3000
				148	- E2-3800
				149	- A4-5000
				150	- A4-5100
				151	- A6-5200
				152	- A4 Pro-3340B
				153	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				154	- Radeon HD 8770
				155	- R7 260
				156	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	157	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	158	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	159	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				160	[on] - Pro A6-8500B
				161	- A8-8600P
				162	- Pro A8-8600B
				163	- FX-8800P
				164	- Pro A12-8800B
				165	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				166	[on] - Pro A10-8700B
				167	- A10-8780P
				168	\ ``amdgcn`` APU - xnack - A10-9600P
				169	[on] - A10-9630P
				170	- A12-9700P
				171	- A12-9730P
				172	- FX-9800P
				173	- FX-9830P
				174	\ ``amdgcn`` APU - xnack - E2-9010
				175	[on] - A6-9210
				176	- A9-9410
				177	``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tye	a697880	2017-12-12 05:47:00 +0000	[diff] [blame^]	178	- ``iceland`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	179	- FirePro W7100
				180	- Radeon R285
				181	- Radeon R9 380
				182	- Radeon R9 385
				183	- Mobile FirePro
				184	M7170
				185	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				186	[off] - Radeon R9 Fury
				187	- Radeon R9 FuryX
				188	- Radeon Pro Duo
				189	- FirePro S9300x2
				190	- Radeon Instinct MI8
				191	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				192	[off] - Radeon RX 480
				193	- Radeon Instinct MI6
				194	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				195	[off]
				196	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				197	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	198	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	199	-----------------------------------------------------------------------------------
				200	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				201	[off] Frontier Edition
				202	- Radeon RX Vega 56
				203	- Radeon RX Vega 64
				204	- Radeon RX Vega 64
				205	Liquid
				206	- Radeon Instinct MI25
				207	``gfx902`` ``amdgcn`` APU - xnack TBA
				208	[on]
				209	.. TODO
				210	Add product
				211	names.
				212	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	213
				214	.. _amdgpu-target-features:
				215
				216	Target Features
				217	---------------
				218
				219	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	220	processor specific features. Not all target features are supported by
				221	all processors. The runtime must ensure that the features supported by
				222	the device used to execute the code match the features enabled when
				223	generating the code. A mismatch of features may result in incorrect
				224	execution, or a reduction in performance.
				225
				226	The target features supported by each processor, and the default value
				227	used if not specified explicitly, is listed in
				228	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	229
				230	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				231	target features.
				232
				233	For example:
				234
				235	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	236	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	237	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	238	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	239
				240	.. table:: AMDGPU Target Features
				241	:name: amdgpu-target-feature-table
				242
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	243	============== ==================================================
				244	Target Feature Description
				245	============== ==================================================
				246	-m[no-]xnack Enable/disable generating code that has
				247	memory clauses that are compatible with
				248	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	249
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	250	This is used for demand paging and page
				251	migration. If XNACK replay is enabled in
				252	the device, then if a page fault occurs
				253	the code may execute incorrectly if the
				254	``xnack`` feature is not enabled. Executing
				255	code that has the feature enabled on a
				256	device that does not have XNACK replay
				257	enabled will execute correctly, but may
				258	be less performant than code with the
				259	feature disabled.
				260	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	261
				262	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	263
				264	Address Spaces
				265	--------------
				266
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	267	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	268
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	269	The memory space names used in the table, aside from the region memory space, is
				270	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	271
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	272	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	273
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	274	.. table:: Address Space Mapping
				275	:name: amdgpu-address-space-mapping-table
				276
				277	================== ================= ================= ================= =================
				278	LLVM Address Space Memory Space
				279	------------------ -----------------------------------------------------------------------
				280	\ Current Default amdgiz/amdgizcl hcc Future Default
				281	================== ================= ================= ================= =================
				282	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				283	1 Global Global Global Global
				284	2 Constant Constant Constant Region (GDS)
				285	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				286	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				287	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				288	================== ================= ================= ================= =================
				289
				290	Current Default
				291	This is the current default address space mapping used for all languages
				292	except hcc. This will shortly be deprecated.
				293
				294	amdgiz/amdgizcl
				295	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				296	is specified as the target triple environment value.
				297
				298	hcc
				299	This is the current address space mapping used when ``hcc`` is specified as
				300	the target triple environment value.This will shortly be deprecated.
				301
				302	Future Default
				303	This will shortly be the only address space mapping for all languages using
				304	AMDGPU backend.
				305
				306	.. _amdgpu-memory-scopes:
				307
				308	Memory Scopes
				309	-------------
				310
				311	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				312	backend memory model when the target triple OS is ``amdhsa`` (see
				313	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				314
				315	The memory model supported is based on the HSA memory model [HSA]_ which is
				316	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				317	relation is transitive over the synchonizes-with relation independent of scope,
				318	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	319	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	320
				321	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				322	inclusion and requires the memory scopes to exactly match. However, this
				323	is conservatively correct for OpenCL.
				324
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	325	.. table:: AMDHSA LLVM Sync Scopes
				326	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	327
				328	================ ==========================================================
				329	LLVM Sync Scope Description
				330	================ ==========================================================
				331	none The default: ``system``.
				332
				333	Synchronizes with, and participates in modification and
				334	seq_cst total orderings with, other operations (except
				335	image operations) for all address spaces (except private,
				336	or generic that accesses private) provided the other
				337	operation's sync scope is:
				338
				339	- ``system``.
				340	- ``agent`` and executed by a thread on the same agent.
				341	- ``workgroup`` and executed by a thread in the same
				342	workgroup.
				343	- ``wavefront`` and executed by a thread in the same
				344	wavefront.
				345
				346	``agent`` Synchronizes with, and participates in modification and
				347	seq_cst total orderings with, other operations (except
				348	image operations) for all address spaces (except private,
				349	or generic that accesses private) provided the other
				350	operation's sync scope is:
				351
				352	- ``system`` or ``agent`` and executed by a thread on the
				353	same agent.
				354	- ``workgroup`` and executed by a thread in the same
				355	workgroup.
				356	- ``wavefront`` and executed by a thread in the same
				357	wavefront.
				358
				359	``workgroup`` Synchronizes with, and participates in modification and
				360	seq_cst total orderings with, other operations (except
				361	image operations) for all address spaces (except private,
				362	or generic that accesses private) provided the other
				363	operation's sync scope is:
				364
				365	- ``system``, ``agent`` or ``workgroup`` and executed by a
				366	thread in the same workgroup.
				367	- ``wavefront`` and executed by a thread in the same
				368	wavefront.
				369
				370	``wavefront`` Synchronizes with, and participates in modification and
				371	seq_cst total orderings with, other operations (except
				372	image operations) for all address spaces (except private,
				373	or generic that accesses private) provided the other
				374	operation's sync scope is:
				375
				376	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				377	and executed by a thread in the same wavefront.
				378
				379	``singlethread`` Only synchronizes with, and participates in modification
				380	and seq_cst total orderings with, other operations (except
				381	image operations) running in the same thread for all
				382	address spaces (for example, in signal handlers).
				383	================ ==========================================================
				384
				385	AMDGPU Intrinsics
				386	-----------------
				387
				388	The AMDGPU backend implements the following intrinsics.
				389
				390	This section is WIP.
				391
				392	.. TODO
				393	List AMDGPU intrinsics
				394
				395	Code Object
				396	===========
				397
				398	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				399	can be linked by ``lld`` to produce a standard ELF shared code object which can
				400	be loaded and executed on an AMDGPU target.
				401
				402	Header
				403	------
				404
				405	The AMDGPU backend uses the following ELF header:
				406
				407	.. table:: AMDGPU ELF Header
				408	:name: amdgpu-elf-header-table
				409
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	410	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	412	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	413	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				414	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	415	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				416	- ``ELFOSABI_AMDGPU_HSA``
				417	- ``ELFOSABI_AMDGPU_PAL``
				418	- ``ELFOSABI_AMDGPU_MESA3D``
				419	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				420	- ``ELFABIVERSION_AMDGPU_PAL``
				421	- ``ELFABIVERSION_AMDGPU_MESA3D``
				422	``e_type`` - ``ET_REL``
				423	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424	``e_machine`` ``EM_AMDGPU``
				425	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	426	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	427	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428
				429	..
				430
				431	.. table:: AMDGPU ELF Header Enumeration Values
				432	:name: amdgpu-elf-header-enumeration-values-table
				433
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	434	=============================== =====
				435	Name Value
				436	=============================== =====
				437	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	438	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	439	``ELFOSABI_AMDGPU_HSA`` 64
				440	``ELFOSABI_AMDGPU_PAL`` 65
				441	``ELFOSABI_AMDGPU_MESA3D`` 66
				442	``ELFABIVERSION_AMDGPU_HSA`` 1
				443	``ELFABIVERSION_AMDGPU_PAL`` 0
				444	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				445	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	446
				447	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	The ELF class is:
				449
				450	* ``ELFCLASS32`` for ``r600`` architecture.
				451
				452	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				453	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454
				455	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	456	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	457
				458	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	459	One of the following AMD GPU architecture specific OS ABIs
				460	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	461
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	462	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	463
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	464	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	465
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	466	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				467
				468	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	469
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	470	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	471	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				472	object conforms:
				473
				474	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				475	runtime ABI.
				476
				477	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				478	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	479
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	480	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	482
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	483	``e_type``
				484	Can be one of the following values:
				485
				486
				487	``ET_REL``
				488	The type produced by the AMD GPU backend compiler as it is relocatable code
				489	object.
				490
				491	``ET_DYN``
				492	The type produced by the linker as it is a shared code object.
				493
				494	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				495
				496	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	497	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				498	by the ``r600`` and ``amdgcn`` architectures (see
				499	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				500	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				501	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	502
				503	``e_entry``
				504	The entry point is 0 as the entry points for individual kernels must be
				505	selected in order to invoke them through AQL packets.
				506
				507	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	508	The AMDGPU backend uses the following ELF header flags:
				509
				510	.. table:: AMDGPU ELF Header ``e_flags``
				511	:name: amdgpu-elf-header-e_flags-table
				512
				513	================================= ========== =============================
				514	Name Value Description
				515	================================= ========== =============================
				516	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				517	-------------------------------------------- -----------------------------
				518	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				519	mask for
				520	``EF_AMDGPU_MACH_xxx`` values
				521	defined in
				522	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	523	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				524	target feature is
				525	enabled for all code
				526	contained in the code object.
				527	See
				528	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	529	================================= ========== =============================
				530
				531	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				532	:name: amdgpu-ef-amdgpu-mach-table
				533
				534	================================= ========== =============================
				535	Name Value Description (see
				536	:ref:`amdgpu-processor-table`)
				537	================================= ========== =============================
				538	``EF_AMDGPU_MACH_NONE`` 0 not specified
				539	``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
				540	``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
				541	``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
				542	``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
				543	``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
				544	``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
				545	``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
				546	``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
				547	``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
				548	``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
				549	``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
				550	``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
				551	``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
				552	``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
				553	``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
				554	``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
				555	reserved 17-31 Reserved for ``r600``
				556	architecture processors.
				557	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
				560	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
				562	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	563	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tye	a697880	2017-12-12 05:47:00 +0000	[diff] [blame^]	564	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
				565	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
				566	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
				567	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
				568	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
				569	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	570	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	571
				572	Sections
				573	--------
				574
				575	An AMDGPU target ELF code object has the standard ELF sections which include:
				576
				577	.. table:: AMDGPU ELF Sections
				578	:name: amdgpu-elf-sections-table
				579
				580	================== ================ =================================
				581	Name Type Attributes
				582	================== ================ =================================
				583	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				584	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				585	``.debug_``\ \* ``SHT_PROGBITS`` none
				586	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				587	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				588	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				589	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				590	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				591	``.note`` ``SHT_NOTE`` none
				592	``.rela``\ name ``SHT_RELA`` none
				593	``.rela.dyn`` ``SHT_RELA`` none
				594	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				595	``.shstrtab`` ``SHT_STRTAB`` none
				596	``.strtab`` ``SHT_STRTAB`` none
				597	``.symtab`` ``SHT_SYMTAB`` none
				598	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				599	================== ================ =================================
				600
				601	These sections have their standard meanings (see [ELF]_) and are only generated
				602	if needed.
				603
				604	``.debug``\ \*
				605	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				606	DWARF produced by the AMDGPU backend.
				607
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	608	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	609	The standard sections used by a dynamic loader.
				610
				611	``.note``
				612	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				613	backend.
				614
				615	``.rela``\ name, ``.rela.dyn``
				616	For relocatable code objects, name is the name of the section that the
				617	relocation records apply. For example, ``.rela.text`` is the section name for
				618	relocation records associated with the ``.text`` section.
				619
				620	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				621	records from each of the relocatable code object's ``.rela``\ name sections.
				622
				623	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				624	the AMDGPU backend.
				625
				626	``.text``
				627	The executable machine code for the kernels and functions they call. Generated
				628	as position independent code. See :ref:`amdgpu-code-conventions` for
				629	information on conventions used in the isa generation.
				630
				631	.. _amdgpu-note-records:
				632
				633	Note Records
				634	------------
				635
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	636	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				637	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				638	aligned. In addition, minimal zero byte padding must be generated to ensure the
				639	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				640	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
				642	The AMDGPU backend code object uses the following ELF note records in the
				643	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	644	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	645	variable size strings have a corresponding ``*_size`` field that specifies the
				646	number of bytes, including the terminating null character, in the string. The
				647	string(s) come immediately after the preceding fields.
				648
				649	Additional note records can be present.
				650
				651	.. table:: AMDGPU ELF Note Records
				652	:name: amdgpu-elf-note-records-table
				653
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	654	===== ============================== ======================================
				655	Name Type Description
				656	===== ============================== ======================================
				657	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	658	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	659
				660	..
				661
				662	.. table:: AMDGPU ELF Note Record Enumeration Values
				663	:name: amdgpu-elf-note-record-enumeration-values-table
				664
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	665	============================== =====
				666	Name Value
				667	============================== =====
				668	reserved 0-9
				669	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	670	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	671	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	672
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	673	``NT_AMD_AMDGPU_HSA_METADATA``
				674	Specifies extensible metadata associated with the code objects executed on HSA
				675	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				676	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				677	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				678	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	679
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	680	.. _amdgpu-symbols:
				681
				682	Symbols
				683	-------
				684
				685	Symbols include the following:
				686
				687	.. table:: AMDGPU ELF Symbols
				688	:name: amdgpu-elf-symbols-table
				689
				690	===================== ============== ============= ==================
				691	Name Type Section Description
				692	===================== ============== ============= ==================
				693	link-name ``STT_OBJECT`` - ``.data`` Global variable
				694	- ``.rodata``
				695	- ``.bss``
				696	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				697	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				698	===================== ============== ============= ==================
				699
				700	Global variable
				701	Global variables both used and defined by the compilation unit.
				702
				703	If the symbol is defined in the compilation unit then it is allocated in the
				704	appropriate section according to if it has initialized data or is readonly.
				705
				706	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				707	will resolve relocations using the definition provided by another code object
				708	or explicitly defined by the runtime.
				709
				710	All global symbols, whether defined in the compilation unit or external, are
				711	accessed by the machine code indirectly through a GOT table entry. This
				712	allows them to be preemptable. The GOT table is only supported when the target
				713	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	714
				715	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	716	Add description of linked shared object symbols. Seems undefined symbols
				717	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	718
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	719	Kernel descriptor
				720	Every HSA kernel has an associated kernel descriptor. It is the address of the
				721	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				722	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				723	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				724
				725	Kernel entry point
				726	Every HSA kernel also has a symbol for its machine code entry point.
				727
				728	.. _amdgpu-relocation-records:
				729
				730	Relocation Records
				731	------------------
				732
				733	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				734	relocatable fields are:
				735
				736	``word32``
				737	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				738	alignment. These values use the same byte order as other word values in the
				739	AMD GPU architecture.
				740
				741	``word64``
				742	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				743	alignment. These values use the same byte order as other word values in the
				744	AMD GPU architecture.
				745
				746	Following notations are used for specifying relocation calculations:
				747
				748	A
				749	Represents the addend used to compute the value of the relocatable field.
				750
				751	G
				752	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	753	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	754
				755	GOT
				756	Represents the address of the global offset table.
				757
				758	P
				759	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				760	of the storage unit being relocated (computed using ``r_offset``).
				761
				762	S
				763	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	764	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				765
				766	B
				767	Represents the base address of a loaded executable or shared object which is
				768	the difference between the ELF address and the actual load address. Relocations
				769	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	770
				771	The following relocation types are supported:
				772
				773	.. table:: AMDGPU ELF Relocation Records
				774	:name: amdgpu-elf-relocation-records-table
				775
				776	========================== ===== ========== ==============================
				777	Relocation Type Value Field Calculation
				778	========================== ===== ========== ==============================
				779	``R_AMDGPU_NONE`` 0 none none
				780	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				781	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				782	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				783	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				784	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				785	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				786	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				787	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				788	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				789	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				790	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	791	reserved 12
				792	``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	793	========================== ===== ========== ==============================
				794
				795	.. _amdgpu-dwarf:
				796
				797	DWARF
				798	-----
				799
				800	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				801	information that maps the code object executable code and data to the source
				802	language constructs. It can be used by tools such as debuggers and profilers.
				803
				804	Address Space Mapping
				805	~~~~~~~~~~~~~~~~~~~~~
				806
				807	The following address space mapping is used:
				808
				809	.. table:: AMDGPU DWARF Address Space Mapping
				810	:name: amdgpu-dwarf-address-space-mapping-table
				811
				812	=================== =================
				813	DWARF Address Space Memory Space
				814	=================== =================
				815	1 Private (Scratch)
				816	2 Local (group/LDS)
				817	omitted Global
				818	omitted Constant
				819	omitted Generic (Flat)
				820	not supported Region (GDS)
				821	=================== =================
				822
				823	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				824	used in the table.
				825
				826	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				827	DWARF address space of the value of the pointer when it is in the private or
				828	local address space. Otherwise the attribute is omitted.
				829
				830	An ``XDEREF`` operation is generated in location list expressions for variables
				831	that are allocated in the private and local address space. Otherwise no
				832	``XDREF`` is omitted.
				833
				834	Register Mapping
				835	~~~~~~~~~~~~~~~~
				836
				837	This section is WIP.
				838
				839	.. TODO
				840	Define DWARF register enumeration.
				841
				842	If want to present a wavefront state then should expose vector registers as
				843	64 wide (rather than per work-item view that LLVM uses). Either as separate
				844	registers, or a 64x4 byte single register. In either case use a new LANE op
				845	(akin to XDREF) to select the current lane usage in a location
				846	expression. This would also allow scalar register spilling to vector register
				847	lanes to be expressed (currently no debug information is being generated for
				848	spilling). If choose a wide single register approach then use LANE in
				849	conjunction with PIECE operation to select the dword part of the register for
				850	the current lane. If the separate register approach then use LANE to select
				851	the register.
				852
				853	Source Text
				854	~~~~~~~~~~~
				855
				856	This section is WIP.
				857
				858	.. TODO
				859	DWARF extension to include runtime generated source text.
				860
				861	.. _amdgpu-code-conventions:
				862
				863	Code Conventions
				864	================
				865
				866	This section provides code conventions used for each supported target triple OS
				867	(see :ref:`amdgpu-target-triples`).
				868
				869	AMDHSA
				870	------
				871
				872	This section provides code conventions used when the target triple OS is
				873	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				874
				875	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	876
				877	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	878	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	879
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	880	The code object metadata specifies extensible metadata associated with the code
				881	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				882	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				883	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				884	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				885	information necessary to support the ROCM kernel queries. For example, the
				886	segment sizes needed in a dispatch packet. In addition, a high level language
				887	runtime may require other information to be included. For example, the AMD
				888	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	889
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	890	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	891	:doc:`YamlIO`).
				892
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	893	.. TODO
				894	Is the string null terminated? It probably should not if YAML allows it to
				895	contain null characters, otherwise it should be.
				896
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	897	The metadata is represented as a single YAML document comprised of the mapping
				898	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				899	referenced tables.
				900
				901	For boolean values, the string values of ``false`` and ``true`` are used for
				902	false and true respectively.
				903
				904	Additional information can be added to the mappings. To avoid conflicts, any
				905	non-AMD key names should be prefixed by "vendor-name.".
				906
				907	.. table:: AMDHSA Code Object Metadata Mapping
				908	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				909
				910	========== ============== ========= =======================================
				911	String Key Value Type Required? Description
				912	========== ============== ========= =======================================
				913	"Version" sequence of Required - The first integer is the major
				914	2 integers version. Currently 1.
				915	- The second integer is the minor
				916	version. Currently 0.
				917	"Printf" sequence of Each string is encoded information
				918	strings about a printf function call. The
				919	encoded information is organized as
				920	fields separated by colon (':'):
				921
				922	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				923
				924	where:
				925
				926	``ID``
				927	A 32 bit integer as a unique id for
				928	each printf function call
				929
				930	``N``
				931	A 32 bit integer equal to the number
				932	of arguments of printf function call
				933	minus 1
				934
				935	``S[i]`` (where i = 0, 1, ... , N-1)
				936	32 bit integers for the size in bytes
				937	of the i-th FormatString argument of
				938	the printf function call
				939
				940	FormatString
				941	The format string passed to the
				942	printf function call.
				943	"Kernels" sequence of Required Sequence of the mappings for each
				944	mapping kernel in the code object. See
				945	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				946	for the definition of the mapping.
				947	========== ============== ========= =======================================
				948
				949	..
				950
				951	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				952	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				953
				954	================= ============== ========= ================================
				955	String Key Value Type Required? Description
				956	================= ============== ========= ================================
				957	"Name" string Required Source name of the kernel.
				958	"SymbolName" string Required Name of the kernel
				959	descriptor ELF symbol.
				960	"Language" string Source language of the kernel.
				961	Values include:
				962
				963	- "OpenCL C"
				964	- "OpenCL C++"
				965	- "HCC"
				966	- "OpenMP"
				967
				968	"LanguageVersion" sequence of - The first integer is the major
				969	2 integers version.
				970	- The second integer is the
				971	minor version.
				972	"Attrs" mapping Mapping of kernel attributes.
				973	See
				974	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				975	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	976	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	977	mapping kernel arguments. See
				978	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				979	for the definition of the mapping.
				980	"CodeProps" mapping Mapping of properties related to
				981	the kernel code. See
				982	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				983	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	984	================= ============== ========= ================================
				985
				986	..
				987
				988	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				989	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				990
				991	=================== ============== ========= ==============================
				992	String Key Value Type Required? Description
				993	=================== ============== ========= ==============================
				994	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				995	3 integers X, Y, Z must correspond to the
				996	specified values.
				997
				998	Corresponds to the OpenCL
				999	``reqd_work_group_size``
				1000	attribute.
				1001	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1002	3 integers X, Y, Z is likely to be the
				1003	specified values.
				1004
				1005	Corresponds to the OpenCL
				1006	``work_group_size_hint``
				1007	attribute.
				1008	"VecTypeHint" string The name of a scalar or vector
				1009	type.
				1010
				1011	Corresponds to the OpenCL
				1012	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1013
				1014	"RuntimeHandle" string The external symbol name
				1015	associated with a kernel.
				1016	OpenCL runtime allocates a
				1017	global buffer for the symbol
				1018	and saves the kernel's address
				1019	to it, which is used for
				1020	device side enqueueing. Only
				1021	available for device side
				1022	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1023	=================== ============== ========= ==============================
				1024
				1025	..
				1026
				1027	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1028	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1029
				1030	================= ============== ========= ================================
				1031	String Key Value Type Required? Description
				1032	================= ============== ========= ================================
				1033	"Name" string Kernel argument name.
				1034	"TypeName" string Kernel argument type name.
				1035	"Size" integer Required Kernel argument size in bytes.
				1036	"Align" integer Required Kernel argument alignment in
				1037	bytes. Must be a power of two.
				1038	"ValueKind" string Required Kernel argument kind that
				1039	specifies how to set up the
				1040	corresponding argument.
				1041	Values include:
				1042
				1043	"ByValue"
				1044	The argument is copied
				1045	directly into the kernarg.
				1046
				1047	"GlobalBuffer"
				1048	A global address space pointer
				1049	to the buffer data is passed
				1050	in the kernarg.
				1051
				1052	"DynamicSharedPointer"
				1053	A group address space pointer
				1054	to dynamically allocated LDS
				1055	is passed in the kernarg.
				1056
				1057	"Sampler"
				1058	A global address space
				1059	pointer to a S# is passed in
				1060	the kernarg.
				1061
				1062	"Image"
				1063	A global address space
				1064	pointer to a T# is passed in
				1065	the kernarg.
				1066
				1067	"Pipe"
				1068	A global address space pointer
				1069	to an OpenCL pipe is passed in
				1070	the kernarg.
				1071
				1072	"Queue"
				1073	A global address space pointer
				1074	to an OpenCL device enqueue
				1075	queue is passed in the
				1076	kernarg.
				1077
				1078	"HiddenGlobalOffsetX"
				1079	The OpenCL grid dispatch
				1080	global offset for the X
				1081	dimension is passed in the
				1082	kernarg.
				1083
				1084	"HiddenGlobalOffsetY"
				1085	The OpenCL grid dispatch
				1086	global offset for the Y
				1087	dimension is passed in the
				1088	kernarg.
				1089
				1090	"HiddenGlobalOffsetZ"
				1091	The OpenCL grid dispatch
				1092	global offset for the Z
				1093	dimension is passed in the
				1094	kernarg.
				1095
				1096	"HiddenNone"
				1097	An argument that is not used
				1098	by the kernel. Space needs to
				1099	be left for it, but it does
				1100	not need to be set up.
				1101
				1102	"HiddenPrintfBuffer"
				1103	A global address space pointer
				1104	to the runtime printf buffer
				1105	is passed in kernarg.
				1106
				1107	"HiddenDefaultQueue"
				1108	A global address space pointer
				1109	to the OpenCL device enqueue
				1110	queue that should be used by
				1111	the kernel by default is
				1112	passed in the kernarg.
				1113
				1114	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1115	A global address space pointer
				1116	to help link enqueued kernels into
				1117	the ancestor tree for determining
				1118	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1119
				1120	"ValueType" string Required Kernel argument value type. Only
				1121	present if "ValueKind" is
				1122	"ByValue". For vector data
				1123	types, the value is for the
				1124	element type. Values include:
				1125
				1126	- "Struct"
				1127	- "I8"
				1128	- "U8"
				1129	- "I16"
				1130	- "U16"
				1131	- "F16"
				1132	- "I32"
				1133	- "U32"
				1134	- "F32"
				1135	- "I64"
				1136	- "U64"
				1137	- "F64"
				1138
				1139	.. TODO
				1140	How can it be determined if a
				1141	vector type, and what size
				1142	vector?
				1143	"PointeeAlign" integer Alignment in bytes of pointee
				1144	type for pointer type kernel
				1145	argument. Must be a power
				1146	of 2. Only present if
				1147	"ValueKind" is
				1148	"DynamicSharedPointer".
				1149	"AddrSpaceQual" string Kernel argument address space
				1150	qualifier. Only present if
				1151	"ValueKind" is "GlobalBuffer" or
				1152	"DynamicSharedPointer". Values
				1153	are:
				1154
				1155	- "Private"
				1156	- "Global"
				1157	- "Constant"
				1158	- "Local"
				1159	- "Generic"
				1160	- "Region"
				1161
				1162	.. TODO
				1163	Is GlobalBuffer only Global
				1164	or Constant? Is
				1165	DynamicSharedPointer always
				1166	Local? Can HCC allow Generic?
				1167	How can Private or Region
				1168	ever happen?
				1169	"AccQual" string Kernel argument access
				1170	qualifier. Only present if
				1171	"ValueKind" is "Image" or
				1172	"Pipe". Values
				1173	are:
				1174
				1175	- "ReadOnly"
				1176	- "WriteOnly"
				1177	- "ReadWrite"
				1178
				1179	.. TODO
				1180	Does this apply to
				1181	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1182	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1183	performed by the kernel on the
				1184	kernel argument. Only present if
				1185	"ValueKind" is "GlobalBuffer",
				1186	"Image", or "Pipe". This may be
				1187	more restrictive than indicated
				1188	by "AccQual" to reflect what the
				1189	kernel actual does. If not
				1190	present then the runtime must
				1191	assume what is implied by
				1192	"AccQual" and "IsConst". Values
				1193	are:
				1194
				1195	- "ReadOnly"
				1196	- "WriteOnly"
				1197	- "ReadWrite"
				1198
				1199	"IsConst" boolean Indicates if the kernel argument
				1200	is const qualified. Only present
				1201	if "ValueKind" is
				1202	"GlobalBuffer".
				1203
				1204	"IsRestrict" boolean Indicates if the kernel argument
				1205	is restrict qualified. Only
				1206	present if "ValueKind" is
				1207	"GlobalBuffer".
				1208
				1209	"IsVolatile" boolean Indicates if the kernel argument
				1210	is volatile qualified. Only
				1211	present if "ValueKind" is
				1212	"GlobalBuffer".
				1213
				1214	"IsPipe" boolean Indicates if the kernel argument
				1215	is pipe qualified. Only present
				1216	if "ValueKind" is "Pipe".
				1217
				1218	.. TODO
				1219	Can GlobalBuffer be pipe
				1220	qualified?
				1221	================= ============== ========= ================================
				1222
				1223	..
				1224
				1225	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1226	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1227
				1228	============================ ============== ========= =====================
				1229	String Key Value Type Required? Description
				1230	============================ ============== ========= =====================
				1231	"KernargSegmentSize" integer Required The size in bytes of
				1232	the kernarg segment
				1233	that holds the values
				1234	of the arguments to
				1235	the kernel.
				1236	"GroupSegmentFixedSize" integer Required The amount of group
				1237	segment memory
				1238	required by a
				1239	work-group in
				1240	bytes. This does not
				1241	include any
				1242	dynamically allocated
				1243	group segment memory
				1244	that may be added
				1245	when the kernel is
				1246	dispatched.
				1247	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1248	private address space
				1249	memory required for a
				1250	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1251	bytes. If the kernel
				1252	uses a dynamic call
				1253	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1254	space must be added
				1255	to this value for the
				1256	call stack.
				1257	"KernargSegmentAlign" integer Required The maximum byte
				1258	alignment of
				1259	arguments in the
				1260	kernarg segment. Must
				1261	be a power of 2.
				1262	"WavefrontSize" integer Required Wavefront size. Must
				1263	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1264	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1265	registers used by a
				1266	wavefront for
				1267	GFX6-GFX9. This
				1268	includes the special
				1269	SGPRs for VCC, Flat
				1270	Scratch (GFX7-GFX9)
				1271	and XNACK (for
				1272	GFX8-GFX9). It does
				1273	not include the 16
				1274	SGPR added if a trap
				1275	handler is
				1276	enabled. It is not
				1277	rounded up to the
				1278	allocation
				1279	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1280	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1281	registers used by
				1282	each work-item for
				1283	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1284	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1285	work-group size
				1286	supported by the
				1287	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1288	Must be >=1 and
				1289	consistent with any
				1290	non-0 values in
				1291	FixedWorkGroupSize.
				1292	"FixedWorkGroupSize" sequence of Corresponds to the
				1293	3 integers dispatch work-group
				1294	size X, Y, Z. If
				1295	omitted, defaults to
				1296	0, 0, 0. If an
				1297	element is non-0 then
				1298	the kernel must only
				1299	be launched with a
				1300	matching corresponding
				1301	work-group size.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1302	"NumSpilledSGPRs" integer Number of stores from
				1303	a scalar register to
				1304	a register allocator
				1305	created spill
				1306	location.
				1307	"NumSpilledVGPRs" integer Number of stores from
				1308	a vector register to
				1309	a register allocator
				1310	created spill
				1311	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1312	============================ ============== ========= =====================
				1313
				1314	..
				1315
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1316	Kernel Dispatch
				1317	~~~~~~~~~~~~~~~
				1318
				1319	The HSA architected queuing language (AQL) defines a user space memory interface
				1320	that can be used to control the dispatch of kernels, in an agent independent
				1321	way. An agent can have zero or more AQL queues created for it using the ROCm
				1322	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1323	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1324	mechanics and packet layouts.
				1325
				1326	The packet processor of a kernel agent is responsible for detecting and
				1327	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1328	packet processor is implemented by the hardware command processor (CP),
				1329	asynchronous dispatch controller (ADC) and shader processor input controller
				1330	(SPI).
				1331
				1332	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1333	mode driver to initialize and register the AQL queue with CP.
				1334
				1335	To dispatch a kernel the following actions are performed. This can occur in the
				1336	CPU host program, or from an HSA kernel executing on a GPU.
				1337
				1338	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1339	executed is obtained.
				1340	2. A pointer to the kernel descriptor (see
				1341	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1342	obtained. It must be for a kernel that is contained in a code object that that
				1343	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1344	associated.
				1345	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1346	for a memory region with the kernarg property for the kernel agent that will
				1347	execute the kernel. It must be at least 16 byte aligned.
				1348	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1349	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1350	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1351	memory in the same way constant memory is accessed. (Note that the HSA
				1352	specification allows an implementation to copy the kernel argument contents to
				1353	another location that is accessed by the kernel.)
				1354	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1355	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1356	packet. The packet must be set up, and the final write must use an atomic
				1357	store release to set the packet kind to ensure the packet contents are
				1358	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1359	notify the kernel agent that the AQL queue has been updated. These rules, and
				1360	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1361	System Architecture Specification* [HSA]_.
				1362	6. A kernel dispatch packet includes information about the actual dispatch,
				1363	such as grid and work-group size, together with information from the code
				1364	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1365	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1366	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1367	7. CP executes micro-code and is responsible for detecting and setting up the
				1368	GPU to execute the wavefronts of a kernel dispatch.
				1369	8. CP ensures that when the a wavefront starts executing the kernel machine
				1370	code, the scalar general purpose registers (SGPR) and vector general purpose
				1371	registers (VGPR) are set up as required by the machine code. The required
				1372	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1373	register state is defined in
				1374	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1375	9. The prolog of the kernel machine code (see
				1376	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1377	before continuing executing the machine code that corresponds to the kernel.
				1378	10. When the kernel dispatch has completed execution, CP signals the completion
				1379	signal specified in the kernel dispatch packet if not 0.
				1380
				1381	.. _amdgpu-amdhsa-memory-spaces:
				1382
				1383	Memory Spaces
				1384	~~~~~~~~~~~~~
				1385
				1386	The memory space properties are:
				1387
				1388	.. table:: AMDHSA Memory Spaces
				1389	:name: amdgpu-amdhsa-memory-spaces-table
				1390
				1391	================= =========== ======== ======= ==================
				1392	Memory Space Name HSA Segment Hardware Address NULL Value
				1393	Name Name Size
				1394	================= =========== ======== ======= ==================
				1395	Private private scratch 32 0x00000000
				1396	Local group LDS 32 0xFFFFFFFF
				1397	Global global global 64 0x0000000000000000
				1398	Constant constant *same as 64 0x0000000000000000
				1399	global*
				1400	Generic flat flat 64 0x0000000000000000
				1401	Region N/A GDS 32 *not implemented
				1402	for AMDHSA*
				1403	================= =========== ======== ======= ==================
				1404
				1405	The global and constant memory spaces both use global virtual addresses, which
				1406	are the same virtual address space used by the CPU. However, some virtual
				1407	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1408	and some by both.
				1409
				1410	Using the constant memory space indicates that the data will not change during
				1411	the execution of the kernel. This allows scalar read instructions to be
				1412	used. The vector and scalar L1 caches are invalidated of volatile data before
				1413	each kernel dispatch execution to allow constant memory to change values between
				1414	kernel dispatches.
				1415
				1416	The local memory space uses the hardware Local Data Store (LDS) which is
				1417	automatically allocated when the hardware creates work-groups of wavefronts, and
				1418	freed when all the wavefronts of a work-group have terminated. The data store
				1419	(DS) instructions can be used to access it.
				1420
				1421	The private memory space uses the hardware scratch memory support. If the kernel
				1422	uses scratch, then the hardware allocates memory that is accessed using
				1423	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1424	address to physical address is:
				1425
				1426	``wavefront-scratch-base +
				1427	(private-address * wavefront-size * 4) +
				1428	(wavefront-lane-id * 4)``
				1429
				1430	There are different ways that the wavefront scratch base address is determined
				1431	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1432	memory can be accessed in an interleaved manner using buffer instruction with
				1433	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1434	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1435	same private address, the interleaving results in adjacent dwords being accessed
				1436	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1437	supported except by flat and scratch instructions in GFX9.
				1438
				1439	The generic address space uses the hardware flat address support available in
				1440	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1441	local appertures), that are outside the range of addressible global memory, to
				1442	map from a flat address to a private or local address.
				1443
				1444	FLAT instructions can take a flat address and access global, private (scratch)
				1445	and group (LDS) memory depending in if the address is within one of the
				1446	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1447	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1448	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1449	(see :ref:`amdgpu-amdhsa-m0`).
				1450
				1451	To convert between a segment address and a flat address the base address of the
				1452	appertures address can be used. For GFX7-GFX8 these are available in the
				1453	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1454	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1455	GFX9 the appature base addresses are directly available as inline constant
				1456	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1457	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1458	which makes it easier to convert from flat to segment or segment to flat.
				1459
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1460	Image and Samplers
				1461	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1462
				1463	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1464	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1465	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1466	enumeration values for the queries that are not trivially deducible from the S#
				1467	representation.
				1468
				1469	HSA Signals
				1470	~~~~~~~~~~~
				1471
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1472	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1473	structure allocated in memory accessible from both the CPU and GPU. The
				1474	structure is defined by the ROCm runtime and subject to change between releases
				1475	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1476
				1477	.. _amdgpu-amdhsa-hsa-aql-queue:
				1478
				1479	HSA AQL Queue
				1480	~~~~~~~~~~~~~
				1481
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1482	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1483	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1484	fields needed to implement certain language features such as the flat address
				1485	aperture bases. It also contains fields used by CP such as managing the
				1486	allocation of scratch memory.
				1487
				1488	.. _amdgpu-amdhsa-kernel-descriptor:
				1489
				1490	Kernel Descriptor
				1491	~~~~~~~~~~~~~~~~~
				1492
				1493	A kernel descriptor consists of the information needed by CP to initiate the
				1494	execution of a kernel, including the entry point address of the machine code
				1495	that implements the kernel.
				1496
				1497	Kernel Descriptor for GFX6-GFX9
				1498	+++++++++++++++++++++++++++++++
				1499
				1500	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1501
				1502	.. table:: Kernel Descriptor for GFX6-GFX9
				1503	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1504
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1505	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1506	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1507	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1508	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1509	address space memory
				1510	required for a work-group
				1511	in bytes. This does not
				1512	include any dynamically
				1513	allocated local address
				1514	space memory that may be
				1515	added when the kernel is
				1516	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1517	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1518	private address space
				1519	memory required for a
				1520	work-item in bytes. If
				1521	is_dynamic_callstack is 1
				1522	then additional space must
				1523	be added to this value for
				1524	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1525	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1526	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1527	negative) from base
				1528	address of kernel
				1529	descriptor to kernel's
				1530	entry point instruction
				1531	which must be 256 byte
				1532	aligned.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1533	223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
				1534	size supported by the
				1535	kernel in work-items. If
				1536	an exact work-group size
				1537	is required then must be
				1538	omitted or 0 and
				1539	ReqdWorkGroupSize* must
				1540	be set to non-0.
				1541	239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
				1542	the kernel
				1543	must be executed with the
				1544	specified work-group size
				1545	for X.
				1546	255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
				1547	the kernel
				1548	must be executed with the
				1549	specified work-group size
				1550	for Y.
				1551	271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
				1552	the kernel
				1553	must be executed with the
				1554	specified work-group size
				1555	for Z.
Mark Searles	095d4ea	2017-12-07 21:24:27 +0000	[diff] [blame]	1556	383:272 14 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1557	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1558	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1559	program settings used by
				1560	CP to set up
				1561	``COMPUTE_PGM_RSRC1``
				1562	configuration
				1563	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1564	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1565	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1566	program settings used by
				1567	CP to set up
				1568	``COMPUTE_PGM_RSRC2``
				1569	configuration
				1570	register. See
				1571	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1572	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1573	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1574	(see
				1575	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1576
				1577	The total number of SGPR
				1578	user data registers
				1579	requested must not exceed
				1580	16 and match value in
				1581	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1582	Any requests beyond 16
				1583	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1584	449 1 bit EnableSGPRDispatchPtr see above
				1585	450 1 bit EnableSGPRQueuePtr see above
				1586	451 1 bit EnableSGPRKernargSegmentPtr see above
				1587	452 1 bit EnableSGPRDispatchID see above
				1588	453 1 bit EnableSGPRFlatScratchInit see above
				1589	454 1 bit EnableSGPRPrivateSegmentSize see above
				1590	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1591	should always be 0.
				1592	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1593	should always be 0.
				1594	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1595	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1596	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1597	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1598	bytes
				1599	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1600	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1601
				1602	..
				1603
				1604	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1605	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1606
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1607	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1608	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1609	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1610	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1611	used by each work-item,
				1612	granularity is device
				1613	specific:
				1614
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1615	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1616	- max_vgpr 1..256
				1617	- roundup((max_vgpg + 1)
				1618	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1619
				1620	Used by CP to set up
				1621	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1622	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1623	used by a wavefront,
				1624	granularity is device
				1625	specific:
				1626
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1627	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1628	- max_sgpr 1..112
				1629	- roundup((max_sgpg + 1)
				1630	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1631	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1632	- max_sgpr 1..112
				1633	- roundup((max_sgpg + 1)
				1634	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1635
				1636	Includes the special SGPRs
				1637	for VCC, Flat Scratch (for
				1638	GFX7 onwards) and XNACK
				1639	(for GFX8 onwards). It does
				1640	not include the 16 SGPR
				1641	added if a trap handler is
				1642	enabled.
				1643
				1644	Used by CP to set up
				1645	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1646	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1647
				1648	Start executing wavefront
				1649	at the specified priority.
				1650
				1651	CP is responsible for
				1652	filling in
				1653	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1654	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1655	with specified rounding
				1656	mode for single (32
				1657	bit) floating point
				1658	precision floating point
				1659	operations.
				1660
				1661	Floating point rounding
				1662	mode values are defined in
				1663	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1664
				1665	Used by CP to set up
				1666	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1667	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1668	with specified rounding
				1669	denorm mode for half/double (16
				1670	and 64 bit) floating point
				1671	precision floating point
				1672	operations.
				1673
				1674	Floating point rounding
				1675	mode values are defined in
				1676	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1677
				1678	Used by CP to set up
				1679	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1680	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1681	with specified denorm mode
				1682	for single (32
				1683	bit) floating point
				1684	precision floating point
				1685	operations.
				1686
				1687	Floating point denorm mode
				1688	values are defined in
				1689	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1690
				1691	Used by CP to set up
				1692	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1693	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1694	with specified denorm mode
				1695	for half/double (16
				1696	and 64 bit) floating point
				1697	precision floating point
				1698	operations.
				1699
				1700	Floating point denorm mode
				1701	values are defined in
				1702	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1703
				1704	Used by CP to set up
				1705	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1706	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1707
				1708	Start executing wavefront
				1709	in privilege trap handler
				1710	mode.
				1711
				1712	CP is responsible for
				1713	filling in
				1714	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1715	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1716	with DX10 clamp mode
				1717	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1718	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1719	treatment of NaN's (when
				1720	set, clamp NaN to zero,
				1721	otherwise pass NaN
				1722	through).
				1723
				1724	Used by CP to set up
				1725	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1726	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1727
				1728	Start executing wavefront
				1729	in single step mode.
				1730
				1731	CP is responsible for
				1732	filling in
				1733	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1734	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1735	with IEEE mode
				1736	enabled. Floating point
				1737	opcodes that support
				1738	exception flag gathering
				1739	will quiet and propagate
				1740	signaling-NaN inputs per
				1741	IEEE 754-2008. Min_dx10 and
				1742	max_dx10 become IEEE
				1743	754-2008 compliant due to
				1744	signaling-NaN propagation
				1745	and quieting.
				1746
				1747	Used by CP to set up
				1748	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1749	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1750
				1751	Only one work-group allowed
				1752	to execute on a compute
				1753	unit.
				1754
				1755	CP is responsible for
				1756	filling in
				1757	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1758	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1759
				1760	Flag that can be used to
				1761	control debugging code.
				1762
				1763	CP is responsible for
				1764	filling in
				1765	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1766	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1767	Reserved, must be 0.
				1768	GFX9
				1769	Wavefront starts execution
				1770	with specified fp16 overflow
				1771	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1772
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1773	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1774	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1775	- If 1, fp16 overflow that is the
				1776	result of an +/-INF input value
				1777	or divide by 0 produces a +/-INF,
				1778	otherwise clamps computed
				1779	overflow to +/-MAX_FP16 as
				1780	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1781
				1782	Used by CP to set up
				1783	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1784	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1785	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1786	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1787
				1788	..
				1789
				1790	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1791	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1792
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1793	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1794	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1795	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1796	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1797	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1798	system register (see
				1799	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1800
				1801	Used by CP to set up
				1802	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1803	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1804	user data registers
				1805	requested. This number must
				1806	match the number of user
				1807	data registers enabled.
				1808
				1809	Used by CP to set up
				1810	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1811	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1812	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1813	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1814	be enabled.
				1815
				1816	CP sets
				1817	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1818	if the runtime has
				1819	installed a trap handler
				1820	regardless of the setting
				1821	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1822	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1823	system SGPR register for
				1824	the work-group id in the X
				1825	dimension (see
				1826	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1827
				1828	Used by CP to set up
				1829	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1830	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1831	system SGPR register for
				1832	the work-group id in the Y
				1833	dimension (see
				1834	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1835
				1836	Used by CP to set up
				1837	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1838	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1839	system SGPR register for
				1840	the work-group id in the Z
				1841	dimension (see
				1842	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1843
				1844	Used by CP to set up
				1845	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1847	system SGPR register for
				1848	work-group information (see
				1849	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1850
				1851	Used by CP to set up
				1852	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1853	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854	VGPR system registers used
				1855	for the work-item ID.
				1856	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1857	defines the values.
				1858
				1859	Used by CP to set up
				1860	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1861	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1862
				1863	Wavefront starts execution
				1864	with address watch
				1865	exceptions enabled which
				1866	are generated when L1 has
				1867	witnessed a thread access
				1868	an *address of
				1869	interest*.
				1870
				1871	CP is responsible for
				1872	filling in the address
				1873	watch bit in
				1874	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1875	according to what the
				1876	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1877	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1878
				1879	Wavefront starts execution
				1880	with memory violation
				1881	exceptions exceptions
				1882	enabled which are generated
				1883	when a memory violation has
				1884	occurred for this wave from
				1885	L1 or LDS
				1886	(write-to-read-only-memory,
				1887	mis-aligned atomic, LDS
				1888	address out of range,
				1889	illegal address, etc.).
				1890
				1891	CP sets the memory
				1892	violation bit in
				1893	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1894	according to what the
				1895	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1896	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1897
				1898	CP uses the rounded value
				1899	from the dispatch packet,
				1900	not this value, as the
				1901	dispatch may contain
				1902	dynamically allocated group
				1903	segment memory. CP writes
				1904	directly to
				1905	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1906
				1907	Amount of group segment
				1908	(LDS) to allocate for each
				1909	work-group. Granularity is
				1910	device specific:
				1911
				1912	GFX6:
				1913	roundup(lds-size / (64 * 4))
				1914	GFX7-GFX9:
				1915	roundup(lds-size / (128 * 4))
				1916
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1917	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1918	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1919	enabled.
				1920
				1921	Used by CP to set up
				1922	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1923	(set from bits 0..6).
				1924
				1925	IEEE 754 FP Invalid
				1926	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1927	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1928	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1929	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1930	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1931	_DIVISION_BY_ZERO Zero
				1932	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1933	_OVERFLOW
				1934	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1935	_UNDERFLOW
				1936	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1937	_INEXACT
				1938	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1939	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1940	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1941	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1942	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1943	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1944
				1945	..
				1946
				1947	.. table:: Floating Point Rounding Mode Enumeration Values
				1948	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1949
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1950	====================================== ===== ==============================
				1951	Enumeration Name Value Description
				1952	====================================== ===== ==============================
				1953	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1954	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1955	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1956	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1957	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1958
				1959	..
				1960
				1961	.. table:: Floating Point Denorm Mode Enumeration Values
				1962	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1963
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1964	====================================== ===== ==============================
				1965	Enumeration Name Value Description
				1966	====================================== ===== ==============================
				1967	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1968	Denorms
				1969	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1970	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1971	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1972	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1973
				1974	..
				1975
				1976	.. table:: System VGPR Work-Item ID Enumeration Values
				1977	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1978
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1979	======================================== ===== ============================
				1980	Enumeration Name Value Description
				1981	======================================== ===== ============================
				1982	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1983	ID.
				1984	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1985	dimensions ID.
				1986	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1987	dimensions ID.
				1988	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1989	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1990
				1991	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1992
				1993	Initial Kernel Execution State
				1994	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1995
				1996	This section defines the register state that will be set up by the packet
				1997	processor prior to the start of execution of every wavefront. This is limited by
				1998	the constraints of the hardware controllers of CP/ADC/SPI.
				1999
				2000	The order of the SGPR registers is defined, but the compiler can specify which
				2001	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2002	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2003	for enabled registers are dense starting at SGPR0: the first enabled register is
				2004	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2005	an SGPR number.
				2006
				2007	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				2008	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				2009	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2010	initialized. These are then immediately followed by the System SGPRs that are
				2011	set up by ADC/SPI and can have different values for each wave of the grid
				2012	dispatch.
				2013
				2014	SGPR register initial state is defined in
				2015	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2016
				2017	.. table:: SGPR Register Set Up Order
				2018	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2019
				2020	========== ========================== ====== ==============================
				2021	SGPR Order Name Number Description
				2022	(kernel descriptor enable of
				2023	field) SGPRs
				2024	========== ========================== ====== ==============================
				2025	First Private Segment Buffer 4 V# that can be used, together
				2026	(enable_sgpr_private with Scratch Wave Offset as an
				2027	_segment_buffer) offset, to access the private
				2028	memory space using a segment
				2029	address.
				2030
				2031	CP uses the value provided by
				2032	the runtime.
				2033	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2034	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2035	actually executing.
				2036	then Queue Ptr 2 64 bit address of amd_queue_t
				2037	(enable_sgpr_queue_ptr) object for AQL queue on which
				2038	the dispatch packet was
				2039	queued.
				2040	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2041	(enable_sgpr_kernarg segment. This is directly
				2042	_segment_ptr) copied from the
				2043	kernarg_address in the kernel
				2044	dispatch packet.
				2045
				2046	Having CP load it once avoids
				2047	loading it at the beginning of
				2048	every wavefront.
				2049	then Dispatch Id 2 64 bit Dispatch ID of the
				2050	(enable_sgpr_dispatch_id) dispatch packet being
				2051	executed.
				2052	then Flat Scratch Init 2 This is 2 SGPRs:
				2053	(enable_sgpr_flat_scratch
				2054	_init) GFX6
				2055	Not supported.
				2056	GFX7-GFX8
				2057	The first SGPR is a 32 bit
				2058	byte offset from
				2059	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2060	to per SPI base of memory
				2061	for scratch for the queue
				2062	executing the kernel
				2063	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2064	from the runtime. (The
				2065	Scratch Segment Buffer base
				2066	address is
				2067	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2068	plus this offset.) The value
				2069	of Scratch Wave Offset must
				2070	be added to this offset by
				2071	the kernel machine code,
				2072	right shifted by 8, and
				2073	moved to the FLAT_SCRATCH_HI
				2074	SGPR register.
				2075	FLAT_SCRATCH_HI corresponds
				2076	to SGPRn-4 on GFX7, and
				2077	SGPRn-6 on GFX8 (where SGPRn
				2078	is the highest numbered SGPR
				2079	allocated to the wave).
				2080	FLAT_SCRATCH_HI is
				2081	multiplied by 256 (as it is
				2082	in units of 256 bytes) and
				2083	added to
				2084	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2085	to calculate the per wave
				2086	FLAT SCRATCH BASE in flat
				2087	memory instructions that
				2088	access the scratch
				2089	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2090
				2091	The second SGPR is 32 bit
				2092	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2093	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2094	usage. CP obtains this from
				2095	the runtime, and it is
				2096	always a multiple of DWORD.
				2097	CP checks that the value in
				2098	the kernel dispatch packet
				2099	Private Segment Byte Size is
				2100	not larger, and requests the
				2101	runtime to increase the
				2102	queue's scratch size if
				2103	necessary. The kernel code
				2104	must move it to
				2105	FLAT_SCRATCH_LO which is
				2106	SGPRn-3 on GFX7 and SGPRn-5
				2107	on GFX8. FLAT_SCRATCH_LO is
				2108	used as the FLAT SCRATCH
				2109	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2110	instructions. Having CP load
				2111	it once avoids loading it at
				2112	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2113	wavefront.
				2114	GFX9
				2115	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2116	64 bit base address of the
				2117	per SPI scratch backing
				2118	memory managed by SPI for
				2119	the queue executing the
				2120	kernel dispatch. CP obtains
				2121	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2122	divides it if there are
				2123	multiple Shader Arrays each
				2124	with its own SPI). The value
				2125	of Scratch Wave Offset must
				2126	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2127	machine code and the result
				2128	moved to the FLAT_SCRATCH
				2129	SGPR which is SGPRn-6 and
				2130	SGPRn-5. It is used as the
				2131	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2132	memory instructions.
				2133	then Private Segment Size 1 The 32 bit byte size of a
				2134	(enable_sgpr_private single
				2135	work-item's
				2136	scratch_segment_size) memory
				2137	allocation. This is the
				2138	value from the kernel
				2139	dispatch packet Private
				2140	Segment Byte Size rounded up
				2141	by CP to a multiple of
				2142	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2143
				2144	Having CP load it once avoids
				2145	loading it at the beginning of
				2146	every wavefront.
				2147
				2148	This is not used for
				2149	GFX7-GFX8 since it is the same
				2150	value as the second SGPR of
				2151	Flat Scratch Init. However, it
				2152	may be needed for GFX9 which
				2153	changes the meaning of the
				2154	Flat Scratch Init value.
				2155	then Grid Work-Group Count X 1 32 bit count of the number of
				2156	(enable_sgpr_grid work-groups in the X dimension
				2157	_workgroup_count_X) for the grid being
				2158	executed. Computed from the
				2159	fields in the kernel dispatch
				2160	packet as ((grid_size.x +
				2161	workgroup_size.x - 1) /
				2162	workgroup_size.x).
				2163	then Grid Work-Group Count Y 1 32 bit count of the number of
				2164	(enable_sgpr_grid work-groups in the Y dimension
				2165	_workgroup_count_Y && for the grid being
				2166	less than 16 previous executed. Computed from the
				2167	SGPRs) fields in the kernel dispatch
				2168	packet as ((grid_size.y +
				2169	workgroup_size.y - 1) /
				2170	workgroupSize.y).
				2171
				2172	Only initialized if <16
				2173	previous SGPRs initialized.
				2174	then Grid Work-Group Count Z 1 32 bit count of the number of
				2175	(enable_sgpr_grid work-groups in the Z dimension
				2176	_workgroup_count_Z && for the grid being
				2177	less than 16 previous executed. Computed from the
				2178	SGPRs) fields in the kernel dispatch
				2179	packet as ((grid_size.z +
				2180	workgroup_size.z - 1) /
				2181	workgroupSize.z).
				2182
				2183	Only initialized if <16
				2184	previous SGPRs initialized.
				2185	then Work-Group Id X 1 32 bit work-group id in X
				2186	(enable_sgpr_workgroup_id dimension of grid for
				2187	_X) wavefront.
				2188	then Work-Group Id Y 1 32 bit work-group id in Y
				2189	(enable_sgpr_workgroup_id dimension of grid for
				2190	_Y) wavefront.
				2191	then Work-Group Id Z 1 32 bit work-group id in Z
				2192	(enable_sgpr_workgroup_id dimension of grid for
				2193	_Z) wavefront.
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2194	then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2195	(enable_sgpr_workgroup ordered_append_term[10:0],
				2196	_info) threadgroup_size_in_waves[5:0]}
				2197	then Scratch Wave Offset 1 32 bit byte offset from base
				2198	(enable_sgpr_private of scratch base of queue
				2199	_segment_wave_offset) executing the kernel
				2200	dispatch. Must be used as an
				2201	offset with Private
				2202	segment address when using
				2203	Scratch Segment Buffer. It
				2204	must be used to set up FLAT
				2205	SCRATCH for flat addressing
				2206	(see
				2207	:ref:`amdgpu-amdhsa-flat-scratch`).
				2208	========== ========================== ====== ==============================
				2209
				2210	The order of the VGPR registers is defined, but the compiler can specify which
				2211	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2212	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2213	for enabled registers are dense starting at VGPR0: the first enabled register is
				2214	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2215	VGPR number.
				2216
				2217	VGPR register initial state is defined in
				2218	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2219
				2220	.. table:: VGPR Register Set Up Order
				2221	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2222
				2223	========== ========================== ====== ==============================
				2224	VGPR Order Name Number Description
				2225	(kernel descriptor enable of
				2226	field) VGPRs
				2227	========== ========================== ====== ==============================
				2228	First Work-Item Id X 1 32 bit work item id in X
				2229	(Always initialized) dimension of work-group for
				2230	wavefront lane.
				2231	then Work-Item Id Y 1 32 bit work item id in Y
				2232	(enable_vgpr_workitem_id dimension of work-group for
				2233	> 0) wavefront lane.
				2234	then Work-Item Id Z 1 32 bit work item id in Z
				2235	(enable_vgpr_workitem_id dimension of work-group for
				2236	> 1) wavefront lane.
				2237	========== ========================== ====== ==============================
				2238
				2239	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2240
				2241	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2242	registers.
				2243	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2244	combination including none.
				2245	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2246	cannot included with the flat scratch init value which is per queue.
				2247	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2248	or (X, Y, Z).
				2249
				2250	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2251	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2252
				2253	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2254	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2255	instructions (GFX9).
				2256
				2257	If buffer operations are used then the compiler can generate a V# with the
				2258	following properties:
				2259
				2260	* base address of 0
				2261	* no swizzle
				2262	* ATC: 1 if IOMMU present (such as APU)
				2263	* ptr64: 1
				2264	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2265	APU and NC for dGPU).
				2266
				2267	.. _amdgpu-amdhsa-kernel-prolog:
				2268
				2269	Kernel Prolog
				2270	~~~~~~~~~~~~~
				2271
				2272	.. _amdgpu-amdhsa-m0:
				2273
				2274	M0
				2275	++
				2276
				2277	GFX6-GFX8
				2278	The M0 register must be initialized with a value at least the total LDS size
				2279	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2280	available in dispatch packet. For M0, it is also possible to use maximum
				2281	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2282	GFX7-GFX8).
				2283	GFX9
				2284	The M0 register is not used for range checking LDS accesses and so does not
				2285	need to be initialized in the prolog.
				2286
				2287	.. _amdgpu-amdhsa-flat-scratch:
				2288
				2289	Flat Scratch
				2290	++++++++++++
				2291
				2292	If the kernel may use flat operations to access scratch memory, the prolog code
				2293	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2294	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2295	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2296
				2297	GFX6
				2298	Flat scratch is not supported.
				2299
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2300	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2301	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2302	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2303	being managed by SPI for the queue executing the kernel dispatch. This is
				2304	the same value used in the Scratch Segment Buffer V# base address. The
				2305	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2306	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2307	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2308	by 8 before moving into FLAT_SCRATCH_LO.
				2309	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2310	work-items scratch memory usage. This is directly loaded from the kernel
				2311	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2312	DWORD. Having CP load it once avoids loading it at the beginning of every
				2313	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2314	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2315
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2316	GFX9
				2317	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2318	memory being managed by SPI for the queue executing the kernel dispatch. The
				2319	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2320	pair for use as the flat scratch base in flat memory instructions.
				2321
				2322	.. _amdgpu-amdhsa-memory-model:
				2323
				2324	Memory Model
				2325	~~~~~~~~~~~~
				2326
				2327	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2328	(see :ref:`memmodel`). The implementation is WIP.
				2329
				2330	.. TODO
				2331	Update when implementation complete.
				2332
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2333	The AMDGPU backend supports the memory synchronization scopes specified in
				2334	:ref:`amdgpu-memory-scopes`.
				2335
				2336	The code sequences used to implement the memory model are defined in table
				2337	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2338
				2339	The sequences specify the order of instructions that a single thread must
				2340	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2341	to other memory instructions executed by the same thread. This allows them to be
				2342	moved earlier or later which can allow them to be combined with other instances
				2343	of the same instruction, or hoisted/sunk out of loops to improve
				2344	performance. Only the instructions related to the memory model are given;
				2345	additional ``s_waitcnt`` instructions are required to ensure registers are
				2346	defined before being used. These may be able to be combined with the memory
				2347	model ``s_waitcnt`` instructions as described above.
				2348
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2349	The AMDGPU backend supports the following memory models:
				2350
				2351	HSA Memory Model [HSA]_
				2352	The HSA memory model uses a single happens-before relation for all address
				2353	spaces (see :ref:`amdgpu-address-spaces`).
				2354	OpenCL Memory Model [OpenCL]_
				2355	The OpenCL memory model which has separate happens-before relations for the
				2356	global and local address spaces. Only a fence specifying both global and
				2357	local address space, and seq_cst instructions join the relationships. Since
				2358	the LLVM ``memfence`` instruction does not allow an address space to be
				2359	specified the OpenCL fence has to convervatively assume both local and
				2360	global address space was specified. However, optimizations can often be
				2361	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2362	no intervening memory instructions which access the corresponding address
				2363	space. The code sequences in the table indicate what can be omitted for the
				2364	OpenCL memory. The target triple environment is used to determine if the
				2365	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2366
				2367	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2368	operations.
				2369
				2370	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2371	termed vector memory operations.
				2372
				2373	For GFX6-GFX9:
				2374
				2375	* Each agent has multiple compute units (CU).
				2376	* Each CU has multiple SIMDs that execute wavefronts.
				2377	* The wavefronts for a single work-group are executed in the same CU but may be
				2378	executed by different SIMDs.
				2379	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2380	executing on it.
				2381	* All LDS operations of a CU are performed as wavefront wide operations in a
				2382	global order and involve no caching. Completion is reported to a wavefront in
				2383	execution order.
				2384	* The LDS memory has multiple request queues shared by the SIMDs of a
				2385	CU. Therefore, the LDS operations performed by different waves of a work-group
				2386	can be reordered relative to each other, which can result in reordering the
				2387	visibility of vector memory operations with respect to LDS operations of other
				2388	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2389	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2390	between waves of a work-group, but not between operations performed by the
				2391	same wavefront.
				2392	* The vector memory operations are performed as wavefront wide operations and
				2393	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2394	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2395	vector memory order if they access LDS memory, and out of LDS operation order
				2396	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2397	* The vector memory operations access a single vector L1 cache shared by all
				2398	SIMDs a CU. Therefore, no special action is required for coherence between the
				2399	lanes of a single wavefront, or for coherence between wavefronts in the same
				2400	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
				2401	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2402	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2403	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2404	scalar operations are used in a restricted way so do not impact the memory
				2405	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2406	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2407	the same agent.
				2408	* The L2 cache has independent channels to service disjoint ranges of virtual
				2409	addresses.
				2410	* Each CU has a separate request queue per channel. Therefore, the vector and
				2411	scalar memory operations performed by waves executing in different work-groups
				2412	(which may be executing on different CUs) of an agent can be reordered
				2413	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2414	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2415	previous vector memory operation has completed before executing a subsequent
				2416	vector memory or LDS operation and so can be used to meet the requirements of
				2417	acquire and release.
				2418	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2419	of virtual addresses can be set up to bypass it to ensure system coherence.
				2420
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2421	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2422	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2423	memory, atomic memory orderings are not meaningful and all accesses are treated
				2424	as non-atomic.
				2425
				2426	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2427	scalar memory instructions). Since the constant address space contents do not
				2428	change during the execution of a kernel dispatch it is not legal to perform
				2429	stores, and atomic memory orderings are not meaningful and all access are
				2430	treated as non-atomic.
				2431
				2432	A memory synchronization scope wider than work-group is not meaningful for the
				2433	group (LDS) address space and is treated as work-group.
				2434
				2435	The memory model does not support the region address space which is treated as
				2436	non-atomic.
				2437
				2438	Acquire memory ordering is not meaningful on store atomic instructions and is
				2439	treated as non-atomic.
				2440
				2441	Release memory ordering is not meaningful on load atomic instructions and is
				2442	treated a non-atomic.
				2443
				2444	Acquire-release memory ordering is not meaningful on load or store atomic
				2445	instructions and is treated as acquire and release respectively.
				2446
				2447	AMDGPU backend only uses scalar memory operations to access memory that is
				2448	proven to not change during the execution of the kernel dispatch. This includes
				2449	constant address space and global address space for program scope const
				2450	variables. Therefore the kernel machine code does not have to maintain the
				2451	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2452	and vector L1 caches are invalidated between kernel dispatches by CP since
				2453	constant address space data may change between kernel dispatch executions. See
				2454	:ref:`amdgpu-amdhsa-memory-spaces`.
				2455
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2456	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2457	case the AMDGPU backend ensures the memory location used to spill is never
				2458	accessed by vector memory operations at the same time. If scalar writes are used
				2459	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2460	return since the locations may be used for vector memory instructions by a
				2461	future wave that uses the same scratch area, or a function call that creates a
				2462	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2463	as all scalar writes are write-before-read in the same thread.
				2464
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2465	Scratch backing memory (which is used for the private address space)
				2466	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2467	address space is only accessed by a single thread, and is always
				2468	write-before-read, there is never a need to invalidate these entries from the L1
				2469	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2470	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2471
				2472	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2473	to invalidate the L2 cache. This also causes it to be treated as
				2474	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2475	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2476	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2477
				2478	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2479	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2480
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2481	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2482	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2483	Ordering Sync Scope Address
				2484	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2485	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2486	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2487	-----------------------------------------------------------------------------------
				2488	load none none - global - !volatile & !nontemporal
				2489	- generic
				2490	- private 1. buffer/global/flat_load
				2491	- constant
				2492	- volatile & !nontemporal
				2493
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2494	1. buffer/global/flat_load
				2495	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2496
				2497	- nontemporal
				2498
				2499	1. buffer/global/flat_load
				2500	glc=1 slc=1
				2501
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2502	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2503	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2504	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2505	- private 1. buffer/global/flat_store
				2506	- constant
				2507	- nontemporal
				2508
				2509	1. buffer/global/flat_stote
				2510	glc=1 slc=1
				2511
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2512	store none none - local 1. ds_store
				2513	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2514	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2515	load atomic unordered any any Same as non-atomic.
				2516	store atomic unordered any any Same as non-atomic.
				2517	atomicrmw unordered any any *Same as monotonic
				2518	atomic*.
				2519	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2520	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2521	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2522	- wavefront - generic
				2523	- workgroup
				2524	load atomic monotonic - singlethread - local 1. ds_load
				2525	- wavefront
				2526	- workgroup
				2527	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2528	- system - generic glc=1
				2529	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2530	- wavefront - generic
				2531	- workgroup
				2532	- agent
				2533	- system
				2534	store atomic monotonic - singlethread - local 1. ds_store
				2535	- wavefront
				2536	- workgroup
				2537	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2538	- wavefront - generic
				2539	- workgroup
				2540	- agent
				2541	- system
				2542	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2543	- wavefront
				2544	- workgroup
				2545	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2546	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2547	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2548	- wavefront - local
				2549	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2550	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2551	load atomic acquire - workgroup - local 1. ds_load
				2552	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2553
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2554	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2555	- Must happen before
				2556	any following
				2557	global/generic
				2558	load/load
				2559	atomic/store/store
				2560	atomic/atomicrmw.
				2561	- Ensures any
				2562	following global
				2563	data read is no
				2564	older than the load
				2565	atomic value being
				2566	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2567	load atomic acquire - workgroup - generic 1. flat_load
				2568	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2569
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2570	- If OpenCL, omit.
				2571	- Must happen before
				2572	any following
				2573	global/generic
				2574	load/load
				2575	atomic/store/store
				2576	atomic/atomicrmw.
				2577	- Ensures any
				2578	following global
				2579	data read is no
				2580	older than the load
				2581	atomic value being
				2582	acquired.
				2583	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2584	- system glc=1
				2585	2. s_waitcnt vmcnt(0)
				2586
				2587	- Must happen before
				2588	following
				2589	buffer_wbinvl1_vol.
				2590	- Ensures the load
				2591	has completed
				2592	before invalidating
				2593	the cache.
				2594
				2595	3. buffer_wbinvl1_vol
				2596
				2597	- Must happen before
				2598	any following
				2599	global/generic
				2600	load/load
				2601	atomic/atomicrmw.
				2602	- Ensures that
				2603	following
				2604	loads will not see
				2605	stale global data.
				2606
				2607	load atomic acquire - agent - generic 1. flat_load glc=1
				2608	- system 2. s_waitcnt vmcnt(0) &
				2609	lgkmcnt(0)
				2610
				2611	- If OpenCL omit
				2612	lgkmcnt(0).
				2613	- Must happen before
				2614	following
				2615	buffer_wbinvl1_vol.
				2616	- Ensures the flat_load
				2617	has completed
				2618	before invalidating
				2619	the cache.
				2620
				2621	3. buffer_wbinvl1_vol
				2622
				2623	- Must happen before
				2624	any following
				2625	global/generic
				2626	load/load
				2627	atomic/atomicrmw.
				2628	- Ensures that
				2629	following loads
				2630	will not see stale
				2631	global data.
				2632
				2633	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2634	- wavefront - local
				2635	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2636	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2637	atomicrmw acquire - workgroup - local 1. ds_atomic
				2638	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2639
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2640	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2641	- Must happen before
				2642	any following
				2643	global/generic
				2644	load/load
				2645	atomic/store/store
				2646	atomic/atomicrmw.
				2647	- Ensures any
				2648	following global
				2649	data read is no
				2650	older than the
				2651	atomicrmw value
				2652	being acquired.
				2653
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2654	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2655	2. waitcnt lgkmcnt(0)
				2656
				2657	- If OpenCL, omit.
				2658	- Must happen before
				2659	any following
				2660	global/generic
				2661	load/load
				2662	atomic/store/store
				2663	atomic/atomicrmw.
				2664	- Ensures any
				2665	following global
				2666	data read is no
				2667	older than the
				2668	atomicrmw value
				2669	being acquired.
				2670
				2671	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2672	- system 2. s_waitcnt vmcnt(0)
				2673
				2674	- Must happen before
				2675	following
				2676	buffer_wbinvl1_vol.
				2677	- Ensures the
				2678	atomicrmw has
				2679	completed before
				2680	invalidating the
				2681	cache.
				2682
				2683	3. buffer_wbinvl1_vol
				2684
				2685	- Must happen before
				2686	any following
				2687	global/generic
				2688	load/load
				2689	atomic/atomicrmw.
				2690	- Ensures that
				2691	following loads
				2692	will not see stale
				2693	global data.
				2694
				2695	atomicrmw acquire - agent - generic 1. flat_atomic
				2696	- system 2. s_waitcnt vmcnt(0) &
				2697	lgkmcnt(0)
				2698
				2699	- If OpenCL, omit
				2700	lgkmcnt(0).
				2701	- Must happen before
				2702	following
				2703	buffer_wbinvl1_vol.
				2704	- Ensures the
				2705	atomicrmw has
				2706	completed before
				2707	invalidating the
				2708	cache.
				2709
				2710	3. buffer_wbinvl1_vol
				2711
				2712	- Must happen before
				2713	any following
				2714	global/generic
				2715	load/load
				2716	atomic/atomicrmw.
				2717	- Ensures that
				2718	following loads
				2719	will not see stale
				2720	global data.
				2721
				2722	fence acquire - singlethread none none
				2723	- wavefront
				2724	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2725
				2726	- If OpenCL and
				2727	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2728	not generic, omit.
				2729	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2730	currently has no
				2731	address space on
				2732	the fence need to
				2733	conservatively
				2734	always generate. If
				2735	fence had an
				2736	address space then
				2737	set to address
				2738	space of OpenCL
				2739	fence flag, or to
				2740	generic if both
				2741	local and global
				2742	flags are
				2743	specified.
				2744	- Must happen after
				2745	any preceding
				2746	local/generic load
				2747	atomic/atomicrmw
				2748	with an equal or
				2749	wider sync scope
				2750	and memory ordering
				2751	stronger than
				2752	unordered (this is
				2753	termed the
				2754	fence-paired-atomic).
				2755	- Must happen before
				2756	any following
				2757	global/generic
				2758	load/load
				2759	atomic/store/store
				2760	atomic/atomicrmw.
				2761	- Ensures any
				2762	following global
				2763	data read is no
				2764	older than the
				2765	value read by the
				2766	fence-paired-atomic.
				2767
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2768	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2769	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2770
				2771	- If OpenCL and
				2772	address space is
				2773	not generic, omit
				2774	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2775	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2776	currently has no
				2777	address space on
				2778	the fence need to
				2779	conservatively
				2780	always generate
				2781	(see comment for
				2782	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2783	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2784	separate s_waitcnt
				2785	vmcnt(0) and
				2786	s_waitcnt
				2787	lgkmcnt(0) to allow
				2788	them to be
				2789	independently moved
				2790	according to the
				2791	following rules.
				2792	- s_waitcnt vmcnt(0)
				2793	must happen after
				2794	any preceding
				2795	global/generic load
				2796	atomic/atomicrmw
				2797	with an equal or
				2798	wider sync scope
				2799	and memory ordering
				2800	stronger than
				2801	unordered (this is
				2802	termed the
				2803	fence-paired-atomic).
				2804	- s_waitcnt lgkmcnt(0)
				2805	must happen after
				2806	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2807	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2808	atomic/atomicrmw
				2809	with an equal or
				2810	wider sync scope
				2811	and memory ordering
				2812	stronger than
				2813	unordered (this is
				2814	termed the
				2815	fence-paired-atomic).
				2816	- Must happen before
				2817	the following
				2818	buffer_wbinvl1_vol.
				2819	- Ensures that the
				2820	fence-paired atomic
				2821	has completed
				2822	before invalidating
				2823	the
				2824	cache. Therefore
				2825	any following
				2826	locations read must
				2827	be no older than
				2828	the value read by
				2829	the
				2830	fence-paired-atomic.
				2831
				2832	2. buffer_wbinvl1_vol
				2833
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2834	- Must happen before any
				2835	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2836	load/load
				2837	atomic/store/store
				2838	atomic/atomicrmw.
				2839	- Ensures that
				2840	following loads
				2841	will not see stale
				2842	global data.
				2843
				2844	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2845	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2846	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2847	- wavefront - local
				2848	- generic
				2849	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2850
				2851	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2852	- Must happen after
				2853	any preceding
				2854	local/generic
				2855	load/store/load
				2856	atomic/store
				2857	atomic/atomicrmw.
				2858	- Must happen before
				2859	the following
				2860	store.
				2861	- Ensures that all
				2862	memory operations
				2863	to local have
				2864	completed before
				2865	performing the
				2866	store that is being
				2867	released.
				2868
				2869	2. buffer/global/flat_store
				2870	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2871	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2872
				2873	- If OpenCL, omit.
				2874	- Must happen after
				2875	any preceding
				2876	local/generic
				2877	load/store/load
				2878	atomic/store
				2879	atomic/atomicrmw.
				2880	- Must happen before
				2881	the following
				2882	store.
				2883	- Ensures that all
				2884	memory operations
				2885	to local have
				2886	completed before
				2887	performing the
				2888	store that is being
				2889	released.
				2890
				2891	2. flat_store
				2892	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2893	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2894
				2895	- If OpenCL, omit
				2896	lgkmcnt(0).
				2897	- Could be split into
				2898	separate s_waitcnt
				2899	vmcnt(0) and
				2900	s_waitcnt
				2901	lgkmcnt(0) to allow
				2902	them to be
				2903	independently moved
				2904	according to the
				2905	following rules.
				2906	- s_waitcnt vmcnt(0)
				2907	must happen after
				2908	any preceding
				2909	global/generic
				2910	load/store/load
				2911	atomic/store
				2912	atomic/atomicrmw.
				2913	- s_waitcnt lgkmcnt(0)
				2914	must happen after
				2915	any preceding
				2916	local/generic
				2917	load/store/load
				2918	atomic/store
				2919	atomic/atomicrmw.
				2920	- Must happen before
				2921	the following
				2922	store.
				2923	- Ensures that all
				2924	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2925	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2926	completed before
				2927	performing the
				2928	store that is being
				2929	released.
				2930
				2931	2. buffer/global/ds/flat_store
				2932	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2933	- wavefront - local
				2934	- generic
				2935	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2936
				2937	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2938	- Must happen after
				2939	any preceding
				2940	local/generic
				2941	load/store/load
				2942	atomic/store
				2943	atomic/atomicrmw.
				2944	- Must happen before
				2945	the following
				2946	atomicrmw.
				2947	- Ensures that all
				2948	memory operations
				2949	to local have
				2950	completed before
				2951	performing the
				2952	atomicrmw that is
				2953	being released.
				2954
				2955	2. buffer/global/flat_atomic
				2956	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2957	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2958
				2959	- If OpenCL, omit.
				2960	- Must happen after
				2961	any preceding
				2962	local/generic
				2963	load/store/load
				2964	atomic/store
				2965	atomic/atomicrmw.
				2966	- Must happen before
				2967	the following
				2968	atomicrmw.
				2969	- Ensures that all
				2970	memory operations
				2971	to local have
				2972	completed before
				2973	performing the
				2974	atomicrmw that is
				2975	being released.
				2976
				2977	2. flat_atomic
				2978	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2979	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2980
				2981	- If OpenCL, omit
				2982	lgkmcnt(0).
				2983	- Could be split into
				2984	separate s_waitcnt
				2985	vmcnt(0) and
				2986	s_waitcnt
				2987	lgkmcnt(0) to allow
				2988	them to be
				2989	independently moved
				2990	according to the
				2991	following rules.
				2992	- s_waitcnt vmcnt(0)
				2993	must happen after
				2994	any preceding
				2995	global/generic
				2996	load/store/load
				2997	atomic/store
				2998	atomic/atomicrmw.
				2999	- s_waitcnt lgkmcnt(0)
				3000	must happen after
				3001	any preceding
				3002	local/generic
				3003	load/store/load
				3004	atomic/store
				3005	atomic/atomicrmw.
				3006	- Must happen before
				3007	the following
				3008	atomicrmw.
				3009	- Ensures that all
				3010	memory operations
				3011	to global and local
				3012	have completed
				3013	before performing
				3014	the atomicrmw that
				3015	is being released.
				3016
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3017	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3018	fence release - singlethread none none
				3019	- wavefront
				3020	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3021
				3022	- If OpenCL and
				3023	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3024	not generic, omit.
				3025	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3026	currently has no
				3027	address space on
				3028	the fence need to
				3029	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3030	always generate. If
				3031	fence had an
				3032	address space then
				3033	set to address
				3034	space of OpenCL
				3035	fence flag, or to
				3036	generic if both
				3037	local and global
				3038	flags are
				3039	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3040	- Must happen after
				3041	any preceding
				3042	local/generic
				3043	load/load
				3044	atomic/store/store
				3045	atomic/atomicrmw.
				3046	- Must happen before
				3047	any following store
				3048	atomic/atomicrmw
				3049	with an equal or
				3050	wider sync scope
				3051	and memory ordering
				3052	stronger than
				3053	unordered (this is
				3054	termed the
				3055	fence-paired-atomic).
				3056	- Ensures that all
				3057	memory operations
				3058	to local have
				3059	completed before
				3060	performing the
				3061	following
				3062	fence-paired-atomic.
				3063
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3064	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3065	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3066
				3067	- If OpenCL and
				3068	address space is
				3069	not generic, omit
				3070	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3071	- If OpenCL and
				3072	address space is
				3073	local, omit
				3074	vmcnt(0).
				3075	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3076	currently has no
				3077	address space on
				3078	the fence need to
				3079	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3080	always generate. If
				3081	fence had an
				3082	address space then
				3083	set to address
				3084	space of OpenCL
				3085	fence flag, or to
				3086	generic if both
				3087	local and global
				3088	flags are
				3089	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3090	- Could be split into
				3091	separate s_waitcnt
				3092	vmcnt(0) and
				3093	s_waitcnt
				3094	lgkmcnt(0) to allow
				3095	them to be
				3096	independently moved
				3097	according to the
				3098	following rules.
				3099	- s_waitcnt vmcnt(0)
				3100	must happen after
				3101	any preceding
				3102	global/generic
				3103	load/store/load
				3104	atomic/store
				3105	atomic/atomicrmw.
				3106	- s_waitcnt lgkmcnt(0)
				3107	must happen after
				3108	any preceding
				3109	local/generic
				3110	load/store/load
				3111	atomic/store
				3112	atomic/atomicrmw.
				3113	- Must happen before
				3114	any following store
				3115	atomic/atomicrmw
				3116	with an equal or
				3117	wider sync scope
				3118	and memory ordering
				3119	stronger than
				3120	unordered (this is
				3121	termed the
				3122	fence-paired-atomic).
				3123	- Ensures that all
				3124	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3125	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3126	completed before
				3127	performing the
				3128	following
				3129	fence-paired-atomic.
				3130
				3131	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3132	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3133	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3134	- wavefront - local
				3135	- generic
				3136	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3137
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3138	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3139	- Must happen after
				3140	any preceding
				3141	local/generic
				3142	load/store/load
				3143	atomic/store
				3144	atomic/atomicrmw.
				3145	- Must happen before
				3146	the following
				3147	atomicrmw.
				3148	- Ensures that all
				3149	memory operations
				3150	to local have
				3151	completed before
				3152	performing the
				3153	atomicrmw that is
				3154	being released.
				3155
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3156	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3157	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3158	2. s_waitcnt lgkmcnt(0)
				3159
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3160	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3161	- Must happen before
				3162	any following
				3163	global/generic
				3164	load/load
				3165	atomic/store/store
				3166	atomic/atomicrmw.
				3167	- Ensures any
				3168	following global
				3169	data read is no
				3170	older than the load
				3171	atomic value being
				3172	acquired.
				3173
				3174	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3175
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3176	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3177	- Must happen after
				3178	any preceding
				3179	local/generic
				3180	load/store/load
				3181	atomic/store
				3182	atomic/atomicrmw.
				3183	- Must happen before
				3184	the following
				3185	atomicrmw.
				3186	- Ensures that all
				3187	memory operations
				3188	to local have
				3189	completed before
				3190	performing the
				3191	atomicrmw that is
				3192	being released.
				3193
				3194	2. flat_atomic
				3195	3. s_waitcnt lgkmcnt(0)
				3196
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3197	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3198	- Must happen before
				3199	any following
				3200	global/generic
				3201	load/load
				3202	atomic/store/store
				3203	atomic/atomicrmw.
				3204	- Ensures any
				3205	following global
				3206	data read is no
				3207	older than the load
				3208	atomic value being
				3209	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3210
				3211	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3212	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3213
				3214	- If OpenCL, omit
				3215	lgkmcnt(0).
				3216	- Could be split into
				3217	separate s_waitcnt
				3218	vmcnt(0) and
				3219	s_waitcnt
				3220	lgkmcnt(0) to allow
				3221	them to be
				3222	independently moved
				3223	according to the
				3224	following rules.
				3225	- s_waitcnt vmcnt(0)
				3226	must happen after
				3227	any preceding
				3228	global/generic
				3229	load/store/load
				3230	atomic/store
				3231	atomic/atomicrmw.
				3232	- s_waitcnt lgkmcnt(0)
				3233	must happen after
				3234	any preceding
				3235	local/generic
				3236	load/store/load
				3237	atomic/store
				3238	atomic/atomicrmw.
				3239	- Must happen before
				3240	the following
				3241	atomicrmw.
				3242	- Ensures that all
				3243	memory operations
				3244	to global have
				3245	completed before
				3246	performing the
				3247	atomicrmw that is
				3248	being released.
				3249
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3250	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3251	3. s_waitcnt vmcnt(0)
				3252
				3253	- Must happen before
				3254	following
				3255	buffer_wbinvl1_vol.
				3256	- Ensures the
				3257	atomicrmw has
				3258	completed before
				3259	invalidating the
				3260	cache.
				3261
				3262	4. buffer_wbinvl1_vol
				3263
				3264	- Must happen before
				3265	any following
				3266	global/generic
				3267	load/load
				3268	atomic/atomicrmw.
				3269	- Ensures that
				3270	following loads
				3271	will not see stale
				3272	global data.
				3273
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3274	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3275	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3276
				3277	- If OpenCL, omit
				3278	lgkmcnt(0).
				3279	- Could be split into
				3280	separate s_waitcnt
				3281	vmcnt(0) and
				3282	s_waitcnt
				3283	lgkmcnt(0) to allow
				3284	them to be
				3285	independently moved
				3286	according to the
				3287	following rules.
				3288	- s_waitcnt vmcnt(0)
				3289	must happen after
				3290	any preceding
				3291	global/generic
				3292	load/store/load
				3293	atomic/store
				3294	atomic/atomicrmw.
				3295	- s_waitcnt lgkmcnt(0)
				3296	must happen after
				3297	any preceding
				3298	local/generic
				3299	load/store/load
				3300	atomic/store
				3301	atomic/atomicrmw.
				3302	- Must happen before
				3303	the following
				3304	atomicrmw.
				3305	- Ensures that all
				3306	memory operations
				3307	to global have
				3308	completed before
				3309	performing the
				3310	atomicrmw that is
				3311	being released.
				3312
				3313	2. flat_atomic
				3314	3. s_waitcnt vmcnt(0) &
				3315	lgkmcnt(0)
				3316
				3317	- If OpenCL, omit
				3318	lgkmcnt(0).
				3319	- Must happen before
				3320	following
				3321	buffer_wbinvl1_vol.
				3322	- Ensures the
				3323	atomicrmw has
				3324	completed before
				3325	invalidating the
				3326	cache.
				3327
				3328	4. buffer_wbinvl1_vol
				3329
				3330	- Must happen before
				3331	any following
				3332	global/generic
				3333	load/load
				3334	atomic/atomicrmw.
				3335	- Ensures that
				3336	following loads
				3337	will not see stale
				3338	global data.
				3339
				3340	fence acq_rel - singlethread none none
				3341	- wavefront
				3342	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3343
				3344	- If OpenCL and
				3345	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3346	not generic, omit.
				3347	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3348	since LLVM
				3349	currently has no
				3350	address space on
				3351	the fence need to
				3352	conservatively
				3353	always generate
				3354	(see comment for
				3355	previous fence).
				3356	- Must happen after
				3357	any preceding
				3358	local/generic
				3359	load/load
				3360	atomic/store/store
				3361	atomic/atomicrmw.
				3362	- Must happen before
				3363	any following
				3364	global/generic
				3365	load/load
				3366	atomic/store/store
				3367	atomic/atomicrmw.
				3368	- Ensures that all
				3369	memory operations
				3370	to local have
				3371	completed before
				3372	performing any
				3373	following global
				3374	memory operations.
				3375	- Ensures that the
				3376	preceding
				3377	local/generic load
				3378	atomic/atomicrmw
				3379	with an equal or
				3380	wider sync scope
				3381	and memory ordering
				3382	stronger than
				3383	unordered (this is
				3384	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3385	acquire-fence-paired-atomic
				3386	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3387	before following
				3388	global memory
				3389	operations. This
				3390	satisfies the
				3391	requirements of
				3392	acquire.
				3393	- Ensures that all
				3394	previous memory
				3395	operations have
				3396	completed before a
				3397	following
				3398	local/generic store
				3399	atomic/atomicrmw
				3400	with an equal or
				3401	wider sync scope
				3402	and memory ordering
				3403	stronger than
				3404	unordered (this is
				3405	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3406	release-fence-paired-atomic
				3407	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3408	requirements of
				3409	release.
				3410
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3411	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3412	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3413
				3414	- If OpenCL and
				3415	address space is
				3416	not generic, omit
				3417	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3418	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3419	currently has no
				3420	address space on
				3421	the fence need to
				3422	conservatively
				3423	always generate
				3424	(see comment for
				3425	previous fence).
				3426	- Could be split into
				3427	separate s_waitcnt
				3428	vmcnt(0) and
				3429	s_waitcnt
				3430	lgkmcnt(0) to allow
				3431	them to be
				3432	independently moved
				3433	according to the
				3434	following rules.
				3435	- s_waitcnt vmcnt(0)
				3436	must happen after
				3437	any preceding
				3438	global/generic
				3439	load/store/load
				3440	atomic/store
				3441	atomic/atomicrmw.
				3442	- s_waitcnt lgkmcnt(0)
				3443	must happen after
				3444	any preceding
				3445	local/generic
				3446	load/store/load
				3447	atomic/store
				3448	atomic/atomicrmw.
				3449	- Must happen before
				3450	the following
				3451	buffer_wbinvl1_vol.
				3452	- Ensures that the
				3453	preceding
				3454	global/local/generic
				3455	load
				3456	atomic/atomicrmw
				3457	with an equal or
				3458	wider sync scope
				3459	and memory ordering
				3460	stronger than
				3461	unordered (this is
				3462	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3463	acquire-fence-paired-atomic
				3464	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3465	before invalidating
				3466	the cache. This
				3467	satisfies the
				3468	requirements of
				3469	acquire.
				3470	- Ensures that all
				3471	previous memory
				3472	operations have
				3473	completed before a
				3474	following
				3475	global/local/generic
				3476	store
				3477	atomic/atomicrmw
				3478	with an equal or
				3479	wider sync scope
				3480	and memory ordering
				3481	stronger than
				3482	unordered (this is
				3483	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3484	release-fence-paired-atomic
				3485	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3486	requirements of
				3487	release.
				3488
				3489	2. buffer_wbinvl1_vol
				3490
				3491	- Must happen before
				3492	any following
				3493	global/generic
				3494	load/load
				3495	atomic/store/store
				3496	atomic/atomicrmw.
				3497	- Ensures that
				3498	following loads
				3499	will not see stale
				3500	global data. This
				3501	satisfies the
				3502	requirements of
				3503	acquire.
				3504
				3505	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3506	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3507	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3508	- wavefront - local load atomic acquire,
				3509	- generic except must generated
				3510	all instructions even
				3511	for OpenCL.*
				3512	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3513	- generic
				3514	- Must
				3515	happen after
				3516	preceding
				3517	global/generic load
				3518	atomic/store
				3519	atomic/atomicrmw
				3520	with memory
				3521	ordering of seq_cst
				3522	and with equal or
				3523	wider sync scope.
				3524	(Note that seq_cst
				3525	fences have their
				3526	own s_waitcnt
				3527	lgkmcnt(0) and so do
				3528	not need to be
				3529	considered.)
				3530	- Ensures any
				3531	preceding
				3532	sequential
				3533	consistent local
				3534	memory instructions
				3535	have completed
				3536	before executing
				3537	this sequentially
				3538	consistent
				3539	instruction. This
				3540	prevents reordering
				3541	a seq_cst store
				3542	followed by a
				3543	seq_cst load. (Note
				3544	that seq_cst is
				3545	stronger than
				3546	acquire/release as
				3547	the reordering of
				3548	load acquire
				3549	followed by a store
				3550	release is
				3551	prevented by the
				3552	waitcnt of
				3553	the release, but
				3554	there is nothing
				3555	preventing a store
				3556	release followed by
				3557	load acquire from
				3558	competing out of
				3559	order.)
				3560
				3561	2. *Following
				3562	instructions same as
				3563	corresponding load
				3564	atomic acquire,
				3565	except must generated
				3566	all instructions even
				3567	for OpenCL.*
				3568	load atomic seq_cst - workgroup - local *Same as corresponding
				3569	load atomic acquire,
				3570	except must generated
				3571	all instructions even
				3572	for OpenCL.*
				3573	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3574	- system - generic vmcnt(0)
				3575
				3576	- Could be split into
				3577	separate s_waitcnt
				3578	vmcnt(0)
				3579	and s_waitcnt
				3580	lgkmcnt(0) to allow
				3581	them to be
				3582	independently moved
				3583	according to the
				3584	following rules.
				3585	- waitcnt lgkmcnt(0)
				3586	must happen after
				3587	preceding
				3588	global/generic load
				3589	atomic/store
				3590	atomic/atomicrmw
				3591	with memory
				3592	ordering of seq_cst
				3593	and with equal or
				3594	wider sync scope.
				3595	(Note that seq_cst
				3596	fences have their
				3597	own s_waitcnt
				3598	lgkmcnt(0) and so do
				3599	not need to be
				3600	considered.)
				3601	- waitcnt vmcnt(0)
				3602	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3603	preceding
				3604	global/generic load
				3605	atomic/store
				3606	atomic/atomicrmw
				3607	with memory
				3608	ordering of seq_cst
				3609	and with equal or
				3610	wider sync scope.
				3611	(Note that seq_cst
				3612	fences have their
				3613	own s_waitcnt
				3614	vmcnt(0) and so do
				3615	not need to be
				3616	considered.)
				3617	- Ensures any
				3618	preceding
				3619	sequential
				3620	consistent global
				3621	memory instructions
				3622	have completed
				3623	before executing
				3624	this sequentially
				3625	consistent
				3626	instruction. This
				3627	prevents reordering
				3628	a seq_cst store
				3629	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3630	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3631	that seq_cst is
				3632	stronger than
				3633	acquire/release as
				3634	the reordering of
				3635	load acquire
				3636	followed by a store
				3637	release is
				3638	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3639	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3640	the release, but
				3641	there is nothing
				3642	preventing a store
				3643	release followed by
				3644	load acquire from
				3645	competing out of
				3646	order.)
				3647
				3648	2. *Following
				3649	instructions same as
				3650	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3651	atomic acquire,
				3652	except must generated
				3653	all instructions even
				3654	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3655	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3656	- wavefront - local store atomic release,
				3657	- workgroup - generic except must generated
				3658	all instructions even
				3659	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3660	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3661	- system - generic store atomic release,
				3662	except must generated
				3663	all instructions even
				3664	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3665	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3666	- wavefront - local atomicrmw acq_rel,
				3667	- workgroup - generic except must generated
				3668	all instructions even
				3669	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3670	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3671	- system - generic atomicrmw acq_rel,
				3672	except must generated
				3673	all instructions even
				3674	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3675	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3676	- wavefront fence acq_rel,
				3677	- workgroup except must generated
				3678	- agent all instructions even
				3679	- system for OpenCL.*
				3680	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3681
				3682	The memory order also adds the single thread optimization constrains defined in
				3683	table
				3684	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3685
				3686	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3687	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3688
				3689	============ ==============================================================
				3690	LLVM Memory Optimization Constraints
				3691	Ordering
				3692	============ ==============================================================
				3693	unordered none
				3694	monotonic none
				3695	acquire - If a load atomic/atomicrmw then no following load/load
				3696	atomic/store/ store atomic/atomicrmw/fence instruction can
				3697	be moved before the acquire.
				3698	- If a fence then same as load atomic, plus no preceding
				3699	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3700	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3701	atomic/store/ store atomic/atomicrmw/fence instruction can
				3702	be moved after the release.
				3703	- If a fence then same as store atomic, plus no following
				3704	associated fence-paired-atomic can be moved before the
				3705	fence.
				3706	acq_rel Same constraints as both acquire and release.
				3707	seq_cst - If a load atomic then same constraints as acquire, plus no
				3708	preceding sequentially consistent load atomic/store
				3709	atomic/atomicrmw/fence instruction can be moved after the
				3710	seq_cst.
				3711	- If a store atomic then the same constraints as release, plus
				3712	no following sequentially consistent load atomic/store
				3713	atomic/atomicrmw/fence instruction can be moved before the
				3714	seq_cst.
				3715	- If an atomicrmw/fence then same constraints as acq_rel.
				3716	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3717
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3718	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3719	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3720
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3721	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3722	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3723	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3724
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3725	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3726	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3727
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3728	=================== =============== =============== =======================
				3729	Usage Code Sequence Trap Handler Description
				3730	Inputs
				3731	=================== =============== =============== =======================
				3732	reserved ``s_trap 0x00`` Reserved by hardware.
				3733	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3734	``queue_ptr`` ``debugtrap``
				3735	``VGPR0``: intrinsic (not
				3736	``arg`` implemented).
				3737	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3738	``queue_ptr`` terminated and its
				3739	associated queue put
				3740	into the error state.
				3741	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3742	``queue_ptr`` installed handled
				3743	same as ``llvm.trap``.
				3744	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3745	breakpoints.
				3746	debugger ``s_trap 0x08`` Reserved for debugger.
				3747	debugger ``s_trap 0xfe`` Reserved for debugger.
				3748	debugger ``s_trap 0xff`` Reserved for debugger.
				3749	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3750
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3751	Unspecified OS
				3752	--------------
				3753
				3754	This section provides code conventions used when the target triple OS is
				3755	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3756
				3757	Trap Handler ABI
				3758	~~~~~~~~~~~~~~~~
				3759
				3760	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3761	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3762	instructions are handled as follows:
				3763
				3764	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3765	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3766
				3767	=============== =============== ===========================================
				3768	Usage Code Sequence Description
				3769	=============== =============== ===========================================
				3770	llvm.trap s_endpgm Causes wavefront to be terminated.
				3771	llvm.debugtrap none Compiler warning given that there is no
				3772	trap handler installed.
				3773	=============== =============== ===========================================
				3774
				3775	Source Languages
				3776	================
				3777
				3778	.. _amdgpu-opencl:
				3779
				3780	OpenCL
				3781	------
				3782
				3783	When generating code for the OpenCL language the target triple environment
				3784	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3785
				3786	When the language is OpenCL the following differences occur:
				3787
				3788	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3789	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3790	3. Additional metadata is generated
				3791	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3792
				3793	.. TODO
				3794	Specify what affect this has. Hidden arguments added. Additional metadata
				3795	generated.
				3796
				3797	.. _amdgpu-hcc:
				3798
				3799	HCC
				3800	---
				3801
				3802	When generating code for the OpenCL language the target triple environment
				3803	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3804
				3805	When the language is OpenCL the following differences occur:
				3806
				3807	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3808
				3809	.. TODO
				3810	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3811
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3812	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3813	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3814
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3815	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3816	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3817
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3818	This section describes general syntax for instructions and operands. For more
				3819	information about instructions, their semantics and supported combinations of
				3820	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3821	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3822
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3823	An instruction has the following syntax (register operands are normally
				3824	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3825
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3826	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3827
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3828	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3829	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3830
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3831	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3832
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3833	* SGPR registers: s0, ... or s[0], ...
				3834	* VGPR registers: v0, ... or v[0], ...
				3835	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3836	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3837	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3838	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3839	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3840	* Register index expressions: v[2*2], s[1-1:2-1]
				3841	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3842
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3843	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3844
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3845	* offset, offset0, offset1
				3846	* idxen, offen bits
				3847	* glc, slc, tfe bits
				3848	* waitcnt: integer or combination of counter values
				3849	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3850
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3851	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3852
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3853	* DPP modifiers:
				3854
				3855	- row_shl, row_shr, row_ror, row_rol
				3856	- row_mirror, row_half_mirror, row_bcast
				3857	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3858	- row_mask, bank_mask, bound_ctrl
				3859
				3860	* SDWA modifiers:
				3861
				3862	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3863	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3864	- abs, neg, sext
				3865
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3866	Instruction Examples
				3867	~~~~~~~~~~~~~~~~~~~~
				3868
				3869	DS
				3870	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3871
				3872	.. code-block:: nasm
				3873
				3874	ds_add_u32 v2, v4 offset:16
				3875	ds_write_src2_b64 v2 offset0:4 offset1:8
				3876	ds_cmpst_f32 v2, v4, v6
				3877	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3878
				3879
				3880	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3881
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3882	FLAT
				3883	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3884
				3885	.. code-block:: nasm
				3886
				3887	flat_load_dword v1, v[3:4]
				3888	flat_store_dwordx3 v[3:4], v[5:7]
				3889	flat_atomic_swap v1, v[3:4], v5 glc
				3890	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3891	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3892
				3893	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3894
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3895	MUBUF
				3896	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3897
				3898	.. code-block:: nasm
				3899
				3900	buffer_load_dword v1, off, s[4:7], s1
				3901	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3902	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3903	buffer_wbinvl1
				3904	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3905
				3906	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3907
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3908	SMRD/SMEM
				3909	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3910
				3911	.. code-block:: nasm
				3912
				3913	s_load_dword s1, s[2:3], 0xfc
				3914	s_load_dwordx8 s[8:15], s[2:3], s4
				3915	s_load_dwordx16 s[88:103], s[2:3], s4
				3916	s_dcache_inv_vol
				3917	s_memtime s[4:5]
				3918
				3919	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3920
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3921	SOP1
				3922	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3923
				3924	.. code-block:: nasm
				3925
				3926	s_mov_b32 s1, s2
				3927	s_mov_b64 s[0:1], 0x80000000
				3928	s_cmov_b32 s1, 200
				3929	s_wqm_b64 s[2:3], s[4:5]
				3930	s_bcnt0_i32_b64 s1, s[2:3]
				3931	s_swappc_b64 s[2:3], s[4:5]
				3932	s_cbranch_join s[4:5]
				3933
				3934	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3935
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3936	SOP2
				3937	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3938
				3939	.. code-block:: nasm
				3940
				3941	s_add_u32 s1, s2, s3
				3942	s_and_b64 s[2:3], s[4:5], s[6:7]
				3943	s_cselect_b32 s1, s2, s3
				3944	s_andn2_b32 s2, s4, s6
				3945	s_lshr_b64 s[2:3], s[4:5], s6
				3946	s_ashr_i32 s2, s4, s6
				3947	s_bfm_b64 s[2:3], s4, s6
				3948	s_bfe_i64 s[2:3], s[4:5], s6
				3949	s_cbranch_g_fork s[4:5], s[6:7]
				3950
				3951	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3952
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3953	SOPC
				3954	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3955
				3956	.. code-block:: nasm
				3957
				3958	s_cmp_eq_i32 s1, s2
				3959	s_bitcmp1_b32 s1, s2
				3960	s_bitcmp0_b64 s[2:3], s4
				3961	s_setvskip s3, s5
				3962
				3963	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3964
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3965	SOPP
				3966	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3967
				3968	.. code-block:: nasm
				3969
				3970	s_barrier
				3971	s_nop 2
				3972	s_endpgm
				3973	s_waitcnt 0 ; Wait for all counters to be 0
				3974	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3975	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3976	s_sethalt 9
				3977	s_sleep 10
				3978	s_sendmsg 0x1
				3979	s_sendmsg sendmsg(MSG_INTERRUPT)
				3980	s_trap 1
				3981
				3982	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3983
				3984	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3985	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3986	range or acceptable values.
				3987
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3988	VALU
				3989	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3990
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3991	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3992	the assembler will automatically use optimal encoding based on its operands.
				3993	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3994
				3995	* _e32 for 32-bit VOP1/VOP2/VOPC
				3996	* _e64 for 64-bit VOP3
				3997	* _dpp for VOP_DPP
				3998	* _sdwa for VOP_SDWA
				3999
				4000	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4001
				4002	.. code-block:: nasm
				4003
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4004	v_mov_b32 v1, v2
				4005	v_mov_b32_e32 v1, v2
				4006	v_nop
				4007	v_cvt_f64_i32_e32 v[1:2], v2
				4008	v_floor_f32_e32 v1, v2
				4009	v_bfrev_b32_e32 v1, v2
				4010	v_add_f32_e32 v1, v2, v3
				4011	v_mul_i32_i24_e64 v1, v2, 3
				4012	v_mul_i32_i24_e32 v1, -3, v3
				4013	v_mul_i32_i24_e32 v1, -100, v3
				4014	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4015	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4016
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4017	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4018
				4019	.. code-block:: nasm
				4020
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4021	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4022	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4023	v_mov_b32 v0, v0 wave_shl:1
				4024	v_mov_b32 v0, v0 row_mirror
				4025	v_mov_b32 v0, v0 row_bcast:31
				4026	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4027	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4028	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4029
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4030	VOP_SDWA examples:
				4031
				4032	.. code-block:: nasm
				4033
				4034	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4035	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4036	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4037	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4038	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4039
				4040	For full list of supported instructions, refer to "Vector ALU instructions".
				4041
				4042	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4043	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4044
				4045	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4046	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4047
				4048	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4049	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4050
				4051	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4052	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4053
				4054	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4055	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4056
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4057
				4058	major, minor, and stepping are all integers that describe the instruction
				4059	set architecture (ISA) version of the assembly program.
				4060
				4061	vendor and arch are quoted strings. vendor should always be equal to
				4062	"AMD" and arch should always be equal to "AMDGPU".
				4063
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4064	By default, the assembler will derive the ISA version, vendor, and arch
				4065	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4066
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4067	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4068	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4069
				4070	This directives specifies that the symbol with given name is a kernel entry point
				4071	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4072
				4073	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4074	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4075
				4076	This directive marks the beginning of a list of key / value pairs that are used
				4077	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4078	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4079	any amd_kernel_code_t values that are unspecified a default value will be
				4080	used. The default value for all keys is 0, with the following exceptions:
				4081
				4082	- kernel_code_version_major defaults to 1.
				4083	- machine_kind defaults to 1.
				4084	- machine_version_major, machine_version_minor, and
				4085	machine_version_stepping are derived from the value of the -mcpu option
				4086	that is passed to the assembler.
				4087	- kernel_code_entry_byte_offset defaults to 256.
				4088	- wavefront_size defaults to 6.
				4089	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4090	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4091	as a power of two, so a value of n means an alignment of 2^ n.
				4092
				4093	The .amd_kernel_code_t directive must be placed immediately after the
				4094	function label and before any instructions.
				4095
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4096	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4097	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4098
				4099	Here is an example of a minimal amd_kernel_code_t specification:
				4100
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4101	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4102
				4103	.hsa_code_object_version 1,0
				4104	.hsa_code_object_isa
				4105
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4106	.hsatext
				4107	.globl hello_world
				4108	.p2align 8
				4109	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4110
				4111	hello_world:
				4112
				4113	.amd_kernel_code_t
				4114	enable_sgpr_kernarg_segment_ptr = 1
				4115	is_ptr64 = 1
				4116	compute_pgm_rsrc1_vgprs = 0
				4117	compute_pgm_rsrc1_sgprs = 0
				4118	compute_pgm_rsrc2_user_sgpr = 2
				4119	kernarg_segment_byte_size = 8
				4120	wavefront_sgpr_count = 2
				4121	workitem_vgpr_count = 3
				4122	.end_amd_kernel_code_t
				4123
				4124	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4125	v_mov_b32 v0, 3.14159
				4126	s_waitcnt lgkmcnt(0)
				4127	v_mov_b32 v1, s0
				4128	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4129	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4130	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4131	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4132	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4133
				4134	Additional Documentation
				4135	========================
				4136
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4137	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4138	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4139	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4140	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4141	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4142	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4143	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4144	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4145	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				4146	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				4147	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4148	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4149	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4150	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4151	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4152	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4153	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4154	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				4155	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__