Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 7062d75d92ec6e0750d65ad8221f6881764d3092 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
				26	.. table:: AMDGPU Target Triples
				27	:name: amdgpu-target-triples-table
				28
				29	============ ======== ========= ===========
				30	Architecture Vendor OS Environment
				31	============ ======== ========= ===========
				32	r600 amd <empty> <empty>
				33	amdgcn amd <empty> <empty>
				34	amdgcn amd amdhsa <empty>
				35	amdgcn amd amdhsa opencl
				36	amdgcn amd amdhsa amdgizcl
				37	amdgcn amd amdhsa amdgiz
				38	amdgcn amd amdhsa hcc
				39	============ ======== ========= ===========
				40
				41	``r600-amd--``
				42	Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
				43	the MESA runtime.
				44
				45	``amdgcn-amd--``
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	46	Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	47	the MESA runtime.
				48
				49	``amdgcn-amd-amdhsa-``
				50	Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
				51	compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
				52
				53	``amdgcn-amd-amdhsa-opencl``
				54	Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
				55	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				56	:ref:`amdgpu-opencl`.
				57
				58	``amdgcn-amd-amdhsa-amdgizcl``
				59	Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
				60	is used (see :ref:`amdgpu-address-spaces`).
				61
				62	``amdgcn-amd-amdhsa-amdgiz``
				63	Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
				64	used (see :ref:`amdgpu-address-spaces`).
				65
				66	``amdgcn-amd-amdhsa-hcc``
				67	Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
				68	executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				69	:ref:`amdgpu-hcc`.
				70
				71	.. _amdgpu-processors:
				72
				73	Processors
				74	----------
				75
				76	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				77	names from both the Processor and Alternative Processor can be used.
				78
				79	.. table:: AMDGPU Processors
				80	:name: amdgpu-processors-table
				81
				82	========== =========== ============ ===== ======= ==================
				83	Processor Alternative Target dGPU/ Runtime Example
				84	Processor Triple APU Support Products
				85	Architecture
				86	========== =========== ============ ===== ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	88	--------------------------------------------------------------------
				89	r600 r600 dGPU
				90	r630 r600 dGPU
				91	rs880 r600 dGPU
				92	rv670 r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	94	--------------------------------------------------------------------
				95	rv710 r600 dGPU
				96	rv730 r600 dGPU
				97	rv770 r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	99	--------------------------------------------------------------------
				100	cedar r600 dGPU
				101	redwood r600 dGPU
				102	sumo r600 dGPU
				103	juniper r600 dGPU
				104	cypress r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	106	--------------------------------------------------------------------
				107	barts r600 dGPU
				108	turks r600 dGPU
				109	caicos r600 dGPU
				110	cayman r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	112	--------------------------------------------------------------------
Konstantin Zhuravlyov	6cbcb27	2017-08-08 04:28:31 +0000	[diff] [blame]	113	gfx600 - tahiti amdgcn dGPU
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	114	gfx601 - pitcairn amdgcn dGPU
				115	- verde
				116	- oland
				117	- hainan
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	119	--------------------------------------------------------------------
				120	gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
				121	- Radeon HD 8770
				122	- R7 260
				123	- R7 260X
				124	\ - kaveri amdgcn APU - A6-7000
				125	- A6 Pro-7050B
				126	- A8-7100
				127	- A8 Pro-7150B
				128	- A10-7300
				129	- A10 Pro-7350B
				130	- FX-7500
				131	- A8-7200P
				132	- A10-7400P
				133	- FX-7600P
				134	gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
				135	- FirePro W9100
				136	- FirePro S9150
				137	- FirePro S9170
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	138	gfx702 dGPU ROCm - Radeon R9 290
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	139	- Radeon R9 290x
				140	- Radeon R390
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	141	- Radeon R390x
				142	gfx703 - kabini amdgcn APU - E1-2100
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	143	- mullins - E1-2200
				144	- E1-2500
				145	- E2-3000
				146	- E2-3800
				147	- A4-5000
				148	- A4-5100
				149	- A6-5200
				150	- A4 Pro-3340B
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	152	--------------------------------------------------------------------
				153	gfx800 - iceland amdgcn dGPU - FirePro S7150
				154	- FirePro S7100
				155	- FirePro W7100
				156	- Radeon R285
				157	- Radeon R9 380
				158	- Radeon R9 385
				159	- Mobile FirePro
				160	M7170
				161	gfx801 - carrizo amdgcn APU - A6-8500P
				162	- Pro A6-8500B
				163	- A8-8600P
				164	- Pro A8-8600B
				165	- FX-8800P
				166	- Pro A12-8800B
				167	\ amdgcn APU ROCm - A10-8700P
				168	- Pro A10-8700B
				169	- A10-8780P
				170	\ amdgcn APU - A10-9600P
				171	- A10-9630P
				172	- A12-9700P
				173	- A12-9730P
				174	- FX-9800P
				175	- FX-9830P
				176	\ amdgcn APU - E2-9010
				177	- A6-9210
				178	- A9-9410
				179	gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
				180	gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
				181	- Radeon R9 Fury
				182	- Radeon R9 FuryX
				183	- Radeon Pro Duo
				184	- FirePro S9300x2
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	185	- Radeon Instinct MI8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	186	\ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
				187	- Radeon RX 480
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	188	- Radeon Instinct MI6
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	189	\ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
				190	gfx804 amdgcn dGPU Same as gfx803
				191	gfx810 - stoney amdgcn APU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	193	--------------------------------------------------------------------
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	194	gfx900 amdgcn dGPU - Radeon Vega
				195	Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	201	gfx901 amdgcn dGPU ROCm Same as gfx900
				202	except XNACK is
				203	enabled
				204	gfx902 amdgcn APU TBA
				205
				206	.. TODO
				207	Add product
				208	names.
				209	gfx903 amdgcn APU Same as gfx902
				210	except XNACK is
				211	enabled
				212	========== =========== ============ ===== ======= ==================
				213
				214	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	215
				216	Address Spaces
				217	--------------
				218
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	219	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	220
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	221	The memory space names used in the table, aside from the region memory space, is
				222	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	223
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	224	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	225
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	226	.. table:: Address Space Mapping
				227	:name: amdgpu-address-space-mapping-table
				228
				229	================== ================= ================= ================= =================
				230	LLVM Address Space Memory Space
				231	------------------ -----------------------------------------------------------------------
				232	\ Current Default amdgiz/amdgizcl hcc Future Default
				233	================== ================= ================= ================= =================
				234	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				235	1 Global Global Global Global
				236	2 Constant Constant Constant Region (GDS)
				237	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				238	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				239	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				240	================== ================= ================= ================= =================
				241
				242	Current Default
				243	This is the current default address space mapping used for all languages
				244	except hcc. This will shortly be deprecated.
				245
				246	amdgiz/amdgizcl
				247	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				248	is specified as the target triple environment value.
				249
				250	hcc
				251	This is the current address space mapping used when ``hcc`` is specified as
				252	the target triple environment value.This will shortly be deprecated.
				253
				254	Future Default
				255	This will shortly be the only address space mapping for all languages using
				256	AMDGPU backend.
				257
				258	.. _amdgpu-memory-scopes:
				259
				260	Memory Scopes
				261	-------------
				262
				263	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				264	backend memory model when the target triple OS is ``amdhsa`` (see
				265	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				266
				267	The memory model supported is based on the HSA memory model [HSA]_ which is
				268	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				269	relation is transitive over the synchonizes-with relation independent of scope,
				270	and synchonizes-with allows the memory scope instances to be inclusive (see
				271	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
				272
				273	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				274	inclusion and requires the memory scopes to exactly match. However, this
				275	is conservatively correct for OpenCL.
				276
				277	.. table:: AMDHSA LLVM Sync Scopes for AMDHSA
				278	:name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
				279
				280	================ ==========================================================
				281	LLVM Sync Scope Description
				282	================ ==========================================================
				283	none The default: ``system``.
				284
				285	Synchronizes with, and participates in modification and
				286	seq_cst total orderings with, other operations (except
				287	image operations) for all address spaces (except private,
				288	or generic that accesses private) provided the other
				289	operation's sync scope is:
				290
				291	- ``system``.
				292	- ``agent`` and executed by a thread on the same agent.
				293	- ``workgroup`` and executed by a thread in the same
				294	workgroup.
				295	- ``wavefront`` and executed by a thread in the same
				296	wavefront.
				297
				298	``agent`` Synchronizes with, and participates in modification and
				299	seq_cst total orderings with, other operations (except
				300	image operations) for all address spaces (except private,
				301	or generic that accesses private) provided the other
				302	operation's sync scope is:
				303
				304	- ``system`` or ``agent`` and executed by a thread on the
				305	same agent.
				306	- ``workgroup`` and executed by a thread in the same
				307	workgroup.
				308	- ``wavefront`` and executed by a thread in the same
				309	wavefront.
				310
				311	``workgroup`` Synchronizes with, and participates in modification and
				312	seq_cst total orderings with, other operations (except
				313	image operations) for all address spaces (except private,
				314	or generic that accesses private) provided the other
				315	operation's sync scope is:
				316
				317	- ``system``, ``agent`` or ``workgroup`` and executed by a
				318	thread in the same workgroup.
				319	- ``wavefront`` and executed by a thread in the same
				320	wavefront.
				321
				322	``wavefront`` Synchronizes with, and participates in modification and
				323	seq_cst total orderings with, other operations (except
				324	image operations) for all address spaces (except private,
				325	or generic that accesses private) provided the other
				326	operation's sync scope is:
				327
				328	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				329	and executed by a thread in the same wavefront.
				330
				331	``singlethread`` Only synchronizes with, and participates in modification
				332	and seq_cst total orderings with, other operations (except
				333	image operations) running in the same thread for all
				334	address spaces (for example, in signal handlers).
				335	================ ==========================================================
				336
				337	AMDGPU Intrinsics
				338	-----------------
				339
				340	The AMDGPU backend implements the following intrinsics.
				341
				342	This section is WIP.
				343
				344	.. TODO
				345	List AMDGPU intrinsics
				346
				347	Code Object
				348	===========
				349
				350	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				351	can be linked by ``lld`` to produce a standard ELF shared code object which can
				352	be loaded and executed on an AMDGPU target.
				353
				354	Header
				355	------
				356
				357	The AMDGPU backend uses the following ELF header:
				358
				359	.. table:: AMDGPU ELF Header
				360	:name: amdgpu-elf-header-table
				361
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	362	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	363	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	364	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	365	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				366	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	367	``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	368	``ELFOSABI_AMDGPU_PAL`` or
				369	``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	370	``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	371	``ELFABIVERSION_AMDGPU_PAL`` or
				372	``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373	``e_type`` ``ET_REL`` or ``ET_DYN``
				374	``e_machine`` ``EM_AMDGPU``
				375	``e_entry`` 0
				376	``e_flags`` 0
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	377	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	378
				379	..
				380
				381	.. table:: AMDGPU ELF Header Enumeration Values
				382	:name: amdgpu-elf-header-enumeration-values-table
				383
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	384	=============================== =====
				385	Name Value
				386	=============================== =====
				387	``EM_AMDGPU`` 224
				388	``ELFOSABI_AMDGPU_HSA`` 64
				389	``ELFOSABI_AMDGPU_PAL`` 65
				390	``ELFOSABI_AMDGPU_MESA3D`` 66
				391	``ELFABIVERSION_AMDGPU_HSA`` 1
				392	``ELFABIVERSION_AMDGPU_PAL`` 0
				393	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				394	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395
				396	``e_ident[EI_CLASS]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	397	The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
				398	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	399
				400	``e_ident[EI_DATA]``
				401	All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
				402
				403	``e_ident[EI_OSABI]``
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	404	One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	405
				406	* ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
				407	the AMD HSA runtime ABI [HSA]_.
				408
				409	* ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
				410	the AMD PAL runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	412	* ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
				413	to the AMD MESA runtime ABI.
				414
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	415	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	416	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				417	object conforms:
				418
				419	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				420	runtime ABI.
				421
				422	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				423	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	425	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
				426	runtime ABI.
				427
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428	``e_type``
				429	Can be one of the following values:
				430
				431
				432	``ET_REL``
				433	The type produced by the AMD GPU backend compiler as it is relocatable code
				434	object.
				435
				436	``ET_DYN``
				437	The type produced by the linker as it is a shared code object.
				438
				439	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				440
				441	``e_machine``
				442	The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
				443	architecture family. The specific member is specified in the
				444	``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
				445	:ref:`amdgpu-note-records`).
				446
				447	``e_entry``
				448	The entry point is 0 as the entry points for individual kernels must be
				449	selected in order to invoke them through AQL packets.
				450
				451	``e_flags``
				452	The value is 0 as no flags are used.
				453
				454	Sections
				455	--------
				456
				457	An AMDGPU target ELF code object has the standard ELF sections which include:
				458
				459	.. table:: AMDGPU ELF Sections
				460	:name: amdgpu-elf-sections-table
				461
				462	================== ================ =================================
				463	Name Type Attributes
				464	================== ================ =================================
				465	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				466	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				467	``.debug_``\ \* ``SHT_PROGBITS`` none
				468	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				469	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				470	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				471	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				472	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				473	``.note`` ``SHT_NOTE`` none
				474	``.rela``\ name ``SHT_RELA`` none
				475	``.rela.dyn`` ``SHT_RELA`` none
				476	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				477	``.shstrtab`` ``SHT_STRTAB`` none
				478	``.strtab`` ``SHT_STRTAB`` none
				479	``.symtab`` ``SHT_SYMTAB`` none
				480	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				481	================== ================ =================================
				482
				483	These sections have their standard meanings (see [ELF]_) and are only generated
				484	if needed.
				485
				486	``.debug``\ \*
				487	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				488	DWARF produced by the AMDGPU backend.
				489
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	490	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	491	The standard sections used by a dynamic loader.
				492
				493	``.note``
				494	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				495	backend.
				496
				497	``.rela``\ name, ``.rela.dyn``
				498	For relocatable code objects, name is the name of the section that the
				499	relocation records apply. For example, ``.rela.text`` is the section name for
				500	relocation records associated with the ``.text`` section.
				501
				502	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				503	records from each of the relocatable code object's ``.rela``\ name sections.
				504
				505	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				506	the AMDGPU backend.
				507
				508	``.text``
				509	The executable machine code for the kernels and functions they call. Generated
				510	as position independent code. See :ref:`amdgpu-code-conventions` for
				511	information on conventions used in the isa generation.
				512
				513	.. _amdgpu-note-records:
				514
				515	Note Records
				516	------------
				517
				518	As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
				519	the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
				520	minimal zero byte padding must be generated to ensure the ``desc`` field size is
				521	a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
				522	be at least 4 to indicate at least 8 byte alignment.
				523
				524	The AMDGPU backend code object uses the following ELF note records in the
				525	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	526	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	527	variable size strings have a corresponding ``*_size`` field that specifies the
				528	number of bytes, including the terminating null character, in the string. The
				529	string(s) come immediately after the preceding fields.
				530
				531	Additional note records can be present.
				532
				533	.. table:: AMDGPU ELF Note Records
				534	:name: amdgpu-elf-note-records-table
				535
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	536	===== ============================== ======================================
				537	Name Type Description
				538	===== ============================== ======================================
				539	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
				540	"AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
				541	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	542
				543	..
				544
				545	.. table:: AMDGPU ELF Note Record Enumeration Values
				546	:name: amdgpu-elf-note-record-enumeration-values-table
				547
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	548	============================== =====
				549	Name Value
				550	============================== =====
				551	reserved 0-9
				552	``NT_AMD_AMDGPU_HSA_METADATA`` 10
				553	``NT_AMD_AMDGPU_ISA`` 11
				554	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	555
				556	``NT_AMD_AMDGPU_ISA``
				557	Specifies the instruction set architecture used by the machine code contained
				558	in the code object.
				559
				560	This note record is required for code objects containing machine code for
				561	processors matching the ``amdgcn`` architecture in table
				562	:ref:`amdgpu-processors`.
				563
				564	The null terminated string has the following syntax:
				565
				566	architecture\ ``-``\ vendor\ ``-``\ os\ ``-``\ environment\ ``-``\ processor
				567
				568	where:
				569
				570	architecture
				571	The architecture from table :ref:`amdgpu-target-triples-table`.
				572
				573	This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
				574	:ref:`amdgpu-target-triples`).
				575
				576	vendor
				577	The vendor from table :ref:`amdgpu-target-triples-table`.
				578
				579	For the AMDGPU backend this is always ``amd``.
				580
				581	os
				582	The OS from table :ref:`amdgpu-target-triples-table`.
				583
				584	environment
				585	An environment from table :ref:`amdgpu-target-triples-table`, or blank if
				586	the environment has no affect on the execution of the code object.
				587
				588	For the AMDGPU backend this is currently always blank.
				589	processor
				590	The processor from table :ref:`amdgpu-processors-table`.
				591
				592	For example:
				593
				594	``amdgcn-amd-amdhsa--gfx901``
				595
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	596	``NT_AMD_AMDGPU_HSA_METADATA``
				597	Specifies extensible metadata associated with the code objects executed on HSA
				598	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				599	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				600	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				601	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	603	.. _amdgpu-symbols:
				604
				605	Symbols
				606	-------
				607
				608	Symbols include the following:
				609
				610	.. table:: AMDGPU ELF Symbols
				611	:name: amdgpu-elf-symbols-table
				612
				613	===================== ============== ============= ==================
				614	Name Type Section Description
				615	===================== ============== ============= ==================
				616	link-name ``STT_OBJECT`` - ``.data`` Global variable
				617	- ``.rodata``
				618	- ``.bss``
				619	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				620	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				621	===================== ============== ============= ==================
				622
				623	Global variable
				624	Global variables both used and defined by the compilation unit.
				625
				626	If the symbol is defined in the compilation unit then it is allocated in the
				627	appropriate section according to if it has initialized data or is readonly.
				628
				629	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				630	will resolve relocations using the definition provided by another code object
				631	or explicitly defined by the runtime.
				632
				633	All global symbols, whether defined in the compilation unit or external, are
				634	accessed by the machine code indirectly through a GOT table entry. This
				635	allows them to be preemptable. The GOT table is only supported when the target
				636	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	637
				638	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	639	Add description of linked shared object symbols. Seems undefined symbols
				640	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	642	Kernel descriptor
				643	Every HSA kernel has an associated kernel descriptor. It is the address of the
				644	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				645	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				646	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				647
				648	Kernel entry point
				649	Every HSA kernel also has a symbol for its machine code entry point.
				650
				651	.. _amdgpu-relocation-records:
				652
				653	Relocation Records
				654	------------------
				655
				656	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				657	relocatable fields are:
				658
				659	``word32``
				660	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				661	alignment. These values use the same byte order as other word values in the
				662	AMD GPU architecture.
				663
				664	``word64``
				665	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				666	alignment. These values use the same byte order as other word values in the
				667	AMD GPU architecture.
				668
				669	Following notations are used for specifying relocation calculations:
				670
				671	A
				672	Represents the addend used to compute the value of the relocatable field.
				673
				674	G
				675	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	676	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	677
				678	GOT
				679	Represents the address of the global offset table.
				680
				681	P
				682	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				683	of the storage unit being relocated (computed using ``r_offset``).
				684
				685	S
				686	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	687	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				688
				689	B
				690	Represents the base address of a loaded executable or shared object which is
				691	the difference between the ELF address and the actual load address. Relocations
				692	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	693
				694	The following relocation types are supported:
				695
				696	.. table:: AMDGPU ELF Relocation Records
				697	:name: amdgpu-elf-relocation-records-table
				698
				699	========================== ===== ========== ==============================
				700	Relocation Type Value Field Calculation
				701	========================== ===== ========== ==============================
				702	``R_AMDGPU_NONE`` 0 none none
				703	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				704	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				705	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				706	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				707	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				708	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				709	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				710	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				711	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				712	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				713	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	714	reserved 12
				715	``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	716	========================== ===== ========== ==============================
				717
				718	.. _amdgpu-dwarf:
				719
				720	DWARF
				721	-----
				722
				723	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				724	information that maps the code object executable code and data to the source
				725	language constructs. It can be used by tools such as debuggers and profilers.
				726
				727	Address Space Mapping
				728	~~~~~~~~~~~~~~~~~~~~~
				729
				730	The following address space mapping is used:
				731
				732	.. table:: AMDGPU DWARF Address Space Mapping
				733	:name: amdgpu-dwarf-address-space-mapping-table
				734
				735	=================== =================
				736	DWARF Address Space Memory Space
				737	=================== =================
				738	1 Private (Scratch)
				739	2 Local (group/LDS)
				740	omitted Global
				741	omitted Constant
				742	omitted Generic (Flat)
				743	not supported Region (GDS)
				744	=================== =================
				745
				746	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				747	used in the table.
				748
				749	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				750	DWARF address space of the value of the pointer when it is in the private or
				751	local address space. Otherwise the attribute is omitted.
				752
				753	An ``XDEREF`` operation is generated in location list expressions for variables
				754	that are allocated in the private and local address space. Otherwise no
				755	``XDREF`` is omitted.
				756
				757	Register Mapping
				758	~~~~~~~~~~~~~~~~
				759
				760	This section is WIP.
				761
				762	.. TODO
				763	Define DWARF register enumeration.
				764
				765	If want to present a wavefront state then should expose vector registers as
				766	64 wide (rather than per work-item view that LLVM uses). Either as separate
				767	registers, or a 64x4 byte single register. In either case use a new LANE op
				768	(akin to XDREF) to select the current lane usage in a location
				769	expression. This would also allow scalar register spilling to vector register
				770	lanes to be expressed (currently no debug information is being generated for
				771	spilling). If choose a wide single register approach then use LANE in
				772	conjunction with PIECE operation to select the dword part of the register for
				773	the current lane. If the separate register approach then use LANE to select
				774	the register.
				775
				776	Source Text
				777	~~~~~~~~~~~
				778
				779	This section is WIP.
				780
				781	.. TODO
				782	DWARF extension to include runtime generated source text.
				783
				784	.. _amdgpu-code-conventions:
				785
				786	Code Conventions
				787	================
				788
				789	This section provides code conventions used for each supported target triple OS
				790	(see :ref:`amdgpu-target-triples`).
				791
				792	AMDHSA
				793	------
				794
				795	This section provides code conventions used when the target triple OS is
				796	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				797
				798	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	799
				800	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	801	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	802
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	803	The code object metadata specifies extensible metadata associated with the code
				804	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				805	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				806	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				807	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				808	information necessary to support the ROCM kernel queries. For example, the
				809	segment sizes needed in a dispatch packet. In addition, a high level language
				810	runtime may require other information to be included. For example, the AMD
				811	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	812
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	813	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	814	:doc:`YamlIO`).
				815
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	816	.. TODO
				817	Is the string null terminated? It probably should not if YAML allows it to
				818	contain null characters, otherwise it should be.
				819
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	820	The metadata is represented as a single YAML document comprised of the mapping
				821	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				822	referenced tables.
				823
				824	For boolean values, the string values of ``false`` and ``true`` are used for
				825	false and true respectively.
				826
				827	Additional information can be added to the mappings. To avoid conflicts, any
				828	non-AMD key names should be prefixed by "vendor-name.".
				829
				830	.. table:: AMDHSA Code Object Metadata Mapping
				831	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				832
				833	========== ============== ========= =======================================
				834	String Key Value Type Required? Description
				835	========== ============== ========= =======================================
				836	"Version" sequence of Required - The first integer is the major
				837	2 integers version. Currently 1.
				838	- The second integer is the minor
				839	version. Currently 0.
				840	"Printf" sequence of Each string is encoded information
				841	strings about a printf function call. The
				842	encoded information is organized as
				843	fields separated by colon (':'):
				844
				845	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				846
				847	where:
				848
				849	``ID``
				850	A 32 bit integer as a unique id for
				851	each printf function call
				852
				853	``N``
				854	A 32 bit integer equal to the number
				855	of arguments of printf function call
				856	minus 1
				857
				858	``S[i]`` (where i = 0, 1, ... , N-1)
				859	32 bit integers for the size in bytes
				860	of the i-th FormatString argument of
				861	the printf function call
				862
				863	FormatString
				864	The format string passed to the
				865	printf function call.
				866	"Kernels" sequence of Required Sequence of the mappings for each
				867	mapping kernel in the code object. See
				868	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				869	for the definition of the mapping.
				870	========== ============== ========= =======================================
				871
				872	..
				873
				874	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				875	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				876
				877	================= ============== ========= ================================
				878	String Key Value Type Required? Description
				879	================= ============== ========= ================================
				880	"Name" string Required Source name of the kernel.
				881	"SymbolName" string Required Name of the kernel
				882	descriptor ELF symbol.
				883	"Language" string Source language of the kernel.
				884	Values include:
				885
				886	- "OpenCL C"
				887	- "OpenCL C++"
				888	- "HCC"
				889	- "OpenMP"
				890
				891	"LanguageVersion" sequence of - The first integer is the major
				892	2 integers version.
				893	- The second integer is the
				894	minor version.
				895	"Attrs" mapping Mapping of kernel attributes.
				896	See
				897	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				898	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	899	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	900	mapping kernel arguments. See
				901	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				902	for the definition of the mapping.
				903	"CodeProps" mapping Mapping of properties related to
				904	the kernel code. See
				905	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				906	for the mapping definition.
				907	"DebugProps" mapping Mapping of properties related to
				908	the kernel debugging. See
				909	:ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
				910	for the mapping definition.
				911	================= ============== ========= ================================
				912
				913	..
				914
				915	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				916	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				917
				918	=================== ============== ========= ==============================
				919	String Key Value Type Required? Description
				920	=================== ============== ========= ==============================
				921	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				922	3 integers X, Y, Z must correspond to the
				923	specified values.
				924
				925	Corresponds to the OpenCL
				926	``reqd_work_group_size``
				927	attribute.
				928	"WorkGroupSizeHint" sequence of The dispatch work-group size
				929	3 integers X, Y, Z is likely to be the
				930	specified values.
				931
				932	Corresponds to the OpenCL
				933	``work_group_size_hint``
				934	attribute.
				935	"VecTypeHint" string The name of a scalar or vector
				936	type.
				937
				938	Corresponds to the OpenCL
				939	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	940
				941	"RuntimeHandle" string The external symbol name
				942	associated with a kernel.
				943	OpenCL runtime allocates a
				944	global buffer for the symbol
				945	and saves the kernel's address
				946	to it, which is used for
				947	device side enqueueing. Only
				948	available for device side
				949	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	950	=================== ============== ========= ==============================
				951
				952	..
				953
				954	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				955	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				956
				957	================= ============== ========= ================================
				958	String Key Value Type Required? Description
				959	================= ============== ========= ================================
				960	"Name" string Kernel argument name.
				961	"TypeName" string Kernel argument type name.
				962	"Size" integer Required Kernel argument size in bytes.
				963	"Align" integer Required Kernel argument alignment in
				964	bytes. Must be a power of two.
				965	"ValueKind" string Required Kernel argument kind that
				966	specifies how to set up the
				967	corresponding argument.
				968	Values include:
				969
				970	"ByValue"
				971	The argument is copied
				972	directly into the kernarg.
				973
				974	"GlobalBuffer"
				975	A global address space pointer
				976	to the buffer data is passed
				977	in the kernarg.
				978
				979	"DynamicSharedPointer"
				980	A group address space pointer
				981	to dynamically allocated LDS
				982	is passed in the kernarg.
				983
				984	"Sampler"
				985	A global address space
				986	pointer to a S# is passed in
				987	the kernarg.
				988
				989	"Image"
				990	A global address space
				991	pointer to a T# is passed in
				992	the kernarg.
				993
				994	"Pipe"
				995	A global address space pointer
				996	to an OpenCL pipe is passed in
				997	the kernarg.
				998
				999	"Queue"
				1000	A global address space pointer
				1001	to an OpenCL device enqueue
				1002	queue is passed in the
				1003	kernarg.
				1004
				1005	"HiddenGlobalOffsetX"
				1006	The OpenCL grid dispatch
				1007	global offset for the X
				1008	dimension is passed in the
				1009	kernarg.
				1010
				1011	"HiddenGlobalOffsetY"
				1012	The OpenCL grid dispatch
				1013	global offset for the Y
				1014	dimension is passed in the
				1015	kernarg.
				1016
				1017	"HiddenGlobalOffsetZ"
				1018	The OpenCL grid dispatch
				1019	global offset for the Z
				1020	dimension is passed in the
				1021	kernarg.
				1022
				1023	"HiddenNone"
				1024	An argument that is not used
				1025	by the kernel. Space needs to
				1026	be left for it, but it does
				1027	not need to be set up.
				1028
				1029	"HiddenPrintfBuffer"
				1030	A global address space pointer
				1031	to the runtime printf buffer
				1032	is passed in kernarg.
				1033
				1034	"HiddenDefaultQueue"
				1035	A global address space pointer
				1036	to the OpenCL device enqueue
				1037	queue that should be used by
				1038	the kernel by default is
				1039	passed in the kernarg.
				1040
				1041	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1042	A global address space pointer
				1043	to help link enqueued kernels into
				1044	the ancestor tree for determining
				1045	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1046
				1047	"ValueType" string Required Kernel argument value type. Only
				1048	present if "ValueKind" is
				1049	"ByValue". For vector data
				1050	types, the value is for the
				1051	element type. Values include:
				1052
				1053	- "Struct"
				1054	- "I8"
				1055	- "U8"
				1056	- "I16"
				1057	- "U16"
				1058	- "F16"
				1059	- "I32"
				1060	- "U32"
				1061	- "F32"
				1062	- "I64"
				1063	- "U64"
				1064	- "F64"
				1065
				1066	.. TODO
				1067	How can it be determined if a
				1068	vector type, and what size
				1069	vector?
				1070	"PointeeAlign" integer Alignment in bytes of pointee
				1071	type for pointer type kernel
				1072	argument. Must be a power
				1073	of 2. Only present if
				1074	"ValueKind" is
				1075	"DynamicSharedPointer".
				1076	"AddrSpaceQual" string Kernel argument address space
				1077	qualifier. Only present if
				1078	"ValueKind" is "GlobalBuffer" or
				1079	"DynamicSharedPointer". Values
				1080	are:
				1081
				1082	- "Private"
				1083	- "Global"
				1084	- "Constant"
				1085	- "Local"
				1086	- "Generic"
				1087	- "Region"
				1088
				1089	.. TODO
				1090	Is GlobalBuffer only Global
				1091	or Constant? Is
				1092	DynamicSharedPointer always
				1093	Local? Can HCC allow Generic?
				1094	How can Private or Region
				1095	ever happen?
				1096	"AccQual" string Kernel argument access
				1097	qualifier. Only present if
				1098	"ValueKind" is "Image" or
				1099	"Pipe". Values
				1100	are:
				1101
				1102	- "ReadOnly"
				1103	- "WriteOnly"
				1104	- "ReadWrite"
				1105
				1106	.. TODO
				1107	Does this apply to
				1108	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1109	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1110	performed by the kernel on the
				1111	kernel argument. Only present if
				1112	"ValueKind" is "GlobalBuffer",
				1113	"Image", or "Pipe". This may be
				1114	more restrictive than indicated
				1115	by "AccQual" to reflect what the
				1116	kernel actual does. If not
				1117	present then the runtime must
				1118	assume what is implied by
				1119	"AccQual" and "IsConst". Values
				1120	are:
				1121
				1122	- "ReadOnly"
				1123	- "WriteOnly"
				1124	- "ReadWrite"
				1125
				1126	"IsConst" boolean Indicates if the kernel argument
				1127	is const qualified. Only present
				1128	if "ValueKind" is
				1129	"GlobalBuffer".
				1130
				1131	"IsRestrict" boolean Indicates if the kernel argument
				1132	is restrict qualified. Only
				1133	present if "ValueKind" is
				1134	"GlobalBuffer".
				1135
				1136	"IsVolatile" boolean Indicates if the kernel argument
				1137	is volatile qualified. Only
				1138	present if "ValueKind" is
				1139	"GlobalBuffer".
				1140
				1141	"IsPipe" boolean Indicates if the kernel argument
				1142	is pipe qualified. Only present
				1143	if "ValueKind" is "Pipe".
				1144
				1145	.. TODO
				1146	Can GlobalBuffer be pipe
				1147	qualified?
				1148	================= ============== ========= ================================
				1149
				1150	..
				1151
				1152	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1153	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1154
				1155	============================ ============== ========= =====================
				1156	String Key Value Type Required? Description
				1157	============================ ============== ========= =====================
				1158	"KernargSegmentSize" integer Required The size in bytes of
				1159	the kernarg segment
				1160	that holds the values
				1161	of the arguments to
				1162	the kernel.
				1163	"GroupSegmentFixedSize" integer Required The amount of group
				1164	segment memory
				1165	required by a
				1166	work-group in
				1167	bytes. This does not
				1168	include any
				1169	dynamically allocated
				1170	group segment memory
				1171	that may be added
				1172	when the kernel is
				1173	dispatched.
				1174	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1175	private address space
				1176	memory required for a
				1177	work-item in
				1178	bytes. If
				1179	IsDynamicCallstack
				1180	is 1 then additional
				1181	space must be added
				1182	to this value for the
				1183	call stack.
				1184	"KernargSegmentAlign" integer Required The maximum byte
				1185	alignment of
				1186	arguments in the
				1187	kernarg segment. Must
				1188	be a power of 2.
				1189	"WavefrontSize" integer Required Wavefront size. Must
				1190	be a power of 2.
				1191	"NumSGPRs" integer Number of scalar
				1192	registers used by a
				1193	wavefront for
				1194	GFX6-GFX9. This
				1195	includes the special
				1196	SGPRs for VCC, Flat
				1197	Scratch (GFX7-GFX9)
				1198	and XNACK (for
				1199	GFX8-GFX9). It does
				1200	not include the 16
				1201	SGPR added if a trap
				1202	handler is
				1203	enabled. It is not
				1204	rounded up to the
				1205	allocation
				1206	granularity.
				1207	"NumVGPRs" integer Number of vector
				1208	registers used by
				1209	each work-item for
				1210	GFX6-GFX9
Konstantin Zhuravlyov	8d5e9e1	2017-10-18 17:31:09 +0000	[diff] [blame]	1211	"MaxFlatWorkGroupSize" integer Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1212	work-group size
				1213	supported by the
				1214	kernel in work-items.
				1215	"IsDynamicCallStack" boolean Indicates if the
				1216	generated machine
				1217	code is using a
				1218	dynamically sized
				1219	call stack.
				1220	"IsXNACKEnabled" boolean Indicates if the
				1221	generated machine
				1222	code is capable of
				1223	supporting XNACK.
				1224	============================ ============== ========= =====================
				1225
				1226	..
				1227
				1228	.. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
				1229	:name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
				1230
				1231	=================================== ============== ========= ==============
				1232	String Key Value Type Required? Description
				1233	=================================== ============== ========= ==============
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1234	"DebuggerABIVersion" sequence of
				1235	2 integers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1236	"ReservedNumVGPRs" integer
				1237	"ReservedFirstVGPR" integer
				1238	"PrivateSegmentBufferSGPR" integer
				1239	"WavefrontPrivateSegmentOffsetSGPR" integer
				1240	=================================== ============== ========= ==============
				1241
				1242	.. TODO
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1243	Plan to remove the debug properties metadata.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1244
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1245	Kernel Dispatch
				1246	~~~~~~~~~~~~~~~
				1247
				1248	The HSA architected queuing language (AQL) defines a user space memory interface
				1249	that can be used to control the dispatch of kernels, in an agent independent
				1250	way. An agent can have zero or more AQL queues created for it using the ROCm
				1251	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1252	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1253	mechanics and packet layouts.
				1254
				1255	The packet processor of a kernel agent is responsible for detecting and
				1256	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1257	packet processor is implemented by the hardware command processor (CP),
				1258	asynchronous dispatch controller (ADC) and shader processor input controller
				1259	(SPI).
				1260
				1261	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1262	mode driver to initialize and register the AQL queue with CP.
				1263
				1264	To dispatch a kernel the following actions are performed. This can occur in the
				1265	CPU host program, or from an HSA kernel executing on a GPU.
				1266
				1267	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1268	executed is obtained.
				1269	2. A pointer to the kernel descriptor (see
				1270	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1271	obtained. It must be for a kernel that is contained in a code object that that
				1272	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1273	associated.
				1274	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1275	for a memory region with the kernarg property for the kernel agent that will
				1276	execute the kernel. It must be at least 16 byte aligned.
				1277	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1278	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1279	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1280	memory in the same way constant memory is accessed. (Note that the HSA
				1281	specification allows an implementation to copy the kernel argument contents to
				1282	another location that is accessed by the kernel.)
				1283	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1284	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1285	packet. The packet must be set up, and the final write must use an atomic
				1286	store release to set the packet kind to ensure the packet contents are
				1287	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1288	notify the kernel agent that the AQL queue has been updated. These rules, and
				1289	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1290	System Architecture Specification* [HSA]_.
				1291	6. A kernel dispatch packet includes information about the actual dispatch,
				1292	such as grid and work-group size, together with information from the code
				1293	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1294	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1295	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1296	7. CP executes micro-code and is responsible for detecting and setting up the
				1297	GPU to execute the wavefronts of a kernel dispatch.
				1298	8. CP ensures that when the a wavefront starts executing the kernel machine
				1299	code, the scalar general purpose registers (SGPR) and vector general purpose
				1300	registers (VGPR) are set up as required by the machine code. The required
				1301	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1302	register state is defined in
				1303	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1304	9. The prolog of the kernel machine code (see
				1305	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1306	before continuing executing the machine code that corresponds to the kernel.
				1307	10. When the kernel dispatch has completed execution, CP signals the completion
				1308	signal specified in the kernel dispatch packet if not 0.
				1309
				1310	.. _amdgpu-amdhsa-memory-spaces:
				1311
				1312	Memory Spaces
				1313	~~~~~~~~~~~~~
				1314
				1315	The memory space properties are:
				1316
				1317	.. table:: AMDHSA Memory Spaces
				1318	:name: amdgpu-amdhsa-memory-spaces-table
				1319
				1320	================= =========== ======== ======= ==================
				1321	Memory Space Name HSA Segment Hardware Address NULL Value
				1322	Name Name Size
				1323	================= =========== ======== ======= ==================
				1324	Private private scratch 32 0x00000000
				1325	Local group LDS 32 0xFFFFFFFF
				1326	Global global global 64 0x0000000000000000
				1327	Constant constant *same as 64 0x0000000000000000
				1328	global*
				1329	Generic flat flat 64 0x0000000000000000
				1330	Region N/A GDS 32 *not implemented
				1331	for AMDHSA*
				1332	================= =========== ======== ======= ==================
				1333
				1334	The global and constant memory spaces both use global virtual addresses, which
				1335	are the same virtual address space used by the CPU. However, some virtual
				1336	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1337	and some by both.
				1338
				1339	Using the constant memory space indicates that the data will not change during
				1340	the execution of the kernel. This allows scalar read instructions to be
				1341	used. The vector and scalar L1 caches are invalidated of volatile data before
				1342	each kernel dispatch execution to allow constant memory to change values between
				1343	kernel dispatches.
				1344
				1345	The local memory space uses the hardware Local Data Store (LDS) which is
				1346	automatically allocated when the hardware creates work-groups of wavefronts, and
				1347	freed when all the wavefronts of a work-group have terminated. The data store
				1348	(DS) instructions can be used to access it.
				1349
				1350	The private memory space uses the hardware scratch memory support. If the kernel
				1351	uses scratch, then the hardware allocates memory that is accessed using
				1352	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1353	address to physical address is:
				1354
				1355	``wavefront-scratch-base +
				1356	(private-address * wavefront-size * 4) +
				1357	(wavefront-lane-id * 4)``
				1358
				1359	There are different ways that the wavefront scratch base address is determined
				1360	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1361	memory can be accessed in an interleaved manner using buffer instruction with
				1362	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1363	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1364	same private address, the interleaving results in adjacent dwords being accessed
				1365	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1366	supported except by flat and scratch instructions in GFX9.
				1367
				1368	The generic address space uses the hardware flat address support available in
				1369	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1370	local appertures), that are outside the range of addressible global memory, to
				1371	map from a flat address to a private or local address.
				1372
				1373	FLAT instructions can take a flat address and access global, private (scratch)
				1374	and group (LDS) memory depending in if the address is within one of the
				1375	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1376	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1377	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1378	(see :ref:`amdgpu-amdhsa-m0`).
				1379
				1380	To convert between a segment address and a flat address the base address of the
				1381	appertures address can be used. For GFX7-GFX8 these are available in the
				1382	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1383	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1384	GFX9 the appature base addresses are directly available as inline constant
				1385	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1386	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1387	which makes it easier to convert from flat to segment or segment to flat.
				1388
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1389	Image and Samplers
				1390	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1391
				1392	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1393	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1394	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1395	enumeration values for the queries that are not trivially deducible from the S#
				1396	representation.
				1397
				1398	HSA Signals
				1399	~~~~~~~~~~~
				1400
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1401	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1402	structure allocated in memory accessible from both the CPU and GPU. The
				1403	structure is defined by the ROCm runtime and subject to change between releases
				1404	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1405
				1406	.. _amdgpu-amdhsa-hsa-aql-queue:
				1407
				1408	HSA AQL Queue
				1409	~~~~~~~~~~~~~
				1410
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1411	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1412	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1413	fields needed to implement certain language features such as the flat address
				1414	aperture bases. It also contains fields used by CP such as managing the
				1415	allocation of scratch memory.
				1416
				1417	.. _amdgpu-amdhsa-kernel-descriptor:
				1418
				1419	Kernel Descriptor
				1420	~~~~~~~~~~~~~~~~~
				1421
				1422	A kernel descriptor consists of the information needed by CP to initiate the
				1423	execution of a kernel, including the entry point address of the machine code
				1424	that implements the kernel.
				1425
				1426	Kernel Descriptor for GFX6-GFX9
				1427	+++++++++++++++++++++++++++++++
				1428
				1429	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1430
				1431	.. table:: Kernel Descriptor for GFX6-GFX9
				1432	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1433
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1434	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1435	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1436	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1437	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1438	address space memory
				1439	required for a work-group
				1440	in bytes. This does not
				1441	include any dynamically
				1442	allocated local address
				1443	space memory that may be
				1444	added when the kernel is
				1445	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1446	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1447	private address space
				1448	memory required for a
				1449	work-item in bytes. If
				1450	is_dynamic_callstack is 1
				1451	then additional space must
				1452	be added to this value for
				1453	the call stack.
Konstantin Zhuravlyov	8d5e9e1	2017-10-18 17:31:09 +0000	[diff] [blame]	1454	95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1455	size supported by the
				1456	kernel in work-items.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1457	96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1458	machine code is using a
				1459	dynamically sized call
				1460	stack.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1461	97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1462	machine code is capable of
				1463	suppoting XNACK.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1464	127:98 30 bits Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1465	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1466	negative) from base
				1467	address of kernel
				1468	descriptor to kernel's
				1469	entry point instruction
				1470	which must be 256 byte
				1471	aligned.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1472	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1473	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1474	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1475	program settings used by
				1476	CP to set up
				1477	``COMPUTE_PGM_RSRC1``
				1478	configuration
				1479	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1480	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1481	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1482	program settings used by
				1483	CP to set up
				1484	``COMPUTE_PGM_RSRC2``
				1485	configuration
				1486	register. See
				1487	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1488	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1489	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1490	(see
				1491	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1492
				1493	The total number of SGPR
				1494	user data registers
				1495	requested must not exceed
				1496	16 and match value in
				1497	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1498	Any requests beyond 16
				1499	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1500	449 1 bit EnableSGPRDispatchPtr see above
				1501	450 1 bit EnableSGPRQueuePtr see above
				1502	451 1 bit EnableSGPRKernargSegmentPtr see above
				1503	452 1 bit EnableSGPRDispatchID see above
				1504	453 1 bit EnableSGPRFlatScratchInit see above
				1505	454 1 bit EnableSGPRPrivateSegmentSize see above
				1506	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1507	should always be 0.
				1508	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1509	should always be 0.
				1510	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1511	should always be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1512	463:458 6 bits Reserved, must be 0.
				1513	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1514	bytes
				1515	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1516	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1517
				1518	..
				1519
				1520	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1521	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1522
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1523	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1524	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1525	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1526	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1527	used by each work-item,
				1528	granularity is device
				1529	specific:
				1530
				1531	GFX6-9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1532	- max_vgpr 1..256
				1533	- roundup((max_vgpg + 1)
				1534	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1535
				1536	Used by CP to set up
				1537	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1538	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1539	used by a wavefront,
				1540	granularity is device
				1541	specific:
				1542
				1543	GFX6-8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1544	- max_sgpr 1..112
				1545	- roundup((max_sgpg + 1)
				1546	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1547	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1548	- max_sgpr 1..112
				1549	- roundup((max_sgpg + 1)
				1550	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1551
				1552	Includes the special SGPRs
				1553	for VCC, Flat Scratch (for
				1554	GFX7 onwards) and XNACK
				1555	(for GFX8 onwards). It does
				1556	not include the 16 SGPR
				1557	added if a trap handler is
				1558	enabled.
				1559
				1560	Used by CP to set up
				1561	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1562	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1563
				1564	Start executing wavefront
				1565	at the specified priority.
				1566
				1567	CP is responsible for
				1568	filling in
				1569	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1570	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1571	with specified rounding
				1572	mode for single (32
				1573	bit) floating point
				1574	precision floating point
				1575	operations.
				1576
				1577	Floating point rounding
				1578	mode values are defined in
				1579	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1580
				1581	Used by CP to set up
				1582	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1583	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1584	with specified rounding
				1585	denorm mode for half/double (16
				1586	and 64 bit) floating point
				1587	precision floating point
				1588	operations.
				1589
				1590	Floating point rounding
				1591	mode values are defined in
				1592	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1593
				1594	Used by CP to set up
				1595	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1596	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1597	with specified denorm mode
				1598	for single (32
				1599	bit) floating point
				1600	precision floating point
				1601	operations.
				1602
				1603	Floating point denorm mode
				1604	values are defined in
				1605	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1606
				1607	Used by CP to set up
				1608	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1609	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1610	with specified denorm mode
				1611	for half/double (16
				1612	and 64 bit) floating point
				1613	precision floating point
				1614	operations.
				1615
				1616	Floating point denorm mode
				1617	values are defined in
				1618	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1619
				1620	Used by CP to set up
				1621	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1622	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1623
				1624	Start executing wavefront
				1625	in privilege trap handler
				1626	mode.
				1627
				1628	CP is responsible for
				1629	filling in
				1630	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1631	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1632	with DX10 clamp mode
				1633	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1634	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1635	treatment of NaN's (when
				1636	set, clamp NaN to zero,
				1637	otherwise pass NaN
				1638	through).
				1639
				1640	Used by CP to set up
				1641	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1642	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1643
				1644	Start executing wavefront
				1645	in single step mode.
				1646
				1647	CP is responsible for
				1648	filling in
				1649	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1650	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1651	with IEEE mode
				1652	enabled. Floating point
				1653	opcodes that support
				1654	exception flag gathering
				1655	will quiet and propagate
				1656	signaling-NaN inputs per
				1657	IEEE 754-2008. Min_dx10 and
				1658	max_dx10 become IEEE
				1659	754-2008 compliant due to
				1660	signaling-NaN propagation
				1661	and quieting.
				1662
				1663	Used by CP to set up
				1664	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1665	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1666
				1667	Only one work-group allowed
				1668	to execute on a compute
				1669	unit.
				1670
				1671	CP is responsible for
				1672	filling in
				1673	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1674	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1675
				1676	Flag that can be used to
				1677	control debugging code.
				1678
				1679	CP is responsible for
				1680	filling in
				1681	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1682	26 1 bit FP16_OVFL GFX6-8
				1683	Reserved, must be 0.
				1684	GFX9
				1685	Wavefront starts execution
				1686	with specified fp16 overflow
				1687	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1688
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1689	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1690	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1691	- If 1, fp16 overflow that is the
				1692	result of an +/-INF input value
				1693	or divide by 0 produces a +/-INF,
				1694	otherwise clamps computed
				1695	overflow to +/-MAX_FP16 as
				1696	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1697
				1698	Used by CP to set up
				1699	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1700	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1701	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1702	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1703
				1704	..
				1705
				1706	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1707	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1708
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1709	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1710	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1711	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1712	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1713	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1714	system register (see
				1715	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1716
				1717	Used by CP to set up
				1718	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1719	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1720	user data registers
				1721	requested. This number must
				1722	match the number of user
				1723	data registers enabled.
				1724
				1725	Used by CP to set up
				1726	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1727	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1728	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1729	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1730	be enabled.
				1731
				1732	CP sets
				1733	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1734	if the runtime has
				1735	installed a trap handler
				1736	regardless of the setting
				1737	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1738	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1739	system SGPR register for
				1740	the work-group id in the X
				1741	dimension (see
				1742	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1743
				1744	Used by CP to set up
				1745	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1746	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1747	system SGPR register for
				1748	the work-group id in the Y
				1749	dimension (see
				1750	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1751
				1752	Used by CP to set up
				1753	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1754	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1755	system SGPR register for
				1756	the work-group id in the Z
				1757	dimension (see
				1758	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1759
				1760	Used by CP to set up
				1761	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1762	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1763	system SGPR register for
				1764	work-group information (see
				1765	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1766
				1767	Used by CP to set up
				1768	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1769	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1770	VGPR system registers used
				1771	for the work-item ID.
				1772	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1773	defines the values.
				1774
				1775	Used by CP to set up
				1776	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1777	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1778
				1779	Wavefront starts execution
				1780	with address watch
				1781	exceptions enabled which
				1782	are generated when L1 has
				1783	witnessed a thread access
				1784	an *address of
				1785	interest*.
				1786
				1787	CP is responsible for
				1788	filling in the address
				1789	watch bit in
				1790	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1791	according to what the
				1792	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1793	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1794
				1795	Wavefront starts execution
				1796	with memory violation
				1797	exceptions exceptions
				1798	enabled which are generated
				1799	when a memory violation has
				1800	occurred for this wave from
				1801	L1 or LDS
				1802	(write-to-read-only-memory,
				1803	mis-aligned atomic, LDS
				1804	address out of range,
				1805	illegal address, etc.).
				1806
				1807	CP sets the memory
				1808	violation bit in
				1809	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1810	according to what the
				1811	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1812	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1813
				1814	CP uses the rounded value
				1815	from the dispatch packet,
				1816	not this value, as the
				1817	dispatch may contain
				1818	dynamically allocated group
				1819	segment memory. CP writes
				1820	directly to
				1821	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1822
				1823	Amount of group segment
				1824	(LDS) to allocate for each
				1825	work-group. Granularity is
				1826	device specific:
				1827
				1828	GFX6:
				1829	roundup(lds-size / (64 * 4))
				1830	GFX7-GFX9:
				1831	roundup(lds-size / (128 * 4))
				1832
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1833	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1834	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1835	enabled.
				1836
				1837	Used by CP to set up
				1838	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1839	(set from bits 0..6).
				1840
				1841	IEEE 754 FP Invalid
				1842	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1843	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1844	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1845	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1846	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1847	_DIVISION_BY_ZERO Zero
				1848	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1849	_OVERFLOW
				1850	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1851	_UNDERFLOW
				1852	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1853	_INEXACT
				1854	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1855	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1856	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1857	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1858	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1859	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1860
				1861	..
				1862
				1863	.. table:: Floating Point Rounding Mode Enumeration Values
				1864	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1865
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1866	====================================== ===== ==============================
				1867	Enumeration Name Value Description
				1868	====================================== ===== ==============================
				1869	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1870	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1871	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1872	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1873	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1874
				1875	..
				1876
				1877	.. table:: Floating Point Denorm Mode Enumeration Values
				1878	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1879
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1880	====================================== ===== ==============================
				1881	Enumeration Name Value Description
				1882	====================================== ===== ==============================
				1883	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1884	Denorms
				1885	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1886	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1887	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1888	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1889
				1890	..
				1891
				1892	.. table:: System VGPR Work-Item ID Enumeration Values
				1893	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1894
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1895	======================================== ===== ============================
				1896	Enumeration Name Value Description
				1897	======================================== ===== ============================
				1898	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1899	ID.
				1900	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1901	dimensions ID.
				1902	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1903	dimensions ID.
				1904	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1905	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1906
				1907	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1908
				1909	Initial Kernel Execution State
				1910	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1911
				1912	This section defines the register state that will be set up by the packet
				1913	processor prior to the start of execution of every wavefront. This is limited by
				1914	the constraints of the hardware controllers of CP/ADC/SPI.
				1915
				1916	The order of the SGPR registers is defined, but the compiler can specify which
				1917	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1918	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1919	for enabled registers are dense starting at SGPR0: the first enabled register is
				1920	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1921	an SGPR number.
				1922
				1923	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1924	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1925	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1926	initialized. These are then immediately followed by the System SGPRs that are
				1927	set up by ADC/SPI and can have different values for each wave of the grid
				1928	dispatch.
				1929
				1930	SGPR register initial state is defined in
				1931	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1932
				1933	.. table:: SGPR Register Set Up Order
				1934	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1935
				1936	========== ========================== ====== ==============================
				1937	SGPR Order Name Number Description
				1938	(kernel descriptor enable of
				1939	field) SGPRs
				1940	========== ========================== ====== ==============================
				1941	First Private Segment Buffer 4 V# that can be used, together
				1942	(enable_sgpr_private with Scratch Wave Offset as an
				1943	_segment_buffer) offset, to access the private
				1944	memory space using a segment
				1945	address.
				1946
				1947	CP uses the value provided by
				1948	the runtime.
				1949	then Dispatch Ptr 2 64 bit address of AQL dispatch
				1950	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				1951	actually executing.
				1952	then Queue Ptr 2 64 bit address of amd_queue_t
				1953	(enable_sgpr_queue_ptr) object for AQL queue on which
				1954	the dispatch packet was
				1955	queued.
				1956	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				1957	(enable_sgpr_kernarg segment. This is directly
				1958	_segment_ptr) copied from the
				1959	kernarg_address in the kernel
				1960	dispatch packet.
				1961
				1962	Having CP load it once avoids
				1963	loading it at the beginning of
				1964	every wavefront.
				1965	then Dispatch Id 2 64 bit Dispatch ID of the
				1966	(enable_sgpr_dispatch_id) dispatch packet being
				1967	executed.
				1968	then Flat Scratch Init 2 This is 2 SGPRs:
				1969	(enable_sgpr_flat_scratch
				1970	_init) GFX6
				1971	Not supported.
				1972	GFX7-GFX8
				1973	The first SGPR is a 32 bit
				1974	byte offset from
				1975	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1976	to per SPI base of memory
				1977	for scratch for the queue
				1978	executing the kernel
				1979	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1980	from the runtime. (The
				1981	Scratch Segment Buffer base
				1982	address is
				1983	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1984	plus this offset.) The value
				1985	of Scratch Wave Offset must
				1986	be added to this offset by
				1987	the kernel machine code,
				1988	right shifted by 8, and
				1989	moved to the FLAT_SCRATCH_HI
				1990	SGPR register.
				1991	FLAT_SCRATCH_HI corresponds
				1992	to SGPRn-4 on GFX7, and
				1993	SGPRn-6 on GFX8 (where SGPRn
				1994	is the highest numbered SGPR
				1995	allocated to the wave).
				1996	FLAT_SCRATCH_HI is
				1997	multiplied by 256 (as it is
				1998	in units of 256 bytes) and
				1999	added to
				2000	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2001	to calculate the per wave
				2002	FLAT SCRATCH BASE in flat
				2003	memory instructions that
				2004	access the scratch
				2005	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2006
				2007	The second SGPR is 32 bit
				2008	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2009	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2010	usage. CP obtains this from
				2011	the runtime, and it is
				2012	always a multiple of DWORD.
				2013	CP checks that the value in
				2014	the kernel dispatch packet
				2015	Private Segment Byte Size is
				2016	not larger, and requests the
				2017	runtime to increase the
				2018	queue's scratch size if
				2019	necessary. The kernel code
				2020	must move it to
				2021	FLAT_SCRATCH_LO which is
				2022	SGPRn-3 on GFX7 and SGPRn-5
				2023	on GFX8. FLAT_SCRATCH_LO is
				2024	used as the FLAT SCRATCH
				2025	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2026	instructions. Having CP load
				2027	it once avoids loading it at
				2028	the beginning of every
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2029	wavefront. GFX9 This is the
				2030	64 bit base address of the
				2031	per SPI scratch backing
				2032	memory managed by SPI for
				2033	the queue executing the
				2034	kernel dispatch. CP obtains
				2035	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2036	divides it if there are
				2037	multiple Shader Arrays each
				2038	with its own SPI). The value
				2039	of Scratch Wave Offset must
				2040	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2041	machine code and the result
				2042	moved to the FLAT_SCRATCH
				2043	SGPR which is SGPRn-6 and
				2044	SGPRn-5. It is used as the
				2045	FLAT SCRATCH BASE in flat
				2046	memory instructions. then
				2047	Private Segment Size 1 The
				2048	32 bit byte size of a
				2049	(enable_sgpr_private single
				2050	work-item's
				2051	scratch_segment_size) memory
				2052	allocation. This is the
				2053	value from the kernel
				2054	dispatch packet Private
				2055	Segment Byte Size rounded up
				2056	by CP to a multiple of
				2057	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2058
				2059	Having CP load it once avoids
				2060	loading it at the beginning of
				2061	every wavefront.
				2062
				2063	This is not used for
				2064	GFX7-GFX8 since it is the same
				2065	value as the second SGPR of
				2066	Flat Scratch Init. However, it
				2067	may be needed for GFX9 which
				2068	changes the meaning of the
				2069	Flat Scratch Init value.
				2070	then Grid Work-Group Count X 1 32 bit count of the number of
				2071	(enable_sgpr_grid work-groups in the X dimension
				2072	_workgroup_count_X) for the grid being
				2073	executed. Computed from the
				2074	fields in the kernel dispatch
				2075	packet as ((grid_size.x +
				2076	workgroup_size.x - 1) /
				2077	workgroup_size.x).
				2078	then Grid Work-Group Count Y 1 32 bit count of the number of
				2079	(enable_sgpr_grid work-groups in the Y dimension
				2080	_workgroup_count_Y && for the grid being
				2081	less than 16 previous executed. Computed from the
				2082	SGPRs) fields in the kernel dispatch
				2083	packet as ((grid_size.y +
				2084	workgroup_size.y - 1) /
				2085	workgroupSize.y).
				2086
				2087	Only initialized if <16
				2088	previous SGPRs initialized.
				2089	then Grid Work-Group Count Z 1 32 bit count of the number of
				2090	(enable_sgpr_grid work-groups in the Z dimension
				2091	_workgroup_count_Z && for the grid being
				2092	less than 16 previous executed. Computed from the
				2093	SGPRs) fields in the kernel dispatch
				2094	packet as ((grid_size.z +
				2095	workgroup_size.z - 1) /
				2096	workgroupSize.z).
				2097
				2098	Only initialized if <16
				2099	previous SGPRs initialized.
				2100	then Work-Group Id X 1 32 bit work-group id in X
				2101	(enable_sgpr_workgroup_id dimension of grid for
				2102	_X) wavefront.
				2103	then Work-Group Id Y 1 32 bit work-group id in Y
				2104	(enable_sgpr_workgroup_id dimension of grid for
				2105	_Y) wavefront.
				2106	then Work-Group Id Z 1 32 bit work-group id in Z
				2107	(enable_sgpr_workgroup_id dimension of grid for
				2108	_Z) wavefront.
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2109	then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2110	(enable_sgpr_workgroup ordered_append_term[10:0],
				2111	_info) threadgroup_size_in_waves[5:0]}
				2112	then Scratch Wave Offset 1 32 bit byte offset from base
				2113	(enable_sgpr_private of scratch base of queue
				2114	_segment_wave_offset) executing the kernel
				2115	dispatch. Must be used as an
				2116	offset with Private
				2117	segment address when using
				2118	Scratch Segment Buffer. It
				2119	must be used to set up FLAT
				2120	SCRATCH for flat addressing
				2121	(see
				2122	:ref:`amdgpu-amdhsa-flat-scratch`).
				2123	========== ========================== ====== ==============================
				2124
				2125	The order of the VGPR registers is defined, but the compiler can specify which
				2126	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2127	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2128	for enabled registers are dense starting at VGPR0: the first enabled register is
				2129	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2130	VGPR number.
				2131
				2132	VGPR register initial state is defined in
				2133	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2134
				2135	.. table:: VGPR Register Set Up Order
				2136	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2137
				2138	========== ========================== ====== ==============================
				2139	VGPR Order Name Number Description
				2140	(kernel descriptor enable of
				2141	field) VGPRs
				2142	========== ========================== ====== ==============================
				2143	First Work-Item Id X 1 32 bit work item id in X
				2144	(Always initialized) dimension of work-group for
				2145	wavefront lane.
				2146	then Work-Item Id Y 1 32 bit work item id in Y
				2147	(enable_vgpr_workitem_id dimension of work-group for
				2148	> 0) wavefront lane.
				2149	then Work-Item Id Z 1 32 bit work item id in Z
				2150	(enable_vgpr_workitem_id dimension of work-group for
				2151	> 1) wavefront lane.
				2152	========== ========================== ====== ==============================
				2153
				2154	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2155
				2156	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2157	registers.
				2158	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2159	combination including none.
				2160	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2161	cannot included with the flat scratch init value which is per queue.
				2162	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2163	or (X, Y, Z).
				2164
				2165	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2166	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2167
				2168	The global segment can be accessed either using buffer instructions (GFX6 which
				2169	has V# 64 bit address support), flat instructions (GFX7-9), or global
				2170	instructions (GFX9).
				2171
				2172	If buffer operations are used then the compiler can generate a V# with the
				2173	following properties:
				2174
				2175	* base address of 0
				2176	* no swizzle
				2177	* ATC: 1 if IOMMU present (such as APU)
				2178	* ptr64: 1
				2179	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2180	APU and NC for dGPU).
				2181
				2182	.. _amdgpu-amdhsa-kernel-prolog:
				2183
				2184	Kernel Prolog
				2185	~~~~~~~~~~~~~
				2186
				2187	.. _amdgpu-amdhsa-m0:
				2188
				2189	M0
				2190	++
				2191
				2192	GFX6-GFX8
				2193	The M0 register must be initialized with a value at least the total LDS size
				2194	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2195	available in dispatch packet. For M0, it is also possible to use maximum
				2196	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2197	GFX7-GFX8).
				2198	GFX9
				2199	The M0 register is not used for range checking LDS accesses and so does not
				2200	need to be initialized in the prolog.
				2201
				2202	.. _amdgpu-amdhsa-flat-scratch:
				2203
				2204	Flat Scratch
				2205	++++++++++++
				2206
				2207	If the kernel may use flat operations to access scratch memory, the prolog code
				2208	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2209	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2210	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2211
				2212	GFX6
				2213	Flat scratch is not supported.
				2214
				2215	GFX7-8
				2216	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2217	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2218	being managed by SPI for the queue executing the kernel dispatch. This is
				2219	the same value used in the Scratch Segment Buffer V# base address. The
				2220	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2221	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2222	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2223	by 8 before moving into FLAT_SCRATCH_LO.
				2224	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2225	work-items scratch memory usage. This is directly loaded from the kernel
				2226	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2227	DWORD. Having CP load it once avoids loading it at the beginning of every
				2228	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2229	SIZE.
				2230	GFX9
				2231	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2232	memory being managed by SPI for the queue executing the kernel dispatch. The
				2233	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2234	pair for use as the flat scratch base in flat memory instructions.
				2235
				2236	.. _amdgpu-amdhsa-memory-model:
				2237
				2238	Memory Model
				2239	~~~~~~~~~~~~
				2240
				2241	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2242	(see :ref:`memmodel`). The implementation is WIP.
				2243
				2244	.. TODO
				2245	Update when implementation complete.
				2246
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2247	The AMDGPU backend supports the memory synchronization scopes specified in
				2248	:ref:`amdgpu-memory-scopes`.
				2249
				2250	The code sequences used to implement the memory model are defined in table
				2251	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2252
				2253	The sequences specify the order of instructions that a single thread must
				2254	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2255	to other memory instructions executed by the same thread. This allows them to be
				2256	moved earlier or later which can allow them to be combined with other instances
				2257	of the same instruction, or hoisted/sunk out of loops to improve
				2258	performance. Only the instructions related to the memory model are given;
				2259	additional ``s_waitcnt`` instructions are required to ensure registers are
				2260	defined before being used. These may be able to be combined with the memory
				2261	model ``s_waitcnt`` instructions as described above.
				2262
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2263	The AMDGPU backend supports the following memory models:
				2264
				2265	HSA Memory Model [HSA]_
				2266	The HSA memory model uses a single happens-before relation for all address
				2267	spaces (see :ref:`amdgpu-address-spaces`).
				2268	OpenCL Memory Model [OpenCL]_
				2269	The OpenCL memory model which has separate happens-before relations for the
				2270	global and local address spaces. Only a fence specifying both global and
				2271	local address space, and seq_cst instructions join the relationships. Since
				2272	the LLVM ``memfence`` instruction does not allow an address space to be
				2273	specified the OpenCL fence has to convervatively assume both local and
				2274	global address space was specified. However, optimizations can often be
				2275	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2276	no intervening memory instructions which access the corresponding address
				2277	space. The code sequences in the table indicate what can be omitted for the
				2278	OpenCL memory. The target triple environment is used to determine if the
				2279	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2280
				2281	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2282	operations.
				2283
				2284	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2285	termed vector memory operations.
				2286
				2287	For GFX6-GFX9:
				2288
				2289	* Each agent has multiple compute units (CU).
				2290	* Each CU has multiple SIMDs that execute wavefronts.
				2291	* The wavefronts for a single work-group are executed in the same CU but may be
				2292	executed by different SIMDs.
				2293	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2294	executing on it.
				2295	* All LDS operations of a CU are performed as wavefront wide operations in a
				2296	global order and involve no caching. Completion is reported to a wavefront in
				2297	execution order.
				2298	* The LDS memory has multiple request queues shared by the SIMDs of a
				2299	CU. Therefore, the LDS operations performed by different waves of a work-group
				2300	can be reordered relative to each other, which can result in reordering the
				2301	visibility of vector memory operations with respect to LDS operations of other
				2302	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2303	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2304	between waves of a work-group, but not between operations performed by the
				2305	same wavefront.
				2306	* The vector memory operations are performed as wavefront wide operations and
				2307	completion is reported to a wavefront in execution order. The exception is
				2308	that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
				2309	vector memory order if they access LDS memory, and out of LDS operation order
				2310	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2311	* The vector memory operations access a single vector L1 cache shared by all
				2312	SIMDs a CU. Therefore, no special action is required for coherence between the
				2313	lanes of a single wavefront, or for coherence between wavefronts in the same
				2314	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
				2315	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2316	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2317	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2318	scalar operations are used in a restricted way so do not impact the memory
				2319	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2320	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2321	the same agent.
				2322	* The L2 cache has independent channels to service disjoint ranges of virtual
				2323	addresses.
				2324	* Each CU has a separate request queue per channel. Therefore, the vector and
				2325	scalar memory operations performed by waves executing in different work-groups
				2326	(which may be executing on different CUs) of an agent can be reordered
				2327	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2328	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2329	previous vector memory operation has completed before executing a subsequent
				2330	vector memory or LDS operation and so can be used to meet the requirements of
				2331	acquire and release.
				2332	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2333	of virtual addresses can be set up to bypass it to ensure system coherence.
				2334
				2335	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
				2336	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2337	memory, atomic memory orderings are not meaningful and all accesses are treated
				2338	as non-atomic.
				2339
				2340	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2341	scalar memory instructions). Since the constant address space contents do not
				2342	change during the execution of a kernel dispatch it is not legal to perform
				2343	stores, and atomic memory orderings are not meaningful and all access are
				2344	treated as non-atomic.
				2345
				2346	A memory synchronization scope wider than work-group is not meaningful for the
				2347	group (LDS) address space and is treated as work-group.
				2348
				2349	The memory model does not support the region address space which is treated as
				2350	non-atomic.
				2351
				2352	Acquire memory ordering is not meaningful on store atomic instructions and is
				2353	treated as non-atomic.
				2354
				2355	Release memory ordering is not meaningful on load atomic instructions and is
				2356	treated a non-atomic.
				2357
				2358	Acquire-release memory ordering is not meaningful on load or store atomic
				2359	instructions and is treated as acquire and release respectively.
				2360
				2361	AMDGPU backend only uses scalar memory operations to access memory that is
				2362	proven to not change during the execution of the kernel dispatch. This includes
				2363	constant address space and global address space for program scope const
				2364	variables. Therefore the kernel machine code does not have to maintain the
				2365	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2366	and vector L1 caches are invalidated between kernel dispatches by CP since
				2367	constant address space data may change between kernel dispatch executions. See
				2368	:ref:`amdgpu-amdhsa-memory-spaces`.
				2369
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2370	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2371	case the AMDGPU backend ensures the memory location used to spill is never
				2372	accessed by vector memory operations at the same time. If scalar writes are used
				2373	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2374	return since the locations may be used for vector memory instructions by a
				2375	future wave that uses the same scratch area, or a function call that creates a
				2376	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2377	as all scalar writes are write-before-read in the same thread.
				2378
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2379	Scratch backing memory (which is used for the private address space)
				2380	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2381	address space is only accessed by a single thread, and is always
				2382	write-before-read, there is never a need to invalidate these entries from the L1
				2383	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2384	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2385
				2386	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2387	to invalidate the L2 cache. This also causes it to be treated as
				2388	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2389	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2390	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2391
				2392	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2393	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2394
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2395	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2396	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2397	Ordering Sync Scope Address
				2398	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2399	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2400	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2401	-----------------------------------------------------------------------------------
				2402	load none none - global - !volatile & !nontemporal
				2403	- generic
				2404	- private 1. buffer/global/flat_load
				2405	- constant
				2406	- volatile & !nontemporal
				2407
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2408	1. buffer/global/flat_load
				2409	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2410
				2411	- nontemporal
				2412
				2413	1. buffer/global/flat_load
				2414	glc=1 slc=1
				2415
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2416	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2417	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2418	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2419	- private 1. buffer/global/flat_store
				2420	- constant
				2421	- nontemporal
				2422
				2423	1. buffer/global/flat_stote
				2424	glc=1 slc=1
				2425
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2426	store none none - local 1. ds_store
				2427	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2428	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2429	load atomic unordered any any Same as non-atomic.
				2430	store atomic unordered any any Same as non-atomic.
				2431	atomicrmw unordered any any *Same as monotonic
				2432	atomic*.
				2433	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2434	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2435	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2436	- wavefront - generic
				2437	- workgroup
				2438	load atomic monotonic - singlethread - local 1. ds_load
				2439	- wavefront
				2440	- workgroup
				2441	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2442	- system - generic glc=1
				2443	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2444	- wavefront - generic
				2445	- workgroup
				2446	- agent
				2447	- system
				2448	store atomic monotonic - singlethread - local 1. ds_store
				2449	- wavefront
				2450	- workgroup
				2451	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2452	- wavefront - generic
				2453	- workgroup
				2454	- agent
				2455	- system
				2456	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2457	- wavefront
				2458	- workgroup
				2459	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2460	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2461	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2462	- wavefront - local
				2463	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2464	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2465	load atomic acquire - workgroup - local 1. ds_load
				2466	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2467
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2468	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2469	- Must happen before
				2470	any following
				2471	global/generic
				2472	load/load
				2473	atomic/store/store
				2474	atomic/atomicrmw.
				2475	- Ensures any
				2476	following global
				2477	data read is no
				2478	older than the load
				2479	atomic value being
				2480	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2481	load atomic acquire - workgroup - generic 1. flat_load
				2482	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2483
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2484	- If OpenCL, omit.
				2485	- Must happen before
				2486	any following
				2487	global/generic
				2488	load/load
				2489	atomic/store/store
				2490	atomic/atomicrmw.
				2491	- Ensures any
				2492	following global
				2493	data read is no
				2494	older than the load
				2495	atomic value being
				2496	acquired.
				2497	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2498	- system glc=1
				2499	2. s_waitcnt vmcnt(0)
				2500
				2501	- Must happen before
				2502	following
				2503	buffer_wbinvl1_vol.
				2504	- Ensures the load
				2505	has completed
				2506	before invalidating
				2507	the cache.
				2508
				2509	3. buffer_wbinvl1_vol
				2510
				2511	- Must happen before
				2512	any following
				2513	global/generic
				2514	load/load
				2515	atomic/atomicrmw.
				2516	- Ensures that
				2517	following
				2518	loads will not see
				2519	stale global data.
				2520
				2521	load atomic acquire - agent - generic 1. flat_load glc=1
				2522	- system 2. s_waitcnt vmcnt(0) &
				2523	lgkmcnt(0)
				2524
				2525	- If OpenCL omit
				2526	lgkmcnt(0).
				2527	- Must happen before
				2528	following
				2529	buffer_wbinvl1_vol.
				2530	- Ensures the flat_load
				2531	has completed
				2532	before invalidating
				2533	the cache.
				2534
				2535	3. buffer_wbinvl1_vol
				2536
				2537	- Must happen before
				2538	any following
				2539	global/generic
				2540	load/load
				2541	atomic/atomicrmw.
				2542	- Ensures that
				2543	following loads
				2544	will not see stale
				2545	global data.
				2546
				2547	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2548	- wavefront - local
				2549	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2550	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2551	atomicrmw acquire - workgroup - local 1. ds_atomic
				2552	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2553
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2554	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2555	- Must happen before
				2556	any following
				2557	global/generic
				2558	load/load
				2559	atomic/store/store
				2560	atomic/atomicrmw.
				2561	- Ensures any
				2562	following global
				2563	data read is no
				2564	older than the
				2565	atomicrmw value
				2566	being acquired.
				2567
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2568	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2569	2. waitcnt lgkmcnt(0)
				2570
				2571	- If OpenCL, omit.
				2572	- Must happen before
				2573	any following
				2574	global/generic
				2575	load/load
				2576	atomic/store/store
				2577	atomic/atomicrmw.
				2578	- Ensures any
				2579	following global
				2580	data read is no
				2581	older than the
				2582	atomicrmw value
				2583	being acquired.
				2584
				2585	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2586	- system 2. s_waitcnt vmcnt(0)
				2587
				2588	- Must happen before
				2589	following
				2590	buffer_wbinvl1_vol.
				2591	- Ensures the
				2592	atomicrmw has
				2593	completed before
				2594	invalidating the
				2595	cache.
				2596
				2597	3. buffer_wbinvl1_vol
				2598
				2599	- Must happen before
				2600	any following
				2601	global/generic
				2602	load/load
				2603	atomic/atomicrmw.
				2604	- Ensures that
				2605	following loads
				2606	will not see stale
				2607	global data.
				2608
				2609	atomicrmw acquire - agent - generic 1. flat_atomic
				2610	- system 2. s_waitcnt vmcnt(0) &
				2611	lgkmcnt(0)
				2612
				2613	- If OpenCL, omit
				2614	lgkmcnt(0).
				2615	- Must happen before
				2616	following
				2617	buffer_wbinvl1_vol.
				2618	- Ensures the
				2619	atomicrmw has
				2620	completed before
				2621	invalidating the
				2622	cache.
				2623
				2624	3. buffer_wbinvl1_vol
				2625
				2626	- Must happen before
				2627	any following
				2628	global/generic
				2629	load/load
				2630	atomic/atomicrmw.
				2631	- Ensures that
				2632	following loads
				2633	will not see stale
				2634	global data.
				2635
				2636	fence acquire - singlethread none none
				2637	- wavefront
				2638	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2639
				2640	- If OpenCL and
				2641	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2642	not generic, omit.
				2643	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2644	currently has no
				2645	address space on
				2646	the fence need to
				2647	conservatively
				2648	always generate. If
				2649	fence had an
				2650	address space then
				2651	set to address
				2652	space of OpenCL
				2653	fence flag, or to
				2654	generic if both
				2655	local and global
				2656	flags are
				2657	specified.
				2658	- Must happen after
				2659	any preceding
				2660	local/generic load
				2661	atomic/atomicrmw
				2662	with an equal or
				2663	wider sync scope
				2664	and memory ordering
				2665	stronger than
				2666	unordered (this is
				2667	termed the
				2668	fence-paired-atomic).
				2669	- Must happen before
				2670	any following
				2671	global/generic
				2672	load/load
				2673	atomic/store/store
				2674	atomic/atomicrmw.
				2675	- Ensures any
				2676	following global
				2677	data read is no
				2678	older than the
				2679	value read by the
				2680	fence-paired-atomic.
				2681
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2682	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2683	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2684
				2685	- If OpenCL and
				2686	address space is
				2687	not generic, omit
				2688	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2689	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2690	currently has no
				2691	address space on
				2692	the fence need to
				2693	conservatively
				2694	always generate
				2695	(see comment for
				2696	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2697	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2698	separate s_waitcnt
				2699	vmcnt(0) and
				2700	s_waitcnt
				2701	lgkmcnt(0) to allow
				2702	them to be
				2703	independently moved
				2704	according to the
				2705	following rules.
				2706	- s_waitcnt vmcnt(0)
				2707	must happen after
				2708	any preceding
				2709	global/generic load
				2710	atomic/atomicrmw
				2711	with an equal or
				2712	wider sync scope
				2713	and memory ordering
				2714	stronger than
				2715	unordered (this is
				2716	termed the
				2717	fence-paired-atomic).
				2718	- s_waitcnt lgkmcnt(0)
				2719	must happen after
				2720	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2721	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2722	atomic/atomicrmw
				2723	with an equal or
				2724	wider sync scope
				2725	and memory ordering
				2726	stronger than
				2727	unordered (this is
				2728	termed the
				2729	fence-paired-atomic).
				2730	- Must happen before
				2731	the following
				2732	buffer_wbinvl1_vol.
				2733	- Ensures that the
				2734	fence-paired atomic
				2735	has completed
				2736	before invalidating
				2737	the
				2738	cache. Therefore
				2739	any following
				2740	locations read must
				2741	be no older than
				2742	the value read by
				2743	the
				2744	fence-paired-atomic.
				2745
				2746	2. buffer_wbinvl1_vol
				2747
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2748	- Must happen before any
				2749	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2750	load/load
				2751	atomic/store/store
				2752	atomic/atomicrmw.
				2753	- Ensures that
				2754	following loads
				2755	will not see stale
				2756	global data.
				2757
				2758	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2759	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2760	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2761	- wavefront - local
				2762	- generic
				2763	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2764
				2765	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2766	- Must happen after
				2767	any preceding
				2768	local/generic
				2769	load/store/load
				2770	atomic/store
				2771	atomic/atomicrmw.
				2772	- Must happen before
				2773	the following
				2774	store.
				2775	- Ensures that all
				2776	memory operations
				2777	to local have
				2778	completed before
				2779	performing the
				2780	store that is being
				2781	released.
				2782
				2783	2. buffer/global/flat_store
				2784	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2785	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2786
				2787	- If OpenCL, omit.
				2788	- Must happen after
				2789	any preceding
				2790	local/generic
				2791	load/store/load
				2792	atomic/store
				2793	atomic/atomicrmw.
				2794	- Must happen before
				2795	the following
				2796	store.
				2797	- Ensures that all
				2798	memory operations
				2799	to local have
				2800	completed before
				2801	performing the
				2802	store that is being
				2803	released.
				2804
				2805	2. flat_store
				2806	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2807	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2808
				2809	- If OpenCL, omit
				2810	lgkmcnt(0).
				2811	- Could be split into
				2812	separate s_waitcnt
				2813	vmcnt(0) and
				2814	s_waitcnt
				2815	lgkmcnt(0) to allow
				2816	them to be
				2817	independently moved
				2818	according to the
				2819	following rules.
				2820	- s_waitcnt vmcnt(0)
				2821	must happen after
				2822	any preceding
				2823	global/generic
				2824	load/store/load
				2825	atomic/store
				2826	atomic/atomicrmw.
				2827	- s_waitcnt lgkmcnt(0)
				2828	must happen after
				2829	any preceding
				2830	local/generic
				2831	load/store/load
				2832	atomic/store
				2833	atomic/atomicrmw.
				2834	- Must happen before
				2835	the following
				2836	store.
				2837	- Ensures that all
				2838	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2839	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2840	completed before
				2841	performing the
				2842	store that is being
				2843	released.
				2844
				2845	2. buffer/global/ds/flat_store
				2846	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2847	- wavefront - local
				2848	- generic
				2849	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2850
				2851	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2852	- Must happen after
				2853	any preceding
				2854	local/generic
				2855	load/store/load
				2856	atomic/store
				2857	atomic/atomicrmw.
				2858	- Must happen before
				2859	the following
				2860	atomicrmw.
				2861	- Ensures that all
				2862	memory operations
				2863	to local have
				2864	completed before
				2865	performing the
				2866	atomicrmw that is
				2867	being released.
				2868
				2869	2. buffer/global/flat_atomic
				2870	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2871	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2872
				2873	- If OpenCL, omit.
				2874	- Must happen after
				2875	any preceding
				2876	local/generic
				2877	load/store/load
				2878	atomic/store
				2879	atomic/atomicrmw.
				2880	- Must happen before
				2881	the following
				2882	atomicrmw.
				2883	- Ensures that all
				2884	memory operations
				2885	to local have
				2886	completed before
				2887	performing the
				2888	atomicrmw that is
				2889	being released.
				2890
				2891	2. flat_atomic
				2892	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2893	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2894
				2895	- If OpenCL, omit
				2896	lgkmcnt(0).
				2897	- Could be split into
				2898	separate s_waitcnt
				2899	vmcnt(0) and
				2900	s_waitcnt
				2901	lgkmcnt(0) to allow
				2902	them to be
				2903	independently moved
				2904	according to the
				2905	following rules.
				2906	- s_waitcnt vmcnt(0)
				2907	must happen after
				2908	any preceding
				2909	global/generic
				2910	load/store/load
				2911	atomic/store
				2912	atomic/atomicrmw.
				2913	- s_waitcnt lgkmcnt(0)
				2914	must happen after
				2915	any preceding
				2916	local/generic
				2917	load/store/load
				2918	atomic/store
				2919	atomic/atomicrmw.
				2920	- Must happen before
				2921	the following
				2922	atomicrmw.
				2923	- Ensures that all
				2924	memory operations
				2925	to global and local
				2926	have completed
				2927	before performing
				2928	the atomicrmw that
				2929	is being released.
				2930
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2931	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2932	fence release - singlethread none none
				2933	- wavefront
				2934	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2935
				2936	- If OpenCL and
				2937	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2938	not generic, omit.
				2939	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2940	currently has no
				2941	address space on
				2942	the fence need to
				2943	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2944	always generate. If
				2945	fence had an
				2946	address space then
				2947	set to address
				2948	space of OpenCL
				2949	fence flag, or to
				2950	generic if both
				2951	local and global
				2952	flags are
				2953	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2954	- Must happen after
				2955	any preceding
				2956	local/generic
				2957	load/load
				2958	atomic/store/store
				2959	atomic/atomicrmw.
				2960	- Must happen before
				2961	any following store
				2962	atomic/atomicrmw
				2963	with an equal or
				2964	wider sync scope
				2965	and memory ordering
				2966	stronger than
				2967	unordered (this is
				2968	termed the
				2969	fence-paired-atomic).
				2970	- Ensures that all
				2971	memory operations
				2972	to local have
				2973	completed before
				2974	performing the
				2975	following
				2976	fence-paired-atomic.
				2977
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2978	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				2979	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2980
				2981	- If OpenCL and
				2982	address space is
				2983	not generic, omit
				2984	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2985	- If OpenCL and
				2986	address space is
				2987	local, omit
				2988	vmcnt(0).
				2989	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2990	currently has no
				2991	address space on
				2992	the fence need to
				2993	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2994	always generate. If
				2995	fence had an
				2996	address space then
				2997	set to address
				2998	space of OpenCL
				2999	fence flag, or to
				3000	generic if both
				3001	local and global
				3002	flags are
				3003	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3004	- Could be split into
				3005	separate s_waitcnt
				3006	vmcnt(0) and
				3007	s_waitcnt
				3008	lgkmcnt(0) to allow
				3009	them to be
				3010	independently moved
				3011	according to the
				3012	following rules.
				3013	- s_waitcnt vmcnt(0)
				3014	must happen after
				3015	any preceding
				3016	global/generic
				3017	load/store/load
				3018	atomic/store
				3019	atomic/atomicrmw.
				3020	- s_waitcnt lgkmcnt(0)
				3021	must happen after
				3022	any preceding
				3023	local/generic
				3024	load/store/load
				3025	atomic/store
				3026	atomic/atomicrmw.
				3027	- Must happen before
				3028	any following store
				3029	atomic/atomicrmw
				3030	with an equal or
				3031	wider sync scope
				3032	and memory ordering
				3033	stronger than
				3034	unordered (this is
				3035	termed the
				3036	fence-paired-atomic).
				3037	- Ensures that all
				3038	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3039	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3040	completed before
				3041	performing the
				3042	following
				3043	fence-paired-atomic.
				3044
				3045	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3046	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3047	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3048	- wavefront - local
				3049	- generic
				3050	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3051
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3052	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3053	- Must happen after
				3054	any preceding
				3055	local/generic
				3056	load/store/load
				3057	atomic/store
				3058	atomic/atomicrmw.
				3059	- Must happen before
				3060	the following
				3061	atomicrmw.
				3062	- Ensures that all
				3063	memory operations
				3064	to local have
				3065	completed before
				3066	performing the
				3067	atomicrmw that is
				3068	being released.
				3069
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3070	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3071	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3072	2. s_waitcnt lgkmcnt(0)
				3073
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3074	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3075	- Must happen before
				3076	any following
				3077	global/generic
				3078	load/load
				3079	atomic/store/store
				3080	atomic/atomicrmw.
				3081	- Ensures any
				3082	following global
				3083	data read is no
				3084	older than the load
				3085	atomic value being
				3086	acquired.
				3087
				3088	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3089
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3090	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3091	- Must happen after
				3092	any preceding
				3093	local/generic
				3094	load/store/load
				3095	atomic/store
				3096	atomic/atomicrmw.
				3097	- Must happen before
				3098	the following
				3099	atomicrmw.
				3100	- Ensures that all
				3101	memory operations
				3102	to local have
				3103	completed before
				3104	performing the
				3105	atomicrmw that is
				3106	being released.
				3107
				3108	2. flat_atomic
				3109	3. s_waitcnt lgkmcnt(0)
				3110
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3111	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3112	- Must happen before
				3113	any following
				3114	global/generic
				3115	load/load
				3116	atomic/store/store
				3117	atomic/atomicrmw.
				3118	- Ensures any
				3119	following global
				3120	data read is no
				3121	older than the load
				3122	atomic value being
				3123	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3124
				3125	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3126	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3127
				3128	- If OpenCL, omit
				3129	lgkmcnt(0).
				3130	- Could be split into
				3131	separate s_waitcnt
				3132	vmcnt(0) and
				3133	s_waitcnt
				3134	lgkmcnt(0) to allow
				3135	them to be
				3136	independently moved
				3137	according to the
				3138	following rules.
				3139	- s_waitcnt vmcnt(0)
				3140	must happen after
				3141	any preceding
				3142	global/generic
				3143	load/store/load
				3144	atomic/store
				3145	atomic/atomicrmw.
				3146	- s_waitcnt lgkmcnt(0)
				3147	must happen after
				3148	any preceding
				3149	local/generic
				3150	load/store/load
				3151	atomic/store
				3152	atomic/atomicrmw.
				3153	- Must happen before
				3154	the following
				3155	atomicrmw.
				3156	- Ensures that all
				3157	memory operations
				3158	to global have
				3159	completed before
				3160	performing the
				3161	atomicrmw that is
				3162	being released.
				3163
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3164	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3165	3. s_waitcnt vmcnt(0)
				3166
				3167	- Must happen before
				3168	following
				3169	buffer_wbinvl1_vol.
				3170	- Ensures the
				3171	atomicrmw has
				3172	completed before
				3173	invalidating the
				3174	cache.
				3175
				3176	4. buffer_wbinvl1_vol
				3177
				3178	- Must happen before
				3179	any following
				3180	global/generic
				3181	load/load
				3182	atomic/atomicrmw.
				3183	- Ensures that
				3184	following loads
				3185	will not see stale
				3186	global data.
				3187
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3188	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3189	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3190
				3191	- If OpenCL, omit
				3192	lgkmcnt(0).
				3193	- Could be split into
				3194	separate s_waitcnt
				3195	vmcnt(0) and
				3196	s_waitcnt
				3197	lgkmcnt(0) to allow
				3198	them to be
				3199	independently moved
				3200	according to the
				3201	following rules.
				3202	- s_waitcnt vmcnt(0)
				3203	must happen after
				3204	any preceding
				3205	global/generic
				3206	load/store/load
				3207	atomic/store
				3208	atomic/atomicrmw.
				3209	- s_waitcnt lgkmcnt(0)
				3210	must happen after
				3211	any preceding
				3212	local/generic
				3213	load/store/load
				3214	atomic/store
				3215	atomic/atomicrmw.
				3216	- Must happen before
				3217	the following
				3218	atomicrmw.
				3219	- Ensures that all
				3220	memory operations
				3221	to global have
				3222	completed before
				3223	performing the
				3224	atomicrmw that is
				3225	being released.
				3226
				3227	2. flat_atomic
				3228	3. s_waitcnt vmcnt(0) &
				3229	lgkmcnt(0)
				3230
				3231	- If OpenCL, omit
				3232	lgkmcnt(0).
				3233	- Must happen before
				3234	following
				3235	buffer_wbinvl1_vol.
				3236	- Ensures the
				3237	atomicrmw has
				3238	completed before
				3239	invalidating the
				3240	cache.
				3241
				3242	4. buffer_wbinvl1_vol
				3243
				3244	- Must happen before
				3245	any following
				3246	global/generic
				3247	load/load
				3248	atomic/atomicrmw.
				3249	- Ensures that
				3250	following loads
				3251	will not see stale
				3252	global data.
				3253
				3254	fence acq_rel - singlethread none none
				3255	- wavefront
				3256	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3257
				3258	- If OpenCL and
				3259	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3260	not generic, omit.
				3261	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3262	since LLVM
				3263	currently has no
				3264	address space on
				3265	the fence need to
				3266	conservatively
				3267	always generate
				3268	(see comment for
				3269	previous fence).
				3270	- Must happen after
				3271	any preceding
				3272	local/generic
				3273	load/load
				3274	atomic/store/store
				3275	atomic/atomicrmw.
				3276	- Must happen before
				3277	any following
				3278	global/generic
				3279	load/load
				3280	atomic/store/store
				3281	atomic/atomicrmw.
				3282	- Ensures that all
				3283	memory operations
				3284	to local have
				3285	completed before
				3286	performing any
				3287	following global
				3288	memory operations.
				3289	- Ensures that the
				3290	preceding
				3291	local/generic load
				3292	atomic/atomicrmw
				3293	with an equal or
				3294	wider sync scope
				3295	and memory ordering
				3296	stronger than
				3297	unordered (this is
				3298	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3299	acquire-fence-paired-atomic
				3300	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3301	before following
				3302	global memory
				3303	operations. This
				3304	satisfies the
				3305	requirements of
				3306	acquire.
				3307	- Ensures that all
				3308	previous memory
				3309	operations have
				3310	completed before a
				3311	following
				3312	local/generic store
				3313	atomic/atomicrmw
				3314	with an equal or
				3315	wider sync scope
				3316	and memory ordering
				3317	stronger than
				3318	unordered (this is
				3319	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3320	release-fence-paired-atomic
				3321	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3322	requirements of
				3323	release.
				3324
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3325	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3326	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3327
				3328	- If OpenCL and
				3329	address space is
				3330	not generic, omit
				3331	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3332	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3333	currently has no
				3334	address space on
				3335	the fence need to
				3336	conservatively
				3337	always generate
				3338	(see comment for
				3339	previous fence).
				3340	- Could be split into
				3341	separate s_waitcnt
				3342	vmcnt(0) and
				3343	s_waitcnt
				3344	lgkmcnt(0) to allow
				3345	them to be
				3346	independently moved
				3347	according to the
				3348	following rules.
				3349	- s_waitcnt vmcnt(0)
				3350	must happen after
				3351	any preceding
				3352	global/generic
				3353	load/store/load
				3354	atomic/store
				3355	atomic/atomicrmw.
				3356	- s_waitcnt lgkmcnt(0)
				3357	must happen after
				3358	any preceding
				3359	local/generic
				3360	load/store/load
				3361	atomic/store
				3362	atomic/atomicrmw.
				3363	- Must happen before
				3364	the following
				3365	buffer_wbinvl1_vol.
				3366	- Ensures that the
				3367	preceding
				3368	global/local/generic
				3369	load
				3370	atomic/atomicrmw
				3371	with an equal or
				3372	wider sync scope
				3373	and memory ordering
				3374	stronger than
				3375	unordered (this is
				3376	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3377	acquire-fence-paired-atomic
				3378	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3379	before invalidating
				3380	the cache. This
				3381	satisfies the
				3382	requirements of
				3383	acquire.
				3384	- Ensures that all
				3385	previous memory
				3386	operations have
				3387	completed before a
				3388	following
				3389	global/local/generic
				3390	store
				3391	atomic/atomicrmw
				3392	with an equal or
				3393	wider sync scope
				3394	and memory ordering
				3395	stronger than
				3396	unordered (this is
				3397	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3398	release-fence-paired-atomic
				3399	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3400	requirements of
				3401	release.
				3402
				3403	2. buffer_wbinvl1_vol
				3404
				3405	- Must happen before
				3406	any following
				3407	global/generic
				3408	load/load
				3409	atomic/store/store
				3410	atomic/atomicrmw.
				3411	- Ensures that
				3412	following loads
				3413	will not see stale
				3414	global data. This
				3415	satisfies the
				3416	requirements of
				3417	acquire.
				3418
				3419	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3420	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3421	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3422	- wavefront - local load atomic acquire,
				3423	- generic except must generated
				3424	all instructions even
				3425	for OpenCL.*
				3426	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3427	- generic
				3428	- Must
				3429	happen after
				3430	preceding
				3431	global/generic load
				3432	atomic/store
				3433	atomic/atomicrmw
				3434	with memory
				3435	ordering of seq_cst
				3436	and with equal or
				3437	wider sync scope.
				3438	(Note that seq_cst
				3439	fences have their
				3440	own s_waitcnt
				3441	lgkmcnt(0) and so do
				3442	not need to be
				3443	considered.)
				3444	- Ensures any
				3445	preceding
				3446	sequential
				3447	consistent local
				3448	memory instructions
				3449	have completed
				3450	before executing
				3451	this sequentially
				3452	consistent
				3453	instruction. This
				3454	prevents reordering
				3455	a seq_cst store
				3456	followed by a
				3457	seq_cst load. (Note
				3458	that seq_cst is
				3459	stronger than
				3460	acquire/release as
				3461	the reordering of
				3462	load acquire
				3463	followed by a store
				3464	release is
				3465	prevented by the
				3466	waitcnt of
				3467	the release, but
				3468	there is nothing
				3469	preventing a store
				3470	release followed by
				3471	load acquire from
				3472	competing out of
				3473	order.)
				3474
				3475	2. *Following
				3476	instructions same as
				3477	corresponding load
				3478	atomic acquire,
				3479	except must generated
				3480	all instructions even
				3481	for OpenCL.*
				3482	load atomic seq_cst - workgroup - local *Same as corresponding
				3483	load atomic acquire,
				3484	except must generated
				3485	all instructions even
				3486	for OpenCL.*
				3487	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3488	- system - generic vmcnt(0)
				3489
				3490	- Could be split into
				3491	separate s_waitcnt
				3492	vmcnt(0)
				3493	and s_waitcnt
				3494	lgkmcnt(0) to allow
				3495	them to be
				3496	independently moved
				3497	according to the
				3498	following rules.
				3499	- waitcnt lgkmcnt(0)
				3500	must happen after
				3501	preceding
				3502	global/generic load
				3503	atomic/store
				3504	atomic/atomicrmw
				3505	with memory
				3506	ordering of seq_cst
				3507	and with equal or
				3508	wider sync scope.
				3509	(Note that seq_cst
				3510	fences have their
				3511	own s_waitcnt
				3512	lgkmcnt(0) and so do
				3513	not need to be
				3514	considered.)
				3515	- waitcnt vmcnt(0)
				3516	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3517	preceding
				3518	global/generic load
				3519	atomic/store
				3520	atomic/atomicrmw
				3521	with memory
				3522	ordering of seq_cst
				3523	and with equal or
				3524	wider sync scope.
				3525	(Note that seq_cst
				3526	fences have their
				3527	own s_waitcnt
				3528	vmcnt(0) and so do
				3529	not need to be
				3530	considered.)
				3531	- Ensures any
				3532	preceding
				3533	sequential
				3534	consistent global
				3535	memory instructions
				3536	have completed
				3537	before executing
				3538	this sequentially
				3539	consistent
				3540	instruction. This
				3541	prevents reordering
				3542	a seq_cst store
				3543	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3544	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3545	that seq_cst is
				3546	stronger than
				3547	acquire/release as
				3548	the reordering of
				3549	load acquire
				3550	followed by a store
				3551	release is
				3552	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3553	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3554	the release, but
				3555	there is nothing
				3556	preventing a store
				3557	release followed by
				3558	load acquire from
				3559	competing out of
				3560	order.)
				3561
				3562	2. *Following
				3563	instructions same as
				3564	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3565	atomic acquire,
				3566	except must generated
				3567	all instructions even
				3568	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3569	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3570	- wavefront - local store atomic release,
				3571	- workgroup - generic except must generated
				3572	all instructions even
				3573	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3574	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3575	- system - generic store atomic release,
				3576	except must generated
				3577	all instructions even
				3578	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3579	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3580	- wavefront - local atomicrmw acq_rel,
				3581	- workgroup - generic except must generated
				3582	all instructions even
				3583	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3584	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3585	- system - generic atomicrmw acq_rel,
				3586	except must generated
				3587	all instructions even
				3588	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3589	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3590	- wavefront fence acq_rel,
				3591	- workgroup except must generated
				3592	- agent all instructions even
				3593	- system for OpenCL.*
				3594	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3595
				3596	The memory order also adds the single thread optimization constrains defined in
				3597	table
				3598	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3599
				3600	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3601	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3602
				3603	============ ==============================================================
				3604	LLVM Memory Optimization Constraints
				3605	Ordering
				3606	============ ==============================================================
				3607	unordered none
				3608	monotonic none
				3609	acquire - If a load atomic/atomicrmw then no following load/load
				3610	atomic/store/ store atomic/atomicrmw/fence instruction can
				3611	be moved before the acquire.
				3612	- If a fence then same as load atomic, plus no preceding
				3613	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3614	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3615	atomic/store/ store atomic/atomicrmw/fence instruction can
				3616	be moved after the release.
				3617	- If a fence then same as store atomic, plus no following
				3618	associated fence-paired-atomic can be moved before the
				3619	fence.
				3620	acq_rel Same constraints as both acquire and release.
				3621	seq_cst - If a load atomic then same constraints as acquire, plus no
				3622	preceding sequentially consistent load atomic/store
				3623	atomic/atomicrmw/fence instruction can be moved after the
				3624	seq_cst.
				3625	- If a store atomic then the same constraints as release, plus
				3626	no following sequentially consistent load atomic/store
				3627	atomic/atomicrmw/fence instruction can be moved before the
				3628	seq_cst.
				3629	- If an atomicrmw/fence then same constraints as acq_rel.
				3630	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3631
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3632	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3633	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3634
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3635	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3636	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3637	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3638
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3639	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3640	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3641
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3642	=================== =============== =============== =======================
				3643	Usage Code Sequence Trap Handler Description
				3644	Inputs
				3645	=================== =============== =============== =======================
				3646	reserved ``s_trap 0x00`` Reserved by hardware.
				3647	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3648	``queue_ptr`` ``debugtrap``
				3649	``VGPR0``: intrinsic (not
				3650	``arg`` implemented).
				3651	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3652	``queue_ptr`` terminated and its
				3653	associated queue put
				3654	into the error state.
				3655	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3656	``queue_ptr`` installed handled
				3657	same as ``llvm.trap``.
				3658	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3659	breakpoints.
				3660	debugger ``s_trap 0x08`` Reserved for debugger.
				3661	debugger ``s_trap 0xfe`` Reserved for debugger.
				3662	debugger ``s_trap 0xff`` Reserved for debugger.
				3663	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3664
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3665	Unspecified OS
				3666	--------------
				3667
				3668	This section provides code conventions used when the target triple OS is
				3669	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3670
				3671	Trap Handler ABI
				3672	~~~~~~~~~~~~~~~~
				3673
				3674	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3675	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3676	instructions are handled as follows:
				3677
				3678	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3679	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3680
				3681	=============== =============== ===========================================
				3682	Usage Code Sequence Description
				3683	=============== =============== ===========================================
				3684	llvm.trap s_endpgm Causes wavefront to be terminated.
				3685	llvm.debugtrap none Compiler warning given that there is no
				3686	trap handler installed.
				3687	=============== =============== ===========================================
				3688
				3689	Source Languages
				3690	================
				3691
				3692	.. _amdgpu-opencl:
				3693
				3694	OpenCL
				3695	------
				3696
				3697	When generating code for the OpenCL language the target triple environment
				3698	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3699
				3700	When the language is OpenCL the following differences occur:
				3701
				3702	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3703	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3704	3. Additional metadata is generated
				3705	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3706
				3707	.. TODO
				3708	Specify what affect this has. Hidden arguments added. Additional metadata
				3709	generated.
				3710
				3711	.. _amdgpu-hcc:
				3712
				3713	HCC
				3714	---
				3715
				3716	When generating code for the OpenCL language the target triple environment
				3717	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3718
				3719	When the language is OpenCL the following differences occur:
				3720
				3721	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3722
				3723	.. TODO
				3724	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3725
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3726	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3727	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3728
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3729	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3730	It supports AMDGCN GFX6-GFX8.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3731
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3732	This section describes general syntax for instructions and operands. For more
				3733	information about instructions, their semantics and supported combinations of
				3734	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3735	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3736
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3737	An instruction has the following syntax (register operands are normally
				3738	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3739
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3740	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3741
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3742	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3743	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3744
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3745	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3746
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3747	* SGPR registers: s0, ... or s[0], ...
				3748	* VGPR registers: v0, ... or v[0], ...
				3749	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3750	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3751	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3752	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3753	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3754	* Register index expressions: v[2*2], s[1-1:2-1]
				3755	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3756
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3757	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3758
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3759	* offset, offset0, offset1
				3760	* idxen, offen bits
				3761	* glc, slc, tfe bits
				3762	* waitcnt: integer or combination of counter values
				3763	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3764
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3765	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3766
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3767	* DPP modifiers:
				3768
				3769	- row_shl, row_shr, row_ror, row_rol
				3770	- row_mirror, row_half_mirror, row_bcast
				3771	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3772	- row_mask, bank_mask, bound_ctrl
				3773
				3774	* SDWA modifiers:
				3775
				3776	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3777	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3778	- abs, neg, sext
				3779
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3780	Instruction Examples
				3781	~~~~~~~~~~~~~~~~~~~~
				3782
				3783	DS
				3784	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3785
				3786	.. code-block:: nasm
				3787
				3788	ds_add_u32 v2, v4 offset:16
				3789	ds_write_src2_b64 v2 offset0:4 offset1:8
				3790	ds_cmpst_f32 v2, v4, v6
				3791	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3792
				3793
				3794	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3795
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3796	FLAT
				3797	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3798
				3799	.. code-block:: nasm
				3800
				3801	flat_load_dword v1, v[3:4]
				3802	flat_store_dwordx3 v[3:4], v[5:7]
				3803	flat_atomic_swap v1, v[3:4], v5 glc
				3804	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3805	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3806
				3807	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3808
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3809	MUBUF
				3810	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3811
				3812	.. code-block:: nasm
				3813
				3814	buffer_load_dword v1, off, s[4:7], s1
				3815	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3816	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3817	buffer_wbinvl1
				3818	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3819
				3820	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3821
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3822	SMRD/SMEM
				3823	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3824
				3825	.. code-block:: nasm
				3826
				3827	s_load_dword s1, s[2:3], 0xfc
				3828	s_load_dwordx8 s[8:15], s[2:3], s4
				3829	s_load_dwordx16 s[88:103], s[2:3], s4
				3830	s_dcache_inv_vol
				3831	s_memtime s[4:5]
				3832
				3833	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3834
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3835	SOP1
				3836	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3837
				3838	.. code-block:: nasm
				3839
				3840	s_mov_b32 s1, s2
				3841	s_mov_b64 s[0:1], 0x80000000
				3842	s_cmov_b32 s1, 200
				3843	s_wqm_b64 s[2:3], s[4:5]
				3844	s_bcnt0_i32_b64 s1, s[2:3]
				3845	s_swappc_b64 s[2:3], s[4:5]
				3846	s_cbranch_join s[4:5]
				3847
				3848	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3849
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3850	SOP2
				3851	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3852
				3853	.. code-block:: nasm
				3854
				3855	s_add_u32 s1, s2, s3
				3856	s_and_b64 s[2:3], s[4:5], s[6:7]
				3857	s_cselect_b32 s1, s2, s3
				3858	s_andn2_b32 s2, s4, s6
				3859	s_lshr_b64 s[2:3], s[4:5], s6
				3860	s_ashr_i32 s2, s4, s6
				3861	s_bfm_b64 s[2:3], s4, s6
				3862	s_bfe_i64 s[2:3], s[4:5], s6
				3863	s_cbranch_g_fork s[4:5], s[6:7]
				3864
				3865	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3866
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3867	SOPC
				3868	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3869
				3870	.. code-block:: nasm
				3871
				3872	s_cmp_eq_i32 s1, s2
				3873	s_bitcmp1_b32 s1, s2
				3874	s_bitcmp0_b64 s[2:3], s4
				3875	s_setvskip s3, s5
				3876
				3877	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3878
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3879	SOPP
				3880	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3881
				3882	.. code-block:: nasm
				3883
				3884	s_barrier
				3885	s_nop 2
				3886	s_endpgm
				3887	s_waitcnt 0 ; Wait for all counters to be 0
				3888	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3889	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3890	s_sethalt 9
				3891	s_sleep 10
				3892	s_sendmsg 0x1
				3893	s_sendmsg sendmsg(MSG_INTERRUPT)
				3894	s_trap 1
				3895
				3896	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3897
				3898	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3899	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3900	range or acceptable values.
				3901
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3902	VALU
				3903	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3904
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3905	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3906	the assembler will automatically use optimal encoding based on its operands.
				3907	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3908
				3909	* _e32 for 32-bit VOP1/VOP2/VOPC
				3910	* _e64 for 64-bit VOP3
				3911	* _dpp for VOP_DPP
				3912	* _sdwa for VOP_SDWA
				3913
				3914	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3915
				3916	.. code-block:: nasm
				3917
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3918	v_mov_b32 v1, v2
				3919	v_mov_b32_e32 v1, v2
				3920	v_nop
				3921	v_cvt_f64_i32_e32 v[1:2], v2
				3922	v_floor_f32_e32 v1, v2
				3923	v_bfrev_b32_e32 v1, v2
				3924	v_add_f32_e32 v1, v2, v3
				3925	v_mul_i32_i24_e64 v1, v2, 3
				3926	v_mul_i32_i24_e32 v1, -3, v3
				3927	v_mul_i32_i24_e32 v1, -100, v3
				3928	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3929	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3930
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3931	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3932
				3933	.. code-block:: nasm
				3934
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3935	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3936	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3937	v_mov_b32 v0, v0 wave_shl:1
				3938	v_mov_b32 v0, v0 row_mirror
				3939	v_mov_b32 v0, v0 row_bcast:31
				3940	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3941	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3942	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3943
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3944	VOP_SDWA examples:
				3945
				3946	.. code-block:: nasm
				3947
				3948	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				3949	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				3950	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				3951	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				3952	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				3953
				3954	For full list of supported instructions, refer to "Vector ALU instructions".
				3955
				3956	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3957	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3958
				3959	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				3960	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3961
				3962	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3963	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3964
				3965	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3966	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3967
				3968	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3969	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				3970
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3971
				3972	major, minor, and stepping are all integers that describe the instruction
				3973	set architecture (ISA) version of the assembly program.
				3974
				3975	vendor and arch are quoted strings. vendor should always be equal to
				3976	"AMD" and arch should always be equal to "AMDGPU".
				3977
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3978	By default, the assembler will derive the ISA version, vendor, and arch
				3979	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3980
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3981	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3982	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3983
				3984	This directives specifies that the symbol with given name is a kernel entry point
				3985	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3986
				3987	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3988	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3989
				3990	This directive marks the beginning of a list of key / value pairs that are used
				3991	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				3992	The list must be terminated by the .end_amd_kernel_code_t directive. For
				3993	any amd_kernel_code_t values that are unspecified a default value will be
				3994	used. The default value for all keys is 0, with the following exceptions:
				3995
				3996	- kernel_code_version_major defaults to 1.
				3997	- machine_kind defaults to 1.
				3998	- machine_version_major, machine_version_minor, and
				3999	machine_version_stepping are derived from the value of the -mcpu option
				4000	that is passed to the assembler.
				4001	- kernel_code_entry_byte_offset defaults to 256.
				4002	- wavefront_size defaults to 6.
				4003	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4004	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4005	as a power of two, so a value of n means an alignment of 2^ n.
				4006
				4007	The .amd_kernel_code_t directive must be placed immediately after the
				4008	function label and before any instructions.
				4009
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4010	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4011	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4012
				4013	Here is an example of a minimal amd_kernel_code_t specification:
				4014
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4015	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4016
				4017	.hsa_code_object_version 1,0
				4018	.hsa_code_object_isa
				4019
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4020	.hsatext
				4021	.globl hello_world
				4022	.p2align 8
				4023	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4024
				4025	hello_world:
				4026
				4027	.amd_kernel_code_t
				4028	enable_sgpr_kernarg_segment_ptr = 1
				4029	is_ptr64 = 1
				4030	compute_pgm_rsrc1_vgprs = 0
				4031	compute_pgm_rsrc1_sgprs = 0
				4032	compute_pgm_rsrc2_user_sgpr = 2
				4033	kernarg_segment_byte_size = 8
				4034	wavefront_sgpr_count = 2
				4035	workitem_vgpr_count = 3
				4036	.end_amd_kernel_code_t
				4037
				4038	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4039	v_mov_b32 v0, 3.14159
				4040	s_waitcnt lgkmcnt(0)
				4041	v_mov_b32 v1, s0
				4042	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4043	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4044	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4045	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4046	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4047
				4048	Additional Documentation
				4049	========================
				4050
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4051	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4052	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4053	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4054	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4055	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4056	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4057	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4058	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4059	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				4060	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				4061	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4062	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4063	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4064	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4065	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4066	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4067	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4068	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				4069	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__