Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 268abc2ff866b7fe550948876f2b4a59a33943dc [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
				26	.. table:: AMDGPU Target Triples
				27	:name: amdgpu-target-triples-table
				28
				29	============ ======== ========= ===========
				30	Architecture Vendor OS Environment
				31	============ ======== ========= ===========
				32	r600 amd <empty> <empty>
				33	amdgcn amd <empty> <empty>
				34	amdgcn amd amdhsa <empty>
				35	amdgcn amd amdhsa opencl
				36	amdgcn amd amdhsa amdgizcl
				37	amdgcn amd amdhsa amdgiz
				38	amdgcn amd amdhsa hcc
				39	============ ======== ========= ===========
				40
				41	``r600-amd--``
				42	Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
				43	the MESA runtime.
				44
				45	``amdgcn-amd--``
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	46	Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	47	the MESA runtime.
				48
				49	``amdgcn-amd-amdhsa-``
				50	Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
				51	compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
				52
				53	``amdgcn-amd-amdhsa-opencl``
				54	Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
				55	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				56	:ref:`amdgpu-opencl`.
				57
				58	``amdgcn-amd-amdhsa-amdgizcl``
				59	Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
				60	is used (see :ref:`amdgpu-address-spaces`).
				61
				62	``amdgcn-amd-amdhsa-amdgiz``
				63	Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
				64	used (see :ref:`amdgpu-address-spaces`).
				65
				66	``amdgcn-amd-amdhsa-hcc``
				67	Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
				68	executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				69	:ref:`amdgpu-hcc`.
				70
				71	.. _amdgpu-processors:
				72
				73	Processors
				74	----------
				75
				76	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				77	names from both the Processor and Alternative Processor can be used.
				78
				79	.. table:: AMDGPU Processors
				80	:name: amdgpu-processors-table
				81
				82	========== =========== ============ ===== ======= ==================
				83	Processor Alternative Target dGPU/ Runtime Example
				84	Processor Triple APU Support Products
				85	Architecture
				86	========== =========== ============ ===== ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	87	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	88	--------------------------------------------------------------------
				89	r600 r600 dGPU
				90	r630 r600 dGPU
				91	rs880 r600 dGPU
				92	rv670 r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	94	--------------------------------------------------------------------
				95	rv710 r600 dGPU
				96	rv730 r600 dGPU
				97	rv770 r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	99	--------------------------------------------------------------------
				100	cedar r600 dGPU
				101	redwood r600 dGPU
				102	sumo r600 dGPU
				103	juniper r600 dGPU
				104	cypress r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	105	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	106	--------------------------------------------------------------------
				107	barts r600 dGPU
				108	turks r600 dGPU
				109	caicos r600 dGPU
				110	cayman r600 dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	112	--------------------------------------------------------------------
Konstantin Zhuravlyov	6cbcb27	2017-08-08 04:28:31 +0000	[diff] [blame]	113	gfx600 - tahiti amdgcn dGPU
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	114	gfx601 - pitcairn amdgcn dGPU
				115	- verde
				116	- oland
				117	- hainan
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	118	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	119	--------------------------------------------------------------------
				120	gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
				121	- Radeon HD 8770
				122	- R7 260
				123	- R7 260X
				124	\ - kaveri amdgcn APU - A6-7000
				125	- A6 Pro-7050B
				126	- A8-7100
				127	- A8 Pro-7150B
				128	- A10-7300
				129	- A10 Pro-7350B
				130	- FX-7500
				131	- A8-7200P
				132	- A10-7400P
				133	- FX-7600P
				134	gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
				135	- FirePro W9100
				136	- FirePro S9150
				137	- FirePro S9170
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	138	gfx702 dGPU ROCm - Radeon R9 290
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	139	- Radeon R9 290x
				140	- Radeon R390
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	141	- Radeon R390x
				142	gfx703 - kabini amdgcn APU - E1-2100
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	143	- mullins - E1-2200
				144	- E1-2500
				145	- E2-3000
				146	- E2-3800
				147	- A4-5000
				148	- A4-5100
				149	- A6-5200
				150	- A4 Pro-3340B
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	151	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	152	--------------------------------------------------------------------
				153	gfx800 - iceland amdgcn dGPU - FirePro S7150
				154	- FirePro S7100
				155	- FirePro W7100
				156	- Radeon R285
				157	- Radeon R9 380
				158	- Radeon R9 385
				159	- Mobile FirePro
				160	M7170
				161	gfx801 - carrizo amdgcn APU - A6-8500P
				162	- Pro A6-8500B
				163	- A8-8600P
				164	- Pro A8-8600B
				165	- FX-8800P
				166	- Pro A12-8800B
				167	\ amdgcn APU ROCm - A10-8700P
				168	- Pro A10-8700B
				169	- A10-8780P
				170	\ amdgcn APU - A10-9600P
				171	- A10-9630P
				172	- A12-9700P
				173	- A12-9730P
				174	- FX-9800P
				175	- FX-9830P
				176	\ amdgcn APU - E2-9010
				177	- A6-9210
				178	- A9-9410
				179	gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
				180	gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
				181	- Radeon R9 Fury
				182	- Radeon R9 FuryX
				183	- Radeon Pro Duo
				184	- FirePro S9300x2
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	185	- Radeon Instinct MI8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	186	\ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
				187	- Radeon RX 480
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	188	- Radeon Instinct MI6
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	189	\ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
				190	gfx804 amdgcn dGPU Same as gfx803
				191	gfx810 - stoney amdgcn APU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	192	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	193	--------------------------------------------------------------------
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	194	gfx900 amdgcn dGPU - Radeon Vega
				195	Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	201	gfx901 amdgcn dGPU ROCm Same as gfx900
				202	except XNACK is
				203	enabled
				204	gfx902 amdgcn APU TBA
				205
				206	.. TODO
				207	Add product
				208	names.
				209	gfx903 amdgcn APU Same as gfx902
				210	except XNACK is
				211	enabled
				212	========== =========== ============ ===== ======= ==================
				213
				214	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	215
				216	Address Spaces
				217	--------------
				218
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	219	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	220
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	221	The memory space names used in the table, aside from the region memory space, is
				222	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	223
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	224	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	225
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	226	.. table:: Address Space Mapping
				227	:name: amdgpu-address-space-mapping-table
				228
				229	================== ================= ================= ================= =================
				230	LLVM Address Space Memory Space
				231	------------------ -----------------------------------------------------------------------
				232	\ Current Default amdgiz/amdgizcl hcc Future Default
				233	================== ================= ================= ================= =================
				234	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				235	1 Global Global Global Global
				236	2 Constant Constant Constant Region (GDS)
				237	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				238	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				239	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				240	================== ================= ================= ================= =================
				241
				242	Current Default
				243	This is the current default address space mapping used for all languages
				244	except hcc. This will shortly be deprecated.
				245
				246	amdgiz/amdgizcl
				247	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				248	is specified as the target triple environment value.
				249
				250	hcc
				251	This is the current address space mapping used when ``hcc`` is specified as
				252	the target triple environment value.This will shortly be deprecated.
				253
				254	Future Default
				255	This will shortly be the only address space mapping for all languages using
				256	AMDGPU backend.
				257
				258	.. _amdgpu-memory-scopes:
				259
				260	Memory Scopes
				261	-------------
				262
				263	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				264	backend memory model when the target triple OS is ``amdhsa`` (see
				265	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				266
				267	The memory model supported is based on the HSA memory model [HSA]_ which is
				268	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				269	relation is transitive over the synchonizes-with relation independent of scope,
				270	and synchonizes-with allows the memory scope instances to be inclusive (see
				271	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
				272
				273	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				274	inclusion and requires the memory scopes to exactly match. However, this
				275	is conservatively correct for OpenCL.
				276
				277	.. table:: AMDHSA LLVM Sync Scopes for AMDHSA
				278	:name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
				279
				280	================ ==========================================================
				281	LLVM Sync Scope Description
				282	================ ==========================================================
				283	none The default: ``system``.
				284
				285	Synchronizes with, and participates in modification and
				286	seq_cst total orderings with, other operations (except
				287	image operations) for all address spaces (except private,
				288	or generic that accesses private) provided the other
				289	operation's sync scope is:
				290
				291	- ``system``.
				292	- ``agent`` and executed by a thread on the same agent.
				293	- ``workgroup`` and executed by a thread in the same
				294	workgroup.
				295	- ``wavefront`` and executed by a thread in the same
				296	wavefront.
				297
				298	``agent`` Synchronizes with, and participates in modification and
				299	seq_cst total orderings with, other operations (except
				300	image operations) for all address spaces (except private,
				301	or generic that accesses private) provided the other
				302	operation's sync scope is:
				303
				304	- ``system`` or ``agent`` and executed by a thread on the
				305	same agent.
				306	- ``workgroup`` and executed by a thread in the same
				307	workgroup.
				308	- ``wavefront`` and executed by a thread in the same
				309	wavefront.
				310
				311	``workgroup`` Synchronizes with, and participates in modification and
				312	seq_cst total orderings with, other operations (except
				313	image operations) for all address spaces (except private,
				314	or generic that accesses private) provided the other
				315	operation's sync scope is:
				316
				317	- ``system``, ``agent`` or ``workgroup`` and executed by a
				318	thread in the same workgroup.
				319	- ``wavefront`` and executed by a thread in the same
				320	wavefront.
				321
				322	``wavefront`` Synchronizes with, and participates in modification and
				323	seq_cst total orderings with, other operations (except
				324	image operations) for all address spaces (except private,
				325	or generic that accesses private) provided the other
				326	operation's sync scope is:
				327
				328	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				329	and executed by a thread in the same wavefront.
				330
				331	``singlethread`` Only synchronizes with, and participates in modification
				332	and seq_cst total orderings with, other operations (except
				333	image operations) running in the same thread for all
				334	address spaces (for example, in signal handlers).
				335	================ ==========================================================
				336
				337	AMDGPU Intrinsics
				338	-----------------
				339
				340	The AMDGPU backend implements the following intrinsics.
				341
				342	This section is WIP.
				343
				344	.. TODO
				345	List AMDGPU intrinsics
				346
				347	Code Object
				348	===========
				349
				350	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				351	can be linked by ``lld`` to produce a standard ELF shared code object which can
				352	be loaded and executed on an AMDGPU target.
				353
				354	Header
				355	------
				356
				357	The AMDGPU backend uses the following ELF header:
				358
				359	.. table:: AMDGPU ELF Header
				360	:name: amdgpu-elf-header-table
				361
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	362	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	363	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	364	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	365	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				366	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	367	``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	368	``ELFOSABI_AMDGPU_PAL`` or
				369	``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	370	``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	371	``ELFABIVERSION_AMDGPU_PAL`` or
				372	``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373	``e_type`` ``ET_REL`` or ``ET_DYN``
				374	``e_machine`` ``EM_AMDGPU``
				375	``e_entry`` 0
				376	``e_flags`` 0
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	377	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	378
				379	..
				380
				381	.. table:: AMDGPU ELF Header Enumeration Values
				382	:name: amdgpu-elf-header-enumeration-values-table
				383
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	384	=============================== =====
				385	Name Value
				386	=============================== =====
				387	``EM_AMDGPU`` 224
				388	``ELFOSABI_AMDGPU_HSA`` 64
				389	``ELFOSABI_AMDGPU_PAL`` 65
				390	``ELFOSABI_AMDGPU_MESA3D`` 66
				391	``ELFABIVERSION_AMDGPU_HSA`` 1
				392	``ELFABIVERSION_AMDGPU_PAL`` 0
				393	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				394	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395
				396	``e_ident[EI_CLASS]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	397	The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
				398	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	399
				400	``e_ident[EI_DATA]``
				401	All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
				402
				403	``e_ident[EI_OSABI]``
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	404	One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	405
				406	* ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
				407	the AMD HSA runtime ABI [HSA]_.
				408
				409	* ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
				410	the AMD PAL runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	412	* ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
				413	to the AMD MESA runtime ABI.
				414
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	415	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	416	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				417	object conforms:
				418
				419	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				420	runtime ABI.
				421
				422	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				423	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	425	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
				426	runtime ABI.
				427
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428	``e_type``
				429	Can be one of the following values:
				430
				431
				432	``ET_REL``
				433	The type produced by the AMD GPU backend compiler as it is relocatable code
				434	object.
				435
				436	``ET_DYN``
				437	The type produced by the linker as it is a shared code object.
				438
				439	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				440
				441	``e_machine``
				442	The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
				443	architecture family. The specific member is specified in the
				444	``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
				445	:ref:`amdgpu-note-records`).
				446
				447	``e_entry``
				448	The entry point is 0 as the entry points for individual kernels must be
				449	selected in order to invoke them through AQL packets.
				450
				451	``e_flags``
				452	The value is 0 as no flags are used.
				453
				454	Sections
				455	--------
				456
				457	An AMDGPU target ELF code object has the standard ELF sections which include:
				458
				459	.. table:: AMDGPU ELF Sections
				460	:name: amdgpu-elf-sections-table
				461
				462	================== ================ =================================
				463	Name Type Attributes
				464	================== ================ =================================
				465	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				466	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				467	``.debug_``\ \* ``SHT_PROGBITS`` none
				468	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				469	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				470	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				471	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				472	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				473	``.note`` ``SHT_NOTE`` none
				474	``.rela``\ name ``SHT_RELA`` none
				475	``.rela.dyn`` ``SHT_RELA`` none
				476	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				477	``.shstrtab`` ``SHT_STRTAB`` none
				478	``.strtab`` ``SHT_STRTAB`` none
				479	``.symtab`` ``SHT_SYMTAB`` none
				480	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				481	================== ================ =================================
				482
				483	These sections have their standard meanings (see [ELF]_) and are only generated
				484	if needed.
				485
				486	``.debug``\ \*
				487	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				488	DWARF produced by the AMDGPU backend.
				489
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	490	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	491	The standard sections used by a dynamic loader.
				492
				493	``.note``
				494	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				495	backend.
				496
				497	``.rela``\ name, ``.rela.dyn``
				498	For relocatable code objects, name is the name of the section that the
				499	relocation records apply. For example, ``.rela.text`` is the section name for
				500	relocation records associated with the ``.text`` section.
				501
				502	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				503	records from each of the relocatable code object's ``.rela``\ name sections.
				504
				505	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				506	the AMDGPU backend.
				507
				508	``.text``
				509	The executable machine code for the kernels and functions they call. Generated
				510	as position independent code. See :ref:`amdgpu-code-conventions` for
				511	information on conventions used in the isa generation.
				512
				513	.. _amdgpu-note-records:
				514
				515	Note Records
				516	------------
				517
				518	As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
				519	the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
				520	minimal zero byte padding must be generated to ensure the ``desc`` field size is
				521	a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
				522	be at least 4 to indicate at least 8 byte alignment.
				523
				524	The AMDGPU backend code object uses the following ELF note records in the
				525	``.note`` section. The Description column specifies the layout of the note
				526	record’s ``desc`` field. All fields are consecutive bytes. Note records with
				527	variable size strings have a corresponding ``*_size`` field that specifies the
				528	number of bytes, including the terminating null character, in the string. The
				529	string(s) come immediately after the preceding fields.
				530
				531	Additional note records can be present.
				532
				533	.. table:: AMDGPU ELF Note Records
				534	:name: amdgpu-elf-note-records-table
				535
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	536	===== ============================== ======================================
				537	Name Type Description
				538	===== ============================== ======================================
				539	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
				540	"AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
				541	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	542
				543	..
				544
				545	.. table:: AMDGPU ELF Note Record Enumeration Values
				546	:name: amdgpu-elf-note-record-enumeration-values-table
				547
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	548	============================== =====
				549	Name Value
				550	============================== =====
				551	reserved 0-9
				552	``NT_AMD_AMDGPU_HSA_METADATA`` 10
				553	``NT_AMD_AMDGPU_ISA`` 11
				554	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	555
				556	``NT_AMD_AMDGPU_ISA``
				557	Specifies the instruction set architecture used by the machine code contained
				558	in the code object.
				559
				560	This note record is required for code objects containing machine code for
				561	processors matching the ``amdgcn`` architecture in table
				562	:ref:`amdgpu-processors`.
				563
				564	The null terminated string has the following syntax:
				565
				566	architecture\ ``-``\ vendor\ ``-``\ os\ ``-``\ environment\ ``-``\ processor
				567
				568	where:
				569
				570	architecture
				571	The architecture from table :ref:`amdgpu-target-triples-table`.
				572
				573	This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
				574	:ref:`amdgpu-target-triples`).
				575
				576	vendor
				577	The vendor from table :ref:`amdgpu-target-triples-table`.
				578
				579	For the AMDGPU backend this is always ``amd``.
				580
				581	os
				582	The OS from table :ref:`amdgpu-target-triples-table`.
				583
				584	environment
				585	An environment from table :ref:`amdgpu-target-triples-table`, or blank if
				586	the environment has no affect on the execution of the code object.
				587
				588	For the AMDGPU backend this is currently always blank.
				589	processor
				590	The processor from table :ref:`amdgpu-processors-table`.
				591
				592	For example:
				593
				594	``amdgcn-amd-amdhsa--gfx901``
				595
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	596	``NT_AMD_AMDGPU_HSA_METADATA``
				597	Specifies extensible metadata associated with the code objects executed on HSA
				598	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				599	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				600	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				601	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	603	.. _amdgpu-symbols:
				604
				605	Symbols
				606	-------
				607
				608	Symbols include the following:
				609
				610	.. table:: AMDGPU ELF Symbols
				611	:name: amdgpu-elf-symbols-table
				612
				613	===================== ============== ============= ==================
				614	Name Type Section Description
				615	===================== ============== ============= ==================
				616	link-name ``STT_OBJECT`` - ``.data`` Global variable
				617	- ``.rodata``
				618	- ``.bss``
				619	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				620	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				621	===================== ============== ============= ==================
				622
				623	Global variable
				624	Global variables both used and defined by the compilation unit.
				625
				626	If the symbol is defined in the compilation unit then it is allocated in the
				627	appropriate section according to if it has initialized data or is readonly.
				628
				629	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				630	will resolve relocations using the definition provided by another code object
				631	or explicitly defined by the runtime.
				632
				633	All global symbols, whether defined in the compilation unit or external, are
				634	accessed by the machine code indirectly through a GOT table entry. This
				635	allows them to be preemptable. The GOT table is only supported when the target
				636	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	637
				638	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	639	Add description of linked shared object symbols. Seems undefined symbols
				640	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	642	Kernel descriptor
				643	Every HSA kernel has an associated kernel descriptor. It is the address of the
				644	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				645	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				646	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				647
				648	Kernel entry point
				649	Every HSA kernel also has a symbol for its machine code entry point.
				650
				651	.. _amdgpu-relocation-records:
				652
				653	Relocation Records
				654	------------------
				655
				656	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				657	relocatable fields are:
				658
				659	``word32``
				660	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				661	alignment. These values use the same byte order as other word values in the
				662	AMD GPU architecture.
				663
				664	``word64``
				665	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				666	alignment. These values use the same byte order as other word values in the
				667	AMD GPU architecture.
				668
				669	Following notations are used for specifying relocation calculations:
				670
				671	A
				672	Represents the addend used to compute the value of the relocatable field.
				673
				674	G
				675	Represents the offset into the global offset table at which the relocation
				676	entry’s symbol will reside during execution.
				677
				678	GOT
				679	Represents the address of the global offset table.
				680
				681	P
				682	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				683	of the storage unit being relocated (computed using ``r_offset``).
				684
				685	S
				686	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	687	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				688
				689	B
				690	Represents the base address of a loaded executable or shared object which is
				691	the difference between the ELF address and the actual load address. Relocations
				692	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	693
				694	The following relocation types are supported:
				695
				696	.. table:: AMDGPU ELF Relocation Records
				697	:name: amdgpu-elf-relocation-records-table
				698
				699	========================== ===== ========== ==============================
				700	Relocation Type Value Field Calculation
				701	========================== ===== ========== ==============================
				702	``R_AMDGPU_NONE`` 0 none none
				703	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				704	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				705	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				706	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				707	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				708	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				709	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				710	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				711	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				712	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				713	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	714	reserved 12
				715	``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	716	========================== ===== ========== ==============================
				717
				718	.. _amdgpu-dwarf:
				719
				720	DWARF
				721	-----
				722
				723	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				724	information that maps the code object executable code and data to the source
				725	language constructs. It can be used by tools such as debuggers and profilers.
				726
				727	Address Space Mapping
				728	~~~~~~~~~~~~~~~~~~~~~
				729
				730	The following address space mapping is used:
				731
				732	.. table:: AMDGPU DWARF Address Space Mapping
				733	:name: amdgpu-dwarf-address-space-mapping-table
				734
				735	=================== =================
				736	DWARF Address Space Memory Space
				737	=================== =================
				738	1 Private (Scratch)
				739	2 Local (group/LDS)
				740	omitted Global
				741	omitted Constant
				742	omitted Generic (Flat)
				743	not supported Region (GDS)
				744	=================== =================
				745
				746	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				747	used in the table.
				748
				749	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				750	DWARF address space of the value of the pointer when it is in the private or
				751	local address space. Otherwise the attribute is omitted.
				752
				753	An ``XDEREF`` operation is generated in location list expressions for variables
				754	that are allocated in the private and local address space. Otherwise no
				755	``XDREF`` is omitted.
				756
				757	Register Mapping
				758	~~~~~~~~~~~~~~~~
				759
				760	This section is WIP.
				761
				762	.. TODO
				763	Define DWARF register enumeration.
				764
				765	If want to present a wavefront state then should expose vector registers as
				766	64 wide (rather than per work-item view that LLVM uses). Either as separate
				767	registers, or a 64x4 byte single register. In either case use a new LANE op
				768	(akin to XDREF) to select the current lane usage in a location
				769	expression. This would also allow scalar register spilling to vector register
				770	lanes to be expressed (currently no debug information is being generated for
				771	spilling). If choose a wide single register approach then use LANE in
				772	conjunction with PIECE operation to select the dword part of the register for
				773	the current lane. If the separate register approach then use LANE to select
				774	the register.
				775
				776	Source Text
				777	~~~~~~~~~~~
				778
				779	This section is WIP.
				780
				781	.. TODO
				782	DWARF extension to include runtime generated source text.
				783
				784	.. _amdgpu-code-conventions:
				785
				786	Code Conventions
				787	================
				788
				789	This section provides code conventions used for each supported target triple OS
				790	(see :ref:`amdgpu-target-triples`).
				791
				792	AMDHSA
				793	------
				794
				795	This section provides code conventions used when the target triple OS is
				796	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				797
				798	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	799
				800	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	801	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	802
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	803	The code object metadata specifies extensible metadata associated with the code
				804	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				805	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				806	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				807	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				808	information necessary to support the ROCM kernel queries. For example, the
				809	segment sizes needed in a dispatch packet. In addition, a high level language
				810	runtime may require other information to be included. For example, the AMD
				811	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	812
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	813	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	814	:doc:`YamlIO`).
				815
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	816	.. TODO
				817	Is the string null terminated? It probably should not if YAML allows it to
				818	contain null characters, otherwise it should be.
				819
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	820	The metadata is represented as a single YAML document comprised of the mapping
				821	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				822	referenced tables.
				823
				824	For boolean values, the string values of ``false`` and ``true`` are used for
				825	false and true respectively.
				826
				827	Additional information can be added to the mappings. To avoid conflicts, any
				828	non-AMD key names should be prefixed by "vendor-name.".
				829
				830	.. table:: AMDHSA Code Object Metadata Mapping
				831	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				832
				833	========== ============== ========= =======================================
				834	String Key Value Type Required? Description
				835	========== ============== ========= =======================================
				836	"Version" sequence of Required - The first integer is the major
				837	2 integers version. Currently 1.
				838	- The second integer is the minor
				839	version. Currently 0.
				840	"Printf" sequence of Each string is encoded information
				841	strings about a printf function call. The
				842	encoded information is organized as
				843	fields separated by colon (':'):
				844
				845	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				846
				847	where:
				848
				849	``ID``
				850	A 32 bit integer as a unique id for
				851	each printf function call
				852
				853	``N``
				854	A 32 bit integer equal to the number
				855	of arguments of printf function call
				856	minus 1
				857
				858	``S[i]`` (where i = 0, 1, ... , N-1)
				859	32 bit integers for the size in bytes
				860	of the i-th FormatString argument of
				861	the printf function call
				862
				863	FormatString
				864	The format string passed to the
				865	printf function call.
				866	"Kernels" sequence of Required Sequence of the mappings for each
				867	mapping kernel in the code object. See
				868	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				869	for the definition of the mapping.
				870	========== ============== ========= =======================================
				871
				872	..
				873
				874	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				875	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				876
				877	================= ============== ========= ================================
				878	String Key Value Type Required? Description
				879	================= ============== ========= ================================
				880	"Name" string Required Source name of the kernel.
				881	"SymbolName" string Required Name of the kernel
				882	descriptor ELF symbol.
				883	"Language" string Source language of the kernel.
				884	Values include:
				885
				886	- "OpenCL C"
				887	- "OpenCL C++"
				888	- "HCC"
				889	- "OpenMP"
				890
				891	"LanguageVersion" sequence of - The first integer is the major
				892	2 integers version.
				893	- The second integer is the
				894	minor version.
				895	"Attrs" mapping Mapping of kernel attributes.
				896	See
				897	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				898	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	899	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	900	mapping kernel arguments. See
				901	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				902	for the definition of the mapping.
				903	"CodeProps" mapping Mapping of properties related to
				904	the kernel code. See
				905	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				906	for the mapping definition.
				907	"DebugProps" mapping Mapping of properties related to
				908	the kernel debugging. See
				909	:ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
				910	for the mapping definition.
				911	================= ============== ========= ================================
				912
				913	..
				914
				915	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				916	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				917
				918	=================== ============== ========= ==============================
				919	String Key Value Type Required? Description
				920	=================== ============== ========= ==============================
				921	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				922	3 integers X, Y, Z must correspond to the
				923	specified values.
				924
				925	Corresponds to the OpenCL
				926	``reqd_work_group_size``
				927	attribute.
				928	"WorkGroupSizeHint" sequence of The dispatch work-group size
				929	3 integers X, Y, Z is likely to be the
				930	specified values.
				931
				932	Corresponds to the OpenCL
				933	``work_group_size_hint``
				934	attribute.
				935	"VecTypeHint" string The name of a scalar or vector
				936	type.
				937
				938	Corresponds to the OpenCL
				939	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	940
				941	"RuntimeHandle" string The external symbol name
				942	associated with a kernel.
				943	OpenCL runtime allocates a
				944	global buffer for the symbol
				945	and saves the kernel's address
				946	to it, which is used for
				947	device side enqueueing. Only
				948	available for device side
				949	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	950	=================== ============== ========= ==============================
				951
				952	..
				953
				954	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				955	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				956
				957	================= ============== ========= ================================
				958	String Key Value Type Required? Description
				959	================= ============== ========= ================================
				960	"Name" string Kernel argument name.
				961	"TypeName" string Kernel argument type name.
				962	"Size" integer Required Kernel argument size in bytes.
				963	"Align" integer Required Kernel argument alignment in
				964	bytes. Must be a power of two.
				965	"ValueKind" string Required Kernel argument kind that
				966	specifies how to set up the
				967	corresponding argument.
				968	Values include:
				969
				970	"ByValue"
				971	The argument is copied
				972	directly into the kernarg.
				973
				974	"GlobalBuffer"
				975	A global address space pointer
				976	to the buffer data is passed
				977	in the kernarg.
				978
				979	"DynamicSharedPointer"
				980	A group address space pointer
				981	to dynamically allocated LDS
				982	is passed in the kernarg.
				983
				984	"Sampler"
				985	A global address space
				986	pointer to a S# is passed in
				987	the kernarg.
				988
				989	"Image"
				990	A global address space
				991	pointer to a T# is passed in
				992	the kernarg.
				993
				994	"Pipe"
				995	A global address space pointer
				996	to an OpenCL pipe is passed in
				997	the kernarg.
				998
				999	"Queue"
				1000	A global address space pointer
				1001	to an OpenCL device enqueue
				1002	queue is passed in the
				1003	kernarg.
				1004
				1005	"HiddenGlobalOffsetX"
				1006	The OpenCL grid dispatch
				1007	global offset for the X
				1008	dimension is passed in the
				1009	kernarg.
				1010
				1011	"HiddenGlobalOffsetY"
				1012	The OpenCL grid dispatch
				1013	global offset for the Y
				1014	dimension is passed in the
				1015	kernarg.
				1016
				1017	"HiddenGlobalOffsetZ"
				1018	The OpenCL grid dispatch
				1019	global offset for the Z
				1020	dimension is passed in the
				1021	kernarg.
				1022
				1023	"HiddenNone"
				1024	An argument that is not used
				1025	by the kernel. Space needs to
				1026	be left for it, but it does
				1027	not need to be set up.
				1028
				1029	"HiddenPrintfBuffer"
				1030	A global address space pointer
				1031	to the runtime printf buffer
				1032	is passed in kernarg.
				1033
				1034	"HiddenDefaultQueue"
				1035	A global address space pointer
				1036	to the OpenCL device enqueue
				1037	queue that should be used by
				1038	the kernel by default is
				1039	passed in the kernarg.
				1040
				1041	"HiddenCompletionAction"
				1042	TBD
				1043
				1044	.. TODO
				1045	Add description.
				1046
				1047	"ValueType" string Required Kernel argument value type. Only
				1048	present if "ValueKind" is
				1049	"ByValue". For vector data
				1050	types, the value is for the
				1051	element type. Values include:
				1052
				1053	- "Struct"
				1054	- "I8"
				1055	- "U8"
				1056	- "I16"
				1057	- "U16"
				1058	- "F16"
				1059	- "I32"
				1060	- "U32"
				1061	- "F32"
				1062	- "I64"
				1063	- "U64"
				1064	- "F64"
				1065
				1066	.. TODO
				1067	How can it be determined if a
				1068	vector type, and what size
				1069	vector?
				1070	"PointeeAlign" integer Alignment in bytes of pointee
				1071	type for pointer type kernel
				1072	argument. Must be a power
				1073	of 2. Only present if
				1074	"ValueKind" is
				1075	"DynamicSharedPointer".
				1076	"AddrSpaceQual" string Kernel argument address space
				1077	qualifier. Only present if
				1078	"ValueKind" is "GlobalBuffer" or
				1079	"DynamicSharedPointer". Values
				1080	are:
				1081
				1082	- "Private"
				1083	- "Global"
				1084	- "Constant"
				1085	- "Local"
				1086	- "Generic"
				1087	- "Region"
				1088
				1089	.. TODO
				1090	Is GlobalBuffer only Global
				1091	or Constant? Is
				1092	DynamicSharedPointer always
				1093	Local? Can HCC allow Generic?
				1094	How can Private or Region
				1095	ever happen?
				1096	"AccQual" string Kernel argument access
				1097	qualifier. Only present if
				1098	"ValueKind" is "Image" or
				1099	"Pipe". Values
				1100	are:
				1101
				1102	- "ReadOnly"
				1103	- "WriteOnly"
				1104	- "ReadWrite"
				1105
				1106	.. TODO
				1107	Does this apply to
				1108	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1109	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1110	performed by the kernel on the
				1111	kernel argument. Only present if
				1112	"ValueKind" is "GlobalBuffer",
				1113	"Image", or "Pipe". This may be
				1114	more restrictive than indicated
				1115	by "AccQual" to reflect what the
				1116	kernel actual does. If not
				1117	present then the runtime must
				1118	assume what is implied by
				1119	"AccQual" and "IsConst". Values
				1120	are:
				1121
				1122	- "ReadOnly"
				1123	- "WriteOnly"
				1124	- "ReadWrite"
				1125
				1126	"IsConst" boolean Indicates if the kernel argument
				1127	is const qualified. Only present
				1128	if "ValueKind" is
				1129	"GlobalBuffer".
				1130
				1131	"IsRestrict" boolean Indicates if the kernel argument
				1132	is restrict qualified. Only
				1133	present if "ValueKind" is
				1134	"GlobalBuffer".
				1135
				1136	"IsVolatile" boolean Indicates if the kernel argument
				1137	is volatile qualified. Only
				1138	present if "ValueKind" is
				1139	"GlobalBuffer".
				1140
				1141	"IsPipe" boolean Indicates if the kernel argument
				1142	is pipe qualified. Only present
				1143	if "ValueKind" is "Pipe".
				1144
				1145	.. TODO
				1146	Can GlobalBuffer be pipe
				1147	qualified?
				1148	================= ============== ========= ================================
				1149
				1150	..
				1151
				1152	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1153	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1154
				1155	============================ ============== ========= =====================
				1156	String Key Value Type Required? Description
				1157	============================ ============== ========= =====================
				1158	"KernargSegmentSize" integer Required The size in bytes of
				1159	the kernarg segment
				1160	that holds the values
				1161	of the arguments to
				1162	the kernel.
				1163	"GroupSegmentFixedSize" integer Required The amount of group
				1164	segment memory
				1165	required by a
				1166	work-group in
				1167	bytes. This does not
				1168	include any
				1169	dynamically allocated
				1170	group segment memory
				1171	that may be added
				1172	when the kernel is
				1173	dispatched.
				1174	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1175	private address space
				1176	memory required for a
				1177	work-item in
				1178	bytes. If
				1179	IsDynamicCallstack
				1180	is 1 then additional
				1181	space must be added
				1182	to this value for the
				1183	call stack.
				1184	"KernargSegmentAlign" integer Required The maximum byte
				1185	alignment of
				1186	arguments in the
				1187	kernarg segment. Must
				1188	be a power of 2.
				1189	"WavefrontSize" integer Required Wavefront size. Must
				1190	be a power of 2.
				1191	"NumSGPRs" integer Number of scalar
				1192	registers used by a
				1193	wavefront for
				1194	GFX6-GFX9. This
				1195	includes the special
				1196	SGPRs for VCC, Flat
				1197	Scratch (GFX7-GFX9)
				1198	and XNACK (for
				1199	GFX8-GFX9). It does
				1200	not include the 16
				1201	SGPR added if a trap
				1202	handler is
				1203	enabled. It is not
				1204	rounded up to the
				1205	allocation
				1206	granularity.
				1207	"NumVGPRs" integer Number of vector
				1208	registers used by
				1209	each work-item for
				1210	GFX6-GFX9
Konstantin Zhuravlyov	8d5e9e1	2017-10-18 17:31:09 +0000	[diff] [blame]	1211	"MaxFlatWorkGroupSize" integer Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1212	work-group size
				1213	supported by the
				1214	kernel in work-items.
				1215	"IsDynamicCallStack" boolean Indicates if the
				1216	generated machine
				1217	code is using a
				1218	dynamically sized
				1219	call stack.
				1220	"IsXNACKEnabled" boolean Indicates if the
				1221	generated machine
				1222	code is capable of
				1223	supporting XNACK.
				1224	============================ ============== ========= =====================
				1225
				1226	..
				1227
				1228	.. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
				1229	:name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
				1230
				1231	=================================== ============== ========= ==============
				1232	String Key Value Type Required? Description
				1233	=================================== ============== ========= ==============
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1234	"DebuggerABIVersion" sequence of
				1235	2 integers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1236	"ReservedNumVGPRs" integer
				1237	"ReservedFirstVGPR" integer
				1238	"PrivateSegmentBufferSGPR" integer
				1239	"WavefrontPrivateSegmentOffsetSGPR" integer
				1240	=================================== ============== ========= ==============
				1241
				1242	.. TODO
				1243	Plan to remove the debug properties metadata.
				1244
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1245	Kernel Dispatch
				1246	~~~~~~~~~~~~~~~
				1247
				1248	The HSA architected queuing language (AQL) defines a user space memory interface
				1249	that can be used to control the dispatch of kernels, in an agent independent
				1250	way. An agent can have zero or more AQL queues created for it using the ROCm
				1251	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1252	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1253	mechanics and packet layouts.
				1254
				1255	The packet processor of a kernel agent is responsible for detecting and
				1256	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1257	packet processor is implemented by the hardware command processor (CP),
				1258	asynchronous dispatch controller (ADC) and shader processor input controller
				1259	(SPI).
				1260
				1261	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1262	mode driver to initialize and register the AQL queue with CP.
				1263
				1264	To dispatch a kernel the following actions are performed. This can occur in the
				1265	CPU host program, or from an HSA kernel executing on a GPU.
				1266
				1267	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1268	executed is obtained.
				1269	2. A pointer to the kernel descriptor (see
				1270	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1271	obtained. It must be for a kernel that is contained in a code object that that
				1272	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1273	associated.
				1274	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1275	for a memory region with the kernarg property for the kernel agent that will
				1276	execute the kernel. It must be at least 16 byte aligned.
				1277	4. Kernel argument values are assigned to the kernel argument memory
				1278	allocation. The layout is defined in the HSA Programmer’s Language Reference
				1279	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1280	memory in the same way constant memory is accessed. (Note that the HSA
				1281	specification allows an implementation to copy the kernel argument contents to
				1282	another location that is accessed by the kernel.)
				1283	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1284	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1285	packet. The packet must be set up, and the final write must use an atomic
				1286	store release to set the packet kind to ensure the packet contents are
				1287	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1288	notify the kernel agent that the AQL queue has been updated. These rules, and
				1289	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1290	System Architecture Specification* [HSA]_.
				1291	6. A kernel dispatch packet includes information about the actual dispatch,
				1292	such as grid and work-group size, together with information from the code
				1293	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1294	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1295	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1296	7. CP executes micro-code and is responsible for detecting and setting up the
				1297	GPU to execute the wavefronts of a kernel dispatch.
				1298	8. CP ensures that when the a wavefront starts executing the kernel machine
				1299	code, the scalar general purpose registers (SGPR) and vector general purpose
				1300	registers (VGPR) are set up as required by the machine code. The required
				1301	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1302	register state is defined in
				1303	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1304	9. The prolog of the kernel machine code (see
				1305	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1306	before continuing executing the machine code that corresponds to the kernel.
				1307	10. When the kernel dispatch has completed execution, CP signals the completion
				1308	signal specified in the kernel dispatch packet if not 0.
				1309
				1310	.. _amdgpu-amdhsa-memory-spaces:
				1311
				1312	Memory Spaces
				1313	~~~~~~~~~~~~~
				1314
				1315	The memory space properties are:
				1316
				1317	.. table:: AMDHSA Memory Spaces
				1318	:name: amdgpu-amdhsa-memory-spaces-table
				1319
				1320	================= =========== ======== ======= ==================
				1321	Memory Space Name HSA Segment Hardware Address NULL Value
				1322	Name Name Size
				1323	================= =========== ======== ======= ==================
				1324	Private private scratch 32 0x00000000
				1325	Local group LDS 32 0xFFFFFFFF
				1326	Global global global 64 0x0000000000000000
				1327	Constant constant *same as 64 0x0000000000000000
				1328	global*
				1329	Generic flat flat 64 0x0000000000000000
				1330	Region N/A GDS 32 *not implemented
				1331	for AMDHSA*
				1332	================= =========== ======== ======= ==================
				1333
				1334	The global and constant memory spaces both use global virtual addresses, which
				1335	are the same virtual address space used by the CPU. However, some virtual
				1336	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1337	and some by both.
				1338
				1339	Using the constant memory space indicates that the data will not change during
				1340	the execution of the kernel. This allows scalar read instructions to be
				1341	used. The vector and scalar L1 caches are invalidated of volatile data before
				1342	each kernel dispatch execution to allow constant memory to change values between
				1343	kernel dispatches.
				1344
				1345	The local memory space uses the hardware Local Data Store (LDS) which is
				1346	automatically allocated when the hardware creates work-groups of wavefronts, and
				1347	freed when all the wavefronts of a work-group have terminated. The data store
				1348	(DS) instructions can be used to access it.
				1349
				1350	The private memory space uses the hardware scratch memory support. If the kernel
				1351	uses scratch, then the hardware allocates memory that is accessed using
				1352	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1353	address to physical address is:
				1354
				1355	``wavefront-scratch-base +
				1356	(private-address * wavefront-size * 4) +
				1357	(wavefront-lane-id * 4)``
				1358
				1359	There are different ways that the wavefront scratch base address is determined
				1360	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1361	memory can be accessed in an interleaved manner using buffer instruction with
				1362	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1363	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1364	same private address, the interleaving results in adjacent dwords being accessed
				1365	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1366	supported except by flat and scratch instructions in GFX9.
				1367
				1368	The generic address space uses the hardware flat address support available in
				1369	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1370	local appertures), that are outside the range of addressible global memory, to
				1371	map from a flat address to a private or local address.
				1372
				1373	FLAT instructions can take a flat address and access global, private (scratch)
				1374	and group (LDS) memory depending in if the address is within one of the
				1375	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1376	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1377	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1378	(see :ref:`amdgpu-amdhsa-m0`).
				1379
				1380	To convert between a segment address and a flat address the base address of the
				1381	appertures address can be used. For GFX7-GFX8 these are available in the
				1382	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1383	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1384	GFX9 the appature base addresses are directly available as inline constant
				1385	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1386	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1387	which makes it easier to convert from flat to segment or segment to flat.
				1388
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1389	Image and Samplers
				1390	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1391
				1392	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1393	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1394	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1395	enumeration values for the queries that are not trivially deducible from the S#
				1396	representation.
				1397
				1398	HSA Signals
				1399	~~~~~~~~~~~
				1400
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1401	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1402	structure allocated in memory accessible from both the CPU and GPU. The
				1403	structure is defined by the ROCm runtime and subject to change between releases
				1404	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1405
				1406	.. _amdgpu-amdhsa-hsa-aql-queue:
				1407
				1408	HSA AQL Queue
				1409	~~~~~~~~~~~~~
				1410
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1411	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1412	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1413	fields needed to implement certain language features such as the flat address
				1414	aperture bases. It also contains fields used by CP such as managing the
				1415	allocation of scratch memory.
				1416
				1417	.. _amdgpu-amdhsa-kernel-descriptor:
				1418
				1419	Kernel Descriptor
				1420	~~~~~~~~~~~~~~~~~
				1421
				1422	A kernel descriptor consists of the information needed by CP to initiate the
				1423	execution of a kernel, including the entry point address of the machine code
				1424	that implements the kernel.
				1425
				1426	Kernel Descriptor for GFX6-GFX9
				1427	+++++++++++++++++++++++++++++++
				1428
				1429	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1430
				1431	.. table:: Kernel Descriptor for GFX6-GFX9
				1432	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1433
				1434	======= ======= =============================== ===========================
				1435	Bits Size Field Name Description
				1436	======= ======= =============================== ===========================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1437	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1438	address space memory
				1439	required for a work-group
				1440	in bytes. This does not
				1441	include any dynamically
				1442	allocated local address
				1443	space memory that may be
				1444	added when the kernel is
				1445	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1446	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1447	private address space
				1448	memory required for a
				1449	work-item in bytes. If
				1450	is_dynamic_callstack is 1
				1451	then additional space must
				1452	be added to this value for
				1453	the call stack.
Konstantin Zhuravlyov	8d5e9e1	2017-10-18 17:31:09 +0000	[diff] [blame]	1454	95:64 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1455	size supported by the
				1456	kernel in work-items.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1457	96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1458	machine code is using a
				1459	dynamically sized call
				1460	stack.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1461	97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1462	machine code is capable of
				1463	suppoting XNACK.
				1464	127:98 30 bits Reserved. Must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1465	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1466	negative) from base
				1467	address of kernel
				1468	descriptor to kernel's
				1469	entry point instruction
				1470	which must be 256 byte
				1471	aligned.
				1472	383:192 24 Reserved. Must be 0.
				1473	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1474	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1475	program settings used by
				1476	CP to set up
				1477	``COMPUTE_PGM_RSRC1``
				1478	configuration
				1479	register. See
				1480	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1481	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1482	program settings used by
				1483	CP to set up
				1484	``COMPUTE_PGM_RSRC2``
				1485	configuration
				1486	register. See
				1487	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1488	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1489	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1490	(see
				1491	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1492
				1493	The total number of SGPR
				1494	user data registers
				1495	requested must not exceed
				1496	16 and match value in
				1497	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1498	Any requests beyond 16
				1499	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1500	449 1 bit EnableSGPRDispatchPtr see above
				1501	450 1 bit EnableSGPRQueuePtr see above
				1502	451 1 bit EnableSGPRKernargSegmentPtr see above
				1503	452 1 bit EnableSGPRDispatchID see above
				1504	453 1 bit EnableSGPRFlatScratchInit see above
				1505	454 1 bit EnableSGPRPrivateSegmentSize see above
				1506	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1507	should always be 0.
				1508	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1509	should always be 0.
				1510	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1511	should always be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1512	463:458 6 bits Reserved. Must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1513	511:464 6 Reserved. Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1514	bytes
				1515	512 Total size 64 bytes.
				1516	======= ===================================================================
				1517
				1518	..
				1519
				1520	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
				1521	:name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
				1522
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1523	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1524	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1525	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1526	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1527	used by each work-item,
				1528	granularity is device
				1529	specific:
				1530
				1531	GFX6-9
				1532	roundup((max-vgpg + 1)
				1533	/ 4) - 1
				1534
				1535	Used by CP to set up
				1536	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1537	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1538	used by a wavefront,
				1539	granularity is device
				1540	specific:
				1541
				1542	GFX6-8
				1543	roundup((max-sgpg + 1)
				1544	/ 8) - 1
				1545	GFX9
				1546	roundup((max-sgpg + 1)
				1547	/ 16) - 1
				1548
				1549	Includes the special SGPRs
				1550	for VCC, Flat Scratch (for
				1551	GFX7 onwards) and XNACK
				1552	(for GFX8 onwards). It does
				1553	not include the 16 SGPR
				1554	added if a trap handler is
				1555	enabled.
				1556
				1557	Used by CP to set up
				1558	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1559	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1560
				1561	Start executing wavefront
				1562	at the specified priority.
				1563
				1564	CP is responsible for
				1565	filling in
				1566	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1567	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1568	with specified rounding
				1569	mode for single (32
				1570	bit) floating point
				1571	precision floating point
				1572	operations.
				1573
				1574	Floating point rounding
				1575	mode values are defined in
				1576	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1577
				1578	Used by CP to set up
				1579	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1580	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1581	with specified rounding
				1582	denorm mode for half/double (16
				1583	and 64 bit) floating point
				1584	precision floating point
				1585	operations.
				1586
				1587	Floating point rounding
				1588	mode values are defined in
				1589	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1590
				1591	Used by CP to set up
				1592	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1593	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1594	with specified denorm mode
				1595	for single (32
				1596	bit) floating point
				1597	precision floating point
				1598	operations.
				1599
				1600	Floating point denorm mode
				1601	values are defined in
				1602	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1603
				1604	Used by CP to set up
				1605	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1606	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	with specified denorm mode
				1608	for half/double (16
				1609	and 64 bit) floating point
				1610	precision floating point
				1611	operations.
				1612
				1613	Floating point denorm mode
				1614	values are defined in
				1615	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1616
				1617	Used by CP to set up
				1618	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1619	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1620
				1621	Start executing wavefront
				1622	in privilege trap handler
				1623	mode.
				1624
				1625	CP is responsible for
				1626	filling in
				1627	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1628	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1629	with DX10 clamp mode
				1630	enabled. Used by the vector
				1631	ALU to force DX-10 style
				1632	treatment of NaN's (when
				1633	set, clamp NaN to zero,
				1634	otherwise pass NaN
				1635	through).
				1636
				1637	Used by CP to set up
				1638	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1639	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1640
				1641	Start executing wavefront
				1642	in single step mode.
				1643
				1644	CP is responsible for
				1645	filling in
				1646	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1647	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1648	with IEEE mode
				1649	enabled. Floating point
				1650	opcodes that support
				1651	exception flag gathering
				1652	will quiet and propagate
				1653	signaling-NaN inputs per
				1654	IEEE 754-2008. Min_dx10 and
				1655	max_dx10 become IEEE
				1656	754-2008 compliant due to
				1657	signaling-NaN propagation
				1658	and quieting.
				1659
				1660	Used by CP to set up
				1661	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1662	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1663
				1664	Only one work-group allowed
				1665	to execute on a compute
				1666	unit.
				1667
				1668	CP is responsible for
				1669	filling in
				1670	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1671	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1672
				1673	Flag that can be used to
				1674	control debugging code.
				1675
				1676	CP is responsible for
				1677	filling in
				1678	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1679	26 1 bit FP16_OVFL GFX6-8:
				1680	Reserved. Must be 0.
				1681	GFX9:
				1682	Wavefront starts
				1683	execution with specified
				1684	fp16 overflow mode.
				1685
				1686	- If 0, then fp16
				1687	overflow generates
				1688	+/-INF values.
				1689	- If 1, then fp16
				1690	overflow that is the
				1691	result of an +/-INF
				1692	input value or divide
				1693	by 0 generates a
				1694	+/-INF, otherwise
				1695	clamps computed
				1696	overflow to +/-MAX_FP16
				1697	as appropriate.
				1698
				1699	Used by CP to set up
				1700	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
				1701	31:27 5 bits Reserved. Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1702	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1703	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1704
				1705	..
				1706
				1707	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1708	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1709
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1710	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1711	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1712	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1713	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1714	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1715	system register (see
				1716	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1717
				1718	Used by CP to set up
				1719	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1720	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1721	user data registers
				1722	requested. This number must
				1723	match the number of user
				1724	data registers enabled.
				1725
				1726	Used by CP to set up
				1727	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1728	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1729	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1730	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1731	be enabled.
				1732
				1733	CP sets
				1734	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1735	if the runtime has
				1736	installed a trap handler
				1737	regardless of the setting
				1738	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1739	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1740	system SGPR register for
				1741	the work-group id in the X
				1742	dimension (see
				1743	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1744
				1745	Used by CP to set up
				1746	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1747	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1748	system SGPR register for
				1749	the work-group id in the Y
				1750	dimension (see
				1751	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1752
				1753	Used by CP to set up
				1754	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1755	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1756	system SGPR register for
				1757	the work-group id in the Z
				1758	dimension (see
				1759	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1760
				1761	Used by CP to set up
				1762	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1763	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1764	system SGPR register for
				1765	work-group information (see
				1766	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1767
				1768	Used by CP to set up
				1769	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1770	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1771	VGPR system registers used
				1772	for the work-item ID.
				1773	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1774	defines the values.
				1775
				1776	Used by CP to set up
				1777	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1778	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1779
				1780	Wavefront starts execution
				1781	with address watch
				1782	exceptions enabled which
				1783	are generated when L1 has
				1784	witnessed a thread access
				1785	an *address of
				1786	interest*.
				1787
				1788	CP is responsible for
				1789	filling in the address
				1790	watch bit in
				1791	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1792	according to what the
				1793	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1794	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1795
				1796	Wavefront starts execution
				1797	with memory violation
				1798	exceptions exceptions
				1799	enabled which are generated
				1800	when a memory violation has
				1801	occurred for this wave from
				1802	L1 or LDS
				1803	(write-to-read-only-memory,
				1804	mis-aligned atomic, LDS
				1805	address out of range,
				1806	illegal address, etc.).
				1807
				1808	CP sets the memory
				1809	violation bit in
				1810	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1811	according to what the
				1812	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1813	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1814
				1815	CP uses the rounded value
				1816	from the dispatch packet,
				1817	not this value, as the
				1818	dispatch may contain
				1819	dynamically allocated group
				1820	segment memory. CP writes
				1821	directly to
				1822	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1823
				1824	Amount of group segment
				1825	(LDS) to allocate for each
				1826	work-group. Granularity is
				1827	device specific:
				1828
				1829	GFX6:
				1830	roundup(lds-size / (64 * 4))
				1831	GFX7-GFX9:
				1832	roundup(lds-size / (128 * 4))
				1833
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1834	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1835	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1836	enabled.
				1837
				1838	Used by CP to set up
				1839	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1840	(set from bits 0..6).
				1841
				1842	IEEE 754 FP Invalid
				1843	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1844	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1845	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1846	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1847	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1848	_DIVISION_BY_ZERO Zero
				1849	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1850	_OVERFLOW
				1851	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1852	_UNDERFLOW
				1853	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1854	_INEXACT
				1855	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1856	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1857	only)
				1858	31 1 bit Reserved. Must be 0.
				1859	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1860	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1861
				1862	..
				1863
				1864	.. table:: Floating Point Rounding Mode Enumeration Values
				1865	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1866
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1867	====================================== ===== ==============================
				1868	Enumeration Name Value Description
				1869	====================================== ===== ==============================
				1870	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1871	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1872	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1873	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1874	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1875
				1876	..
				1877
				1878	.. table:: Floating Point Denorm Mode Enumeration Values
				1879	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1880
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1881	====================================== ===== ==============================
				1882	Enumeration Name Value Description
				1883	====================================== ===== ==============================
				1884	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1885	Denorms
				1886	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1887	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1888	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1889	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1890
				1891	..
				1892
				1893	.. table:: System VGPR Work-Item ID Enumeration Values
				1894	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1895
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1896	======================================== ===== ============================
				1897	Enumeration Name Value Description
				1898	======================================== ===== ============================
				1899	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1900	ID.
				1901	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1902	dimensions ID.
				1903	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1904	dimensions ID.
				1905	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1906	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1907
				1908	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1909
				1910	Initial Kernel Execution State
				1911	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1912
				1913	This section defines the register state that will be set up by the packet
				1914	processor prior to the start of execution of every wavefront. This is limited by
				1915	the constraints of the hardware controllers of CP/ADC/SPI.
				1916
				1917	The order of the SGPR registers is defined, but the compiler can specify which
				1918	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1919	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1920	for enabled registers are dense starting at SGPR0: the first enabled register is
				1921	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1922	an SGPR number.
				1923
				1924	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1925	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1926	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1927	initialized. These are then immediately followed by the System SGPRs that are
				1928	set up by ADC/SPI and can have different values for each wave of the grid
				1929	dispatch.
				1930
				1931	SGPR register initial state is defined in
				1932	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1933
				1934	.. table:: SGPR Register Set Up Order
				1935	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1936
				1937	========== ========================== ====== ==============================
				1938	SGPR Order Name Number Description
				1939	(kernel descriptor enable of
				1940	field) SGPRs
				1941	========== ========================== ====== ==============================
				1942	First Private Segment Buffer 4 V# that can be used, together
				1943	(enable_sgpr_private with Scratch Wave Offset as an
				1944	_segment_buffer) offset, to access the private
				1945	memory space using a segment
				1946	address.
				1947
				1948	CP uses the value provided by
				1949	the runtime.
				1950	then Dispatch Ptr 2 64 bit address of AQL dispatch
				1951	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				1952	actually executing.
				1953	then Queue Ptr 2 64 bit address of amd_queue_t
				1954	(enable_sgpr_queue_ptr) object for AQL queue on which
				1955	the dispatch packet was
				1956	queued.
				1957	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				1958	(enable_sgpr_kernarg segment. This is directly
				1959	_segment_ptr) copied from the
				1960	kernarg_address in the kernel
				1961	dispatch packet.
				1962
				1963	Having CP load it once avoids
				1964	loading it at the beginning of
				1965	every wavefront.
				1966	then Dispatch Id 2 64 bit Dispatch ID of the
				1967	(enable_sgpr_dispatch_id) dispatch packet being
				1968	executed.
				1969	then Flat Scratch Init 2 This is 2 SGPRs:
				1970	(enable_sgpr_flat_scratch
				1971	_init) GFX6
				1972	Not supported.
				1973	GFX7-GFX8
				1974	The first SGPR is a 32 bit
				1975	byte offset from
				1976	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1977	to per SPI base of memory
				1978	for scratch for the queue
				1979	executing the kernel
				1980	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1981	from the runtime. (The
				1982	Scratch Segment Buffer base
				1983	address is
				1984	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1985	plus this offset.) The value
				1986	of Scratch Wave Offset must
				1987	be added to this offset by
				1988	the kernel machine code,
				1989	right shifted by 8, and
				1990	moved to the FLAT_SCRATCH_HI
				1991	SGPR register.
				1992	FLAT_SCRATCH_HI corresponds
				1993	to SGPRn-4 on GFX7, and
				1994	SGPRn-6 on GFX8 (where SGPRn
				1995	is the highest numbered SGPR
				1996	allocated to the wave).
				1997	FLAT_SCRATCH_HI is
				1998	multiplied by 256 (as it is
				1999	in units of 256 bytes) and
				2000	added to
				2001	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2002	to calculate the per wave
				2003	FLAT SCRATCH BASE in flat
				2004	memory instructions that
				2005	access the scratch
				2006	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2007
				2008	The second SGPR is 32 bit
				2009	byte size of a single
				2010	work-item’s scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2011	usage. CP obtains this from
				2012	the runtime, and it is
				2013	always a multiple of DWORD.
				2014	CP checks that the value in
				2015	the kernel dispatch packet
				2016	Private Segment Byte Size is
				2017	not larger, and requests the
				2018	runtime to increase the
				2019	queue's scratch size if
				2020	necessary. The kernel code
				2021	must move it to
				2022	FLAT_SCRATCH_LO which is
				2023	SGPRn-3 on GFX7 and SGPRn-5
				2024	on GFX8. FLAT_SCRATCH_LO is
				2025	used as the FLAT SCRATCH
				2026	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2027	instructions. Having CP load
				2028	it once avoids loading it at
				2029	the beginning of every
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2030	wavefront. GFX9 This is the
				2031	64 bit base address of the
				2032	per SPI scratch backing
				2033	memory managed by SPI for
				2034	the queue executing the
				2035	kernel dispatch. CP obtains
				2036	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2037	divides it if there are
				2038	multiple Shader Arrays each
				2039	with its own SPI). The value
				2040	of Scratch Wave Offset must
				2041	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2042	machine code and the result
				2043	moved to the FLAT_SCRATCH
				2044	SGPR which is SGPRn-6 and
				2045	SGPRn-5. It is used as the
				2046	FLAT SCRATCH BASE in flat
				2047	memory instructions. then
				2048	Private Segment Size 1 The
				2049	32 bit byte size of a
				2050	(enable_sgpr_private single
				2051	work-item's
				2052	scratch_segment_size) memory
				2053	allocation. This is the
				2054	value from the kernel
				2055	dispatch packet Private
				2056	Segment Byte Size rounded up
				2057	by CP to a multiple of
				2058	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2059
				2060	Having CP load it once avoids
				2061	loading it at the beginning of
				2062	every wavefront.
				2063
				2064	This is not used for
				2065	GFX7-GFX8 since it is the same
				2066	value as the second SGPR of
				2067	Flat Scratch Init. However, it
				2068	may be needed for GFX9 which
				2069	changes the meaning of the
				2070	Flat Scratch Init value.
				2071	then Grid Work-Group Count X 1 32 bit count of the number of
				2072	(enable_sgpr_grid work-groups in the X dimension
				2073	_workgroup_count_X) for the grid being
				2074	executed. Computed from the
				2075	fields in the kernel dispatch
				2076	packet as ((grid_size.x +
				2077	workgroup_size.x - 1) /
				2078	workgroup_size.x).
				2079	then Grid Work-Group Count Y 1 32 bit count of the number of
				2080	(enable_sgpr_grid work-groups in the Y dimension
				2081	_workgroup_count_Y && for the grid being
				2082	less than 16 previous executed. Computed from the
				2083	SGPRs) fields in the kernel dispatch
				2084	packet as ((grid_size.y +
				2085	workgroup_size.y - 1) /
				2086	workgroupSize.y).
				2087
				2088	Only initialized if <16
				2089	previous SGPRs initialized.
				2090	then Grid Work-Group Count Z 1 32 bit count of the number of
				2091	(enable_sgpr_grid work-groups in the Z dimension
				2092	_workgroup_count_Z && for the grid being
				2093	less than 16 previous executed. Computed from the
				2094	SGPRs) fields in the kernel dispatch
				2095	packet as ((grid_size.z +
				2096	workgroup_size.z - 1) /
				2097	workgroupSize.z).
				2098
				2099	Only initialized if <16
				2100	previous SGPRs initialized.
				2101	then Work-Group Id X 1 32 bit work-group id in X
				2102	(enable_sgpr_workgroup_id dimension of grid for
				2103	_X) wavefront.
				2104	then Work-Group Id Y 1 32 bit work-group id in Y
				2105	(enable_sgpr_workgroup_id dimension of grid for
				2106	_Y) wavefront.
				2107	then Work-Group Id Z 1 32 bit work-group id in Z
				2108	(enable_sgpr_workgroup_id dimension of grid for
				2109	_Z) wavefront.
				2110	then Work-Group Info 1 {first_wave, 14’b0000,
				2111	(enable_sgpr_workgroup ordered_append_term[10:0],
				2112	_info) threadgroup_size_in_waves[5:0]}
				2113	then Scratch Wave Offset 1 32 bit byte offset from base
				2114	(enable_sgpr_private of scratch base of queue
				2115	_segment_wave_offset) executing the kernel
				2116	dispatch. Must be used as an
				2117	offset with Private
				2118	segment address when using
				2119	Scratch Segment Buffer. It
				2120	must be used to set up FLAT
				2121	SCRATCH for flat addressing
				2122	(see
				2123	:ref:`amdgpu-amdhsa-flat-scratch`).
				2124	========== ========================== ====== ==============================
				2125
				2126	The order of the VGPR registers is defined, but the compiler can specify which
				2127	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2128	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2129	for enabled registers are dense starting at VGPR0: the first enabled register is
				2130	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2131	VGPR number.
				2132
				2133	VGPR register initial state is defined in
				2134	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2135
				2136	.. table:: VGPR Register Set Up Order
				2137	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2138
				2139	========== ========================== ====== ==============================
				2140	VGPR Order Name Number Description
				2141	(kernel descriptor enable of
				2142	field) VGPRs
				2143	========== ========================== ====== ==============================
				2144	First Work-Item Id X 1 32 bit work item id in X
				2145	(Always initialized) dimension of work-group for
				2146	wavefront lane.
				2147	then Work-Item Id Y 1 32 bit work item id in Y
				2148	(enable_vgpr_workitem_id dimension of work-group for
				2149	> 0) wavefront lane.
				2150	then Work-Item Id Z 1 32 bit work item id in Z
				2151	(enable_vgpr_workitem_id dimension of work-group for
				2152	> 1) wavefront lane.
				2153	========== ========================== ====== ==============================
				2154
				2155	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2156
				2157	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2158	registers.
				2159	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2160	combination including none.
				2161	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2162	cannot included with the flat scratch init value which is per queue.
				2163	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2164	or (X, Y, Z).
				2165
				2166	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2167	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2168
				2169	The global segment can be accessed either using buffer instructions (GFX6 which
				2170	has V# 64 bit address support), flat instructions (GFX7-9), or global
				2171	instructions (GFX9).
				2172
				2173	If buffer operations are used then the compiler can generate a V# with the
				2174	following properties:
				2175
				2176	* base address of 0
				2177	* no swizzle
				2178	* ATC: 1 if IOMMU present (such as APU)
				2179	* ptr64: 1
				2180	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2181	APU and NC for dGPU).
				2182
				2183	.. _amdgpu-amdhsa-kernel-prolog:
				2184
				2185	Kernel Prolog
				2186	~~~~~~~~~~~~~
				2187
				2188	.. _amdgpu-amdhsa-m0:
				2189
				2190	M0
				2191	++
				2192
				2193	GFX6-GFX8
				2194	The M0 register must be initialized with a value at least the total LDS size
				2195	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2196	available in dispatch packet. For M0, it is also possible to use maximum
				2197	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2198	GFX7-GFX8).
				2199	GFX9
				2200	The M0 register is not used for range checking LDS accesses and so does not
				2201	need to be initialized in the prolog.
				2202
				2203	.. _amdgpu-amdhsa-flat-scratch:
				2204
				2205	Flat Scratch
				2206	++++++++++++
				2207
				2208	If the kernel may use flat operations to access scratch memory, the prolog code
				2209	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2210	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2211	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2212
				2213	GFX6
				2214	Flat scratch is not supported.
				2215
				2216	GFX7-8
				2217	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2218	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2219	being managed by SPI for the queue executing the kernel dispatch. This is
				2220	the same value used in the Scratch Segment Buffer V# base address. The
				2221	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2222	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2223	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2224	by 8 before moving into FLAT_SCRATCH_LO.
				2225	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2226	work-items scratch memory usage. This is directly loaded from the kernel
				2227	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2228	DWORD. Having CP load it once avoids loading it at the beginning of every
				2229	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2230	SIZE.
				2231	GFX9
				2232	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2233	memory being managed by SPI for the queue executing the kernel dispatch. The
				2234	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2235	pair for use as the flat scratch base in flat memory instructions.
				2236
				2237	.. _amdgpu-amdhsa-memory-model:
				2238
				2239	Memory Model
				2240	~~~~~~~~~~~~
				2241
				2242	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2243	(see :ref:`memmodel`). The implementation is WIP.
				2244
				2245	.. TODO
				2246	Update when implementation complete.
				2247
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2248	Support more relaxed OpenCL memory model to be controlled by environment
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2249	component of target triple.
				2250
				2251	The AMDGPU backend supports the memory synchronization scopes specified in
				2252	:ref:`amdgpu-memory-scopes`.
				2253
				2254	The code sequences used to implement the memory model are defined in table
				2255	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2256
				2257	The sequences specify the order of instructions that a single thread must
				2258	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2259	to other memory instructions executed by the same thread. This allows them to be
				2260	moved earlier or later which can allow them to be combined with other instances
				2261	of the same instruction, or hoisted/sunk out of loops to improve
				2262	performance. Only the instructions related to the memory model are given;
				2263	additional ``s_waitcnt`` instructions are required to ensure registers are
				2264	defined before being used. These may be able to be combined with the memory
				2265	model ``s_waitcnt`` instructions as described above.
				2266
				2267	The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
				2268	OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
				2269	relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
				2270	memory model which has separate happens-before relations for the global and
				2271	local address spaces, and only a fence specifying both global and local address
				2272	space joins the relationships. Since the LLVM ``memfence`` instruction does not
				2273	allow an address space to be specified the OpenCL fence has to convervatively
				2274	assume both local and global address space was specified. However, optimizations
				2275	can often be done to eliminate the additional ``s_waitcnt``instructions when
				2276	there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
				2277	instructions. The code sequences in the table indicate what can be omitted for
				2278	the OpenCL memory. The target triple environment is used to determine if the
				2279	source language is OpenCL (see :ref:`amdgpu-opencl`).
				2280
				2281	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2282	operations.
				2283
				2284	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2285	termed vector memory operations.
				2286
				2287	For GFX6-GFX9:
				2288
				2289	* Each agent has multiple compute units (CU).
				2290	* Each CU has multiple SIMDs that execute wavefronts.
				2291	* The wavefronts for a single work-group are executed in the same CU but may be
				2292	executed by different SIMDs.
				2293	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2294	executing on it.
				2295	* All LDS operations of a CU are performed as wavefront wide operations in a
				2296	global order and involve no caching. Completion is reported to a wavefront in
				2297	execution order.
				2298	* The LDS memory has multiple request queues shared by the SIMDs of a
				2299	CU. Therefore, the LDS operations performed by different waves of a work-group
				2300	can be reordered relative to each other, which can result in reordering the
				2301	visibility of vector memory operations with respect to LDS operations of other
				2302	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2303	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2304	between waves of a work-group, but not between operations performed by the
				2305	same wavefront.
				2306	* The vector memory operations are performed as wavefront wide operations and
				2307	completion is reported to a wavefront in execution order. The exception is
				2308	that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
				2309	vector memory order if they access LDS memory, and out of LDS operation order
				2310	if they access global memory.
				2311	* The vector memory operations access a vector L1 cache shared by all wavefronts
				2312	on a CU. Therefore, no special action is required for coherence between
				2313	wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
				2314	coherence between waves executing in different work-groups as they may be
				2315	executing on different CUs.
				2316	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2317	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2318	scalar operations are used in a restricted way so do not impact the memory
				2319	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2320	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2321	the same agent.
				2322	* The L2 cache has independent channels to service disjoint ranges of virtual
				2323	addresses.
				2324	* Each CU has a separate request queue per channel. Therefore, the vector and
				2325	scalar memory operations performed by waves executing in different work-groups
				2326	(which may be executing on different CUs) of an agent can be reordered
				2327	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2328	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2329	previous vector memory operation has completed before executing a subsequent
				2330	vector memory or LDS operation and so can be used to meet the requirements of
				2331	acquire and release.
				2332	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2333	of virtual addresses can be set up to bypass it to ensure system coherence.
				2334
				2335	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
				2336	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2337	memory, atomic memory orderings are not meaningful and all accesses are treated
				2338	as non-atomic.
				2339
				2340	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2341	scalar memory instructions). Since the constant address space contents do not
				2342	change during the execution of a kernel dispatch it is not legal to perform
				2343	stores, and atomic memory orderings are not meaningful and all access are
				2344	treated as non-atomic.
				2345
				2346	A memory synchronization scope wider than work-group is not meaningful for the
				2347	group (LDS) address space and is treated as work-group.
				2348
				2349	The memory model does not support the region address space which is treated as
				2350	non-atomic.
				2351
				2352	Acquire memory ordering is not meaningful on store atomic instructions and is
				2353	treated as non-atomic.
				2354
				2355	Release memory ordering is not meaningful on load atomic instructions and is
				2356	treated a non-atomic.
				2357
				2358	Acquire-release memory ordering is not meaningful on load or store atomic
				2359	instructions and is treated as acquire and release respectively.
				2360
				2361	AMDGPU backend only uses scalar memory operations to access memory that is
				2362	proven to not change during the execution of the kernel dispatch. This includes
				2363	constant address space and global address space for program scope const
				2364	variables. Therefore the kernel machine code does not have to maintain the
				2365	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2366	and vector L1 caches are invalidated between kernel dispatches by CP since
				2367	constant address space data may change between kernel dispatch executions. See
				2368	:ref:`amdgpu-amdhsa-memory-spaces`.
				2369
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2370	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2371	case the AMDGPU backend ensures the memory location used to spill is never
				2372	accessed by vector memory operations at the same time. If scalar writes are used
				2373	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2374	return since the locations may be used for vector memory instructions by a
				2375	future wave that uses the same scratch area, or a function call that creates a
				2376	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2377	as all scalar writes are write-before-read in the same thread.
				2378
				2379	Scratch backing memory (which is used for the private address space) is accessed
				2380	with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
				2381	is only accessed by a single thread, and is always write-before-read,
				2382	there is never a need to invalidate these entries from the L1 cache. Hence all
				2383	cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
				2384	lines.
				2385
				2386	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
				2387	to invalidate the L2 cache. This also causes it to be treated as non-volatile
				2388	and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
				2389	coherent) and so the L2 cache will coherent with the CPU and other agents.
				2390
				2391	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2392	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2393
				2394	============ ============ ============== ========== =======================
				2395	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2396	Ordering Sync Scope Address
				2397	Space
				2398	============ ============ ============== ========== =======================
				2399	Non-Atomic
				2400	---------------------------------------------------------------------------
				2401	load none none - global non-volatile
				2402	- generic 1. buffer/global/flat_load
				2403	volatile
				2404	1. buffer/global/flat_load
				2405	glc=1
				2406	load none none - local 1. ds_load
				2407	store none none - global 1. buffer/global/flat_store
				2408	- generic
				2409	store none none - local 1. ds_store
				2410	Unordered Atomic
				2411	---------------------------------------------------------------------------
				2412	load atomic unordered any any Same as non-atomic.
				2413	store atomic unordered any any Same as non-atomic.
				2414	atomicrmw unordered any any *Same as monotonic
				2415	atomic*.
				2416	Monotonic Atomic
				2417	---------------------------------------------------------------------------
				2418	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2419	- wavefront - generic
				2420	- workgroup
				2421	load atomic monotonic - singlethread - local 1. ds_load
				2422	- wavefront
				2423	- workgroup
				2424	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2425	- system - generic glc=1
				2426	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2427	- wavefront - generic
				2428	- workgroup
				2429	- agent
				2430	- system
				2431	store atomic monotonic - singlethread - local 1. ds_store
				2432	- wavefront
				2433	- workgroup
				2434	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2435	- wavefront - generic
				2436	- workgroup
				2437	- agent
				2438	- system
				2439	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2440	- wavefront
				2441	- workgroup
				2442	Acquire Atomic
				2443	---------------------------------------------------------------------------
				2444	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2445	- wavefront - local
				2446	- generic
				2447	load atomic acquire - workgroup - global 1. buffer/global_load
				2448	load atomic acquire - workgroup - local 1. ds/flat_load
				2449	- generic 2. s_waitcnt lgkmcnt(0)
				2450
				2451	- If OpenCL, omit
				2452	waitcnt.
				2453	- Must happen before
				2454	any following
				2455	global/generic
				2456	load/load
				2457	atomic/store/store
				2458	atomic/atomicrmw.
				2459	- Ensures any
				2460	following global
				2461	data read is no
				2462	older than the load
				2463	atomic value being
				2464	acquired.
				2465
				2466	load atomic acquire - agent - global 1. buffer/global_load
				2467	- system glc=1
				2468	2. s_waitcnt vmcnt(0)
				2469
				2470	- Must happen before
				2471	following
				2472	buffer_wbinvl1_vol.
				2473	- Ensures the load
				2474	has completed
				2475	before invalidating
				2476	the cache.
				2477
				2478	3. buffer_wbinvl1_vol
				2479
				2480	- Must happen before
				2481	any following
				2482	global/generic
				2483	load/load
				2484	atomic/atomicrmw.
				2485	- Ensures that
				2486	following
				2487	loads will not see
				2488	stale global data.
				2489
				2490	load atomic acquire - agent - generic 1. flat_load glc=1
				2491	- system 2. s_waitcnt vmcnt(0) &
				2492	lgkmcnt(0)
				2493
				2494	- If OpenCL omit
				2495	lgkmcnt(0).
				2496	- Must happen before
				2497	following
				2498	buffer_wbinvl1_vol.
				2499	- Ensures the flat_load
				2500	has completed
				2501	before invalidating
				2502	the cache.
				2503
				2504	3. buffer_wbinvl1_vol
				2505
				2506	- Must happen before
				2507	any following
				2508	global/generic
				2509	load/load
				2510	atomic/atomicrmw.
				2511	- Ensures that
				2512	following loads
				2513	will not see stale
				2514	global data.
				2515
				2516	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2517	- wavefront - local
				2518	- generic
				2519	atomicrmw acquire - workgroup - global 1. buffer/global_atomic
				2520	atomicrmw acquire - workgroup - local 1. ds/flat_atomic
				2521	- generic 2. waitcnt lgkmcnt(0)
				2522
				2523	- If OpenCL, omit
				2524	waitcnt.
				2525	- Must happen before
				2526	any following
				2527	global/generic
				2528	load/load
				2529	atomic/store/store
				2530	atomic/atomicrmw.
				2531	- Ensures any
				2532	following global
				2533	data read is no
				2534	older than the
				2535	atomicrmw value
				2536	being acquired.
				2537
				2538	atomicrmw acquire - agent - global 1. buffer/global_atomic
				2539	- system 2. s_waitcnt vmcnt(0)
				2540
				2541	- Must happen before
				2542	following
				2543	buffer_wbinvl1_vol.
				2544	- Ensures the
				2545	atomicrmw has
				2546	completed before
				2547	invalidating the
				2548	cache.
				2549
				2550	3. buffer_wbinvl1_vol
				2551
				2552	- Must happen before
				2553	any following
				2554	global/generic
				2555	load/load
				2556	atomic/atomicrmw.
				2557	- Ensures that
				2558	following loads
				2559	will not see stale
				2560	global data.
				2561
				2562	atomicrmw acquire - agent - generic 1. flat_atomic
				2563	- system 2. s_waitcnt vmcnt(0) &
				2564	lgkmcnt(0)
				2565
				2566	- If OpenCL, omit
				2567	lgkmcnt(0).
				2568	- Must happen before
				2569	following
				2570	buffer_wbinvl1_vol.
				2571	- Ensures the
				2572	atomicrmw has
				2573	completed before
				2574	invalidating the
				2575	cache.
				2576
				2577	3. buffer_wbinvl1_vol
				2578
				2579	- Must happen before
				2580	any following
				2581	global/generic
				2582	load/load
				2583	atomic/atomicrmw.
				2584	- Ensures that
				2585	following loads
				2586	will not see stale
				2587	global data.
				2588
				2589	fence acquire - singlethread none none
				2590	- wavefront
				2591	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2592
				2593	- If OpenCL and
				2594	address space is
				2595	not generic, omit
				2596	waitcnt. However,
				2597	since LLVM
				2598	currently has no
				2599	address space on
				2600	the fence need to
				2601	conservatively
				2602	always generate. If
				2603	fence had an
				2604	address space then
				2605	set to address
				2606	space of OpenCL
				2607	fence flag, or to
				2608	generic if both
				2609	local and global
				2610	flags are
				2611	specified.
				2612	- Must happen after
				2613	any preceding
				2614	local/generic load
				2615	atomic/atomicrmw
				2616	with an equal or
				2617	wider sync scope
				2618	and memory ordering
				2619	stronger than
				2620	unordered (this is
				2621	termed the
				2622	fence-paired-atomic).
				2623	- Must happen before
				2624	any following
				2625	global/generic
				2626	load/load
				2627	atomic/store/store
				2628	atomic/atomicrmw.
				2629	- Ensures any
				2630	following global
				2631	data read is no
				2632	older than the
				2633	value read by the
				2634	fence-paired-atomic.
				2635
				2636	fence acquire - agent none 1. s_waitcnt vmcnt(0) &
				2637	- system lgkmcnt(0)
				2638
				2639	- If OpenCL and
				2640	address space is
				2641	not generic, omit
				2642	lgkmcnt(0).
				2643	However, since LLVM
				2644	currently has no
				2645	address space on
				2646	the fence need to
				2647	conservatively
				2648	always generate
				2649	(see comment for
				2650	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2651	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2652	separate s_waitcnt
				2653	vmcnt(0) and
				2654	s_waitcnt
				2655	lgkmcnt(0) to allow
				2656	them to be
				2657	independently moved
				2658	according to the
				2659	following rules.
				2660	- s_waitcnt vmcnt(0)
				2661	must happen after
				2662	any preceding
				2663	global/generic load
				2664	atomic/atomicrmw
				2665	with an equal or
				2666	wider sync scope
				2667	and memory ordering
				2668	stronger than
				2669	unordered (this is
				2670	termed the
				2671	fence-paired-atomic).
				2672	- s_waitcnt lgkmcnt(0)
				2673	must happen after
				2674	any preceding
				2675	group/generic load
				2676	atomic/atomicrmw
				2677	with an equal or
				2678	wider sync scope
				2679	and memory ordering
				2680	stronger than
				2681	unordered (this is
				2682	termed the
				2683	fence-paired-atomic).
				2684	- Must happen before
				2685	the following
				2686	buffer_wbinvl1_vol.
				2687	- Ensures that the
				2688	fence-paired atomic
				2689	has completed
				2690	before invalidating
				2691	the
				2692	cache. Therefore
				2693	any following
				2694	locations read must
				2695	be no older than
				2696	the value read by
				2697	the
				2698	fence-paired-atomic.
				2699
				2700	2. buffer_wbinvl1_vol
				2701
				2702	- Must happen before
				2703	any following global/generic
				2704	load/load
				2705	atomic/store/store
				2706	atomic/atomicrmw.
				2707	- Ensures that
				2708	following loads
				2709	will not see stale
				2710	global data.
				2711
				2712	Release Atomic
				2713	---------------------------------------------------------------------------
				2714	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2715	- wavefront - local
				2716	- generic
				2717	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2718	- generic
				2719	- If OpenCL, omit
				2720	waitcnt.
				2721	- Must happen after
				2722	any preceding
				2723	local/generic
				2724	load/store/load
				2725	atomic/store
				2726	atomic/atomicrmw.
				2727	- Must happen before
				2728	the following
				2729	store.
				2730	- Ensures that all
				2731	memory operations
				2732	to local have
				2733	completed before
				2734	performing the
				2735	store that is being
				2736	released.
				2737
				2738	2. buffer/global/flat_store
				2739	store atomic release - workgroup - local 1. ds_store
				2740	store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
				2741	- system - generic lgkmcnt(0)
				2742
				2743	- If OpenCL, omit
				2744	lgkmcnt(0).
				2745	- Could be split into
				2746	separate s_waitcnt
				2747	vmcnt(0) and
				2748	s_waitcnt
				2749	lgkmcnt(0) to allow
				2750	them to be
				2751	independently moved
				2752	according to the
				2753	following rules.
				2754	- s_waitcnt vmcnt(0)
				2755	must happen after
				2756	any preceding
				2757	global/generic
				2758	load/store/load
				2759	atomic/store
				2760	atomic/atomicrmw.
				2761	- s_waitcnt lgkmcnt(0)
				2762	must happen after
				2763	any preceding
				2764	local/generic
				2765	load/store/load
				2766	atomic/store
				2767	atomic/atomicrmw.
				2768	- Must happen before
				2769	the following
				2770	store.
				2771	- Ensures that all
				2772	memory operations
				2773	to global have
				2774	completed before
				2775	performing the
				2776	store that is being
				2777	released.
				2778
				2779	2. buffer/global/ds/flat_store
				2780	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2781	- wavefront - local
				2782	- generic
				2783	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2784	- generic
				2785	- If OpenCL, omit
				2786	waitcnt.
				2787	- Must happen after
				2788	any preceding
				2789	local/generic
				2790	load/store/load
				2791	atomic/store
				2792	atomic/atomicrmw.
				2793	- Must happen before
				2794	the following
				2795	atomicrmw.
				2796	- Ensures that all
				2797	memory operations
				2798	to local have
				2799	completed before
				2800	performing the
				2801	atomicrmw that is
				2802	being released.
				2803
				2804	2. buffer/global/flat_atomic
				2805	atomicrmw release - workgroup - local 1. ds_atomic
				2806	atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
				2807	- system - generic lgkmcnt(0)
				2808
				2809	- If OpenCL, omit
				2810	lgkmcnt(0).
				2811	- Could be split into
				2812	separate s_waitcnt
				2813	vmcnt(0) and
				2814	s_waitcnt
				2815	lgkmcnt(0) to allow
				2816	them to be
				2817	independently moved
				2818	according to the
				2819	following rules.
				2820	- s_waitcnt vmcnt(0)
				2821	must happen after
				2822	any preceding
				2823	global/generic
				2824	load/store/load
				2825	atomic/store
				2826	atomic/atomicrmw.
				2827	- s_waitcnt lgkmcnt(0)
				2828	must happen after
				2829	any preceding
				2830	local/generic
				2831	load/store/load
				2832	atomic/store
				2833	atomic/atomicrmw.
				2834	- Must happen before
				2835	the following
				2836	atomicrmw.
				2837	- Ensures that all
				2838	memory operations
				2839	to global and local
				2840	have completed
				2841	before performing
				2842	the atomicrmw that
				2843	is being released.
				2844
				2845	2. buffer/global/ds/flat_atomic*
				2846	fence release - singlethread none none
				2847	- wavefront
				2848	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2849
				2850	- If OpenCL and
				2851	address space is
				2852	not generic, omit
				2853	waitcnt. However,
				2854	since LLVM
				2855	currently has no
				2856	address space on
				2857	the fence need to
				2858	conservatively
				2859	always generate
				2860	(see comment for
				2861	previous fence).
				2862	- Must happen after
				2863	any preceding
				2864	local/generic
				2865	load/load
				2866	atomic/store/store
				2867	atomic/atomicrmw.
				2868	- Must happen before
				2869	any following store
				2870	atomic/atomicrmw
				2871	with an equal or
				2872	wider sync scope
				2873	and memory ordering
				2874	stronger than
				2875	unordered (this is
				2876	termed the
				2877	fence-paired-atomic).
				2878	- Ensures that all
				2879	memory operations
				2880	to local have
				2881	completed before
				2882	performing the
				2883	following
				2884	fence-paired-atomic.
				2885
				2886	fence release - agent none 1. s_waitcnt vmcnt(0) &
				2887	- system lgkmcnt(0)
				2888
				2889	- If OpenCL and
				2890	address space is
				2891	not generic, omit
				2892	lgkmcnt(0).
				2893	However, since LLVM
				2894	currently has no
				2895	address space on
				2896	the fence need to
				2897	conservatively
				2898	always generate
				2899	(see comment for
				2900	previous fence).
				2901	- Could be split into
				2902	separate s_waitcnt
				2903	vmcnt(0) and
				2904	s_waitcnt
				2905	lgkmcnt(0) to allow
				2906	them to be
				2907	independently moved
				2908	according to the
				2909	following rules.
				2910	- s_waitcnt vmcnt(0)
				2911	must happen after
				2912	any preceding
				2913	global/generic
				2914	load/store/load
				2915	atomic/store
				2916	atomic/atomicrmw.
				2917	- s_waitcnt lgkmcnt(0)
				2918	must happen after
				2919	any preceding
				2920	local/generic
				2921	load/store/load
				2922	atomic/store
				2923	atomic/atomicrmw.
				2924	- Must happen before
				2925	any following store
				2926	atomic/atomicrmw
				2927	with an equal or
				2928	wider sync scope
				2929	and memory ordering
				2930	stronger than
				2931	unordered (this is
				2932	termed the
				2933	fence-paired-atomic).
				2934	- Ensures that all
				2935	memory operations
				2936	to global have
				2937	completed before
				2938	performing the
				2939	following
				2940	fence-paired-atomic.
				2941
				2942	Acquire-Release Atomic
				2943	---------------------------------------------------------------------------
				2944	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				2945	- wavefront - local
				2946	- generic
				2947	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2948
				2949	- If OpenCL, omit
				2950	waitcnt.
				2951	- Must happen after
				2952	any preceding
				2953	local/generic
				2954	load/store/load
				2955	atomic/store
				2956	atomic/atomicrmw.
				2957	- Must happen before
				2958	the following
				2959	atomicrmw.
				2960	- Ensures that all
				2961	memory operations
				2962	to local have
				2963	completed before
				2964	performing the
				2965	atomicrmw that is
				2966	being released.
				2967
				2968	2. buffer/global_atomic
				2969	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				2970	2. s_waitcnt lgkmcnt(0)
				2971
				2972	- If OpenCL, omit
				2973	waitcnt.
				2974	- Must happen before
				2975	any following
				2976	global/generic
				2977	load/load
				2978	atomic/store/store
				2979	atomic/atomicrmw.
				2980	- Ensures any
				2981	following global
				2982	data read is no
				2983	older than the load
				2984	atomic value being
				2985	acquired.
				2986
				2987	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2988
				2989	- If OpenCL, omit
				2990	waitcnt.
				2991	- Must happen after
				2992	any preceding
				2993	local/generic
				2994	load/store/load
				2995	atomic/store
				2996	atomic/atomicrmw.
				2997	- Must happen before
				2998	the following
				2999	atomicrmw.
				3000	- Ensures that all
				3001	memory operations
				3002	to local have
				3003	completed before
				3004	performing the
				3005	atomicrmw that is
				3006	being released.
				3007
				3008	2. flat_atomic
				3009	3. s_waitcnt lgkmcnt(0)
				3010
				3011	- If OpenCL, omit
				3012	waitcnt.
				3013	- Must happen before
				3014	any following
				3015	global/generic
				3016	load/load
				3017	atomic/store/store
				3018	atomic/atomicrmw.
				3019	- Ensures any
				3020	following global
				3021	data read is no
				3022	older than the load
				3023	atomic value being
				3024	acquired.
				3025	atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
				3026	- system lgkmcnt(0)
				3027
				3028	- If OpenCL, omit
				3029	lgkmcnt(0).
				3030	- Could be split into
				3031	separate s_waitcnt
				3032	vmcnt(0) and
				3033	s_waitcnt
				3034	lgkmcnt(0) to allow
				3035	them to be
				3036	independently moved
				3037	according to the
				3038	following rules.
				3039	- s_waitcnt vmcnt(0)
				3040	must happen after
				3041	any preceding
				3042	global/generic
				3043	load/store/load
				3044	atomic/store
				3045	atomic/atomicrmw.
				3046	- s_waitcnt lgkmcnt(0)
				3047	must happen after
				3048	any preceding
				3049	local/generic
				3050	load/store/load
				3051	atomic/store
				3052	atomic/atomicrmw.
				3053	- Must happen before
				3054	the following
				3055	atomicrmw.
				3056	- Ensures that all
				3057	memory operations
				3058	to global have
				3059	completed before
				3060	performing the
				3061	atomicrmw that is
				3062	being released.
				3063
				3064	2. buffer/global_atomic
				3065	3. s_waitcnt vmcnt(0)
				3066
				3067	- Must happen before
				3068	following
				3069	buffer_wbinvl1_vol.
				3070	- Ensures the
				3071	atomicrmw has
				3072	completed before
				3073	invalidating the
				3074	cache.
				3075
				3076	4. buffer_wbinvl1_vol
				3077
				3078	- Must happen before
				3079	any following
				3080	global/generic
				3081	load/load
				3082	atomic/atomicrmw.
				3083	- Ensures that
				3084	following loads
				3085	will not see stale
				3086	global data.
				3087
				3088	atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
				3089	- system lgkmcnt(0)
				3090
				3091	- If OpenCL, omit
				3092	lgkmcnt(0).
				3093	- Could be split into
				3094	separate s_waitcnt
				3095	vmcnt(0) and
				3096	s_waitcnt
				3097	lgkmcnt(0) to allow
				3098	them to be
				3099	independently moved
				3100	according to the
				3101	following rules.
				3102	- s_waitcnt vmcnt(0)
				3103	must happen after
				3104	any preceding
				3105	global/generic
				3106	load/store/load
				3107	atomic/store
				3108	atomic/atomicrmw.
				3109	- s_waitcnt lgkmcnt(0)
				3110	must happen after
				3111	any preceding
				3112	local/generic
				3113	load/store/load
				3114	atomic/store
				3115	atomic/atomicrmw.
				3116	- Must happen before
				3117	the following
				3118	atomicrmw.
				3119	- Ensures that all
				3120	memory operations
				3121	to global have
				3122	completed before
				3123	performing the
				3124	atomicrmw that is
				3125	being released.
				3126
				3127	2. flat_atomic
				3128	3. s_waitcnt vmcnt(0) &
				3129	lgkmcnt(0)
				3130
				3131	- If OpenCL, omit
				3132	lgkmcnt(0).
				3133	- Must happen before
				3134	following
				3135	buffer_wbinvl1_vol.
				3136	- Ensures the
				3137	atomicrmw has
				3138	completed before
				3139	invalidating the
				3140	cache.
				3141
				3142	4. buffer_wbinvl1_vol
				3143
				3144	- Must happen before
				3145	any following
				3146	global/generic
				3147	load/load
				3148	atomic/atomicrmw.
				3149	- Ensures that
				3150	following loads
				3151	will not see stale
				3152	global data.
				3153
				3154	fence acq_rel - singlethread none none
				3155	- wavefront
				3156	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3157
				3158	- If OpenCL and
				3159	address space is
				3160	not generic, omit
				3161	waitcnt. However,
				3162	since LLVM
				3163	currently has no
				3164	address space on
				3165	the fence need to
				3166	conservatively
				3167	always generate
				3168	(see comment for
				3169	previous fence).
				3170	- Must happen after
				3171	any preceding
				3172	local/generic
				3173	load/load
				3174	atomic/store/store
				3175	atomic/atomicrmw.
				3176	- Must happen before
				3177	any following
				3178	global/generic
				3179	load/load
				3180	atomic/store/store
				3181	atomic/atomicrmw.
				3182	- Ensures that all
				3183	memory operations
				3184	to local have
				3185	completed before
				3186	performing any
				3187	following global
				3188	memory operations.
				3189	- Ensures that the
				3190	preceding
				3191	local/generic load
				3192	atomic/atomicrmw
				3193	with an equal or
				3194	wider sync scope
				3195	and memory ordering
				3196	stronger than
				3197	unordered (this is
				3198	termed the
				3199	fence-paired-atomic)
				3200	has completed
				3201	before following
				3202	global memory
				3203	operations. This
				3204	satisfies the
				3205	requirements of
				3206	acquire.
				3207	- Ensures that all
				3208	previous memory
				3209	operations have
				3210	completed before a
				3211	following
				3212	local/generic store
				3213	atomic/atomicrmw
				3214	with an equal or
				3215	wider sync scope
				3216	and memory ordering
				3217	stronger than
				3218	unordered (this is
				3219	termed the
				3220	fence-paired-atomic).
				3221	This satisfies the
				3222	requirements of
				3223	release.
				3224
				3225	fence acq_rel - agent none 1. s_waitcnt vmcnt(0) &
				3226	- system lgkmcnt(0)
				3227
				3228	- If OpenCL and
				3229	address space is
				3230	not generic, omit
				3231	lgkmcnt(0).
				3232	However, since LLVM
				3233	currently has no
				3234	address space on
				3235	the fence need to
				3236	conservatively
				3237	always generate
				3238	(see comment for
				3239	previous fence).
				3240	- Could be split into
				3241	separate s_waitcnt
				3242	vmcnt(0) and
				3243	s_waitcnt
				3244	lgkmcnt(0) to allow
				3245	them to be
				3246	independently moved
				3247	according to the
				3248	following rules.
				3249	- s_waitcnt vmcnt(0)
				3250	must happen after
				3251	any preceding
				3252	global/generic
				3253	load/store/load
				3254	atomic/store
				3255	atomic/atomicrmw.
				3256	- s_waitcnt lgkmcnt(0)
				3257	must happen after
				3258	any preceding
				3259	local/generic
				3260	load/store/load
				3261	atomic/store
				3262	atomic/atomicrmw.
				3263	- Must happen before
				3264	the following
				3265	buffer_wbinvl1_vol.
				3266	- Ensures that the
				3267	preceding
				3268	global/local/generic
				3269	load
				3270	atomic/atomicrmw
				3271	with an equal or
				3272	wider sync scope
				3273	and memory ordering
				3274	stronger than
				3275	unordered (this is
				3276	termed the
				3277	fence-paired-atomic)
				3278	has completed
				3279	before invalidating
				3280	the cache. This
				3281	satisfies the
				3282	requirements of
				3283	acquire.
				3284	- Ensures that all
				3285	previous memory
				3286	operations have
				3287	completed before a
				3288	following
				3289	global/local/generic
				3290	store
				3291	atomic/atomicrmw
				3292	with an equal or
				3293	wider sync scope
				3294	and memory ordering
				3295	stronger than
				3296	unordered (this is
				3297	termed the
				3298	fence-paired-atomic).
				3299	This satisfies the
				3300	requirements of
				3301	release.
				3302
				3303	2. buffer_wbinvl1_vol
				3304
				3305	- Must happen before
				3306	any following
				3307	global/generic
				3308	load/load
				3309	atomic/store/store
				3310	atomic/atomicrmw.
				3311	- Ensures that
				3312	following loads
				3313	will not see stale
				3314	global data. This
				3315	satisfies the
				3316	requirements of
				3317	acquire.
				3318
				3319	Sequential Consistent Atomic
				3320	---------------------------------------------------------------------------
				3321	load atomic seq_cst - singlethread - global *Same as corresponding
				3322	- wavefront - local load atomic acquire*.
				3323	- workgroup - generic
				3324	load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
				3325	- system - local
				3326	- generic - Must happen after
				3327	preceding
				3328	global/generic load
				3329	atomic/store
				3330	atomic/atomicrmw
				3331	with memory
				3332	ordering of seq_cst
				3333	and with equal or
				3334	wider sync scope.
				3335	(Note that seq_cst
				3336	fences have their
				3337	own s_waitcnt
				3338	vmcnt(0) and so do
				3339	not need to be
				3340	considered.)
				3341	- Ensures any
				3342	preceding
				3343	sequential
				3344	consistent global
				3345	memory instructions
				3346	have completed
				3347	before executing
				3348	this sequentially
				3349	consistent
				3350	instruction. This
				3351	prevents reordering
				3352	a seq_cst store
				3353	followed by a
				3354	seq_cst load (Note
				3355	that seq_cst is
				3356	stronger than
				3357	acquire/release as
				3358	the reordering of
				3359	load acquire
				3360	followed by a store
				3361	release is
				3362	prevented by the
				3363	waitcnt vmcnt(0) of
				3364	the release, but
				3365	there is nothing
				3366	preventing a store
				3367	release followed by
				3368	load acquire from
				3369	competing out of
				3370	order.)
				3371
				3372	2. *Following
				3373	instructions same as
				3374	corresponding load
				3375	atomic acquire*.
				3376
				3377	store atomic seq_cst - singlethread - global *Same as corresponding
				3378	- wavefront - local store atomic release*.
				3379	- workgroup - generic
				3380	store atomic seq_cst - agent - global *Same as corresponding
				3381	- system - generic store atomic release*.
				3382	atomicrmw seq_cst - singlethread - global *Same as corresponding
				3383	- wavefront - local atomicrmw acq_rel*.
				3384	- workgroup - generic
				3385	atomicrmw seq_cst - agent - global *Same as corresponding
				3386	- system - generic atomicrmw acq_rel*.
				3387	fence seq_cst - singlethread none *Same as corresponding
				3388	- wavefront fence acq_rel*.
				3389	- workgroup
				3390	- agent
				3391	- system
				3392	============ ============ ============== ========== =======================
				3393
				3394	The memory order also adds the single thread optimization constrains defined in
				3395	table
				3396	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3397
				3398	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3399	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3400
				3401	============ ==============================================================
				3402	LLVM Memory Optimization Constraints
				3403	Ordering
				3404	============ ==============================================================
				3405	unordered none
				3406	monotonic none
				3407	acquire - If a load atomic/atomicrmw then no following load/load
				3408	atomic/store/ store atomic/atomicrmw/fence instruction can
				3409	be moved before the acquire.
				3410	- If a fence then same as load atomic, plus no preceding
				3411	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3412	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3413	atomic/store/ store atomic/atomicrmw/fence instruction can
				3414	be moved after the release.
				3415	- If a fence then same as store atomic, plus no following
				3416	associated fence-paired-atomic can be moved before the
				3417	fence.
				3418	acq_rel Same constraints as both acquire and release.
				3419	seq_cst - If a load atomic then same constraints as acquire, plus no
				3420	preceding sequentially consistent load atomic/store
				3421	atomic/atomicrmw/fence instruction can be moved after the
				3422	seq_cst.
				3423	- If a store atomic then the same constraints as release, plus
				3424	no following sequentially consistent load atomic/store
				3425	atomic/atomicrmw/fence instruction can be moved before the
				3426	seq_cst.
				3427	- If an atomicrmw/fence then same constraints as acq_rel.
				3428	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3429
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3430	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3431	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3432
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3433	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3434	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3435	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3436
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3437	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3438	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3439
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3440	=================== =============== =============== =======================
				3441	Usage Code Sequence Trap Handler Description
				3442	Inputs
				3443	=================== =============== =============== =======================
				3444	reserved ``s_trap 0x00`` Reserved by hardware.
				3445	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3446	``queue_ptr`` ``debugtrap``
				3447	``VGPR0``: intrinsic (not
				3448	``arg`` implemented).
				3449	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3450	``queue_ptr`` terminated and its
				3451	associated queue put
				3452	into the error state.
				3453	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3454	``queue_ptr`` installed handled
				3455	same as ``llvm.trap``.
				3456	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3457	breakpoints.
				3458	debugger ``s_trap 0x08`` Reserved for debugger.
				3459	debugger ``s_trap 0xfe`` Reserved for debugger.
				3460	debugger ``s_trap 0xff`` Reserved for debugger.
				3461	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3462
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3463	Unspecified OS
				3464	--------------
				3465
				3466	This section provides code conventions used when the target triple OS is
				3467	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3468
				3469	Trap Handler ABI
				3470	~~~~~~~~~~~~~~~~
				3471
				3472	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3473	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3474	instructions are handled as follows:
				3475
				3476	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3477	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3478
				3479	=============== =============== ===========================================
				3480	Usage Code Sequence Description
				3481	=============== =============== ===========================================
				3482	llvm.trap s_endpgm Causes wavefront to be terminated.
				3483	llvm.debugtrap none Compiler warning given that there is no
				3484	trap handler installed.
				3485	=============== =============== ===========================================
				3486
				3487	Source Languages
				3488	================
				3489
				3490	.. _amdgpu-opencl:
				3491
				3492	OpenCL
				3493	------
				3494
				3495	When generating code for the OpenCL language the target triple environment
				3496	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3497
				3498	When the language is OpenCL the following differences occur:
				3499
				3500	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3501	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3502	3. Additional metadata is generated
				3503	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3504
				3505	.. TODO
				3506	Specify what affect this has. Hidden arguments added. Additional metadata
				3507	generated.
				3508
				3509	.. _amdgpu-hcc:
				3510
				3511	HCC
				3512	---
				3513
				3514	When generating code for the OpenCL language the target triple environment
				3515	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3516
				3517	When the language is OpenCL the following differences occur:
				3518
				3519	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3520
				3521	.. TODO
				3522	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3523
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3524	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3525	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3526
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3527	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3528	It supports AMDGCN GFX6-GFX8.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3529
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3530	This section describes general syntax for instructions and operands. For more
				3531	information about instructions, their semantics and supported combinations of
				3532	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3533	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3534
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3535	An instruction has the following syntax (register operands are normally
				3536	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3537
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3538	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3539
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3540	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3541	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3542
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3543	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3544
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3545	* SGPR registers: s0, ... or s[0], ...
				3546	* VGPR registers: v0, ... or v[0], ...
				3547	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3548	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3549	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3550	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3551	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3552	* Register index expressions: v[2*2], s[1-1:2-1]
				3553	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3554
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3555	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3556
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3557	* offset, offset0, offset1
				3558	* idxen, offen bits
				3559	* glc, slc, tfe bits
				3560	* waitcnt: integer or combination of counter values
				3561	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3562
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3563	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3564
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3565	* DPP modifiers:
				3566
				3567	- row_shl, row_shr, row_ror, row_rol
				3568	- row_mirror, row_half_mirror, row_bcast
				3569	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3570	- row_mask, bank_mask, bound_ctrl
				3571
				3572	* SDWA modifiers:
				3573
				3574	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3575	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3576	- abs, neg, sext
				3577
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3578	Instruction Examples
				3579	~~~~~~~~~~~~~~~~~~~~
				3580
				3581	DS
				3582	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3583
				3584	.. code-block:: nasm
				3585
				3586	ds_add_u32 v2, v4 offset:16
				3587	ds_write_src2_b64 v2 offset0:4 offset1:8
				3588	ds_cmpst_f32 v2, v4, v6
				3589	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3590
				3591
				3592	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3593
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3594	FLAT
				3595	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3596
				3597	.. code-block:: nasm
				3598
				3599	flat_load_dword v1, v[3:4]
				3600	flat_store_dwordx3 v[3:4], v[5:7]
				3601	flat_atomic_swap v1, v[3:4], v5 glc
				3602	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3603	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3604
				3605	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3606
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3607	MUBUF
				3608	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3609
				3610	.. code-block:: nasm
				3611
				3612	buffer_load_dword v1, off, s[4:7], s1
				3613	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3614	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3615	buffer_wbinvl1
				3616	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3617
				3618	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3619
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3620	SMRD/SMEM
				3621	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3622
				3623	.. code-block:: nasm
				3624
				3625	s_load_dword s1, s[2:3], 0xfc
				3626	s_load_dwordx8 s[8:15], s[2:3], s4
				3627	s_load_dwordx16 s[88:103], s[2:3], s4
				3628	s_dcache_inv_vol
				3629	s_memtime s[4:5]
				3630
				3631	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3632
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3633	SOP1
				3634	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3635
				3636	.. code-block:: nasm
				3637
				3638	s_mov_b32 s1, s2
				3639	s_mov_b64 s[0:1], 0x80000000
				3640	s_cmov_b32 s1, 200
				3641	s_wqm_b64 s[2:3], s[4:5]
				3642	s_bcnt0_i32_b64 s1, s[2:3]
				3643	s_swappc_b64 s[2:3], s[4:5]
				3644	s_cbranch_join s[4:5]
				3645
				3646	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3647
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3648	SOP2
				3649	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3650
				3651	.. code-block:: nasm
				3652
				3653	s_add_u32 s1, s2, s3
				3654	s_and_b64 s[2:3], s[4:5], s[6:7]
				3655	s_cselect_b32 s1, s2, s3
				3656	s_andn2_b32 s2, s4, s6
				3657	s_lshr_b64 s[2:3], s[4:5], s6
				3658	s_ashr_i32 s2, s4, s6
				3659	s_bfm_b64 s[2:3], s4, s6
				3660	s_bfe_i64 s[2:3], s[4:5], s6
				3661	s_cbranch_g_fork s[4:5], s[6:7]
				3662
				3663	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3664
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3665	SOPC
				3666	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3667
				3668	.. code-block:: nasm
				3669
				3670	s_cmp_eq_i32 s1, s2
				3671	s_bitcmp1_b32 s1, s2
				3672	s_bitcmp0_b64 s[2:3], s4
				3673	s_setvskip s3, s5
				3674
				3675	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3676
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3677	SOPP
				3678	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3679
				3680	.. code-block:: nasm
				3681
				3682	s_barrier
				3683	s_nop 2
				3684	s_endpgm
				3685	s_waitcnt 0 ; Wait for all counters to be 0
				3686	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3687	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3688	s_sethalt 9
				3689	s_sleep 10
				3690	s_sendmsg 0x1
				3691	s_sendmsg sendmsg(MSG_INTERRUPT)
				3692	s_trap 1
				3693
				3694	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3695
				3696	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3697	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3698	range or acceptable values.
				3699
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3700	VALU
				3701	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3702
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3703	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3704	the assembler will automatically use optimal encoding based on its operands.
				3705	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3706
				3707	* _e32 for 32-bit VOP1/VOP2/VOPC
				3708	* _e64 for 64-bit VOP3
				3709	* _dpp for VOP_DPP
				3710	* _sdwa for VOP_SDWA
				3711
				3712	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3713
				3714	.. code-block:: nasm
				3715
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3716	v_mov_b32 v1, v2
				3717	v_mov_b32_e32 v1, v2
				3718	v_nop
				3719	v_cvt_f64_i32_e32 v[1:2], v2
				3720	v_floor_f32_e32 v1, v2
				3721	v_bfrev_b32_e32 v1, v2
				3722	v_add_f32_e32 v1, v2, v3
				3723	v_mul_i32_i24_e64 v1, v2, 3
				3724	v_mul_i32_i24_e32 v1, -3, v3
				3725	v_mul_i32_i24_e32 v1, -100, v3
				3726	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3727	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3728
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3729	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3730
				3731	.. code-block:: nasm
				3732
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3733	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3734	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3735	v_mov_b32 v0, v0 wave_shl:1
				3736	v_mov_b32 v0, v0 row_mirror
				3737	v_mov_b32 v0, v0 row_bcast:31
				3738	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3739	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3740	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3741
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3742	VOP_SDWA examples:
				3743
				3744	.. code-block:: nasm
				3745
				3746	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				3747	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				3748	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				3749	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				3750	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				3751
				3752	For full list of supported instructions, refer to "Vector ALU instructions".
				3753
				3754	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3755	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3756
				3757	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				3758	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3759
				3760	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3761	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3762
				3763	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3764	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3765
				3766	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3767	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				3768
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3769
				3770	major, minor, and stepping are all integers that describe the instruction
				3771	set architecture (ISA) version of the assembly program.
				3772
				3773	vendor and arch are quoted strings. vendor should always be equal to
				3774	"AMD" and arch should always be equal to "AMDGPU".
				3775
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3776	By default, the assembler will derive the ISA version, vendor, and arch
				3777	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3778
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3779	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3780	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3781
				3782	This directives specifies that the symbol with given name is a kernel entry point
				3783	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3784
				3785	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3786	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3787
				3788	This directive marks the beginning of a list of key / value pairs that are used
				3789	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				3790	The list must be terminated by the .end_amd_kernel_code_t directive. For
				3791	any amd_kernel_code_t values that are unspecified a default value will be
				3792	used. The default value for all keys is 0, with the following exceptions:
				3793
				3794	- kernel_code_version_major defaults to 1.
				3795	- machine_kind defaults to 1.
				3796	- machine_version_major, machine_version_minor, and
				3797	machine_version_stepping are derived from the value of the -mcpu option
				3798	that is passed to the assembler.
				3799	- kernel_code_entry_byte_offset defaults to 256.
				3800	- wavefront_size defaults to 6.
				3801	- kernarg_segment_alignment, group_segment_alignment, and
				3802	private_segment_alignment default to 4. Note that alignments are specified
				3803	as a power of two, so a value of n means an alignment of 2^ n.
				3804
				3805	The .amd_kernel_code_t directive must be placed immediately after the
				3806	function label and before any instructions.
				3807
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3808	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				3809	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3810
				3811	Here is an example of a minimal amd_kernel_code_t specification:
				3812
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	3813	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3814
				3815	.hsa_code_object_version 1,0
				3816	.hsa_code_object_isa
				3817
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3818	.hsatext
				3819	.globl hello_world
				3820	.p2align 8
				3821	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3822
				3823	hello_world:
				3824
				3825	.amd_kernel_code_t
				3826	enable_sgpr_kernarg_segment_ptr = 1
				3827	is_ptr64 = 1
				3828	compute_pgm_rsrc1_vgprs = 0
				3829	compute_pgm_rsrc1_sgprs = 0
				3830	compute_pgm_rsrc2_user_sgpr = 2
				3831	kernarg_segment_byte_size = 8
				3832	wavefront_sgpr_count = 2
				3833	workitem_vgpr_count = 3
				3834	.end_amd_kernel_code_t
				3835
				3836	s_load_dwordx2 s[0:1], s[0:1] 0x0
				3837	v_mov_b32 v0, 3.14159
				3838	s_waitcnt lgkmcnt(0)
				3839	v_mov_b32 v1, s0
				3840	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3841	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3842	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	3843	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3844	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3845
				3846	Additional Documentation
				3847	========================
				3848
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3849	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				3850	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				3851	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				3852	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				3853	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				3854	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				3855	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				3856	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3857	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				3858	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				3859	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				3860	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				3861	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				3862	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				3863	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
				3864	.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
				3865	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				3866	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				3867	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__