Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: 8b94a3e6a5a57a9a8f55b7bcfe6f82007230fa6a [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
				26	.. table:: AMDGPU Target Triples
				27	:name: amdgpu-target-triples-table
				28
				29	============ ======== ========= ===========
				30	Architecture Vendor OS Environment
				31	============ ======== ========= ===========
				32	r600 amd <empty> <empty>
				33	amdgcn amd <empty> <empty>
				34	amdgcn amd amdhsa <empty>
				35	amdgcn amd amdhsa opencl
				36	amdgcn amd amdhsa amdgizcl
				37	amdgcn amd amdhsa amdgiz
				38	amdgcn amd amdhsa hcc
				39	============ ======== ========= ===========
				40
				41	``r600-amd--``
				42	Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
				43	the MESA runtime.
				44
				45	``amdgcn-amd--``
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	46	Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	47	the MESA runtime.
				48
				49	``amdgcn-amd-amdhsa-``
				50	Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
				51	compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
				52
				53	``amdgcn-amd-amdhsa-opencl``
				54	Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
				55	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				56	:ref:`amdgpu-opencl`.
				57
				58	``amdgcn-amd-amdhsa-amdgizcl``
				59	Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
				60	is used (see :ref:`amdgpu-address-spaces`).
				61
				62	``amdgcn-amd-amdhsa-amdgiz``
				63	Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
				64	used (see :ref:`amdgpu-address-spaces`).
				65
				66	``amdgcn-amd-amdhsa-hcc``
				67	Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
				68	executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				69	:ref:`amdgpu-hcc`.
				70
				71	.. _amdgpu-processors:
				72
				73	Processors
				74	----------
				75
				76	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				77	names from both the Processor and Alternative Processor can be used.
				78
				79	.. table:: AMDGPU Processors
				80	:name: amdgpu-processors-table
				81
				82	========== =========== ============ ===== ======= ==================
				83	Processor Alternative Target dGPU/ Runtime Example
				84	Processor Triple APU Support Products
				85	Architecture
				86	========== =========== ============ ===== ======= ==================
				87	R600 [AMD-R6xx]_
				88	--------------------------------------------------------------------
				89	r600 r600 dGPU
				90	r630 r600 dGPU
				91	rs880 r600 dGPU
				92	rv670 r600 dGPU
				93	R700 [AMD-R7xx]_
				94	--------------------------------------------------------------------
				95	rv710 r600 dGPU
				96	rv730 r600 dGPU
				97	rv770 r600 dGPU
				98	Evergreen [AMD-Evergreen]_
				99	--------------------------------------------------------------------
				100	cedar r600 dGPU
				101	redwood r600 dGPU
				102	sumo r600 dGPU
				103	juniper r600 dGPU
				104	cypress r600 dGPU
				105	Northern Islands [AMD-Cayman-Trinity]_
				106	--------------------------------------------------------------------
				107	barts r600 dGPU
				108	turks r600 dGPU
				109	caicos r600 dGPU
				110	cayman r600 dGPU
				111	GCN GFX6 (Southern Islands (SI)) [AMD-Souther-Islands]_
				112	--------------------------------------------------------------------
Konstantin Zhuravlyov	6cbcb27	2017-08-08 04:28:31 +0000	[diff] [blame]	113	gfx600 - tahiti amdgcn dGPU
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	114	gfx601 - pitcairn amdgcn dGPU
				115	- verde
				116	- oland
				117	- hainan
				118	GCN GFX7 (Sea Islands (CI)) [AMD-Sea-Islands]_
				119	--------------------------------------------------------------------
				120	gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
				121	- Radeon HD 8770
				122	- R7 260
				123	- R7 260X
				124	\ - kaveri amdgcn APU - A6-7000
				125	- A6 Pro-7050B
				126	- A8-7100
				127	- A8 Pro-7150B
				128	- A10-7300
				129	- A10 Pro-7350B
				130	- FX-7500
				131	- A8-7200P
				132	- A10-7400P
				133	- FX-7600P
				134	gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
				135	- FirePro W9100
				136	- FirePro S9150
				137	- FirePro S9170
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	138	gfx702 dGPU ROCm - Radeon R9 290
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	139	- Radeon R9 290x
				140	- Radeon R390
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	141	- Radeon R390x
				142	gfx703 - kabini amdgcn APU - E1-2100
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	143	- mullins - E1-2200
				144	- E1-2500
				145	- E2-3000
				146	- E2-3800
				147	- A4-5000
				148	- A4-5100
				149	- A6-5200
				150	- A4 Pro-3340B
				151	GCN GFX8 (Volcanic Islands (VI)) [AMD-Volcanic-Islands]_
				152	--------------------------------------------------------------------
				153	gfx800 - iceland amdgcn dGPU - FirePro S7150
				154	- FirePro S7100
				155	- FirePro W7100
				156	- Radeon R285
				157	- Radeon R9 380
				158	- Radeon R9 385
				159	- Mobile FirePro
				160	M7170
				161	gfx801 - carrizo amdgcn APU - A6-8500P
				162	- Pro A6-8500B
				163	- A8-8600P
				164	- Pro A8-8600B
				165	- FX-8800P
				166	- Pro A12-8800B
				167	\ amdgcn APU ROCm - A10-8700P
				168	- Pro A10-8700B
				169	- A10-8780P
				170	\ amdgcn APU - A10-9600P
				171	- A10-9630P
				172	- A12-9700P
				173	- A12-9730P
				174	- FX-9800P
				175	- FX-9830P
				176	\ amdgcn APU - E2-9010
				177	- A6-9210
				178	- A9-9410
				179	gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
				180	gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
				181	- Radeon R9 Fury
				182	- Radeon R9 FuryX
				183	- Radeon Pro Duo
				184	- FirePro S9300x2
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	185	- Radeon Instinct MI8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	186	\ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
				187	- Radeon RX 480
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	188	- Radeon Instinct MI6
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	189	\ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
				190	gfx804 amdgcn dGPU Same as gfx803
				191	gfx810 - stoney amdgcn APU
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	192	GCN GFX9 [AMD-Vega]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	193	--------------------------------------------------------------------
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	194	gfx900 amdgcn dGPU - Radeon Vega
				195	Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	201	gfx901 amdgcn dGPU ROCm Same as gfx900
				202	except XNACK is
				203	enabled
				204	gfx902 amdgcn APU TBA
				205
				206	.. TODO
				207	Add product
				208	names.
				209	gfx903 amdgcn APU Same as gfx902
				210	except XNACK is
				211	enabled
				212	========== =========== ============ ===== ======= ==================
				213
				214	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	215
				216	Address Spaces
				217	--------------
				218
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	219	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	220
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	221	The memory space names used in the table, aside from the region memory space, is
				222	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	223
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	224	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	225
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	226	.. table:: Address Space Mapping
				227	:name: amdgpu-address-space-mapping-table
				228
				229	================== ================= ================= ================= =================
				230	LLVM Address Space Memory Space
				231	------------------ -----------------------------------------------------------------------
				232	\ Current Default amdgiz/amdgizcl hcc Future Default
				233	================== ================= ================= ================= =================
				234	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				235	1 Global Global Global Global
				236	2 Constant Constant Constant Region (GDS)
				237	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				238	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				239	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				240	================== ================= ================= ================= =================
				241
				242	Current Default
				243	This is the current default address space mapping used for all languages
				244	except hcc. This will shortly be deprecated.
				245
				246	amdgiz/amdgizcl
				247	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				248	is specified as the target triple environment value.
				249
				250	hcc
				251	This is the current address space mapping used when ``hcc`` is specified as
				252	the target triple environment value.This will shortly be deprecated.
				253
				254	Future Default
				255	This will shortly be the only address space mapping for all languages using
				256	AMDGPU backend.
				257
				258	.. _amdgpu-memory-scopes:
				259
				260	Memory Scopes
				261	-------------
				262
				263	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				264	backend memory model when the target triple OS is ``amdhsa`` (see
				265	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				266
				267	The memory model supported is based on the HSA memory model [HSA]_ which is
				268	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				269	relation is transitive over the synchonizes-with relation independent of scope,
				270	and synchonizes-with allows the memory scope instances to be inclusive (see
				271	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
				272
				273	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				274	inclusion and requires the memory scopes to exactly match. However, this
				275	is conservatively correct for OpenCL.
				276
				277	.. table:: AMDHSA LLVM Sync Scopes for AMDHSA
				278	:name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
				279
				280	================ ==========================================================
				281	LLVM Sync Scope Description
				282	================ ==========================================================
				283	none The default: ``system``.
				284
				285	Synchronizes with, and participates in modification and
				286	seq_cst total orderings with, other operations (except
				287	image operations) for all address spaces (except private,
				288	or generic that accesses private) provided the other
				289	operation's sync scope is:
				290
				291	- ``system``.
				292	- ``agent`` and executed by a thread on the same agent.
				293	- ``workgroup`` and executed by a thread in the same
				294	workgroup.
				295	- ``wavefront`` and executed by a thread in the same
				296	wavefront.
				297
				298	``agent`` Synchronizes with, and participates in modification and
				299	seq_cst total orderings with, other operations (except
				300	image operations) for all address spaces (except private,
				301	or generic that accesses private) provided the other
				302	operation's sync scope is:
				303
				304	- ``system`` or ``agent`` and executed by a thread on the
				305	same agent.
				306	- ``workgroup`` and executed by a thread in the same
				307	workgroup.
				308	- ``wavefront`` and executed by a thread in the same
				309	wavefront.
				310
				311	``workgroup`` Synchronizes with, and participates in modification and
				312	seq_cst total orderings with, other operations (except
				313	image operations) for all address spaces (except private,
				314	or generic that accesses private) provided the other
				315	operation's sync scope is:
				316
				317	- ``system``, ``agent`` or ``workgroup`` and executed by a
				318	thread in the same workgroup.
				319	- ``wavefront`` and executed by a thread in the same
				320	wavefront.
				321
				322	``wavefront`` Synchronizes with, and participates in modification and
				323	seq_cst total orderings with, other operations (except
				324	image operations) for all address spaces (except private,
				325	or generic that accesses private) provided the other
				326	operation's sync scope is:
				327
				328	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				329	and executed by a thread in the same wavefront.
				330
				331	``singlethread`` Only synchronizes with, and participates in modification
				332	and seq_cst total orderings with, other operations (except
				333	image operations) running in the same thread for all
				334	address spaces (for example, in signal handlers).
				335	================ ==========================================================
				336
				337	AMDGPU Intrinsics
				338	-----------------
				339
				340	The AMDGPU backend implements the following intrinsics.
				341
				342	This section is WIP.
				343
				344	.. TODO
				345	List AMDGPU intrinsics
				346
				347	Code Object
				348	===========
				349
				350	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				351	can be linked by ``lld`` to produce a standard ELF shared code object which can
				352	be loaded and executed on an AMDGPU target.
				353
				354	Header
				355	------
				356
				357	The AMDGPU backend uses the following ELF header:
				358
				359	.. table:: AMDGPU ELF Header
				360	:name: amdgpu-elf-header-table
				361
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	362	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	363	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	364	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	365	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				366	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	367	``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	368	``ELFOSABI_AMDGPU_PAL`` or
				369	``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov	3696352	2017-10-03 21:18:03 +0000	[diff] [blame]	370	``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``,
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	371	``ELFABIVERSION_AMDGPU_PAL`` or
				372	``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373	``e_type`` ``ET_REL`` or ``ET_DYN``
				374	``e_machine`` ``EM_AMDGPU``
				375	``e_entry`` 0
				376	``e_flags`` 0
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	377	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	378
				379	..
				380
				381	.. table:: AMDGPU ELF Header Enumeration Values
				382	:name: amdgpu-elf-header-enumeration-values-table
				383
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	384	=============================== =====
				385	Name Value
				386	=============================== =====
				387	``EM_AMDGPU`` 224
				388	``ELFOSABI_AMDGPU_HSA`` 64
				389	``ELFOSABI_AMDGPU_PAL`` 65
				390	``ELFOSABI_AMDGPU_MESA3D`` 66
				391	``ELFABIVERSION_AMDGPU_HSA`` 1
				392	``ELFABIVERSION_AMDGPU_PAL`` 0
				393	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				394	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395
				396	``e_ident[EI_CLASS]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	397	The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
				398	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	399
				400	``e_ident[EI_DATA]``
				401	All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
				402
				403	``e_ident[EI_OSABI]``
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	404	One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	405
				406	* ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
				407	the AMD HSA runtime ABI [HSA]_.
				408
				409	* ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
				410	the AMD PAL runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	412	* ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
				413	to the AMD MESA runtime ABI.
				414
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	415	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	416	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				417	object conforms:
				418
				419	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				420	runtime ABI.
				421
				422	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				423	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	425	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
				426	runtime ABI.
				427
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428	``e_type``
				429	Can be one of the following values:
				430
				431
				432	``ET_REL``
				433	The type produced by the AMD GPU backend compiler as it is relocatable code
				434	object.
				435
				436	``ET_DYN``
				437	The type produced by the linker as it is a shared code object.
				438
				439	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				440
				441	``e_machine``
				442	The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
				443	architecture family. The specific member is specified in the
				444	``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
				445	:ref:`amdgpu-note-records`).
				446
				447	``e_entry``
				448	The entry point is 0 as the entry points for individual kernels must be
				449	selected in order to invoke them through AQL packets.
				450
				451	``e_flags``
				452	The value is 0 as no flags are used.
				453
				454	Sections
				455	--------
				456
				457	An AMDGPU target ELF code object has the standard ELF sections which include:
				458
				459	.. table:: AMDGPU ELF Sections
				460	:name: amdgpu-elf-sections-table
				461
				462	================== ================ =================================
				463	Name Type Attributes
				464	================== ================ =================================
				465	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				466	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				467	``.debug_``\ \* ``SHT_PROGBITS`` none
				468	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				469	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				470	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				471	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				472	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				473	``.note`` ``SHT_NOTE`` none
				474	``.rela``\ name ``SHT_RELA`` none
				475	``.rela.dyn`` ``SHT_RELA`` none
				476	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				477	``.shstrtab`` ``SHT_STRTAB`` none
				478	``.strtab`` ``SHT_STRTAB`` none
				479	``.symtab`` ``SHT_SYMTAB`` none
				480	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				481	================== ================ =================================
				482
				483	These sections have their standard meanings (see [ELF]_) and are only generated
				484	if needed.
				485
				486	``.debug``\ \*
				487	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				488	DWARF produced by the AMDGPU backend.
				489
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	490	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	491	The standard sections used by a dynamic loader.
				492
				493	``.note``
				494	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				495	backend.
				496
				497	``.rela``\ name, ``.rela.dyn``
				498	For relocatable code objects, name is the name of the section that the
				499	relocation records apply. For example, ``.rela.text`` is the section name for
				500	relocation records associated with the ``.text`` section.
				501
				502	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				503	records from each of the relocatable code object's ``.rela``\ name sections.
				504
				505	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				506	the AMDGPU backend.
				507
				508	``.text``
				509	The executable machine code for the kernels and functions they call. Generated
				510	as position independent code. See :ref:`amdgpu-code-conventions` for
				511	information on conventions used in the isa generation.
				512
				513	.. _amdgpu-note-records:
				514
				515	Note Records
				516	------------
				517
				518	As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
				519	the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
				520	minimal zero byte padding must be generated to ensure the ``desc`` field size is
				521	a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
				522	be at least 4 to indicate at least 8 byte alignment.
				523
				524	The AMDGPU backend code object uses the following ELF note records in the
				525	``.note`` section. The Description column specifies the layout of the note
				526	record’s ``desc`` field. All fields are consecutive bytes. Note records with
				527	variable size strings have a corresponding ``*_size`` field that specifies the
				528	number of bytes, including the terminating null character, in the string. The
				529	string(s) come immediately after the preceding fields.
				530
				531	Additional note records can be present.
				532
				533	.. table:: AMDGPU ELF Note Records
				534	:name: amdgpu-elf-note-records-table
				535
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	536	===== ============================== ======================================
				537	Name Type Description
				538	===== ============================== ======================================
				539	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
				540	"AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
				541	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	542
				543	..
				544
				545	.. table:: AMDGPU ELF Note Record Enumeration Values
				546	:name: amdgpu-elf-note-record-enumeration-values-table
				547
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	548	============================== =====
				549	Name Value
				550	============================== =====
				551	reserved 0-9
				552	``NT_AMD_AMDGPU_HSA_METADATA`` 10
				553	``NT_AMD_AMDGPU_ISA`` 11
				554	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	555
				556	``NT_AMD_AMDGPU_ISA``
				557	Specifies the instruction set architecture used by the machine code contained
				558	in the code object.
				559
				560	This note record is required for code objects containing machine code for
				561	processors matching the ``amdgcn`` architecture in table
				562	:ref:`amdgpu-processors`.
				563
				564	The null terminated string has the following syntax:
				565
				566	architecture\ ``-``\ vendor\ ``-``\ os\ ``-``\ environment\ ``-``\ processor
				567
				568	where:
				569
				570	architecture
				571	The architecture from table :ref:`amdgpu-target-triples-table`.
				572
				573	This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
				574	:ref:`amdgpu-target-triples`).
				575
				576	vendor
				577	The vendor from table :ref:`amdgpu-target-triples-table`.
				578
				579	For the AMDGPU backend this is always ``amd``.
				580
				581	os
				582	The OS from table :ref:`amdgpu-target-triples-table`.
				583
				584	environment
				585	An environment from table :ref:`amdgpu-target-triples-table`, or blank if
				586	the environment has no affect on the execution of the code object.
				587
				588	For the AMDGPU backend this is currently always blank.
				589	processor
				590	The processor from table :ref:`amdgpu-processors-table`.
				591
				592	For example:
				593
				594	``amdgcn-amd-amdhsa--gfx901``
				595
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	596	``NT_AMD_AMDGPU_HSA_METADATA``
				597	Specifies extensible metadata associated with the code objects executed on HSA
				598	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				599	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				600	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				601	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	603	.. _amdgpu-symbols:
				604
				605	Symbols
				606	-------
				607
				608	Symbols include the following:
				609
				610	.. table:: AMDGPU ELF Symbols
				611	:name: amdgpu-elf-symbols-table
				612
				613	===================== ============== ============= ==================
				614	Name Type Section Description
				615	===================== ============== ============= ==================
				616	link-name ``STT_OBJECT`` - ``.data`` Global variable
				617	- ``.rodata``
				618	- ``.bss``
				619	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				620	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				621	===================== ============== ============= ==================
				622
				623	Global variable
				624	Global variables both used and defined by the compilation unit.
				625
				626	If the symbol is defined in the compilation unit then it is allocated in the
				627	appropriate section according to if it has initialized data or is readonly.
				628
				629	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				630	will resolve relocations using the definition provided by another code object
				631	or explicitly defined by the runtime.
				632
				633	All global symbols, whether defined in the compilation unit or external, are
				634	accessed by the machine code indirectly through a GOT table entry. This
				635	allows them to be preemptable. The GOT table is only supported when the target
				636	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	637
				638	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	639	Add description of linked shared object symbols. Seems undefined symbols
				640	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	642	Kernel descriptor
				643	Every HSA kernel has an associated kernel descriptor. It is the address of the
				644	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				645	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				646	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				647
				648	Kernel entry point
				649	Every HSA kernel also has a symbol for its machine code entry point.
				650
				651	.. _amdgpu-relocation-records:
				652
				653	Relocation Records
				654	------------------
				655
				656	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				657	relocatable fields are:
				658
				659	``word32``
				660	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				661	alignment. These values use the same byte order as other word values in the
				662	AMD GPU architecture.
				663
				664	``word64``
				665	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				666	alignment. These values use the same byte order as other word values in the
				667	AMD GPU architecture.
				668
				669	Following notations are used for specifying relocation calculations:
				670
				671	A
				672	Represents the addend used to compute the value of the relocatable field.
				673
				674	G
				675	Represents the offset into the global offset table at which the relocation
				676	entry’s symbol will reside during execution.
				677
				678	GOT
				679	Represents the address of the global offset table.
				680
				681	P
				682	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				683	of the storage unit being relocated (computed using ``r_offset``).
				684
				685	S
				686	Represents the value of the symbol whose index resides in the relocation
				687	entry.
				688
				689	The following relocation types are supported:
				690
				691	.. table:: AMDGPU ELF Relocation Records
				692	:name: amdgpu-elf-relocation-records-table
				693
				694	========================== ===== ========== ==============================
				695	Relocation Type Value Field Calculation
				696	========================== ===== ========== ==============================
				697	``R_AMDGPU_NONE`` 0 none none
				698	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				699	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				700	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				701	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				702	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				703	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				704	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				705	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				706	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				707	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				708	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
				709	========================== ===== ========== ==============================
				710
				711	.. _amdgpu-dwarf:
				712
				713	DWARF
				714	-----
				715
				716	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				717	information that maps the code object executable code and data to the source
				718	language constructs. It can be used by tools such as debuggers and profilers.
				719
				720	Address Space Mapping
				721	~~~~~~~~~~~~~~~~~~~~~
				722
				723	The following address space mapping is used:
				724
				725	.. table:: AMDGPU DWARF Address Space Mapping
				726	:name: amdgpu-dwarf-address-space-mapping-table
				727
				728	=================== =================
				729	DWARF Address Space Memory Space
				730	=================== =================
				731	1 Private (Scratch)
				732	2 Local (group/LDS)
				733	omitted Global
				734	omitted Constant
				735	omitted Generic (Flat)
				736	not supported Region (GDS)
				737	=================== =================
				738
				739	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				740	used in the table.
				741
				742	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				743	DWARF address space of the value of the pointer when it is in the private or
				744	local address space. Otherwise the attribute is omitted.
				745
				746	An ``XDEREF`` operation is generated in location list expressions for variables
				747	that are allocated in the private and local address space. Otherwise no
				748	``XDREF`` is omitted.
				749
				750	Register Mapping
				751	~~~~~~~~~~~~~~~~
				752
				753	This section is WIP.
				754
				755	.. TODO
				756	Define DWARF register enumeration.
				757
				758	If want to present a wavefront state then should expose vector registers as
				759	64 wide (rather than per work-item view that LLVM uses). Either as separate
				760	registers, or a 64x4 byte single register. In either case use a new LANE op
				761	(akin to XDREF) to select the current lane usage in a location
				762	expression. This would also allow scalar register spilling to vector register
				763	lanes to be expressed (currently no debug information is being generated for
				764	spilling). If choose a wide single register approach then use LANE in
				765	conjunction with PIECE operation to select the dword part of the register for
				766	the current lane. If the separate register approach then use LANE to select
				767	the register.
				768
				769	Source Text
				770	~~~~~~~~~~~
				771
				772	This section is WIP.
				773
				774	.. TODO
				775	DWARF extension to include runtime generated source text.
				776
				777	.. _amdgpu-code-conventions:
				778
				779	Code Conventions
				780	================
				781
				782	This section provides code conventions used for each supported target triple OS
				783	(see :ref:`amdgpu-target-triples`).
				784
				785	AMDHSA
				786	------
				787
				788	This section provides code conventions used when the target triple OS is
				789	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				790
				791	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	792
				793	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	794	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	795
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	796	The code object metadata specifies extensible metadata associated with the code
				797	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				798	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				799	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				800	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				801	information necessary to support the ROCM kernel queries. For example, the
				802	segment sizes needed in a dispatch packet. In addition, a high level language
				803	runtime may require other information to be included. For example, the AMD
				804	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	805
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	806	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	807	:doc:`YamlIO`).
				808
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	809	.. TODO
				810	Is the string null terminated? It probably should not if YAML allows it to
				811	contain null characters, otherwise it should be.
				812
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	813	The metadata is represented as a single YAML document comprised of the mapping
				814	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				815	referenced tables.
				816
				817	For boolean values, the string values of ``false`` and ``true`` are used for
				818	false and true respectively.
				819
				820	Additional information can be added to the mappings. To avoid conflicts, any
				821	non-AMD key names should be prefixed by "vendor-name.".
				822
				823	.. table:: AMDHSA Code Object Metadata Mapping
				824	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				825
				826	========== ============== ========= =======================================
				827	String Key Value Type Required? Description
				828	========== ============== ========= =======================================
				829	"Version" sequence of Required - The first integer is the major
				830	2 integers version. Currently 1.
				831	- The second integer is the minor
				832	version. Currently 0.
				833	"Printf" sequence of Each string is encoded information
				834	strings about a printf function call. The
				835	encoded information is organized as
				836	fields separated by colon (':'):
				837
				838	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				839
				840	where:
				841
				842	``ID``
				843	A 32 bit integer as a unique id for
				844	each printf function call
				845
				846	``N``
				847	A 32 bit integer equal to the number
				848	of arguments of printf function call
				849	minus 1
				850
				851	``S[i]`` (where i = 0, 1, ... , N-1)
				852	32 bit integers for the size in bytes
				853	of the i-th FormatString argument of
				854	the printf function call
				855
				856	FormatString
				857	The format string passed to the
				858	printf function call.
				859	"Kernels" sequence of Required Sequence of the mappings for each
				860	mapping kernel in the code object. See
				861	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				862	for the definition of the mapping.
				863	========== ============== ========= =======================================
				864
				865	..
				866
				867	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				868	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				869
				870	================= ============== ========= ================================
				871	String Key Value Type Required? Description
				872	================= ============== ========= ================================
				873	"Name" string Required Source name of the kernel.
				874	"SymbolName" string Required Name of the kernel
				875	descriptor ELF symbol.
				876	"Language" string Source language of the kernel.
				877	Values include:
				878
				879	- "OpenCL C"
				880	- "OpenCL C++"
				881	- "HCC"
				882	- "OpenMP"
				883
				884	"LanguageVersion" sequence of - The first integer is the major
				885	2 integers version.
				886	- The second integer is the
				887	minor version.
				888	"Attrs" mapping Mapping of kernel attributes.
				889	See
				890	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				891	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	892	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	893	mapping kernel arguments. See
				894	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				895	for the definition of the mapping.
				896	"CodeProps" mapping Mapping of properties related to
				897	the kernel code. See
				898	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				899	for the mapping definition.
				900	"DebugProps" mapping Mapping of properties related to
				901	the kernel debugging. See
				902	:ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
				903	for the mapping definition.
				904	================= ============== ========= ================================
				905
				906	..
				907
				908	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				909	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				910
				911	=================== ============== ========= ==============================
				912	String Key Value Type Required? Description
				913	=================== ============== ========= ==============================
				914	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				915	3 integers X, Y, Z must correspond to the
				916	specified values.
				917
				918	Corresponds to the OpenCL
				919	``reqd_work_group_size``
				920	attribute.
				921	"WorkGroupSizeHint" sequence of The dispatch work-group size
				922	3 integers X, Y, Z is likely to be the
				923	specified values.
				924
				925	Corresponds to the OpenCL
				926	``work_group_size_hint``
				927	attribute.
				928	"VecTypeHint" string The name of a scalar or vector
				929	type.
				930
				931	Corresponds to the OpenCL
				932	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	933
				934	"RuntimeHandle" string The external symbol name
				935	associated with a kernel.
				936	OpenCL runtime allocates a
				937	global buffer for the symbol
				938	and saves the kernel's address
				939	to it, which is used for
				940	device side enqueueing. Only
				941	available for device side
				942	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	943	=================== ============== ========= ==============================
				944
				945	..
				946
				947	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				948	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				949
				950	================= ============== ========= ================================
				951	String Key Value Type Required? Description
				952	================= ============== ========= ================================
				953	"Name" string Kernel argument name.
				954	"TypeName" string Kernel argument type name.
				955	"Size" integer Required Kernel argument size in bytes.
				956	"Align" integer Required Kernel argument alignment in
				957	bytes. Must be a power of two.
				958	"ValueKind" string Required Kernel argument kind that
				959	specifies how to set up the
				960	corresponding argument.
				961	Values include:
				962
				963	"ByValue"
				964	The argument is copied
				965	directly into the kernarg.
				966
				967	"GlobalBuffer"
				968	A global address space pointer
				969	to the buffer data is passed
				970	in the kernarg.
				971
				972	"DynamicSharedPointer"
				973	A group address space pointer
				974	to dynamically allocated LDS
				975	is passed in the kernarg.
				976
				977	"Sampler"
				978	A global address space
				979	pointer to a S# is passed in
				980	the kernarg.
				981
				982	"Image"
				983	A global address space
				984	pointer to a T# is passed in
				985	the kernarg.
				986
				987	"Pipe"
				988	A global address space pointer
				989	to an OpenCL pipe is passed in
				990	the kernarg.
				991
				992	"Queue"
				993	A global address space pointer
				994	to an OpenCL device enqueue
				995	queue is passed in the
				996	kernarg.
				997
				998	"HiddenGlobalOffsetX"
				999	The OpenCL grid dispatch
				1000	global offset for the X
				1001	dimension is passed in the
				1002	kernarg.
				1003
				1004	"HiddenGlobalOffsetY"
				1005	The OpenCL grid dispatch
				1006	global offset for the Y
				1007	dimension is passed in the
				1008	kernarg.
				1009
				1010	"HiddenGlobalOffsetZ"
				1011	The OpenCL grid dispatch
				1012	global offset for the Z
				1013	dimension is passed in the
				1014	kernarg.
				1015
				1016	"HiddenNone"
				1017	An argument that is not used
				1018	by the kernel. Space needs to
				1019	be left for it, but it does
				1020	not need to be set up.
				1021
				1022	"HiddenPrintfBuffer"
				1023	A global address space pointer
				1024	to the runtime printf buffer
				1025	is passed in kernarg.
				1026
				1027	"HiddenDefaultQueue"
				1028	A global address space pointer
				1029	to the OpenCL device enqueue
				1030	queue that should be used by
				1031	the kernel by default is
				1032	passed in the kernarg.
				1033
				1034	"HiddenCompletionAction"
				1035	TBD
				1036
				1037	.. TODO
				1038	Add description.
				1039
				1040	"ValueType" string Required Kernel argument value type. Only
				1041	present if "ValueKind" is
				1042	"ByValue". For vector data
				1043	types, the value is for the
				1044	element type. Values include:
				1045
				1046	- "Struct"
				1047	- "I8"
				1048	- "U8"
				1049	- "I16"
				1050	- "U16"
				1051	- "F16"
				1052	- "I32"
				1053	- "U32"
				1054	- "F32"
				1055	- "I64"
				1056	- "U64"
				1057	- "F64"
				1058
				1059	.. TODO
				1060	How can it be determined if a
				1061	vector type, and what size
				1062	vector?
				1063	"PointeeAlign" integer Alignment in bytes of pointee
				1064	type for pointer type kernel
				1065	argument. Must be a power
				1066	of 2. Only present if
				1067	"ValueKind" is
				1068	"DynamicSharedPointer".
				1069	"AddrSpaceQual" string Kernel argument address space
				1070	qualifier. Only present if
				1071	"ValueKind" is "GlobalBuffer" or
				1072	"DynamicSharedPointer". Values
				1073	are:
				1074
				1075	- "Private"
				1076	- "Global"
				1077	- "Constant"
				1078	- "Local"
				1079	- "Generic"
				1080	- "Region"
				1081
				1082	.. TODO
				1083	Is GlobalBuffer only Global
				1084	or Constant? Is
				1085	DynamicSharedPointer always
				1086	Local? Can HCC allow Generic?
				1087	How can Private or Region
				1088	ever happen?
				1089	"AccQual" string Kernel argument access
				1090	qualifier. Only present if
				1091	"ValueKind" is "Image" or
				1092	"Pipe". Values
				1093	are:
				1094
				1095	- "ReadOnly"
				1096	- "WriteOnly"
				1097	- "ReadWrite"
				1098
				1099	.. TODO
				1100	Does this apply to
				1101	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1102	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1103	performed by the kernel on the
				1104	kernel argument. Only present if
				1105	"ValueKind" is "GlobalBuffer",
				1106	"Image", or "Pipe". This may be
				1107	more restrictive than indicated
				1108	by "AccQual" to reflect what the
				1109	kernel actual does. If not
				1110	present then the runtime must
				1111	assume what is implied by
				1112	"AccQual" and "IsConst". Values
				1113	are:
				1114
				1115	- "ReadOnly"
				1116	- "WriteOnly"
				1117	- "ReadWrite"
				1118
				1119	"IsConst" boolean Indicates if the kernel argument
				1120	is const qualified. Only present
				1121	if "ValueKind" is
				1122	"GlobalBuffer".
				1123
				1124	"IsRestrict" boolean Indicates if the kernel argument
				1125	is restrict qualified. Only
				1126	present if "ValueKind" is
				1127	"GlobalBuffer".
				1128
				1129	"IsVolatile" boolean Indicates if the kernel argument
				1130	is volatile qualified. Only
				1131	present if "ValueKind" is
				1132	"GlobalBuffer".
				1133
				1134	"IsPipe" boolean Indicates if the kernel argument
				1135	is pipe qualified. Only present
				1136	if "ValueKind" is "Pipe".
				1137
				1138	.. TODO
				1139	Can GlobalBuffer be pipe
				1140	qualified?
				1141	================= ============== ========= ================================
				1142
				1143	..
				1144
				1145	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1146	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1147
				1148	============================ ============== ========= =====================
				1149	String Key Value Type Required? Description
				1150	============================ ============== ========= =====================
				1151	"KernargSegmentSize" integer Required The size in bytes of
				1152	the kernarg segment
				1153	that holds the values
				1154	of the arguments to
				1155	the kernel.
				1156	"GroupSegmentFixedSize" integer Required The amount of group
				1157	segment memory
				1158	required by a
				1159	work-group in
				1160	bytes. This does not
				1161	include any
				1162	dynamically allocated
				1163	group segment memory
				1164	that may be added
				1165	when the kernel is
				1166	dispatched.
				1167	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1168	private address space
				1169	memory required for a
				1170	work-item in
				1171	bytes. If
				1172	IsDynamicCallstack
				1173	is 1 then additional
				1174	space must be added
				1175	to this value for the
				1176	call stack.
				1177	"KernargSegmentAlign" integer Required The maximum byte
				1178	alignment of
				1179	arguments in the
				1180	kernarg segment. Must
				1181	be a power of 2.
				1182	"WavefrontSize" integer Required Wavefront size. Must
				1183	be a power of 2.
				1184	"NumSGPRs" integer Number of scalar
				1185	registers used by a
				1186	wavefront for
				1187	GFX6-GFX9. This
				1188	includes the special
				1189	SGPRs for VCC, Flat
				1190	Scratch (GFX7-GFX9)
				1191	and XNACK (for
				1192	GFX8-GFX9). It does
				1193	not include the 16
				1194	SGPR added if a trap
				1195	handler is
				1196	enabled. It is not
				1197	rounded up to the
				1198	allocation
				1199	granularity.
				1200	"NumVGPRs" integer Number of vector
				1201	registers used by
				1202	each work-item for
				1203	GFX6-GFX9
				1204	"MaxFlatWorkgroupSize" integer Maximum flat
				1205	work-group size
				1206	supported by the
				1207	kernel in work-items.
				1208	"IsDynamicCallStack" boolean Indicates if the
				1209	generated machine
				1210	code is using a
				1211	dynamically sized
				1212	call stack.
				1213	"IsXNACKEnabled" boolean Indicates if the
				1214	generated machine
				1215	code is capable of
				1216	supporting XNACK.
				1217	============================ ============== ========= =====================
				1218
				1219	..
				1220
				1221	.. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
				1222	:name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
				1223
				1224	=================================== ============== ========= ==============
				1225	String Key Value Type Required? Description
				1226	=================================== ============== ========= ==============
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1227	"DebuggerABIVersion" sequence of
				1228	2 integers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1229	"ReservedNumVGPRs" integer
				1230	"ReservedFirstVGPR" integer
				1231	"PrivateSegmentBufferSGPR" integer
				1232	"WavefrontPrivateSegmentOffsetSGPR" integer
				1233	=================================== ============== ========= ==============
				1234
				1235	.. TODO
				1236	Plan to remove the debug properties metadata.
				1237
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1238	Kernel Dispatch
				1239	~~~~~~~~~~~~~~~
				1240
				1241	The HSA architected queuing language (AQL) defines a user space memory interface
				1242	that can be used to control the dispatch of kernels, in an agent independent
				1243	way. An agent can have zero or more AQL queues created for it using the ROCm
				1244	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1245	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1246	mechanics and packet layouts.
				1247
				1248	The packet processor of a kernel agent is responsible for detecting and
				1249	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1250	packet processor is implemented by the hardware command processor (CP),
				1251	asynchronous dispatch controller (ADC) and shader processor input controller
				1252	(SPI).
				1253
				1254	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1255	mode driver to initialize and register the AQL queue with CP.
				1256
				1257	To dispatch a kernel the following actions are performed. This can occur in the
				1258	CPU host program, or from an HSA kernel executing on a GPU.
				1259
				1260	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1261	executed is obtained.
				1262	2. A pointer to the kernel descriptor (see
				1263	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1264	obtained. It must be for a kernel that is contained in a code object that that
				1265	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1266	associated.
				1267	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1268	for a memory region with the kernarg property for the kernel agent that will
				1269	execute the kernel. It must be at least 16 byte aligned.
				1270	4. Kernel argument values are assigned to the kernel argument memory
				1271	allocation. The layout is defined in the HSA Programmer’s Language Reference
				1272	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1273	memory in the same way constant memory is accessed. (Note that the HSA
				1274	specification allows an implementation to copy the kernel argument contents to
				1275	another location that is accessed by the kernel.)
				1276	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1277	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1278	packet. The packet must be set up, and the final write must use an atomic
				1279	store release to set the packet kind to ensure the packet contents are
				1280	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1281	notify the kernel agent that the AQL queue has been updated. These rules, and
				1282	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1283	System Architecture Specification* [HSA]_.
				1284	6. A kernel dispatch packet includes information about the actual dispatch,
				1285	such as grid and work-group size, together with information from the code
				1286	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1287	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1288	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1289	7. CP executes micro-code and is responsible for detecting and setting up the
				1290	GPU to execute the wavefronts of a kernel dispatch.
				1291	8. CP ensures that when the a wavefront starts executing the kernel machine
				1292	code, the scalar general purpose registers (SGPR) and vector general purpose
				1293	registers (VGPR) are set up as required by the machine code. The required
				1294	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1295	register state is defined in
				1296	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1297	9. The prolog of the kernel machine code (see
				1298	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1299	before continuing executing the machine code that corresponds to the kernel.
				1300	10. When the kernel dispatch has completed execution, CP signals the completion
				1301	signal specified in the kernel dispatch packet if not 0.
				1302
				1303	.. _amdgpu-amdhsa-memory-spaces:
				1304
				1305	Memory Spaces
				1306	~~~~~~~~~~~~~
				1307
				1308	The memory space properties are:
				1309
				1310	.. table:: AMDHSA Memory Spaces
				1311	:name: amdgpu-amdhsa-memory-spaces-table
				1312
				1313	================= =========== ======== ======= ==================
				1314	Memory Space Name HSA Segment Hardware Address NULL Value
				1315	Name Name Size
				1316	================= =========== ======== ======= ==================
				1317	Private private scratch 32 0x00000000
				1318	Local group LDS 32 0xFFFFFFFF
				1319	Global global global 64 0x0000000000000000
				1320	Constant constant *same as 64 0x0000000000000000
				1321	global*
				1322	Generic flat flat 64 0x0000000000000000
				1323	Region N/A GDS 32 *not implemented
				1324	for AMDHSA*
				1325	================= =========== ======== ======= ==================
				1326
				1327	The global and constant memory spaces both use global virtual addresses, which
				1328	are the same virtual address space used by the CPU. However, some virtual
				1329	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1330	and some by both.
				1331
				1332	Using the constant memory space indicates that the data will not change during
				1333	the execution of the kernel. This allows scalar read instructions to be
				1334	used. The vector and scalar L1 caches are invalidated of volatile data before
				1335	each kernel dispatch execution to allow constant memory to change values between
				1336	kernel dispatches.
				1337
				1338	The local memory space uses the hardware Local Data Store (LDS) which is
				1339	automatically allocated when the hardware creates work-groups of wavefronts, and
				1340	freed when all the wavefronts of a work-group have terminated. The data store
				1341	(DS) instructions can be used to access it.
				1342
				1343	The private memory space uses the hardware scratch memory support. If the kernel
				1344	uses scratch, then the hardware allocates memory that is accessed using
				1345	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1346	address to physical address is:
				1347
				1348	``wavefront-scratch-base +
				1349	(private-address * wavefront-size * 4) +
				1350	(wavefront-lane-id * 4)``
				1351
				1352	There are different ways that the wavefront scratch base address is determined
				1353	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1354	memory can be accessed in an interleaved manner using buffer instruction with
				1355	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1356	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1357	same private address, the interleaving results in adjacent dwords being accessed
				1358	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1359	supported except by flat and scratch instructions in GFX9.
				1360
				1361	The generic address space uses the hardware flat address support available in
				1362	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1363	local appertures), that are outside the range of addressible global memory, to
				1364	map from a flat address to a private or local address.
				1365
				1366	FLAT instructions can take a flat address and access global, private (scratch)
				1367	and group (LDS) memory depending in if the address is within one of the
				1368	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1369	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1370	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1371	(see :ref:`amdgpu-amdhsa-m0`).
				1372
				1373	To convert between a segment address and a flat address the base address of the
				1374	appertures address can be used. For GFX7-GFX8 these are available in the
				1375	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1376	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1377	GFX9 the appature base addresses are directly available as inline constant
				1378	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1379	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1380	which makes it easier to convert from flat to segment or segment to flat.
				1381
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1382	Image and Samplers
				1383	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1384
				1385	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1386	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1387	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1388	enumeration values for the queries that are not trivially deducible from the S#
				1389	representation.
				1390
				1391	HSA Signals
				1392	~~~~~~~~~~~
				1393
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1394	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1395	structure allocated in memory accessible from both the CPU and GPU. The
				1396	structure is defined by the ROCm runtime and subject to change between releases
				1397	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1398
				1399	.. _amdgpu-amdhsa-hsa-aql-queue:
				1400
				1401	HSA AQL Queue
				1402	~~~~~~~~~~~~~
				1403
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1404	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1405	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1406	fields needed to implement certain language features such as the flat address
				1407	aperture bases. It also contains fields used by CP such as managing the
				1408	allocation of scratch memory.
				1409
				1410	.. _amdgpu-amdhsa-kernel-descriptor:
				1411
				1412	Kernel Descriptor
				1413	~~~~~~~~~~~~~~~~~
				1414
				1415	A kernel descriptor consists of the information needed by CP to initiate the
				1416	execution of a kernel, including the entry point address of the machine code
				1417	that implements the kernel.
				1418
				1419	Kernel Descriptor for GFX6-GFX9
				1420	+++++++++++++++++++++++++++++++
				1421
				1422	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1423
				1424	.. table:: Kernel Descriptor for GFX6-GFX9
				1425	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1426
				1427	======= ======= =============================== ===========================
				1428	Bits Size Field Name Description
				1429	======= ======= =============================== ===========================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1430	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1431	address space memory
				1432	required for a work-group
				1433	in bytes. This does not
				1434	include any dynamically
				1435	allocated local address
				1436	space memory that may be
				1437	added when the kernel is
				1438	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1439	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1440	private address space
				1441	memory required for a
				1442	work-item in bytes. If
				1443	is_dynamic_callstack is 1
				1444	then additional space must
				1445	be added to this value for
				1446	the call stack.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1447	95:64 4 bytes MaxFlatWorkgroupSize Maximum flat work-group
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1448	size supported by the
				1449	kernel in work-items.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1450	96 1 bit IsDynamicCallStack Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1451	machine code is using a
				1452	dynamically sized call
				1453	stack.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1454	97 1 bit IsXNACKEnabled Indicates if the generated
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1455	machine code is capable of
				1456	suppoting XNACK.
				1457	127:98 30 bits Reserved. Must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1458	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1459	negative) from base
				1460	address of kernel
				1461	descriptor to kernel's
				1462	entry point instruction
				1463	which must be 256 byte
				1464	aligned.
				1465	383:192 24 Reserved. Must be 0.
				1466	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1467	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1468	program settings used by
				1469	CP to set up
				1470	``COMPUTE_PGM_RSRC1``
				1471	configuration
				1472	register. See
				1473	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1474	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1475	program settings used by
				1476	CP to set up
				1477	``COMPUTE_PGM_RSRC2``
				1478	configuration
				1479	register. See
				1480	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1481	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1482	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1483	(see
				1484	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1485
				1486	The total number of SGPR
				1487	user data registers
				1488	requested must not exceed
				1489	16 and match value in
				1490	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1491	Any requests beyond 16
				1492	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1493	449 1 bit EnableSGPRDispatchPtr see above
				1494	450 1 bit EnableSGPRQueuePtr see above
				1495	451 1 bit EnableSGPRKernargSegmentPtr see above
				1496	452 1 bit EnableSGPRDispatchID see above
				1497	453 1 bit EnableSGPRFlatScratchInit see above
				1498	454 1 bit EnableSGPRPrivateSegmentSize see above
				1499	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1500	should always be 0.
				1501	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1502	should always be 0.
				1503	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1504	should always be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1505	463:458 6 bits Reserved. Must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1506	511:464 6 Reserved. Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1507	bytes
				1508	512 Total size 64 bytes.
				1509	======= ===================================================================
				1510
				1511	..
				1512
				1513	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
				1514	:name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
				1515
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1516	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1517	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1518	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1519	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1520	used by each work-item,
				1521	granularity is device
				1522	specific:
				1523
				1524	GFX6-9
				1525	roundup((max-vgpg + 1)
				1526	/ 4) - 1
				1527
				1528	Used by CP to set up
				1529	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1530	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1531	used by a wavefront,
				1532	granularity is device
				1533	specific:
				1534
				1535	GFX6-8
				1536	roundup((max-sgpg + 1)
				1537	/ 8) - 1
				1538	GFX9
				1539	roundup((max-sgpg + 1)
				1540	/ 16) - 1
				1541
				1542	Includes the special SGPRs
				1543	for VCC, Flat Scratch (for
				1544	GFX7 onwards) and XNACK
				1545	(for GFX8 onwards). It does
				1546	not include the 16 SGPR
				1547	added if a trap handler is
				1548	enabled.
				1549
				1550	Used by CP to set up
				1551	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1552	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1553
				1554	Start executing wavefront
				1555	at the specified priority.
				1556
				1557	CP is responsible for
				1558	filling in
				1559	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1560	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1561	with specified rounding
				1562	mode for single (32
				1563	bit) floating point
				1564	precision floating point
				1565	operations.
				1566
				1567	Floating point rounding
				1568	mode values are defined in
				1569	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1570
				1571	Used by CP to set up
				1572	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1573	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1574	with specified rounding
				1575	denorm mode for half/double (16
				1576	and 64 bit) floating point
				1577	precision floating point
				1578	operations.
				1579
				1580	Floating point rounding
				1581	mode values are defined in
				1582	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1583
				1584	Used by CP to set up
				1585	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1586	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1587	with specified denorm mode
				1588	for single (32
				1589	bit) floating point
				1590	precision floating point
				1591	operations.
				1592
				1593	Floating point denorm mode
				1594	values are defined in
				1595	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1596
				1597	Used by CP to set up
				1598	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1599	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600	with specified denorm mode
				1601	for half/double (16
				1602	and 64 bit) floating point
				1603	precision floating point
				1604	operations.
				1605
				1606	Floating point denorm mode
				1607	values are defined in
				1608	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1609
				1610	Used by CP to set up
				1611	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1612	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1613
				1614	Start executing wavefront
				1615	in privilege trap handler
				1616	mode.
				1617
				1618	CP is responsible for
				1619	filling in
				1620	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1621	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1622	with DX10 clamp mode
				1623	enabled. Used by the vector
				1624	ALU to force DX-10 style
				1625	treatment of NaN's (when
				1626	set, clamp NaN to zero,
				1627	otherwise pass NaN
				1628	through).
				1629
				1630	Used by CP to set up
				1631	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1632	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1633
				1634	Start executing wavefront
				1635	in single step mode.
				1636
				1637	CP is responsible for
				1638	filling in
				1639	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1640	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1641	with IEEE mode
				1642	enabled. Floating point
				1643	opcodes that support
				1644	exception flag gathering
				1645	will quiet and propagate
				1646	signaling-NaN inputs per
				1647	IEEE 754-2008. Min_dx10 and
				1648	max_dx10 become IEEE
				1649	754-2008 compliant due to
				1650	signaling-NaN propagation
				1651	and quieting.
				1652
				1653	Used by CP to set up
				1654	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1655	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1656
				1657	Only one work-group allowed
				1658	to execute on a compute
				1659	unit.
				1660
				1661	CP is responsible for
				1662	filling in
				1663	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1664	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1665
				1666	Flag that can be used to
				1667	control debugging code.
				1668
				1669	CP is responsible for
				1670	filling in
				1671	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1672	26 1 bit FP16_OVFL GFX6-8:
				1673	Reserved. Must be 0.
				1674	GFX9:
				1675	Wavefront starts
				1676	execution with specified
				1677	fp16 overflow mode.
				1678
				1679	- If 0, then fp16
				1680	overflow generates
				1681	+/-INF values.
				1682	- If 1, then fp16
				1683	overflow that is the
				1684	result of an +/-INF
				1685	input value or divide
				1686	by 0 generates a
				1687	+/-INF, otherwise
				1688	clamps computed
				1689	overflow to +/-MAX_FP16
				1690	as appropriate.
				1691
				1692	Used by CP to set up
				1693	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
				1694	31:27 5 bits Reserved. Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1695	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1696	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1697
				1698	..
				1699
				1700	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1701	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1702
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1703	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1704	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1705	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1706	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1707	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1708	system register (see
				1709	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1710
				1711	Used by CP to set up
				1712	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1713	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1714	user data registers
				1715	requested. This number must
				1716	match the number of user
				1717	data registers enabled.
				1718
				1719	Used by CP to set up
				1720	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1721	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1722	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1723	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1724	be enabled.
				1725
				1726	CP sets
				1727	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1728	if the runtime has
				1729	installed a trap handler
				1730	regardless of the setting
				1731	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1732	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1733	system SGPR register for
				1734	the work-group id in the X
				1735	dimension (see
				1736	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1737
				1738	Used by CP to set up
				1739	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1740	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1741	system SGPR register for
				1742	the work-group id in the Y
				1743	dimension (see
				1744	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1745
				1746	Used by CP to set up
				1747	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1748	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1749	system SGPR register for
				1750	the work-group id in the Z
				1751	dimension (see
				1752	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1753
				1754	Used by CP to set up
				1755	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1756	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1757	system SGPR register for
				1758	work-group information (see
				1759	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1760
				1761	Used by CP to set up
				1762	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1763	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1764	VGPR system registers used
				1765	for the work-item ID.
				1766	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1767	defines the values.
				1768
				1769	Used by CP to set up
				1770	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1771	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1772
				1773	Wavefront starts execution
				1774	with address watch
				1775	exceptions enabled which
				1776	are generated when L1 has
				1777	witnessed a thread access
				1778	an *address of
				1779	interest*.
				1780
				1781	CP is responsible for
				1782	filling in the address
				1783	watch bit in
				1784	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1785	according to what the
				1786	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1787	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1788
				1789	Wavefront starts execution
				1790	with memory violation
				1791	exceptions exceptions
				1792	enabled which are generated
				1793	when a memory violation has
				1794	occurred for this wave from
				1795	L1 or LDS
				1796	(write-to-read-only-memory,
				1797	mis-aligned atomic, LDS
				1798	address out of range,
				1799	illegal address, etc.).
				1800
				1801	CP sets the memory
				1802	violation bit in
				1803	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1804	according to what the
				1805	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1806	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1807
				1808	CP uses the rounded value
				1809	from the dispatch packet,
				1810	not this value, as the
				1811	dispatch may contain
				1812	dynamically allocated group
				1813	segment memory. CP writes
				1814	directly to
				1815	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1816
				1817	Amount of group segment
				1818	(LDS) to allocate for each
				1819	work-group. Granularity is
				1820	device specific:
				1821
				1822	GFX6:
				1823	roundup(lds-size / (64 * 4))
				1824	GFX7-GFX9:
				1825	roundup(lds-size / (128 * 4))
				1826
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1827	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1828	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1829	enabled.
				1830
				1831	Used by CP to set up
				1832	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1833	(set from bits 0..6).
				1834
				1835	IEEE 754 FP Invalid
				1836	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1837	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1838	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1839	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1840	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1841	_DIVISION_BY_ZERO Zero
				1842	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1843	_OVERFLOW
				1844	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1845	_UNDERFLOW
				1846	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1847	_INEXACT
				1848	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1849	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1850	only)
				1851	31 1 bit Reserved. Must be 0.
				1852	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1853	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1854
				1855	..
				1856
				1857	.. table:: Floating Point Rounding Mode Enumeration Values
				1858	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1859
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1860	====================================== ===== ==============================
				1861	Enumeration Name Value Description
				1862	====================================== ===== ==============================
				1863	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1864	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1865	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1866	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1867	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1868
				1869	..
				1870
				1871	.. table:: Floating Point Denorm Mode Enumeration Values
				1872	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1873
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1874	====================================== ===== ==============================
				1875	Enumeration Name Value Description
				1876	====================================== ===== ==============================
				1877	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1878	Denorms
				1879	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1880	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1881	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1882	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1883
				1884	..
				1885
				1886	.. table:: System VGPR Work-Item ID Enumeration Values
				1887	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1888
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame^]	1889	======================================== ===== ============================
				1890	Enumeration Name Value Description
				1891	======================================== ===== ============================
				1892	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1893	ID.
				1894	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1895	dimensions ID.
				1896	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1897	dimensions ID.
				1898	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1899	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1900
				1901	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1902
				1903	Initial Kernel Execution State
				1904	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1905
				1906	This section defines the register state that will be set up by the packet
				1907	processor prior to the start of execution of every wavefront. This is limited by
				1908	the constraints of the hardware controllers of CP/ADC/SPI.
				1909
				1910	The order of the SGPR registers is defined, but the compiler can specify which
				1911	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1912	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1913	for enabled registers are dense starting at SGPR0: the first enabled register is
				1914	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1915	an SGPR number.
				1916
				1917	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1918	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1919	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1920	initialized. These are then immediately followed by the System SGPRs that are
				1921	set up by ADC/SPI and can have different values for each wave of the grid
				1922	dispatch.
				1923
				1924	SGPR register initial state is defined in
				1925	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1926
				1927	.. table:: SGPR Register Set Up Order
				1928	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1929
				1930	========== ========================== ====== ==============================
				1931	SGPR Order Name Number Description
				1932	(kernel descriptor enable of
				1933	field) SGPRs
				1934	========== ========================== ====== ==============================
				1935	First Private Segment Buffer 4 V# that can be used, together
				1936	(enable_sgpr_private with Scratch Wave Offset as an
				1937	_segment_buffer) offset, to access the private
				1938	memory space using a segment
				1939	address.
				1940
				1941	CP uses the value provided by
				1942	the runtime.
				1943	then Dispatch Ptr 2 64 bit address of AQL dispatch
				1944	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				1945	actually executing.
				1946	then Queue Ptr 2 64 bit address of amd_queue_t
				1947	(enable_sgpr_queue_ptr) object for AQL queue on which
				1948	the dispatch packet was
				1949	queued.
				1950	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				1951	(enable_sgpr_kernarg segment. This is directly
				1952	_segment_ptr) copied from the
				1953	kernarg_address in the kernel
				1954	dispatch packet.
				1955
				1956	Having CP load it once avoids
				1957	loading it at the beginning of
				1958	every wavefront.
				1959	then Dispatch Id 2 64 bit Dispatch ID of the
				1960	(enable_sgpr_dispatch_id) dispatch packet being
				1961	executed.
				1962	then Flat Scratch Init 2 This is 2 SGPRs:
				1963	(enable_sgpr_flat_scratch
				1964	_init) GFX6
				1965	Not supported.
				1966	GFX7-GFX8
				1967	The first SGPR is a 32 bit
				1968	byte offset from
				1969	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1970	to per SPI base of memory
				1971	for scratch for the queue
				1972	executing the kernel
				1973	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1974	from the runtime. (The
				1975	Scratch Segment Buffer base
				1976	address is
				1977	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1978	plus this offset.) The value
				1979	of Scratch Wave Offset must
				1980	be added to this offset by
				1981	the kernel machine code,
				1982	right shifted by 8, and
				1983	moved to the FLAT_SCRATCH_HI
				1984	SGPR register.
				1985	FLAT_SCRATCH_HI corresponds
				1986	to SGPRn-4 on GFX7, and
				1987	SGPRn-6 on GFX8 (where SGPRn
				1988	is the highest numbered SGPR
				1989	allocated to the wave).
				1990	FLAT_SCRATCH_HI is
				1991	multiplied by 256 (as it is
				1992	in units of 256 bytes) and
				1993	added to
				1994	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1995	to calculate the per wave
				1996	FLAT SCRATCH BASE in flat
				1997	memory instructions that
				1998	access the scratch
				1999	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2000
				2001	The second SGPR is 32 bit
				2002	byte size of a single
				2003	work-item’s scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2004	usage. CP obtains this from
				2005	the runtime, and it is
				2006	always a multiple of DWORD.
				2007	CP checks that the value in
				2008	the kernel dispatch packet
				2009	Private Segment Byte Size is
				2010	not larger, and requests the
				2011	runtime to increase the
				2012	queue's scratch size if
				2013	necessary. The kernel code
				2014	must move it to
				2015	FLAT_SCRATCH_LO which is
				2016	SGPRn-3 on GFX7 and SGPRn-5
				2017	on GFX8. FLAT_SCRATCH_LO is
				2018	used as the FLAT SCRATCH
				2019	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2020	instructions. Having CP load
				2021	it once avoids loading it at
				2022	the beginning of every
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2023	wavefront. GFX9 This is the
				2024	64 bit base address of the
				2025	per SPI scratch backing
				2026	memory managed by SPI for
				2027	the queue executing the
				2028	kernel dispatch. CP obtains
				2029	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2030	divides it if there are
				2031	multiple Shader Arrays each
				2032	with its own SPI). The value
				2033	of Scratch Wave Offset must
				2034	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2035	machine code and the result
				2036	moved to the FLAT_SCRATCH
				2037	SGPR which is SGPRn-6 and
				2038	SGPRn-5. It is used as the
				2039	FLAT SCRATCH BASE in flat
				2040	memory instructions. then
				2041	Private Segment Size 1 The
				2042	32 bit byte size of a
				2043	(enable_sgpr_private single
				2044	work-item's
				2045	scratch_segment_size) memory
				2046	allocation. This is the
				2047	value from the kernel
				2048	dispatch packet Private
				2049	Segment Byte Size rounded up
				2050	by CP to a multiple of
				2051	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2052
				2053	Having CP load it once avoids
				2054	loading it at the beginning of
				2055	every wavefront.
				2056
				2057	This is not used for
				2058	GFX7-GFX8 since it is the same
				2059	value as the second SGPR of
				2060	Flat Scratch Init. However, it
				2061	may be needed for GFX9 which
				2062	changes the meaning of the
				2063	Flat Scratch Init value.
				2064	then Grid Work-Group Count X 1 32 bit count of the number of
				2065	(enable_sgpr_grid work-groups in the X dimension
				2066	_workgroup_count_X) for the grid being
				2067	executed. Computed from the
				2068	fields in the kernel dispatch
				2069	packet as ((grid_size.x +
				2070	workgroup_size.x - 1) /
				2071	workgroup_size.x).
				2072	then Grid Work-Group Count Y 1 32 bit count of the number of
				2073	(enable_sgpr_grid work-groups in the Y dimension
				2074	_workgroup_count_Y && for the grid being
				2075	less than 16 previous executed. Computed from the
				2076	SGPRs) fields in the kernel dispatch
				2077	packet as ((grid_size.y +
				2078	workgroup_size.y - 1) /
				2079	workgroupSize.y).
				2080
				2081	Only initialized if <16
				2082	previous SGPRs initialized.
				2083	then Grid Work-Group Count Z 1 32 bit count of the number of
				2084	(enable_sgpr_grid work-groups in the Z dimension
				2085	_workgroup_count_Z && for the grid being
				2086	less than 16 previous executed. Computed from the
				2087	SGPRs) fields in the kernel dispatch
				2088	packet as ((grid_size.z +
				2089	workgroup_size.z - 1) /
				2090	workgroupSize.z).
				2091
				2092	Only initialized if <16
				2093	previous SGPRs initialized.
				2094	then Work-Group Id X 1 32 bit work-group id in X
				2095	(enable_sgpr_workgroup_id dimension of grid for
				2096	_X) wavefront.
				2097	then Work-Group Id Y 1 32 bit work-group id in Y
				2098	(enable_sgpr_workgroup_id dimension of grid for
				2099	_Y) wavefront.
				2100	then Work-Group Id Z 1 32 bit work-group id in Z
				2101	(enable_sgpr_workgroup_id dimension of grid for
				2102	_Z) wavefront.
				2103	then Work-Group Info 1 {first_wave, 14’b0000,
				2104	(enable_sgpr_workgroup ordered_append_term[10:0],
				2105	_info) threadgroup_size_in_waves[5:0]}
				2106	then Scratch Wave Offset 1 32 bit byte offset from base
				2107	(enable_sgpr_private of scratch base of queue
				2108	_segment_wave_offset) executing the kernel
				2109	dispatch. Must be used as an
				2110	offset with Private
				2111	segment address when using
				2112	Scratch Segment Buffer. It
				2113	must be used to set up FLAT
				2114	SCRATCH for flat addressing
				2115	(see
				2116	:ref:`amdgpu-amdhsa-flat-scratch`).
				2117	========== ========================== ====== ==============================
				2118
				2119	The order of the VGPR registers is defined, but the compiler can specify which
				2120	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2121	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2122	for enabled registers are dense starting at VGPR0: the first enabled register is
				2123	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2124	VGPR number.
				2125
				2126	VGPR register initial state is defined in
				2127	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2128
				2129	.. table:: VGPR Register Set Up Order
				2130	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2131
				2132	========== ========================== ====== ==============================
				2133	VGPR Order Name Number Description
				2134	(kernel descriptor enable of
				2135	field) VGPRs
				2136	========== ========================== ====== ==============================
				2137	First Work-Item Id X 1 32 bit work item id in X
				2138	(Always initialized) dimension of work-group for
				2139	wavefront lane.
				2140	then Work-Item Id Y 1 32 bit work item id in Y
				2141	(enable_vgpr_workitem_id dimension of work-group for
				2142	> 0) wavefront lane.
				2143	then Work-Item Id Z 1 32 bit work item id in Z
				2144	(enable_vgpr_workitem_id dimension of work-group for
				2145	> 1) wavefront lane.
				2146	========== ========================== ====== ==============================
				2147
				2148	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2149
				2150	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2151	registers.
				2152	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2153	combination including none.
				2154	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2155	cannot included with the flat scratch init value which is per queue.
				2156	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2157	or (X, Y, Z).
				2158
				2159	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2160	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2161
				2162	The global segment can be accessed either using buffer instructions (GFX6 which
				2163	has V# 64 bit address support), flat instructions (GFX7-9), or global
				2164	instructions (GFX9).
				2165
				2166	If buffer operations are used then the compiler can generate a V# with the
				2167	following properties:
				2168
				2169	* base address of 0
				2170	* no swizzle
				2171	* ATC: 1 if IOMMU present (such as APU)
				2172	* ptr64: 1
				2173	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2174	APU and NC for dGPU).
				2175
				2176	.. _amdgpu-amdhsa-kernel-prolog:
				2177
				2178	Kernel Prolog
				2179	~~~~~~~~~~~~~
				2180
				2181	.. _amdgpu-amdhsa-m0:
				2182
				2183	M0
				2184	++
				2185
				2186	GFX6-GFX8
				2187	The M0 register must be initialized with a value at least the total LDS size
				2188	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2189	available in dispatch packet. For M0, it is also possible to use maximum
				2190	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2191	GFX7-GFX8).
				2192	GFX9
				2193	The M0 register is not used for range checking LDS accesses and so does not
				2194	need to be initialized in the prolog.
				2195
				2196	.. _amdgpu-amdhsa-flat-scratch:
				2197
				2198	Flat Scratch
				2199	++++++++++++
				2200
				2201	If the kernel may use flat operations to access scratch memory, the prolog code
				2202	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2203	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2204	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2205
				2206	GFX6
				2207	Flat scratch is not supported.
				2208
				2209	GFX7-8
				2210	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2211	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2212	being managed by SPI for the queue executing the kernel dispatch. This is
				2213	the same value used in the Scratch Segment Buffer V# base address. The
				2214	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2215	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2216	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2217	by 8 before moving into FLAT_SCRATCH_LO.
				2218	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2219	work-items scratch memory usage. This is directly loaded from the kernel
				2220	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2221	DWORD. Having CP load it once avoids loading it at the beginning of every
				2222	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2223	SIZE.
				2224	GFX9
				2225	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2226	memory being managed by SPI for the queue executing the kernel dispatch. The
				2227	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2228	pair for use as the flat scratch base in flat memory instructions.
				2229
				2230	.. _amdgpu-amdhsa-memory-model:
				2231
				2232	Memory Model
				2233	~~~~~~~~~~~~
				2234
				2235	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2236	(see :ref:`memmodel`). The implementation is WIP.
				2237
				2238	.. TODO
				2239	Update when implementation complete.
				2240
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2241	Support more relaxed OpenCL memory model to be controlled by environment
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2242	component of target triple.
				2243
				2244	The AMDGPU backend supports the memory synchronization scopes specified in
				2245	:ref:`amdgpu-memory-scopes`.
				2246
				2247	The code sequences used to implement the memory model are defined in table
				2248	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2249
				2250	The sequences specify the order of instructions that a single thread must
				2251	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2252	to other memory instructions executed by the same thread. This allows them to be
				2253	moved earlier or later which can allow them to be combined with other instances
				2254	of the same instruction, or hoisted/sunk out of loops to improve
				2255	performance. Only the instructions related to the memory model are given;
				2256	additional ``s_waitcnt`` instructions are required to ensure registers are
				2257	defined before being used. These may be able to be combined with the memory
				2258	model ``s_waitcnt`` instructions as described above.
				2259
				2260	The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
				2261	OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
				2262	relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
				2263	memory model which has separate happens-before relations for the global and
				2264	local address spaces, and only a fence specifying both global and local address
				2265	space joins the relationships. Since the LLVM ``memfence`` instruction does not
				2266	allow an address space to be specified the OpenCL fence has to convervatively
				2267	assume both local and global address space was specified. However, optimizations
				2268	can often be done to eliminate the additional ``s_waitcnt``instructions when
				2269	there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
				2270	instructions. The code sequences in the table indicate what can be omitted for
				2271	the OpenCL memory. The target triple environment is used to determine if the
				2272	source language is OpenCL (see :ref:`amdgpu-opencl`).
				2273
				2274	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2275	operations.
				2276
				2277	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2278	termed vector memory operations.
				2279
				2280	For GFX6-GFX9:
				2281
				2282	* Each agent has multiple compute units (CU).
				2283	* Each CU has multiple SIMDs that execute wavefronts.
				2284	* The wavefronts for a single work-group are executed in the same CU but may be
				2285	executed by different SIMDs.
				2286	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2287	executing on it.
				2288	* All LDS operations of a CU are performed as wavefront wide operations in a
				2289	global order and involve no caching. Completion is reported to a wavefront in
				2290	execution order.
				2291	* The LDS memory has multiple request queues shared by the SIMDs of a
				2292	CU. Therefore, the LDS operations performed by different waves of a work-group
				2293	can be reordered relative to each other, which can result in reordering the
				2294	visibility of vector memory operations with respect to LDS operations of other
				2295	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2296	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2297	between waves of a work-group, but not between operations performed by the
				2298	same wavefront.
				2299	* The vector memory operations are performed as wavefront wide operations and
				2300	completion is reported to a wavefront in execution order. The exception is
				2301	that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
				2302	vector memory order if they access LDS memory, and out of LDS operation order
				2303	if they access global memory.
				2304	* The vector memory operations access a vector L1 cache shared by all wavefronts
				2305	on a CU. Therefore, no special action is required for coherence between
				2306	wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
				2307	coherence between waves executing in different work-groups as they may be
				2308	executing on different CUs.
				2309	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2310	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2311	scalar operations are used in a restricted way so do not impact the memory
				2312	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2313	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2314	the same agent.
				2315	* The L2 cache has independent channels to service disjoint ranges of virtual
				2316	addresses.
				2317	* Each CU has a separate request queue per channel. Therefore, the vector and
				2318	scalar memory operations performed by waves executing in different work-groups
				2319	(which may be executing on different CUs) of an agent can be reordered
				2320	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2321	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2322	previous vector memory operation has completed before executing a subsequent
				2323	vector memory or LDS operation and so can be used to meet the requirements of
				2324	acquire and release.
				2325	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2326	of virtual addresses can be set up to bypass it to ensure system coherence.
				2327
				2328	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
				2329	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2330	memory, atomic memory orderings are not meaningful and all accesses are treated
				2331	as non-atomic.
				2332
				2333	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2334	scalar memory instructions). Since the constant address space contents do not
				2335	change during the execution of a kernel dispatch it is not legal to perform
				2336	stores, and atomic memory orderings are not meaningful and all access are
				2337	treated as non-atomic.
				2338
				2339	A memory synchronization scope wider than work-group is not meaningful for the
				2340	group (LDS) address space and is treated as work-group.
				2341
				2342	The memory model does not support the region address space which is treated as
				2343	non-atomic.
				2344
				2345	Acquire memory ordering is not meaningful on store atomic instructions and is
				2346	treated as non-atomic.
				2347
				2348	Release memory ordering is not meaningful on load atomic instructions and is
				2349	treated a non-atomic.
				2350
				2351	Acquire-release memory ordering is not meaningful on load or store atomic
				2352	instructions and is treated as acquire and release respectively.
				2353
				2354	AMDGPU backend only uses scalar memory operations to access memory that is
				2355	proven to not change during the execution of the kernel dispatch. This includes
				2356	constant address space and global address space for program scope const
				2357	variables. Therefore the kernel machine code does not have to maintain the
				2358	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2359	and vector L1 caches are invalidated between kernel dispatches by CP since
				2360	constant address space data may change between kernel dispatch executions. See
				2361	:ref:`amdgpu-amdhsa-memory-spaces`.
				2362
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2363	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2364	case the AMDGPU backend ensures the memory location used to spill is never
				2365	accessed by vector memory operations at the same time. If scalar writes are used
				2366	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2367	return since the locations may be used for vector memory instructions by a
				2368	future wave that uses the same scratch area, or a function call that creates a
				2369	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2370	as all scalar writes are write-before-read in the same thread.
				2371
				2372	Scratch backing memory (which is used for the private address space) is accessed
				2373	with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
				2374	is only accessed by a single thread, and is always write-before-read,
				2375	there is never a need to invalidate these entries from the L1 cache. Hence all
				2376	cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
				2377	lines.
				2378
				2379	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
				2380	to invalidate the L2 cache. This also causes it to be treated as non-volatile
				2381	and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
				2382	coherent) and so the L2 cache will coherent with the CPU and other agents.
				2383
				2384	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2385	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2386
				2387	============ ============ ============== ========== =======================
				2388	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2389	Ordering Sync Scope Address
				2390	Space
				2391	============ ============ ============== ========== =======================
				2392	Non-Atomic
				2393	---------------------------------------------------------------------------
				2394	load none none - global non-volatile
				2395	- generic 1. buffer/global/flat_load
				2396	volatile
				2397	1. buffer/global/flat_load
				2398	glc=1
				2399	load none none - local 1. ds_load
				2400	store none none - global 1. buffer/global/flat_store
				2401	- generic
				2402	store none none - local 1. ds_store
				2403	Unordered Atomic
				2404	---------------------------------------------------------------------------
				2405	load atomic unordered any any Same as non-atomic.
				2406	store atomic unordered any any Same as non-atomic.
				2407	atomicrmw unordered any any *Same as monotonic
				2408	atomic*.
				2409	Monotonic Atomic
				2410	---------------------------------------------------------------------------
				2411	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2412	- wavefront - generic
				2413	- workgroup
				2414	load atomic monotonic - singlethread - local 1. ds_load
				2415	- wavefront
				2416	- workgroup
				2417	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2418	- system - generic glc=1
				2419	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2420	- wavefront - generic
				2421	- workgroup
				2422	- agent
				2423	- system
				2424	store atomic monotonic - singlethread - local 1. ds_store
				2425	- wavefront
				2426	- workgroup
				2427	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2428	- wavefront - generic
				2429	- workgroup
				2430	- agent
				2431	- system
				2432	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2433	- wavefront
				2434	- workgroup
				2435	Acquire Atomic
				2436	---------------------------------------------------------------------------
				2437	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2438	- wavefront - local
				2439	- generic
				2440	load atomic acquire - workgroup - global 1. buffer/global_load
				2441	load atomic acquire - workgroup - local 1. ds/flat_load
				2442	- generic 2. s_waitcnt lgkmcnt(0)
				2443
				2444	- If OpenCL, omit
				2445	waitcnt.
				2446	- Must happen before
				2447	any following
				2448	global/generic
				2449	load/load
				2450	atomic/store/store
				2451	atomic/atomicrmw.
				2452	- Ensures any
				2453	following global
				2454	data read is no
				2455	older than the load
				2456	atomic value being
				2457	acquired.
				2458
				2459	load atomic acquire - agent - global 1. buffer/global_load
				2460	- system glc=1
				2461	2. s_waitcnt vmcnt(0)
				2462
				2463	- Must happen before
				2464	following
				2465	buffer_wbinvl1_vol.
				2466	- Ensures the load
				2467	has completed
				2468	before invalidating
				2469	the cache.
				2470
				2471	3. buffer_wbinvl1_vol
				2472
				2473	- Must happen before
				2474	any following
				2475	global/generic
				2476	load/load
				2477	atomic/atomicrmw.
				2478	- Ensures that
				2479	following
				2480	loads will not see
				2481	stale global data.
				2482
				2483	load atomic acquire - agent - generic 1. flat_load glc=1
				2484	- system 2. s_waitcnt vmcnt(0) &
				2485	lgkmcnt(0)
				2486
				2487	- If OpenCL omit
				2488	lgkmcnt(0).
				2489	- Must happen before
				2490	following
				2491	buffer_wbinvl1_vol.
				2492	- Ensures the flat_load
				2493	has completed
				2494	before invalidating
				2495	the cache.
				2496
				2497	3. buffer_wbinvl1_vol
				2498
				2499	- Must happen before
				2500	any following
				2501	global/generic
				2502	load/load
				2503	atomic/atomicrmw.
				2504	- Ensures that
				2505	following loads
				2506	will not see stale
				2507	global data.
				2508
				2509	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2510	- wavefront - local
				2511	- generic
				2512	atomicrmw acquire - workgroup - global 1. buffer/global_atomic
				2513	atomicrmw acquire - workgroup - local 1. ds/flat_atomic
				2514	- generic 2. waitcnt lgkmcnt(0)
				2515
				2516	- If OpenCL, omit
				2517	waitcnt.
				2518	- Must happen before
				2519	any following
				2520	global/generic
				2521	load/load
				2522	atomic/store/store
				2523	atomic/atomicrmw.
				2524	- Ensures any
				2525	following global
				2526	data read is no
				2527	older than the
				2528	atomicrmw value
				2529	being acquired.
				2530
				2531	atomicrmw acquire - agent - global 1. buffer/global_atomic
				2532	- system 2. s_waitcnt vmcnt(0)
				2533
				2534	- Must happen before
				2535	following
				2536	buffer_wbinvl1_vol.
				2537	- Ensures the
				2538	atomicrmw has
				2539	completed before
				2540	invalidating the
				2541	cache.
				2542
				2543	3. buffer_wbinvl1_vol
				2544
				2545	- Must happen before
				2546	any following
				2547	global/generic
				2548	load/load
				2549	atomic/atomicrmw.
				2550	- Ensures that
				2551	following loads
				2552	will not see stale
				2553	global data.
				2554
				2555	atomicrmw acquire - agent - generic 1. flat_atomic
				2556	- system 2. s_waitcnt vmcnt(0) &
				2557	lgkmcnt(0)
				2558
				2559	- If OpenCL, omit
				2560	lgkmcnt(0).
				2561	- Must happen before
				2562	following
				2563	buffer_wbinvl1_vol.
				2564	- Ensures the
				2565	atomicrmw has
				2566	completed before
				2567	invalidating the
				2568	cache.
				2569
				2570	3. buffer_wbinvl1_vol
				2571
				2572	- Must happen before
				2573	any following
				2574	global/generic
				2575	load/load
				2576	atomic/atomicrmw.
				2577	- Ensures that
				2578	following loads
				2579	will not see stale
				2580	global data.
				2581
				2582	fence acquire - singlethread none none
				2583	- wavefront
				2584	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2585
				2586	- If OpenCL and
				2587	address space is
				2588	not generic, omit
				2589	waitcnt. However,
				2590	since LLVM
				2591	currently has no
				2592	address space on
				2593	the fence need to
				2594	conservatively
				2595	always generate. If
				2596	fence had an
				2597	address space then
				2598	set to address
				2599	space of OpenCL
				2600	fence flag, or to
				2601	generic if both
				2602	local and global
				2603	flags are
				2604	specified.
				2605	- Must happen after
				2606	any preceding
				2607	local/generic load
				2608	atomic/atomicrmw
				2609	with an equal or
				2610	wider sync scope
				2611	and memory ordering
				2612	stronger than
				2613	unordered (this is
				2614	termed the
				2615	fence-paired-atomic).
				2616	- Must happen before
				2617	any following
				2618	global/generic
				2619	load/load
				2620	atomic/store/store
				2621	atomic/atomicrmw.
				2622	- Ensures any
				2623	following global
				2624	data read is no
				2625	older than the
				2626	value read by the
				2627	fence-paired-atomic.
				2628
				2629	fence acquire - agent none 1. s_waitcnt vmcnt(0) &
				2630	- system lgkmcnt(0)
				2631
				2632	- If OpenCL and
				2633	address space is
				2634	not generic, omit
				2635	lgkmcnt(0).
				2636	However, since LLVM
				2637	currently has no
				2638	address space on
				2639	the fence need to
				2640	conservatively
				2641	always generate
				2642	(see comment for
				2643	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2644	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2645	separate s_waitcnt
				2646	vmcnt(0) and
				2647	s_waitcnt
				2648	lgkmcnt(0) to allow
				2649	them to be
				2650	independently moved
				2651	according to the
				2652	following rules.
				2653	- s_waitcnt vmcnt(0)
				2654	must happen after
				2655	any preceding
				2656	global/generic load
				2657	atomic/atomicrmw
				2658	with an equal or
				2659	wider sync scope
				2660	and memory ordering
				2661	stronger than
				2662	unordered (this is
				2663	termed the
				2664	fence-paired-atomic).
				2665	- s_waitcnt lgkmcnt(0)
				2666	must happen after
				2667	any preceding
				2668	group/generic load
				2669	atomic/atomicrmw
				2670	with an equal or
				2671	wider sync scope
				2672	and memory ordering
				2673	stronger than
				2674	unordered (this is
				2675	termed the
				2676	fence-paired-atomic).
				2677	- Must happen before
				2678	the following
				2679	buffer_wbinvl1_vol.
				2680	- Ensures that the
				2681	fence-paired atomic
				2682	has completed
				2683	before invalidating
				2684	the
				2685	cache. Therefore
				2686	any following
				2687	locations read must
				2688	be no older than
				2689	the value read by
				2690	the
				2691	fence-paired-atomic.
				2692
				2693	2. buffer_wbinvl1_vol
				2694
				2695	- Must happen before
				2696	any following global/generic
				2697	load/load
				2698	atomic/store/store
				2699	atomic/atomicrmw.
				2700	- Ensures that
				2701	following loads
				2702	will not see stale
				2703	global data.
				2704
				2705	Release Atomic
				2706	---------------------------------------------------------------------------
				2707	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2708	- wavefront - local
				2709	- generic
				2710	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2711	- generic
				2712	- If OpenCL, omit
				2713	waitcnt.
				2714	- Must happen after
				2715	any preceding
				2716	local/generic
				2717	load/store/load
				2718	atomic/store
				2719	atomic/atomicrmw.
				2720	- Must happen before
				2721	the following
				2722	store.
				2723	- Ensures that all
				2724	memory operations
				2725	to local have
				2726	completed before
				2727	performing the
				2728	store that is being
				2729	released.
				2730
				2731	2. buffer/global/flat_store
				2732	store atomic release - workgroup - local 1. ds_store
				2733	store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
				2734	- system - generic lgkmcnt(0)
				2735
				2736	- If OpenCL, omit
				2737	lgkmcnt(0).
				2738	- Could be split into
				2739	separate s_waitcnt
				2740	vmcnt(0) and
				2741	s_waitcnt
				2742	lgkmcnt(0) to allow
				2743	them to be
				2744	independently moved
				2745	according to the
				2746	following rules.
				2747	- s_waitcnt vmcnt(0)
				2748	must happen after
				2749	any preceding
				2750	global/generic
				2751	load/store/load
				2752	atomic/store
				2753	atomic/atomicrmw.
				2754	- s_waitcnt lgkmcnt(0)
				2755	must happen after
				2756	any preceding
				2757	local/generic
				2758	load/store/load
				2759	atomic/store
				2760	atomic/atomicrmw.
				2761	- Must happen before
				2762	the following
				2763	store.
				2764	- Ensures that all
				2765	memory operations
				2766	to global have
				2767	completed before
				2768	performing the
				2769	store that is being
				2770	released.
				2771
				2772	2. buffer/global/ds/flat_store
				2773	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2774	- wavefront - local
				2775	- generic
				2776	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2777	- generic
				2778	- If OpenCL, omit
				2779	waitcnt.
				2780	- Must happen after
				2781	any preceding
				2782	local/generic
				2783	load/store/load
				2784	atomic/store
				2785	atomic/atomicrmw.
				2786	- Must happen before
				2787	the following
				2788	atomicrmw.
				2789	- Ensures that all
				2790	memory operations
				2791	to local have
				2792	completed before
				2793	performing the
				2794	atomicrmw that is
				2795	being released.
				2796
				2797	2. buffer/global/flat_atomic
				2798	atomicrmw release - workgroup - local 1. ds_atomic
				2799	atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
				2800	- system - generic lgkmcnt(0)
				2801
				2802	- If OpenCL, omit
				2803	lgkmcnt(0).
				2804	- Could be split into
				2805	separate s_waitcnt
				2806	vmcnt(0) and
				2807	s_waitcnt
				2808	lgkmcnt(0) to allow
				2809	them to be
				2810	independently moved
				2811	according to the
				2812	following rules.
				2813	- s_waitcnt vmcnt(0)
				2814	must happen after
				2815	any preceding
				2816	global/generic
				2817	load/store/load
				2818	atomic/store
				2819	atomic/atomicrmw.
				2820	- s_waitcnt lgkmcnt(0)
				2821	must happen after
				2822	any preceding
				2823	local/generic
				2824	load/store/load
				2825	atomic/store
				2826	atomic/atomicrmw.
				2827	- Must happen before
				2828	the following
				2829	atomicrmw.
				2830	- Ensures that all
				2831	memory operations
				2832	to global and local
				2833	have completed
				2834	before performing
				2835	the atomicrmw that
				2836	is being released.
				2837
				2838	2. buffer/global/ds/flat_atomic*
				2839	fence release - singlethread none none
				2840	- wavefront
				2841	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2842
				2843	- If OpenCL and
				2844	address space is
				2845	not generic, omit
				2846	waitcnt. However,
				2847	since LLVM
				2848	currently has no
				2849	address space on
				2850	the fence need to
				2851	conservatively
				2852	always generate
				2853	(see comment for
				2854	previous fence).
				2855	- Must happen after
				2856	any preceding
				2857	local/generic
				2858	load/load
				2859	atomic/store/store
				2860	atomic/atomicrmw.
				2861	- Must happen before
				2862	any following store
				2863	atomic/atomicrmw
				2864	with an equal or
				2865	wider sync scope
				2866	and memory ordering
				2867	stronger than
				2868	unordered (this is
				2869	termed the
				2870	fence-paired-atomic).
				2871	- Ensures that all
				2872	memory operations
				2873	to local have
				2874	completed before
				2875	performing the
				2876	following
				2877	fence-paired-atomic.
				2878
				2879	fence release - agent none 1. s_waitcnt vmcnt(0) &
				2880	- system lgkmcnt(0)
				2881
				2882	- If OpenCL and
				2883	address space is
				2884	not generic, omit
				2885	lgkmcnt(0).
				2886	However, since LLVM
				2887	currently has no
				2888	address space on
				2889	the fence need to
				2890	conservatively
				2891	always generate
				2892	(see comment for
				2893	previous fence).
				2894	- Could be split into
				2895	separate s_waitcnt
				2896	vmcnt(0) and
				2897	s_waitcnt
				2898	lgkmcnt(0) to allow
				2899	them to be
				2900	independently moved
				2901	according to the
				2902	following rules.
				2903	- s_waitcnt vmcnt(0)
				2904	must happen after
				2905	any preceding
				2906	global/generic
				2907	load/store/load
				2908	atomic/store
				2909	atomic/atomicrmw.
				2910	- s_waitcnt lgkmcnt(0)
				2911	must happen after
				2912	any preceding
				2913	local/generic
				2914	load/store/load
				2915	atomic/store
				2916	atomic/atomicrmw.
				2917	- Must happen before
				2918	any following store
				2919	atomic/atomicrmw
				2920	with an equal or
				2921	wider sync scope
				2922	and memory ordering
				2923	stronger than
				2924	unordered (this is
				2925	termed the
				2926	fence-paired-atomic).
				2927	- Ensures that all
				2928	memory operations
				2929	to global have
				2930	completed before
				2931	performing the
				2932	following
				2933	fence-paired-atomic.
				2934
				2935	Acquire-Release Atomic
				2936	---------------------------------------------------------------------------
				2937	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				2938	- wavefront - local
				2939	- generic
				2940	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2941
				2942	- If OpenCL, omit
				2943	waitcnt.
				2944	- Must happen after
				2945	any preceding
				2946	local/generic
				2947	load/store/load
				2948	atomic/store
				2949	atomic/atomicrmw.
				2950	- Must happen before
				2951	the following
				2952	atomicrmw.
				2953	- Ensures that all
				2954	memory operations
				2955	to local have
				2956	completed before
				2957	performing the
				2958	atomicrmw that is
				2959	being released.
				2960
				2961	2. buffer/global_atomic
				2962	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				2963	2. s_waitcnt lgkmcnt(0)
				2964
				2965	- If OpenCL, omit
				2966	waitcnt.
				2967	- Must happen before
				2968	any following
				2969	global/generic
				2970	load/load
				2971	atomic/store/store
				2972	atomic/atomicrmw.
				2973	- Ensures any
				2974	following global
				2975	data read is no
				2976	older than the load
				2977	atomic value being
				2978	acquired.
				2979
				2980	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2981
				2982	- If OpenCL, omit
				2983	waitcnt.
				2984	- Must happen after
				2985	any preceding
				2986	local/generic
				2987	load/store/load
				2988	atomic/store
				2989	atomic/atomicrmw.
				2990	- Must happen before
				2991	the following
				2992	atomicrmw.
				2993	- Ensures that all
				2994	memory operations
				2995	to local have
				2996	completed before
				2997	performing the
				2998	atomicrmw that is
				2999	being released.
				3000
				3001	2. flat_atomic
				3002	3. s_waitcnt lgkmcnt(0)
				3003
				3004	- If OpenCL, omit
				3005	waitcnt.
				3006	- Must happen before
				3007	any following
				3008	global/generic
				3009	load/load
				3010	atomic/store/store
				3011	atomic/atomicrmw.
				3012	- Ensures any
				3013	following global
				3014	data read is no
				3015	older than the load
				3016	atomic value being
				3017	acquired.
				3018	atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
				3019	- system lgkmcnt(0)
				3020
				3021	- If OpenCL, omit
				3022	lgkmcnt(0).
				3023	- Could be split into
				3024	separate s_waitcnt
				3025	vmcnt(0) and
				3026	s_waitcnt
				3027	lgkmcnt(0) to allow
				3028	them to be
				3029	independently moved
				3030	according to the
				3031	following rules.
				3032	- s_waitcnt vmcnt(0)
				3033	must happen after
				3034	any preceding
				3035	global/generic
				3036	load/store/load
				3037	atomic/store
				3038	atomic/atomicrmw.
				3039	- s_waitcnt lgkmcnt(0)
				3040	must happen after
				3041	any preceding
				3042	local/generic
				3043	load/store/load
				3044	atomic/store
				3045	atomic/atomicrmw.
				3046	- Must happen before
				3047	the following
				3048	atomicrmw.
				3049	- Ensures that all
				3050	memory operations
				3051	to global have
				3052	completed before
				3053	performing the
				3054	atomicrmw that is
				3055	being released.
				3056
				3057	2. buffer/global_atomic
				3058	3. s_waitcnt vmcnt(0)
				3059
				3060	- Must happen before
				3061	following
				3062	buffer_wbinvl1_vol.
				3063	- Ensures the
				3064	atomicrmw has
				3065	completed before
				3066	invalidating the
				3067	cache.
				3068
				3069	4. buffer_wbinvl1_vol
				3070
				3071	- Must happen before
				3072	any following
				3073	global/generic
				3074	load/load
				3075	atomic/atomicrmw.
				3076	- Ensures that
				3077	following loads
				3078	will not see stale
				3079	global data.
				3080
				3081	atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
				3082	- system lgkmcnt(0)
				3083
				3084	- If OpenCL, omit
				3085	lgkmcnt(0).
				3086	- Could be split into
				3087	separate s_waitcnt
				3088	vmcnt(0) and
				3089	s_waitcnt
				3090	lgkmcnt(0) to allow
				3091	them to be
				3092	independently moved
				3093	according to the
				3094	following rules.
				3095	- s_waitcnt vmcnt(0)
				3096	must happen after
				3097	any preceding
				3098	global/generic
				3099	load/store/load
				3100	atomic/store
				3101	atomic/atomicrmw.
				3102	- s_waitcnt lgkmcnt(0)
				3103	must happen after
				3104	any preceding
				3105	local/generic
				3106	load/store/load
				3107	atomic/store
				3108	atomic/atomicrmw.
				3109	- Must happen before
				3110	the following
				3111	atomicrmw.
				3112	- Ensures that all
				3113	memory operations
				3114	to global have
				3115	completed before
				3116	performing the
				3117	atomicrmw that is
				3118	being released.
				3119
				3120	2. flat_atomic
				3121	3. s_waitcnt vmcnt(0) &
				3122	lgkmcnt(0)
				3123
				3124	- If OpenCL, omit
				3125	lgkmcnt(0).
				3126	- Must happen before
				3127	following
				3128	buffer_wbinvl1_vol.
				3129	- Ensures the
				3130	atomicrmw has
				3131	completed before
				3132	invalidating the
				3133	cache.
				3134
				3135	4. buffer_wbinvl1_vol
				3136
				3137	- Must happen before
				3138	any following
				3139	global/generic
				3140	load/load
				3141	atomic/atomicrmw.
				3142	- Ensures that
				3143	following loads
				3144	will not see stale
				3145	global data.
				3146
				3147	fence acq_rel - singlethread none none
				3148	- wavefront
				3149	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3150
				3151	- If OpenCL and
				3152	address space is
				3153	not generic, omit
				3154	waitcnt. However,
				3155	since LLVM
				3156	currently has no
				3157	address space on
				3158	the fence need to
				3159	conservatively
				3160	always generate
				3161	(see comment for
				3162	previous fence).
				3163	- Must happen after
				3164	any preceding
				3165	local/generic
				3166	load/load
				3167	atomic/store/store
				3168	atomic/atomicrmw.
				3169	- Must happen before
				3170	any following
				3171	global/generic
				3172	load/load
				3173	atomic/store/store
				3174	atomic/atomicrmw.
				3175	- Ensures that all
				3176	memory operations
				3177	to local have
				3178	completed before
				3179	performing any
				3180	following global
				3181	memory operations.
				3182	- Ensures that the
				3183	preceding
				3184	local/generic load
				3185	atomic/atomicrmw
				3186	with an equal or
				3187	wider sync scope
				3188	and memory ordering
				3189	stronger than
				3190	unordered (this is
				3191	termed the
				3192	fence-paired-atomic)
				3193	has completed
				3194	before following
				3195	global memory
				3196	operations. This
				3197	satisfies the
				3198	requirements of
				3199	acquire.
				3200	- Ensures that all
				3201	previous memory
				3202	operations have
				3203	completed before a
				3204	following
				3205	local/generic store
				3206	atomic/atomicrmw
				3207	with an equal or
				3208	wider sync scope
				3209	and memory ordering
				3210	stronger than
				3211	unordered (this is
				3212	termed the
				3213	fence-paired-atomic).
				3214	This satisfies the
				3215	requirements of
				3216	release.
				3217
				3218	fence acq_rel - agent none 1. s_waitcnt vmcnt(0) &
				3219	- system lgkmcnt(0)
				3220
				3221	- If OpenCL and
				3222	address space is
				3223	not generic, omit
				3224	lgkmcnt(0).
				3225	However, since LLVM
				3226	currently has no
				3227	address space on
				3228	the fence need to
				3229	conservatively
				3230	always generate
				3231	(see comment for
				3232	previous fence).
				3233	- Could be split into
				3234	separate s_waitcnt
				3235	vmcnt(0) and
				3236	s_waitcnt
				3237	lgkmcnt(0) to allow
				3238	them to be
				3239	independently moved
				3240	according to the
				3241	following rules.
				3242	- s_waitcnt vmcnt(0)
				3243	must happen after
				3244	any preceding
				3245	global/generic
				3246	load/store/load
				3247	atomic/store
				3248	atomic/atomicrmw.
				3249	- s_waitcnt lgkmcnt(0)
				3250	must happen after
				3251	any preceding
				3252	local/generic
				3253	load/store/load
				3254	atomic/store
				3255	atomic/atomicrmw.
				3256	- Must happen before
				3257	the following
				3258	buffer_wbinvl1_vol.
				3259	- Ensures that the
				3260	preceding
				3261	global/local/generic
				3262	load
				3263	atomic/atomicrmw
				3264	with an equal or
				3265	wider sync scope
				3266	and memory ordering
				3267	stronger than
				3268	unordered (this is
				3269	termed the
				3270	fence-paired-atomic)
				3271	has completed
				3272	before invalidating
				3273	the cache. This
				3274	satisfies the
				3275	requirements of
				3276	acquire.
				3277	- Ensures that all
				3278	previous memory
				3279	operations have
				3280	completed before a
				3281	following
				3282	global/local/generic
				3283	store
				3284	atomic/atomicrmw
				3285	with an equal or
				3286	wider sync scope
				3287	and memory ordering
				3288	stronger than
				3289	unordered (this is
				3290	termed the
				3291	fence-paired-atomic).
				3292	This satisfies the
				3293	requirements of
				3294	release.
				3295
				3296	2. buffer_wbinvl1_vol
				3297
				3298	- Must happen before
				3299	any following
				3300	global/generic
				3301	load/load
				3302	atomic/store/store
				3303	atomic/atomicrmw.
				3304	- Ensures that
				3305	following loads
				3306	will not see stale
				3307	global data. This
				3308	satisfies the
				3309	requirements of
				3310	acquire.
				3311
				3312	Sequential Consistent Atomic
				3313	---------------------------------------------------------------------------
				3314	load atomic seq_cst - singlethread - global *Same as corresponding
				3315	- wavefront - local load atomic acquire*.
				3316	- workgroup - generic
				3317	load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
				3318	- system - local
				3319	- generic - Must happen after
				3320	preceding
				3321	global/generic load
				3322	atomic/store
				3323	atomic/atomicrmw
				3324	with memory
				3325	ordering of seq_cst
				3326	and with equal or
				3327	wider sync scope.
				3328	(Note that seq_cst
				3329	fences have their
				3330	own s_waitcnt
				3331	vmcnt(0) and so do
				3332	not need to be
				3333	considered.)
				3334	- Ensures any
				3335	preceding
				3336	sequential
				3337	consistent global
				3338	memory instructions
				3339	have completed
				3340	before executing
				3341	this sequentially
				3342	consistent
				3343	instruction. This
				3344	prevents reordering
				3345	a seq_cst store
				3346	followed by a
				3347	seq_cst load (Note
				3348	that seq_cst is
				3349	stronger than
				3350	acquire/release as
				3351	the reordering of
				3352	load acquire
				3353	followed by a store
				3354	release is
				3355	prevented by the
				3356	waitcnt vmcnt(0) of
				3357	the release, but
				3358	there is nothing
				3359	preventing a store
				3360	release followed by
				3361	load acquire from
				3362	competing out of
				3363	order.)
				3364
				3365	2. *Following
				3366	instructions same as
				3367	corresponding load
				3368	atomic acquire*.
				3369
				3370	store atomic seq_cst - singlethread - global *Same as corresponding
				3371	- wavefront - local store atomic release*.
				3372	- workgroup - generic
				3373	store atomic seq_cst - agent - global *Same as corresponding
				3374	- system - generic store atomic release*.
				3375	atomicrmw seq_cst - singlethread - global *Same as corresponding
				3376	- wavefront - local atomicrmw acq_rel*.
				3377	- workgroup - generic
				3378	atomicrmw seq_cst - agent - global *Same as corresponding
				3379	- system - generic atomicrmw acq_rel*.
				3380	fence seq_cst - singlethread none *Same as corresponding
				3381	- wavefront fence acq_rel*.
				3382	- workgroup
				3383	- agent
				3384	- system
				3385	============ ============ ============== ========== =======================
				3386
				3387	The memory order also adds the single thread optimization constrains defined in
				3388	table
				3389	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3390
				3391	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3392	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3393
				3394	============ ==============================================================
				3395	LLVM Memory Optimization Constraints
				3396	Ordering
				3397	============ ==============================================================
				3398	unordered none
				3399	monotonic none
				3400	acquire - If a load atomic/atomicrmw then no following load/load
				3401	atomic/store/ store atomic/atomicrmw/fence instruction can
				3402	be moved before the acquire.
				3403	- If a fence then same as load atomic, plus no preceding
				3404	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3405	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3406	atomic/store/ store atomic/atomicrmw/fence instruction can
				3407	be moved after the release.
				3408	- If a fence then same as store atomic, plus no following
				3409	associated fence-paired-atomic can be moved before the
				3410	fence.
				3411	acq_rel Same constraints as both acquire and release.
				3412	seq_cst - If a load atomic then same constraints as acquire, plus no
				3413	preceding sequentially consistent load atomic/store
				3414	atomic/atomicrmw/fence instruction can be moved after the
				3415	seq_cst.
				3416	- If a store atomic then the same constraints as release, plus
				3417	no following sequentially consistent load atomic/store
				3418	atomic/atomicrmw/fence instruction can be moved before the
				3419	seq_cst.
				3420	- If an atomicrmw/fence then same constraints as acq_rel.
				3421	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3422
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3423	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3424	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3425
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3426	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3427	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3428	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3429
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3430	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3431	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3432
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3433	=================== =============== =============== =======================
				3434	Usage Code Sequence Trap Handler Description
				3435	Inputs
				3436	=================== =============== =============== =======================
				3437	reserved ``s_trap 0x00`` Reserved by hardware.
				3438	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3439	``queue_ptr`` ``debugtrap``
				3440	``VGPR0``: intrinsic (not
				3441	``arg`` implemented).
				3442	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3443	``queue_ptr`` terminated and its
				3444	associated queue put
				3445	into the error state.
				3446	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3447	``queue_ptr`` installed handled
				3448	same as ``llvm.trap``.
				3449	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3450	breakpoints.
				3451	debugger ``s_trap 0x08`` Reserved for debugger.
				3452	debugger ``s_trap 0xfe`` Reserved for debugger.
				3453	debugger ``s_trap 0xff`` Reserved for debugger.
				3454	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3455
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3456	Unspecified OS
				3457	--------------
				3458
				3459	This section provides code conventions used when the target triple OS is
				3460	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3461
				3462	Trap Handler ABI
				3463	~~~~~~~~~~~~~~~~
				3464
				3465	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3466	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3467	instructions are handled as follows:
				3468
				3469	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3470	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3471
				3472	=============== =============== ===========================================
				3473	Usage Code Sequence Description
				3474	=============== =============== ===========================================
				3475	llvm.trap s_endpgm Causes wavefront to be terminated.
				3476	llvm.debugtrap none Compiler warning given that there is no
				3477	trap handler installed.
				3478	=============== =============== ===========================================
				3479
				3480	Source Languages
				3481	================
				3482
				3483	.. _amdgpu-opencl:
				3484
				3485	OpenCL
				3486	------
				3487
				3488	When generating code for the OpenCL language the target triple environment
				3489	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3490
				3491	When the language is OpenCL the following differences occur:
				3492
				3493	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3494	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3495	3. Additional metadata is generated
				3496	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3497
				3498	.. TODO
				3499	Specify what affect this has. Hidden arguments added. Additional metadata
				3500	generated.
				3501
				3502	.. _amdgpu-hcc:
				3503
				3504	HCC
				3505	---
				3506
				3507	When generating code for the OpenCL language the target triple environment
				3508	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3509
				3510	When the language is OpenCL the following differences occur:
				3511
				3512	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3513
				3514	.. TODO
				3515	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3516
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3517	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3518	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3519
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3520	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3521	It supports AMDGCN GFX6-GFX8.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3522
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3523	This section describes general syntax for instructions and operands. For more
				3524	information about instructions, their semantics and supported combinations of
				3525	operands, refer to one of instruction set architecture manuals
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3526	[AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
				3527	[AMD-Vega]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3528
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3529	An instruction has the following syntax (register operands are normally
				3530	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3531
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3532	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3533
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3534	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3535	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3536
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3537	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3538
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3539	* SGPR registers: s0, ... or s[0], ...
				3540	* VGPR registers: v0, ... or v[0], ...
				3541	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3542	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3543	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3544	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3545	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3546	* Register index expressions: v[2*2], s[1-1:2-1]
				3547	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3548
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3549	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3550
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3551	* offset, offset0, offset1
				3552	* idxen, offen bits
				3553	* glc, slc, tfe bits
				3554	* waitcnt: integer or combination of counter values
				3555	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3556
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3557	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3558
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3559	* DPP modifiers:
				3560
				3561	- row_shl, row_shr, row_ror, row_rol
				3562	- row_mirror, row_half_mirror, row_bcast
				3563	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3564	- row_mask, bank_mask, bound_ctrl
				3565
				3566	* SDWA modifiers:
				3567
				3568	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3569	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3570	- abs, neg, sext
				3571
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3572	Instruction Examples
				3573	~~~~~~~~~~~~~~~~~~~~
				3574
				3575	DS
				3576	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3577
				3578	.. code-block:: nasm
				3579
				3580	ds_add_u32 v2, v4 offset:16
				3581	ds_write_src2_b64 v2 offset0:4 offset1:8
				3582	ds_cmpst_f32 v2, v4, v6
				3583	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3584
				3585
				3586	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3587
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3588	FLAT
				3589	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3590
				3591	.. code-block:: nasm
				3592
				3593	flat_load_dword v1, v[3:4]
				3594	flat_store_dwordx3 v[3:4], v[5:7]
				3595	flat_atomic_swap v1, v[3:4], v5 glc
				3596	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3597	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3598
				3599	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3600
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3601	MUBUF
				3602	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3603
				3604	.. code-block:: nasm
				3605
				3606	buffer_load_dword v1, off, s[4:7], s1
				3607	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3608	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3609	buffer_wbinvl1
				3610	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3611
				3612	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3613
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3614	SMRD/SMEM
				3615	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3616
				3617	.. code-block:: nasm
				3618
				3619	s_load_dword s1, s[2:3], 0xfc
				3620	s_load_dwordx8 s[8:15], s[2:3], s4
				3621	s_load_dwordx16 s[88:103], s[2:3], s4
				3622	s_dcache_inv_vol
				3623	s_memtime s[4:5]
				3624
				3625	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3626
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3627	SOP1
				3628	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3629
				3630	.. code-block:: nasm
				3631
				3632	s_mov_b32 s1, s2
				3633	s_mov_b64 s[0:1], 0x80000000
				3634	s_cmov_b32 s1, 200
				3635	s_wqm_b64 s[2:3], s[4:5]
				3636	s_bcnt0_i32_b64 s1, s[2:3]
				3637	s_swappc_b64 s[2:3], s[4:5]
				3638	s_cbranch_join s[4:5]
				3639
				3640	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3641
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3642	SOP2
				3643	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3644
				3645	.. code-block:: nasm
				3646
				3647	s_add_u32 s1, s2, s3
				3648	s_and_b64 s[2:3], s[4:5], s[6:7]
				3649	s_cselect_b32 s1, s2, s3
				3650	s_andn2_b32 s2, s4, s6
				3651	s_lshr_b64 s[2:3], s[4:5], s6
				3652	s_ashr_i32 s2, s4, s6
				3653	s_bfm_b64 s[2:3], s4, s6
				3654	s_bfe_i64 s[2:3], s[4:5], s6
				3655	s_cbranch_g_fork s[4:5], s[6:7]
				3656
				3657	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3658
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3659	SOPC
				3660	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3661
				3662	.. code-block:: nasm
				3663
				3664	s_cmp_eq_i32 s1, s2
				3665	s_bitcmp1_b32 s1, s2
				3666	s_bitcmp0_b64 s[2:3], s4
				3667	s_setvskip s3, s5
				3668
				3669	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3670
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3671	SOPP
				3672	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3673
				3674	.. code-block:: nasm
				3675
				3676	s_barrier
				3677	s_nop 2
				3678	s_endpgm
				3679	s_waitcnt 0 ; Wait for all counters to be 0
				3680	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3681	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3682	s_sethalt 9
				3683	s_sleep 10
				3684	s_sendmsg 0x1
				3685	s_sendmsg sendmsg(MSG_INTERRUPT)
				3686	s_trap 1
				3687
				3688	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3689
				3690	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3691	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3692	range or acceptable values.
				3693
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3694	VALU
				3695	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3696
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3697	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3698	the assembler will automatically use optimal encoding based on its operands.
				3699	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3700
				3701	* _e32 for 32-bit VOP1/VOP2/VOPC
				3702	* _e64 for 64-bit VOP3
				3703	* _dpp for VOP_DPP
				3704	* _sdwa for VOP_SDWA
				3705
				3706	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3707
				3708	.. code-block:: nasm
				3709
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3710	v_mov_b32 v1, v2
				3711	v_mov_b32_e32 v1, v2
				3712	v_nop
				3713	v_cvt_f64_i32_e32 v[1:2], v2
				3714	v_floor_f32_e32 v1, v2
				3715	v_bfrev_b32_e32 v1, v2
				3716	v_add_f32_e32 v1, v2, v3
				3717	v_mul_i32_i24_e64 v1, v2, 3
				3718	v_mul_i32_i24_e32 v1, -3, v3
				3719	v_mul_i32_i24_e32 v1, -100, v3
				3720	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3721	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3722
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3723	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3724
				3725	.. code-block:: nasm
				3726
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3727	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3728	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3729	v_mov_b32 v0, v0 wave_shl:1
				3730	v_mov_b32 v0, v0 row_mirror
				3731	v_mov_b32 v0, v0 row_bcast:31
				3732	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3733	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3734	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3735
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3736	VOP_SDWA examples:
				3737
				3738	.. code-block:: nasm
				3739
				3740	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				3741	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				3742	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				3743	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				3744	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				3745
				3746	For full list of supported instructions, refer to "Vector ALU instructions".
				3747
				3748	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3749	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3750
				3751	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				3752	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3753
				3754	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3755	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3756
				3757	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3758	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3759
				3760	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3761	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				3762
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3763
				3764	major, minor, and stepping are all integers that describe the instruction
				3765	set architecture (ISA) version of the assembly program.
				3766
				3767	vendor and arch are quoted strings. vendor should always be equal to
				3768	"AMD" and arch should always be equal to "AMDGPU".
				3769
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3770	By default, the assembler will derive the ISA version, vendor, and arch
				3771	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3772
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3773	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3774	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3775
				3776	This directives specifies that the symbol with given name is a kernel entry point
				3777	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3778
				3779	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3780	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3781
				3782	This directive marks the beginning of a list of key / value pairs that are used
				3783	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				3784	The list must be terminated by the .end_amd_kernel_code_t directive. For
				3785	any amd_kernel_code_t values that are unspecified a default value will be
				3786	used. The default value for all keys is 0, with the following exceptions:
				3787
				3788	- kernel_code_version_major defaults to 1.
				3789	- machine_kind defaults to 1.
				3790	- machine_version_major, machine_version_minor, and
				3791	machine_version_stepping are derived from the value of the -mcpu option
				3792	that is passed to the assembler.
				3793	- kernel_code_entry_byte_offset defaults to 256.
				3794	- wavefront_size defaults to 6.
				3795	- kernarg_segment_alignment, group_segment_alignment, and
				3796	private_segment_alignment default to 4. Note that alignments are specified
				3797	as a power of two, so a value of n means an alignment of 2^ n.
				3798
				3799	The .amd_kernel_code_t directive must be placed immediately after the
				3800	function label and before any instructions.
				3801
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3802	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				3803	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3804
				3805	Here is an example of a minimal amd_kernel_code_t specification:
				3806
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	3807	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3808
				3809	.hsa_code_object_version 1,0
				3810	.hsa_code_object_isa
				3811
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3812	.hsatext
				3813	.globl hello_world
				3814	.p2align 8
				3815	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3816
				3817	hello_world:
				3818
				3819	.amd_kernel_code_t
				3820	enable_sgpr_kernarg_segment_ptr = 1
				3821	is_ptr64 = 1
				3822	compute_pgm_rsrc1_vgprs = 0
				3823	compute_pgm_rsrc1_sgprs = 0
				3824	compute_pgm_rsrc2_user_sgpr = 2
				3825	kernarg_segment_byte_size = 8
				3826	wavefront_sgpr_count = 2
				3827	workitem_vgpr_count = 3
				3828	.end_amd_kernel_code_t
				3829
				3830	s_load_dwordx2 s[0:1], s[0:1] 0x0
				3831	v_mov_b32 v0, 3.14159
				3832	s_waitcnt lgkmcnt(0)
				3833	v_mov_b32 v1, s0
				3834	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3835	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3836	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	3837	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3838	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3839
				3840	Additional Documentation
				3841	========================
				3842
				3843	.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				3844	.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				3845	.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				3846	.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				3847	.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				3848	.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				3849	.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3850	.. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3851	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				3852	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				3853	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				3854	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				3855	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				3856	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				3857	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
				3858	.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
				3859	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				3860	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				3861	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__