Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: ac8c0bd884fb8122f352a1655a9185fb319c54ed [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
				26	.. table:: AMDGPU Target Triples
				27	:name: amdgpu-target-triples-table
				28
				29	============ ======== ========= ===========
				30	Architecture Vendor OS Environment
				31	============ ======== ========= ===========
				32	r600 amd <empty> <empty>
				33	amdgcn amd <empty> <empty>
				34	amdgcn amd amdhsa <empty>
				35	amdgcn amd amdhsa opencl
				36	amdgcn amd amdhsa amdgizcl
				37	amdgcn amd amdhsa amdgiz
				38	amdgcn amd amdhsa hcc
				39	============ ======== ========= ===========
				40
				41	``r600-amd--``
				42	Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
				43	the MESA runtime.
				44
				45	``amdgcn-amd--``
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	46	Supports AMD GPUs GCN GFX6 onwards for graphics and compute shaders executed on
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	47	the MESA runtime.
				48
				49	``amdgcn-amd-amdhsa-``
				50	Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
				51	compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
				52
				53	``amdgcn-amd-amdhsa-opencl``
				54	Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
				55	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				56	:ref:`amdgpu-opencl`.
				57
				58	``amdgcn-amd-amdhsa-amdgizcl``
				59	Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
				60	is used (see :ref:`amdgpu-address-spaces`).
				61
				62	``amdgcn-amd-amdhsa-amdgiz``
				63	Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
				64	used (see :ref:`amdgpu-address-spaces`).
				65
				66	``amdgcn-amd-amdhsa-hcc``
				67	Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
				68	executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				69	:ref:`amdgpu-hcc`.
				70
				71	.. _amdgpu-processors:
				72
				73	Processors
				74	----------
				75
				76	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				77	names from both the Processor and Alternative Processor can be used.
				78
				79	.. table:: AMDGPU Processors
				80	:name: amdgpu-processors-table
				81
				82	========== =========== ============ ===== ======= ==================
				83	Processor Alternative Target dGPU/ Runtime Example
				84	Processor Triple APU Support Products
				85	Architecture
				86	========== =========== ============ ===== ======= ==================
				87	R600 [AMD-R6xx]_
				88	--------------------------------------------------------------------
				89	r600 r600 dGPU
				90	r630 r600 dGPU
				91	rs880 r600 dGPU
				92	rv670 r600 dGPU
				93	R700 [AMD-R7xx]_
				94	--------------------------------------------------------------------
				95	rv710 r600 dGPU
				96	rv730 r600 dGPU
				97	rv770 r600 dGPU
				98	Evergreen [AMD-Evergreen]_
				99	--------------------------------------------------------------------
				100	cedar r600 dGPU
				101	redwood r600 dGPU
				102	sumo r600 dGPU
				103	juniper r600 dGPU
				104	cypress r600 dGPU
				105	Northern Islands [AMD-Cayman-Trinity]_
				106	--------------------------------------------------------------------
				107	barts r600 dGPU
				108	turks r600 dGPU
				109	caicos r600 dGPU
				110	cayman r600 dGPU
				111	GCN GFX6 (Southern Islands (SI)) [AMD-Souther-Islands]_
				112	--------------------------------------------------------------------
Konstantin Zhuravlyov	6cbcb27	2017-08-08 04:28:31 +0000	[diff] [blame]	113	gfx600 - tahiti amdgcn dGPU
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	114	gfx601 - pitcairn amdgcn dGPU
				115	- verde
				116	- oland
				117	- hainan
				118	GCN GFX7 (Sea Islands (CI)) [AMD-Sea-Islands]_
				119	--------------------------------------------------------------------
				120	gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
				121	- Radeon HD 8770
				122	- R7 260
				123	- R7 260X
				124	\ - kaveri amdgcn APU - A6-7000
				125	- A6 Pro-7050B
				126	- A8-7100
				127	- A8 Pro-7150B
				128	- A10-7300
				129	- A10 Pro-7350B
				130	- FX-7500
				131	- A8-7200P
				132	- A10-7400P
				133	- FX-7600P
				134	gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
				135	- FirePro W9100
				136	- FirePro S9150
				137	- FirePro S9170
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	138	gfx702 dGPU ROCm - Radeon R9 290
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	139	- Radeon R9 290x
				140	- Radeon R390
Tony Tye	1fd77b2	2017-06-08 01:47:25 +0000	[diff] [blame]	141	- Radeon R390x
				142	gfx703 - kabini amdgcn APU - E1-2100
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	143	- mullins - E1-2200
				144	- E1-2500
				145	- E2-3000
				146	- E2-3800
				147	- A4-5000
				148	- A4-5100
				149	- A6-5200
				150	- A4 Pro-3340B
				151	GCN GFX8 (Volcanic Islands (VI)) [AMD-Volcanic-Islands]_
				152	--------------------------------------------------------------------
				153	gfx800 - iceland amdgcn dGPU - FirePro S7150
				154	- FirePro S7100
				155	- FirePro W7100
				156	- Radeon R285
				157	- Radeon R9 380
				158	- Radeon R9 385
				159	- Mobile FirePro
				160	M7170
				161	gfx801 - carrizo amdgcn APU - A6-8500P
				162	- Pro A6-8500B
				163	- A8-8600P
				164	- Pro A8-8600B
				165	- FX-8800P
				166	- Pro A12-8800B
				167	\ amdgcn APU ROCm - A10-8700P
				168	- Pro A10-8700B
				169	- A10-8780P
				170	\ amdgcn APU - A10-9600P
				171	- A10-9630P
				172	- A12-9700P
				173	- A12-9730P
				174	- FX-9800P
				175	- FX-9830P
				176	\ amdgcn APU - E2-9010
				177	- A6-9210
				178	- A9-9410
				179	gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
				180	gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
				181	- Radeon R9 Fury
				182	- Radeon R9 FuryX
				183	- Radeon Pro Duo
				184	- FirePro S9300x2
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	185	- Radeon Instinct MI8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	186	\ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
				187	- Radeon RX 480
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	188	- Radeon Instinct MI6
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	189	\ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
				190	gfx804 amdgcn dGPU Same as gfx803
				191	gfx810 - stoney amdgcn APU
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	192	GCN GFX9 [AMD-Vega]_
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	193	--------------------------------------------------------------------
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	194	gfx900 amdgcn dGPU - Radeon Vega
				195	Frontier Edition
				196	- Radeon RX Vega 56
				197	- Radeon RX Vega 64
				198	- Radeon RX Vega 64
				199	Liquid
				200	- Radeon Instinct MI25
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	201	gfx901 amdgcn dGPU ROCm Same as gfx900
				202	except XNACK is
				203	enabled
				204	gfx902 amdgcn APU TBA
				205
				206	.. TODO
				207	Add product
				208	names.
				209	gfx903 amdgcn APU Same as gfx902
				210	except XNACK is
				211	enabled
				212	========== =========== ============ ===== ======= ==================
				213
				214	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	215
				216	Address Spaces
				217	--------------
				218
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	219	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	220
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	221	The memory space names used in the table, aside from the region memory space, is
				222	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	223
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	224	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	225
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	226	.. table:: Address Space Mapping
				227	:name: amdgpu-address-space-mapping-table
				228
				229	================== ================= ================= ================= =================
				230	LLVM Address Space Memory Space
				231	------------------ -----------------------------------------------------------------------
				232	\ Current Default amdgiz/amdgizcl hcc Future Default
				233	================== ================= ================= ================= =================
				234	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				235	1 Global Global Global Global
				236	2 Constant Constant Constant Region (GDS)
				237	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				238	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				239	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				240	================== ================= ================= ================= =================
				241
				242	Current Default
				243	This is the current default address space mapping used for all languages
				244	except hcc. This will shortly be deprecated.
				245
				246	amdgiz/amdgizcl
				247	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				248	is specified as the target triple environment value.
				249
				250	hcc
				251	This is the current address space mapping used when ``hcc`` is specified as
				252	the target triple environment value.This will shortly be deprecated.
				253
				254	Future Default
				255	This will shortly be the only address space mapping for all languages using
				256	AMDGPU backend.
				257
				258	.. _amdgpu-memory-scopes:
				259
				260	Memory Scopes
				261	-------------
				262
				263	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				264	backend memory model when the target triple OS is ``amdhsa`` (see
				265	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				266
				267	The memory model supported is based on the HSA memory model [HSA]_ which is
				268	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				269	relation is transitive over the synchonizes-with relation independent of scope,
				270	and synchonizes-with allows the memory scope instances to be inclusive (see
				271	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
				272
				273	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				274	inclusion and requires the memory scopes to exactly match. However, this
				275	is conservatively correct for OpenCL.
				276
				277	.. table:: AMDHSA LLVM Sync Scopes for AMDHSA
				278	:name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
				279
				280	================ ==========================================================
				281	LLVM Sync Scope Description
				282	================ ==========================================================
				283	none The default: ``system``.
				284
				285	Synchronizes with, and participates in modification and
				286	seq_cst total orderings with, other operations (except
				287	image operations) for all address spaces (except private,
				288	or generic that accesses private) provided the other
				289	operation's sync scope is:
				290
				291	- ``system``.
				292	- ``agent`` and executed by a thread on the same agent.
				293	- ``workgroup`` and executed by a thread in the same
				294	workgroup.
				295	- ``wavefront`` and executed by a thread in the same
				296	wavefront.
				297
				298	``agent`` Synchronizes with, and participates in modification and
				299	seq_cst total orderings with, other operations (except
				300	image operations) for all address spaces (except private,
				301	or generic that accesses private) provided the other
				302	operation's sync scope is:
				303
				304	- ``system`` or ``agent`` and executed by a thread on the
				305	same agent.
				306	- ``workgroup`` and executed by a thread in the same
				307	workgroup.
				308	- ``wavefront`` and executed by a thread in the same
				309	wavefront.
				310
				311	``workgroup`` Synchronizes with, and participates in modification and
				312	seq_cst total orderings with, other operations (except
				313	image operations) for all address spaces (except private,
				314	or generic that accesses private) provided the other
				315	operation's sync scope is:
				316
				317	- ``system``, ``agent`` or ``workgroup`` and executed by a
				318	thread in the same workgroup.
				319	- ``wavefront`` and executed by a thread in the same
				320	wavefront.
				321
				322	``wavefront`` Synchronizes with, and participates in modification and
				323	seq_cst total orderings with, other operations (except
				324	image operations) for all address spaces (except private,
				325	or generic that accesses private) provided the other
				326	operation's sync scope is:
				327
				328	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				329	and executed by a thread in the same wavefront.
				330
				331	``singlethread`` Only synchronizes with, and participates in modification
				332	and seq_cst total orderings with, other operations (except
				333	image operations) running in the same thread for all
				334	address spaces (for example, in signal handlers).
				335	================ ==========================================================
				336
				337	AMDGPU Intrinsics
				338	-----------------
				339
				340	The AMDGPU backend implements the following intrinsics.
				341
				342	This section is WIP.
				343
				344	.. TODO
				345	List AMDGPU intrinsics
				346
				347	Code Object
				348	===========
				349
				350	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				351	can be linked by ``lld`` to produce a standard ELF shared code object which can
				352	be loaded and executed on an AMDGPU target.
				353
				354	Header
				355	------
				356
				357	The AMDGPU backend uses the following ELF header:
				358
				359	.. table:: AMDGPU ELF Header
				360	:name: amdgpu-elf-header-table
				361
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	362	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	363	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	364	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	365	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				366	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	367	``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA`` or
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	368	``ELFOSABI_AMDGPU_PAL`` or
				369	``ELFOSABI_AMDGPU_MESA3D``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	370	``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA`` or
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	371	``ELFABIVERSION_AMDGPU_PAL`` or
				372	``ELFABIVERSION_AMDGPU_MESA3D``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	373	``e_type`` ``ET_REL`` or ``ET_DYN``
				374	``e_machine`` ``EM_AMDGPU``
				375	``e_entry`` 0
				376	``e_flags`` 0
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	377	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	378
				379	..
				380
				381	.. table:: AMDGPU ELF Header Enumeration Values
				382	:name: amdgpu-elf-header-enumeration-values-table
				383
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	384	=============================== =====
				385	Name Value
				386	=============================== =====
				387	``EM_AMDGPU`` 224
				388	``ELFOSABI_AMDGPU_HSA`` 64
				389	``ELFOSABI_AMDGPU_PAL`` 65
				390	``ELFOSABI_AMDGPU_MESA3D`` 66
				391	``ELFABIVERSION_AMDGPU_HSA`` 1
				392	``ELFABIVERSION_AMDGPU_PAL`` 0
				393	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				394	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	395
				396	``e_ident[EI_CLASS]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	397	The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64
				398	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	399
				400	``e_ident[EI_DATA]``
				401	All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
				402
				403	``e_ident[EI_OSABI]``
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	404	One of the following AMD GPU architecture specific OS ABIs:
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	405
				406	* ``ELFOSABI_AMDGPU_HSA`` is used to specify that the code object conforms to
				407	the AMD HSA runtime ABI [HSA]_.
				408
				409	* ``ELFOSABI_AMDGPU_PAL`` is used to specify that the code object conforms to
				410	the AMD PAL runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	412	* ``ELFOSABI_AMDGPU_MESA3D`` is used to specify that the code object conforms
				413	to the AMD MESA runtime ABI.
				414
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	415	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	416	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				417	object conforms:
				418
				419	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				420	runtime ABI.
				421
				422	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				423	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame^]	425	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
				426	runtime ABI.
				427
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428	``e_type``
				429	Can be one of the following values:
				430
				431
				432	``ET_REL``
				433	The type produced by the AMD GPU backend compiler as it is relocatable code
				434	object.
				435
				436	``ET_DYN``
				437	The type produced by the linker as it is a shared code object.
				438
				439	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				440
				441	``e_machine``
				442	The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
				443	architecture family. The specific member is specified in the
				444	``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
				445	:ref:`amdgpu-note-records`).
				446
				447	``e_entry``
				448	The entry point is 0 as the entry points for individual kernels must be
				449	selected in order to invoke them through AQL packets.
				450
				451	``e_flags``
				452	The value is 0 as no flags are used.
				453
				454	Sections
				455	--------
				456
				457	An AMDGPU target ELF code object has the standard ELF sections which include:
				458
				459	.. table:: AMDGPU ELF Sections
				460	:name: amdgpu-elf-sections-table
				461
				462	================== ================ =================================
				463	Name Type Attributes
				464	================== ================ =================================
				465	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				466	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				467	``.debug_``\ \* ``SHT_PROGBITS`` none
				468	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				469	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				470	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				471	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				472	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				473	``.note`` ``SHT_NOTE`` none
				474	``.rela``\ name ``SHT_RELA`` none
				475	``.rela.dyn`` ``SHT_RELA`` none
				476	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				477	``.shstrtab`` ``SHT_STRTAB`` none
				478	``.strtab`` ``SHT_STRTAB`` none
				479	``.symtab`` ``SHT_SYMTAB`` none
				480	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				481	================== ================ =================================
				482
				483	These sections have their standard meanings (see [ELF]_) and are only generated
				484	if needed.
				485
				486	``.debug``\ \*
				487	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				488	DWARF produced by the AMDGPU backend.
				489
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	490	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	491	The standard sections used by a dynamic loader.
				492
				493	``.note``
				494	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				495	backend.
				496
				497	``.rela``\ name, ``.rela.dyn``
				498	For relocatable code objects, name is the name of the section that the
				499	relocation records apply. For example, ``.rela.text`` is the section name for
				500	relocation records associated with the ``.text`` section.
				501
				502	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				503	records from each of the relocatable code object's ``.rela``\ name sections.
				504
				505	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				506	the AMDGPU backend.
				507
				508	``.text``
				509	The executable machine code for the kernels and functions they call. Generated
				510	as position independent code. See :ref:`amdgpu-code-conventions` for
				511	information on conventions used in the isa generation.
				512
				513	.. _amdgpu-note-records:
				514
				515	Note Records
				516	------------
				517
				518	As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
				519	the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
				520	minimal zero byte padding must be generated to ensure the ``desc`` field size is
				521	a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
				522	be at least 4 to indicate at least 8 byte alignment.
				523
				524	The AMDGPU backend code object uses the following ELF note records in the
				525	``.note`` section. The Description column specifies the layout of the note
				526	record’s ``desc`` field. All fields are consecutive bytes. Note records with
				527	variable size strings have a corresponding ``*_size`` field that specifies the
				528	number of bytes, including the terminating null character, in the string. The
				529	string(s) come immediately after the preceding fields.
				530
				531	Additional note records can be present.
				532
				533	.. table:: AMDGPU ELF Note Records
				534	:name: amdgpu-elf-note-records-table
				535
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	536	===== ============================== ======================================
				537	Name Type Description
				538	===== ============================== ======================================
				539	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
				540	"AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
				541	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	542
				543	..
				544
				545	.. table:: AMDGPU ELF Note Record Enumeration Values
				546	:name: amdgpu-elf-note-record-enumeration-values-table
				547
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	548	============================== =====
				549	Name Value
				550	============================== =====
				551	reserved 0-9
				552	``NT_AMD_AMDGPU_HSA_METADATA`` 10
				553	``NT_AMD_AMDGPU_ISA`` 11
				554	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	555
				556	``NT_AMD_AMDGPU_ISA``
				557	Specifies the instruction set architecture used by the machine code contained
				558	in the code object.
				559
				560	This note record is required for code objects containing machine code for
				561	processors matching the ``amdgcn`` architecture in table
				562	:ref:`amdgpu-processors`.
				563
				564	The null terminated string has the following syntax:
				565
				566	architecture\ ``-``\ vendor\ ``-``\ os\ ``-``\ environment\ ``-``\ processor
				567
				568	where:
				569
				570	architecture
				571	The architecture from table :ref:`amdgpu-target-triples-table`.
				572
				573	This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
				574	:ref:`amdgpu-target-triples`).
				575
				576	vendor
				577	The vendor from table :ref:`amdgpu-target-triples-table`.
				578
				579	For the AMDGPU backend this is always ``amd``.
				580
				581	os
				582	The OS from table :ref:`amdgpu-target-triples-table`.
				583
				584	environment
				585	An environment from table :ref:`amdgpu-target-triples-table`, or blank if
				586	the environment has no affect on the execution of the code object.
				587
				588	For the AMDGPU backend this is currently always blank.
				589	processor
				590	The processor from table :ref:`amdgpu-processors-table`.
				591
				592	For example:
				593
				594	``amdgcn-amd-amdhsa--gfx901``
				595
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	596	``NT_AMD_AMDGPU_HSA_METADATA``
				597	Specifies extensible metadata associated with the code objects executed on HSA
				598	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				599	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				600	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				601	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	602
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	603	.. _amdgpu-symbols:
				604
				605	Symbols
				606	-------
				607
				608	Symbols include the following:
				609
				610	.. table:: AMDGPU ELF Symbols
				611	:name: amdgpu-elf-symbols-table
				612
				613	===================== ============== ============= ==================
				614	Name Type Section Description
				615	===================== ============== ============= ==================
				616	link-name ``STT_OBJECT`` - ``.data`` Global variable
				617	- ``.rodata``
				618	- ``.bss``
				619	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				620	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				621	===================== ============== ============= ==================
				622
				623	Global variable
				624	Global variables both used and defined by the compilation unit.
				625
				626	If the symbol is defined in the compilation unit then it is allocated in the
				627	appropriate section according to if it has initialized data or is readonly.
				628
				629	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				630	will resolve relocations using the definition provided by another code object
				631	or explicitly defined by the runtime.
				632
				633	All global symbols, whether defined in the compilation unit or external, are
				634	accessed by the machine code indirectly through a GOT table entry. This
				635	allows them to be preemptable. The GOT table is only supported when the target
				636	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	637
				638	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	639	Add description of linked shared object symbols. Seems undefined symbols
				640	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	641
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	642	Kernel descriptor
				643	Every HSA kernel has an associated kernel descriptor. It is the address of the
				644	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				645	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				646	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				647
				648	Kernel entry point
				649	Every HSA kernel also has a symbol for its machine code entry point.
				650
				651	.. _amdgpu-relocation-records:
				652
				653	Relocation Records
				654	------------------
				655
				656	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				657	relocatable fields are:
				658
				659	``word32``
				660	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				661	alignment. These values use the same byte order as other word values in the
				662	AMD GPU architecture.
				663
				664	``word64``
				665	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				666	alignment. These values use the same byte order as other word values in the
				667	AMD GPU architecture.
				668
				669	Following notations are used for specifying relocation calculations:
				670
				671	A
				672	Represents the addend used to compute the value of the relocatable field.
				673
				674	G
				675	Represents the offset into the global offset table at which the relocation
				676	entry’s symbol will reside during execution.
				677
				678	GOT
				679	Represents the address of the global offset table.
				680
				681	P
				682	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				683	of the storage unit being relocated (computed using ``r_offset``).
				684
				685	S
				686	Represents the value of the symbol whose index resides in the relocation
				687	entry.
				688
				689	The following relocation types are supported:
				690
				691	.. table:: AMDGPU ELF Relocation Records
				692	:name: amdgpu-elf-relocation-records-table
				693
				694	========================== ===== ========== ==============================
				695	Relocation Type Value Field Calculation
				696	========================== ===== ========== ==============================
				697	``R_AMDGPU_NONE`` 0 none none
				698	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				699	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				700	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				701	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				702	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				703	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				704	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				705	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				706	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				707	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				708	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
				709	========================== ===== ========== ==============================
				710
				711	.. _amdgpu-dwarf:
				712
				713	DWARF
				714	-----
				715
				716	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				717	information that maps the code object executable code and data to the source
				718	language constructs. It can be used by tools such as debuggers and profilers.
				719
				720	Address Space Mapping
				721	~~~~~~~~~~~~~~~~~~~~~
				722
				723	The following address space mapping is used:
				724
				725	.. table:: AMDGPU DWARF Address Space Mapping
				726	:name: amdgpu-dwarf-address-space-mapping-table
				727
				728	=================== =================
				729	DWARF Address Space Memory Space
				730	=================== =================
				731	1 Private (Scratch)
				732	2 Local (group/LDS)
				733	omitted Global
				734	omitted Constant
				735	omitted Generic (Flat)
				736	not supported Region (GDS)
				737	=================== =================
				738
				739	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				740	used in the table.
				741
				742	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				743	DWARF address space of the value of the pointer when it is in the private or
				744	local address space. Otherwise the attribute is omitted.
				745
				746	An ``XDEREF`` operation is generated in location list expressions for variables
				747	that are allocated in the private and local address space. Otherwise no
				748	``XDREF`` is omitted.
				749
				750	Register Mapping
				751	~~~~~~~~~~~~~~~~
				752
				753	This section is WIP.
				754
				755	.. TODO
				756	Define DWARF register enumeration.
				757
				758	If want to present a wavefront state then should expose vector registers as
				759	64 wide (rather than per work-item view that LLVM uses). Either as separate
				760	registers, or a 64x4 byte single register. In either case use a new LANE op
				761	(akin to XDREF) to select the current lane usage in a location
				762	expression. This would also allow scalar register spilling to vector register
				763	lanes to be expressed (currently no debug information is being generated for
				764	spilling). If choose a wide single register approach then use LANE in
				765	conjunction with PIECE operation to select the dword part of the register for
				766	the current lane. If the separate register approach then use LANE to select
				767	the register.
				768
				769	Source Text
				770	~~~~~~~~~~~
				771
				772	This section is WIP.
				773
				774	.. TODO
				775	DWARF extension to include runtime generated source text.
				776
				777	.. _amdgpu-code-conventions:
				778
				779	Code Conventions
				780	================
				781
				782	This section provides code conventions used for each supported target triple OS
				783	(see :ref:`amdgpu-target-triples`).
				784
				785	AMDHSA
				786	------
				787
				788	This section provides code conventions used when the target triple OS is
				789	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				790
				791	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	792
				793	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	794	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	795
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	796	The code object metadata specifies extensible metadata associated with the code
				797	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				798	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				799	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				800	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				801	information necessary to support the ROCM kernel queries. For example, the
				802	segment sizes needed in a dispatch packet. In addition, a high level language
				803	runtime may require other information to be included. For example, the AMD
				804	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	805
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	806	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	807	:doc:`YamlIO`).
				808
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	809	.. TODO
				810	Is the string null terminated? It probably should not if YAML allows it to
				811	contain null characters, otherwise it should be.
				812
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	813	The metadata is represented as a single YAML document comprised of the mapping
				814	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				815	referenced tables.
				816
				817	For boolean values, the string values of ``false`` and ``true`` are used for
				818	false and true respectively.
				819
				820	Additional information can be added to the mappings. To avoid conflicts, any
				821	non-AMD key names should be prefixed by "vendor-name.".
				822
				823	.. table:: AMDHSA Code Object Metadata Mapping
				824	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				825
				826	========== ============== ========= =======================================
				827	String Key Value Type Required? Description
				828	========== ============== ========= =======================================
				829	"Version" sequence of Required - The first integer is the major
				830	2 integers version. Currently 1.
				831	- The second integer is the minor
				832	version. Currently 0.
				833	"Printf" sequence of Each string is encoded information
				834	strings about a printf function call. The
				835	encoded information is organized as
				836	fields separated by colon (':'):
				837
				838	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				839
				840	where:
				841
				842	``ID``
				843	A 32 bit integer as a unique id for
				844	each printf function call
				845
				846	``N``
				847	A 32 bit integer equal to the number
				848	of arguments of printf function call
				849	minus 1
				850
				851	``S[i]`` (where i = 0, 1, ... , N-1)
				852	32 bit integers for the size in bytes
				853	of the i-th FormatString argument of
				854	the printf function call
				855
				856	FormatString
				857	The format string passed to the
				858	printf function call.
				859	"Kernels" sequence of Required Sequence of the mappings for each
				860	mapping kernel in the code object. See
				861	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				862	for the definition of the mapping.
				863	========== ============== ========= =======================================
				864
				865	..
				866
				867	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				868	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				869
				870	================= ============== ========= ================================
				871	String Key Value Type Required? Description
				872	================= ============== ========= ================================
				873	"Name" string Required Source name of the kernel.
				874	"SymbolName" string Required Name of the kernel
				875	descriptor ELF symbol.
				876	"Language" string Source language of the kernel.
				877	Values include:
				878
				879	- "OpenCL C"
				880	- "OpenCL C++"
				881	- "HCC"
				882	- "OpenMP"
				883
				884	"LanguageVersion" sequence of - The first integer is the major
				885	2 integers version.
				886	- The second integer is the
				887	minor version.
				888	"Attrs" mapping Mapping of kernel attributes.
				889	See
				890	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				891	for the mapping definition.
				892	"Arguments" sequence of Sequence of mappings of the
				893	mapping kernel arguments. See
				894	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				895	for the definition of the mapping.
				896	"CodeProps" mapping Mapping of properties related to
				897	the kernel code. See
				898	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				899	for the mapping definition.
				900	"DebugProps" mapping Mapping of properties related to
				901	the kernel debugging. See
				902	:ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
				903	for the mapping definition.
				904	================= ============== ========= ================================
				905
				906	..
				907
				908	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				909	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				910
				911	=================== ============== ========= ==============================
				912	String Key Value Type Required? Description
				913	=================== ============== ========= ==============================
				914	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				915	3 integers X, Y, Z must correspond to the
				916	specified values.
				917
				918	Corresponds to the OpenCL
				919	``reqd_work_group_size``
				920	attribute.
				921	"WorkGroupSizeHint" sequence of The dispatch work-group size
				922	3 integers X, Y, Z is likely to be the
				923	specified values.
				924
				925	Corresponds to the OpenCL
				926	``work_group_size_hint``
				927	attribute.
				928	"VecTypeHint" string The name of a scalar or vector
				929	type.
				930
				931	Corresponds to the OpenCL
				932	``vec_type_hint`` attribute.
				933	=================== ============== ========= ==============================
				934
				935	..
				936
				937	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				938	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				939
				940	================= ============== ========= ================================
				941	String Key Value Type Required? Description
				942	================= ============== ========= ================================
				943	"Name" string Kernel argument name.
				944	"TypeName" string Kernel argument type name.
				945	"Size" integer Required Kernel argument size in bytes.
				946	"Align" integer Required Kernel argument alignment in
				947	bytes. Must be a power of two.
				948	"ValueKind" string Required Kernel argument kind that
				949	specifies how to set up the
				950	corresponding argument.
				951	Values include:
				952
				953	"ByValue"
				954	The argument is copied
				955	directly into the kernarg.
				956
				957	"GlobalBuffer"
				958	A global address space pointer
				959	to the buffer data is passed
				960	in the kernarg.
				961
				962	"DynamicSharedPointer"
				963	A group address space pointer
				964	to dynamically allocated LDS
				965	is passed in the kernarg.
				966
				967	"Sampler"
				968	A global address space
				969	pointer to a S# is passed in
				970	the kernarg.
				971
				972	"Image"
				973	A global address space
				974	pointer to a T# is passed in
				975	the kernarg.
				976
				977	"Pipe"
				978	A global address space pointer
				979	to an OpenCL pipe is passed in
				980	the kernarg.
				981
				982	"Queue"
				983	A global address space pointer
				984	to an OpenCL device enqueue
				985	queue is passed in the
				986	kernarg.
				987
				988	"HiddenGlobalOffsetX"
				989	The OpenCL grid dispatch
				990	global offset for the X
				991	dimension is passed in the
				992	kernarg.
				993
				994	"HiddenGlobalOffsetY"
				995	The OpenCL grid dispatch
				996	global offset for the Y
				997	dimension is passed in the
				998	kernarg.
				999
				1000	"HiddenGlobalOffsetZ"
				1001	The OpenCL grid dispatch
				1002	global offset for the Z
				1003	dimension is passed in the
				1004	kernarg.
				1005
				1006	"HiddenNone"
				1007	An argument that is not used
				1008	by the kernel. Space needs to
				1009	be left for it, but it does
				1010	not need to be set up.
				1011
				1012	"HiddenPrintfBuffer"
				1013	A global address space pointer
				1014	to the runtime printf buffer
				1015	is passed in kernarg.
				1016
				1017	"HiddenDefaultQueue"
				1018	A global address space pointer
				1019	to the OpenCL device enqueue
				1020	queue that should be used by
				1021	the kernel by default is
				1022	passed in the kernarg.
				1023
				1024	"HiddenCompletionAction"
				1025	TBD
				1026
				1027	.. TODO
				1028	Add description.
				1029
				1030	"ValueType" string Required Kernel argument value type. Only
				1031	present if "ValueKind" is
				1032	"ByValue". For vector data
				1033	types, the value is for the
				1034	element type. Values include:
				1035
				1036	- "Struct"
				1037	- "I8"
				1038	- "U8"
				1039	- "I16"
				1040	- "U16"
				1041	- "F16"
				1042	- "I32"
				1043	- "U32"
				1044	- "F32"
				1045	- "I64"
				1046	- "U64"
				1047	- "F64"
				1048
				1049	.. TODO
				1050	How can it be determined if a
				1051	vector type, and what size
				1052	vector?
				1053	"PointeeAlign" integer Alignment in bytes of pointee
				1054	type for pointer type kernel
				1055	argument. Must be a power
				1056	of 2. Only present if
				1057	"ValueKind" is
				1058	"DynamicSharedPointer".
				1059	"AddrSpaceQual" string Kernel argument address space
				1060	qualifier. Only present if
				1061	"ValueKind" is "GlobalBuffer" or
				1062	"DynamicSharedPointer". Values
				1063	are:
				1064
				1065	- "Private"
				1066	- "Global"
				1067	- "Constant"
				1068	- "Local"
				1069	- "Generic"
				1070	- "Region"
				1071
				1072	.. TODO
				1073	Is GlobalBuffer only Global
				1074	or Constant? Is
				1075	DynamicSharedPointer always
				1076	Local? Can HCC allow Generic?
				1077	How can Private or Region
				1078	ever happen?
				1079	"AccQual" string Kernel argument access
				1080	qualifier. Only present if
				1081	"ValueKind" is "Image" or
				1082	"Pipe". Values
				1083	are:
				1084
				1085	- "ReadOnly"
				1086	- "WriteOnly"
				1087	- "ReadWrite"
				1088
				1089	.. TODO
				1090	Does this apply to
				1091	GlobalBuffer?
				1092	"ActualAcc" string The actual memory accesses
				1093	performed by the kernel on the
				1094	kernel argument. Only present if
				1095	"ValueKind" is "GlobalBuffer",
				1096	"Image", or "Pipe". This may be
				1097	more restrictive than indicated
				1098	by "AccQual" to reflect what the
				1099	kernel actual does. If not
				1100	present then the runtime must
				1101	assume what is implied by
				1102	"AccQual" and "IsConst". Values
				1103	are:
				1104
				1105	- "ReadOnly"
				1106	- "WriteOnly"
				1107	- "ReadWrite"
				1108
				1109	"IsConst" boolean Indicates if the kernel argument
				1110	is const qualified. Only present
				1111	if "ValueKind" is
				1112	"GlobalBuffer".
				1113
				1114	"IsRestrict" boolean Indicates if the kernel argument
				1115	is restrict qualified. Only
				1116	present if "ValueKind" is
				1117	"GlobalBuffer".
				1118
				1119	"IsVolatile" boolean Indicates if the kernel argument
				1120	is volatile qualified. Only
				1121	present if "ValueKind" is
				1122	"GlobalBuffer".
				1123
				1124	"IsPipe" boolean Indicates if the kernel argument
				1125	is pipe qualified. Only present
				1126	if "ValueKind" is "Pipe".
				1127
				1128	.. TODO
				1129	Can GlobalBuffer be pipe
				1130	qualified?
				1131	================= ============== ========= ================================
				1132
				1133	..
				1134
				1135	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1136	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1137
				1138	============================ ============== ========= =====================
				1139	String Key Value Type Required? Description
				1140	============================ ============== ========= =====================
				1141	"KernargSegmentSize" integer Required The size in bytes of
				1142	the kernarg segment
				1143	that holds the values
				1144	of the arguments to
				1145	the kernel.
				1146	"GroupSegmentFixedSize" integer Required The amount of group
				1147	segment memory
				1148	required by a
				1149	work-group in
				1150	bytes. This does not
				1151	include any
				1152	dynamically allocated
				1153	group segment memory
				1154	that may be added
				1155	when the kernel is
				1156	dispatched.
				1157	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1158	private address space
				1159	memory required for a
				1160	work-item in
				1161	bytes. If
				1162	IsDynamicCallstack
				1163	is 1 then additional
				1164	space must be added
				1165	to this value for the
				1166	call stack.
				1167	"KernargSegmentAlign" integer Required The maximum byte
				1168	alignment of
				1169	arguments in the
				1170	kernarg segment. Must
				1171	be a power of 2.
				1172	"WavefrontSize" integer Required Wavefront size. Must
				1173	be a power of 2.
				1174	"NumSGPRs" integer Number of scalar
				1175	registers used by a
				1176	wavefront for
				1177	GFX6-GFX9. This
				1178	includes the special
				1179	SGPRs for VCC, Flat
				1180	Scratch (GFX7-GFX9)
				1181	and XNACK (for
				1182	GFX8-GFX9). It does
				1183	not include the 16
				1184	SGPR added if a trap
				1185	handler is
				1186	enabled. It is not
				1187	rounded up to the
				1188	allocation
				1189	granularity.
				1190	"NumVGPRs" integer Number of vector
				1191	registers used by
				1192	each work-item for
				1193	GFX6-GFX9
				1194	"MaxFlatWorkgroupSize" integer Maximum flat
				1195	work-group size
				1196	supported by the
				1197	kernel in work-items.
				1198	"IsDynamicCallStack" boolean Indicates if the
				1199	generated machine
				1200	code is using a
				1201	dynamically sized
				1202	call stack.
				1203	"IsXNACKEnabled" boolean Indicates if the
				1204	generated machine
				1205	code is capable of
				1206	supporting XNACK.
				1207	============================ ============== ========= =====================
				1208
				1209	..
				1210
				1211	.. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
				1212	:name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
				1213
				1214	=================================== ============== ========= ==============
				1215	String Key Value Type Required? Description
				1216	=================================== ============== ========= ==============
				1217	"DebuggerABIVersion" string
				1218	"ReservedNumVGPRs" integer
				1219	"ReservedFirstVGPR" integer
				1220	"PrivateSegmentBufferSGPR" integer
				1221	"WavefrontPrivateSegmentOffsetSGPR" integer
				1222	=================================== ============== ========= ==============
				1223
				1224	.. TODO
				1225	Plan to remove the debug properties metadata.
				1226
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1227	Kernel Dispatch
				1228	~~~~~~~~~~~~~~~
				1229
				1230	The HSA architected queuing language (AQL) defines a user space memory interface
				1231	that can be used to control the dispatch of kernels, in an agent independent
				1232	way. An agent can have zero or more AQL queues created for it using the ROCm
				1233	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1234	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1235	mechanics and packet layouts.
				1236
				1237	The packet processor of a kernel agent is responsible for detecting and
				1238	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1239	packet processor is implemented by the hardware command processor (CP),
				1240	asynchronous dispatch controller (ADC) and shader processor input controller
				1241	(SPI).
				1242
				1243	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1244	mode driver to initialize and register the AQL queue with CP.
				1245
				1246	To dispatch a kernel the following actions are performed. This can occur in the
				1247	CPU host program, or from an HSA kernel executing on a GPU.
				1248
				1249	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1250	executed is obtained.
				1251	2. A pointer to the kernel descriptor (see
				1252	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1253	obtained. It must be for a kernel that is contained in a code object that that
				1254	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1255	associated.
				1256	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1257	for a memory region with the kernarg property for the kernel agent that will
				1258	execute the kernel. It must be at least 16 byte aligned.
				1259	4. Kernel argument values are assigned to the kernel argument memory
				1260	allocation. The layout is defined in the HSA Programmer’s Language Reference
				1261	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1262	memory in the same way constant memory is accessed. (Note that the HSA
				1263	specification allows an implementation to copy the kernel argument contents to
				1264	another location that is accessed by the kernel.)
				1265	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1266	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1267	packet. The packet must be set up, and the final write must use an atomic
				1268	store release to set the packet kind to ensure the packet contents are
				1269	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1270	notify the kernel agent that the AQL queue has been updated. These rules, and
				1271	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1272	System Architecture Specification* [HSA]_.
				1273	6. A kernel dispatch packet includes information about the actual dispatch,
				1274	such as grid and work-group size, together with information from the code
				1275	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1276	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1277	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1278	7. CP executes micro-code and is responsible for detecting and setting up the
				1279	GPU to execute the wavefronts of a kernel dispatch.
				1280	8. CP ensures that when the a wavefront starts executing the kernel machine
				1281	code, the scalar general purpose registers (SGPR) and vector general purpose
				1282	registers (VGPR) are set up as required by the machine code. The required
				1283	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1284	register state is defined in
				1285	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1286	9. The prolog of the kernel machine code (see
				1287	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1288	before continuing executing the machine code that corresponds to the kernel.
				1289	10. When the kernel dispatch has completed execution, CP signals the completion
				1290	signal specified in the kernel dispatch packet if not 0.
				1291
				1292	.. _amdgpu-amdhsa-memory-spaces:
				1293
				1294	Memory Spaces
				1295	~~~~~~~~~~~~~
				1296
				1297	The memory space properties are:
				1298
				1299	.. table:: AMDHSA Memory Spaces
				1300	:name: amdgpu-amdhsa-memory-spaces-table
				1301
				1302	================= =========== ======== ======= ==================
				1303	Memory Space Name HSA Segment Hardware Address NULL Value
				1304	Name Name Size
				1305	================= =========== ======== ======= ==================
				1306	Private private scratch 32 0x00000000
				1307	Local group LDS 32 0xFFFFFFFF
				1308	Global global global 64 0x0000000000000000
				1309	Constant constant *same as 64 0x0000000000000000
				1310	global*
				1311	Generic flat flat 64 0x0000000000000000
				1312	Region N/A GDS 32 *not implemented
				1313	for AMDHSA*
				1314	================= =========== ======== ======= ==================
				1315
				1316	The global and constant memory spaces both use global virtual addresses, which
				1317	are the same virtual address space used by the CPU. However, some virtual
				1318	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1319	and some by both.
				1320
				1321	Using the constant memory space indicates that the data will not change during
				1322	the execution of the kernel. This allows scalar read instructions to be
				1323	used. The vector and scalar L1 caches are invalidated of volatile data before
				1324	each kernel dispatch execution to allow constant memory to change values between
				1325	kernel dispatches.
				1326
				1327	The local memory space uses the hardware Local Data Store (LDS) which is
				1328	automatically allocated when the hardware creates work-groups of wavefronts, and
				1329	freed when all the wavefronts of a work-group have terminated. The data store
				1330	(DS) instructions can be used to access it.
				1331
				1332	The private memory space uses the hardware scratch memory support. If the kernel
				1333	uses scratch, then the hardware allocates memory that is accessed using
				1334	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1335	address to physical address is:
				1336
				1337	``wavefront-scratch-base +
				1338	(private-address * wavefront-size * 4) +
				1339	(wavefront-lane-id * 4)``
				1340
				1341	There are different ways that the wavefront scratch base address is determined
				1342	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1343	memory can be accessed in an interleaved manner using buffer instruction with
				1344	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1345	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1346	same private address, the interleaving results in adjacent dwords being accessed
				1347	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1348	supported except by flat and scratch instructions in GFX9.
				1349
				1350	The generic address space uses the hardware flat address support available in
				1351	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1352	local appertures), that are outside the range of addressible global memory, to
				1353	map from a flat address to a private or local address.
				1354
				1355	FLAT instructions can take a flat address and access global, private (scratch)
				1356	and group (LDS) memory depending in if the address is within one of the
				1357	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1358	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1359	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1360	(see :ref:`amdgpu-amdhsa-m0`).
				1361
				1362	To convert between a segment address and a flat address the base address of the
				1363	appertures address can be used. For GFX7-GFX8 these are available in the
				1364	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1365	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1366	GFX9 the appature base addresses are directly available as inline constant
				1367	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1368	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1369	which makes it easier to convert from flat to segment or segment to flat.
				1370
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1371	Image and Samplers
				1372	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1373
				1374	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1375	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1376	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1377	enumeration values for the queries that are not trivially deducible from the S#
				1378	representation.
				1379
				1380	HSA Signals
				1381	~~~~~~~~~~~
				1382
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1383	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1384	structure allocated in memory accessible from both the CPU and GPU. The
				1385	structure is defined by the ROCm runtime and subject to change between releases
				1386	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1387
				1388	.. _amdgpu-amdhsa-hsa-aql-queue:
				1389
				1390	HSA AQL Queue
				1391	~~~~~~~~~~~~~
				1392
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1393	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1394	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1395	fields needed to implement certain language features such as the flat address
				1396	aperture bases. It also contains fields used by CP such as managing the
				1397	allocation of scratch memory.
				1398
				1399	.. _amdgpu-amdhsa-kernel-descriptor:
				1400
				1401	Kernel Descriptor
				1402	~~~~~~~~~~~~~~~~~
				1403
				1404	A kernel descriptor consists of the information needed by CP to initiate the
				1405	execution of a kernel, including the entry point address of the machine code
				1406	that implements the kernel.
				1407
				1408	Kernel Descriptor for GFX6-GFX9
				1409	+++++++++++++++++++++++++++++++
				1410
				1411	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1412
				1413	.. table:: Kernel Descriptor for GFX6-GFX9
				1414	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1415
				1416	======= ======= =============================== ===========================
				1417	Bits Size Field Name Description
				1418	======= ======= =============================== ===========================
				1419	31:0 4 bytes group_segment_fixed_size The amount of fixed local
				1420	address space memory
				1421	required for a work-group
				1422	in bytes. This does not
				1423	include any dynamically
				1424	allocated local address
				1425	space memory that may be
				1426	added when the kernel is
				1427	dispatched.
				1428	63:32 4 bytes private_segment_fixed_size The amount of fixed
				1429	private address space
				1430	memory required for a
				1431	work-item in bytes. If
				1432	is_dynamic_callstack is 1
				1433	then additional space must
				1434	be added to this value for
				1435	the call stack.
				1436	95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
				1437	size supported by the
				1438	kernel in work-items.
				1439	96 1 bit is_dynamic_call_stack Indicates if the generated
				1440	machine code is using a
				1441	dynamically sized call
				1442	stack.
				1443	97 1 bit is_xnack_enabled Indicates if the generated
				1444	machine code is capable of
				1445	suppoting XNACK.
				1446	127:98 30 bits Reserved. Must be 0.
				1447	191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
				1448	negative) from base
				1449	address of kernel
				1450	descriptor to kernel's
				1451	entry point instruction
				1452	which must be 256 byte
				1453	aligned.
				1454	383:192 24 Reserved. Must be 0.
				1455	bytes
				1456	415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
				1457	program settings used by
				1458	CP to set up
				1459	``COMPUTE_PGM_RSRC1``
				1460	configuration
				1461	register. See
				1462	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
				1463	447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
				1464	program settings used by
				1465	CP to set up
				1466	``COMPUTE_PGM_RSRC2``
				1467	configuration
				1468	register. See
				1469	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				1470	448 1 bit enable_sgpr_private_segment Enable the setup of the
				1471	_buffer SGPR user data registers
				1472	(see
				1473	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1474
				1475	The total number of SGPR
				1476	user data registers
				1477	requested must not exceed
				1478	16 and match value in
				1479	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1480	Any requests beyond 16
				1481	will be ignored.
				1482	449 1 bit enable_sgpr_dispatch_ptr see above
				1483	450 1 bit enable_sgpr_queue_ptr see above
				1484	451 1 bit enable_sgpr_kernarg_segment_ptr see above
				1485	452 1 bit enable_sgpr_dispatch_id see above
				1486	453 1 bit enable_sgpr_flat_scratch_init see above
				1487	454 1 bit enable_sgpr_private_segment see above
				1488	_size
				1489	455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1490	_count_X should always be 0.
				1491	456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1492	_count_Y should always be 0.
				1493	457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1494	_count_Z should always be 0.
				1495	463:458 6 bits Reserved. Must be 0.
				1496	511:464 4 Reserved. Must be 0.
				1497	bytes
				1498	512 Total size 64 bytes.
				1499	======= ===================================================================
				1500
				1501	..
				1502
				1503	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
				1504	:name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
				1505
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1506	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1507	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1508	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1509	5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
				1510	used by each work-item,
				1511	granularity is device
				1512	specific:
				1513
				1514	GFX6-9
				1515	roundup((max-vgpg + 1)
				1516	/ 4) - 1
				1517
				1518	Used by CP to set up
				1519	``COMPUTE_PGM_RSRC1.VGPRS``.
				1520	9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
				1521	used by a wavefront,
				1522	granularity is device
				1523	specific:
				1524
				1525	GFX6-8
				1526	roundup((max-sgpg + 1)
				1527	/ 8) - 1
				1528	GFX9
				1529	roundup((max-sgpg + 1)
				1530	/ 16) - 1
				1531
				1532	Includes the special SGPRs
				1533	for VCC, Flat Scratch (for
				1534	GFX7 onwards) and XNACK
				1535	(for GFX8 onwards). It does
				1536	not include the 16 SGPR
				1537	added if a trap handler is
				1538	enabled.
				1539
				1540	Used by CP to set up
				1541	``COMPUTE_PGM_RSRC1.SGPRS``.
				1542	11:10 2 bits priority Must be 0.
				1543
				1544	Start executing wavefront
				1545	at the specified priority.
				1546
				1547	CP is responsible for
				1548	filling in
				1549	``COMPUTE_PGM_RSRC1.PRIORITY``.
				1550	13:12 2 bits float_mode_round_32 Wavefront starts execution
				1551	with specified rounding
				1552	mode for single (32
				1553	bit) floating point
				1554	precision floating point
				1555	operations.
				1556
				1557	Floating point rounding
				1558	mode values are defined in
				1559	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1560
				1561	Used by CP to set up
				1562	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1563	15:14 2 bits float_mode_round_16_64 Wavefront starts execution
				1564	with specified rounding
				1565	denorm mode for half/double (16
				1566	and 64 bit) floating point
				1567	precision floating point
				1568	operations.
				1569
				1570	Floating point rounding
				1571	mode values are defined in
				1572	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1573
				1574	Used by CP to set up
				1575	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1576	17:16 2 bits float_mode_denorm_32 Wavefront starts execution
				1577	with specified denorm mode
				1578	for single (32
				1579	bit) floating point
				1580	precision floating point
				1581	operations.
				1582
				1583	Floating point denorm mode
				1584	values are defined in
				1585	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1586
				1587	Used by CP to set up
				1588	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1589	19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
				1590	with specified denorm mode
				1591	for half/double (16
				1592	and 64 bit) floating point
				1593	precision floating point
				1594	operations.
				1595
				1596	Floating point denorm mode
				1597	values are defined in
				1598	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1599
				1600	Used by CP to set up
				1601	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1602	20 1 bit priv Must be 0.
				1603
				1604	Start executing wavefront
				1605	in privilege trap handler
				1606	mode.
				1607
				1608	CP is responsible for
				1609	filling in
				1610	``COMPUTE_PGM_RSRC1.PRIV``.
				1611	21 1 bit enable_dx10_clamp Wavefront starts execution
				1612	with DX10 clamp mode
				1613	enabled. Used by the vector
				1614	ALU to force DX-10 style
				1615	treatment of NaN's (when
				1616	set, clamp NaN to zero,
				1617	otherwise pass NaN
				1618	through).
				1619
				1620	Used by CP to set up
				1621	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
				1622	22 1 bit debug_mode Must be 0.
				1623
				1624	Start executing wavefront
				1625	in single step mode.
				1626
				1627	CP is responsible for
				1628	filling in
				1629	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
				1630	23 1 bit enable_ieee_mode Wavefront starts execution
				1631	with IEEE mode
				1632	enabled. Floating point
				1633	opcodes that support
				1634	exception flag gathering
				1635	will quiet and propagate
				1636	signaling-NaN inputs per
				1637	IEEE 754-2008. Min_dx10 and
				1638	max_dx10 become IEEE
				1639	754-2008 compliant due to
				1640	signaling-NaN propagation
				1641	and quieting.
				1642
				1643	Used by CP to set up
				1644	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
				1645	24 1 bit bulky Must be 0.
				1646
				1647	Only one work-group allowed
				1648	to execute on a compute
				1649	unit.
				1650
				1651	CP is responsible for
				1652	filling in
				1653	``COMPUTE_PGM_RSRC1.BULKY``.
				1654	25 1 bit cdbg_user Must be 0.
				1655
				1656	Flag that can be used to
				1657	control debugging code.
				1658
				1659	CP is responsible for
				1660	filling in
				1661	``COMPUTE_PGM_RSRC1.CDBG_USER``.
				1662	31:26 6 bits Reserved. Must be 0.
				1663	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1664	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1665
				1666	..
				1667
				1668	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1669	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1670
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1671	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1672	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1673	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1674	0 1 bit enable_sgpr_private_segment Enable the setup of the
				1675	_wave_offset SGPR wave scratch offset
				1676	system register (see
				1677	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1678
				1679	Used by CP to set up
				1680	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
				1681	5:1 5 bits user_sgpr_count The total number of SGPR
				1682	user data registers
				1683	requested. This number must
				1684	match the number of user
				1685	data registers enabled.
				1686
				1687	Used by CP to set up
				1688	``COMPUTE_PGM_RSRC2.USER_SGPR``.
				1689	6 1 bit enable_trap_handler Set to 1 if code contains a
				1690	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1691	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1692	be enabled.
				1693
				1694	CP sets
				1695	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1696	if the runtime has
				1697	installed a trap handler
				1698	regardless of the setting
				1699	of this field.
				1700	7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
				1701	system SGPR register for
				1702	the work-group id in the X
				1703	dimension (see
				1704	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1705
				1706	Used by CP to set up
				1707	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
				1708	8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
				1709	system SGPR register for
				1710	the work-group id in the Y
				1711	dimension (see
				1712	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1713
				1714	Used by CP to set up
				1715	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
				1716	9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
				1717	system SGPR register for
				1718	the work-group id in the Z
				1719	dimension (see
				1720	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1721
				1722	Used by CP to set up
				1723	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
				1724	10 1 bit enable_sgpr_workgroup_info Enable the setup of the
				1725	system SGPR register for
				1726	work-group information (see
				1727	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1728
				1729	Used by CP to set up
				1730	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
				1731	12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
				1732	VGPR system registers used
				1733	for the work-item ID.
				1734	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1735	defines the values.
				1736
				1737	Used by CP to set up
				1738	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
				1739	13 1 bit enable_exception_address_watch Must be 0.
				1740
				1741	Wavefront starts execution
				1742	with address watch
				1743	exceptions enabled which
				1744	are generated when L1 has
				1745	witnessed a thread access
				1746	an *address of
				1747	interest*.
				1748
				1749	CP is responsible for
				1750	filling in the address
				1751	watch bit in
				1752	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1753	according to what the
				1754	runtime requests.
				1755	14 1 bit enable_exception_memory Must be 0.
				1756
				1757	Wavefront starts execution
				1758	with memory violation
				1759	exceptions exceptions
				1760	enabled which are generated
				1761	when a memory violation has
				1762	occurred for this wave from
				1763	L1 or LDS
				1764	(write-to-read-only-memory,
				1765	mis-aligned atomic, LDS
				1766	address out of range,
				1767	illegal address, etc.).
				1768
				1769	CP sets the memory
				1770	violation bit in
				1771	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1772	according to what the
				1773	runtime requests.
				1774	23:15 9 bits granulated_lds_size Must be 0.
				1775
				1776	CP uses the rounded value
				1777	from the dispatch packet,
				1778	not this value, as the
				1779	dispatch may contain
				1780	dynamically allocated group
				1781	segment memory. CP writes
				1782	directly to
				1783	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1784
				1785	Amount of group segment
				1786	(LDS) to allocate for each
				1787	work-group. Granularity is
				1788	device specific:
				1789
				1790	GFX6:
				1791	roundup(lds-size / (64 * 4))
				1792	GFX7-GFX9:
				1793	roundup(lds-size / (128 * 4))
				1794
				1795	24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
				1796	_invalid_operation with specified exceptions
				1797	enabled.
				1798
				1799	Used by CP to set up
				1800	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1801	(set from bits 0..6).
				1802
				1803	IEEE 754 FP Invalid
				1804	Operation
				1805	25 1 bit enable_exception_fp_denormal FP Denormal one or more
				1806	_source input operands is a
				1807	denormal number
				1808	26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
				1809	_division_by_zero Zero
				1810	27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
				1811	_overflow
				1812	28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
				1813	_underflow
				1814	29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
				1815	_inexact
				1816	30 1 bit enable_exception_int_divide_by Integer Division by Zero
				1817	_zero (rcp_iflag_f32 instruction
				1818	only)
				1819	31 1 bit Reserved. Must be 0.
				1820	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1821	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1822
				1823	..
				1824
				1825	.. table:: Floating Point Rounding Mode Enumeration Values
				1826	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1827
				1828	===================================== ===== ===============================
				1829	Enumeration Name Value Description
				1830	===================================== ===== ===============================
				1831	AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1832	AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1833	AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1834	AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1835	===================================== ===== ===============================
				1836
				1837	..
				1838
				1839	.. table:: Floating Point Denorm Mode Enumeration Values
				1840	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1841
				1842	===================================== ===== ===============================
				1843	Enumeration Name Value Description
				1844	===================================== ===== ===============================
				1845	AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1846	Denorms
				1847	AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1848	AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1849	AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1850	===================================== ===== ===============================
				1851
				1852	..
				1853
				1854	.. table:: System VGPR Work-Item ID Enumeration Values
				1855	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1856
				1857	===================================== ===== ===============================
				1858	Enumeration Name Value Description
				1859	===================================== ===== ===============================
				1860	AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
				1861	AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1862	dimensions ID.
				1863	AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1864	dimensions ID.
				1865	AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1866	===================================== ===== ===============================
				1867
				1868	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1869
				1870	Initial Kernel Execution State
				1871	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1872
				1873	This section defines the register state that will be set up by the packet
				1874	processor prior to the start of execution of every wavefront. This is limited by
				1875	the constraints of the hardware controllers of CP/ADC/SPI.
				1876
				1877	The order of the SGPR registers is defined, but the compiler can specify which
				1878	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1879	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1880	for enabled registers are dense starting at SGPR0: the first enabled register is
				1881	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1882	an SGPR number.
				1883
				1884	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1885	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1886	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1887	initialized. These are then immediately followed by the System SGPRs that are
				1888	set up by ADC/SPI and can have different values for each wave of the grid
				1889	dispatch.
				1890
				1891	SGPR register initial state is defined in
				1892	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1893
				1894	.. table:: SGPR Register Set Up Order
				1895	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1896
				1897	========== ========================== ====== ==============================
				1898	SGPR Order Name Number Description
				1899	(kernel descriptor enable of
				1900	field) SGPRs
				1901	========== ========================== ====== ==============================
				1902	First Private Segment Buffer 4 V# that can be used, together
				1903	(enable_sgpr_private with Scratch Wave Offset as an
				1904	_segment_buffer) offset, to access the private
				1905	memory space using a segment
				1906	address.
				1907
				1908	CP uses the value provided by
				1909	the runtime.
				1910	then Dispatch Ptr 2 64 bit address of AQL dispatch
				1911	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				1912	actually executing.
				1913	then Queue Ptr 2 64 bit address of amd_queue_t
				1914	(enable_sgpr_queue_ptr) object for AQL queue on which
				1915	the dispatch packet was
				1916	queued.
				1917	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				1918	(enable_sgpr_kernarg segment. This is directly
				1919	_segment_ptr) copied from the
				1920	kernarg_address in the kernel
				1921	dispatch packet.
				1922
				1923	Having CP load it once avoids
				1924	loading it at the beginning of
				1925	every wavefront.
				1926	then Dispatch Id 2 64 bit Dispatch ID of the
				1927	(enable_sgpr_dispatch_id) dispatch packet being
				1928	executed.
				1929	then Flat Scratch Init 2 This is 2 SGPRs:
				1930	(enable_sgpr_flat_scratch
				1931	_init) GFX6
				1932	Not supported.
				1933	GFX7-GFX8
				1934	The first SGPR is a 32 bit
				1935	byte offset from
				1936	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1937	to per SPI base of memory
				1938	for scratch for the queue
				1939	executing the kernel
				1940	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1941	from the runtime. (The
				1942	Scratch Segment Buffer base
				1943	address is
				1944	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1945	plus this offset.) The value
				1946	of Scratch Wave Offset must
				1947	be added to this offset by
				1948	the kernel machine code,
				1949	right shifted by 8, and
				1950	moved to the FLAT_SCRATCH_HI
				1951	SGPR register.
				1952	FLAT_SCRATCH_HI corresponds
				1953	to SGPRn-4 on GFX7, and
				1954	SGPRn-6 on GFX8 (where SGPRn
				1955	is the highest numbered SGPR
				1956	allocated to the wave).
				1957	FLAT_SCRATCH_HI is
				1958	multiplied by 256 (as it is
				1959	in units of 256 bytes) and
				1960	added to
				1961	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1962	to calculate the per wave
				1963	FLAT SCRATCH BASE in flat
				1964	memory instructions that
				1965	access the scratch
				1966	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1967
				1968	The second SGPR is 32 bit
				1969	byte size of a single
				1970	work-item’s scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1971	usage. CP obtains this from
				1972	the runtime, and it is
				1973	always a multiple of DWORD.
				1974	CP checks that the value in
				1975	the kernel dispatch packet
				1976	Private Segment Byte Size is
				1977	not larger, and requests the
				1978	runtime to increase the
				1979	queue's scratch size if
				1980	necessary. The kernel code
				1981	must move it to
				1982	FLAT_SCRATCH_LO which is
				1983	SGPRn-3 on GFX7 and SGPRn-5
				1984	on GFX8. FLAT_SCRATCH_LO is
				1985	used as the FLAT SCRATCH
				1986	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1987	instructions. Having CP load
				1988	it once avoids loading it at
				1989	the beginning of every
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1990	wavefront. GFX9 This is the
				1991	64 bit base address of the
				1992	per SPI scratch backing
				1993	memory managed by SPI for
				1994	the queue executing the
				1995	kernel dispatch. CP obtains
				1996	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1997	divides it if there are
				1998	multiple Shader Arrays each
				1999	with its own SPI). The value
				2000	of Scratch Wave Offset must
				2001	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2002	machine code and the result
				2003	moved to the FLAT_SCRATCH
				2004	SGPR which is SGPRn-6 and
				2005	SGPRn-5. It is used as the
				2006	FLAT SCRATCH BASE in flat
				2007	memory instructions. then
				2008	Private Segment Size 1 The
				2009	32 bit byte size of a
				2010	(enable_sgpr_private single
				2011	work-item's
				2012	scratch_segment_size) memory
				2013	allocation. This is the
				2014	value from the kernel
				2015	dispatch packet Private
				2016	Segment Byte Size rounded up
				2017	by CP to a multiple of
				2018	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2019
				2020	Having CP load it once avoids
				2021	loading it at the beginning of
				2022	every wavefront.
				2023
				2024	This is not used for
				2025	GFX7-GFX8 since it is the same
				2026	value as the second SGPR of
				2027	Flat Scratch Init. However, it
				2028	may be needed for GFX9 which
				2029	changes the meaning of the
				2030	Flat Scratch Init value.
				2031	then Grid Work-Group Count X 1 32 bit count of the number of
				2032	(enable_sgpr_grid work-groups in the X dimension
				2033	_workgroup_count_X) for the grid being
				2034	executed. Computed from the
				2035	fields in the kernel dispatch
				2036	packet as ((grid_size.x +
				2037	workgroup_size.x - 1) /
				2038	workgroup_size.x).
				2039	then Grid Work-Group Count Y 1 32 bit count of the number of
				2040	(enable_sgpr_grid work-groups in the Y dimension
				2041	_workgroup_count_Y && for the grid being
				2042	less than 16 previous executed. Computed from the
				2043	SGPRs) fields in the kernel dispatch
				2044	packet as ((grid_size.y +
				2045	workgroup_size.y - 1) /
				2046	workgroupSize.y).
				2047
				2048	Only initialized if <16
				2049	previous SGPRs initialized.
				2050	then Grid Work-Group Count Z 1 32 bit count of the number of
				2051	(enable_sgpr_grid work-groups in the Z dimension
				2052	_workgroup_count_Z && for the grid being
				2053	less than 16 previous executed. Computed from the
				2054	SGPRs) fields in the kernel dispatch
				2055	packet as ((grid_size.z +
				2056	workgroup_size.z - 1) /
				2057	workgroupSize.z).
				2058
				2059	Only initialized if <16
				2060	previous SGPRs initialized.
				2061	then Work-Group Id X 1 32 bit work-group id in X
				2062	(enable_sgpr_workgroup_id dimension of grid for
				2063	_X) wavefront.
				2064	then Work-Group Id Y 1 32 bit work-group id in Y
				2065	(enable_sgpr_workgroup_id dimension of grid for
				2066	_Y) wavefront.
				2067	then Work-Group Id Z 1 32 bit work-group id in Z
				2068	(enable_sgpr_workgroup_id dimension of grid for
				2069	_Z) wavefront.
				2070	then Work-Group Info 1 {first_wave, 14’b0000,
				2071	(enable_sgpr_workgroup ordered_append_term[10:0],
				2072	_info) threadgroup_size_in_waves[5:0]}
				2073	then Scratch Wave Offset 1 32 bit byte offset from base
				2074	(enable_sgpr_private of scratch base of queue
				2075	_segment_wave_offset) executing the kernel
				2076	dispatch. Must be used as an
				2077	offset with Private
				2078	segment address when using
				2079	Scratch Segment Buffer. It
				2080	must be used to set up FLAT
				2081	SCRATCH for flat addressing
				2082	(see
				2083	:ref:`amdgpu-amdhsa-flat-scratch`).
				2084	========== ========================== ====== ==============================
				2085
				2086	The order of the VGPR registers is defined, but the compiler can specify which
				2087	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2088	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2089	for enabled registers are dense starting at VGPR0: the first enabled register is
				2090	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2091	VGPR number.
				2092
				2093	VGPR register initial state is defined in
				2094	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2095
				2096	.. table:: VGPR Register Set Up Order
				2097	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2098
				2099	========== ========================== ====== ==============================
				2100	VGPR Order Name Number Description
				2101	(kernel descriptor enable of
				2102	field) VGPRs
				2103	========== ========================== ====== ==============================
				2104	First Work-Item Id X 1 32 bit work item id in X
				2105	(Always initialized) dimension of work-group for
				2106	wavefront lane.
				2107	then Work-Item Id Y 1 32 bit work item id in Y
				2108	(enable_vgpr_workitem_id dimension of work-group for
				2109	> 0) wavefront lane.
				2110	then Work-Item Id Z 1 32 bit work item id in Z
				2111	(enable_vgpr_workitem_id dimension of work-group for
				2112	> 1) wavefront lane.
				2113	========== ========================== ====== ==============================
				2114
				2115	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2116
				2117	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2118	registers.
				2119	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2120	combination including none.
				2121	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2122	cannot included with the flat scratch init value which is per queue.
				2123	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2124	or (X, Y, Z).
				2125
				2126	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2127	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2128
				2129	The global segment can be accessed either using buffer instructions (GFX6 which
				2130	has V# 64 bit address support), flat instructions (GFX7-9), or global
				2131	instructions (GFX9).
				2132
				2133	If buffer operations are used then the compiler can generate a V# with the
				2134	following properties:
				2135
				2136	* base address of 0
				2137	* no swizzle
				2138	* ATC: 1 if IOMMU present (such as APU)
				2139	* ptr64: 1
				2140	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2141	APU and NC for dGPU).
				2142
				2143	.. _amdgpu-amdhsa-kernel-prolog:
				2144
				2145	Kernel Prolog
				2146	~~~~~~~~~~~~~
				2147
				2148	.. _amdgpu-amdhsa-m0:
				2149
				2150	M0
				2151	++
				2152
				2153	GFX6-GFX8
				2154	The M0 register must be initialized with a value at least the total LDS size
				2155	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2156	available in dispatch packet. For M0, it is also possible to use maximum
				2157	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2158	GFX7-GFX8).
				2159	GFX9
				2160	The M0 register is not used for range checking LDS accesses and so does not
				2161	need to be initialized in the prolog.
				2162
				2163	.. _amdgpu-amdhsa-flat-scratch:
				2164
				2165	Flat Scratch
				2166	++++++++++++
				2167
				2168	If the kernel may use flat operations to access scratch memory, the prolog code
				2169	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2170	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2171	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2172
				2173	GFX6
				2174	Flat scratch is not supported.
				2175
				2176	GFX7-8
				2177	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2178	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2179	being managed by SPI for the queue executing the kernel dispatch. This is
				2180	the same value used in the Scratch Segment Buffer V# base address. The
				2181	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2182	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2183	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2184	by 8 before moving into FLAT_SCRATCH_LO.
				2185	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2186	work-items scratch memory usage. This is directly loaded from the kernel
				2187	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2188	DWORD. Having CP load it once avoids loading it at the beginning of every
				2189	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2190	SIZE.
				2191	GFX9
				2192	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2193	memory being managed by SPI for the queue executing the kernel dispatch. The
				2194	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2195	pair for use as the flat scratch base in flat memory instructions.
				2196
				2197	.. _amdgpu-amdhsa-memory-model:
				2198
				2199	Memory Model
				2200	~~~~~~~~~~~~
				2201
				2202	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2203	(see :ref:`memmodel`). The implementation is WIP.
				2204
				2205	.. TODO
				2206	Update when implementation complete.
				2207
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2208	Support more relaxed OpenCL memory model to be controlled by environment
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2209	component of target triple.
				2210
				2211	The AMDGPU backend supports the memory synchronization scopes specified in
				2212	:ref:`amdgpu-memory-scopes`.
				2213
				2214	The code sequences used to implement the memory model are defined in table
				2215	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2216
				2217	The sequences specify the order of instructions that a single thread must
				2218	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2219	to other memory instructions executed by the same thread. This allows them to be
				2220	moved earlier or later which can allow them to be combined with other instances
				2221	of the same instruction, or hoisted/sunk out of loops to improve
				2222	performance. Only the instructions related to the memory model are given;
				2223	additional ``s_waitcnt`` instructions are required to ensure registers are
				2224	defined before being used. These may be able to be combined with the memory
				2225	model ``s_waitcnt`` instructions as described above.
				2226
				2227	The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
				2228	OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
				2229	relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
				2230	memory model which has separate happens-before relations for the global and
				2231	local address spaces, and only a fence specifying both global and local address
				2232	space joins the relationships. Since the LLVM ``memfence`` instruction does not
				2233	allow an address space to be specified the OpenCL fence has to convervatively
				2234	assume both local and global address space was specified. However, optimizations
				2235	can often be done to eliminate the additional ``s_waitcnt``instructions when
				2236	there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
				2237	instructions. The code sequences in the table indicate what can be omitted for
				2238	the OpenCL memory. The target triple environment is used to determine if the
				2239	source language is OpenCL (see :ref:`amdgpu-opencl`).
				2240
				2241	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2242	operations.
				2243
				2244	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2245	termed vector memory operations.
				2246
				2247	For GFX6-GFX9:
				2248
				2249	* Each agent has multiple compute units (CU).
				2250	* Each CU has multiple SIMDs that execute wavefronts.
				2251	* The wavefronts for a single work-group are executed in the same CU but may be
				2252	executed by different SIMDs.
				2253	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2254	executing on it.
				2255	* All LDS operations of a CU are performed as wavefront wide operations in a
				2256	global order and involve no caching. Completion is reported to a wavefront in
				2257	execution order.
				2258	* The LDS memory has multiple request queues shared by the SIMDs of a
				2259	CU. Therefore, the LDS operations performed by different waves of a work-group
				2260	can be reordered relative to each other, which can result in reordering the
				2261	visibility of vector memory operations with respect to LDS operations of other
				2262	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2263	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2264	between waves of a work-group, but not between operations performed by the
				2265	same wavefront.
				2266	* The vector memory operations are performed as wavefront wide operations and
				2267	completion is reported to a wavefront in execution order. The exception is
				2268	that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
				2269	vector memory order if they access LDS memory, and out of LDS operation order
				2270	if they access global memory.
				2271	* The vector memory operations access a vector L1 cache shared by all wavefronts
				2272	on a CU. Therefore, no special action is required for coherence between
				2273	wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
				2274	coherence between waves executing in different work-groups as they may be
				2275	executing on different CUs.
				2276	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2277	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2278	scalar operations are used in a restricted way so do not impact the memory
				2279	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2280	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2281	the same agent.
				2282	* The L2 cache has independent channels to service disjoint ranges of virtual
				2283	addresses.
				2284	* Each CU has a separate request queue per channel. Therefore, the vector and
				2285	scalar memory operations performed by waves executing in different work-groups
				2286	(which may be executing on different CUs) of an agent can be reordered
				2287	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2288	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2289	previous vector memory operation has completed before executing a subsequent
				2290	vector memory or LDS operation and so can be used to meet the requirements of
				2291	acquire and release.
				2292	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2293	of virtual addresses can be set up to bypass it to ensure system coherence.
				2294
				2295	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
				2296	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2297	memory, atomic memory orderings are not meaningful and all accesses are treated
				2298	as non-atomic.
				2299
				2300	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2301	scalar memory instructions). Since the constant address space contents do not
				2302	change during the execution of a kernel dispatch it is not legal to perform
				2303	stores, and atomic memory orderings are not meaningful and all access are
				2304	treated as non-atomic.
				2305
				2306	A memory synchronization scope wider than work-group is not meaningful for the
				2307	group (LDS) address space and is treated as work-group.
				2308
				2309	The memory model does not support the region address space which is treated as
				2310	non-atomic.
				2311
				2312	Acquire memory ordering is not meaningful on store atomic instructions and is
				2313	treated as non-atomic.
				2314
				2315	Release memory ordering is not meaningful on load atomic instructions and is
				2316	treated a non-atomic.
				2317
				2318	Acquire-release memory ordering is not meaningful on load or store atomic
				2319	instructions and is treated as acquire and release respectively.
				2320
				2321	AMDGPU backend only uses scalar memory operations to access memory that is
				2322	proven to not change during the execution of the kernel dispatch. This includes
				2323	constant address space and global address space for program scope const
				2324	variables. Therefore the kernel machine code does not have to maintain the
				2325	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2326	and vector L1 caches are invalidated between kernel dispatches by CP since
				2327	constant address space data may change between kernel dispatch executions. See
				2328	:ref:`amdgpu-amdhsa-memory-spaces`.
				2329
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2330	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2331	case the AMDGPU backend ensures the memory location used to spill is never
				2332	accessed by vector memory operations at the same time. If scalar writes are used
				2333	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2334	return since the locations may be used for vector memory instructions by a
				2335	future wave that uses the same scratch area, or a function call that creates a
				2336	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2337	as all scalar writes are write-before-read in the same thread.
				2338
				2339	Scratch backing memory (which is used for the private address space) is accessed
				2340	with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
				2341	is only accessed by a single thread, and is always write-before-read,
				2342	there is never a need to invalidate these entries from the L1 cache. Hence all
				2343	cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
				2344	lines.
				2345
				2346	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
				2347	to invalidate the L2 cache. This also causes it to be treated as non-volatile
				2348	and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
				2349	coherent) and so the L2 cache will coherent with the CPU and other agents.
				2350
				2351	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2352	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2353
				2354	============ ============ ============== ========== =======================
				2355	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2356	Ordering Sync Scope Address
				2357	Space
				2358	============ ============ ============== ========== =======================
				2359	Non-Atomic
				2360	---------------------------------------------------------------------------
				2361	load none none - global non-volatile
				2362	- generic 1. buffer/global/flat_load
				2363	volatile
				2364	1. buffer/global/flat_load
				2365	glc=1
				2366	load none none - local 1. ds_load
				2367	store none none - global 1. buffer/global/flat_store
				2368	- generic
				2369	store none none - local 1. ds_store
				2370	Unordered Atomic
				2371	---------------------------------------------------------------------------
				2372	load atomic unordered any any Same as non-atomic.
				2373	store atomic unordered any any Same as non-atomic.
				2374	atomicrmw unordered any any *Same as monotonic
				2375	atomic*.
				2376	Monotonic Atomic
				2377	---------------------------------------------------------------------------
				2378	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2379	- wavefront - generic
				2380	- workgroup
				2381	load atomic monotonic - singlethread - local 1. ds_load
				2382	- wavefront
				2383	- workgroup
				2384	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2385	- system - generic glc=1
				2386	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2387	- wavefront - generic
				2388	- workgroup
				2389	- agent
				2390	- system
				2391	store atomic monotonic - singlethread - local 1. ds_store
				2392	- wavefront
				2393	- workgroup
				2394	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2395	- wavefront - generic
				2396	- workgroup
				2397	- agent
				2398	- system
				2399	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2400	- wavefront
				2401	- workgroup
				2402	Acquire Atomic
				2403	---------------------------------------------------------------------------
				2404	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2405	- wavefront - local
				2406	- generic
				2407	load atomic acquire - workgroup - global 1. buffer/global_load
				2408	load atomic acquire - workgroup - local 1. ds/flat_load
				2409	- generic 2. s_waitcnt lgkmcnt(0)
				2410
				2411	- If OpenCL, omit
				2412	waitcnt.
				2413	- Must happen before
				2414	any following
				2415	global/generic
				2416	load/load
				2417	atomic/store/store
				2418	atomic/atomicrmw.
				2419	- Ensures any
				2420	following global
				2421	data read is no
				2422	older than the load
				2423	atomic value being
				2424	acquired.
				2425
				2426	load atomic acquire - agent - global 1. buffer/global_load
				2427	- system glc=1
				2428	2. s_waitcnt vmcnt(0)
				2429
				2430	- Must happen before
				2431	following
				2432	buffer_wbinvl1_vol.
				2433	- Ensures the load
				2434	has completed
				2435	before invalidating
				2436	the cache.
				2437
				2438	3. buffer_wbinvl1_vol
				2439
				2440	- Must happen before
				2441	any following
				2442	global/generic
				2443	load/load
				2444	atomic/atomicrmw.
				2445	- Ensures that
				2446	following
				2447	loads will not see
				2448	stale global data.
				2449
				2450	load atomic acquire - agent - generic 1. flat_load glc=1
				2451	- system 2. s_waitcnt vmcnt(0) &
				2452	lgkmcnt(0)
				2453
				2454	- If OpenCL omit
				2455	lgkmcnt(0).
				2456	- Must happen before
				2457	following
				2458	buffer_wbinvl1_vol.
				2459	- Ensures the flat_load
				2460	has completed
				2461	before invalidating
				2462	the cache.
				2463
				2464	3. buffer_wbinvl1_vol
				2465
				2466	- Must happen before
				2467	any following
				2468	global/generic
				2469	load/load
				2470	atomic/atomicrmw.
				2471	- Ensures that
				2472	following loads
				2473	will not see stale
				2474	global data.
				2475
				2476	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2477	- wavefront - local
				2478	- generic
				2479	atomicrmw acquire - workgroup - global 1. buffer/global_atomic
				2480	atomicrmw acquire - workgroup - local 1. ds/flat_atomic
				2481	- generic 2. waitcnt lgkmcnt(0)
				2482
				2483	- If OpenCL, omit
				2484	waitcnt.
				2485	- Must happen before
				2486	any following
				2487	global/generic
				2488	load/load
				2489	atomic/store/store
				2490	atomic/atomicrmw.
				2491	- Ensures any
				2492	following global
				2493	data read is no
				2494	older than the
				2495	atomicrmw value
				2496	being acquired.
				2497
				2498	atomicrmw acquire - agent - global 1. buffer/global_atomic
				2499	- system 2. s_waitcnt vmcnt(0)
				2500
				2501	- Must happen before
				2502	following
				2503	buffer_wbinvl1_vol.
				2504	- Ensures the
				2505	atomicrmw has
				2506	completed before
				2507	invalidating the
				2508	cache.
				2509
				2510	3. buffer_wbinvl1_vol
				2511
				2512	- Must happen before
				2513	any following
				2514	global/generic
				2515	load/load
				2516	atomic/atomicrmw.
				2517	- Ensures that
				2518	following loads
				2519	will not see stale
				2520	global data.
				2521
				2522	atomicrmw acquire - agent - generic 1. flat_atomic
				2523	- system 2. s_waitcnt vmcnt(0) &
				2524	lgkmcnt(0)
				2525
				2526	- If OpenCL, omit
				2527	lgkmcnt(0).
				2528	- Must happen before
				2529	following
				2530	buffer_wbinvl1_vol.
				2531	- Ensures the
				2532	atomicrmw has
				2533	completed before
				2534	invalidating the
				2535	cache.
				2536
				2537	3. buffer_wbinvl1_vol
				2538
				2539	- Must happen before
				2540	any following
				2541	global/generic
				2542	load/load
				2543	atomic/atomicrmw.
				2544	- Ensures that
				2545	following loads
				2546	will not see stale
				2547	global data.
				2548
				2549	fence acquire - singlethread none none
				2550	- wavefront
				2551	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2552
				2553	- If OpenCL and
				2554	address space is
				2555	not generic, omit
				2556	waitcnt. However,
				2557	since LLVM
				2558	currently has no
				2559	address space on
				2560	the fence need to
				2561	conservatively
				2562	always generate. If
				2563	fence had an
				2564	address space then
				2565	set to address
				2566	space of OpenCL
				2567	fence flag, or to
				2568	generic if both
				2569	local and global
				2570	flags are
				2571	specified.
				2572	- Must happen after
				2573	any preceding
				2574	local/generic load
				2575	atomic/atomicrmw
				2576	with an equal or
				2577	wider sync scope
				2578	and memory ordering
				2579	stronger than
				2580	unordered (this is
				2581	termed the
				2582	fence-paired-atomic).
				2583	- Must happen before
				2584	any following
				2585	global/generic
				2586	load/load
				2587	atomic/store/store
				2588	atomic/atomicrmw.
				2589	- Ensures any
				2590	following global
				2591	data read is no
				2592	older than the
				2593	value read by the
				2594	fence-paired-atomic.
				2595
				2596	fence acquire - agent none 1. s_waitcnt vmcnt(0) &
				2597	- system lgkmcnt(0)
				2598
				2599	- If OpenCL and
				2600	address space is
				2601	not generic, omit
				2602	lgkmcnt(0).
				2603	However, since LLVM
				2604	currently has no
				2605	address space on
				2606	the fence need to
				2607	conservatively
				2608	always generate
				2609	(see comment for
				2610	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2611	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2612	separate s_waitcnt
				2613	vmcnt(0) and
				2614	s_waitcnt
				2615	lgkmcnt(0) to allow
				2616	them to be
				2617	independently moved
				2618	according to the
				2619	following rules.
				2620	- s_waitcnt vmcnt(0)
				2621	must happen after
				2622	any preceding
				2623	global/generic load
				2624	atomic/atomicrmw
				2625	with an equal or
				2626	wider sync scope
				2627	and memory ordering
				2628	stronger than
				2629	unordered (this is
				2630	termed the
				2631	fence-paired-atomic).
				2632	- s_waitcnt lgkmcnt(0)
				2633	must happen after
				2634	any preceding
				2635	group/generic load
				2636	atomic/atomicrmw
				2637	with an equal or
				2638	wider sync scope
				2639	and memory ordering
				2640	stronger than
				2641	unordered (this is
				2642	termed the
				2643	fence-paired-atomic).
				2644	- Must happen before
				2645	the following
				2646	buffer_wbinvl1_vol.
				2647	- Ensures that the
				2648	fence-paired atomic
				2649	has completed
				2650	before invalidating
				2651	the
				2652	cache. Therefore
				2653	any following
				2654	locations read must
				2655	be no older than
				2656	the value read by
				2657	the
				2658	fence-paired-atomic.
				2659
				2660	2. buffer_wbinvl1_vol
				2661
				2662	- Must happen before
				2663	any following global/generic
				2664	load/load
				2665	atomic/store/store
				2666	atomic/atomicrmw.
				2667	- Ensures that
				2668	following loads
				2669	will not see stale
				2670	global data.
				2671
				2672	Release Atomic
				2673	---------------------------------------------------------------------------
				2674	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2675	- wavefront - local
				2676	- generic
				2677	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2678	- generic
				2679	- If OpenCL, omit
				2680	waitcnt.
				2681	- Must happen after
				2682	any preceding
				2683	local/generic
				2684	load/store/load
				2685	atomic/store
				2686	atomic/atomicrmw.
				2687	- Must happen before
				2688	the following
				2689	store.
				2690	- Ensures that all
				2691	memory operations
				2692	to local have
				2693	completed before
				2694	performing the
				2695	store that is being
				2696	released.
				2697
				2698	2. buffer/global/flat_store
				2699	store atomic release - workgroup - local 1. ds_store
				2700	store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
				2701	- system - generic lgkmcnt(0)
				2702
				2703	- If OpenCL, omit
				2704	lgkmcnt(0).
				2705	- Could be split into
				2706	separate s_waitcnt
				2707	vmcnt(0) and
				2708	s_waitcnt
				2709	lgkmcnt(0) to allow
				2710	them to be
				2711	independently moved
				2712	according to the
				2713	following rules.
				2714	- s_waitcnt vmcnt(0)
				2715	must happen after
				2716	any preceding
				2717	global/generic
				2718	load/store/load
				2719	atomic/store
				2720	atomic/atomicrmw.
				2721	- s_waitcnt lgkmcnt(0)
				2722	must happen after
				2723	any preceding
				2724	local/generic
				2725	load/store/load
				2726	atomic/store
				2727	atomic/atomicrmw.
				2728	- Must happen before
				2729	the following
				2730	store.
				2731	- Ensures that all
				2732	memory operations
				2733	to global have
				2734	completed before
				2735	performing the
				2736	store that is being
				2737	released.
				2738
				2739	2. buffer/global/ds/flat_store
				2740	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2741	- wavefront - local
				2742	- generic
				2743	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2744	- generic
				2745	- If OpenCL, omit
				2746	waitcnt.
				2747	- Must happen after
				2748	any preceding
				2749	local/generic
				2750	load/store/load
				2751	atomic/store
				2752	atomic/atomicrmw.
				2753	- Must happen before
				2754	the following
				2755	atomicrmw.
				2756	- Ensures that all
				2757	memory operations
				2758	to local have
				2759	completed before
				2760	performing the
				2761	atomicrmw that is
				2762	being released.
				2763
				2764	2. buffer/global/flat_atomic
				2765	atomicrmw release - workgroup - local 1. ds_atomic
				2766	atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
				2767	- system - generic lgkmcnt(0)
				2768
				2769	- If OpenCL, omit
				2770	lgkmcnt(0).
				2771	- Could be split into
				2772	separate s_waitcnt
				2773	vmcnt(0) and
				2774	s_waitcnt
				2775	lgkmcnt(0) to allow
				2776	them to be
				2777	independently moved
				2778	according to the
				2779	following rules.
				2780	- s_waitcnt vmcnt(0)
				2781	must happen after
				2782	any preceding
				2783	global/generic
				2784	load/store/load
				2785	atomic/store
				2786	atomic/atomicrmw.
				2787	- s_waitcnt lgkmcnt(0)
				2788	must happen after
				2789	any preceding
				2790	local/generic
				2791	load/store/load
				2792	atomic/store
				2793	atomic/atomicrmw.
				2794	- Must happen before
				2795	the following
				2796	atomicrmw.
				2797	- Ensures that all
				2798	memory operations
				2799	to global and local
				2800	have completed
				2801	before performing
				2802	the atomicrmw that
				2803	is being released.
				2804
				2805	2. buffer/global/ds/flat_atomic*
				2806	fence release - singlethread none none
				2807	- wavefront
				2808	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2809
				2810	- If OpenCL and
				2811	address space is
				2812	not generic, omit
				2813	waitcnt. However,
				2814	since LLVM
				2815	currently has no
				2816	address space on
				2817	the fence need to
				2818	conservatively
				2819	always generate
				2820	(see comment for
				2821	previous fence).
				2822	- Must happen after
				2823	any preceding
				2824	local/generic
				2825	load/load
				2826	atomic/store/store
				2827	atomic/atomicrmw.
				2828	- Must happen before
				2829	any following store
				2830	atomic/atomicrmw
				2831	with an equal or
				2832	wider sync scope
				2833	and memory ordering
				2834	stronger than
				2835	unordered (this is
				2836	termed the
				2837	fence-paired-atomic).
				2838	- Ensures that all
				2839	memory operations
				2840	to local have
				2841	completed before
				2842	performing the
				2843	following
				2844	fence-paired-atomic.
				2845
				2846	fence release - agent none 1. s_waitcnt vmcnt(0) &
				2847	- system lgkmcnt(0)
				2848
				2849	- If OpenCL and
				2850	address space is
				2851	not generic, omit
				2852	lgkmcnt(0).
				2853	However, since LLVM
				2854	currently has no
				2855	address space on
				2856	the fence need to
				2857	conservatively
				2858	always generate
				2859	(see comment for
				2860	previous fence).
				2861	- Could be split into
				2862	separate s_waitcnt
				2863	vmcnt(0) and
				2864	s_waitcnt
				2865	lgkmcnt(0) to allow
				2866	them to be
				2867	independently moved
				2868	according to the
				2869	following rules.
				2870	- s_waitcnt vmcnt(0)
				2871	must happen after
				2872	any preceding
				2873	global/generic
				2874	load/store/load
				2875	atomic/store
				2876	atomic/atomicrmw.
				2877	- s_waitcnt lgkmcnt(0)
				2878	must happen after
				2879	any preceding
				2880	local/generic
				2881	load/store/load
				2882	atomic/store
				2883	atomic/atomicrmw.
				2884	- Must happen before
				2885	any following store
				2886	atomic/atomicrmw
				2887	with an equal or
				2888	wider sync scope
				2889	and memory ordering
				2890	stronger than
				2891	unordered (this is
				2892	termed the
				2893	fence-paired-atomic).
				2894	- Ensures that all
				2895	memory operations
				2896	to global have
				2897	completed before
				2898	performing the
				2899	following
				2900	fence-paired-atomic.
				2901
				2902	Acquire-Release Atomic
				2903	---------------------------------------------------------------------------
				2904	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				2905	- wavefront - local
				2906	- generic
				2907	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2908
				2909	- If OpenCL, omit
				2910	waitcnt.
				2911	- Must happen after
				2912	any preceding
				2913	local/generic
				2914	load/store/load
				2915	atomic/store
				2916	atomic/atomicrmw.
				2917	- Must happen before
				2918	the following
				2919	atomicrmw.
				2920	- Ensures that all
				2921	memory operations
				2922	to local have
				2923	completed before
				2924	performing the
				2925	atomicrmw that is
				2926	being released.
				2927
				2928	2. buffer/global_atomic
				2929	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				2930	2. s_waitcnt lgkmcnt(0)
				2931
				2932	- If OpenCL, omit
				2933	waitcnt.
				2934	- Must happen before
				2935	any following
				2936	global/generic
				2937	load/load
				2938	atomic/store/store
				2939	atomic/atomicrmw.
				2940	- Ensures any
				2941	following global
				2942	data read is no
				2943	older than the load
				2944	atomic value being
				2945	acquired.
				2946
				2947	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2948
				2949	- If OpenCL, omit
				2950	waitcnt.
				2951	- Must happen after
				2952	any preceding
				2953	local/generic
				2954	load/store/load
				2955	atomic/store
				2956	atomic/atomicrmw.
				2957	- Must happen before
				2958	the following
				2959	atomicrmw.
				2960	- Ensures that all
				2961	memory operations
				2962	to local have
				2963	completed before
				2964	performing the
				2965	atomicrmw that is
				2966	being released.
				2967
				2968	2. flat_atomic
				2969	3. s_waitcnt lgkmcnt(0)
				2970
				2971	- If OpenCL, omit
				2972	waitcnt.
				2973	- Must happen before
				2974	any following
				2975	global/generic
				2976	load/load
				2977	atomic/store/store
				2978	atomic/atomicrmw.
				2979	- Ensures any
				2980	following global
				2981	data read is no
				2982	older than the load
				2983	atomic value being
				2984	acquired.
				2985	atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
				2986	- system lgkmcnt(0)
				2987
				2988	- If OpenCL, omit
				2989	lgkmcnt(0).
				2990	- Could be split into
				2991	separate s_waitcnt
				2992	vmcnt(0) and
				2993	s_waitcnt
				2994	lgkmcnt(0) to allow
				2995	them to be
				2996	independently moved
				2997	according to the
				2998	following rules.
				2999	- s_waitcnt vmcnt(0)
				3000	must happen after
				3001	any preceding
				3002	global/generic
				3003	load/store/load
				3004	atomic/store
				3005	atomic/atomicrmw.
				3006	- s_waitcnt lgkmcnt(0)
				3007	must happen after
				3008	any preceding
				3009	local/generic
				3010	load/store/load
				3011	atomic/store
				3012	atomic/atomicrmw.
				3013	- Must happen before
				3014	the following
				3015	atomicrmw.
				3016	- Ensures that all
				3017	memory operations
				3018	to global have
				3019	completed before
				3020	performing the
				3021	atomicrmw that is
				3022	being released.
				3023
				3024	2. buffer/global_atomic
				3025	3. s_waitcnt vmcnt(0)
				3026
				3027	- Must happen before
				3028	following
				3029	buffer_wbinvl1_vol.
				3030	- Ensures the
				3031	atomicrmw has
				3032	completed before
				3033	invalidating the
				3034	cache.
				3035
				3036	4. buffer_wbinvl1_vol
				3037
				3038	- Must happen before
				3039	any following
				3040	global/generic
				3041	load/load
				3042	atomic/atomicrmw.
				3043	- Ensures that
				3044	following loads
				3045	will not see stale
				3046	global data.
				3047
				3048	atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
				3049	- system lgkmcnt(0)
				3050
				3051	- If OpenCL, omit
				3052	lgkmcnt(0).
				3053	- Could be split into
				3054	separate s_waitcnt
				3055	vmcnt(0) and
				3056	s_waitcnt
				3057	lgkmcnt(0) to allow
				3058	them to be
				3059	independently moved
				3060	according to the
				3061	following rules.
				3062	- s_waitcnt vmcnt(0)
				3063	must happen after
				3064	any preceding
				3065	global/generic
				3066	load/store/load
				3067	atomic/store
				3068	atomic/atomicrmw.
				3069	- s_waitcnt lgkmcnt(0)
				3070	must happen after
				3071	any preceding
				3072	local/generic
				3073	load/store/load
				3074	atomic/store
				3075	atomic/atomicrmw.
				3076	- Must happen before
				3077	the following
				3078	atomicrmw.
				3079	- Ensures that all
				3080	memory operations
				3081	to global have
				3082	completed before
				3083	performing the
				3084	atomicrmw that is
				3085	being released.
				3086
				3087	2. flat_atomic
				3088	3. s_waitcnt vmcnt(0) &
				3089	lgkmcnt(0)
				3090
				3091	- If OpenCL, omit
				3092	lgkmcnt(0).
				3093	- Must happen before
				3094	following
				3095	buffer_wbinvl1_vol.
				3096	- Ensures the
				3097	atomicrmw has
				3098	completed before
				3099	invalidating the
				3100	cache.
				3101
				3102	4. buffer_wbinvl1_vol
				3103
				3104	- Must happen before
				3105	any following
				3106	global/generic
				3107	load/load
				3108	atomic/atomicrmw.
				3109	- Ensures that
				3110	following loads
				3111	will not see stale
				3112	global data.
				3113
				3114	fence acq_rel - singlethread none none
				3115	- wavefront
				3116	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3117
				3118	- If OpenCL and
				3119	address space is
				3120	not generic, omit
				3121	waitcnt. However,
				3122	since LLVM
				3123	currently has no
				3124	address space on
				3125	the fence need to
				3126	conservatively
				3127	always generate
				3128	(see comment for
				3129	previous fence).
				3130	- Must happen after
				3131	any preceding
				3132	local/generic
				3133	load/load
				3134	atomic/store/store
				3135	atomic/atomicrmw.
				3136	- Must happen before
				3137	any following
				3138	global/generic
				3139	load/load
				3140	atomic/store/store
				3141	atomic/atomicrmw.
				3142	- Ensures that all
				3143	memory operations
				3144	to local have
				3145	completed before
				3146	performing any
				3147	following global
				3148	memory operations.
				3149	- Ensures that the
				3150	preceding
				3151	local/generic load
				3152	atomic/atomicrmw
				3153	with an equal or
				3154	wider sync scope
				3155	and memory ordering
				3156	stronger than
				3157	unordered (this is
				3158	termed the
				3159	fence-paired-atomic)
				3160	has completed
				3161	before following
				3162	global memory
				3163	operations. This
				3164	satisfies the
				3165	requirements of
				3166	acquire.
				3167	- Ensures that all
				3168	previous memory
				3169	operations have
				3170	completed before a
				3171	following
				3172	local/generic store
				3173	atomic/atomicrmw
				3174	with an equal or
				3175	wider sync scope
				3176	and memory ordering
				3177	stronger than
				3178	unordered (this is
				3179	termed the
				3180	fence-paired-atomic).
				3181	This satisfies the
				3182	requirements of
				3183	release.
				3184
				3185	fence acq_rel - agent none 1. s_waitcnt vmcnt(0) &
				3186	- system lgkmcnt(0)
				3187
				3188	- If OpenCL and
				3189	address space is
				3190	not generic, omit
				3191	lgkmcnt(0).
				3192	However, since LLVM
				3193	currently has no
				3194	address space on
				3195	the fence need to
				3196	conservatively
				3197	always generate
				3198	(see comment for
				3199	previous fence).
				3200	- Could be split into
				3201	separate s_waitcnt
				3202	vmcnt(0) and
				3203	s_waitcnt
				3204	lgkmcnt(0) to allow
				3205	them to be
				3206	independently moved
				3207	according to the
				3208	following rules.
				3209	- s_waitcnt vmcnt(0)
				3210	must happen after
				3211	any preceding
				3212	global/generic
				3213	load/store/load
				3214	atomic/store
				3215	atomic/atomicrmw.
				3216	- s_waitcnt lgkmcnt(0)
				3217	must happen after
				3218	any preceding
				3219	local/generic
				3220	load/store/load
				3221	atomic/store
				3222	atomic/atomicrmw.
				3223	- Must happen before
				3224	the following
				3225	buffer_wbinvl1_vol.
				3226	- Ensures that the
				3227	preceding
				3228	global/local/generic
				3229	load
				3230	atomic/atomicrmw
				3231	with an equal or
				3232	wider sync scope
				3233	and memory ordering
				3234	stronger than
				3235	unordered (this is
				3236	termed the
				3237	fence-paired-atomic)
				3238	has completed
				3239	before invalidating
				3240	the cache. This
				3241	satisfies the
				3242	requirements of
				3243	acquire.
				3244	- Ensures that all
				3245	previous memory
				3246	operations have
				3247	completed before a
				3248	following
				3249	global/local/generic
				3250	store
				3251	atomic/atomicrmw
				3252	with an equal or
				3253	wider sync scope
				3254	and memory ordering
				3255	stronger than
				3256	unordered (this is
				3257	termed the
				3258	fence-paired-atomic).
				3259	This satisfies the
				3260	requirements of
				3261	release.
				3262
				3263	2. buffer_wbinvl1_vol
				3264
				3265	- Must happen before
				3266	any following
				3267	global/generic
				3268	load/load
				3269	atomic/store/store
				3270	atomic/atomicrmw.
				3271	- Ensures that
				3272	following loads
				3273	will not see stale
				3274	global data. This
				3275	satisfies the
				3276	requirements of
				3277	acquire.
				3278
				3279	Sequential Consistent Atomic
				3280	---------------------------------------------------------------------------
				3281	load atomic seq_cst - singlethread - global *Same as corresponding
				3282	- wavefront - local load atomic acquire*.
				3283	- workgroup - generic
				3284	load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
				3285	- system - local
				3286	- generic - Must happen after
				3287	preceding
				3288	global/generic load
				3289	atomic/store
				3290	atomic/atomicrmw
				3291	with memory
				3292	ordering of seq_cst
				3293	and with equal or
				3294	wider sync scope.
				3295	(Note that seq_cst
				3296	fences have their
				3297	own s_waitcnt
				3298	vmcnt(0) and so do
				3299	not need to be
				3300	considered.)
				3301	- Ensures any
				3302	preceding
				3303	sequential
				3304	consistent global
				3305	memory instructions
				3306	have completed
				3307	before executing
				3308	this sequentially
				3309	consistent
				3310	instruction. This
				3311	prevents reordering
				3312	a seq_cst store
				3313	followed by a
				3314	seq_cst load (Note
				3315	that seq_cst is
				3316	stronger than
				3317	acquire/release as
				3318	the reordering of
				3319	load acquire
				3320	followed by a store
				3321	release is
				3322	prevented by the
				3323	waitcnt vmcnt(0) of
				3324	the release, but
				3325	there is nothing
				3326	preventing a store
				3327	release followed by
				3328	load acquire from
				3329	competing out of
				3330	order.)
				3331
				3332	2. *Following
				3333	instructions same as
				3334	corresponding load
				3335	atomic acquire*.
				3336
				3337	store atomic seq_cst - singlethread - global *Same as corresponding
				3338	- wavefront - local store atomic release*.
				3339	- workgroup - generic
				3340	store atomic seq_cst - agent - global *Same as corresponding
				3341	- system - generic store atomic release*.
				3342	atomicrmw seq_cst - singlethread - global *Same as corresponding
				3343	- wavefront - local atomicrmw acq_rel*.
				3344	- workgroup - generic
				3345	atomicrmw seq_cst - agent - global *Same as corresponding
				3346	- system - generic atomicrmw acq_rel*.
				3347	fence seq_cst - singlethread none *Same as corresponding
				3348	- wavefront fence acq_rel*.
				3349	- workgroup
				3350	- agent
				3351	- system
				3352	============ ============ ============== ========== =======================
				3353
				3354	The memory order also adds the single thread optimization constrains defined in
				3355	table
				3356	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3357
				3358	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3359	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3360
				3361	============ ==============================================================
				3362	LLVM Memory Optimization Constraints
				3363	Ordering
				3364	============ ==============================================================
				3365	unordered none
				3366	monotonic none
				3367	acquire - If a load atomic/atomicrmw then no following load/load
				3368	atomic/store/ store atomic/atomicrmw/fence instruction can
				3369	be moved before the acquire.
				3370	- If a fence then same as load atomic, plus no preceding
				3371	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3372	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3373	atomic/store/ store atomic/atomicrmw/fence instruction can
				3374	be moved after the release.
				3375	- If a fence then same as store atomic, plus no following
				3376	associated fence-paired-atomic can be moved before the
				3377	fence.
				3378	acq_rel Same constraints as both acquire and release.
				3379	seq_cst - If a load atomic then same constraints as acquire, plus no
				3380	preceding sequentially consistent load atomic/store
				3381	atomic/atomicrmw/fence instruction can be moved after the
				3382	seq_cst.
				3383	- If a store atomic then the same constraints as release, plus
				3384	no following sequentially consistent load atomic/store
				3385	atomic/atomicrmw/fence instruction can be moved before the
				3386	seq_cst.
				3387	- If an atomicrmw/fence then same constraints as acq_rel.
				3388	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3389
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3390	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3391	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3392
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3393	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3394	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3395	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3396
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3397	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3398	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3399
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3400	=================== =============== =============== =======================
				3401	Usage Code Sequence Trap Handler Description
				3402	Inputs
				3403	=================== =============== =============== =======================
				3404	reserved ``s_trap 0x00`` Reserved by hardware.
				3405	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3406	``queue_ptr`` ``debugtrap``
				3407	``VGPR0``: intrinsic (not
				3408	``arg`` implemented).
				3409	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3410	``queue_ptr`` terminated and its
				3411	associated queue put
				3412	into the error state.
				3413	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3414	``queue_ptr`` installed handled
				3415	same as ``llvm.trap``.
				3416	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3417	breakpoints.
				3418	debugger ``s_trap 0x08`` Reserved for debugger.
				3419	debugger ``s_trap 0xfe`` Reserved for debugger.
				3420	debugger ``s_trap 0xff`` Reserved for debugger.
				3421	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3422
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3423	Unspecified OS
				3424	--------------
				3425
				3426	This section provides code conventions used when the target triple OS is
				3427	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3428
				3429	Trap Handler ABI
				3430	~~~~~~~~~~~~~~~~
				3431
				3432	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3433	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3434	instructions are handled as follows:
				3435
				3436	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3437	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3438
				3439	=============== =============== ===========================================
				3440	Usage Code Sequence Description
				3441	=============== =============== ===========================================
				3442	llvm.trap s_endpgm Causes wavefront to be terminated.
				3443	llvm.debugtrap none Compiler warning given that there is no
				3444	trap handler installed.
				3445	=============== =============== ===========================================
				3446
				3447	Source Languages
				3448	================
				3449
				3450	.. _amdgpu-opencl:
				3451
				3452	OpenCL
				3453	------
				3454
				3455	When generating code for the OpenCL language the target triple environment
				3456	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3457
				3458	When the language is OpenCL the following differences occur:
				3459
				3460	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3461	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3462	3. Additional metadata is generated
				3463	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3464
				3465	.. TODO
				3466	Specify what affect this has. Hidden arguments added. Additional metadata
				3467	generated.
				3468
				3469	.. _amdgpu-hcc:
				3470
				3471	HCC
				3472	---
				3473
				3474	When generating code for the OpenCL language the target triple environment
				3475	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3476
				3477	When the language is OpenCL the following differences occur:
				3478
				3479	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3480
				3481	.. TODO
				3482	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3483
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3484	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3485	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3486
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3487	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3488	It supports AMDGCN GFX6-GFX8.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3489
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3490	This section describes general syntax for instructions and operands. For more
				3491	information about instructions, their semantics and supported combinations of
				3492	operands, refer to one of instruction set architecture manuals
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3493	[AMD-Souther-Islands]_, [AMD-Sea-Islands]_, [AMD-Volcanic-Islands]_ and
				3494	[AMD-Vega]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3495
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3496	An instruction has the following syntax (register operands are normally
				3497	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3498
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3499	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3500
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3501	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3502	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3503
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3504	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3505
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3506	* SGPR registers: s0, ... or s[0], ...
				3507	* VGPR registers: v0, ... or v[0], ...
				3508	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3509	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3510	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3511	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3512	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3513	* Register index expressions: v[2*2], s[1-1:2-1]
				3514	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3515
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3516	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3517
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3518	* offset, offset0, offset1
				3519	* idxen, offen bits
				3520	* glc, slc, tfe bits
				3521	* waitcnt: integer or combination of counter values
				3522	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3523
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3524	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3525
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3526	* DPP modifiers:
				3527
				3528	- row_shl, row_shr, row_ror, row_rol
				3529	- row_mirror, row_half_mirror, row_bcast
				3530	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3531	- row_mask, bank_mask, bound_ctrl
				3532
				3533	* SDWA modifiers:
				3534
				3535	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3536	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3537	- abs, neg, sext
				3538
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3539	Instruction Examples
				3540	~~~~~~~~~~~~~~~~~~~~
				3541
				3542	DS
				3543	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3544
				3545	.. code-block:: nasm
				3546
				3547	ds_add_u32 v2, v4 offset:16
				3548	ds_write_src2_b64 v2 offset0:4 offset1:8
				3549	ds_cmpst_f32 v2, v4, v6
				3550	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3551
				3552
				3553	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3554
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3555	FLAT
				3556	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3557
				3558	.. code-block:: nasm
				3559
				3560	flat_load_dword v1, v[3:4]
				3561	flat_store_dwordx3 v[3:4], v[5:7]
				3562	flat_atomic_swap v1, v[3:4], v5 glc
				3563	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3564	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3565
				3566	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3567
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3568	MUBUF
				3569	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3570
				3571	.. code-block:: nasm
				3572
				3573	buffer_load_dword v1, off, s[4:7], s1
				3574	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3575	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3576	buffer_wbinvl1
				3577	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3578
				3579	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3580
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3581	SMRD/SMEM
				3582	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3583
				3584	.. code-block:: nasm
				3585
				3586	s_load_dword s1, s[2:3], 0xfc
				3587	s_load_dwordx8 s[8:15], s[2:3], s4
				3588	s_load_dwordx16 s[88:103], s[2:3], s4
				3589	s_dcache_inv_vol
				3590	s_memtime s[4:5]
				3591
				3592	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3593
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3594	SOP1
				3595	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3596
				3597	.. code-block:: nasm
				3598
				3599	s_mov_b32 s1, s2
				3600	s_mov_b64 s[0:1], 0x80000000
				3601	s_cmov_b32 s1, 200
				3602	s_wqm_b64 s[2:3], s[4:5]
				3603	s_bcnt0_i32_b64 s1, s[2:3]
				3604	s_swappc_b64 s[2:3], s[4:5]
				3605	s_cbranch_join s[4:5]
				3606
				3607	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3608
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3609	SOP2
				3610	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3611
				3612	.. code-block:: nasm
				3613
				3614	s_add_u32 s1, s2, s3
				3615	s_and_b64 s[2:3], s[4:5], s[6:7]
				3616	s_cselect_b32 s1, s2, s3
				3617	s_andn2_b32 s2, s4, s6
				3618	s_lshr_b64 s[2:3], s[4:5], s6
				3619	s_ashr_i32 s2, s4, s6
				3620	s_bfm_b64 s[2:3], s4, s6
				3621	s_bfe_i64 s[2:3], s[4:5], s6
				3622	s_cbranch_g_fork s[4:5], s[6:7]
				3623
				3624	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3625
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3626	SOPC
				3627	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3628
				3629	.. code-block:: nasm
				3630
				3631	s_cmp_eq_i32 s1, s2
				3632	s_bitcmp1_b32 s1, s2
				3633	s_bitcmp0_b64 s[2:3], s4
				3634	s_setvskip s3, s5
				3635
				3636	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3637
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3638	SOPP
				3639	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3640
				3641	.. code-block:: nasm
				3642
				3643	s_barrier
				3644	s_nop 2
				3645	s_endpgm
				3646	s_waitcnt 0 ; Wait for all counters to be 0
				3647	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3648	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3649	s_sethalt 9
				3650	s_sleep 10
				3651	s_sendmsg 0x1
				3652	s_sendmsg sendmsg(MSG_INTERRUPT)
				3653	s_trap 1
				3654
				3655	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3656
				3657	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3658	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3659	range or acceptable values.
				3660
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3661	VALU
				3662	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3663
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3664	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3665	the assembler will automatically use optimal encoding based on its operands.
				3666	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3667
				3668	* _e32 for 32-bit VOP1/VOP2/VOPC
				3669	* _e64 for 64-bit VOP3
				3670	* _dpp for VOP_DPP
				3671	* _sdwa for VOP_SDWA
				3672
				3673	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3674
				3675	.. code-block:: nasm
				3676
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3677	v_mov_b32 v1, v2
				3678	v_mov_b32_e32 v1, v2
				3679	v_nop
				3680	v_cvt_f64_i32_e32 v[1:2], v2
				3681	v_floor_f32_e32 v1, v2
				3682	v_bfrev_b32_e32 v1, v2
				3683	v_add_f32_e32 v1, v2, v3
				3684	v_mul_i32_i24_e64 v1, v2, 3
				3685	v_mul_i32_i24_e32 v1, -3, v3
				3686	v_mul_i32_i24_e32 v1, -100, v3
				3687	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3688	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3689
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3690	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3691
				3692	.. code-block:: nasm
				3693
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3694	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3695	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3696	v_mov_b32 v0, v0 wave_shl:1
				3697	v_mov_b32 v0, v0 row_mirror
				3698	v_mov_b32 v0, v0 row_bcast:31
				3699	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3700	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3701	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3702
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3703	VOP_SDWA examples:
				3704
				3705	.. code-block:: nasm
				3706
				3707	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				3708	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				3709	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				3710	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				3711	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				3712
				3713	For full list of supported instructions, refer to "Vector ALU instructions".
				3714
				3715	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3716	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3717
				3718	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				3719	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3720
				3721	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3722	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3723
				3724	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3725	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3726
				3727	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3728	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				3729
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3730
				3731	major, minor, and stepping are all integers that describe the instruction
				3732	set architecture (ISA) version of the assembly program.
				3733
				3734	vendor and arch are quoted strings. vendor should always be equal to
				3735	"AMD" and arch should always be equal to "AMDGPU".
				3736
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3737	By default, the assembler will derive the ISA version, vendor, and arch
				3738	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3739
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3740	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3741	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3742
				3743	This directives specifies that the symbol with given name is a kernel entry point
				3744	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3745
				3746	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3747	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3748
				3749	This directive marks the beginning of a list of key / value pairs that are used
				3750	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				3751	The list must be terminated by the .end_amd_kernel_code_t directive. For
				3752	any amd_kernel_code_t values that are unspecified a default value will be
				3753	used. The default value for all keys is 0, with the following exceptions:
				3754
				3755	- kernel_code_version_major defaults to 1.
				3756	- machine_kind defaults to 1.
				3757	- machine_version_major, machine_version_minor, and
				3758	machine_version_stepping are derived from the value of the -mcpu option
				3759	that is passed to the assembler.
				3760	- kernel_code_entry_byte_offset defaults to 256.
				3761	- wavefront_size defaults to 6.
				3762	- kernarg_segment_alignment, group_segment_alignment, and
				3763	private_segment_alignment default to 4. Note that alignments are specified
				3764	as a power of two, so a value of n means an alignment of 2^ n.
				3765
				3766	The .amd_kernel_code_t directive must be placed immediately after the
				3767	function label and before any instructions.
				3768
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3769	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				3770	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3771
				3772	Here is an example of a minimal amd_kernel_code_t specification:
				3773
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	3774	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3775
				3776	.hsa_code_object_version 1,0
				3777	.hsa_code_object_isa
				3778
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3779	.hsatext
				3780	.globl hello_world
				3781	.p2align 8
				3782	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3783
				3784	hello_world:
				3785
				3786	.amd_kernel_code_t
				3787	enable_sgpr_kernarg_segment_ptr = 1
				3788	is_ptr64 = 1
				3789	compute_pgm_rsrc1_vgprs = 0
				3790	compute_pgm_rsrc1_sgprs = 0
				3791	compute_pgm_rsrc2_user_sgpr = 2
				3792	kernarg_segment_byte_size = 8
				3793	wavefront_sgpr_count = 2
				3794	workitem_vgpr_count = 3
				3795	.end_amd_kernel_code_t
				3796
				3797	s_load_dwordx2 s[0:1], s[0:1] 0x0
				3798	v_mov_b32 v0, 3.14159
				3799	s_waitcnt lgkmcnt(0)
				3800	v_mov_b32 v1, s0
				3801	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3802	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3803	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	3804	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3805	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3806
				3807	Additional Documentation
				3808	========================
				3809
				3810	.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				3811	.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				3812	.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				3813	.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				3814	.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				3815	.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				3816	.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3817	.. [AMD-Vega] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3818	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				3819	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				3820	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				3821	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				3822	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				3823	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				3824	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
				3825	.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
				3826	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				3827	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				3828	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__