Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: ff22f2c35977213ce4399378120dd68c3b469d78 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
				67	<empty> Defaults to ``opencl``.
				68	``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
				69	``amdgizcl`` Same as ``opencl`` except a different address space mapping is
				70	used (see :ref:`amdgpu-address-spaces`).
				71	``amdgiz`` Same as ``opencl`` except a different address space mapping is
				72	used (see :ref:`amdgpu-address-spaces`).
				73	``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
				74	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	75
				76	.. _amdgpu-processors:
				77
				78	Processors
				79	----------
				80
				81	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				82	names from both the Processor and Alternative Processor can be used.
				83
				84	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	85	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	86
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	87	=========== =============== ============ ===== ========= ======= ==================
				88	Processor Alternative Target dGPU/ Target ROCm Example
				89	Processor Triple APU Features Support Products
				90	Architecture Supported
				91	[Default]
				92	=========== =============== ============ ===== ========= ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	93	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	94	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	95	``r600`` ``r600`` dGPU
				96	``r630`` ``r600`` dGPU
				97	``rs880`` ``r600`` dGPU
				98	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	99	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	100	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	101	``rv710`` ``r600`` dGPU
				102	``rv730`` ``r600`` dGPU
				103	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	104	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	105	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	106	``cedar`` ``r600`` dGPU
				107	``redwood`` ``r600`` dGPU
				108	``sumo`` ``r600`` dGPU
				109	``juniper`` ``r600`` dGPU
				110	``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	111	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	112	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	113	``barts`` ``r600`` dGPU
				114	``turks`` ``r600`` dGPU
				115	``caicos`` ``r600`` dGPU
				116	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	117	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	118	-----------------------------------------------------------------------------------
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	119	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
				120	``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
				121	- ``verde``
				122	- ``oland``
				123	- ``hainan``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	124	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	125	-----------------------------------------------------------------------------------
				126	``gfx700`` - ``kaveri`` ``amdgcn`` APU - A6-7000
				127	- A6 Pro-7050B
				128	- A8-7100
				129	- A8 Pro-7150B
				130	- A10-7300
				131	- A10 Pro-7350B
				132	- FX-7500
				133	- A8-7200P
				134	- A10-7400P
				135	- FX-7600P
				136	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				137	- FirePro W9100
				138	- FirePro S9150
				139	- FirePro S9170
				140	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				141	- Radeon R9 290x
				142	- Radeon R390
				143	- Radeon R390x
				144	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				145	- ``mullins`` - E1-2200
				146	- E1-2500
				147	- E2-3000
				148	- E2-3800
				149	- A4-5000
				150	- A4-5100
				151	- A6-5200
				152	- A4 Pro-3340B
				153	``gfx704`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				154	- Radeon HD 8770
				155	- R7 260
				156	- R7 260X
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	157	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	158	-----------------------------------------------------------------------------------
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	159	``gfx801`` - ``carrizo`` ``amdgcn`` APU - xnack - A6-8500P
				160	[on] - Pro A6-8500B
				161	- A8-8600P
				162	- Pro A8-8600B
				163	- FX-8800P
				164	- Pro A12-8800B
				165	\ ``amdgcn`` APU - xnack ROCm - A10-8700P
				166	[on] - Pro A10-8700B
				167	- A10-8780P
				168	\ ``amdgcn`` APU - xnack - A10-9600P
				169	[on] - A10-9630P
				170	- A12-9700P
				171	- A12-9730P
				172	- FX-9800P
				173	- FX-9830P
				174	\ ``amdgcn`` APU - xnack - E2-9010
				175	[on] - A6-9210
				176	- A9-9410
				177	``gfx802`` - ``tonga`` ``amdgcn`` dGPU - xnack ROCm - FirePro S7150
Tony Tye	a697880	2017-12-12 05:47:00 +0000	[diff] [blame]	178	- ``iceland`` [off] - FirePro S7100
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	179	- FirePro W7100
				180	- Radeon R285
				181	- Radeon R9 380
				182	- Radeon R9 385
				183	- Mobile FirePro
				184	M7170
				185	``gfx803`` - ``fiji`` ``amdgcn`` dGPU - xnack ROCm - Radeon R9 Nano
				186	[off] - Radeon R9 Fury
				187	- Radeon R9 FuryX
				188	- Radeon Pro Duo
				189	- FirePro S9300x2
				190	- Radeon Instinct MI8
				191	\ - ``polaris10`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 470
				192	[off] - Radeon RX 480
				193	- Radeon Instinct MI6
				194	\ - ``polaris11`` ``amdgcn`` dGPU - xnack ROCm - Radeon RX 460
				195	[off]
				196	``gfx810`` - ``stoney`` ``amdgcn`` APU - xnack
				197	[on]
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	198	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	199	-----------------------------------------------------------------------------------
				200	``gfx900`` ``amdgcn`` dGPU - xnack ROCm - Radeon Vega
				201	[off] Frontier Edition
				202	- Radeon RX Vega 56
				203	- Radeon RX Vega 64
				204	- Radeon RX Vega 64
				205	Liquid
				206	- Radeon Instinct MI25
				207	``gfx902`` ``amdgcn`` APU - xnack TBA
				208	[on]
				209	.. TODO
				210	Add product
				211	names.
				212	=========== =============== ============ ===== ========= ======= ==================
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	213
				214	.. _amdgpu-target-features:
				215
				216	Target Features
				217	---------------
				218
				219	Target features control how code is generated to support certain
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	220	processor specific features. Not all target features are supported by
				221	all processors. The runtime must ensure that the features supported by
				222	the device used to execute the code match the features enabled when
				223	generating the code. A mismatch of features may result in incorrect
				224	execution, or a reduction in performance.
				225
				226	The target features supported by each processor, and the default value
				227	used if not specified explicitly, is listed in
				228	:ref:`amdgpu-processor-table`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	229
				230	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				231	target features.
				232
				233	For example:
				234
				235	``-mxnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	236	Enable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	237	``-mno-xnack``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	238	Disable the ``xnack`` feature.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	239
				240	.. table:: AMDGPU Target Features
				241	:name: amdgpu-target-feature-table
				242
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	243	============== ==================================================
				244	Target Feature Description
				245	============== ==================================================
				246	-m[no-]xnack Enable/disable generating code that has
				247	memory clauses that are compatible with
				248	having XNACK replay enabled.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	249
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	250	This is used for demand paging and page
				251	migration. If XNACK replay is enabled in
				252	the device, then if a page fault occurs
				253	the code may execute incorrectly if the
				254	``xnack`` feature is not enabled. Executing
				255	code that has the feature enabled on a
				256	device that does not have XNACK replay
				257	enabled will execute correctly, but may
				258	be less performant than code with the
				259	feature disabled.
				260	============== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	261
				262	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	263
				264	Address Spaces
				265	--------------
				266
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	267	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	268
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	269	The memory space names used in the table, aside from the region memory space, is
				270	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	271
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	272	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	273
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	274	.. table:: Address Space Mapping
				275	:name: amdgpu-address-space-mapping-table
				276
				277	================== ================= ================= ================= =================
				278	LLVM Address Space Memory Space
				279	------------------ -----------------------------------------------------------------------
				280	\ Current Default amdgiz/amdgizcl hcc Future Default
				281	================== ================= ================= ================= =================
				282	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				283	1 Global Global Global Global
				284	2 Constant Constant Constant Region (GDS)
				285	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				286	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				287	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				288	================== ================= ================= ================= =================
				289
				290	Current Default
				291	This is the current default address space mapping used for all languages
				292	except hcc. This will shortly be deprecated.
				293
				294	amdgiz/amdgizcl
				295	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				296	is specified as the target triple environment value.
				297
				298	hcc
				299	This is the current address space mapping used when ``hcc`` is specified as
				300	the target triple environment value.This will shortly be deprecated.
				301
				302	Future Default
				303	This will shortly be the only address space mapping for all languages using
				304	AMDGPU backend.
				305
				306	.. _amdgpu-memory-scopes:
				307
				308	Memory Scopes
				309	-------------
				310
				311	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				312	backend memory model when the target triple OS is ``amdhsa`` (see
				313	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				314
				315	The memory model supported is based on the HSA memory model [HSA]_ which is
				316	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				317	relation is transitive over the synchonizes-with relation independent of scope,
				318	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	319	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	320
				321	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				322	inclusion and requires the memory scopes to exactly match. However, this
				323	is conservatively correct for OpenCL.
				324
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	325	.. table:: AMDHSA LLVM Sync Scopes
				326	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	327
				328	================ ==========================================================
				329	LLVM Sync Scope Description
				330	================ ==========================================================
				331	none The default: ``system``.
				332
				333	Synchronizes with, and participates in modification and
				334	seq_cst total orderings with, other operations (except
				335	image operations) for all address spaces (except private,
				336	or generic that accesses private) provided the other
				337	operation's sync scope is:
				338
				339	- ``system``.
				340	- ``agent`` and executed by a thread on the same agent.
				341	- ``workgroup`` and executed by a thread in the same
				342	workgroup.
				343	- ``wavefront`` and executed by a thread in the same
				344	wavefront.
				345
				346	``agent`` Synchronizes with, and participates in modification and
				347	seq_cst total orderings with, other operations (except
				348	image operations) for all address spaces (except private,
				349	or generic that accesses private) provided the other
				350	operation's sync scope is:
				351
				352	- ``system`` or ``agent`` and executed by a thread on the
				353	same agent.
				354	- ``workgroup`` and executed by a thread in the same
				355	workgroup.
				356	- ``wavefront`` and executed by a thread in the same
				357	wavefront.
				358
				359	``workgroup`` Synchronizes with, and participates in modification and
				360	seq_cst total orderings with, other operations (except
				361	image operations) for all address spaces (except private,
				362	or generic that accesses private) provided the other
				363	operation's sync scope is:
				364
				365	- ``system``, ``agent`` or ``workgroup`` and executed by a
				366	thread in the same workgroup.
				367	- ``wavefront`` and executed by a thread in the same
				368	wavefront.
				369
				370	``wavefront`` Synchronizes with, and participates in modification and
				371	seq_cst total orderings with, other operations (except
				372	image operations) for all address spaces (except private,
				373	or generic that accesses private) provided the other
				374	operation's sync scope is:
				375
				376	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				377	and executed by a thread in the same wavefront.
				378
				379	``singlethread`` Only synchronizes with, and participates in modification
				380	and seq_cst total orderings with, other operations (except
				381	image operations) running in the same thread for all
				382	address spaces (for example, in signal handlers).
				383	================ ==========================================================
				384
				385	AMDGPU Intrinsics
				386	-----------------
				387
				388	The AMDGPU backend implements the following intrinsics.
				389
				390	This section is WIP.
				391
				392	.. TODO
				393	List AMDGPU intrinsics
				394
				395	Code Object
				396	===========
				397
				398	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				399	can be linked by ``lld`` to produce a standard ELF shared code object which can
				400	be loaded and executed on an AMDGPU target.
				401
				402	Header
				403	------
				404
				405	The AMDGPU backend uses the following ELF header:
				406
				407	.. table:: AMDGPU ELF Header
				408	:name: amdgpu-elf-header-table
				409
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	410	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	411	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	412	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	413	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				414	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	415	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				416	- ``ELFOSABI_AMDGPU_HSA``
				417	- ``ELFOSABI_AMDGPU_PAL``
				418	- ``ELFOSABI_AMDGPU_MESA3D``
				419	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				420	- ``ELFABIVERSION_AMDGPU_PAL``
				421	- ``ELFABIVERSION_AMDGPU_MESA3D``
				422	``e_type`` - ``ET_REL``
				423	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	424	``e_machine`` ``EM_AMDGPU``
				425	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	426	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	427	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	428
				429	..
				430
				431	.. table:: AMDGPU ELF Header Enumeration Values
				432	:name: amdgpu-elf-header-enumeration-values-table
				433
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	434	=============================== =====
				435	Name Value
				436	=============================== =====
				437	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	438	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	439	``ELFOSABI_AMDGPU_HSA`` 64
				440	``ELFOSABI_AMDGPU_PAL`` 65
				441	``ELFOSABI_AMDGPU_MESA3D`` 66
				442	``ELFABIVERSION_AMDGPU_HSA`` 1
				443	``ELFABIVERSION_AMDGPU_PAL`` 0
				444	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				445	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	446
				447	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	448	The ELF class is:
				449
				450	* ``ELFCLASS32`` for ``r600`` architecture.
				451
				452	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				453	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454
				455	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	456	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	457
				458	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	459	One of the following AMD GPU architecture specific OS ABIs
				460	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	461
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	462	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	463
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	464	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	465
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	466	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				467
				468	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	469
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	470	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	471	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				472	object conforms:
				473
				474	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				475	runtime ABI.
				476
				477	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				478	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	479
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	480	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	481	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	482
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	483	``e_type``
				484	Can be one of the following values:
				485
				486
				487	``ET_REL``
				488	The type produced by the AMD GPU backend compiler as it is relocatable code
				489	object.
				490
				491	``ET_DYN``
				492	The type produced by the linker as it is a shared code object.
				493
				494	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				495
				496	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	497	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				498	by the ``r600`` and ``amdgcn`` architectures (see
				499	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				500	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				501	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	502
				503	``e_entry``
				504	The entry point is 0 as the entry points for individual kernels must be
				505	selected in order to invoke them through AQL packets.
				506
				507	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	508	The AMDGPU backend uses the following ELF header flags:
				509
				510	.. table:: AMDGPU ELF Header ``e_flags``
				511	:name: amdgpu-elf-header-e_flags-table
				512
				513	================================= ========== =============================
				514	Name Value Description
				515	================================= ========== =============================
				516	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				517	-------------------------------------------- -----------------------------
				518	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				519	mask for
				520	``EF_AMDGPU_MACH_xxx`` values
				521	defined in
				522	:ref:`amdgpu-ef-amdgpu-mach-table`.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	523	``EF_AMDGPU_XNACK`` 0x00000100 Indicates if the ``xnack``
				524	target feature is
				525	enabled for all code
				526	contained in the code object.
				527	See
				528	:ref:`amdgpu-target-features`.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	529	================================= ========== =============================
				530
				531	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				532	:name: amdgpu-ef-amdgpu-mach-table
				533
				534	================================= ========== =============================
				535	Name Value Description (see
				536	:ref:`amdgpu-processor-table`)
				537	================================= ========== =============================
				538	``EF_AMDGPU_MACH_NONE`` 0 not specified
				539	``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
				540	``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
				541	``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
				542	``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
				543	``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
				544	``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
				545	``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
				546	``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
				547	``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
				548	``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
				549	``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
				550	``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
				551	``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
				552	``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
				553	``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
				554	``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
				555	reserved 17-31 Reserved for ``r600``
				556	architecture processors.
				557	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
				560	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
				561	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
				562	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	563	``EF_AMDGPU_MACH_AMDGCN_GFX704`` 38 ``gfx704``
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame^]	564	reserved 39 Reserved.
				565	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 40 ``gfx801``
				566	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 41 ``gfx802``
				567	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 42 ``gfx803``
				568	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 43 ``gfx810``
				569	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 44 ``gfx900``
				570	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 45 ``gfx902``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	571	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	572
				573	Sections
				574	--------
				575
				576	An AMDGPU target ELF code object has the standard ELF sections which include:
				577
				578	.. table:: AMDGPU ELF Sections
				579	:name: amdgpu-elf-sections-table
				580
				581	================== ================ =================================
				582	Name Type Attributes
				583	================== ================ =================================
				584	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				585	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				586	``.debug_``\ \* ``SHT_PROGBITS`` none
				587	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				588	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				589	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				590	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				591	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				592	``.note`` ``SHT_NOTE`` none
				593	``.rela``\ name ``SHT_RELA`` none
				594	``.rela.dyn`` ``SHT_RELA`` none
				595	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				596	``.shstrtab`` ``SHT_STRTAB`` none
				597	``.strtab`` ``SHT_STRTAB`` none
				598	``.symtab`` ``SHT_SYMTAB`` none
				599	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				600	================== ================ =================================
				601
				602	These sections have their standard meanings (see [ELF]_) and are only generated
				603	if needed.
				604
				605	``.debug``\ \*
				606	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				607	DWARF produced by the AMDGPU backend.
				608
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	609	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	610	The standard sections used by a dynamic loader.
				611
				612	``.note``
				613	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				614	backend.
				615
				616	``.rela``\ name, ``.rela.dyn``
				617	For relocatable code objects, name is the name of the section that the
				618	relocation records apply. For example, ``.rela.text`` is the section name for
				619	relocation records associated with the ``.text`` section.
				620
				621	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				622	records from each of the relocatable code object's ``.rela``\ name sections.
				623
				624	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				625	the AMDGPU backend.
				626
				627	``.text``
				628	The executable machine code for the kernels and functions they call. Generated
				629	as position independent code. See :ref:`amdgpu-code-conventions` for
				630	information on conventions used in the isa generation.
				631
				632	.. _amdgpu-note-records:
				633
				634	Note Records
				635	------------
				636
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	637	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				638	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				639	aligned. In addition, minimal zero byte padding must be generated to ensure the
				640	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				641	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	642
				643	The AMDGPU backend code object uses the following ELF note records in the
				644	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	645	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	646	variable size strings have a corresponding ``*_size`` field that specifies the
				647	number of bytes, including the terminating null character, in the string. The
				648	string(s) come immediately after the preceding fields.
				649
				650	Additional note records can be present.
				651
				652	.. table:: AMDGPU ELF Note Records
				653	:name: amdgpu-elf-note-records-table
				654
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	655	===== ============================== ======================================
				656	Name Type Description
				657	===== ============================== ======================================
				658	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	659	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	660
				661	..
				662
				663	.. table:: AMDGPU ELF Note Record Enumeration Values
				664	:name: amdgpu-elf-note-record-enumeration-values-table
				665
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	666	============================== =====
				667	Name Value
				668	============================== =====
				669	reserved 0-9
				670	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	671	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	672	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	673
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	674	``NT_AMD_AMDGPU_HSA_METADATA``
				675	Specifies extensible metadata associated with the code objects executed on HSA
				676	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				677	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				678	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				679	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	680
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	681	.. _amdgpu-symbols:
				682
				683	Symbols
				684	-------
				685
				686	Symbols include the following:
				687
				688	.. table:: AMDGPU ELF Symbols
				689	:name: amdgpu-elf-symbols-table
				690
				691	===================== ============== ============= ==================
				692	Name Type Section Description
				693	===================== ============== ============= ==================
				694	link-name ``STT_OBJECT`` - ``.data`` Global variable
				695	- ``.rodata``
				696	- ``.bss``
				697	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				698	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				699	===================== ============== ============= ==================
				700
				701	Global variable
				702	Global variables both used and defined by the compilation unit.
				703
				704	If the symbol is defined in the compilation unit then it is allocated in the
				705	appropriate section according to if it has initialized data or is readonly.
				706
				707	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				708	will resolve relocations using the definition provided by another code object
				709	or explicitly defined by the runtime.
				710
				711	All global symbols, whether defined in the compilation unit or external, are
				712	accessed by the machine code indirectly through a GOT table entry. This
				713	allows them to be preemptable. The GOT table is only supported when the target
				714	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	715
				716	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	717	Add description of linked shared object symbols. Seems undefined symbols
				718	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	719
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	720	Kernel descriptor
				721	Every HSA kernel has an associated kernel descriptor. It is the address of the
				722	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				723	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				724	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				725
				726	Kernel entry point
				727	Every HSA kernel also has a symbol for its machine code entry point.
				728
				729	.. _amdgpu-relocation-records:
				730
				731	Relocation Records
				732	------------------
				733
				734	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				735	relocatable fields are:
				736
				737	``word32``
				738	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				739	alignment. These values use the same byte order as other word values in the
				740	AMD GPU architecture.
				741
				742	``word64``
				743	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				744	alignment. These values use the same byte order as other word values in the
				745	AMD GPU architecture.
				746
				747	Following notations are used for specifying relocation calculations:
				748
				749	A
				750	Represents the addend used to compute the value of the relocatable field.
				751
				752	G
				753	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	754	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	755
				756	GOT
				757	Represents the address of the global offset table.
				758
				759	P
				760	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				761	of the storage unit being relocated (computed using ``r_offset``).
				762
				763	S
				764	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	765	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				766
				767	B
				768	Represents the base address of a loaded executable or shared object which is
				769	the difference between the ELF address and the actual load address. Relocations
				770	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	771
				772	The following relocation types are supported:
				773
				774	.. table:: AMDGPU ELF Relocation Records
				775	:name: amdgpu-elf-relocation-records-table
				776
Tony Tye	db6c993	2018-01-30 23:59:43 +0000	[diff] [blame^]	777	========================== ======= ===== ========== ==============================
				778	Relocation Type Kind Value Field Calculation
				779	========================== ======= ===== ========== ==============================
				780	``R_AMDGPU_NONE`` 0 none none
				781	``R_AMDGPU_ABS32_LO`` Dynamic 1 ``word32`` (S + A) & 0xFFFFFFFF
				782	``R_AMDGPU_ABS32_HI`` Dynamic 2 ``word32`` (S + A) >> 32
				783	``R_AMDGPU_ABS64`` Dynamic 3 ``word64`` S + A
				784	``R_AMDGPU_REL32`` Static 4 ``word32`` S + A - P
				785	``R_AMDGPU_REL64`` Static 5 ``word64`` S + A - P
				786	``R_AMDGPU_ABS32`` Static 6 ``word32`` S + A
				787	``R_AMDGPU_GOTPCREL`` Static 7 ``word32`` G + GOT + A - P
				788	``R_AMDGPU_GOTPCREL32_LO`` Static 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				789	``R_AMDGPU_GOTPCREL32_HI`` Static 9 ``word32`` (G + GOT + A - P) >> 32
				790	``R_AMDGPU_REL32_LO`` Static 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				791	``R_AMDGPU_REL32_HI`` Static 11 ``word32`` (S + A - P) >> 32
				792	reserved 12
				793	``R_AMDGPU_RELATIVE64`` Dynamic 13 ``word64`` B + A
				794	========================== ======= ===== ========== ==============================
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	795
				796	.. _amdgpu-dwarf:
				797
				798	DWARF
				799	-----
				800
				801	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				802	information that maps the code object executable code and data to the source
				803	language constructs. It can be used by tools such as debuggers and profilers.
				804
				805	Address Space Mapping
				806	~~~~~~~~~~~~~~~~~~~~~
				807
				808	The following address space mapping is used:
				809
				810	.. table:: AMDGPU DWARF Address Space Mapping
				811	:name: amdgpu-dwarf-address-space-mapping-table
				812
				813	=================== =================
				814	DWARF Address Space Memory Space
				815	=================== =================
				816	1 Private (Scratch)
				817	2 Local (group/LDS)
				818	omitted Global
				819	omitted Constant
				820	omitted Generic (Flat)
				821	not supported Region (GDS)
				822	=================== =================
				823
				824	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				825	used in the table.
				826
				827	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				828	DWARF address space of the value of the pointer when it is in the private or
				829	local address space. Otherwise the attribute is omitted.
				830
				831	An ``XDEREF`` operation is generated in location list expressions for variables
				832	that are allocated in the private and local address space. Otherwise no
				833	``XDREF`` is omitted.
				834
				835	Register Mapping
				836	~~~~~~~~~~~~~~~~
				837
				838	This section is WIP.
				839
				840	.. TODO
				841	Define DWARF register enumeration.
				842
				843	If want to present a wavefront state then should expose vector registers as
				844	64 wide (rather than per work-item view that LLVM uses). Either as separate
				845	registers, or a 64x4 byte single register. In either case use a new LANE op
				846	(akin to XDREF) to select the current lane usage in a location
				847	expression. This would also allow scalar register spilling to vector register
				848	lanes to be expressed (currently no debug information is being generated for
				849	spilling). If choose a wide single register approach then use LANE in
				850	conjunction with PIECE operation to select the dword part of the register for
				851	the current lane. If the separate register approach then use LANE to select
				852	the register.
				853
				854	Source Text
				855	~~~~~~~~~~~
				856
				857	This section is WIP.
				858
				859	.. TODO
				860	DWARF extension to include runtime generated source text.
				861
				862	.. _amdgpu-code-conventions:
				863
				864	Code Conventions
				865	================
				866
				867	This section provides code conventions used for each supported target triple OS
				868	(see :ref:`amdgpu-target-triples`).
				869
				870	AMDHSA
				871	------
				872
				873	This section provides code conventions used when the target triple OS is
				874	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				875
				876	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	877
				878	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	879	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	880
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	881	The code object metadata specifies extensible metadata associated with the code
				882	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				883	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				884	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				885	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				886	information necessary to support the ROCM kernel queries. For example, the
				887	segment sizes needed in a dispatch packet. In addition, a high level language
				888	runtime may require other information to be included. For example, the AMD
				889	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	890
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	891	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	892	:doc:`YamlIO`).
				893
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	894	.. TODO
				895	Is the string null terminated? It probably should not if YAML allows it to
				896	contain null characters, otherwise it should be.
				897
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	898	The metadata is represented as a single YAML document comprised of the mapping
				899	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				900	referenced tables.
				901
				902	For boolean values, the string values of ``false`` and ``true`` are used for
				903	false and true respectively.
				904
				905	Additional information can be added to the mappings. To avoid conflicts, any
				906	non-AMD key names should be prefixed by "vendor-name.".
				907
				908	.. table:: AMDHSA Code Object Metadata Mapping
				909	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				910
				911	========== ============== ========= =======================================
				912	String Key Value Type Required? Description
				913	========== ============== ========= =======================================
				914	"Version" sequence of Required - The first integer is the major
				915	2 integers version. Currently 1.
				916	- The second integer is the minor
				917	version. Currently 0.
				918	"Printf" sequence of Each string is encoded information
				919	strings about a printf function call. The
				920	encoded information is organized as
				921	fields separated by colon (':'):
				922
				923	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				924
				925	where:
				926
				927	``ID``
				928	A 32 bit integer as a unique id for
				929	each printf function call
				930
				931	``N``
				932	A 32 bit integer equal to the number
				933	of arguments of printf function call
				934	minus 1
				935
				936	``S[i]`` (where i = 0, 1, ... , N-1)
				937	32 bit integers for the size in bytes
				938	of the i-th FormatString argument of
				939	the printf function call
				940
				941	FormatString
				942	The format string passed to the
				943	printf function call.
				944	"Kernels" sequence of Required Sequence of the mappings for each
				945	mapping kernel in the code object. See
				946	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				947	for the definition of the mapping.
				948	========== ============== ========= =======================================
				949
				950	..
				951
				952	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				953	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				954
				955	================= ============== ========= ================================
				956	String Key Value Type Required? Description
				957	================= ============== ========= ================================
				958	"Name" string Required Source name of the kernel.
				959	"SymbolName" string Required Name of the kernel
				960	descriptor ELF symbol.
				961	"Language" string Source language of the kernel.
				962	Values include:
				963
				964	- "OpenCL C"
				965	- "OpenCL C++"
				966	- "HCC"
				967	- "OpenMP"
				968
				969	"LanguageVersion" sequence of - The first integer is the major
				970	2 integers version.
				971	- The second integer is the
				972	minor version.
				973	"Attrs" mapping Mapping of kernel attributes.
				974	See
				975	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				976	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	977	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	978	mapping kernel arguments. See
				979	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				980	for the definition of the mapping.
				981	"CodeProps" mapping Mapping of properties related to
				982	the kernel code. See
				983	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				984	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	985	================= ============== ========= ================================
				986
				987	..
				988
				989	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				990	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				991
				992	=================== ============== ========= ==============================
				993	String Key Value Type Required? Description
				994	=================== ============== ========= ==============================
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	995	"ReqdWorkGroupSize" sequence of If not 0, 0, 0 then all values
				996	3 integers must be >=1 and the dispatch
				997	work-group size X, Y, Z must
				998	correspond to the specified
				999	values. Defaults to 0, 0, 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1000
				1001	Corresponds to the OpenCL
				1002	``reqd_work_group_size``
				1003	attribute.
				1004	"WorkGroupSizeHint" sequence of The dispatch work-group size
				1005	3 integers X, Y, Z is likely to be the
				1006	specified values.
				1007
				1008	Corresponds to the OpenCL
				1009	``work_group_size_hint``
				1010	attribute.
				1011	"VecTypeHint" string The name of a scalar or vector
				1012	type.
				1013
				1014	Corresponds to the OpenCL
				1015	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1016
				1017	"RuntimeHandle" string The external symbol name
				1018	associated with a kernel.
				1019	OpenCL runtime allocates a
				1020	global buffer for the symbol
				1021	and saves the kernel's address
				1022	to it, which is used for
				1023	device side enqueueing. Only
				1024	available for device side
				1025	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1026	=================== ============== ========= ==============================
				1027
				1028	..
				1029
				1030	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1031	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1032
				1033	================= ============== ========= ================================
				1034	String Key Value Type Required? Description
				1035	================= ============== ========= ================================
				1036	"Name" string Kernel argument name.
				1037	"TypeName" string Kernel argument type name.
				1038	"Size" integer Required Kernel argument size in bytes.
				1039	"Align" integer Required Kernel argument alignment in
				1040	bytes. Must be a power of two.
				1041	"ValueKind" string Required Kernel argument kind that
				1042	specifies how to set up the
				1043	corresponding argument.
				1044	Values include:
				1045
				1046	"ByValue"
				1047	The argument is copied
				1048	directly into the kernarg.
				1049
				1050	"GlobalBuffer"
				1051	A global address space pointer
				1052	to the buffer data is passed
				1053	in the kernarg.
				1054
				1055	"DynamicSharedPointer"
				1056	A group address space pointer
				1057	to dynamically allocated LDS
				1058	is passed in the kernarg.
				1059
				1060	"Sampler"
				1061	A global address space
				1062	pointer to a S# is passed in
				1063	the kernarg.
				1064
				1065	"Image"
				1066	A global address space
				1067	pointer to a T# is passed in
				1068	the kernarg.
				1069
				1070	"Pipe"
				1071	A global address space pointer
				1072	to an OpenCL pipe is passed in
				1073	the kernarg.
				1074
				1075	"Queue"
				1076	A global address space pointer
				1077	to an OpenCL device enqueue
				1078	queue is passed in the
				1079	kernarg.
				1080
				1081	"HiddenGlobalOffsetX"
				1082	The OpenCL grid dispatch
				1083	global offset for the X
				1084	dimension is passed in the
				1085	kernarg.
				1086
				1087	"HiddenGlobalOffsetY"
				1088	The OpenCL grid dispatch
				1089	global offset for the Y
				1090	dimension is passed in the
				1091	kernarg.
				1092
				1093	"HiddenGlobalOffsetZ"
				1094	The OpenCL grid dispatch
				1095	global offset for the Z
				1096	dimension is passed in the
				1097	kernarg.
				1098
				1099	"HiddenNone"
				1100	An argument that is not used
				1101	by the kernel. Space needs to
				1102	be left for it, but it does
				1103	not need to be set up.
				1104
				1105	"HiddenPrintfBuffer"
				1106	A global address space pointer
				1107	to the runtime printf buffer
				1108	is passed in kernarg.
				1109
				1110	"HiddenDefaultQueue"
				1111	A global address space pointer
				1112	to the OpenCL device enqueue
				1113	queue that should be used by
				1114	the kernel by default is
				1115	passed in the kernarg.
				1116
				1117	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1118	A global address space pointer
				1119	to help link enqueued kernels into
				1120	the ancestor tree for determining
				1121	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1122
				1123	"ValueType" string Required Kernel argument value type. Only
				1124	present if "ValueKind" is
				1125	"ByValue". For vector data
				1126	types, the value is for the
				1127	element type. Values include:
				1128
				1129	- "Struct"
				1130	- "I8"
				1131	- "U8"
				1132	- "I16"
				1133	- "U16"
				1134	- "F16"
				1135	- "I32"
				1136	- "U32"
				1137	- "F32"
				1138	- "I64"
				1139	- "U64"
				1140	- "F64"
				1141
				1142	.. TODO
				1143	How can it be determined if a
				1144	vector type, and what size
				1145	vector?
				1146	"PointeeAlign" integer Alignment in bytes of pointee
				1147	type for pointer type kernel
				1148	argument. Must be a power
				1149	of 2. Only present if
				1150	"ValueKind" is
				1151	"DynamicSharedPointer".
				1152	"AddrSpaceQual" string Kernel argument address space
				1153	qualifier. Only present if
				1154	"ValueKind" is "GlobalBuffer" or
				1155	"DynamicSharedPointer". Values
				1156	are:
				1157
				1158	- "Private"
				1159	- "Global"
				1160	- "Constant"
				1161	- "Local"
				1162	- "Generic"
				1163	- "Region"
				1164
				1165	.. TODO
				1166	Is GlobalBuffer only Global
				1167	or Constant? Is
				1168	DynamicSharedPointer always
				1169	Local? Can HCC allow Generic?
				1170	How can Private or Region
				1171	ever happen?
				1172	"AccQual" string Kernel argument access
				1173	qualifier. Only present if
				1174	"ValueKind" is "Image" or
				1175	"Pipe". Values
				1176	are:
				1177
				1178	- "ReadOnly"
				1179	- "WriteOnly"
				1180	- "ReadWrite"
				1181
				1182	.. TODO
				1183	Does this apply to
				1184	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1185	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1186	performed by the kernel on the
				1187	kernel argument. Only present if
				1188	"ValueKind" is "GlobalBuffer",
				1189	"Image", or "Pipe". This may be
				1190	more restrictive than indicated
				1191	by "AccQual" to reflect what the
				1192	kernel actual does. If not
				1193	present then the runtime must
				1194	assume what is implied by
				1195	"AccQual" and "IsConst". Values
				1196	are:
				1197
				1198	- "ReadOnly"
				1199	- "WriteOnly"
				1200	- "ReadWrite"
				1201
				1202	"IsConst" boolean Indicates if the kernel argument
				1203	is const qualified. Only present
				1204	if "ValueKind" is
				1205	"GlobalBuffer".
				1206
				1207	"IsRestrict" boolean Indicates if the kernel argument
				1208	is restrict qualified. Only
				1209	present if "ValueKind" is
				1210	"GlobalBuffer".
				1211
				1212	"IsVolatile" boolean Indicates if the kernel argument
				1213	is volatile qualified. Only
				1214	present if "ValueKind" is
				1215	"GlobalBuffer".
				1216
				1217	"IsPipe" boolean Indicates if the kernel argument
				1218	is pipe qualified. Only present
				1219	if "ValueKind" is "Pipe".
				1220
				1221	.. TODO
				1222	Can GlobalBuffer be pipe
				1223	qualified?
				1224	================= ============== ========= ================================
				1225
				1226	..
				1227
				1228	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1229	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1230
				1231	============================ ============== ========= =====================
				1232	String Key Value Type Required? Description
				1233	============================ ============== ========= =====================
				1234	"KernargSegmentSize" integer Required The size in bytes of
				1235	the kernarg segment
				1236	that holds the values
				1237	of the arguments to
				1238	the kernel.
				1239	"GroupSegmentFixedSize" integer Required The amount of group
				1240	segment memory
				1241	required by a
				1242	work-group in
				1243	bytes. This does not
				1244	include any
				1245	dynamically allocated
				1246	group segment memory
				1247	that may be added
				1248	when the kernel is
				1249	dispatched.
				1250	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1251	private address space
				1252	memory required for a
				1253	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1254	bytes. If the kernel
				1255	uses a dynamic call
				1256	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1257	space must be added
				1258	to this value for the
				1259	call stack.
				1260	"KernargSegmentAlign" integer Required The maximum byte
				1261	alignment of
				1262	arguments in the
				1263	kernarg segment. Must
				1264	be a power of 2.
				1265	"WavefrontSize" integer Required Wavefront size. Must
				1266	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1267	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1268	registers used by a
				1269	wavefront for
				1270	GFX6-GFX9. This
				1271	includes the special
				1272	SGPRs for VCC, Flat
				1273	Scratch (GFX7-GFX9)
				1274	and XNACK (for
				1275	GFX8-GFX9). It does
				1276	not include the 16
				1277	SGPR added if a trap
				1278	handler is
				1279	enabled. It is not
				1280	rounded up to the
				1281	allocation
				1282	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1283	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1284	registers used by
				1285	each work-item for
				1286	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1287	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1288	work-group size
				1289	supported by the
				1290	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1291	Must be >=1 and
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1292	consistent with
				1293	ReqdWorkGroupSize if
				1294	not 0, 0, 0.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame]	1295	"NumSpilledSGPRs" integer Number of stores from
				1296	a scalar register to
				1297	a register allocator
				1298	created spill
				1299	location.
				1300	"NumSpilledVGPRs" integer Number of stores from
				1301	a vector register to
				1302	a register allocator
				1303	created spill
				1304	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1305	============================ ============== ========= =====================
				1306
				1307	..
				1308
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1309	Kernel Dispatch
				1310	~~~~~~~~~~~~~~~
				1311
				1312	The HSA architected queuing language (AQL) defines a user space memory interface
				1313	that can be used to control the dispatch of kernels, in an agent independent
				1314	way. An agent can have zero or more AQL queues created for it using the ROCm
				1315	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1316	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1317	mechanics and packet layouts.
				1318
				1319	The packet processor of a kernel agent is responsible for detecting and
				1320	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1321	packet processor is implemented by the hardware command processor (CP),
				1322	asynchronous dispatch controller (ADC) and shader processor input controller
				1323	(SPI).
				1324
				1325	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1326	mode driver to initialize and register the AQL queue with CP.
				1327
				1328	To dispatch a kernel the following actions are performed. This can occur in the
				1329	CPU host program, or from an HSA kernel executing on a GPU.
				1330
				1331	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1332	executed is obtained.
				1333	2. A pointer to the kernel descriptor (see
				1334	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1335	obtained. It must be for a kernel that is contained in a code object that that
				1336	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1337	associated.
				1338	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1339	for a memory region with the kernarg property for the kernel agent that will
				1340	execute the kernel. It must be at least 16 byte aligned.
				1341	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1342	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1343	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1344	memory in the same way constant memory is accessed. (Note that the HSA
				1345	specification allows an implementation to copy the kernel argument contents to
				1346	another location that is accessed by the kernel.)
				1347	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1348	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1349	packet. The packet must be set up, and the final write must use an atomic
				1350	store release to set the packet kind to ensure the packet contents are
				1351	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1352	notify the kernel agent that the AQL queue has been updated. These rules, and
				1353	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1354	System Architecture Specification* [HSA]_.
				1355	6. A kernel dispatch packet includes information about the actual dispatch,
				1356	such as grid and work-group size, together with information from the code
				1357	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1358	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1359	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1360	7. CP executes micro-code and is responsible for detecting and setting up the
				1361	GPU to execute the wavefronts of a kernel dispatch.
				1362	8. CP ensures that when the a wavefront starts executing the kernel machine
				1363	code, the scalar general purpose registers (SGPR) and vector general purpose
				1364	registers (VGPR) are set up as required by the machine code. The required
				1365	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1366	register state is defined in
				1367	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1368	9. The prolog of the kernel machine code (see
				1369	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1370	before continuing executing the machine code that corresponds to the kernel.
				1371	10. When the kernel dispatch has completed execution, CP signals the completion
				1372	signal specified in the kernel dispatch packet if not 0.
				1373
				1374	.. _amdgpu-amdhsa-memory-spaces:
				1375
				1376	Memory Spaces
				1377	~~~~~~~~~~~~~
				1378
				1379	The memory space properties are:
				1380
				1381	.. table:: AMDHSA Memory Spaces
				1382	:name: amdgpu-amdhsa-memory-spaces-table
				1383
				1384	================= =========== ======== ======= ==================
				1385	Memory Space Name HSA Segment Hardware Address NULL Value
				1386	Name Name Size
				1387	================= =========== ======== ======= ==================
				1388	Private private scratch 32 0x00000000
				1389	Local group LDS 32 0xFFFFFFFF
				1390	Global global global 64 0x0000000000000000
				1391	Constant constant *same as 64 0x0000000000000000
				1392	global*
				1393	Generic flat flat 64 0x0000000000000000
				1394	Region N/A GDS 32 *not implemented
				1395	for AMDHSA*
				1396	================= =========== ======== ======= ==================
				1397
				1398	The global and constant memory spaces both use global virtual addresses, which
				1399	are the same virtual address space used by the CPU. However, some virtual
				1400	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1401	and some by both.
				1402
				1403	Using the constant memory space indicates that the data will not change during
				1404	the execution of the kernel. This allows scalar read instructions to be
				1405	used. The vector and scalar L1 caches are invalidated of volatile data before
				1406	each kernel dispatch execution to allow constant memory to change values between
				1407	kernel dispatches.
				1408
				1409	The local memory space uses the hardware Local Data Store (LDS) which is
				1410	automatically allocated when the hardware creates work-groups of wavefronts, and
				1411	freed when all the wavefronts of a work-group have terminated. The data store
				1412	(DS) instructions can be used to access it.
				1413
				1414	The private memory space uses the hardware scratch memory support. If the kernel
				1415	uses scratch, then the hardware allocates memory that is accessed using
				1416	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1417	address to physical address is:
				1418
				1419	``wavefront-scratch-base +
				1420	(private-address * wavefront-size * 4) +
				1421	(wavefront-lane-id * 4)``
				1422
				1423	There are different ways that the wavefront scratch base address is determined
				1424	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1425	memory can be accessed in an interleaved manner using buffer instruction with
				1426	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1427	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1428	same private address, the interleaving results in adjacent dwords being accessed
				1429	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1430	supported except by flat and scratch instructions in GFX9.
				1431
				1432	The generic address space uses the hardware flat address support available in
				1433	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1434	local appertures), that are outside the range of addressible global memory, to
				1435	map from a flat address to a private or local address.
				1436
				1437	FLAT instructions can take a flat address and access global, private (scratch)
				1438	and group (LDS) memory depending in if the address is within one of the
				1439	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1440	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1441	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1442	(see :ref:`amdgpu-amdhsa-m0`).
				1443
				1444	To convert between a segment address and a flat address the base address of the
				1445	appertures address can be used. For GFX7-GFX8 these are available in the
				1446	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1447	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1448	GFX9 the appature base addresses are directly available as inline constant
				1449	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1450	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1451	which makes it easier to convert from flat to segment or segment to flat.
				1452
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1453	Image and Samplers
				1454	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1455
				1456	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1457	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1458	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1459	enumeration values for the queries that are not trivially deducible from the S#
				1460	representation.
				1461
				1462	HSA Signals
				1463	~~~~~~~~~~~
				1464
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1465	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1466	structure allocated in memory accessible from both the CPU and GPU. The
				1467	structure is defined by the ROCm runtime and subject to change between releases
				1468	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1469
				1470	.. _amdgpu-amdhsa-hsa-aql-queue:
				1471
				1472	HSA AQL Queue
				1473	~~~~~~~~~~~~~
				1474
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1475	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1476	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1477	fields needed to implement certain language features such as the flat address
				1478	aperture bases. It also contains fields used by CP such as managing the
				1479	allocation of scratch memory.
				1480
				1481	.. _amdgpu-amdhsa-kernel-descriptor:
				1482
				1483	Kernel Descriptor
				1484	~~~~~~~~~~~~~~~~~
				1485
				1486	A kernel descriptor consists of the information needed by CP to initiate the
				1487	execution of a kernel, including the entry point address of the machine code
				1488	that implements the kernel.
				1489
				1490	Kernel Descriptor for GFX6-GFX9
				1491	+++++++++++++++++++++++++++++++
				1492
				1493	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1494
				1495	.. table:: Kernel Descriptor for GFX6-GFX9
				1496	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1497
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1498	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1499	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1500	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1501	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1502	address space memory
				1503	required for a work-group
				1504	in bytes. This does not
				1505	include any dynamically
				1506	allocated local address
				1507	space memory that may be
				1508	added when the kernel is
				1509	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1510	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1511	private address space
				1512	memory required for a
				1513	work-item in bytes. If
				1514	is_dynamic_callstack is 1
				1515	then additional space must
				1516	be added to this value for
				1517	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1518	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1519	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1520	negative) from base
				1521	address of kernel
				1522	descriptor to kernel's
				1523	entry point instruction
				1524	which must be 256 byte
				1525	aligned.
Tony Tye	e039d0e	2018-01-30 23:07:10 +0000	[diff] [blame]	1526	383:192 24 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1527	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1528	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1529	program settings used by
				1530	CP to set up
				1531	``COMPUTE_PGM_RSRC1``
				1532	configuration
				1533	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1534	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1535	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1536	program settings used by
				1537	CP to set up
				1538	``COMPUTE_PGM_RSRC2``
				1539	configuration
				1540	register. See
				1541	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1542	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1543	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1544	(see
				1545	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1546
				1547	The total number of SGPR
				1548	user data registers
				1549	requested must not exceed
				1550	16 and match value in
				1551	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1552	Any requests beyond 16
				1553	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1554	449 1 bit EnableSGPRDispatchPtr see above
				1555	450 1 bit EnableSGPRQueuePtr see above
				1556	451 1 bit EnableSGPRKernargSegmentPtr see above
				1557	452 1 bit EnableSGPRDispatchID see above
				1558	453 1 bit EnableSGPRFlatScratchInit see above
				1559	454 1 bit EnableSGPRPrivateSegmentSize see above
				1560	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1561	should always be 0.
				1562	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1563	should always be 0.
				1564	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1565	should always be 0.
Tony Tye	31105cc	2017-12-11 15:35:27 +0000	[diff] [blame]	1566	463:458 6 bits Reserved, must be 0.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1567	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1568	bytes
				1569	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1570	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1571
				1572	..
				1573
				1574	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1575	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1576
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1577	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1578	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1579	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1580	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1581	used by each work-item,
				1582	granularity is device
				1583	specific:
				1584
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1585	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1586	- max_vgpr 1..256
				1587	- roundup((max_vgpg + 1)
				1588	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1589
				1590	Used by CP to set up
				1591	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1592	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1593	used by a wavefront,
				1594	granularity is device
				1595	specific:
				1596
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1597	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1598	- max_sgpr 1..112
				1599	- roundup((max_sgpg + 1)
				1600	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1601	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1602	- max_sgpr 1..112
				1603	- roundup((max_sgpg + 1)
				1604	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1605
				1606	Includes the special SGPRs
				1607	for VCC, Flat Scratch (for
				1608	GFX7 onwards) and XNACK
				1609	(for GFX8 onwards). It does
				1610	not include the 16 SGPR
				1611	added if a trap handler is
				1612	enabled.
				1613
				1614	Used by CP to set up
				1615	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1616	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1617
				1618	Start executing wavefront
				1619	at the specified priority.
				1620
				1621	CP is responsible for
				1622	filling in
				1623	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1624	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1625	with specified rounding
				1626	mode for single (32
				1627	bit) floating point
				1628	precision floating point
				1629	operations.
				1630
				1631	Floating point rounding
				1632	mode values are defined in
				1633	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1634
				1635	Used by CP to set up
				1636	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1637	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1638	with specified rounding
				1639	denorm mode for half/double (16
				1640	and 64 bit) floating point
				1641	precision floating point
				1642	operations.
				1643
				1644	Floating point rounding
				1645	mode values are defined in
				1646	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1647
				1648	Used by CP to set up
				1649	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1650	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1651	with specified denorm mode
				1652	for single (32
				1653	bit) floating point
				1654	precision floating point
				1655	operations.
				1656
				1657	Floating point denorm mode
				1658	values are defined in
				1659	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1660
				1661	Used by CP to set up
				1662	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1663	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1664	with specified denorm mode
				1665	for half/double (16
				1666	and 64 bit) floating point
				1667	precision floating point
				1668	operations.
				1669
				1670	Floating point denorm mode
				1671	values are defined in
				1672	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1673
				1674	Used by CP to set up
				1675	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1676	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1677
				1678	Start executing wavefront
				1679	in privilege trap handler
				1680	mode.
				1681
				1682	CP is responsible for
				1683	filling in
				1684	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1685	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1686	with DX10 clamp mode
				1687	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1688	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1689	treatment of NaN's (when
				1690	set, clamp NaN to zero,
				1691	otherwise pass NaN
				1692	through).
				1693
				1694	Used by CP to set up
				1695	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1696	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1697
				1698	Start executing wavefront
				1699	in single step mode.
				1700
				1701	CP is responsible for
				1702	filling in
				1703	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1704	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1705	with IEEE mode
				1706	enabled. Floating point
				1707	opcodes that support
				1708	exception flag gathering
				1709	will quiet and propagate
				1710	signaling-NaN inputs per
				1711	IEEE 754-2008. Min_dx10 and
				1712	max_dx10 become IEEE
				1713	754-2008 compliant due to
				1714	signaling-NaN propagation
				1715	and quieting.
				1716
				1717	Used by CP to set up
				1718	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1719	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1720
				1721	Only one work-group allowed
				1722	to execute on a compute
				1723	unit.
				1724
				1725	CP is responsible for
				1726	filling in
				1727	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1728	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1729
				1730	Flag that can be used to
				1731	control debugging code.
				1732
				1733	CP is responsible for
				1734	filling in
				1735	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1736	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1737	Reserved, must be 0.
				1738	GFX9
				1739	Wavefront starts execution
				1740	with specified fp16 overflow
				1741	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1742
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1743	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1744	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1745	- If 1, fp16 overflow that is the
				1746	result of an +/-INF input value
				1747	or divide by 0 produces a +/-INF,
				1748	otherwise clamps computed
				1749	overflow to +/-MAX_FP16 as
				1750	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1751
				1752	Used by CP to set up
				1753	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1754	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1755	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1756	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1757
				1758	..
				1759
				1760	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1761	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1762
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1763	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1764	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1765	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1766	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1767	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1768	system register (see
				1769	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1770
				1771	Used by CP to set up
				1772	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1773	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1774	user data registers
				1775	requested. This number must
				1776	match the number of user
				1777	data registers enabled.
				1778
				1779	Used by CP to set up
				1780	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1781	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1782	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1783	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1784	be enabled.
				1785
				1786	CP sets
				1787	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1788	if the runtime has
				1789	installed a trap handler
				1790	regardless of the setting
				1791	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1792	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1793	system SGPR register for
				1794	the work-group id in the X
				1795	dimension (see
				1796	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1797
				1798	Used by CP to set up
				1799	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1800	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1801	system SGPR register for
				1802	the work-group id in the Y
				1803	dimension (see
				1804	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1805
				1806	Used by CP to set up
				1807	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1808	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1809	system SGPR register for
				1810	the work-group id in the Z
				1811	dimension (see
				1812	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1813
				1814	Used by CP to set up
				1815	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1816	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1817	system SGPR register for
				1818	work-group information (see
				1819	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1820
				1821	Used by CP to set up
				1822	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1823	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1824	VGPR system registers used
				1825	for the work-item ID.
				1826	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1827	defines the values.
				1828
				1829	Used by CP to set up
				1830	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1831	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1832
				1833	Wavefront starts execution
				1834	with address watch
				1835	exceptions enabled which
				1836	are generated when L1 has
				1837	witnessed a thread access
				1838	an *address of
				1839	interest*.
				1840
				1841	CP is responsible for
				1842	filling in the address
				1843	watch bit in
				1844	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1845	according to what the
				1846	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1847	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1848
				1849	Wavefront starts execution
				1850	with memory violation
				1851	exceptions exceptions
				1852	enabled which are generated
				1853	when a memory violation has
				1854	occurred for this wave from
				1855	L1 or LDS
				1856	(write-to-read-only-memory,
				1857	mis-aligned atomic, LDS
				1858	address out of range,
				1859	illegal address, etc.).
				1860
				1861	CP sets the memory
				1862	violation bit in
				1863	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1864	according to what the
				1865	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1866	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1867
				1868	CP uses the rounded value
				1869	from the dispatch packet,
				1870	not this value, as the
				1871	dispatch may contain
				1872	dynamically allocated group
				1873	segment memory. CP writes
				1874	directly to
				1875	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1876
				1877	Amount of group segment
				1878	(LDS) to allocate for each
				1879	work-group. Granularity is
				1880	device specific:
				1881
				1882	GFX6:
				1883	roundup(lds-size / (64 * 4))
				1884	GFX7-GFX9:
				1885	roundup(lds-size / (128 * 4))
				1886
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1887	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1888	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1889	enabled.
				1890
				1891	Used by CP to set up
				1892	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1893	(set from bits 0..6).
				1894
				1895	IEEE 754 FP Invalid
				1896	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1897	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1898	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1899	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1900	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1901	_DIVISION_BY_ZERO Zero
				1902	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1903	_OVERFLOW
				1904	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1905	_UNDERFLOW
				1906	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1907	_INEXACT
				1908	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1909	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1910	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1911	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1912	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1913	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1914
				1915	..
				1916
				1917	.. table:: Floating Point Rounding Mode Enumeration Values
				1918	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1919
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1920	====================================== ===== ==============================
				1921	Enumeration Name Value Description
				1922	====================================== ===== ==============================
				1923	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1924	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1925	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1926	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1927	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1928
				1929	..
				1930
				1931	.. table:: Floating Point Denorm Mode Enumeration Values
				1932	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1933
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1934	====================================== ===== ==============================
				1935	Enumeration Name Value Description
				1936	====================================== ===== ==============================
				1937	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1938	Denorms
				1939	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1940	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1941	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1942	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1943
				1944	..
				1945
				1946	.. table:: System VGPR Work-Item ID Enumeration Values
				1947	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1948
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1949	======================================== ===== ============================
				1950	Enumeration Name Value Description
				1951	======================================== ===== ============================
				1952	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1953	ID.
				1954	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1955	dimensions ID.
				1956	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1957	dimensions ID.
				1958	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1959	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1960
				1961	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1962
				1963	Initial Kernel Execution State
				1964	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1965
				1966	This section defines the register state that will be set up by the packet
				1967	processor prior to the start of execution of every wavefront. This is limited by
				1968	the constraints of the hardware controllers of CP/ADC/SPI.
				1969
				1970	The order of the SGPR registers is defined, but the compiler can specify which
				1971	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1972	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1973	for enabled registers are dense starting at SGPR0: the first enabled register is
				1974	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1975	an SGPR number.
				1976
				1977	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1978	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1979	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1980	initialized. These are then immediately followed by the System SGPRs that are
				1981	set up by ADC/SPI and can have different values for each wave of the grid
				1982	dispatch.
				1983
				1984	SGPR register initial state is defined in
				1985	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1986
				1987	.. table:: SGPR Register Set Up Order
				1988	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1989
				1990	========== ========================== ====== ==============================
				1991	SGPR Order Name Number Description
				1992	(kernel descriptor enable of
				1993	field) SGPRs
				1994	========== ========================== ====== ==============================
				1995	First Private Segment Buffer 4 V# that can be used, together
				1996	(enable_sgpr_private with Scratch Wave Offset as an
				1997	_segment_buffer) offset, to access the private
				1998	memory space using a segment
				1999	address.
				2000
				2001	CP uses the value provided by
				2002	the runtime.
				2003	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2004	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2005	actually executing.
				2006	then Queue Ptr 2 64 bit address of amd_queue_t
				2007	(enable_sgpr_queue_ptr) object for AQL queue on which
				2008	the dispatch packet was
				2009	queued.
				2010	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2011	(enable_sgpr_kernarg segment. This is directly
				2012	_segment_ptr) copied from the
				2013	kernarg_address in the kernel
				2014	dispatch packet.
				2015
				2016	Having CP load it once avoids
				2017	loading it at the beginning of
				2018	every wavefront.
				2019	then Dispatch Id 2 64 bit Dispatch ID of the
				2020	(enable_sgpr_dispatch_id) dispatch packet being
				2021	executed.
				2022	then Flat Scratch Init 2 This is 2 SGPRs:
				2023	(enable_sgpr_flat_scratch
				2024	_init) GFX6
				2025	Not supported.
				2026	GFX7-GFX8
				2027	The first SGPR is a 32 bit
				2028	byte offset from
				2029	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2030	to per SPI base of memory
				2031	for scratch for the queue
				2032	executing the kernel
				2033	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2034	from the runtime. (The
				2035	Scratch Segment Buffer base
				2036	address is
				2037	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2038	plus this offset.) The value
				2039	of Scratch Wave Offset must
				2040	be added to this offset by
				2041	the kernel machine code,
				2042	right shifted by 8, and
				2043	moved to the FLAT_SCRATCH_HI
				2044	SGPR register.
				2045	FLAT_SCRATCH_HI corresponds
				2046	to SGPRn-4 on GFX7, and
				2047	SGPRn-6 on GFX8 (where SGPRn
				2048	is the highest numbered SGPR
				2049	allocated to the wave).
				2050	FLAT_SCRATCH_HI is
				2051	multiplied by 256 (as it is
				2052	in units of 256 bytes) and
				2053	added to
				2054	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2055	to calculate the per wave
				2056	FLAT SCRATCH BASE in flat
				2057	memory instructions that
				2058	access the scratch
				2059	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2060
				2061	The second SGPR is 32 bit
				2062	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2063	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2064	usage. CP obtains this from
				2065	the runtime, and it is
				2066	always a multiple of DWORD.
				2067	CP checks that the value in
				2068	the kernel dispatch packet
				2069	Private Segment Byte Size is
				2070	not larger, and requests the
				2071	runtime to increase the
				2072	queue's scratch size if
				2073	necessary. The kernel code
				2074	must move it to
				2075	FLAT_SCRATCH_LO which is
				2076	SGPRn-3 on GFX7 and SGPRn-5
				2077	on GFX8. FLAT_SCRATCH_LO is
				2078	used as the FLAT SCRATCH
				2079	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2080	instructions. Having CP load
				2081	it once avoids loading it at
				2082	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2083	wavefront.
				2084	GFX9
				2085	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2086	64 bit base address of the
				2087	per SPI scratch backing
				2088	memory managed by SPI for
				2089	the queue executing the
				2090	kernel dispatch. CP obtains
				2091	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2092	divides it if there are
				2093	multiple Shader Arrays each
				2094	with its own SPI). The value
				2095	of Scratch Wave Offset must
				2096	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2097	machine code and the result
				2098	moved to the FLAT_SCRATCH
				2099	SGPR which is SGPRn-6 and
				2100	SGPRn-5. It is used as the
				2101	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2102	memory instructions.
				2103	then Private Segment Size 1 The 32 bit byte size of a
				2104	(enable_sgpr_private single
				2105	work-item's
				2106	scratch_segment_size) memory
				2107	allocation. This is the
				2108	value from the kernel
				2109	dispatch packet Private
				2110	Segment Byte Size rounded up
				2111	by CP to a multiple of
				2112	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2113
				2114	Having CP load it once avoids
				2115	loading it at the beginning of
				2116	every wavefront.
				2117
				2118	This is not used for
				2119	GFX7-GFX8 since it is the same
				2120	value as the second SGPR of
				2121	Flat Scratch Init. However, it
				2122	may be needed for GFX9 which
				2123	changes the meaning of the
				2124	Flat Scratch Init value.
				2125	then Grid Work-Group Count X 1 32 bit count of the number of
				2126	(enable_sgpr_grid work-groups in the X dimension
				2127	_workgroup_count_X) for the grid being
				2128	executed. Computed from the
				2129	fields in the kernel dispatch
				2130	packet as ((grid_size.x +
				2131	workgroup_size.x - 1) /
				2132	workgroup_size.x).
				2133	then Grid Work-Group Count Y 1 32 bit count of the number of
				2134	(enable_sgpr_grid work-groups in the Y dimension
				2135	_workgroup_count_Y && for the grid being
				2136	less than 16 previous executed. Computed from the
				2137	SGPRs) fields in the kernel dispatch
				2138	packet as ((grid_size.y +
				2139	workgroup_size.y - 1) /
				2140	workgroupSize.y).
				2141
				2142	Only initialized if <16
				2143	previous SGPRs initialized.
				2144	then Grid Work-Group Count Z 1 32 bit count of the number of
				2145	(enable_sgpr_grid work-groups in the Z dimension
				2146	_workgroup_count_Z && for the grid being
				2147	less than 16 previous executed. Computed from the
				2148	SGPRs) fields in the kernel dispatch
				2149	packet as ((grid_size.z +
				2150	workgroup_size.z - 1) /
				2151	workgroupSize.z).
				2152
				2153	Only initialized if <16
				2154	previous SGPRs initialized.
				2155	then Work-Group Id X 1 32 bit work-group id in X
				2156	(enable_sgpr_workgroup_id dimension of grid for
				2157	_X) wavefront.
				2158	then Work-Group Id Y 1 32 bit work-group id in Y
				2159	(enable_sgpr_workgroup_id dimension of grid for
				2160	_Y) wavefront.
				2161	then Work-Group Id Z 1 32 bit work-group id in Z
				2162	(enable_sgpr_workgroup_id dimension of grid for
				2163	_Z) wavefront.
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2164	then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2165	(enable_sgpr_workgroup ordered_append_term[10:0],
				2166	_info) threadgroup_size_in_waves[5:0]}
				2167	then Scratch Wave Offset 1 32 bit byte offset from base
				2168	(enable_sgpr_private of scratch base of queue
				2169	_segment_wave_offset) executing the kernel
				2170	dispatch. Must be used as an
				2171	offset with Private
				2172	segment address when using
				2173	Scratch Segment Buffer. It
				2174	must be used to set up FLAT
				2175	SCRATCH for flat addressing
				2176	(see
				2177	:ref:`amdgpu-amdhsa-flat-scratch`).
				2178	========== ========================== ====== ==============================
				2179
				2180	The order of the VGPR registers is defined, but the compiler can specify which
				2181	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2182	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2183	for enabled registers are dense starting at VGPR0: the first enabled register is
				2184	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2185	VGPR number.
				2186
				2187	VGPR register initial state is defined in
				2188	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2189
				2190	.. table:: VGPR Register Set Up Order
				2191	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2192
				2193	========== ========================== ====== ==============================
				2194	VGPR Order Name Number Description
				2195	(kernel descriptor enable of
				2196	field) VGPRs
				2197	========== ========================== ====== ==============================
				2198	First Work-Item Id X 1 32 bit work item id in X
				2199	(Always initialized) dimension of work-group for
				2200	wavefront lane.
				2201	then Work-Item Id Y 1 32 bit work item id in Y
				2202	(enable_vgpr_workitem_id dimension of work-group for
				2203	> 0) wavefront lane.
				2204	then Work-Item Id Z 1 32 bit work item id in Z
				2205	(enable_vgpr_workitem_id dimension of work-group for
				2206	> 1) wavefront lane.
				2207	========== ========================== ====== ==============================
				2208
				2209	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2210
				2211	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2212	registers.
				2213	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2214	combination including none.
				2215	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2216	cannot included with the flat scratch init value which is per queue.
				2217	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2218	or (X, Y, Z).
				2219
				2220	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2221	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2222
				2223	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2224	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2225	instructions (GFX9).
				2226
				2227	If buffer operations are used then the compiler can generate a V# with the
				2228	following properties:
				2229
				2230	* base address of 0
				2231	* no swizzle
				2232	* ATC: 1 if IOMMU present (such as APU)
				2233	* ptr64: 1
				2234	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2235	APU and NC for dGPU).
				2236
				2237	.. _amdgpu-amdhsa-kernel-prolog:
				2238
				2239	Kernel Prolog
				2240	~~~~~~~~~~~~~
				2241
				2242	.. _amdgpu-amdhsa-m0:
				2243
				2244	M0
				2245	++
				2246
				2247	GFX6-GFX8
				2248	The M0 register must be initialized with a value at least the total LDS size
				2249	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2250	available in dispatch packet. For M0, it is also possible to use maximum
				2251	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2252	GFX7-GFX8).
				2253	GFX9
				2254	The M0 register is not used for range checking LDS accesses and so does not
				2255	need to be initialized in the prolog.
				2256
				2257	.. _amdgpu-amdhsa-flat-scratch:
				2258
				2259	Flat Scratch
				2260	++++++++++++
				2261
				2262	If the kernel may use flat operations to access scratch memory, the prolog code
				2263	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2264	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2265	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2266
				2267	GFX6
				2268	Flat scratch is not supported.
				2269
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2270	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2271	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2272	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2273	being managed by SPI for the queue executing the kernel dispatch. This is
				2274	the same value used in the Scratch Segment Buffer V# base address. The
				2275	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2276	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2277	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2278	by 8 before moving into FLAT_SCRATCH_LO.
				2279	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2280	work-items scratch memory usage. This is directly loaded from the kernel
				2281	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2282	DWORD. Having CP load it once avoids loading it at the beginning of every
				2283	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2284	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2285
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2286	GFX9
				2287	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2288	memory being managed by SPI for the queue executing the kernel dispatch. The
				2289	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2290	pair for use as the flat scratch base in flat memory instructions.
				2291
				2292	.. _amdgpu-amdhsa-memory-model:
				2293
				2294	Memory Model
				2295	~~~~~~~~~~~~
				2296
				2297	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2298	(see :ref:`memmodel`). The implementation is WIP.
				2299
				2300	.. TODO
				2301	Update when implementation complete.
				2302
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2303	The AMDGPU backend supports the memory synchronization scopes specified in
				2304	:ref:`amdgpu-memory-scopes`.
				2305
				2306	The code sequences used to implement the memory model are defined in table
				2307	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2308
				2309	The sequences specify the order of instructions that a single thread must
				2310	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2311	to other memory instructions executed by the same thread. This allows them to be
				2312	moved earlier or later which can allow them to be combined with other instances
				2313	of the same instruction, or hoisted/sunk out of loops to improve
				2314	performance. Only the instructions related to the memory model are given;
				2315	additional ``s_waitcnt`` instructions are required to ensure registers are
				2316	defined before being used. These may be able to be combined with the memory
				2317	model ``s_waitcnt`` instructions as described above.
				2318
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2319	The AMDGPU backend supports the following memory models:
				2320
				2321	HSA Memory Model [HSA]_
				2322	The HSA memory model uses a single happens-before relation for all address
				2323	spaces (see :ref:`amdgpu-address-spaces`).
				2324	OpenCL Memory Model [OpenCL]_
				2325	The OpenCL memory model which has separate happens-before relations for the
				2326	global and local address spaces. Only a fence specifying both global and
				2327	local address space, and seq_cst instructions join the relationships. Since
				2328	the LLVM ``memfence`` instruction does not allow an address space to be
				2329	specified the OpenCL fence has to convervatively assume both local and
				2330	global address space was specified. However, optimizations can often be
				2331	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2332	no intervening memory instructions which access the corresponding address
				2333	space. The code sequences in the table indicate what can be omitted for the
				2334	OpenCL memory. The target triple environment is used to determine if the
				2335	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2336
				2337	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2338	operations.
				2339
				2340	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2341	termed vector memory operations.
				2342
				2343	For GFX6-GFX9:
				2344
				2345	* Each agent has multiple compute units (CU).
				2346	* Each CU has multiple SIMDs that execute wavefronts.
				2347	* The wavefronts for a single work-group are executed in the same CU but may be
				2348	executed by different SIMDs.
				2349	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2350	executing on it.
				2351	* All LDS operations of a CU are performed as wavefront wide operations in a
				2352	global order and involve no caching. Completion is reported to a wavefront in
				2353	execution order.
				2354	* The LDS memory has multiple request queues shared by the SIMDs of a
				2355	CU. Therefore, the LDS operations performed by different waves of a work-group
				2356	can be reordered relative to each other, which can result in reordering the
				2357	visibility of vector memory operations with respect to LDS operations of other
				2358	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2359	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2360	between waves of a work-group, but not between operations performed by the
				2361	same wavefront.
				2362	* The vector memory operations are performed as wavefront wide operations and
				2363	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2364	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2365	vector memory order if they access LDS memory, and out of LDS operation order
				2366	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2367	* The vector memory operations access a single vector L1 cache shared by all
				2368	SIMDs a CU. Therefore, no special action is required for coherence between the
				2369	lanes of a single wavefront, or for coherence between wavefronts in the same
				2370	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
				2371	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2372	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2373	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2374	scalar operations are used in a restricted way so do not impact the memory
				2375	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2376	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2377	the same agent.
				2378	* The L2 cache has independent channels to service disjoint ranges of virtual
				2379	addresses.
				2380	* Each CU has a separate request queue per channel. Therefore, the vector and
				2381	scalar memory operations performed by waves executing in different work-groups
				2382	(which may be executing on different CUs) of an agent can be reordered
				2383	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2384	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2385	previous vector memory operation has completed before executing a subsequent
				2386	vector memory or LDS operation and so can be used to meet the requirements of
				2387	acquire and release.
				2388	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2389	of virtual addresses can be set up to bypass it to ensure system coherence.
				2390
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2391	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2392	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2393	memory, atomic memory orderings are not meaningful and all accesses are treated
				2394	as non-atomic.
				2395
				2396	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2397	scalar memory instructions). Since the constant address space contents do not
				2398	change during the execution of a kernel dispatch it is not legal to perform
				2399	stores, and atomic memory orderings are not meaningful and all access are
				2400	treated as non-atomic.
				2401
				2402	A memory synchronization scope wider than work-group is not meaningful for the
				2403	group (LDS) address space and is treated as work-group.
				2404
				2405	The memory model does not support the region address space which is treated as
				2406	non-atomic.
				2407
				2408	Acquire memory ordering is not meaningful on store atomic instructions and is
				2409	treated as non-atomic.
				2410
				2411	Release memory ordering is not meaningful on load atomic instructions and is
				2412	treated a non-atomic.
				2413
				2414	Acquire-release memory ordering is not meaningful on load or store atomic
				2415	instructions and is treated as acquire and release respectively.
				2416
				2417	AMDGPU backend only uses scalar memory operations to access memory that is
				2418	proven to not change during the execution of the kernel dispatch. This includes
				2419	constant address space and global address space for program scope const
				2420	variables. Therefore the kernel machine code does not have to maintain the
				2421	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2422	and vector L1 caches are invalidated between kernel dispatches by CP since
				2423	constant address space data may change between kernel dispatch executions. See
				2424	:ref:`amdgpu-amdhsa-memory-spaces`.
				2425
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2426	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2427	case the AMDGPU backend ensures the memory location used to spill is never
				2428	accessed by vector memory operations at the same time. If scalar writes are used
				2429	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2430	return since the locations may be used for vector memory instructions by a
				2431	future wave that uses the same scratch area, or a function call that creates a
				2432	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2433	as all scalar writes are write-before-read in the same thread.
				2434
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2435	Scratch backing memory (which is used for the private address space)
				2436	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2437	address space is only accessed by a single thread, and is always
				2438	write-before-read, there is never a need to invalidate these entries from the L1
				2439	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2440	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2441
				2442	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2443	to invalidate the L2 cache. This also causes it to be treated as
				2444	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2445	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2446	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2447
				2448	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2449	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2450
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2451	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2452	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2453	Ordering Sync Scope Address
				2454	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2455	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2456	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2457	-----------------------------------------------------------------------------------
				2458	load none none - global - !volatile & !nontemporal
				2459	- generic
				2460	- private 1. buffer/global/flat_load
				2461	- constant
				2462	- volatile & !nontemporal
				2463
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2464	1. buffer/global/flat_load
				2465	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2466
				2467	- nontemporal
				2468
				2469	1. buffer/global/flat_load
				2470	glc=1 slc=1
				2471
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2472	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2473	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2474	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2475	- private 1. buffer/global/flat_store
				2476	- constant
				2477	- nontemporal
				2478
				2479	1. buffer/global/flat_stote
				2480	glc=1 slc=1
				2481
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2482	store none none - local 1. ds_store
				2483	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2484	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2485	load atomic unordered any any Same as non-atomic.
				2486	store atomic unordered any any Same as non-atomic.
				2487	atomicrmw unordered any any *Same as monotonic
				2488	atomic*.
				2489	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2490	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2491	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2492	- wavefront - generic
				2493	- workgroup
				2494	load atomic monotonic - singlethread - local 1. ds_load
				2495	- wavefront
				2496	- workgroup
				2497	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2498	- system - generic glc=1
				2499	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2500	- wavefront - generic
				2501	- workgroup
				2502	- agent
				2503	- system
				2504	store atomic monotonic - singlethread - local 1. ds_store
				2505	- wavefront
				2506	- workgroup
				2507	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2508	- wavefront - generic
				2509	- workgroup
				2510	- agent
				2511	- system
				2512	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2513	- wavefront
				2514	- workgroup
				2515	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2516	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2517	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2518	- wavefront - local
				2519	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2520	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2521	load atomic acquire - workgroup - local 1. ds_load
				2522	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2523
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2524	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2525	- Must happen before
				2526	any following
				2527	global/generic
				2528	load/load
				2529	atomic/store/store
				2530	atomic/atomicrmw.
				2531	- Ensures any
				2532	following global
				2533	data read is no
				2534	older than the load
				2535	atomic value being
				2536	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2537	load atomic acquire - workgroup - generic 1. flat_load
				2538	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2539
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2540	- If OpenCL, omit.
				2541	- Must happen before
				2542	any following
				2543	global/generic
				2544	load/load
				2545	atomic/store/store
				2546	atomic/atomicrmw.
				2547	- Ensures any
				2548	following global
				2549	data read is no
				2550	older than the load
				2551	atomic value being
				2552	acquired.
				2553	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2554	- system glc=1
				2555	2. s_waitcnt vmcnt(0)
				2556
				2557	- Must happen before
				2558	following
				2559	buffer_wbinvl1_vol.
				2560	- Ensures the load
				2561	has completed
				2562	before invalidating
				2563	the cache.
				2564
				2565	3. buffer_wbinvl1_vol
				2566
				2567	- Must happen before
				2568	any following
				2569	global/generic
				2570	load/load
				2571	atomic/atomicrmw.
				2572	- Ensures that
				2573	following
				2574	loads will not see
				2575	stale global data.
				2576
				2577	load atomic acquire - agent - generic 1. flat_load glc=1
				2578	- system 2. s_waitcnt vmcnt(0) &
				2579	lgkmcnt(0)
				2580
				2581	- If OpenCL omit
				2582	lgkmcnt(0).
				2583	- Must happen before
				2584	following
				2585	buffer_wbinvl1_vol.
				2586	- Ensures the flat_load
				2587	has completed
				2588	before invalidating
				2589	the cache.
				2590
				2591	3. buffer_wbinvl1_vol
				2592
				2593	- Must happen before
				2594	any following
				2595	global/generic
				2596	load/load
				2597	atomic/atomicrmw.
				2598	- Ensures that
				2599	following loads
				2600	will not see stale
				2601	global data.
				2602
				2603	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2604	- wavefront - local
				2605	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2606	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2607	atomicrmw acquire - workgroup - local 1. ds_atomic
				2608	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2609
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2610	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2611	- Must happen before
				2612	any following
				2613	global/generic
				2614	load/load
				2615	atomic/store/store
				2616	atomic/atomicrmw.
				2617	- Ensures any
				2618	following global
				2619	data read is no
				2620	older than the
				2621	atomicrmw value
				2622	being acquired.
				2623
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2624	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2625	2. waitcnt lgkmcnt(0)
				2626
				2627	- If OpenCL, omit.
				2628	- Must happen before
				2629	any following
				2630	global/generic
				2631	load/load
				2632	atomic/store/store
				2633	atomic/atomicrmw.
				2634	- Ensures any
				2635	following global
				2636	data read is no
				2637	older than the
				2638	atomicrmw value
				2639	being acquired.
				2640
				2641	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2642	- system 2. s_waitcnt vmcnt(0)
				2643
				2644	- Must happen before
				2645	following
				2646	buffer_wbinvl1_vol.
				2647	- Ensures the
				2648	atomicrmw has
				2649	completed before
				2650	invalidating the
				2651	cache.
				2652
				2653	3. buffer_wbinvl1_vol
				2654
				2655	- Must happen before
				2656	any following
				2657	global/generic
				2658	load/load
				2659	atomic/atomicrmw.
				2660	- Ensures that
				2661	following loads
				2662	will not see stale
				2663	global data.
				2664
				2665	atomicrmw acquire - agent - generic 1. flat_atomic
				2666	- system 2. s_waitcnt vmcnt(0) &
				2667	lgkmcnt(0)
				2668
				2669	- If OpenCL, omit
				2670	lgkmcnt(0).
				2671	- Must happen before
				2672	following
				2673	buffer_wbinvl1_vol.
				2674	- Ensures the
				2675	atomicrmw has
				2676	completed before
				2677	invalidating the
				2678	cache.
				2679
				2680	3. buffer_wbinvl1_vol
				2681
				2682	- Must happen before
				2683	any following
				2684	global/generic
				2685	load/load
				2686	atomic/atomicrmw.
				2687	- Ensures that
				2688	following loads
				2689	will not see stale
				2690	global data.
				2691
				2692	fence acquire - singlethread none none
				2693	- wavefront
				2694	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2695
				2696	- If OpenCL and
				2697	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2698	not generic, omit.
				2699	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2700	currently has no
				2701	address space on
				2702	the fence need to
				2703	conservatively
				2704	always generate. If
				2705	fence had an
				2706	address space then
				2707	set to address
				2708	space of OpenCL
				2709	fence flag, or to
				2710	generic if both
				2711	local and global
				2712	flags are
				2713	specified.
				2714	- Must happen after
				2715	any preceding
				2716	local/generic load
				2717	atomic/atomicrmw
				2718	with an equal or
				2719	wider sync scope
				2720	and memory ordering
				2721	stronger than
				2722	unordered (this is
				2723	termed the
				2724	fence-paired-atomic).
				2725	- Must happen before
				2726	any following
				2727	global/generic
				2728	load/load
				2729	atomic/store/store
				2730	atomic/atomicrmw.
				2731	- Ensures any
				2732	following global
				2733	data read is no
				2734	older than the
				2735	value read by the
				2736	fence-paired-atomic.
				2737
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2738	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2739	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2740
				2741	- If OpenCL and
				2742	address space is
				2743	not generic, omit
				2744	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2745	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2746	currently has no
				2747	address space on
				2748	the fence need to
				2749	conservatively
				2750	always generate
				2751	(see comment for
				2752	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2753	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2754	separate s_waitcnt
				2755	vmcnt(0) and
				2756	s_waitcnt
				2757	lgkmcnt(0) to allow
				2758	them to be
				2759	independently moved
				2760	according to the
				2761	following rules.
				2762	- s_waitcnt vmcnt(0)
				2763	must happen after
				2764	any preceding
				2765	global/generic load
				2766	atomic/atomicrmw
				2767	with an equal or
				2768	wider sync scope
				2769	and memory ordering
				2770	stronger than
				2771	unordered (this is
				2772	termed the
				2773	fence-paired-atomic).
				2774	- s_waitcnt lgkmcnt(0)
				2775	must happen after
				2776	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2777	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2778	atomic/atomicrmw
				2779	with an equal or
				2780	wider sync scope
				2781	and memory ordering
				2782	stronger than
				2783	unordered (this is
				2784	termed the
				2785	fence-paired-atomic).
				2786	- Must happen before
				2787	the following
				2788	buffer_wbinvl1_vol.
				2789	- Ensures that the
				2790	fence-paired atomic
				2791	has completed
				2792	before invalidating
				2793	the
				2794	cache. Therefore
				2795	any following
				2796	locations read must
				2797	be no older than
				2798	the value read by
				2799	the
				2800	fence-paired-atomic.
				2801
				2802	2. buffer_wbinvl1_vol
				2803
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2804	- Must happen before any
				2805	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2806	load/load
				2807	atomic/store/store
				2808	atomic/atomicrmw.
				2809	- Ensures that
				2810	following loads
				2811	will not see stale
				2812	global data.
				2813
				2814	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2815	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2816	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2817	- wavefront - local
				2818	- generic
				2819	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2820
				2821	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2822	- Must happen after
				2823	any preceding
				2824	local/generic
				2825	load/store/load
				2826	atomic/store
				2827	atomic/atomicrmw.
				2828	- Must happen before
				2829	the following
				2830	store.
				2831	- Ensures that all
				2832	memory operations
				2833	to local have
				2834	completed before
				2835	performing the
				2836	store that is being
				2837	released.
				2838
				2839	2. buffer/global/flat_store
				2840	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2841	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2842
				2843	- If OpenCL, omit.
				2844	- Must happen after
				2845	any preceding
				2846	local/generic
				2847	load/store/load
				2848	atomic/store
				2849	atomic/atomicrmw.
				2850	- Must happen before
				2851	the following
				2852	store.
				2853	- Ensures that all
				2854	memory operations
				2855	to local have
				2856	completed before
				2857	performing the
				2858	store that is being
				2859	released.
				2860
				2861	2. flat_store
				2862	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2863	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2864
				2865	- If OpenCL, omit
				2866	lgkmcnt(0).
				2867	- Could be split into
				2868	separate s_waitcnt
				2869	vmcnt(0) and
				2870	s_waitcnt
				2871	lgkmcnt(0) to allow
				2872	them to be
				2873	independently moved
				2874	according to the
				2875	following rules.
				2876	- s_waitcnt vmcnt(0)
				2877	must happen after
				2878	any preceding
				2879	global/generic
				2880	load/store/load
				2881	atomic/store
				2882	atomic/atomicrmw.
				2883	- s_waitcnt lgkmcnt(0)
				2884	must happen after
				2885	any preceding
				2886	local/generic
				2887	load/store/load
				2888	atomic/store
				2889	atomic/atomicrmw.
				2890	- Must happen before
				2891	the following
				2892	store.
				2893	- Ensures that all
				2894	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2895	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2896	completed before
				2897	performing the
				2898	store that is being
				2899	released.
				2900
				2901	2. buffer/global/ds/flat_store
				2902	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2903	- wavefront - local
				2904	- generic
				2905	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2906
				2907	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2908	- Must happen after
				2909	any preceding
				2910	local/generic
				2911	load/store/load
				2912	atomic/store
				2913	atomic/atomicrmw.
				2914	- Must happen before
				2915	the following
				2916	atomicrmw.
				2917	- Ensures that all
				2918	memory operations
				2919	to local have
				2920	completed before
				2921	performing the
				2922	atomicrmw that is
				2923	being released.
				2924
				2925	2. buffer/global/flat_atomic
				2926	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2927	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2928
				2929	- If OpenCL, omit.
				2930	- Must happen after
				2931	any preceding
				2932	local/generic
				2933	load/store/load
				2934	atomic/store
				2935	atomic/atomicrmw.
				2936	- Must happen before
				2937	the following
				2938	atomicrmw.
				2939	- Ensures that all
				2940	memory operations
				2941	to local have
				2942	completed before
				2943	performing the
				2944	atomicrmw that is
				2945	being released.
				2946
				2947	2. flat_atomic
				2948	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2949	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2950
				2951	- If OpenCL, omit
				2952	lgkmcnt(0).
				2953	- Could be split into
				2954	separate s_waitcnt
				2955	vmcnt(0) and
				2956	s_waitcnt
				2957	lgkmcnt(0) to allow
				2958	them to be
				2959	independently moved
				2960	according to the
				2961	following rules.
				2962	- s_waitcnt vmcnt(0)
				2963	must happen after
				2964	any preceding
				2965	global/generic
				2966	load/store/load
				2967	atomic/store
				2968	atomic/atomicrmw.
				2969	- s_waitcnt lgkmcnt(0)
				2970	must happen after
				2971	any preceding
				2972	local/generic
				2973	load/store/load
				2974	atomic/store
				2975	atomic/atomicrmw.
				2976	- Must happen before
				2977	the following
				2978	atomicrmw.
				2979	- Ensures that all
				2980	memory operations
				2981	to global and local
				2982	have completed
				2983	before performing
				2984	the atomicrmw that
				2985	is being released.
				2986
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2987	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2988	fence release - singlethread none none
				2989	- wavefront
				2990	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2991
				2992	- If OpenCL and
				2993	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2994	not generic, omit.
				2995	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2996	currently has no
				2997	address space on
				2998	the fence need to
				2999	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3000	always generate. If
				3001	fence had an
				3002	address space then
				3003	set to address
				3004	space of OpenCL
				3005	fence flag, or to
				3006	generic if both
				3007	local and global
				3008	flags are
				3009	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3010	- Must happen after
				3011	any preceding
				3012	local/generic
				3013	load/load
				3014	atomic/store/store
				3015	atomic/atomicrmw.
				3016	- Must happen before
				3017	any following store
				3018	atomic/atomicrmw
				3019	with an equal or
				3020	wider sync scope
				3021	and memory ordering
				3022	stronger than
				3023	unordered (this is
				3024	termed the
				3025	fence-paired-atomic).
				3026	- Ensures that all
				3027	memory operations
				3028	to local have
				3029	completed before
				3030	performing the
				3031	following
				3032	fence-paired-atomic.
				3033
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3034	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3035	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3036
				3037	- If OpenCL and
				3038	address space is
				3039	not generic, omit
				3040	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3041	- If OpenCL and
				3042	address space is
				3043	local, omit
				3044	vmcnt(0).
				3045	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3046	currently has no
				3047	address space on
				3048	the fence need to
				3049	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3050	always generate. If
				3051	fence had an
				3052	address space then
				3053	set to address
				3054	space of OpenCL
				3055	fence flag, or to
				3056	generic if both
				3057	local and global
				3058	flags are
				3059	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3060	- Could be split into
				3061	separate s_waitcnt
				3062	vmcnt(0) and
				3063	s_waitcnt
				3064	lgkmcnt(0) to allow
				3065	them to be
				3066	independently moved
				3067	according to the
				3068	following rules.
				3069	- s_waitcnt vmcnt(0)
				3070	must happen after
				3071	any preceding
				3072	global/generic
				3073	load/store/load
				3074	atomic/store
				3075	atomic/atomicrmw.
				3076	- s_waitcnt lgkmcnt(0)
				3077	must happen after
				3078	any preceding
				3079	local/generic
				3080	load/store/load
				3081	atomic/store
				3082	atomic/atomicrmw.
				3083	- Must happen before
				3084	any following store
				3085	atomic/atomicrmw
				3086	with an equal or
				3087	wider sync scope
				3088	and memory ordering
				3089	stronger than
				3090	unordered (this is
				3091	termed the
				3092	fence-paired-atomic).
				3093	- Ensures that all
				3094	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3095	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3096	completed before
				3097	performing the
				3098	following
				3099	fence-paired-atomic.
				3100
				3101	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3102	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3103	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3104	- wavefront - local
				3105	- generic
				3106	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3107
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3108	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3109	- Must happen after
				3110	any preceding
				3111	local/generic
				3112	load/store/load
				3113	atomic/store
				3114	atomic/atomicrmw.
				3115	- Must happen before
				3116	the following
				3117	atomicrmw.
				3118	- Ensures that all
				3119	memory operations
				3120	to local have
				3121	completed before
				3122	performing the
				3123	atomicrmw that is
				3124	being released.
				3125
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3126	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3127	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3128	2. s_waitcnt lgkmcnt(0)
				3129
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3130	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3131	- Must happen before
				3132	any following
				3133	global/generic
				3134	load/load
				3135	atomic/store/store
				3136	atomic/atomicrmw.
				3137	- Ensures any
				3138	following global
				3139	data read is no
				3140	older than the load
				3141	atomic value being
				3142	acquired.
				3143
				3144	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3145
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3146	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3147	- Must happen after
				3148	any preceding
				3149	local/generic
				3150	load/store/load
				3151	atomic/store
				3152	atomic/atomicrmw.
				3153	- Must happen before
				3154	the following
				3155	atomicrmw.
				3156	- Ensures that all
				3157	memory operations
				3158	to local have
				3159	completed before
				3160	performing the
				3161	atomicrmw that is
				3162	being released.
				3163
				3164	2. flat_atomic
				3165	3. s_waitcnt lgkmcnt(0)
				3166
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3167	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3168	- Must happen before
				3169	any following
				3170	global/generic
				3171	load/load
				3172	atomic/store/store
				3173	atomic/atomicrmw.
				3174	- Ensures any
				3175	following global
				3176	data read is no
				3177	older than the load
				3178	atomic value being
				3179	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3180
				3181	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3182	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3183
				3184	- If OpenCL, omit
				3185	lgkmcnt(0).
				3186	- Could be split into
				3187	separate s_waitcnt
				3188	vmcnt(0) and
				3189	s_waitcnt
				3190	lgkmcnt(0) to allow
				3191	them to be
				3192	independently moved
				3193	according to the
				3194	following rules.
				3195	- s_waitcnt vmcnt(0)
				3196	must happen after
				3197	any preceding
				3198	global/generic
				3199	load/store/load
				3200	atomic/store
				3201	atomic/atomicrmw.
				3202	- s_waitcnt lgkmcnt(0)
				3203	must happen after
				3204	any preceding
				3205	local/generic
				3206	load/store/load
				3207	atomic/store
				3208	atomic/atomicrmw.
				3209	- Must happen before
				3210	the following
				3211	atomicrmw.
				3212	- Ensures that all
				3213	memory operations
				3214	to global have
				3215	completed before
				3216	performing the
				3217	atomicrmw that is
				3218	being released.
				3219
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3220	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3221	3. s_waitcnt vmcnt(0)
				3222
				3223	- Must happen before
				3224	following
				3225	buffer_wbinvl1_vol.
				3226	- Ensures the
				3227	atomicrmw has
				3228	completed before
				3229	invalidating the
				3230	cache.
				3231
				3232	4. buffer_wbinvl1_vol
				3233
				3234	- Must happen before
				3235	any following
				3236	global/generic
				3237	load/load
				3238	atomic/atomicrmw.
				3239	- Ensures that
				3240	following loads
				3241	will not see stale
				3242	global data.
				3243
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3244	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3245	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3246
				3247	- If OpenCL, omit
				3248	lgkmcnt(0).
				3249	- Could be split into
				3250	separate s_waitcnt
				3251	vmcnt(0) and
				3252	s_waitcnt
				3253	lgkmcnt(0) to allow
				3254	them to be
				3255	independently moved
				3256	according to the
				3257	following rules.
				3258	- s_waitcnt vmcnt(0)
				3259	must happen after
				3260	any preceding
				3261	global/generic
				3262	load/store/load
				3263	atomic/store
				3264	atomic/atomicrmw.
				3265	- s_waitcnt lgkmcnt(0)
				3266	must happen after
				3267	any preceding
				3268	local/generic
				3269	load/store/load
				3270	atomic/store
				3271	atomic/atomicrmw.
				3272	- Must happen before
				3273	the following
				3274	atomicrmw.
				3275	- Ensures that all
				3276	memory operations
				3277	to global have
				3278	completed before
				3279	performing the
				3280	atomicrmw that is
				3281	being released.
				3282
				3283	2. flat_atomic
				3284	3. s_waitcnt vmcnt(0) &
				3285	lgkmcnt(0)
				3286
				3287	- If OpenCL, omit
				3288	lgkmcnt(0).
				3289	- Must happen before
				3290	following
				3291	buffer_wbinvl1_vol.
				3292	- Ensures the
				3293	atomicrmw has
				3294	completed before
				3295	invalidating the
				3296	cache.
				3297
				3298	4. buffer_wbinvl1_vol
				3299
				3300	- Must happen before
				3301	any following
				3302	global/generic
				3303	load/load
				3304	atomic/atomicrmw.
				3305	- Ensures that
				3306	following loads
				3307	will not see stale
				3308	global data.
				3309
				3310	fence acq_rel - singlethread none none
				3311	- wavefront
				3312	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3313
				3314	- If OpenCL and
				3315	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3316	not generic, omit.
				3317	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3318	since LLVM
				3319	currently has no
				3320	address space on
				3321	the fence need to
				3322	conservatively
				3323	always generate
				3324	(see comment for
				3325	previous fence).
				3326	- Must happen after
				3327	any preceding
				3328	local/generic
				3329	load/load
				3330	atomic/store/store
				3331	atomic/atomicrmw.
				3332	- Must happen before
				3333	any following
				3334	global/generic
				3335	load/load
				3336	atomic/store/store
				3337	atomic/atomicrmw.
				3338	- Ensures that all
				3339	memory operations
				3340	to local have
				3341	completed before
				3342	performing any
				3343	following global
				3344	memory operations.
				3345	- Ensures that the
				3346	preceding
				3347	local/generic load
				3348	atomic/atomicrmw
				3349	with an equal or
				3350	wider sync scope
				3351	and memory ordering
				3352	stronger than
				3353	unordered (this is
				3354	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3355	acquire-fence-paired-atomic
				3356	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3357	before following
				3358	global memory
				3359	operations. This
				3360	satisfies the
				3361	requirements of
				3362	acquire.
				3363	- Ensures that all
				3364	previous memory
				3365	operations have
				3366	completed before a
				3367	following
				3368	local/generic store
				3369	atomic/atomicrmw
				3370	with an equal or
				3371	wider sync scope
				3372	and memory ordering
				3373	stronger than
				3374	unordered (this is
				3375	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3376	release-fence-paired-atomic
				3377	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3378	requirements of
				3379	release.
				3380
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3381	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3382	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3383
				3384	- If OpenCL and
				3385	address space is
				3386	not generic, omit
				3387	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3388	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3389	currently has no
				3390	address space on
				3391	the fence need to
				3392	conservatively
				3393	always generate
				3394	(see comment for
				3395	previous fence).
				3396	- Could be split into
				3397	separate s_waitcnt
				3398	vmcnt(0) and
				3399	s_waitcnt
				3400	lgkmcnt(0) to allow
				3401	them to be
				3402	independently moved
				3403	according to the
				3404	following rules.
				3405	- s_waitcnt vmcnt(0)
				3406	must happen after
				3407	any preceding
				3408	global/generic
				3409	load/store/load
				3410	atomic/store
				3411	atomic/atomicrmw.
				3412	- s_waitcnt lgkmcnt(0)
				3413	must happen after
				3414	any preceding
				3415	local/generic
				3416	load/store/load
				3417	atomic/store
				3418	atomic/atomicrmw.
				3419	- Must happen before
				3420	the following
				3421	buffer_wbinvl1_vol.
				3422	- Ensures that the
				3423	preceding
				3424	global/local/generic
				3425	load
				3426	atomic/atomicrmw
				3427	with an equal or
				3428	wider sync scope
				3429	and memory ordering
				3430	stronger than
				3431	unordered (this is
				3432	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3433	acquire-fence-paired-atomic
				3434	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3435	before invalidating
				3436	the cache. This
				3437	satisfies the
				3438	requirements of
				3439	acquire.
				3440	- Ensures that all
				3441	previous memory
				3442	operations have
				3443	completed before a
				3444	following
				3445	global/local/generic
				3446	store
				3447	atomic/atomicrmw
				3448	with an equal or
				3449	wider sync scope
				3450	and memory ordering
				3451	stronger than
				3452	unordered (this is
				3453	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3454	release-fence-paired-atomic
				3455	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3456	requirements of
				3457	release.
				3458
				3459	2. buffer_wbinvl1_vol
				3460
				3461	- Must happen before
				3462	any following
				3463	global/generic
				3464	load/load
				3465	atomic/store/store
				3466	atomic/atomicrmw.
				3467	- Ensures that
				3468	following loads
				3469	will not see stale
				3470	global data. This
				3471	satisfies the
				3472	requirements of
				3473	acquire.
				3474
				3475	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3476	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3477	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3478	- wavefront - local load atomic acquire,
				3479	- generic except must generated
				3480	all instructions even
				3481	for OpenCL.*
				3482	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3483	- generic
				3484	- Must
				3485	happen after
				3486	preceding
				3487	global/generic load
				3488	atomic/store
				3489	atomic/atomicrmw
				3490	with memory
				3491	ordering of seq_cst
				3492	and with equal or
				3493	wider sync scope.
				3494	(Note that seq_cst
				3495	fences have their
				3496	own s_waitcnt
				3497	lgkmcnt(0) and so do
				3498	not need to be
				3499	considered.)
				3500	- Ensures any
				3501	preceding
				3502	sequential
				3503	consistent local
				3504	memory instructions
				3505	have completed
				3506	before executing
				3507	this sequentially
				3508	consistent
				3509	instruction. This
				3510	prevents reordering
				3511	a seq_cst store
				3512	followed by a
				3513	seq_cst load. (Note
				3514	that seq_cst is
				3515	stronger than
				3516	acquire/release as
				3517	the reordering of
				3518	load acquire
				3519	followed by a store
				3520	release is
				3521	prevented by the
				3522	waitcnt of
				3523	the release, but
				3524	there is nothing
				3525	preventing a store
				3526	release followed by
				3527	load acquire from
				3528	competing out of
				3529	order.)
				3530
				3531	2. *Following
				3532	instructions same as
				3533	corresponding load
				3534	atomic acquire,
				3535	except must generated
				3536	all instructions even
				3537	for OpenCL.*
				3538	load atomic seq_cst - workgroup - local *Same as corresponding
				3539	load atomic acquire,
				3540	except must generated
				3541	all instructions even
				3542	for OpenCL.*
				3543	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3544	- system - generic vmcnt(0)
				3545
				3546	- Could be split into
				3547	separate s_waitcnt
				3548	vmcnt(0)
				3549	and s_waitcnt
				3550	lgkmcnt(0) to allow
				3551	them to be
				3552	independently moved
				3553	according to the
				3554	following rules.
				3555	- waitcnt lgkmcnt(0)
				3556	must happen after
				3557	preceding
				3558	global/generic load
				3559	atomic/store
				3560	atomic/atomicrmw
				3561	with memory
				3562	ordering of seq_cst
				3563	and with equal or
				3564	wider sync scope.
				3565	(Note that seq_cst
				3566	fences have their
				3567	own s_waitcnt
				3568	lgkmcnt(0) and so do
				3569	not need to be
				3570	considered.)
				3571	- waitcnt vmcnt(0)
				3572	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3573	preceding
				3574	global/generic load
				3575	atomic/store
				3576	atomic/atomicrmw
				3577	with memory
				3578	ordering of seq_cst
				3579	and with equal or
				3580	wider sync scope.
				3581	(Note that seq_cst
				3582	fences have their
				3583	own s_waitcnt
				3584	vmcnt(0) and so do
				3585	not need to be
				3586	considered.)
				3587	- Ensures any
				3588	preceding
				3589	sequential
				3590	consistent global
				3591	memory instructions
				3592	have completed
				3593	before executing
				3594	this sequentially
				3595	consistent
				3596	instruction. This
				3597	prevents reordering
				3598	a seq_cst store
				3599	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3600	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3601	that seq_cst is
				3602	stronger than
				3603	acquire/release as
				3604	the reordering of
				3605	load acquire
				3606	followed by a store
				3607	release is
				3608	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3609	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3610	the release, but
				3611	there is nothing
				3612	preventing a store
				3613	release followed by
				3614	load acquire from
				3615	competing out of
				3616	order.)
				3617
				3618	2. *Following
				3619	instructions same as
				3620	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3621	atomic acquire,
				3622	except must generated
				3623	all instructions even
				3624	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3625	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3626	- wavefront - local store atomic release,
				3627	- workgroup - generic except must generated
				3628	all instructions even
				3629	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3630	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3631	- system - generic store atomic release,
				3632	except must generated
				3633	all instructions even
				3634	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3635	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3636	- wavefront - local atomicrmw acq_rel,
				3637	- workgroup - generic except must generated
				3638	all instructions even
				3639	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3640	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3641	- system - generic atomicrmw acq_rel,
				3642	except must generated
				3643	all instructions even
				3644	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3645	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3646	- wavefront fence acq_rel,
				3647	- workgroup except must generated
				3648	- agent all instructions even
				3649	- system for OpenCL.*
				3650	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3651
				3652	The memory order also adds the single thread optimization constrains defined in
				3653	table
				3654	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3655
				3656	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3657	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3658
				3659	============ ==============================================================
				3660	LLVM Memory Optimization Constraints
				3661	Ordering
				3662	============ ==============================================================
				3663	unordered none
				3664	monotonic none
				3665	acquire - If a load atomic/atomicrmw then no following load/load
				3666	atomic/store/ store atomic/atomicrmw/fence instruction can
				3667	be moved before the acquire.
				3668	- If a fence then same as load atomic, plus no preceding
				3669	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3670	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3671	atomic/store/ store atomic/atomicrmw/fence instruction can
				3672	be moved after the release.
				3673	- If a fence then same as store atomic, plus no following
				3674	associated fence-paired-atomic can be moved before the
				3675	fence.
				3676	acq_rel Same constraints as both acquire and release.
				3677	seq_cst - If a load atomic then same constraints as acquire, plus no
				3678	preceding sequentially consistent load atomic/store
				3679	atomic/atomicrmw/fence instruction can be moved after the
				3680	seq_cst.
				3681	- If a store atomic then the same constraints as release, plus
				3682	no following sequentially consistent load atomic/store
				3683	atomic/atomicrmw/fence instruction can be moved before the
				3684	seq_cst.
				3685	- If an atomicrmw/fence then same constraints as acq_rel.
				3686	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3687
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3688	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3689	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3690
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3691	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3692	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3693	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3694
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3695	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3696	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3697
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3698	=================== =============== =============== =======================
				3699	Usage Code Sequence Trap Handler Description
				3700	Inputs
				3701	=================== =============== =============== =======================
				3702	reserved ``s_trap 0x00`` Reserved by hardware.
				3703	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3704	``queue_ptr`` ``debugtrap``
				3705	``VGPR0``: intrinsic (not
				3706	``arg`` implemented).
				3707	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3708	``queue_ptr`` terminated and its
				3709	associated queue put
				3710	into the error state.
				3711	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3712	``queue_ptr`` installed handled
				3713	same as ``llvm.trap``.
				3714	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3715	breakpoints.
				3716	debugger ``s_trap 0x08`` Reserved for debugger.
				3717	debugger ``s_trap 0xfe`` Reserved for debugger.
				3718	debugger ``s_trap 0xff`` Reserved for debugger.
				3719	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3720
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3721	Unspecified OS
				3722	--------------
				3723
				3724	This section provides code conventions used when the target triple OS is
				3725	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3726
				3727	Trap Handler ABI
				3728	~~~~~~~~~~~~~~~~
				3729
				3730	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3731	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3732	instructions are handled as follows:
				3733
				3734	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3735	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3736
				3737	=============== =============== ===========================================
				3738	Usage Code Sequence Description
				3739	=============== =============== ===========================================
				3740	llvm.trap s_endpgm Causes wavefront to be terminated.
				3741	llvm.debugtrap none Compiler warning given that there is no
				3742	trap handler installed.
				3743	=============== =============== ===========================================
				3744
				3745	Source Languages
				3746	================
				3747
				3748	.. _amdgpu-opencl:
				3749
				3750	OpenCL
				3751	------
				3752
				3753	When generating code for the OpenCL language the target triple environment
				3754	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3755
				3756	When the language is OpenCL the following differences occur:
				3757
				3758	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3759	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3760	3. Additional metadata is generated
				3761	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3762
				3763	.. TODO
				3764	Specify what affect this has. Hidden arguments added. Additional metadata
				3765	generated.
				3766
				3767	.. _amdgpu-hcc:
				3768
				3769	HCC
				3770	---
				3771
				3772	When generating code for the OpenCL language the target triple environment
				3773	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3774
				3775	When the language is OpenCL the following differences occur:
				3776
				3777	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3778
				3779	.. TODO
				3780	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3781
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3782	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3783	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3784
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3785	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3786	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3787
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3788	This section describes general syntax for instructions and operands. For more
				3789	information about instructions, their semantics and supported combinations of
				3790	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3791	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3792
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3793	An instruction has the following syntax (register operands are normally
				3794	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3795
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3796	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3797
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3798	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3799	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3800
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3801	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3802
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3803	* SGPR registers: s0, ... or s[0], ...
				3804	* VGPR registers: v0, ... or v[0], ...
				3805	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3806	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3807	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3808	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3809	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3810	* Register index expressions: v[2*2], s[1-1:2-1]
				3811	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3812
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3813	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3814
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3815	* offset, offset0, offset1
				3816	* idxen, offen bits
				3817	* glc, slc, tfe bits
				3818	* waitcnt: integer or combination of counter values
				3819	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3820
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3821	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3822
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3823	* DPP modifiers:
				3824
				3825	- row_shl, row_shr, row_ror, row_rol
				3826	- row_mirror, row_half_mirror, row_bcast
				3827	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3828	- row_mask, bank_mask, bound_ctrl
				3829
				3830	* SDWA modifiers:
				3831
				3832	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3833	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3834	- abs, neg, sext
				3835
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3836	Instruction Examples
				3837	~~~~~~~~~~~~~~~~~~~~
				3838
				3839	DS
				3840	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3841
				3842	.. code-block:: nasm
				3843
				3844	ds_add_u32 v2, v4 offset:16
				3845	ds_write_src2_b64 v2 offset0:4 offset1:8
				3846	ds_cmpst_f32 v2, v4, v6
				3847	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3848
				3849
				3850	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3851
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3852	FLAT
				3853	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3854
				3855	.. code-block:: nasm
				3856
				3857	flat_load_dword v1, v[3:4]
				3858	flat_store_dwordx3 v[3:4], v[5:7]
				3859	flat_atomic_swap v1, v[3:4], v5 glc
				3860	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3861	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3862
				3863	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3864
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3865	MUBUF
				3866	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3867
				3868	.. code-block:: nasm
				3869
				3870	buffer_load_dword v1, off, s[4:7], s1
				3871	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3872	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3873	buffer_wbinvl1
				3874	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3875
				3876	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3877
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3878	SMRD/SMEM
				3879	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3880
				3881	.. code-block:: nasm
				3882
				3883	s_load_dword s1, s[2:3], 0xfc
				3884	s_load_dwordx8 s[8:15], s[2:3], s4
				3885	s_load_dwordx16 s[88:103], s[2:3], s4
				3886	s_dcache_inv_vol
				3887	s_memtime s[4:5]
				3888
				3889	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3890
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3891	SOP1
				3892	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3893
				3894	.. code-block:: nasm
				3895
				3896	s_mov_b32 s1, s2
				3897	s_mov_b64 s[0:1], 0x80000000
				3898	s_cmov_b32 s1, 200
				3899	s_wqm_b64 s[2:3], s[4:5]
				3900	s_bcnt0_i32_b64 s1, s[2:3]
				3901	s_swappc_b64 s[2:3], s[4:5]
				3902	s_cbranch_join s[4:5]
				3903
				3904	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3905
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3906	SOP2
				3907	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3908
				3909	.. code-block:: nasm
				3910
				3911	s_add_u32 s1, s2, s3
				3912	s_and_b64 s[2:3], s[4:5], s[6:7]
				3913	s_cselect_b32 s1, s2, s3
				3914	s_andn2_b32 s2, s4, s6
				3915	s_lshr_b64 s[2:3], s[4:5], s6
				3916	s_ashr_i32 s2, s4, s6
				3917	s_bfm_b64 s[2:3], s4, s6
				3918	s_bfe_i64 s[2:3], s[4:5], s6
				3919	s_cbranch_g_fork s[4:5], s[6:7]
				3920
				3921	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3922
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3923	SOPC
				3924	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3925
				3926	.. code-block:: nasm
				3927
				3928	s_cmp_eq_i32 s1, s2
				3929	s_bitcmp1_b32 s1, s2
				3930	s_bitcmp0_b64 s[2:3], s4
				3931	s_setvskip s3, s5
				3932
				3933	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3934
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3935	SOPP
				3936	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3937
				3938	.. code-block:: nasm
				3939
				3940	s_barrier
				3941	s_nop 2
				3942	s_endpgm
				3943	s_waitcnt 0 ; Wait for all counters to be 0
				3944	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3945	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3946	s_sethalt 9
				3947	s_sleep 10
				3948	s_sendmsg 0x1
				3949	s_sendmsg sendmsg(MSG_INTERRUPT)
				3950	s_trap 1
				3951
				3952	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3953
				3954	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3955	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3956	range or acceptable values.
				3957
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3958	VALU
				3959	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3960
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3961	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3962	the assembler will automatically use optimal encoding based on its operands.
				3963	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3964
				3965	* _e32 for 32-bit VOP1/VOP2/VOPC
				3966	* _e64 for 64-bit VOP3
				3967	* _dpp for VOP_DPP
				3968	* _sdwa for VOP_SDWA
				3969
				3970	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3971
				3972	.. code-block:: nasm
				3973
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3974	v_mov_b32 v1, v2
				3975	v_mov_b32_e32 v1, v2
				3976	v_nop
				3977	v_cvt_f64_i32_e32 v[1:2], v2
				3978	v_floor_f32_e32 v1, v2
				3979	v_bfrev_b32_e32 v1, v2
				3980	v_add_f32_e32 v1, v2, v3
				3981	v_mul_i32_i24_e64 v1, v2, 3
				3982	v_mul_i32_i24_e32 v1, -3, v3
				3983	v_mul_i32_i24_e32 v1, -100, v3
				3984	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3985	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3986
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3987	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3988
				3989	.. code-block:: nasm
				3990
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3991	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3992	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3993	v_mov_b32 v0, v0 wave_shl:1
				3994	v_mov_b32 v0, v0 row_mirror
				3995	v_mov_b32 v0, v0 row_bcast:31
				3996	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3997	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3998	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3999
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4000	VOP_SDWA examples:
				4001
				4002	.. code-block:: nasm
				4003
				4004	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4005	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4006	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4007	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4008	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4009
				4010	For full list of supported instructions, refer to "Vector ALU instructions".
				4011
				4012	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4013	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4014
				4015	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4016	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4017
				4018	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4019	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4020
				4021	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4022	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4023
				4024	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4025	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4026
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4027
				4028	major, minor, and stepping are all integers that describe the instruction
				4029	set architecture (ISA) version of the assembly program.
				4030
				4031	vendor and arch are quoted strings. vendor should always be equal to
				4032	"AMD" and arch should always be equal to "AMDGPU".
				4033
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4034	By default, the assembler will derive the ISA version, vendor, and arch
				4035	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4036
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4037	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4038	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4039
				4040	This directives specifies that the symbol with given name is a kernel entry point
				4041	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4042
				4043	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4044	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4045
				4046	This directive marks the beginning of a list of key / value pairs that are used
				4047	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4048	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4049	any amd_kernel_code_t values that are unspecified a default value will be
				4050	used. The default value for all keys is 0, with the following exceptions:
				4051
				4052	- kernel_code_version_major defaults to 1.
				4053	- machine_kind defaults to 1.
				4054	- machine_version_major, machine_version_minor, and
				4055	machine_version_stepping are derived from the value of the -mcpu option
				4056	that is passed to the assembler.
				4057	- kernel_code_entry_byte_offset defaults to 256.
				4058	- wavefront_size defaults to 6.
				4059	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4060	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4061	as a power of two, so a value of n means an alignment of 2^ n.
				4062
				4063	The .amd_kernel_code_t directive must be placed immediately after the
				4064	function label and before any instructions.
				4065
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4066	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4067	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4068
				4069	Here is an example of a minimal amd_kernel_code_t specification:
				4070
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4071	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4072
				4073	.hsa_code_object_version 1,0
				4074	.hsa_code_object_isa
				4075
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4076	.hsatext
				4077	.globl hello_world
				4078	.p2align 8
				4079	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4080
				4081	hello_world:
				4082
				4083	.amd_kernel_code_t
				4084	enable_sgpr_kernarg_segment_ptr = 1
				4085	is_ptr64 = 1
				4086	compute_pgm_rsrc1_vgprs = 0
				4087	compute_pgm_rsrc1_sgprs = 0
				4088	compute_pgm_rsrc2_user_sgpr = 2
				4089	kernarg_segment_byte_size = 8
				4090	wavefront_sgpr_count = 2
				4091	workitem_vgpr_count = 3
				4092	.end_amd_kernel_code_t
				4093
				4094	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4095	v_mov_b32 v0, 3.14159
				4096	s_waitcnt lgkmcnt(0)
				4097	v_mov_b32 v1, s0
				4098	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4099	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4100	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4101	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4102	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4103
				4104	Additional Documentation
				4105	========================
				4106
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4107	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4108	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4109	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4110	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4111	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4112	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4113	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4114	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4115	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4116	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4117	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4118	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4119	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4120	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4121	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4122	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__