Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: ecb0c11dbcbb872e8af428e4a48a7b15e65620fd [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	26	.. table:: AMDGPU Architectures
				27	:name: amdgpu-architecture-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	28
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	29	============ ==============================================================
				30	Architecture Description
				31	============ ==============================================================
				32	``r600`` AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders.
				33	``amdgcn`` AMD GPUs GCN GFX6 onwards for graphics and compute shaders.
				34	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	35
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	36	.. table:: AMDGPU Vendors
				37	:name: amdgpu-vendor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	38
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	39	============ ==============================================================
				40	Vendor Description
				41	============ ==============================================================
				42	``amd`` Can be used for all AMD GPU usage.
				43	``mesa3d`` Can be used if the OS is ``mesa3d``.
				44	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	45
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	46	.. table:: AMDGPU Operating Systems
				47	:name: amdgpu-os-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	48
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	49	============== ============================================================
				50	OS Description
				51	============== ============================================================
				52	<empty> Defaults to the unknown OS.
				53	``amdhsa`` Compute kernels executed on HSA [HSA]_ compatible runtimes
				54	such as AMD's ROCm [AMD-ROCm]_.
				55	``amdpal`` Graphic shaders and compute kernels executed on AMD PAL
				56	runtime.
				57	``mesa3d`` Graphic shaders and compute kernels executed on Mesa 3D
				58	runtime.
				59	============== ============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	60
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	61	.. table:: AMDGPU Environments
				62	:name: amdgpu-environment-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	63
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	64	============ ==============================================================
				65	Environment Description
				66	============ ==============================================================
				67	<empty> Defaults to ``opencl``.
				68	``opencl`` OpenCL compute kernel (see :ref:`amdgpu-opencl`).
				69	``amdgizcl`` Same as ``opencl`` except a different address space mapping is
				70	used (see :ref:`amdgpu-address-spaces`).
				71	``amdgiz`` Same as ``opencl`` except a different address space mapping is
				72	used (see :ref:`amdgpu-address-spaces`).
				73	``hcc`` AMD HC language compute kernel (see :ref:`amdgpu-hcc`).
				74	============ ==============================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	75
				76	.. _amdgpu-processors:
				77
				78	Processors
				79	----------
				80
				81	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				82	names from both the Processor and Alternative Processor can be used.
				83
				84	.. table:: AMDGPU Processors
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	85	:name: amdgpu-processor-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	86
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	87	=========== =============== ============ ===== ======= ==================
				88	Processor Alternative Target dGPU/ ROCm Example
				89	Processor Triple APU Support Products
				90	Architecture
				91	=========== =============== ============ ===== ======= ==================
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	92	Radeon HD 2000/3000 Series (R600) [AMD-RADEON-HD-2000-3000]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	93	-------------------------------------------------------------------------
				94	``r600`` ``r600`` dGPU
				95	``r630`` ``r600`` dGPU
				96	``rs880`` ``r600`` dGPU
				97	``rv670`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	98	Radeon HD 4000 Series (R700) [AMD-RADEON-HD-4000]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	99	-------------------------------------------------------------------------
				100	``rv710`` ``r600`` dGPU
				101	``rv730`` ``r600`` dGPU
				102	``rv770`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	103	Radeon HD 5000 Series (Evergreen) [AMD-RADEON-HD-5000]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	104	-------------------------------------------------------------------------
				105	``cedar`` ``r600`` dGPU
				106	``redwood`` ``r600`` dGPU
				107	``sumo`` ``r600`` dGPU
				108	``juniper`` ``r600`` dGPU
				109	``cypress`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	110	Radeon HD 6000 Series (Northern Islands) [AMD-RADEON-HD-6000]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	111	-------------------------------------------------------------------------
				112	``barts`` ``r600`` dGPU
				113	``turks`` ``r600`` dGPU
				114	``caicos`` ``r600`` dGPU
				115	``cayman`` ``r600`` dGPU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	116	GCN GFX6 (Southern Islands (SI)) [AMD-GCN-GFX6]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	117	-------------------------------------------------------------------------
				118	``gfx600`` - ``tahiti`` ``amdgcn`` dGPU
				119	``gfx601`` - ``pitcairn`` ``amdgcn`` dGPU
				120	- ``verde``
				121	- ``oland``
				122	- ``hainan``
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	123	GCN GFX7 (Sea Islands (CI)) [AMD-GCN-GFX7]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	124	-------------------------------------------------------------------------
				125	``gfx700`` - ``bonaire`` ``amdgcn`` dGPU - Radeon HD 7790
				126	- Radeon HD 8770
				127	- R7 260
				128	- R7 260X
				129	\ - ``kaveri`` ``amdgcn`` APU - A6-7000
				130	- A6 Pro-7050B
				131	- A8-7100
				132	- A8 Pro-7150B
				133	- A10-7300
				134	- A10 Pro-7350B
				135	- FX-7500
				136	- A8-7200P
				137	- A10-7400P
				138	- FX-7600P
				139	``gfx701`` - ``hawaii`` ``amdgcn`` dGPU ROCm - FirePro W8100
				140	- FirePro W9100
				141	- FirePro S9150
				142	- FirePro S9170
				143	``gfx702`` ``amdgcn`` dGPU ROCm - Radeon R9 290
				144	- Radeon R9 290x
				145	- Radeon R390
				146	- Radeon R390x
				147	``gfx703`` - ``kabini`` ``amdgcn`` APU - E1-2100
				148	- ``mullins`` - E1-2200
				149	- E1-2500
				150	- E2-3000
				151	- E2-3800
				152	- A4-5000
				153	- A4-5100
				154	- A6-5200
				155	- A4 Pro-3340B
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	156	GCN GFX8 (Volcanic Islands (VI)) [AMD-GCN-GFX8]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	157	-------------------------------------------------------------------------
				158	``gfx800`` - ``iceland`` ``amdgcn`` dGPU - FirePro S7150
				159	- FirePro S7100
				160	- FirePro W7100
				161	- Radeon R285
				162	- Radeon R9 380
				163	- Radeon R9 385
				164	- Mobile FirePro
				165	M7170
				166	``gfx801`` - ``carrizo`` ``amdgcn`` APU - A6-8500P
				167	- Pro A6-8500B
				168	- A8-8600P
				169	- Pro A8-8600B
				170	- FX-8800P
				171	- Pro A12-8800B
				172	\ ``amdgcn`` APU ROCm - A10-8700P
				173	- Pro A10-8700B
				174	- A10-8780P
				175	\ ``amdgcn`` APU - A10-9600P
				176	- A10-9630P
				177	- A12-9700P
				178	- A12-9730P
				179	- FX-9800P
				180	- FX-9830P
				181	\ ``amdgcn`` APU - E2-9010
				182	- A6-9210
				183	- A9-9410
				184	``gfx802`` - ``tonga`` ``amdgcn`` dGPU ROCm Same as gfx800
				185	``gfx803`` - ``fiji`` ``amdgcn`` dGPU ROCm - Radeon R9 Nano
				186	- Radeon R9 Fury
				187	- Radeon R9 FuryX
				188	- Radeon Pro Duo
				189	- FirePro S9300x2
				190	- Radeon Instinct MI8
				191	\ - ``polaris10`` ``amdgcn`` dGPU ROCm - Radeon RX 470
				192	- Radeon RX 480
				193	- Radeon Instinct MI6
				194	\ - ``polaris11`` ``amdgcn`` dGPU ROCm - Radeon RX 460
				195	``gfx810`` - ``stoney`` ``amdgcn`` APU
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	196	GCN GFX9 [AMD-GCN-GFX9]_
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	197	-------------------------------------------------------------------------
				198	``gfx900`` ``amdgcn`` dGPU ROCm - Radeon Vega
				199	Frontier Edition
				200	- Radeon RX Vega 56
				201	- Radeon RX Vega 64
				202	- Radeon RX Vega 64
				203	Liquid
				204	- Radeon Instinct MI25
				205	``gfx902`` ``amdgcn`` APU TBA
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	206
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	207	.. TODO
				208	Add product
				209	names.
				210	=========== =============== ============ ===== ======= ==================
				211
				212	.. _amdgpu-target-features:
				213
				214	Target Features
				215	---------------
				216
				217	Target features control how code is generated to support certain
				218	features. Not all target features are supported by all processors. The
				219	runtime must ensure that the features supported by the device used to
				220	execute the code match the features enabled when generating the
				221	code. A mismatch of features may result in incorrect execution, or a
				222	reduction in performance.
				223
				224	Use the ``clang -m[no-]<TargetFeature>`` option to specify the AMD GPU
				225	target features.
				226
				227	For example:
				228
				229	``-mxnack``
				230	Enable the XNACK feature.
				231	``-mno-xnack``
				232	Disable the XNACK feature.
				233
				234	.. table:: AMDGPU Target Features
				235	:name: amdgpu-target-feature-table
				236
				237	============== ======== ==================================================
				238	Target Feature Default Description
				239	============== ======== ==================================================
				240	-m[no-]xnack disabled Enable/disable generating code that has
				241	memory clauses that are compatible with
				242	having XNACK replay enabled.
				243
				244	This is used for demand paging and page
				245	migration. If XNACK replay is enabled in
				246	the device, then if a page fault occurs
				247	the code may execute incorrectly if the
				248	XNACK feature is not enabled. Executing
				249	code that has the feature enabled on a
				250	device that does not have XNACK replay
				251	enabled will execute correctly, but may
				252	be less performant than code with the
				253	feature disabled.
				254
				255	This feature is supported by the
Tony Tye	3507750	2017-11-11 00:50:32 +0000	[diff] [blame]	256	``amdgcn`` architecture for GFX8-GFX9.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	257	============== ======== ==================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	258
				259	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	260
				261	Address Spaces
				262	--------------
				263
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	264	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	265
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	266	The memory space names used in the table, aside from the region memory space, is
				267	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	268
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	269	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	270
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	271	.. table:: Address Space Mapping
				272	:name: amdgpu-address-space-mapping-table
				273
				274	================== ================= ================= ================= =================
				275	LLVM Address Space Memory Space
				276	------------------ -----------------------------------------------------------------------
				277	\ Current Default amdgiz/amdgizcl hcc Future Default
				278	================== ================= ================= ================= =================
				279	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				280	1 Global Global Global Global
				281	2 Constant Constant Constant Region (GDS)
				282	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				283	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				284	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				285	================== ================= ================= ================= =================
				286
				287	Current Default
				288	This is the current default address space mapping used for all languages
				289	except hcc. This will shortly be deprecated.
				290
				291	amdgiz/amdgizcl
				292	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				293	is specified as the target triple environment value.
				294
				295	hcc
				296	This is the current address space mapping used when ``hcc`` is specified as
				297	the target triple environment value.This will shortly be deprecated.
				298
				299	Future Default
				300	This will shortly be the only address space mapping for all languages using
				301	AMDGPU backend.
				302
				303	.. _amdgpu-memory-scopes:
				304
				305	Memory Scopes
				306	-------------
				307
				308	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				309	backend memory model when the target triple OS is ``amdhsa`` (see
				310	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				311
				312	The memory model supported is based on the HSA memory model [HSA]_ which is
				313	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				314	relation is transitive over the synchonizes-with relation independent of scope,
				315	and synchonizes-with allows the memory scope instances to be inclusive (see
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	316	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	317
				318	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				319	inclusion and requires the memory scopes to exactly match. However, this
				320	is conservatively correct for OpenCL.
				321
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	322	.. table:: AMDHSA LLVM Sync Scopes
				323	:name: amdgpu-amdhsa-llvm-sync-scopes-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	324
				325	================ ==========================================================
				326	LLVM Sync Scope Description
				327	================ ==========================================================
				328	none The default: ``system``.
				329
				330	Synchronizes with, and participates in modification and
				331	seq_cst total orderings with, other operations (except
				332	image operations) for all address spaces (except private,
				333	or generic that accesses private) provided the other
				334	operation's sync scope is:
				335
				336	- ``system``.
				337	- ``agent`` and executed by a thread on the same agent.
				338	- ``workgroup`` and executed by a thread in the same
				339	workgroup.
				340	- ``wavefront`` and executed by a thread in the same
				341	wavefront.
				342
				343	``agent`` Synchronizes with, and participates in modification and
				344	seq_cst total orderings with, other operations (except
				345	image operations) for all address spaces (except private,
				346	or generic that accesses private) provided the other
				347	operation's sync scope is:
				348
				349	- ``system`` or ``agent`` and executed by a thread on the
				350	same agent.
				351	- ``workgroup`` and executed by a thread in the same
				352	workgroup.
				353	- ``wavefront`` and executed by a thread in the same
				354	wavefront.
				355
				356	``workgroup`` Synchronizes with, and participates in modification and
				357	seq_cst total orderings with, other operations (except
				358	image operations) for all address spaces (except private,
				359	or generic that accesses private) provided the other
				360	operation's sync scope is:
				361
				362	- ``system``, ``agent`` or ``workgroup`` and executed by a
				363	thread in the same workgroup.
				364	- ``wavefront`` and executed by a thread in the same
				365	wavefront.
				366
				367	``wavefront`` Synchronizes with, and participates in modification and
				368	seq_cst total orderings with, other operations (except
				369	image operations) for all address spaces (except private,
				370	or generic that accesses private) provided the other
				371	operation's sync scope is:
				372
				373	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				374	and executed by a thread in the same wavefront.
				375
				376	``singlethread`` Only synchronizes with, and participates in modification
				377	and seq_cst total orderings with, other operations (except
				378	image operations) running in the same thread for all
				379	address spaces (for example, in signal handlers).
				380	================ ==========================================================
				381
				382	AMDGPU Intrinsics
				383	-----------------
				384
				385	The AMDGPU backend implements the following intrinsics.
				386
				387	This section is WIP.
				388
				389	.. TODO
				390	List AMDGPU intrinsics
				391
				392	Code Object
				393	===========
				394
				395	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				396	can be linked by ``lld`` to produce a standard ELF shared code object which can
				397	be loaded and executed on an AMDGPU target.
				398
				399	Header
				400	------
				401
				402	The AMDGPU backend uses the following ELF header:
				403
				404	.. table:: AMDGPU ELF Header
				405	:name: amdgpu-elf-header-table
				406
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	407	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	408	Field Value
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	409	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	410	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				411	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	412	``e_ident[EI_OSABI]`` - ``ELFOSABI_NONE``
				413	- ``ELFOSABI_AMDGPU_HSA``
				414	- ``ELFOSABI_AMDGPU_PAL``
				415	- ``ELFOSABI_AMDGPU_MESA3D``
				416	``e_ident[EI_ABIVERSION]`` - ``ELFABIVERSION_AMDGPU_HSA``
				417	- ``ELFABIVERSION_AMDGPU_PAL``
				418	- ``ELFABIVERSION_AMDGPU_MESA3D``
				419	``e_type`` - ``ET_REL``
				420	- ``ET_DYN``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	421	``e_machine`` ``EM_AMDGPU``
				422	``e_entry`` 0
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	423	``e_flags`` See :ref:`amdgpu-elf-header-e_flags-table`
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	424	========================== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	425
				426	..
				427
				428	.. table:: AMDGPU ELF Header Enumeration Values
				429	:name: amdgpu-elf-header-enumeration-values-table
				430
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	431	=============================== =====
				432	Name Value
				433	=============================== =====
				434	``EM_AMDGPU`` 224
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	435	``ELFOSABI_NONE`` 0
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	436	``ELFOSABI_AMDGPU_HSA`` 64
				437	``ELFOSABI_AMDGPU_PAL`` 65
				438	``ELFOSABI_AMDGPU_MESA3D`` 66
				439	``ELFABIVERSION_AMDGPU_HSA`` 1
				440	``ELFABIVERSION_AMDGPU_PAL`` 0
				441	``ELFABIVERSION_AMDGPU_MESA3D`` 0
				442	=============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	443
				444	``e_ident[EI_CLASS]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	445	The ELF class is:
				446
				447	* ``ELFCLASS32`` for ``r600`` architecture.
				448
				449	* ``ELFCLASS64`` for ``amdgcn`` architecture which only supports 64
				450	bit applications.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	451
				452	``e_ident[EI_DATA]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	453	All AMDGPU targets use ``ELFDATA2LSB`` for little-endian byte ordering.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	454
				455	``e_ident[EI_OSABI]``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	456	One of the following AMD GPU architecture specific OS ABIs
				457	(see :ref:`amdgpu-os-table`):
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	458
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	459	* ``ELFOSABI_NONE`` for unknown OS.
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	460
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	461	* ``ELFOSABI_AMDGPU_HSA`` for ``amdhsa`` OS.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	462
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	463	* ``ELFOSABI_AMDGPU_PAL`` for ``amdpal`` OS.
				464
				465	* ``ELFOSABI_AMDGPU_MESA3D`` for ``mesa3D`` OS.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	466
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	467	``e_ident[EI_ABIVERSION]``
Konstantin Zhuravlyov	a952b44	2017-10-03 20:54:07 +0000	[diff] [blame]	468	The ABI version of the AMD GPU architecture specific OS ABI to which the code
				469	object conforms:
				470
				471	* ``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA
				472	runtime ABI.
				473
				474	* ``ELFABIVERSION_AMDGPU_PAL`` is used to specify the version of AMD PAL
				475	runtime ABI.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	476
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	477	* ``ELFABIVERSION_AMDGPU_MESA3D`` is used to specify the version of AMD MESA
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	478	3D runtime ABI.
Konstantin Zhuravlyov	0aa94d3	2017-10-03 21:14:14 +0000	[diff] [blame]	479
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	480	``e_type``
				481	Can be one of the following values:
				482
				483
				484	``ET_REL``
				485	The type produced by the AMD GPU backend compiler as it is relocatable code
				486	object.
				487
				488	``ET_DYN``
				489	The type produced by the linker as it is a shared code object.
				490
				491	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				492
				493	``e_machine``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	494	The value ``EM_AMDGPU`` is used for the machine for all processors supported
				495	by the ``r600`` and ``amdgcn`` architectures (see
				496	:ref:`amdgpu-processor-table`). The specific processor is specified in the
				497	``EF_AMDGPU_MACH`` bit field of the ``e_flags`` (see
				498	:ref:`amdgpu-elf-header-e_flags-table`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	499
				500	``e_entry``
				501	The entry point is 0 as the entry points for individual kernels must be
				502	selected in order to invoke them through AQL packets.
				503
				504	``e_flags``
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	505	The AMDGPU backend uses the following ELF header flags:
				506
				507	.. table:: AMDGPU ELF Header ``e_flags``
				508	:name: amdgpu-elf-header-e_flags-table
				509
				510	================================= ========== =============================
				511	Name Value Description
				512	================================= ========== =============================
				513	AMDGPU Processor Flag See :ref:`amdgpu-processor-table`.
				514	-------------------------------------------- -----------------------------
				515	``EF_AMDGPU_MACH`` 0x000000ff AMDGPU processor selection
				516	mask for
				517	``EF_AMDGPU_MACH_xxx`` values
				518	defined in
				519	:ref:`amdgpu-ef-amdgpu-mach-table`.
				520	================================= ========== =============================
				521
				522	.. table:: AMDGPU ``EF_AMDGPU_MACH`` Values
				523	:name: amdgpu-ef-amdgpu-mach-table
				524
				525	================================= ========== =============================
				526	Name Value Description (see
				527	:ref:`amdgpu-processor-table`)
				528	================================= ========== =============================
				529	``EF_AMDGPU_MACH_NONE`` 0 not specified
				530	``EF_AMDGPU_MACH_R600_R600`` 1 ``r600``
				531	``EF_AMDGPU_MACH_R600_R630`` 2 ``r630``
				532	``EF_AMDGPU_MACH_R600_RS880`` 3 ``rs880``
				533	``EF_AMDGPU_MACH_R600_RV670`` 4 ``rv670``
				534	``EF_AMDGPU_MACH_R600_RV710`` 5 ``rv710``
				535	``EF_AMDGPU_MACH_R600_RV730`` 6 ``rv730``
				536	``EF_AMDGPU_MACH_R600_RV770`` 7 ``rv770``
				537	``EF_AMDGPU_MACH_R600_CEDAR`` 8 ``cedar``
				538	``EF_AMDGPU_MACH_R600_REDWOOD`` 9 ``redwood``
				539	``EF_AMDGPU_MACH_R600_SUMO`` 10 ``sumo``
				540	``EF_AMDGPU_MACH_R600_JUNIPER`` 11 ``juniper``
				541	``EF_AMDGPU_MACH_R600_CYPRESS`` 12 ``cypress``
				542	``EF_AMDGPU_MACH_R600_BARTS`` 13 ``barts``
				543	``EF_AMDGPU_MACH_R600_TURKS`` 14 ``turks``
				544	``EF_AMDGPU_MACH_R600_CAICOS`` 15 ``caicos``
				545	``EF_AMDGPU_MACH_R600_CAYMAN`` 16 ``cayman``
				546	reserved 17-31 Reserved for ``r600``
				547	architecture processors.
				548	``EF_AMDGPU_MACH_AMDGCN_GFX600`` 32 ``gfx600``
				549	``EF_AMDGPU_MACH_AMDGCN_GFX601`` 33 ``gfx601``
				550	``EF_AMDGPU_MACH_AMDGCN_GFX700`` 34 ``gfx700``
				551	``EF_AMDGPU_MACH_AMDGCN_GFX701`` 35 ``gfx701``
				552	``EF_AMDGPU_MACH_AMDGCN_GFX702`` 36 ``gfx702``
				553	``EF_AMDGPU_MACH_AMDGCN_GFX703`` 37 ``gfx703``
				554	``EF_AMDGPU_MACH_AMDGCN_GFX800`` 38 ``gfx800``
				555	``EF_AMDGPU_MACH_AMDGCN_GFX801`` 39 ``gfx801``
				556	``EF_AMDGPU_MACH_AMDGCN_GFX802`` 40 ``gfx802``
				557	``EF_AMDGPU_MACH_AMDGCN_GFX803`` 41 ``gfx803``
				558	``EF_AMDGPU_MACH_AMDGCN_GFX810`` 42 ``gfx810``
				559	``EF_AMDGPU_MACH_AMDGCN_GFX900`` 43 ``gfx900``
				560	``EF_AMDGPU_MACH_AMDGCN_GFX902`` 44 ``gfx902``
				561	================================= ========== =============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	562
				563	Sections
				564	--------
				565
				566	An AMDGPU target ELF code object has the standard ELF sections which include:
				567
				568	.. table:: AMDGPU ELF Sections
				569	:name: amdgpu-elf-sections-table
				570
				571	================== ================ =================================
				572	Name Type Attributes
				573	================== ================ =================================
				574	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				575	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				576	``.debug_``\ \* ``SHT_PROGBITS`` none
				577	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				578	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				579	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				580	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				581	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				582	``.note`` ``SHT_NOTE`` none
				583	``.rela``\ name ``SHT_RELA`` none
				584	``.rela.dyn`` ``SHT_RELA`` none
				585	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				586	``.shstrtab`` ``SHT_STRTAB`` none
				587	``.strtab`` ``SHT_STRTAB`` none
				588	``.symtab`` ``SHT_SYMTAB`` none
				589	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				590	================== ================ =================================
				591
				592	These sections have their standard meanings (see [ELF]_) and are only generated
				593	if needed.
				594
				595	``.debug``\ \*
				596	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				597	DWARF produced by the AMDGPU backend.
				598
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	599	``.dynamic``, ``.dynstr``, ``.dynsym``, ``.hash``
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	600	The standard sections used by a dynamic loader.
				601
				602	``.note``
				603	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				604	backend.
				605
				606	``.rela``\ name, ``.rela.dyn``
				607	For relocatable code objects, name is the name of the section that the
				608	relocation records apply. For example, ``.rela.text`` is the section name for
				609	relocation records associated with the ``.text`` section.
				610
				611	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				612	records from each of the relocatable code object's ``.rela``\ name sections.
				613
				614	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				615	the AMDGPU backend.
				616
				617	``.text``
				618	The executable machine code for the kernels and functions they call. Generated
				619	as position independent code. See :ref:`amdgpu-code-conventions` for
				620	information on conventions used in the isa generation.
				621
				622	.. _amdgpu-note-records:
				623
				624	Note Records
				625	------------
				626
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	627	As required by ``ELFCLASS32`` and ``ELFCLASS64``, minimal zero byte padding must
				628	be generated after the ``name`` field to ensure the ``desc`` field is 4 byte
				629	aligned. In addition, minimal zero byte padding must be generated to ensure the
				630	``desc`` field size is a multiple of 4 bytes. The ``sh_addralign`` field of the
				631	``.note`` section must be at least 4 to indicate at least 8 byte alignment.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	632
				633	The AMDGPU backend code object uses the following ELF note records in the
				634	``.note`` section. The Description column specifies the layout of the note
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	635	record's ``desc`` field. All fields are consecutive bytes. Note records with
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	636	variable size strings have a corresponding ``*_size`` field that specifies the
				637	number of bytes, including the terminating null character, in the string. The
				638	string(s) come immediately after the preceding fields.
				639
				640	Additional note records can be present.
				641
				642	.. table:: AMDGPU ELF Note Records
				643	:name: amdgpu-elf-note-records-table
				644
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	645	===== ============================== ======================================
				646	Name Type Description
				647	===== ============================== ======================================
				648	"AMD" ``NT_AMD_AMDGPU_HSA_METADATA`` <metadata null terminated string>
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	649	===== ============================== ======================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	650
				651	..
				652
				653	.. table:: AMDGPU ELF Note Record Enumeration Values
				654	:name: amdgpu-elf-note-record-enumeration-values-table
				655
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	656	============================== =====
				657	Name Value
				658	============================== =====
				659	reserved 0-9
				660	``NT_AMD_AMDGPU_HSA_METADATA`` 10
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	661	reserved 11
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	662	============================== =====
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	663
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	664	``NT_AMD_AMDGPU_HSA_METADATA``
				665	Specifies extensible metadata associated with the code objects executed on HSA
				666	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. It is required when
				667	the target triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`). See
				668	:ref:`amdgpu-amdhsa-hsa-code-object-metadata` for the syntax of the code
				669	object metadata string.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	670
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	671	.. _amdgpu-symbols:
				672
				673	Symbols
				674	-------
				675
				676	Symbols include the following:
				677
				678	.. table:: AMDGPU ELF Symbols
				679	:name: amdgpu-elf-symbols-table
				680
				681	===================== ============== ============= ==================
				682	Name Type Section Description
				683	===================== ============== ============= ==================
				684	link-name ``STT_OBJECT`` - ``.data`` Global variable
				685	- ``.rodata``
				686	- ``.bss``
				687	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				688	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				689	===================== ============== ============= ==================
				690
				691	Global variable
				692	Global variables both used and defined by the compilation unit.
				693
				694	If the symbol is defined in the compilation unit then it is allocated in the
				695	appropriate section according to if it has initialized data or is readonly.
				696
				697	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				698	will resolve relocations using the definition provided by another code object
				699	or explicitly defined by the runtime.
				700
				701	All global symbols, whether defined in the compilation unit or external, are
				702	accessed by the machine code indirectly through a GOT table entry. This
				703	allows them to be preemptable. The GOT table is only supported when the target
				704	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	705
				706	.. TODO
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	707	Add description of linked shared object symbols. Seems undefined symbols
				708	are marked as STT_NOTYPE.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	709
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	710	Kernel descriptor
				711	Every HSA kernel has an associated kernel descriptor. It is the address of the
				712	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				713	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				714	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				715
				716	Kernel entry point
				717	Every HSA kernel also has a symbol for its machine code entry point.
				718
				719	.. _amdgpu-relocation-records:
				720
				721	Relocation Records
				722	------------------
				723
				724	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				725	relocatable fields are:
				726
				727	``word32``
				728	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				729	alignment. These values use the same byte order as other word values in the
				730	AMD GPU architecture.
				731
				732	``word64``
				733	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				734	alignment. These values use the same byte order as other word values in the
				735	AMD GPU architecture.
				736
				737	Following notations are used for specifying relocation calculations:
				738
				739	A
				740	Represents the addend used to compute the value of the relocatable field.
				741
				742	G
				743	Represents the offset into the global offset table at which the relocation
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	744	entry's symbol will reside during execution.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	745
				746	GOT
				747	Represents the address of the global offset table.
				748
				749	P
				750	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				751	of the storage unit being relocated (computed using ``r_offset``).
				752
				753	S
				754	Represents the value of the symbol whose index resides in the relocation
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	755	entry. Relocations not using this must specify a symbol index of ``STN_UNDEF``.
				756
				757	B
				758	Represents the base address of a loaded executable or shared object which is
				759	the difference between the ELF address and the actual load address. Relocations
				760	using this are only valid in executable or shared objects.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	761
				762	The following relocation types are supported:
				763
				764	.. table:: AMDGPU ELF Relocation Records
				765	:name: amdgpu-elf-relocation-records-table
				766
				767	========================== ===== ========== ==============================
				768	Relocation Type Value Field Calculation
				769	========================== ===== ========== ==============================
				770	``R_AMDGPU_NONE`` 0 none none
				771	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				772	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				773	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				774	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				775	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				776	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				777	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				778	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				779	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				780	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				781	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
Tony Tye	d288430	2017-10-16 20:44:29 +0000	[diff] [blame]	782	reserved 12
				783	``R_AMDGPU_RELATIVE64`` 13 ``word64`` B + A
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	784	========================== ===== ========== ==============================
				785
				786	.. _amdgpu-dwarf:
				787
				788	DWARF
				789	-----
				790
				791	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				792	information that maps the code object executable code and data to the source
				793	language constructs. It can be used by tools such as debuggers and profilers.
				794
				795	Address Space Mapping
				796	~~~~~~~~~~~~~~~~~~~~~
				797
				798	The following address space mapping is used:
				799
				800	.. table:: AMDGPU DWARF Address Space Mapping
				801	:name: amdgpu-dwarf-address-space-mapping-table
				802
				803	=================== =================
				804	DWARF Address Space Memory Space
				805	=================== =================
				806	1 Private (Scratch)
				807	2 Local (group/LDS)
				808	omitted Global
				809	omitted Constant
				810	omitted Generic (Flat)
				811	not supported Region (GDS)
				812	=================== =================
				813
				814	See :ref:`amdgpu-address-spaces` for information on the memory space terminology
				815	used in the table.
				816
				817	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				818	DWARF address space of the value of the pointer when it is in the private or
				819	local address space. Otherwise the attribute is omitted.
				820
				821	An ``XDEREF`` operation is generated in location list expressions for variables
				822	that are allocated in the private and local address space. Otherwise no
				823	``XDREF`` is omitted.
				824
				825	Register Mapping
				826	~~~~~~~~~~~~~~~~
				827
				828	This section is WIP.
				829
				830	.. TODO
				831	Define DWARF register enumeration.
				832
				833	If want to present a wavefront state then should expose vector registers as
				834	64 wide (rather than per work-item view that LLVM uses). Either as separate
				835	registers, or a 64x4 byte single register. In either case use a new LANE op
				836	(akin to XDREF) to select the current lane usage in a location
				837	expression. This would also allow scalar register spilling to vector register
				838	lanes to be expressed (currently no debug information is being generated for
				839	spilling). If choose a wide single register approach then use LANE in
				840	conjunction with PIECE operation to select the dword part of the register for
				841	the current lane. If the separate register approach then use LANE to select
				842	the register.
				843
				844	Source Text
				845	~~~~~~~~~~~
				846
				847	This section is WIP.
				848
				849	.. TODO
				850	DWARF extension to include runtime generated source text.
				851
				852	.. _amdgpu-code-conventions:
				853
				854	Code Conventions
				855	================
				856
				857	This section provides code conventions used for each supported target triple OS
				858	(see :ref:`amdgpu-target-triples`).
				859
				860	AMDHSA
				861	------
				862
				863	This section provides code conventions used when the target triple OS is
				864	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				865
				866	.. _amdgpu-amdhsa-hsa-code-object-metadata:
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	867
				868	Code Object Metadata
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	869	~~~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	870
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	871	The code object metadata specifies extensible metadata associated with the code
				872	objects executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm
				873	[AMD-ROCm]_. It is specified by the ``NT_AMD_AMDGPU_HSA_METADATA`` note record
				874	(see :ref:`amdgpu-note-records`) and is required when the target triple OS is
				875	``amdhsa`` (see :ref:`amdgpu-target-triples`). It must contain the minimum
				876	information necessary to support the ROCM kernel queries. For example, the
				877	segment sizes needed in a dispatch packet. In addition, a high level language
				878	runtime may require other information to be included. For example, the AMD
				879	OpenCL runtime records kernel argument information.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	880
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	881	The metadata is specified as a YAML formatted string (see [YAML]_ and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	882	:doc:`YamlIO`).
				883
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	884	.. TODO
				885	Is the string null terminated? It probably should not if YAML allows it to
				886	contain null characters, otherwise it should be.
				887
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	888	The metadata is represented as a single YAML document comprised of the mapping
				889	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				890	referenced tables.
				891
				892	For boolean values, the string values of ``false`` and ``true`` are used for
				893	false and true respectively.
				894
				895	Additional information can be added to the mappings. To avoid conflicts, any
				896	non-AMD key names should be prefixed by "vendor-name.".
				897
				898	.. table:: AMDHSA Code Object Metadata Mapping
				899	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				900
				901	========== ============== ========= =======================================
				902	String Key Value Type Required? Description
				903	========== ============== ========= =======================================
				904	"Version" sequence of Required - The first integer is the major
				905	2 integers version. Currently 1.
				906	- The second integer is the minor
				907	version. Currently 0.
				908	"Printf" sequence of Each string is encoded information
				909	strings about a printf function call. The
				910	encoded information is organized as
				911	fields separated by colon (':'):
				912
				913	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				914
				915	where:
				916
				917	``ID``
				918	A 32 bit integer as a unique id for
				919	each printf function call
				920
				921	``N``
				922	A 32 bit integer equal to the number
				923	of arguments of printf function call
				924	minus 1
				925
				926	``S[i]`` (where i = 0, 1, ... , N-1)
				927	32 bit integers for the size in bytes
				928	of the i-th FormatString argument of
				929	the printf function call
				930
				931	FormatString
				932	The format string passed to the
				933	printf function call.
				934	"Kernels" sequence of Required Sequence of the mappings for each
				935	mapping kernel in the code object. See
				936	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				937	for the definition of the mapping.
				938	========== ============== ========= =======================================
				939
				940	..
				941
				942	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				943	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				944
				945	================= ============== ========= ================================
				946	String Key Value Type Required? Description
				947	================= ============== ========= ================================
				948	"Name" string Required Source name of the kernel.
				949	"SymbolName" string Required Name of the kernel
				950	descriptor ELF symbol.
				951	"Language" string Source language of the kernel.
				952	Values include:
				953
				954	- "OpenCL C"
				955	- "OpenCL C++"
				956	- "HCC"
				957	- "OpenMP"
				958
				959	"LanguageVersion" sequence of - The first integer is the major
				960	2 integers version.
				961	- The second integer is the
				962	minor version.
				963	"Attrs" mapping Mapping of kernel attributes.
				964	See
				965	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				966	for the mapping definition.
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	967	"Args" sequence of Sequence of mappings of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	968	mapping kernel arguments. See
				969	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				970	for the definition of the mapping.
				971	"CodeProps" mapping Mapping of properties related to
				972	the kernel code. See
				973	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				974	for the mapping definition.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	975	================= ============== ========= ================================
				976
				977	..
				978
				979	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				980	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				981
				982	=================== ============== ========= ==============================
				983	String Key Value Type Required? Description
				984	=================== ============== ========= ==============================
				985	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				986	3 integers X, Y, Z must correspond to the
				987	specified values.
				988
				989	Corresponds to the OpenCL
				990	``reqd_work_group_size``
				991	attribute.
				992	"WorkGroupSizeHint" sequence of The dispatch work-group size
				993	3 integers X, Y, Z is likely to be the
				994	specified values.
				995
				996	Corresponds to the OpenCL
				997	``work_group_size_hint``
				998	attribute.
				999	"VecTypeHint" string The name of a scalar or vector
				1000	type.
				1001
				1002	Corresponds to the OpenCL
				1003	``vec_type_hint`` attribute.
Yaxun Liu	de4b88d	2017-10-10 19:39:48 +0000	[diff] [blame]	1004
				1005	"RuntimeHandle" string The external symbol name
				1006	associated with a kernel.
				1007	OpenCL runtime allocates a
				1008	global buffer for the symbol
				1009	and saves the kernel's address
				1010	to it, which is used for
				1011	device side enqueueing. Only
				1012	available for device side
				1013	enqueued kernels.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1014	=================== ============== ========= ==============================
				1015
				1016	..
				1017
				1018	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				1019	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				1020
				1021	================= ============== ========= ================================
				1022	String Key Value Type Required? Description
				1023	================= ============== ========= ================================
				1024	"Name" string Kernel argument name.
				1025	"TypeName" string Kernel argument type name.
				1026	"Size" integer Required Kernel argument size in bytes.
				1027	"Align" integer Required Kernel argument alignment in
				1028	bytes. Must be a power of two.
				1029	"ValueKind" string Required Kernel argument kind that
				1030	specifies how to set up the
				1031	corresponding argument.
				1032	Values include:
				1033
				1034	"ByValue"
				1035	The argument is copied
				1036	directly into the kernarg.
				1037
				1038	"GlobalBuffer"
				1039	A global address space pointer
				1040	to the buffer data is passed
				1041	in the kernarg.
				1042
				1043	"DynamicSharedPointer"
				1044	A group address space pointer
				1045	to dynamically allocated LDS
				1046	is passed in the kernarg.
				1047
				1048	"Sampler"
				1049	A global address space
				1050	pointer to a S# is passed in
				1051	the kernarg.
				1052
				1053	"Image"
				1054	A global address space
				1055	pointer to a T# is passed in
				1056	the kernarg.
				1057
				1058	"Pipe"
				1059	A global address space pointer
				1060	to an OpenCL pipe is passed in
				1061	the kernarg.
				1062
				1063	"Queue"
				1064	A global address space pointer
				1065	to an OpenCL device enqueue
				1066	queue is passed in the
				1067	kernarg.
				1068
				1069	"HiddenGlobalOffsetX"
				1070	The OpenCL grid dispatch
				1071	global offset for the X
				1072	dimension is passed in the
				1073	kernarg.
				1074
				1075	"HiddenGlobalOffsetY"
				1076	The OpenCL grid dispatch
				1077	global offset for the Y
				1078	dimension is passed in the
				1079	kernarg.
				1080
				1081	"HiddenGlobalOffsetZ"
				1082	The OpenCL grid dispatch
				1083	global offset for the Z
				1084	dimension is passed in the
				1085	kernarg.
				1086
				1087	"HiddenNone"
				1088	An argument that is not used
				1089	by the kernel. Space needs to
				1090	be left for it, but it does
				1091	not need to be set up.
				1092
				1093	"HiddenPrintfBuffer"
				1094	A global address space pointer
				1095	to the runtime printf buffer
				1096	is passed in kernarg.
				1097
				1098	"HiddenDefaultQueue"
				1099	A global address space pointer
				1100	to the OpenCL device enqueue
				1101	queue that should be used by
				1102	the kernel by default is
				1103	passed in the kernarg.
				1104
				1105	"HiddenCompletionAction"
Yaxun Liu	c928f2a	2017-10-30 14:30:28 +0000	[diff] [blame]	1106	A global address space pointer
				1107	to help link enqueued kernels into
				1108	the ancestor tree for determining
				1109	when the parent kernel has finished.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1110
				1111	"ValueType" string Required Kernel argument value type. Only
				1112	present if "ValueKind" is
				1113	"ByValue". For vector data
				1114	types, the value is for the
				1115	element type. Values include:
				1116
				1117	- "Struct"
				1118	- "I8"
				1119	- "U8"
				1120	- "I16"
				1121	- "U16"
				1122	- "F16"
				1123	- "I32"
				1124	- "U32"
				1125	- "F32"
				1126	- "I64"
				1127	- "U64"
				1128	- "F64"
				1129
				1130	.. TODO
				1131	How can it be determined if a
				1132	vector type, and what size
				1133	vector?
				1134	"PointeeAlign" integer Alignment in bytes of pointee
				1135	type for pointer type kernel
				1136	argument. Must be a power
				1137	of 2. Only present if
				1138	"ValueKind" is
				1139	"DynamicSharedPointer".
				1140	"AddrSpaceQual" string Kernel argument address space
				1141	qualifier. Only present if
				1142	"ValueKind" is "GlobalBuffer" or
				1143	"DynamicSharedPointer". Values
				1144	are:
				1145
				1146	- "Private"
				1147	- "Global"
				1148	- "Constant"
				1149	- "Local"
				1150	- "Generic"
				1151	- "Region"
				1152
				1153	.. TODO
				1154	Is GlobalBuffer only Global
				1155	or Constant? Is
				1156	DynamicSharedPointer always
				1157	Local? Can HCC allow Generic?
				1158	How can Private or Region
				1159	ever happen?
				1160	"AccQual" string Kernel argument access
				1161	qualifier. Only present if
				1162	"ValueKind" is "Image" or
				1163	"Pipe". Values
				1164	are:
				1165
				1166	- "ReadOnly"
				1167	- "WriteOnly"
				1168	- "ReadWrite"
				1169
				1170	.. TODO
				1171	Does this apply to
				1172	GlobalBuffer?
Konstantin Zhuravlyov	a01d8b0	2017-10-14 19:03:51 +0000	[diff] [blame]	1173	"ActualAccQual" string The actual memory accesses
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1174	performed by the kernel on the
				1175	kernel argument. Only present if
				1176	"ValueKind" is "GlobalBuffer",
				1177	"Image", or "Pipe". This may be
				1178	more restrictive than indicated
				1179	by "AccQual" to reflect what the
				1180	kernel actual does. If not
				1181	present then the runtime must
				1182	assume what is implied by
				1183	"AccQual" and "IsConst". Values
				1184	are:
				1185
				1186	- "ReadOnly"
				1187	- "WriteOnly"
				1188	- "ReadWrite"
				1189
				1190	"IsConst" boolean Indicates if the kernel argument
				1191	is const qualified. Only present
				1192	if "ValueKind" is
				1193	"GlobalBuffer".
				1194
				1195	"IsRestrict" boolean Indicates if the kernel argument
				1196	is restrict qualified. Only
				1197	present if "ValueKind" is
				1198	"GlobalBuffer".
				1199
				1200	"IsVolatile" boolean Indicates if the kernel argument
				1201	is volatile qualified. Only
				1202	present if "ValueKind" is
				1203	"GlobalBuffer".
				1204
				1205	"IsPipe" boolean Indicates if the kernel argument
				1206	is pipe qualified. Only present
				1207	if "ValueKind" is "Pipe".
				1208
				1209	.. TODO
				1210	Can GlobalBuffer be pipe
				1211	qualified?
				1212	================= ============== ========= ================================
				1213
				1214	..
				1215
				1216	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				1217	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				1218
				1219	============================ ============== ========= =====================
				1220	String Key Value Type Required? Description
				1221	============================ ============== ========= =====================
				1222	"KernargSegmentSize" integer Required The size in bytes of
				1223	the kernarg segment
				1224	that holds the values
				1225	of the arguments to
				1226	the kernel.
				1227	"GroupSegmentFixedSize" integer Required The amount of group
				1228	segment memory
				1229	required by a
				1230	work-group in
				1231	bytes. This does not
				1232	include any
				1233	dynamically allocated
				1234	group segment memory
				1235	that may be added
				1236	when the kernel is
				1237	dispatched.
				1238	"PrivateSegmentFixedSize" integer Required The amount of fixed
				1239	private address space
				1240	memory required for a
				1241	work-item in
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1242	bytes. If the kernel
				1243	uses a dynamic call
				1244	stack then additional
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1245	space must be added
				1246	to this value for the
				1247	call stack.
				1248	"KernargSegmentAlign" integer Required The maximum byte
				1249	alignment of
				1250	arguments in the
				1251	kernarg segment. Must
				1252	be a power of 2.
				1253	"WavefrontSize" integer Required Wavefront size. Must
				1254	be a power of 2.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1255	"NumSGPRs" integer Required Number of scalar
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1256	registers used by a
				1257	wavefront for
				1258	GFX6-GFX9. This
				1259	includes the special
				1260	SGPRs for VCC, Flat
				1261	Scratch (GFX7-GFX9)
				1262	and XNACK (for
				1263	GFX8-GFX9). It does
				1264	not include the 16
				1265	SGPR added if a trap
				1266	handler is
				1267	enabled. It is not
				1268	rounded up to the
				1269	allocation
				1270	granularity.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1271	"NumVGPRs" integer Required Number of vector
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1272	registers used by
				1273	each work-item for
				1274	GFX6-GFX9
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1275	"MaxFlatWorkGroupSize" integer Required Maximum flat
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1276	work-group size
				1277	supported by the
				1278	kernel in work-items.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1279	Must be >=1 and
				1280	consistent with any
				1281	non-0 values in
				1282	FixedWorkGroupSize.
				1283	"FixedWorkGroupSize" sequence of Corresponds to the
				1284	3 integers dispatch work-group
				1285	size X, Y, Z. If
				1286	omitted, defaults to
				1287	0, 0, 0. If an
				1288	element is non-0 then
				1289	the kernel must only
				1290	be launched with a
				1291	matching corresponding
				1292	work-group size.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1293	"IsXNACKEnabled" boolean Indicates if the
				1294	generated machine
				1295	code is capable of
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1296	supporting XNACK. See
				1297	:ref:`amdgpu-target-features`.
Konstantin Zhuravlyov	06ae4ec	2017-11-28 17:51:08 +0000	[diff] [blame^]	1298	"NumSpilledSGPRs" integer Number of stores from
				1299	a scalar register to
				1300	a register allocator
				1301	created spill
				1302	location.
				1303	"NumSpilledVGPRs" integer Number of stores from
				1304	a vector register to
				1305	a register allocator
				1306	created spill
				1307	location.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1308	============================ ============== ========= =====================
				1309
				1310	..
				1311
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1312	Kernel Dispatch
				1313	~~~~~~~~~~~~~~~
				1314
				1315	The HSA architected queuing language (AQL) defines a user space memory interface
				1316	that can be used to control the dispatch of kernels, in an agent independent
				1317	way. An agent can have zero or more AQL queues created for it using the ROCm
				1318	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1319	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1320	mechanics and packet layouts.
				1321
				1322	The packet processor of a kernel agent is responsible for detecting and
				1323	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1324	packet processor is implemented by the hardware command processor (CP),
				1325	asynchronous dispatch controller (ADC) and shader processor input controller
				1326	(SPI).
				1327
				1328	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1329	mode driver to initialize and register the AQL queue with CP.
				1330
				1331	To dispatch a kernel the following actions are performed. This can occur in the
				1332	CPU host program, or from an HSA kernel executing on a GPU.
				1333
				1334	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1335	executed is obtained.
				1336	2. A pointer to the kernel descriptor (see
				1337	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1338	obtained. It must be for a kernel that is contained in a code object that that
				1339	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1340	associated.
				1341	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1342	for a memory region with the kernarg property for the kernel agent that will
				1343	execute the kernel. It must be at least 16 byte aligned.
				1344	4. Kernel argument values are assigned to the kernel argument memory
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	1345	allocation. The layout is defined in the HSA Programmer's Language Reference
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1346	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1347	memory in the same way constant memory is accessed. (Note that the HSA
				1348	specification allows an implementation to copy the kernel argument contents to
				1349	another location that is accessed by the kernel.)
				1350	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1351	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1352	packet. The packet must be set up, and the final write must use an atomic
				1353	store release to set the packet kind to ensure the packet contents are
				1354	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1355	notify the kernel agent that the AQL queue has been updated. These rules, and
				1356	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1357	System Architecture Specification* [HSA]_.
				1358	6. A kernel dispatch packet includes information about the actual dispatch,
				1359	such as grid and work-group size, together with information from the code
				1360	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1361	the kernel symbol can be used to obtain the code object values which are
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1362	recorded in the :ref:`amdgpu-amdhsa-hsa-code-object-metadata`.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1363	7. CP executes micro-code and is responsible for detecting and setting up the
				1364	GPU to execute the wavefronts of a kernel dispatch.
				1365	8. CP ensures that when the a wavefront starts executing the kernel machine
				1366	code, the scalar general purpose registers (SGPR) and vector general purpose
				1367	registers (VGPR) are set up as required by the machine code. The required
				1368	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1369	register state is defined in
				1370	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1371	9. The prolog of the kernel machine code (see
				1372	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1373	before continuing executing the machine code that corresponds to the kernel.
				1374	10. When the kernel dispatch has completed execution, CP signals the completion
				1375	signal specified in the kernel dispatch packet if not 0.
				1376
				1377	.. _amdgpu-amdhsa-memory-spaces:
				1378
				1379	Memory Spaces
				1380	~~~~~~~~~~~~~
				1381
				1382	The memory space properties are:
				1383
				1384	.. table:: AMDHSA Memory Spaces
				1385	:name: amdgpu-amdhsa-memory-spaces-table
				1386
				1387	================= =========== ======== ======= ==================
				1388	Memory Space Name HSA Segment Hardware Address NULL Value
				1389	Name Name Size
				1390	================= =========== ======== ======= ==================
				1391	Private private scratch 32 0x00000000
				1392	Local group LDS 32 0xFFFFFFFF
				1393	Global global global 64 0x0000000000000000
				1394	Constant constant *same as 64 0x0000000000000000
				1395	global*
				1396	Generic flat flat 64 0x0000000000000000
				1397	Region N/A GDS 32 *not implemented
				1398	for AMDHSA*
				1399	================= =========== ======== ======= ==================
				1400
				1401	The global and constant memory spaces both use global virtual addresses, which
				1402	are the same virtual address space used by the CPU. However, some virtual
				1403	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1404	and some by both.
				1405
				1406	Using the constant memory space indicates that the data will not change during
				1407	the execution of the kernel. This allows scalar read instructions to be
				1408	used. The vector and scalar L1 caches are invalidated of volatile data before
				1409	each kernel dispatch execution to allow constant memory to change values between
				1410	kernel dispatches.
				1411
				1412	The local memory space uses the hardware Local Data Store (LDS) which is
				1413	automatically allocated when the hardware creates work-groups of wavefronts, and
				1414	freed when all the wavefronts of a work-group have terminated. The data store
				1415	(DS) instructions can be used to access it.
				1416
				1417	The private memory space uses the hardware scratch memory support. If the kernel
				1418	uses scratch, then the hardware allocates memory that is accessed using
				1419	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1420	address to physical address is:
				1421
				1422	``wavefront-scratch-base +
				1423	(private-address * wavefront-size * 4) +
				1424	(wavefront-lane-id * 4)``
				1425
				1426	There are different ways that the wavefront scratch base address is determined
				1427	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1428	memory can be accessed in an interleaved manner using buffer instruction with
				1429	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1430	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1431	same private address, the interleaving results in adjacent dwords being accessed
				1432	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1433	supported except by flat and scratch instructions in GFX9.
				1434
				1435	The generic address space uses the hardware flat address support available in
				1436	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1437	local appertures), that are outside the range of addressible global memory, to
				1438	map from a flat address to a private or local address.
				1439
				1440	FLAT instructions can take a flat address and access global, private (scratch)
				1441	and group (LDS) memory depending in if the address is within one of the
				1442	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1443	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1444	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1445	(see :ref:`amdgpu-amdhsa-m0`).
				1446
				1447	To convert between a segment address and a flat address the base address of the
				1448	appertures address can be used. For GFX7-GFX8 these are available in the
				1449	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1450	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1451	GFX9 the appature base addresses are directly available as inline constant
				1452	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1453	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1454	which makes it easier to convert from flat to segment or segment to flat.
				1455
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1456	Image and Samplers
				1457	~~~~~~~~~~~~~~~~~~
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1458
				1459	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1460	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1461	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1462	enumeration values for the queries that are not trivially deducible from the S#
				1463	representation.
				1464
				1465	HSA Signals
				1466	~~~~~~~~~~~
				1467
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1468	HSA signal handles created by the ROCm runtime are 64 bit addresses of a
				1469	structure allocated in memory accessible from both the CPU and GPU. The
				1470	structure is defined by the ROCm runtime and subject to change between releases
				1471	(see [AMD-ROCm-github]_).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1472
				1473	.. _amdgpu-amdhsa-hsa-aql-queue:
				1474
				1475	HSA AQL Queue
				1476	~~~~~~~~~~~~~
				1477
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	1478	The HSA AQL queue structure is defined by the ROCm runtime and subject to change
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1479	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1480	fields needed to implement certain language features such as the flat address
				1481	aperture bases. It also contains fields used by CP such as managing the
				1482	allocation of scratch memory.
				1483
				1484	.. _amdgpu-amdhsa-kernel-descriptor:
				1485
				1486	Kernel Descriptor
				1487	~~~~~~~~~~~~~~~~~
				1488
				1489	A kernel descriptor consists of the information needed by CP to initiate the
				1490	execution of a kernel, including the entry point address of the machine code
				1491	that implements the kernel.
				1492
				1493	Kernel Descriptor for GFX6-GFX9
				1494	+++++++++++++++++++++++++++++++
				1495
				1496	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1497
				1498	.. table:: Kernel Descriptor for GFX6-GFX9
				1499	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1500
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1501	======= ======= =============================== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1502	Bits Size Field Name Description
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1503	======= ======= =============================== ============================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1504	31:0 4 bytes GroupSegmentFixedSize The amount of fixed local
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1505	address space memory
				1506	required for a work-group
				1507	in bytes. This does not
				1508	include any dynamically
				1509	allocated local address
				1510	space memory that may be
				1511	added when the kernel is
				1512	dispatched.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1513	63:32 4 bytes PrivateSegmentFixedSize The amount of fixed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1514	private address space
				1515	memory required for a
				1516	work-item in bytes. If
				1517	is_dynamic_callstack is 1
				1518	then additional space must
				1519	be added to this value for
				1520	the call stack.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1521	127:64 8 bytes Reserved, must be 0.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1522	191:128 8 bytes KernelCodeEntryByteOffset Byte offset (possibly
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1523	negative) from base
				1524	address of kernel
				1525	descriptor to kernel's
				1526	entry point instruction
				1527	which must be 256 byte
				1528	aligned.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1529	223:192 4 bytes MaxFlatWorkGroupSize Maximum flat work-group
				1530	size supported by the
				1531	kernel in work-items. If
				1532	an exact work-group size
				1533	is required then must be
				1534	omitted or 0 and
				1535	ReqdWorkGroupSize* must
				1536	be set to non-0.
				1537	239:224 2 bytes ReqdWorkGroupSizeX If present and non-0 then
				1538	the kernel
				1539	must be executed with the
				1540	specified work-group size
				1541	for X.
				1542	255:240 2 bytes ReqdWorkGroupSizeY If present and non-0 then
				1543	the kernel
				1544	must be executed with the
				1545	specified work-group size
				1546	for Y.
				1547	271:256 2 bytes ReqdWorkGroupSizeZ If present and non-0 then
				1548	the kernel
				1549	must be executed with the
				1550	specified work-group size
				1551	for Z.
				1552	383:271 14 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1553	bytes
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1554	415:384 4 bytes ComputePgmRsrc1 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1555	program settings used by
				1556	CP to set up
				1557	``COMPUTE_PGM_RSRC1``
				1558	configuration
				1559	register. See
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1560	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1561	447:416 4 bytes ComputePgmRsrc2 Compute Shader (CS)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1562	program settings used by
				1563	CP to set up
				1564	``COMPUTE_PGM_RSRC2``
				1565	configuration
				1566	register. See
				1567	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1568	448 1 bit EnableSGPRPrivateSegmentBuffer Enable the setup of the
				1569	SGPR user data registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1570	(see
				1571	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1572
				1573	The total number of SGPR
				1574	user data registers
				1575	requested must not exceed
				1576	16 and match value in
				1577	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1578	Any requests beyond 16
				1579	will be ignored.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1580	449 1 bit EnableSGPRDispatchPtr see above
				1581	450 1 bit EnableSGPRQueuePtr see above
				1582	451 1 bit EnableSGPRKernargSegmentPtr see above
				1583	452 1 bit EnableSGPRDispatchID see above
				1584	453 1 bit EnableSGPRFlatScratchInit see above
				1585	454 1 bit EnableSGPRPrivateSegmentSize see above
				1586	455 1 bit EnableSGPRGridWorkgroupCountX Not implemented in CP and
				1587	should always be 0.
				1588	456 1 bit EnableSGPRGridWorkgroupCountY Not implemented in CP and
				1589	should always be 0.
				1590	457 1 bit EnableSGPRGridWorkgroupCountZ Not implemented in CP and
				1591	should always be 0.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1592	462:458 5 bits Reserved, must be 0.
				1593	463 1 bit IsXNACKEnabled Indicates if the generated
				1594	machine code is capable of
				1595	supporting XNACK.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1596	511:464 6 Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1597	bytes
				1598	512 Total size 64 bytes.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1599	======= ====================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1600
				1601	..
				1602
				1603	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1604	:name: amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx9-table
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1605
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1606	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1607	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1608	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1609	5:0 6 bits GRANULATED_WORKITEM_VGPR_COUNT Number of vector registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1610	used by each work-item,
				1611	granularity is device
				1612	specific:
				1613
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1614	GFX6-GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1615	- max_vgpr 1..256
				1616	- roundup((max_vgpg + 1)
				1617	/ 4) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1618
				1619	Used by CP to set up
				1620	``COMPUTE_PGM_RSRC1.VGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1621	9:6 4 bits GRANULATED_WAVEFRONT_SGPR_COUNT Number of scalar registers
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1622	used by a wavefront,
				1623	granularity is device
				1624	specific:
				1625
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1626	GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1627	- max_sgpr 1..112
				1628	- roundup((max_sgpg + 1)
				1629	/ 8) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1630	GFX9
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1631	- max_sgpr 1..112
				1632	- roundup((max_sgpg + 1)
				1633	/ 16) - 1
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1634
				1635	Includes the special SGPRs
				1636	for VCC, Flat Scratch (for
				1637	GFX7 onwards) and XNACK
				1638	(for GFX8 onwards). It does
				1639	not include the 16 SGPR
				1640	added if a trap handler is
				1641	enabled.
				1642
				1643	Used by CP to set up
				1644	``COMPUTE_PGM_RSRC1.SGPRS``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1645	11:10 2 bits PRIORITY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1646
				1647	Start executing wavefront
				1648	at the specified priority.
				1649
				1650	CP is responsible for
				1651	filling in
				1652	``COMPUTE_PGM_RSRC1.PRIORITY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1653	13:12 2 bits FLOAT_ROUND_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1654	with specified rounding
				1655	mode for single (32
				1656	bit) floating point
				1657	precision floating point
				1658	operations.
				1659
				1660	Floating point rounding
				1661	mode values are defined in
				1662	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1663
				1664	Used by CP to set up
				1665	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1666	15:14 2 bits FLOAT_ROUND_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1667	with specified rounding
				1668	denorm mode for half/double (16
				1669	and 64 bit) floating point
				1670	precision floating point
				1671	operations.
				1672
				1673	Floating point rounding
				1674	mode values are defined in
				1675	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1676
				1677	Used by CP to set up
				1678	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1679	17:16 2 bits FLOAT_DENORM_MODE_32 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1680	with specified denorm mode
				1681	for single (32
				1682	bit) floating point
				1683	precision floating point
				1684	operations.
				1685
				1686	Floating point denorm mode
				1687	values are defined in
				1688	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1689
				1690	Used by CP to set up
				1691	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1692	19:18 2 bits FLOAT_DENORM_MODE_16_64 Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1693	with specified denorm mode
				1694	for half/double (16
				1695	and 64 bit) floating point
				1696	precision floating point
				1697	operations.
				1698
				1699	Floating point denorm mode
				1700	values are defined in
				1701	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1702
				1703	Used by CP to set up
				1704	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1705	20 1 bit PRIV Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1706
				1707	Start executing wavefront
				1708	in privilege trap handler
				1709	mode.
				1710
				1711	CP is responsible for
				1712	filling in
				1713	``COMPUTE_PGM_RSRC1.PRIV``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1714	21 1 bit ENABLE_DX10_CLAMP Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1715	with DX10 clamp mode
				1716	enabled. Used by the vector
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1717	ALU to force DX10 style
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1718	treatment of NaN's (when
				1719	set, clamp NaN to zero,
				1720	otherwise pass NaN
				1721	through).
				1722
				1723	Used by CP to set up
				1724	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1725	22 1 bit DEBUG_MODE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1726
				1727	Start executing wavefront
				1728	in single step mode.
				1729
				1730	CP is responsible for
				1731	filling in
				1732	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1733	23 1 bit ENABLE_IEEE_MODE Wavefront starts execution
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1734	with IEEE mode
				1735	enabled. Floating point
				1736	opcodes that support
				1737	exception flag gathering
				1738	will quiet and propagate
				1739	signaling-NaN inputs per
				1740	IEEE 754-2008. Min_dx10 and
				1741	max_dx10 become IEEE
				1742	754-2008 compliant due to
				1743	signaling-NaN propagation
				1744	and quieting.
				1745
				1746	Used by CP to set up
				1747	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1748	24 1 bit BULKY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1749
				1750	Only one work-group allowed
				1751	to execute on a compute
				1752	unit.
				1753
				1754	CP is responsible for
				1755	filling in
				1756	``COMPUTE_PGM_RSRC1.BULKY``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1757	25 1 bit CDBG_USER Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1758
				1759	Flag that can be used to
				1760	control debugging code.
				1761
				1762	CP is responsible for
				1763	filling in
				1764	``COMPUTE_PGM_RSRC1.CDBG_USER``.
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	1765	26 1 bit FP16_OVFL GFX6-GFX8
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1766	Reserved, must be 0.
				1767	GFX9
				1768	Wavefront starts execution
				1769	with specified fp16 overflow
				1770	mode.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1771
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1772	- If 0, fp16 overflow generates
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1773	+/-INF values.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1774	- If 1, fp16 overflow that is the
				1775	result of an +/-INF input value
				1776	or divide by 0 produces a +/-INF,
				1777	otherwise clamps computed
				1778	overflow to +/-MAX_FP16 as
				1779	appropriate.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1780
				1781	Used by CP to set up
				1782	``COMPUTE_PGM_RSRC1.FP16_OVFL``.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1783	31:27 5 bits Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1784	32 Total size 4 bytes
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1785	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1786
				1787	..
				1788
				1789	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1790	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1791
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1792	======= ======= =============================== ===========================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1793	Bits Size Field Name Description
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1794	======= ======= =============================== ===========================================================================
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1795	0 1 bit ENABLE_SGPR_PRIVATE_SEGMENT Enable the setup of the
				1796	_WAVE_OFFSET SGPR wave scratch offset
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1797	system register (see
				1798	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1799
				1800	Used by CP to set up
				1801	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1802	5:1 5 bits USER_SGPR_COUNT The total number of SGPR
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1803	user data registers
				1804	requested. This number must
				1805	match the number of user
				1806	data registers enabled.
				1807
				1808	Used by CP to set up
				1809	``COMPUTE_PGM_RSRC2.USER_SGPR``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1810	6 1 bit ENABLE_TRAP_HANDLER Set to 1 if code contains a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1811	TRAP instruction which
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	1812	requires a trap handler to
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1813	be enabled.
				1814
				1815	CP sets
				1816	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1817	if the runtime has
				1818	installed a trap handler
				1819	regardless of the setting
				1820	of this field.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1821	7 1 bit ENABLE_SGPR_WORKGROUP_ID_X Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1822	system SGPR register for
				1823	the work-group id in the X
				1824	dimension (see
				1825	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1826
				1827	Used by CP to set up
				1828	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1829	8 1 bit ENABLE_SGPR_WORKGROUP_ID_Y Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1830	system SGPR register for
				1831	the work-group id in the Y
				1832	dimension (see
				1833	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1834
				1835	Used by CP to set up
				1836	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1837	9 1 bit ENABLE_SGPR_WORKGROUP_ID_Z Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1838	system SGPR register for
				1839	the work-group id in the Z
				1840	dimension (see
				1841	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1842
				1843	Used by CP to set up
				1844	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1845	10 1 bit ENABLE_SGPR_WORKGROUP_INFO Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1846	system SGPR register for
				1847	work-group information (see
				1848	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1849
				1850	Used by CP to set up
				1851	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1852	12:11 2 bits ENABLE_VGPR_WORKITEM_ID Enable the setup of the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1853	VGPR system registers used
				1854	for the work-item ID.
				1855	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1856	defines the values.
				1857
				1858	Used by CP to set up
				1859	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1860	13 1 bit ENABLE_EXCEPTION_ADDRESS_WATCH Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1861
				1862	Wavefront starts execution
				1863	with address watch
				1864	exceptions enabled which
				1865	are generated when L1 has
				1866	witnessed a thread access
				1867	an *address of
				1868	interest*.
				1869
				1870	CP is responsible for
				1871	filling in the address
				1872	watch bit in
				1873	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1874	according to what the
				1875	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1876	14 1 bit ENABLE_EXCEPTION_MEMORY Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1877
				1878	Wavefront starts execution
				1879	with memory violation
				1880	exceptions exceptions
				1881	enabled which are generated
				1882	when a memory violation has
				1883	occurred for this wave from
				1884	L1 or LDS
				1885	(write-to-read-only-memory,
				1886	mis-aligned atomic, LDS
				1887	address out of range,
				1888	illegal address, etc.).
				1889
				1890	CP sets the memory
				1891	violation bit in
				1892	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1893	according to what the
				1894	runtime requests.
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1895	23:15 9 bits GRANULATED_LDS_SIZE Must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1896
				1897	CP uses the rounded value
				1898	from the dispatch packet,
				1899	not this value, as the
				1900	dispatch may contain
				1901	dynamically allocated group
				1902	segment memory. CP writes
				1903	directly to
				1904	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1905
				1906	Amount of group segment
				1907	(LDS) to allocate for each
				1908	work-group. Granularity is
				1909	device specific:
				1910
				1911	GFX6:
				1912	roundup(lds-size / (64 * 4))
				1913	GFX7-GFX9:
				1914	roundup(lds-size / (128 * 4))
				1915
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1916	24 1 bit ENABLE_EXCEPTION_IEEE_754_FP Wavefront starts execution
				1917	_INVALID_OPERATION with specified exceptions
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1918	enabled.
				1919
				1920	Used by CP to set up
				1921	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1922	(set from bits 0..6).
				1923
				1924	IEEE 754 FP Invalid
				1925	Operation
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1926	25 1 bit ENABLE_EXCEPTION_FP_DENORMAL FP Denormal one or more
				1927	_SOURCE input operands is a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1928	denormal number
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1929	26 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Division by
				1930	_DIVISION_BY_ZERO Zero
				1931	27 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP FP Overflow
				1932	_OVERFLOW
				1933	28 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Underflow
				1934	_UNDERFLOW
				1935	29 1 bit ENABLE_EXCEPTION_IEEE_754_FP IEEE 754 FP Inexact
				1936	_INEXACT
				1937	30 1 bit ENABLE_EXCEPTION_INT_DIVIDE_BY Integer Division by Zero
				1938	_ZERO (rcp_iflag_f32 instruction
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1939	only)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	1940	31 1 bit Reserved, must be 0.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1941	32 Total size 4 bytes.
Tony Tye	3b34061	2017-06-07 00:46:08 +0000	[diff] [blame]	1942	======= ===================================================================================================================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1943
				1944	..
				1945
				1946	.. table:: Floating Point Rounding Mode Enumeration Values
				1947	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1948
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1949	====================================== ===== ==============================
				1950	Enumeration Name Value Description
				1951	====================================== ===== ==============================
				1952	AMDGPU_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1953	AMDGPU_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1954	AMDGPU_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1955	AMDGPU_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1956	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1957
				1958	..
				1959
				1960	.. table:: Floating Point Denorm Mode Enumeration Values
				1961	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1962
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1963	====================================== ===== ==============================
				1964	Enumeration Name Value Description
				1965	====================================== ===== ==============================
				1966	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1967	Denorms
				1968	AMDGPU_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1969	AMDGPU_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1970	AMDGPU_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1971	====================================== ===== ==============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1972
				1973	..
				1974
				1975	.. table:: System VGPR Work-Item ID Enumeration Values
				1976	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1977
Konstantin Zhuravlyov	13376a4	2017-10-14 19:17:08 +0000	[diff] [blame]	1978	======================================== ===== ============================
				1979	Enumeration Name Value Description
				1980	======================================== ===== ============================
				1981	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension
				1982	ID.
				1983	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1984	dimensions ID.
				1985	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1986	dimensions ID.
				1987	AMDGPU_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1988	======================================== ===== ============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1989
				1990	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1991
				1992	Initial Kernel Execution State
				1993	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1994
				1995	This section defines the register state that will be set up by the packet
				1996	processor prior to the start of execution of every wavefront. This is limited by
				1997	the constraints of the hardware controllers of CP/ADC/SPI.
				1998
				1999	The order of the SGPR registers is defined, but the compiler can specify which
				2000	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				2001	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2002	for enabled registers are dense starting at SGPR0: the first enabled register is
				2003	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				2004	an SGPR number.
				2005
				2006	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				2007	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				2008	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				2009	initialized. These are then immediately followed by the System SGPRs that are
				2010	set up by ADC/SPI and can have different values for each wave of the grid
				2011	dispatch.
				2012
				2013	SGPR register initial state is defined in
				2014	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				2015
				2016	.. table:: SGPR Register Set Up Order
				2017	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				2018
				2019	========== ========================== ====== ==============================
				2020	SGPR Order Name Number Description
				2021	(kernel descriptor enable of
				2022	field) SGPRs
				2023	========== ========================== ====== ==============================
				2024	First Private Segment Buffer 4 V# that can be used, together
				2025	(enable_sgpr_private with Scratch Wave Offset as an
				2026	_segment_buffer) offset, to access the private
				2027	memory space using a segment
				2028	address.
				2029
				2030	CP uses the value provided by
				2031	the runtime.
				2032	then Dispatch Ptr 2 64 bit address of AQL dispatch
				2033	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				2034	actually executing.
				2035	then Queue Ptr 2 64 bit address of amd_queue_t
				2036	(enable_sgpr_queue_ptr) object for AQL queue on which
				2037	the dispatch packet was
				2038	queued.
				2039	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				2040	(enable_sgpr_kernarg segment. This is directly
				2041	_segment_ptr) copied from the
				2042	kernarg_address in the kernel
				2043	dispatch packet.
				2044
				2045	Having CP load it once avoids
				2046	loading it at the beginning of
				2047	every wavefront.
				2048	then Dispatch Id 2 64 bit Dispatch ID of the
				2049	(enable_sgpr_dispatch_id) dispatch packet being
				2050	executed.
				2051	then Flat Scratch Init 2 This is 2 SGPRs:
				2052	(enable_sgpr_flat_scratch
				2053	_init) GFX6
				2054	Not supported.
				2055	GFX7-GFX8
				2056	The first SGPR is a 32 bit
				2057	byte offset from
				2058	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2059	to per SPI base of memory
				2060	for scratch for the queue
				2061	executing the kernel
				2062	dispatch. CP obtains this
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2063	from the runtime. (The
				2064	Scratch Segment Buffer base
				2065	address is
				2066	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2067	plus this offset.) The value
				2068	of Scratch Wave Offset must
				2069	be added to this offset by
				2070	the kernel machine code,
				2071	right shifted by 8, and
				2072	moved to the FLAT_SCRATCH_HI
				2073	SGPR register.
				2074	FLAT_SCRATCH_HI corresponds
				2075	to SGPRn-4 on GFX7, and
				2076	SGPRn-6 on GFX8 (where SGPRn
				2077	is the highest numbered SGPR
				2078	allocated to the wave).
				2079	FLAT_SCRATCH_HI is
				2080	multiplied by 256 (as it is
				2081	in units of 256 bytes) and
				2082	added to
				2083	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				2084	to calculate the per wave
				2085	FLAT SCRATCH BASE in flat
				2086	memory instructions that
				2087	access the scratch
				2088	apperture.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2089
				2090	The second SGPR is 32 bit
				2091	byte size of a single
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2092	work-item's scratch memory
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2093	usage. CP obtains this from
				2094	the runtime, and it is
				2095	always a multiple of DWORD.
				2096	CP checks that the value in
				2097	the kernel dispatch packet
				2098	Private Segment Byte Size is
				2099	not larger, and requests the
				2100	runtime to increase the
				2101	queue's scratch size if
				2102	necessary. The kernel code
				2103	must move it to
				2104	FLAT_SCRATCH_LO which is
				2105	SGPRn-3 on GFX7 and SGPRn-5
				2106	on GFX8. FLAT_SCRATCH_LO is
				2107	used as the FLAT SCRATCH
				2108	SIZE in flat memory
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2109	instructions. Having CP load
				2110	it once avoids loading it at
				2111	the beginning of every
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2112	wavefront.
				2113	GFX9
				2114	This is the
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2115	64 bit base address of the
				2116	per SPI scratch backing
				2117	memory managed by SPI for
				2118	the queue executing the
				2119	kernel dispatch. CP obtains
				2120	this from the runtime (and
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2121	divides it if there are
				2122	multiple Shader Arrays each
				2123	with its own SPI). The value
				2124	of Scratch Wave Offset must
				2125	be added by the kernel
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	2126	machine code and the result
				2127	moved to the FLAT_SCRATCH
				2128	SGPR which is SGPRn-6 and
				2129	SGPRn-5. It is used as the
				2130	FLAT SCRATCH BASE in flat
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2131	memory instructions.
				2132	then Private Segment Size 1 The 32 bit byte size of a
				2133	(enable_sgpr_private single
				2134	work-item's
				2135	scratch_segment_size) memory
				2136	allocation. This is the
				2137	value from the kernel
				2138	dispatch packet Private
				2139	Segment Byte Size rounded up
				2140	by CP to a multiple of
				2141	DWORD.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2142
				2143	Having CP load it once avoids
				2144	loading it at the beginning of
				2145	every wavefront.
				2146
				2147	This is not used for
				2148	GFX7-GFX8 since it is the same
				2149	value as the second SGPR of
				2150	Flat Scratch Init. However, it
				2151	may be needed for GFX9 which
				2152	changes the meaning of the
				2153	Flat Scratch Init value.
				2154	then Grid Work-Group Count X 1 32 bit count of the number of
				2155	(enable_sgpr_grid work-groups in the X dimension
				2156	_workgroup_count_X) for the grid being
				2157	executed. Computed from the
				2158	fields in the kernel dispatch
				2159	packet as ((grid_size.x +
				2160	workgroup_size.x - 1) /
				2161	workgroup_size.x).
				2162	then Grid Work-Group Count Y 1 32 bit count of the number of
				2163	(enable_sgpr_grid work-groups in the Y dimension
				2164	_workgroup_count_Y && for the grid being
				2165	less than 16 previous executed. Computed from the
				2166	SGPRs) fields in the kernel dispatch
				2167	packet as ((grid_size.y +
				2168	workgroup_size.y - 1) /
				2169	workgroupSize.y).
				2170
				2171	Only initialized if <16
				2172	previous SGPRs initialized.
				2173	then Grid Work-Group Count Z 1 32 bit count of the number of
				2174	(enable_sgpr_grid work-groups in the Z dimension
				2175	_workgroup_count_Z && for the grid being
				2176	less than 16 previous executed. Computed from the
				2177	SGPRs) fields in the kernel dispatch
				2178	packet as ((grid_size.z +
				2179	workgroup_size.z - 1) /
				2180	workgroupSize.z).
				2181
				2182	Only initialized if <16
				2183	previous SGPRs initialized.
				2184	then Work-Group Id X 1 32 bit work-group id in X
				2185	(enable_sgpr_workgroup_id dimension of grid for
				2186	_X) wavefront.
				2187	then Work-Group Id Y 1 32 bit work-group id in Y
				2188	(enable_sgpr_workgroup_id dimension of grid for
				2189	_Y) wavefront.
				2190	then Work-Group Id Z 1 32 bit work-group id in Z
				2191	(enable_sgpr_workgroup_id dimension of grid for
				2192	_Z) wavefront.
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	2193	then Work-Group Info 1 {first_wave, 14'b0000,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2194	(enable_sgpr_workgroup ordered_append_term[10:0],
				2195	_info) threadgroup_size_in_waves[5:0]}
				2196	then Scratch Wave Offset 1 32 bit byte offset from base
				2197	(enable_sgpr_private of scratch base of queue
				2198	_segment_wave_offset) executing the kernel
				2199	dispatch. Must be used as an
				2200	offset with Private
				2201	segment address when using
				2202	Scratch Segment Buffer. It
				2203	must be used to set up FLAT
				2204	SCRATCH for flat addressing
				2205	(see
				2206	:ref:`amdgpu-amdhsa-flat-scratch`).
				2207	========== ========================== ====== ==============================
				2208
				2209	The order of the VGPR registers is defined, but the compiler can specify which
				2210	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2211	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2212	for enabled registers are dense starting at VGPR0: the first enabled register is
				2213	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2214	VGPR number.
				2215
				2216	VGPR register initial state is defined in
				2217	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2218
				2219	.. table:: VGPR Register Set Up Order
				2220	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2221
				2222	========== ========================== ====== ==============================
				2223	VGPR Order Name Number Description
				2224	(kernel descriptor enable of
				2225	field) VGPRs
				2226	========== ========================== ====== ==============================
				2227	First Work-Item Id X 1 32 bit work item id in X
				2228	(Always initialized) dimension of work-group for
				2229	wavefront lane.
				2230	then Work-Item Id Y 1 32 bit work item id in Y
				2231	(enable_vgpr_workitem_id dimension of work-group for
				2232	> 0) wavefront lane.
				2233	then Work-Item Id Z 1 32 bit work item id in Z
				2234	(enable_vgpr_workitem_id dimension of work-group for
				2235	> 1) wavefront lane.
				2236	========== ========================== ====== ==============================
				2237
				2238	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2239
				2240	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2241	registers.
				2242	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2243	combination including none.
				2244	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2245	cannot included with the flat scratch init value which is per queue.
				2246	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2247	or (X, Y, Z).
				2248
				2249	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2250	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2251
				2252	The global segment can be accessed either using buffer instructions (GFX6 which
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2253	has V# 64 bit address support), flat instructions (GFX7-GFX9), or global
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2254	instructions (GFX9).
				2255
				2256	If buffer operations are used then the compiler can generate a V# with the
				2257	following properties:
				2258
				2259	* base address of 0
				2260	* no swizzle
				2261	* ATC: 1 if IOMMU present (such as APU)
				2262	* ptr64: 1
				2263	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2264	APU and NC for dGPU).
				2265
				2266	.. _amdgpu-amdhsa-kernel-prolog:
				2267
				2268	Kernel Prolog
				2269	~~~~~~~~~~~~~
				2270
				2271	.. _amdgpu-amdhsa-m0:
				2272
				2273	M0
				2274	++
				2275
				2276	GFX6-GFX8
				2277	The M0 register must be initialized with a value at least the total LDS size
				2278	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2279	available in dispatch packet. For M0, it is also possible to use maximum
				2280	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2281	GFX7-GFX8).
				2282	GFX9
				2283	The M0 register is not used for range checking LDS accesses and so does not
				2284	need to be initialized in the prolog.
				2285
				2286	.. _amdgpu-amdhsa-flat-scratch:
				2287
				2288	Flat Scratch
				2289	++++++++++++
				2290
				2291	If the kernel may use flat operations to access scratch memory, the prolog code
				2292	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2293	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2294	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2295
				2296	GFX6
				2297	Flat scratch is not supported.
				2298
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2299	GFX7-GFX8
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2300	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2301	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2302	being managed by SPI for the queue executing the kernel dispatch. This is
				2303	the same value used in the Scratch Segment Buffer V# base address. The
				2304	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2305	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2306	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2307	by 8 before moving into FLAT_SCRATCH_LO.
				2308	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2309	work-items scratch memory usage. This is directly loaded from the kernel
				2310	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2311	DWORD. Having CP load it once avoids loading it at the beginning of every
				2312	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2313	SIZE.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	2314
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2315	GFX9
				2316	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2317	memory being managed by SPI for the queue executing the kernel dispatch. The
				2318	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2319	pair for use as the flat scratch base in flat memory instructions.
				2320
				2321	.. _amdgpu-amdhsa-memory-model:
				2322
				2323	Memory Model
				2324	~~~~~~~~~~~~
				2325
				2326	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2327	(see :ref:`memmodel`). The implementation is WIP.
				2328
				2329	.. TODO
				2330	Update when implementation complete.
				2331
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2332	The AMDGPU backend supports the memory synchronization scopes specified in
				2333	:ref:`amdgpu-memory-scopes`.
				2334
				2335	The code sequences used to implement the memory model are defined in table
				2336	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2337
				2338	The sequences specify the order of instructions that a single thread must
				2339	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2340	to other memory instructions executed by the same thread. This allows them to be
				2341	moved earlier or later which can allow them to be combined with other instances
				2342	of the same instruction, or hoisted/sunk out of loops to improve
				2343	performance. Only the instructions related to the memory model are given;
				2344	additional ``s_waitcnt`` instructions are required to ensure registers are
				2345	defined before being used. These may be able to be combined with the memory
				2346	model ``s_waitcnt`` instructions as described above.
				2347
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2348	The AMDGPU backend supports the following memory models:
				2349
				2350	HSA Memory Model [HSA]_
				2351	The HSA memory model uses a single happens-before relation for all address
				2352	spaces (see :ref:`amdgpu-address-spaces`).
				2353	OpenCL Memory Model [OpenCL]_
				2354	The OpenCL memory model which has separate happens-before relations for the
				2355	global and local address spaces. Only a fence specifying both global and
				2356	local address space, and seq_cst instructions join the relationships. Since
				2357	the LLVM ``memfence`` instruction does not allow an address space to be
				2358	specified the OpenCL fence has to convervatively assume both local and
				2359	global address space was specified. However, optimizations can often be
				2360	done to eliminate the additional ``s_waitcnt`` instructions when there are
				2361	no intervening memory instructions which access the corresponding address
				2362	space. The code sequences in the table indicate what can be omitted for the
				2363	OpenCL memory. The target triple environment is used to determine if the
				2364	source language is OpenCL (see :ref:`amdgpu-opencl`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2365
				2366	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2367	operations.
				2368
				2369	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2370	termed vector memory operations.
				2371
				2372	For GFX6-GFX9:
				2373
				2374	* Each agent has multiple compute units (CU).
				2375	* Each CU has multiple SIMDs that execute wavefronts.
				2376	* The wavefronts for a single work-group are executed in the same CU but may be
				2377	executed by different SIMDs.
				2378	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2379	executing on it.
				2380	* All LDS operations of a CU are performed as wavefront wide operations in a
				2381	global order and involve no caching. Completion is reported to a wavefront in
				2382	execution order.
				2383	* The LDS memory has multiple request queues shared by the SIMDs of a
				2384	CU. Therefore, the LDS operations performed by different waves of a work-group
				2385	can be reordered relative to each other, which can result in reordering the
				2386	visibility of vector memory operations with respect to LDS operations of other
				2387	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2388	ensure synchronization between LDS operations and vector memory operations
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2389	between waves of a work-group, but not between operations performed by the
				2390	same wavefront.
				2391	* The vector memory operations are performed as wavefront wide operations and
				2392	completion is reported to a wavefront in execution order. The exception is
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2393	that for GFX7-GFX9 ``flat_load/store/atomic`` instructions can report out of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2394	vector memory order if they access LDS memory, and out of LDS operation order
				2395	if they access global memory.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2396	* The vector memory operations access a single vector L1 cache shared by all
				2397	SIMDs a CU. Therefore, no special action is required for coherence between the
				2398	lanes of a single wavefront, or for coherence between wavefronts in the same
				2399	work-group. A ``buffer_wbinvl1_vol`` is required for coherence between waves
				2400	executing in different work-groups as they may be executing on different CUs.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2401	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2402	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2403	scalar operations are used in a restricted way so do not impact the memory
				2404	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2405	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2406	the same agent.
				2407	* The L2 cache has independent channels to service disjoint ranges of virtual
				2408	addresses.
				2409	* Each CU has a separate request queue per channel. Therefore, the vector and
				2410	scalar memory operations performed by waves executing in different work-groups
				2411	(which may be executing on different CUs) of an agent can be reordered
				2412	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2413	synchronization between vector memory operations of different CUs. It ensures a
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2414	previous vector memory operation has completed before executing a subsequent
				2415	vector memory or LDS operation and so can be used to meet the requirements of
				2416	acquire and release.
				2417	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2418	of virtual addresses can be set up to bypass it to ensure system coherence.
				2419
Tony Tye	07d9f10	2017-11-10 01:00:54 +0000	[diff] [blame]	2420	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-GFX8),
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2421	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2422	memory, atomic memory orderings are not meaningful and all accesses are treated
				2423	as non-atomic.
				2424
				2425	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2426	scalar memory instructions). Since the constant address space contents do not
				2427	change during the execution of a kernel dispatch it is not legal to perform
				2428	stores, and atomic memory orderings are not meaningful and all access are
				2429	treated as non-atomic.
				2430
				2431	A memory synchronization scope wider than work-group is not meaningful for the
				2432	group (LDS) address space and is treated as work-group.
				2433
				2434	The memory model does not support the region address space which is treated as
				2435	non-atomic.
				2436
				2437	Acquire memory ordering is not meaningful on store atomic instructions and is
				2438	treated as non-atomic.
				2439
				2440	Release memory ordering is not meaningful on load atomic instructions and is
				2441	treated a non-atomic.
				2442
				2443	Acquire-release memory ordering is not meaningful on load or store atomic
				2444	instructions and is treated as acquire and release respectively.
				2445
				2446	AMDGPU backend only uses scalar memory operations to access memory that is
				2447	proven to not change during the execution of the kernel dispatch. This includes
				2448	constant address space and global address space for program scope const
				2449	variables. Therefore the kernel machine code does not have to maintain the
				2450	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2451	and vector L1 caches are invalidated between kernel dispatches by CP since
				2452	constant address space data may change between kernel dispatch executions. See
				2453	:ref:`amdgpu-amdhsa-memory-spaces`.
				2454
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	2455	The one execption is if scalar writes are used to spill SGPR registers. In this
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2456	case the AMDGPU backend ensures the memory location used to spill is never
				2457	accessed by vector memory operations at the same time. If scalar writes are used
				2458	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2459	return since the locations may be used for vector memory instructions by a
				2460	future wave that uses the same scratch area, or a function call that creates a
				2461	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2462	as all scalar writes are write-before-read in the same thread.
				2463
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2464	Scratch backing memory (which is used for the private address space)
				2465	is accessed with MTYPE NC_NV (non-coherenent non-volatile). Since the private
				2466	address space is only accessed by a single thread, and is always
				2467	write-before-read, there is never a need to invalidate these entries from the L1
				2468	cache. Hence all cache invalidates are done as ``*_vol`` to only invalidate the
				2469	volatile cache lines.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2470
				2471	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2472	to invalidate the L2 cache. This also causes it to be treated as
				2473	non-volatile and so is not invalidated by ``*_vol``. On APU it is accessed as CC
				2474	(cache coherent) and so the L2 cache will coherent with the CPU and other
				2475	agents.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2476
				2477	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2478	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2479
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2480	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2481	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2482	Ordering Sync Scope Address
				2483	Space
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2484	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2485	Non-Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2486	-----------------------------------------------------------------------------------
				2487	load none none - global - !volatile & !nontemporal
				2488	- generic
				2489	- private 1. buffer/global/flat_load
				2490	- constant
				2491	- volatile & !nontemporal
				2492
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2493	1. buffer/global/flat_load
				2494	glc=1
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2495
				2496	- nontemporal
				2497
				2498	1. buffer/global/flat_load
				2499	glc=1 slc=1
				2500
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2501	load none none - local 1. ds_load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2502	store none none - global - !nontemporal
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2503	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2504	- private 1. buffer/global/flat_store
				2505	- constant
				2506	- nontemporal
				2507
				2508	1. buffer/global/flat_stote
				2509	glc=1 slc=1
				2510
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2511	store none none - local 1. ds_store
				2512	Unordered Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2513	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2514	load atomic unordered any any Same as non-atomic.
				2515	store atomic unordered any any Same as non-atomic.
				2516	atomicrmw unordered any any *Same as monotonic
				2517	atomic*.
				2518	Monotonic Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2519	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2520	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2521	- wavefront - generic
				2522	- workgroup
				2523	load atomic monotonic - singlethread - local 1. ds_load
				2524	- wavefront
				2525	- workgroup
				2526	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2527	- system - generic glc=1
				2528	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2529	- wavefront - generic
				2530	- workgroup
				2531	- agent
				2532	- system
				2533	store atomic monotonic - singlethread - local 1. ds_store
				2534	- wavefront
				2535	- workgroup
				2536	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2537	- wavefront - generic
				2538	- workgroup
				2539	- agent
				2540	- system
				2541	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2542	- wavefront
				2543	- workgroup
				2544	Acquire Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2545	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2546	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2547	- wavefront - local
				2548	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2549	load atomic acquire - workgroup - global 1. buffer/global/flat_load
				2550	load atomic acquire - workgroup - local 1. ds_load
				2551	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2552
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2553	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2554	- Must happen before
				2555	any following
				2556	global/generic
				2557	load/load
				2558	atomic/store/store
				2559	atomic/atomicrmw.
				2560	- Ensures any
				2561	following global
				2562	data read is no
				2563	older than the load
				2564	atomic value being
				2565	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2566	load atomic acquire - workgroup - generic 1. flat_load
				2567	2. s_waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2568
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2569	- If OpenCL, omit.
				2570	- Must happen before
				2571	any following
				2572	global/generic
				2573	load/load
				2574	atomic/store/store
				2575	atomic/atomicrmw.
				2576	- Ensures any
				2577	following global
				2578	data read is no
				2579	older than the load
				2580	atomic value being
				2581	acquired.
				2582	load atomic acquire - agent - global 1. buffer/global/flat_load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2583	- system glc=1
				2584	2. s_waitcnt vmcnt(0)
				2585
				2586	- Must happen before
				2587	following
				2588	buffer_wbinvl1_vol.
				2589	- Ensures the load
				2590	has completed
				2591	before invalidating
				2592	the cache.
				2593
				2594	3. buffer_wbinvl1_vol
				2595
				2596	- Must happen before
				2597	any following
				2598	global/generic
				2599	load/load
				2600	atomic/atomicrmw.
				2601	- Ensures that
				2602	following
				2603	loads will not see
				2604	stale global data.
				2605
				2606	load atomic acquire - agent - generic 1. flat_load glc=1
				2607	- system 2. s_waitcnt vmcnt(0) &
				2608	lgkmcnt(0)
				2609
				2610	- If OpenCL omit
				2611	lgkmcnt(0).
				2612	- Must happen before
				2613	following
				2614	buffer_wbinvl1_vol.
				2615	- Ensures the flat_load
				2616	has completed
				2617	before invalidating
				2618	the cache.
				2619
				2620	3. buffer_wbinvl1_vol
				2621
				2622	- Must happen before
				2623	any following
				2624	global/generic
				2625	load/load
				2626	atomic/atomicrmw.
				2627	- Ensures that
				2628	following loads
				2629	will not see stale
				2630	global data.
				2631
				2632	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2633	- wavefront - local
				2634	- generic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2635	atomicrmw acquire - workgroup - global 1. buffer/global/flat_atomic
				2636	atomicrmw acquire - workgroup - local 1. ds_atomic
				2637	2. waitcnt lgkmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2638
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2639	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2640	- Must happen before
				2641	any following
				2642	global/generic
				2643	load/load
				2644	atomic/store/store
				2645	atomic/atomicrmw.
				2646	- Ensures any
				2647	following global
				2648	data read is no
				2649	older than the
				2650	atomicrmw value
				2651	being acquired.
				2652
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2653	atomicrmw acquire - workgroup - generic 1. flat_atomic
				2654	2. waitcnt lgkmcnt(0)
				2655
				2656	- If OpenCL, omit.
				2657	- Must happen before
				2658	any following
				2659	global/generic
				2660	load/load
				2661	atomic/store/store
				2662	atomic/atomicrmw.
				2663	- Ensures any
				2664	following global
				2665	data read is no
				2666	older than the
				2667	atomicrmw value
				2668	being acquired.
				2669
				2670	atomicrmw acquire - agent - global 1. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2671	- system 2. s_waitcnt vmcnt(0)
				2672
				2673	- Must happen before
				2674	following
				2675	buffer_wbinvl1_vol.
				2676	- Ensures the
				2677	atomicrmw has
				2678	completed before
				2679	invalidating the
				2680	cache.
				2681
				2682	3. buffer_wbinvl1_vol
				2683
				2684	- Must happen before
				2685	any following
				2686	global/generic
				2687	load/load
				2688	atomic/atomicrmw.
				2689	- Ensures that
				2690	following loads
				2691	will not see stale
				2692	global data.
				2693
				2694	atomicrmw acquire - agent - generic 1. flat_atomic
				2695	- system 2. s_waitcnt vmcnt(0) &
				2696	lgkmcnt(0)
				2697
				2698	- If OpenCL, omit
				2699	lgkmcnt(0).
				2700	- Must happen before
				2701	following
				2702	buffer_wbinvl1_vol.
				2703	- Ensures the
				2704	atomicrmw has
				2705	completed before
				2706	invalidating the
				2707	cache.
				2708
				2709	3. buffer_wbinvl1_vol
				2710
				2711	- Must happen before
				2712	any following
				2713	global/generic
				2714	load/load
				2715	atomic/atomicrmw.
				2716	- Ensures that
				2717	following loads
				2718	will not see stale
				2719	global data.
				2720
				2721	fence acquire - singlethread none none
				2722	- wavefront
				2723	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2724
				2725	- If OpenCL and
				2726	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2727	not generic, omit.
				2728	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2729	currently has no
				2730	address space on
				2731	the fence need to
				2732	conservatively
				2733	always generate. If
				2734	fence had an
				2735	address space then
				2736	set to address
				2737	space of OpenCL
				2738	fence flag, or to
				2739	generic if both
				2740	local and global
				2741	flags are
				2742	specified.
				2743	- Must happen after
				2744	any preceding
				2745	local/generic load
				2746	atomic/atomicrmw
				2747	with an equal or
				2748	wider sync scope
				2749	and memory ordering
				2750	stronger than
				2751	unordered (this is
				2752	termed the
				2753	fence-paired-atomic).
				2754	- Must happen before
				2755	any following
				2756	global/generic
				2757	load/load
				2758	atomic/store/store
				2759	atomic/atomicrmw.
				2760	- Ensures any
				2761	following global
				2762	data read is no
				2763	older than the
				2764	value read by the
				2765	fence-paired-atomic.
				2766
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2767	fence acquire - agent none 1. s_waitcnt lgkmcnt(0) &
				2768	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2769
				2770	- If OpenCL and
				2771	address space is
				2772	not generic, omit
				2773	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2774	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2775	currently has no
				2776	address space on
				2777	the fence need to
				2778	conservatively
				2779	always generate
				2780	(see comment for
				2781	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame]	2782	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2783	separate s_waitcnt
				2784	vmcnt(0) and
				2785	s_waitcnt
				2786	lgkmcnt(0) to allow
				2787	them to be
				2788	independently moved
				2789	according to the
				2790	following rules.
				2791	- s_waitcnt vmcnt(0)
				2792	must happen after
				2793	any preceding
				2794	global/generic load
				2795	atomic/atomicrmw
				2796	with an equal or
				2797	wider sync scope
				2798	and memory ordering
				2799	stronger than
				2800	unordered (this is
				2801	termed the
				2802	fence-paired-atomic).
				2803	- s_waitcnt lgkmcnt(0)
				2804	must happen after
				2805	any preceding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2806	local/generic load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2807	atomic/atomicrmw
				2808	with an equal or
				2809	wider sync scope
				2810	and memory ordering
				2811	stronger than
				2812	unordered (this is
				2813	termed the
				2814	fence-paired-atomic).
				2815	- Must happen before
				2816	the following
				2817	buffer_wbinvl1_vol.
				2818	- Ensures that the
				2819	fence-paired atomic
				2820	has completed
				2821	before invalidating
				2822	the
				2823	cache. Therefore
				2824	any following
				2825	locations read must
				2826	be no older than
				2827	the value read by
				2828	the
				2829	fence-paired-atomic.
				2830
				2831	2. buffer_wbinvl1_vol
				2832
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2833	- Must happen before any
				2834	following global/generic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2835	load/load
				2836	atomic/store/store
				2837	atomic/atomicrmw.
				2838	- Ensures that
				2839	following loads
				2840	will not see stale
				2841	global data.
				2842
				2843	Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2844	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2845	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2846	- wavefront - local
				2847	- generic
				2848	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2849
				2850	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2851	- Must happen after
				2852	any preceding
				2853	local/generic
				2854	load/store/load
				2855	atomic/store
				2856	atomic/atomicrmw.
				2857	- Must happen before
				2858	the following
				2859	store.
				2860	- Ensures that all
				2861	memory operations
				2862	to local have
				2863	completed before
				2864	performing the
				2865	store that is being
				2866	released.
				2867
				2868	2. buffer/global/flat_store
				2869	store atomic release - workgroup - local 1. ds_store
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2870	store atomic release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2871
				2872	- If OpenCL, omit.
				2873	- Must happen after
				2874	any preceding
				2875	local/generic
				2876	load/store/load
				2877	atomic/store
				2878	atomic/atomicrmw.
				2879	- Must happen before
				2880	the following
				2881	store.
				2882	- Ensures that all
				2883	memory operations
				2884	to local have
				2885	completed before
				2886	performing the
				2887	store that is being
				2888	released.
				2889
				2890	2. flat_store
				2891	store atomic release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2892	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2893
				2894	- If OpenCL, omit
				2895	lgkmcnt(0).
				2896	- Could be split into
				2897	separate s_waitcnt
				2898	vmcnt(0) and
				2899	s_waitcnt
				2900	lgkmcnt(0) to allow
				2901	them to be
				2902	independently moved
				2903	according to the
				2904	following rules.
				2905	- s_waitcnt vmcnt(0)
				2906	must happen after
				2907	any preceding
				2908	global/generic
				2909	load/store/load
				2910	atomic/store
				2911	atomic/atomicrmw.
				2912	- s_waitcnt lgkmcnt(0)
				2913	must happen after
				2914	any preceding
				2915	local/generic
				2916	load/store/load
				2917	atomic/store
				2918	atomic/atomicrmw.
				2919	- Must happen before
				2920	the following
				2921	store.
				2922	- Ensures that all
				2923	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2924	to memory have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2925	completed before
				2926	performing the
				2927	store that is being
				2928	released.
				2929
				2930	2. buffer/global/ds/flat_store
				2931	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2932	- wavefront - local
				2933	- generic
				2934	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2935
				2936	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2937	- Must happen after
				2938	any preceding
				2939	local/generic
				2940	load/store/load
				2941	atomic/store
				2942	atomic/atomicrmw.
				2943	- Must happen before
				2944	the following
				2945	atomicrmw.
				2946	- Ensures that all
				2947	memory operations
				2948	to local have
				2949	completed before
				2950	performing the
				2951	atomicrmw that is
				2952	being released.
				2953
				2954	2. buffer/global/flat_atomic
				2955	atomicrmw release - workgroup - local 1. ds_atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	2956	atomicrmw release - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2957
				2958	- If OpenCL, omit.
				2959	- Must happen after
				2960	any preceding
				2961	local/generic
				2962	load/store/load
				2963	atomic/store
				2964	atomic/atomicrmw.
				2965	- Must happen before
				2966	the following
				2967	atomicrmw.
				2968	- Ensures that all
				2969	memory operations
				2970	to local have
				2971	completed before
				2972	performing the
				2973	atomicrmw that is
				2974	being released.
				2975
				2976	2. flat_atomic
				2977	atomicrmw release - agent - global 1. s_waitcnt lgkmcnt(0) &
				2978	- system - generic vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2979
				2980	- If OpenCL, omit
				2981	lgkmcnt(0).
				2982	- Could be split into
				2983	separate s_waitcnt
				2984	vmcnt(0) and
				2985	s_waitcnt
				2986	lgkmcnt(0) to allow
				2987	them to be
				2988	independently moved
				2989	according to the
				2990	following rules.
				2991	- s_waitcnt vmcnt(0)
				2992	must happen after
				2993	any preceding
				2994	global/generic
				2995	load/store/load
				2996	atomic/store
				2997	atomic/atomicrmw.
				2998	- s_waitcnt lgkmcnt(0)
				2999	must happen after
				3000	any preceding
				3001	local/generic
				3002	load/store/load
				3003	atomic/store
				3004	atomic/atomicrmw.
				3005	- Must happen before
				3006	the following
				3007	atomicrmw.
				3008	- Ensures that all
				3009	memory operations
				3010	to global and local
				3011	have completed
				3012	before performing
				3013	the atomicrmw that
				3014	is being released.
				3015
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3016	2. buffer/global/ds/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3017	fence release - singlethread none none
				3018	- wavefront
				3019	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				3020
				3021	- If OpenCL and
				3022	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3023	not generic, omit.
				3024	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3025	currently has no
				3026	address space on
				3027	the fence need to
				3028	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3029	always generate. If
				3030	fence had an
				3031	address space then
				3032	set to address
				3033	space of OpenCL
				3034	fence flag, or to
				3035	generic if both
				3036	local and global
				3037	flags are
				3038	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3039	- Must happen after
				3040	any preceding
				3041	local/generic
				3042	load/load
				3043	atomic/store/store
				3044	atomic/atomicrmw.
				3045	- Must happen before
				3046	any following store
				3047	atomic/atomicrmw
				3048	with an equal or
				3049	wider sync scope
				3050	and memory ordering
				3051	stronger than
				3052	unordered (this is
				3053	termed the
				3054	fence-paired-atomic).
				3055	- Ensures that all
				3056	memory operations
				3057	to local have
				3058	completed before
				3059	performing the
				3060	following
				3061	fence-paired-atomic.
				3062
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3063	fence release - agent none 1. s_waitcnt lgkmcnt(0) &
				3064	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3065
				3066	- If OpenCL and
				3067	address space is
				3068	not generic, omit
				3069	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3070	- If OpenCL and
				3071	address space is
				3072	local, omit
				3073	vmcnt(0).
				3074	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3075	currently has no
				3076	address space on
				3077	the fence need to
				3078	conservatively
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3079	always generate. If
				3080	fence had an
				3081	address space then
				3082	set to address
				3083	space of OpenCL
				3084	fence flag, or to
				3085	generic if both
				3086	local and global
				3087	flags are
				3088	specified.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3089	- Could be split into
				3090	separate s_waitcnt
				3091	vmcnt(0) and
				3092	s_waitcnt
				3093	lgkmcnt(0) to allow
				3094	them to be
				3095	independently moved
				3096	according to the
				3097	following rules.
				3098	- s_waitcnt vmcnt(0)
				3099	must happen after
				3100	any preceding
				3101	global/generic
				3102	load/store/load
				3103	atomic/store
				3104	atomic/atomicrmw.
				3105	- s_waitcnt lgkmcnt(0)
				3106	must happen after
				3107	any preceding
				3108	local/generic
				3109	load/store/load
				3110	atomic/store
				3111	atomic/atomicrmw.
				3112	- Must happen before
				3113	any following store
				3114	atomic/atomicrmw
				3115	with an equal or
				3116	wider sync scope
				3117	and memory ordering
				3118	stronger than
				3119	unordered (this is
				3120	termed the
				3121	fence-paired-atomic).
				3122	- Ensures that all
				3123	memory operations
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3124	have
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3125	completed before
				3126	performing the
				3127	following
				3128	fence-paired-atomic.
				3129
				3130	Acquire-Release Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3131	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3132	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				3133	- wavefront - local
				3134	- generic
				3135	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3136
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3137	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3138	- Must happen after
				3139	any preceding
				3140	local/generic
				3141	load/store/load
				3142	atomic/store
				3143	atomic/atomicrmw.
				3144	- Must happen before
				3145	the following
				3146	atomicrmw.
				3147	- Ensures that all
				3148	memory operations
				3149	to local have
				3150	completed before
				3151	performing the
				3152	atomicrmw that is
				3153	being released.
				3154
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3155	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3156	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				3157	2. s_waitcnt lgkmcnt(0)
				3158
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3159	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3160	- Must happen before
				3161	any following
				3162	global/generic
				3163	load/load
				3164	atomic/store/store
				3165	atomic/atomicrmw.
				3166	- Ensures any
				3167	following global
				3168	data read is no
				3169	older than the load
				3170	atomic value being
				3171	acquired.
				3172
				3173	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				3174
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3175	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3176	- Must happen after
				3177	any preceding
				3178	local/generic
				3179	load/store/load
				3180	atomic/store
				3181	atomic/atomicrmw.
				3182	- Must happen before
				3183	the following
				3184	atomicrmw.
				3185	- Ensures that all
				3186	memory operations
				3187	to local have
				3188	completed before
				3189	performing the
				3190	atomicrmw that is
				3191	being released.
				3192
				3193	2. flat_atomic
				3194	3. s_waitcnt lgkmcnt(0)
				3195
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3196	- If OpenCL, omit.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3197	- Must happen before
				3198	any following
				3199	global/generic
				3200	load/load
				3201	atomic/store/store
				3202	atomic/atomicrmw.
				3203	- Ensures any
				3204	following global
				3205	data read is no
				3206	older than the load
				3207	atomic value being
				3208	acquired.
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3209
				3210	atomicrmw acq_rel - agent - global 1. s_waitcnt lgkmcnt(0) &
				3211	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3212
				3213	- If OpenCL, omit
				3214	lgkmcnt(0).
				3215	- Could be split into
				3216	separate s_waitcnt
				3217	vmcnt(0) and
				3218	s_waitcnt
				3219	lgkmcnt(0) to allow
				3220	them to be
				3221	independently moved
				3222	according to the
				3223	following rules.
				3224	- s_waitcnt vmcnt(0)
				3225	must happen after
				3226	any preceding
				3227	global/generic
				3228	load/store/load
				3229	atomic/store
				3230	atomic/atomicrmw.
				3231	- s_waitcnt lgkmcnt(0)
				3232	must happen after
				3233	any preceding
				3234	local/generic
				3235	load/store/load
				3236	atomic/store
				3237	atomic/atomicrmw.
				3238	- Must happen before
				3239	the following
				3240	atomicrmw.
				3241	- Ensures that all
				3242	memory operations
				3243	to global have
				3244	completed before
				3245	performing the
				3246	atomicrmw that is
				3247	being released.
				3248
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3249	2. buffer/global/flat_atomic
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3250	3. s_waitcnt vmcnt(0)
				3251
				3252	- Must happen before
				3253	following
				3254	buffer_wbinvl1_vol.
				3255	- Ensures the
				3256	atomicrmw has
				3257	completed before
				3258	invalidating the
				3259	cache.
				3260
				3261	4. buffer_wbinvl1_vol
				3262
				3263	- Must happen before
				3264	any following
				3265	global/generic
				3266	load/load
				3267	atomic/atomicrmw.
				3268	- Ensures that
				3269	following loads
				3270	will not see stale
				3271	global data.
				3272
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3273	atomicrmw acq_rel - agent - generic 1. s_waitcnt lgkmcnt(0) &
				3274	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3275
				3276	- If OpenCL, omit
				3277	lgkmcnt(0).
				3278	- Could be split into
				3279	separate s_waitcnt
				3280	vmcnt(0) and
				3281	s_waitcnt
				3282	lgkmcnt(0) to allow
				3283	them to be
				3284	independently moved
				3285	according to the
				3286	following rules.
				3287	- s_waitcnt vmcnt(0)
				3288	must happen after
				3289	any preceding
				3290	global/generic
				3291	load/store/load
				3292	atomic/store
				3293	atomic/atomicrmw.
				3294	- s_waitcnt lgkmcnt(0)
				3295	must happen after
				3296	any preceding
				3297	local/generic
				3298	load/store/load
				3299	atomic/store
				3300	atomic/atomicrmw.
				3301	- Must happen before
				3302	the following
				3303	atomicrmw.
				3304	- Ensures that all
				3305	memory operations
				3306	to global have
				3307	completed before
				3308	performing the
				3309	atomicrmw that is
				3310	being released.
				3311
				3312	2. flat_atomic
				3313	3. s_waitcnt vmcnt(0) &
				3314	lgkmcnt(0)
				3315
				3316	- If OpenCL, omit
				3317	lgkmcnt(0).
				3318	- Must happen before
				3319	following
				3320	buffer_wbinvl1_vol.
				3321	- Ensures the
				3322	atomicrmw has
				3323	completed before
				3324	invalidating the
				3325	cache.
				3326
				3327	4. buffer_wbinvl1_vol
				3328
				3329	- Must happen before
				3330	any following
				3331	global/generic
				3332	load/load
				3333	atomic/atomicrmw.
				3334	- Ensures that
				3335	following loads
				3336	will not see stale
				3337	global data.
				3338
				3339	fence acq_rel - singlethread none none
				3340	- wavefront
				3341	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3342
				3343	- If OpenCL and
				3344	address space is
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3345	not generic, omit.
				3346	- However,
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3347	since LLVM
				3348	currently has no
				3349	address space on
				3350	the fence need to
				3351	conservatively
				3352	always generate
				3353	(see comment for
				3354	previous fence).
				3355	- Must happen after
				3356	any preceding
				3357	local/generic
				3358	load/load
				3359	atomic/store/store
				3360	atomic/atomicrmw.
				3361	- Must happen before
				3362	any following
				3363	global/generic
				3364	load/load
				3365	atomic/store/store
				3366	atomic/atomicrmw.
				3367	- Ensures that all
				3368	memory operations
				3369	to local have
				3370	completed before
				3371	performing any
				3372	following global
				3373	memory operations.
				3374	- Ensures that the
				3375	preceding
				3376	local/generic load
				3377	atomic/atomicrmw
				3378	with an equal or
				3379	wider sync scope
				3380	and memory ordering
				3381	stronger than
				3382	unordered (this is
				3383	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3384	acquire-fence-paired-atomic
				3385	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3386	before following
				3387	global memory
				3388	operations. This
				3389	satisfies the
				3390	requirements of
				3391	acquire.
				3392	- Ensures that all
				3393	previous memory
				3394	operations have
				3395	completed before a
				3396	following
				3397	local/generic store
				3398	atomic/atomicrmw
				3399	with an equal or
				3400	wider sync scope
				3401	and memory ordering
				3402	stronger than
				3403	unordered (this is
				3404	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3405	release-fence-paired-atomic
				3406	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3407	requirements of
				3408	release.
				3409
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3410	fence acq_rel - agent none 1. s_waitcnt lgkmcnt(0) &
				3411	- system vmcnt(0)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3412
				3413	- If OpenCL and
				3414	address space is
				3415	not generic, omit
				3416	lgkmcnt(0).
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3417	- However, since LLVM
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3418	currently has no
				3419	address space on
				3420	the fence need to
				3421	conservatively
				3422	always generate
				3423	(see comment for
				3424	previous fence).
				3425	- Could be split into
				3426	separate s_waitcnt
				3427	vmcnt(0) and
				3428	s_waitcnt
				3429	lgkmcnt(0) to allow
				3430	them to be
				3431	independently moved
				3432	according to the
				3433	following rules.
				3434	- s_waitcnt vmcnt(0)
				3435	must happen after
				3436	any preceding
				3437	global/generic
				3438	load/store/load
				3439	atomic/store
				3440	atomic/atomicrmw.
				3441	- s_waitcnt lgkmcnt(0)
				3442	must happen after
				3443	any preceding
				3444	local/generic
				3445	load/store/load
				3446	atomic/store
				3447	atomic/atomicrmw.
				3448	- Must happen before
				3449	the following
				3450	buffer_wbinvl1_vol.
				3451	- Ensures that the
				3452	preceding
				3453	global/local/generic
				3454	load
				3455	atomic/atomicrmw
				3456	with an equal or
				3457	wider sync scope
				3458	and memory ordering
				3459	stronger than
				3460	unordered (this is
				3461	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3462	acquire-fence-paired-atomic
				3463	) has completed
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3464	before invalidating
				3465	the cache. This
				3466	satisfies the
				3467	requirements of
				3468	acquire.
				3469	- Ensures that all
				3470	previous memory
				3471	operations have
				3472	completed before a
				3473	following
				3474	global/local/generic
				3475	store
				3476	atomic/atomicrmw
				3477	with an equal or
				3478	wider sync scope
				3479	and memory ordering
				3480	stronger than
				3481	unordered (this is
				3482	termed the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3483	release-fence-paired-atomic
				3484	). This satisfies the
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3485	requirements of
				3486	release.
				3487
				3488	2. buffer_wbinvl1_vol
				3489
				3490	- Must happen before
				3491	any following
				3492	global/generic
				3493	load/load
				3494	atomic/store/store
				3495	atomic/atomicrmw.
				3496	- Ensures that
				3497	following loads
				3498	will not see stale
				3499	global data. This
				3500	satisfies the
				3501	requirements of
				3502	acquire.
				3503
				3504	Sequential Consistent Atomic
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3505	-----------------------------------------------------------------------------------
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3506	load atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3507	- wavefront - local load atomic acquire,
				3508	- generic except must generated
				3509	all instructions even
				3510	for OpenCL.*
				3511	load atomic seq_cst - workgroup - global 1. s_waitcnt lgkmcnt(0)
				3512	- generic
				3513	- Must
				3514	happen after
				3515	preceding
				3516	global/generic load
				3517	atomic/store
				3518	atomic/atomicrmw
				3519	with memory
				3520	ordering of seq_cst
				3521	and with equal or
				3522	wider sync scope.
				3523	(Note that seq_cst
				3524	fences have their
				3525	own s_waitcnt
				3526	lgkmcnt(0) and so do
				3527	not need to be
				3528	considered.)
				3529	- Ensures any
				3530	preceding
				3531	sequential
				3532	consistent local
				3533	memory instructions
				3534	have completed
				3535	before executing
				3536	this sequentially
				3537	consistent
				3538	instruction. This
				3539	prevents reordering
				3540	a seq_cst store
				3541	followed by a
				3542	seq_cst load. (Note
				3543	that seq_cst is
				3544	stronger than
				3545	acquire/release as
				3546	the reordering of
				3547	load acquire
				3548	followed by a store
				3549	release is
				3550	prevented by the
				3551	waitcnt of
				3552	the release, but
				3553	there is nothing
				3554	preventing a store
				3555	release followed by
				3556	load acquire from
				3557	competing out of
				3558	order.)
				3559
				3560	2. *Following
				3561	instructions same as
				3562	corresponding load
				3563	atomic acquire,
				3564	except must generated
				3565	all instructions even
				3566	for OpenCL.*
				3567	load atomic seq_cst - workgroup - local *Same as corresponding
				3568	load atomic acquire,
				3569	except must generated
				3570	all instructions even
				3571	for OpenCL.*
				3572	load atomic seq_cst - agent - global 1. s_waitcnt lgkmcnt(0) &
				3573	- system - generic vmcnt(0)
				3574
				3575	- Could be split into
				3576	separate s_waitcnt
				3577	vmcnt(0)
				3578	and s_waitcnt
				3579	lgkmcnt(0) to allow
				3580	them to be
				3581	independently moved
				3582	according to the
				3583	following rules.
				3584	- waitcnt lgkmcnt(0)
				3585	must happen after
				3586	preceding
				3587	global/generic load
				3588	atomic/store
				3589	atomic/atomicrmw
				3590	with memory
				3591	ordering of seq_cst
				3592	and with equal or
				3593	wider sync scope.
				3594	(Note that seq_cst
				3595	fences have their
				3596	own s_waitcnt
				3597	lgkmcnt(0) and so do
				3598	not need to be
				3599	considered.)
				3600	- waitcnt vmcnt(0)
				3601	must happen after
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3602	preceding
				3603	global/generic load
				3604	atomic/store
				3605	atomic/atomicrmw
				3606	with memory
				3607	ordering of seq_cst
				3608	and with equal or
				3609	wider sync scope.
				3610	(Note that seq_cst
				3611	fences have their
				3612	own s_waitcnt
				3613	vmcnt(0) and so do
				3614	not need to be
				3615	considered.)
				3616	- Ensures any
				3617	preceding
				3618	sequential
				3619	consistent global
				3620	memory instructions
				3621	have completed
				3622	before executing
				3623	this sequentially
				3624	consistent
				3625	instruction. This
				3626	prevents reordering
				3627	a seq_cst store
				3628	followed by a
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3629	seq_cst load. (Note
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3630	that seq_cst is
				3631	stronger than
				3632	acquire/release as
				3633	the reordering of
				3634	load acquire
				3635	followed by a store
				3636	release is
				3637	prevented by the
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3638	waitcnt of
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3639	the release, but
				3640	there is nothing
				3641	preventing a store
				3642	release followed by
				3643	load acquire from
				3644	competing out of
				3645	order.)
				3646
				3647	2. *Following
				3648	instructions same as
				3649	corresponding load
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3650	atomic acquire,
				3651	except must generated
				3652	all instructions even
				3653	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3654	store atomic seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3655	- wavefront - local store atomic release,
				3656	- workgroup - generic except must generated
				3657	all instructions even
				3658	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3659	store atomic seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3660	- system - generic store atomic release,
				3661	except must generated
				3662	all instructions even
				3663	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3664	atomicrmw seq_cst - singlethread - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3665	- wavefront - local atomicrmw acq_rel,
				3666	- workgroup - generic except must generated
				3667	all instructions even
				3668	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3669	atomicrmw seq_cst - agent - global *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3670	- system - generic atomicrmw acq_rel,
				3671	except must generated
				3672	all instructions even
				3673	for OpenCL.*
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3674	fence seq_cst - singlethread none *Same as corresponding
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	3675	- wavefront fence acq_rel,
				3676	- workgroup except must generated
				3677	- agent all instructions even
				3678	- system for OpenCL.*
				3679	============ ============ ============== ========== ===============================
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3680
				3681	The memory order also adds the single thread optimization constrains defined in
				3682	table
				3683	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3684
				3685	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3686	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3687
				3688	============ ==============================================================
				3689	LLVM Memory Optimization Constraints
				3690	Ordering
				3691	============ ==============================================================
				3692	unordered none
				3693	monotonic none
				3694	acquire - If a load atomic/atomicrmw then no following load/load
				3695	atomic/store/ store atomic/atomicrmw/fence instruction can
				3696	be moved before the acquire.
				3697	- If a fence then same as load atomic, plus no preceding
				3698	associated fence-paired-atomic can be moved after the fence.
Sylvestre Ledru	e3fdbae	2017-06-26 02:45:39 +0000	[diff] [blame]	3699	release - If a store atomic/atomicrmw then no preceding load/load
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3700	atomic/store/ store atomic/atomicrmw/fence instruction can
				3701	be moved after the release.
				3702	- If a fence then same as store atomic, plus no following
				3703	associated fence-paired-atomic can be moved before the
				3704	fence.
				3705	acq_rel Same constraints as both acquire and release.
				3706	seq_cst - If a load atomic then same constraints as acquire, plus no
				3707	preceding sequentially consistent load atomic/store
				3708	atomic/atomicrmw/fence instruction can be moved after the
				3709	seq_cst.
				3710	- If a store atomic then the same constraints as release, plus
				3711	no following sequentially consistent load atomic/store
				3712	atomic/atomicrmw/fence instruction can be moved before the
				3713	seq_cst.
				3714	- If an atomicrmw/fence then same constraints as acq_rel.
				3715	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3716
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3717	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3718	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3719
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3720	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3721	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3722	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3723
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3724	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3725	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3726
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3727	=================== =============== =============== =======================
				3728	Usage Code Sequence Trap Handler Description
				3729	Inputs
				3730	=================== =============== =============== =======================
				3731	reserved ``s_trap 0x00`` Reserved by hardware.
				3732	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3733	``queue_ptr`` ``debugtrap``
				3734	``VGPR0``: intrinsic (not
				3735	``arg`` implemented).
				3736	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3737	``queue_ptr`` terminated and its
				3738	associated queue put
				3739	into the error state.
				3740	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3741	``queue_ptr`` installed handled
				3742	same as ``llvm.trap``.
				3743	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3744	breakpoints.
				3745	debugger ``s_trap 0x08`` Reserved for debugger.
				3746	debugger ``s_trap 0xfe`` Reserved for debugger.
				3747	debugger ``s_trap 0xff`` Reserved for debugger.
				3748	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3749
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3750	Unspecified OS
				3751	--------------
				3752
				3753	This section provides code conventions used when the target triple OS is
				3754	empty (see :ref:`amdgpu-target-triples`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3755
				3756	Trap Handler ABI
				3757	~~~~~~~~~~~~~~~~
				3758
				3759	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3760	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3761	instructions are handled as follows:
				3762
				3763	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3764	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3765
				3766	=============== =============== ===========================================
				3767	Usage Code Sequence Description
				3768	=============== =============== ===========================================
				3769	llvm.trap s_endpgm Causes wavefront to be terminated.
				3770	llvm.debugtrap none Compiler warning given that there is no
				3771	trap handler installed.
				3772	=============== =============== ===========================================
				3773
				3774	Source Languages
				3775	================
				3776
				3777	.. _amdgpu-opencl:
				3778
				3779	OpenCL
				3780	------
				3781
				3782	When generating code for the OpenCL language the target triple environment
				3783	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3784
				3785	When the language is OpenCL the following differences occur:
				3786
				3787	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3788	2. The AMDGPU backend adds additional arguments to the kernel.
Tony Tye	46d3576	2017-08-15 20:47:41 +0000	[diff] [blame]	3789	3. Additional metadata is generated
				3790	(:ref:`amdgpu-amdhsa-hsa-code-object-metadata`).
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3791
				3792	.. TODO
				3793	Specify what affect this has. Hidden arguments added. Additional metadata
				3794	generated.
				3795
				3796	.. _amdgpu-hcc:
				3797
				3798	HCC
				3799	---
				3800
				3801	When generating code for the OpenCL language the target triple environment
				3802	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3803
				3804	When the language is OpenCL the following differences occur:
				3805
				3806	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3807
				3808	.. TODO
				3809	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3810
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3811	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3812	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3813
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3814	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f59d071	2017-11-10 20:51:43 +0000	[diff] [blame]	3815	It supports AMDGCN GFX6-GFX9.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3816
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3817	This section describes general syntax for instructions and operands. For more
				3818	information about instructions, their semantics and supported combinations of
				3819	operands, refer to one of instruction set architecture manuals
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	3820	[AMD-GCN-GFX6]_, [AMD-GCN-GFX7]_, [AMD-GCN-GFX8]_ and [AMD-GCN-GFX9]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3821
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3822	An instruction has the following syntax (register operands are normally
				3823	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3824
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3825	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3826
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3827	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3828	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3829
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3830	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3831
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3832	* SGPR registers: s0, ... or s[0], ...
				3833	* VGPR registers: v0, ... or v[0], ...
				3834	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3835	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3836	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3837	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3838	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3839	* Register index expressions: v[2*2], s[1-1:2-1]
				3840	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3841
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3842	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3843
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3844	* offset, offset0, offset1
				3845	* idxen, offen bits
				3846	* glc, slc, tfe bits
				3847	* waitcnt: integer or combination of counter values
				3848	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3849
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3850	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3851
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3852	* DPP modifiers:
				3853
				3854	- row_shl, row_shr, row_ror, row_rol
				3855	- row_mirror, row_half_mirror, row_bcast
				3856	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3857	- row_mask, bank_mask, bound_ctrl
				3858
				3859	* SDWA modifiers:
				3860
				3861	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3862	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3863	- abs, neg, sext
				3864
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3865	Instruction Examples
				3866	~~~~~~~~~~~~~~~~~~~~
				3867
				3868	DS
				3869	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3870
				3871	.. code-block:: nasm
				3872
				3873	ds_add_u32 v2, v4 offset:16
				3874	ds_write_src2_b64 v2 offset0:4 offset1:8
				3875	ds_cmpst_f32 v2, v4, v6
				3876	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3877
				3878
				3879	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3880
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3881	FLAT
				3882	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3883
				3884	.. code-block:: nasm
				3885
				3886	flat_load_dword v1, v[3:4]
				3887	flat_store_dwordx3 v[3:4], v[5:7]
				3888	flat_atomic_swap v1, v[3:4], v5 glc
				3889	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3890	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3891
				3892	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3893
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3894	MUBUF
				3895	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3896
				3897	.. code-block:: nasm
				3898
				3899	buffer_load_dword v1, off, s[4:7], s1
				3900	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3901	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3902	buffer_wbinvl1
				3903	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3904
				3905	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3906
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3907	SMRD/SMEM
				3908	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3909
				3910	.. code-block:: nasm
				3911
				3912	s_load_dword s1, s[2:3], 0xfc
				3913	s_load_dwordx8 s[8:15], s[2:3], s4
				3914	s_load_dwordx16 s[88:103], s[2:3], s4
				3915	s_dcache_inv_vol
				3916	s_memtime s[4:5]
				3917
				3918	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3919
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3920	SOP1
				3921	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3922
				3923	.. code-block:: nasm
				3924
				3925	s_mov_b32 s1, s2
				3926	s_mov_b64 s[0:1], 0x80000000
				3927	s_cmov_b32 s1, 200
				3928	s_wqm_b64 s[2:3], s[4:5]
				3929	s_bcnt0_i32_b64 s1, s[2:3]
				3930	s_swappc_b64 s[2:3], s[4:5]
				3931	s_cbranch_join s[4:5]
				3932
				3933	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3934
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3935	SOP2
				3936	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3937
				3938	.. code-block:: nasm
				3939
				3940	s_add_u32 s1, s2, s3
				3941	s_and_b64 s[2:3], s[4:5], s[6:7]
				3942	s_cselect_b32 s1, s2, s3
				3943	s_andn2_b32 s2, s4, s6
				3944	s_lshr_b64 s[2:3], s[4:5], s6
				3945	s_ashr_i32 s2, s4, s6
				3946	s_bfm_b64 s[2:3], s4, s6
				3947	s_bfe_i64 s[2:3], s[4:5], s6
				3948	s_cbranch_g_fork s[4:5], s[6:7]
				3949
				3950	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3951
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3952	SOPC
				3953	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3954
				3955	.. code-block:: nasm
				3956
				3957	s_cmp_eq_i32 s1, s2
				3958	s_bitcmp1_b32 s1, s2
				3959	s_bitcmp0_b64 s[2:3], s4
				3960	s_setvskip s3, s5
				3961
				3962	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3963
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3964	SOPP
				3965	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3966
				3967	.. code-block:: nasm
				3968
				3969	s_barrier
				3970	s_nop 2
				3971	s_endpgm
				3972	s_waitcnt 0 ; Wait for all counters to be 0
				3973	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3974	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3975	s_sethalt 9
				3976	s_sleep 10
				3977	s_sendmsg 0x1
				3978	s_sendmsg sendmsg(MSG_INTERRUPT)
				3979	s_trap 1
				3980
				3981	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3982
				3983	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3984	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3985	range or acceptable values.
				3986
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3987	VALU
				3988	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3989
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3990	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3991	the assembler will automatically use optimal encoding based on its operands.
				3992	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3993
				3994	* _e32 for 32-bit VOP1/VOP2/VOPC
				3995	* _e64 for 64-bit VOP3
				3996	* _dpp for VOP_DPP
				3997	* _sdwa for VOP_SDWA
				3998
				3999	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4000
				4001	.. code-block:: nasm
				4002
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4003	v_mov_b32 v1, v2
				4004	v_mov_b32_e32 v1, v2
				4005	v_nop
				4006	v_cvt_f64_i32_e32 v[1:2], v2
				4007	v_floor_f32_e32 v1, v2
				4008	v_bfrev_b32_e32 v1, v2
				4009	v_add_f32_e32 v1, v2, v3
				4010	v_mul_i32_i24_e64 v1, v2, 3
				4011	v_mul_i32_i24_e32 v1, -3, v3
				4012	v_mul_i32_i24_e32 v1, -100, v3
				4013	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				4014	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4015
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4016	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	4017
				4018	.. code-block:: nasm
				4019
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4020	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				4021	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4022	v_mov_b32 v0, v0 wave_shl:1
				4023	v_mov_b32 v0, v0 row_mirror
				4024	v_mov_b32 v0, v0 row_bcast:31
				4025	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4026	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				4027	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4028
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4029	VOP_SDWA examples:
				4030
				4031	.. code-block:: nasm
				4032
				4033	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				4034	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				4035	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				4036	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				4037	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				4038
				4039	For full list of supported instructions, refer to "Vector ALU instructions".
				4040
				4041	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4042	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4043
				4044	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				4045	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4046
				4047	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4048	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4049
				4050	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4051	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4052
				4053	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4054	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				4055
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4056
				4057	major, minor, and stepping are all integers that describe the instruction
				4058	set architecture (ISA) version of the assembly program.
				4059
				4060	vendor and arch are quoted strings. vendor should always be equal to
				4061	"AMD" and arch should always be equal to "AMDGPU".
				4062
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4063	By default, the assembler will derive the ISA version, vendor, and arch
				4064	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	4065
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4066	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4067	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4068
				4069	This directives specifies that the symbol with given name is a kernel entry point
				4070	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4071
				4072	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4073	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4074
				4075	This directive marks the beginning of a list of key / value pairs that are used
				4076	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				4077	The list must be terminated by the .end_amd_kernel_code_t directive. For
				4078	any amd_kernel_code_t values that are unspecified a default value will be
				4079	used. The default value for all keys is 0, with the following exceptions:
				4080
				4081	- kernel_code_version_major defaults to 1.
				4082	- machine_kind defaults to 1.
				4083	- machine_version_major, machine_version_minor, and
				4084	machine_version_stepping are derived from the value of the -mcpu option
				4085	that is passed to the assembler.
				4086	- kernel_code_entry_byte_offset defaults to 256.
				4087	- wavefront_size defaults to 6.
				4088	- kernarg_segment_alignment, group_segment_alignment, and
Tony Tye	6baa6d2	2017-10-18 22:16:55 +0000	[diff] [blame]	4089	private_segment_alignment default to 4. Note that alignments are specified
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4090	as a power of two, so a value of n means an alignment of 2^ n.
				4091
				4092	The .amd_kernel_code_t directive must be placed immediately after the
				4093	function label and before any instructions.
				4094
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	4095	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				4096	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4097
				4098	Here is an example of a minimal amd_kernel_code_t specification:
				4099
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	4100	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4101
				4102	.hsa_code_object_version 1,0
				4103	.hsa_code_object_isa
				4104
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4105	.hsatext
				4106	.globl hello_world
				4107	.p2align 8
				4108	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4109
				4110	hello_world:
				4111
				4112	.amd_kernel_code_t
				4113	enable_sgpr_kernarg_segment_ptr = 1
				4114	is_ptr64 = 1
				4115	compute_pgm_rsrc1_vgprs = 0
				4116	compute_pgm_rsrc1_sgprs = 0
				4117	compute_pgm_rsrc2_user_sgpr = 2
				4118	kernarg_segment_byte_size = 8
				4119	wavefront_sgpr_count = 2
				4120	workitem_vgpr_count = 3
				4121	.end_amd_kernel_code_t
				4122
				4123	s_load_dwordx2 s[0:1], s[0:1] 0x0
				4124	v_mov_b32 v0, 3.14159
				4125	s_waitcnt lgkmcnt(0)
				4126	v_mov_b32 v1, s0
				4127	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4128	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	4129	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	4130	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	4131	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4132
				4133	Additional Documentation
				4134	========================
				4135
Konstantin Zhuravlyov	265d253	2017-10-18 17:59:20 +0000	[diff] [blame]	4136	.. [AMD-RADEON-HD-2000-3000] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				4137	.. [AMD-RADEON-HD-4000] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				4138	.. [AMD-RADEON-HD-5000] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				4139	.. [AMD-RADEON-HD-6000] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				4140	.. [AMD-GCN-GFX6] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				4141	.. [AMD-GCN-GFX7] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				4142	.. [AMD-GCN-GFX8] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				4143	.. [AMD-GCN-GFX9] `AMD "Vega" Instruction Set Architecture <http://developer.amd.com/wordpress/media/2013/12/Vega_Shader_ISA_28July2017.pdf>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4144	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				4145	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				4146	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				4147	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				4148	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				4149	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				4150	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
Konstantin Zhuravlyov	ea35e46	2017-10-19 17:12:55 +0000	[diff] [blame]	4151	.. [YAML] `YAML Ain't Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	4152	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				4153	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				4154	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__