Blame - llvm/docs/AMDGPUUsage.rst - toolchain/llvm-project - Gitiles

blob: f6dcfc6b579ff08e52c0911b9244f7474bba9ed0 [file] [log] [blame]

Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	1	=============================
				2	User Guide for AMDGPU Backend
				3	=============================
				4
				5	.. contents::
				6	:local:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	7
				8	Introduction
				9	============
				10
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	11	The AMDGPU backend provides ISA code generation for AMD GPUs, starting with the
				12	R600 family up until the current GCN families. It lives in the
				13	``lib/Target/AMDGPU`` directory.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	14
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	15	LLVM
				16	====
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	17
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	18	.. _amdgpu-target-triples:
				19
				20	Target Triples
				21	--------------
				22
				23	Use the ``clang -target <Architecture>-<Vendor>-<OS>-<Environment>`` option to
				24	specify the target triple:
				25
				26	.. table:: AMDGPU Target Triples
				27	:name: amdgpu-target-triples-table
				28
				29	============ ======== ========= ===========
				30	Architecture Vendor OS Environment
				31	============ ======== ========= ===========
				32	r600 amd <empty> <empty>
				33	amdgcn amd <empty> <empty>
				34	amdgcn amd amdhsa <empty>
				35	amdgcn amd amdhsa opencl
				36	amdgcn amd amdhsa amdgizcl
				37	amdgcn amd amdhsa amdgiz
				38	amdgcn amd amdhsa hcc
				39	============ ======== ========= ===========
				40
				41	``r600-amd--``
				42	Supports AMD GPUs HD2XXX-HD6XXX for graphics and compute shaders executed on
				43	the MESA runtime.
				44
				45	``amdgcn-amd--``
				46	Supports AMD GPUs GCN 6 onwards for graphics and compute shaders executed on
				47	the MESA runtime.
				48
				49	``amdgcn-amd-amdhsa-``
				50	Supports AMD GCN GPUs GFX6 onwards for compute kernels executed on HSA [HSA]_
				51	compatible runtimes such as AMD's ROCm [AMD-ROCm]_.
				52
				53	``amdgcn-amd-amdhsa-opencl``
				54	Supports AMD GCN GPUs GFX6 onwards for OpenCL compute kernels executed on HSA
				55	[HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				56	:ref:`amdgpu-opencl`.
				57
				58	``amdgcn-amd-amdhsa-amdgizcl``
				59	Same as ``amdgcn-amd-amdhsa-opencl`` except a different address space mapping
				60	is used (see :ref:`amdgpu-address-spaces`).
				61
				62	``amdgcn-amd-amdhsa-amdgiz``
				63	Same as ``amdgcn-amd-amdhsa-`` except a different address space mapping is
				64	used (see :ref:`amdgpu-address-spaces`).
				65
				66	``amdgcn-amd-amdhsa-hcc``
				67	Supports AMD GCN GPUs GFX6 onwards for AMD HC language compute kernels
				68	executed on HSA [HSA]_ compatible runtimes such as AMD's ROCm [AMD-ROCm]_. See
				69	:ref:`amdgpu-hcc`.
				70
				71	.. _amdgpu-processors:
				72
				73	Processors
				74	----------
				75
				76	Use the ``clang -mcpu <Processor>`` option to specify the AMD GPU processor. The
				77	names from both the Processor and Alternative Processor can be used.
				78
				79	.. table:: AMDGPU Processors
				80	:name: amdgpu-processors-table
				81
				82	========== =========== ============ ===== ======= ==================
				83	Processor Alternative Target dGPU/ Runtime Example
				84	Processor Triple APU Support Products
				85	Architecture
				86	========== =========== ============ ===== ======= ==================
				87	R600 [AMD-R6xx]_
				88	--------------------------------------------------------------------
				89	r600 r600 dGPU
				90	r630 r600 dGPU
				91	rs880 r600 dGPU
				92	rv670 r600 dGPU
				93	R700 [AMD-R7xx]_
				94	--------------------------------------------------------------------
				95	rv710 r600 dGPU
				96	rv730 r600 dGPU
				97	rv770 r600 dGPU
				98	Evergreen [AMD-Evergreen]_
				99	--------------------------------------------------------------------
				100	cedar r600 dGPU
				101	redwood r600 dGPU
				102	sumo r600 dGPU
				103	juniper r600 dGPU
				104	cypress r600 dGPU
				105	Northern Islands [AMD-Cayman-Trinity]_
				106	--------------------------------------------------------------------
				107	barts r600 dGPU
				108	turks r600 dGPU
				109	caicos r600 dGPU
				110	cayman r600 dGPU
				111	GCN GFX6 (Southern Islands (SI)) [AMD-Souther-Islands]_
				112	--------------------------------------------------------------------
				113	gfx600 - SI amdgcn dGPU
				114	- tahiti
				115	gfx601 - pitcairn amdgcn dGPU
				116	- verde
				117	- oland
				118	- hainan
				119	GCN GFX7 (Sea Islands (CI)) [AMD-Sea-Islands]_
				120	--------------------------------------------------------------------
				121	gfx700 - bonaire amdgcn dGPU - Radeon HD 7790
				122	- Radeon HD 8770
				123	- R7 260
				124	- R7 260X
				125	\ - kaveri amdgcn APU - A6-7000
				126	- A6 Pro-7050B
				127	- A8-7100
				128	- A8 Pro-7150B
				129	- A10-7300
				130	- A10 Pro-7350B
				131	- FX-7500
				132	- A8-7200P
				133	- A10-7400P
				134	- FX-7600P
				135	gfx701 - hawaii amdgcn dGPU ROCm - FirePro W8100
				136	- FirePro W9100
				137	- FirePro S9150
				138	- FirePro S9170
				139	\ dGPU ROCm - Radeon R9 290
				140	- Radeon R9 290x
				141	- Radeon R390
				142	- Radeonb R390x
				143	gfx702 - kabini amdgcn APU - E1-2100
				144	- mullins - E1-2200
				145	- E1-2500
				146	- E2-3000
				147	- E2-3800
				148	- A4-5000
				149	- A4-5100
				150	- A6-5200
				151	- A4 Pro-3340B
				152	GCN GFX8 (Volcanic Islands (VI)) [AMD-Volcanic-Islands]_
				153	--------------------------------------------------------------------
				154	gfx800 - iceland amdgcn dGPU - FirePro S7150
				155	- FirePro S7100
				156	- FirePro W7100
				157	- Radeon R285
				158	- Radeon R9 380
				159	- Radeon R9 385
				160	- Mobile FirePro
				161	M7170
				162	gfx801 - carrizo amdgcn APU - A6-8500P
				163	- Pro A6-8500B
				164	- A8-8600P
				165	- Pro A8-8600B
				166	- FX-8800P
				167	- Pro A12-8800B
				168	\ amdgcn APU ROCm - A10-8700P
				169	- Pro A10-8700B
				170	- A10-8780P
				171	\ amdgcn APU - A10-9600P
				172	- A10-9630P
				173	- A12-9700P
				174	- A12-9730P
				175	- FX-9800P
				176	- FX-9830P
				177	\ amdgcn APU - E2-9010
				178	- A6-9210
				179	- A9-9410
				180	gfx802 - tonga amdgcn dGPU ROCm Same as gfx800
				181	gfx803 - fiji amdgcn dGPU ROCm - Radeon R9 Nano
				182	- Radeon R9 Fury
				183	- Radeon R9 FuryX
				184	- Radeon Pro Duo
				185	- FirePro S9300x2
				186	\ - polaris10 amdgcn dGPU ROCm - Radeon RX 470
				187	- Radeon RX 480
				188	\ - polaris11 amdgcn dGPU ROCm - Radeon RX 460
				189	gfx804 amdgcn dGPU Same as gfx803
				190	gfx810 - stoney amdgcn APU
				191	GCN GFX9
				192	--------------------------------------------------------------------
				193	gfx900 amdgcn dGPU - FirePro W9500
				194	- FirePro S9500
				195	- FirePro S9500x2
				196	gfx901 amdgcn dGPU ROCm Same as gfx900
				197	except XNACK is
				198	enabled
				199	gfx902 amdgcn APU TBA
				200
				201	.. TODO
				202	Add product
				203	names.
				204	gfx903 amdgcn APU Same as gfx902
				205	except XNACK is
				206	enabled
				207	========== =========== ============ ===== ======= ==================
				208
				209	.. _amdgpu-address-spaces:
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	210
				211	Address Spaces
				212	--------------
				213
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	214	The AMDGPU backend uses the following address space mappings.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	215
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	216	The memory space names used in the table, aside from the region memory space, is
				217	from the OpenCL standard.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	218
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	219	LLVM Address Space number is used throughout LLVM (for example, in LLVM IR).
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	220
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	221	.. table:: Address Space Mapping
				222	:name: amdgpu-address-space-mapping-table
				223
				224	================== ================= ================= ================= =================
				225	LLVM Address Space Memory Space
				226	------------------ -----------------------------------------------------------------------
				227	\ Current Default amdgiz/amdgizcl hcc Future Default
				228	================== ================= ================= ================= =================
				229	0 Private (Scratch) Generic (Flat) Generic (Flat) Generic (Flat)
				230	1 Global Global Global Global
				231	2 Constant Constant Constant Region (GDS)
				232	3 Local (group/LDS) Local (group/LDS) Local (group/LDS) Local (group/LDS)
				233	4 Generic (Flat) Region (GDS) Region (GDS) Constant
				234	5 Region (GDS) Private (Scratch) Private (Scratch) Private (Scratch)
				235	================== ================= ================= ================= =================
				236
				237	Current Default
				238	This is the current default address space mapping used for all languages
				239	except hcc. This will shortly be deprecated.
				240
				241	amdgiz/amdgizcl
				242	This is the current address space mapping used when ``amdgiz`` or ``amdgizcl``
				243	is specified as the target triple environment value.
				244
				245	hcc
				246	This is the current address space mapping used when ``hcc`` is specified as
				247	the target triple environment value.This will shortly be deprecated.
				248
				249	Future Default
				250	This will shortly be the only address space mapping for all languages using
				251	AMDGPU backend.
				252
				253	.. _amdgpu-memory-scopes:
				254
				255	Memory Scopes
				256	-------------
				257
				258	This section provides LLVM memory synchronization scopes supported by the AMDGPU
				259	backend memory model when the target triple OS is ``amdhsa`` (see
				260	:ref:`amdgpu-amdhsa-memory-model` and :ref:`amdgpu-target-triples`).
				261
				262	The memory model supported is based on the HSA memory model [HSA]_ which is
				263	based in turn on HRF-indirect with scope inclusion [HRF]_. The happens-before
				264	relation is transitive over the synchonizes-with relation independent of scope,
				265	and synchonizes-with allows the memory scope instances to be inclusive (see
				266	table :ref:`amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table`).
				267
				268	This is different to the OpenCL [OpenCL]_ memory model which does not have scope
				269	inclusion and requires the memory scopes to exactly match. However, this
				270	is conservatively correct for OpenCL.
				271
				272	.. table:: AMDHSA LLVM Sync Scopes for AMDHSA
				273	:name: amdgpu-amdhsa-llvm-sync-scopes-amdhsa-table
				274
				275	================ ==========================================================
				276	LLVM Sync Scope Description
				277	================ ==========================================================
				278	none The default: ``system``.
				279
				280	Synchronizes with, and participates in modification and
				281	seq_cst total orderings with, other operations (except
				282	image operations) for all address spaces (except private,
				283	or generic that accesses private) provided the other
				284	operation's sync scope is:
				285
				286	- ``system``.
				287	- ``agent`` and executed by a thread on the same agent.
				288	- ``workgroup`` and executed by a thread in the same
				289	workgroup.
				290	- ``wavefront`` and executed by a thread in the same
				291	wavefront.
				292
				293	``agent`` Synchronizes with, and participates in modification and
				294	seq_cst total orderings with, other operations (except
				295	image operations) for all address spaces (except private,
				296	or generic that accesses private) provided the other
				297	operation's sync scope is:
				298
				299	- ``system`` or ``agent`` and executed by a thread on the
				300	same agent.
				301	- ``workgroup`` and executed by a thread in the same
				302	workgroup.
				303	- ``wavefront`` and executed by a thread in the same
				304	wavefront.
				305
				306	``workgroup`` Synchronizes with, and participates in modification and
				307	seq_cst total orderings with, other operations (except
				308	image operations) for all address spaces (except private,
				309	or generic that accesses private) provided the other
				310	operation's sync scope is:
				311
				312	- ``system``, ``agent`` or ``workgroup`` and executed by a
				313	thread in the same workgroup.
				314	- ``wavefront`` and executed by a thread in the same
				315	wavefront.
				316
				317	``wavefront`` Synchronizes with, and participates in modification and
				318	seq_cst total orderings with, other operations (except
				319	image operations) for all address spaces (except private,
				320	or generic that accesses private) provided the other
				321	operation's sync scope is:
				322
				323	- ``system``, ``agent``, ``workgroup`` or ``wavefront``
				324	and executed by a thread in the same wavefront.
				325
				326	``singlethread`` Only synchronizes with, and participates in modification
				327	and seq_cst total orderings with, other operations (except
				328	image operations) running in the same thread for all
				329	address spaces (for example, in signal handlers).
				330	================ ==========================================================
				331
				332	AMDGPU Intrinsics
				333	-----------------
				334
				335	The AMDGPU backend implements the following intrinsics.
				336
				337	This section is WIP.
				338
				339	.. TODO
				340	List AMDGPU intrinsics
				341
				342	Code Object
				343	===========
				344
				345	The AMDGPU backend generates a standard ELF [ELF]_ relocatable code object that
				346	can be linked by ``lld`` to produce a standard ELF shared code object which can
				347	be loaded and executed on an AMDGPU target.
				348
				349	Header
				350	------
				351
				352	The AMDGPU backend uses the following ELF header:
				353
				354	.. table:: AMDGPU ELF Header
				355	:name: amdgpu-elf-header-table
				356
				357	========================== =========================
				358	Field Value
				359	========================== =========================
				360	``e_ident[EI_CLASS]`` ``ELFCLASS64``
				361	``e_ident[EI_DATA]`` ``ELFDATA2LSB``
				362	``e_ident[EI_OSABI]`` ``ELFOSABI_AMDGPU_HSA``
				363	``e_ident[EI_ABIVERSION]`` ``ELFABIVERSION_AMDGPU_HSA``
				364	``e_type`` ``ET_REL`` or ``ET_DYN``
				365	``e_machine`` ``EM_AMDGPU``
				366	``e_entry`` 0
				367	``e_flags`` 0
				368	========================== =========================
				369
				370	..
				371
				372	.. table:: AMDGPU ELF Header Enumeration Values
				373	:name: amdgpu-elf-header-enumeration-values-table
				374
				375	============================ =====
				376	Name Value
				377	============================ =====
				378	``EM_AMDGPU`` 224
				379	``ELFOSABI_AMDGPU_HSA`` 64
				380	``ELFABIVERSION_AMDGPU_HSA`` 1
				381	============================ =====
				382
				383	``e_ident[EI_CLASS]``
				384	The ELF class is always ``ELFCLASS64``. The AMDGPU backend only supports 64 bit
				385	applications.
				386
				387	``e_ident[EI_DATA]``
				388	All AMDGPU targets use ELFDATA2LSB for little-endian byte ordering.
				389
				390	``e_ident[EI_OSABI]``
				391	The AMD GPU architecture specific OS ABI of ``ELFOSABI_AMDGPU_HSA`` is used to
				392	specify that the code object conforms to the AMD HSA runtime ABI [HSA]_.
				393
				394	``e_ident[EI_ABIVERSION]``
				395	The AMD GPU architecture specific OS ABI version of
				396	``ELFABIVERSION_AMDGPU_HSA`` is used to specify the version of AMD HSA runtime
				397	ABI to which the code object conforms.
				398
				399	``e_type``
				400	Can be one of the following values:
				401
				402
				403	``ET_REL``
				404	The type produced by the AMD GPU backend compiler as it is relocatable code
				405	object.
				406
				407	``ET_DYN``
				408	The type produced by the linker as it is a shared code object.
				409
				410	The AMD HSA runtime loader requires a ``ET_DYN`` code object.
				411
				412	``e_machine``
				413	The value ``EM_AMDGPU`` is used for the machine for all members of the AMD GPU
				414	architecture family. The specific member is specified in the
				415	``NT_AMD_AMDGPU_ISA`` entry in the ``.note`` section (see
				416	:ref:`amdgpu-note-records`).
				417
				418	``e_entry``
				419	The entry point is 0 as the entry points for individual kernels must be
				420	selected in order to invoke them through AQL packets.
				421
				422	``e_flags``
				423	The value is 0 as no flags are used.
				424
				425	Sections
				426	--------
				427
				428	An AMDGPU target ELF code object has the standard ELF sections which include:
				429
				430	.. table:: AMDGPU ELF Sections
				431	:name: amdgpu-elf-sections-table
				432
				433	================== ================ =================================
				434	Name Type Attributes
				435	================== ================ =================================
				436	``.bss`` ``SHT_NOBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				437	``.data`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				438	``.debug_``\ \* ``SHT_PROGBITS`` none
				439	``.dynamic`` ``SHT_DYNAMIC`` ``SHF_ALLOC``
				440	``.dynstr`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				441	``.dynsym`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				442	``.got`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_WRITE``
				443	``.hash`` ``SHT_HASH`` ``SHF_ALLOC``
				444	``.note`` ``SHT_NOTE`` none
				445	``.rela``\ name ``SHT_RELA`` none
				446	``.rela.dyn`` ``SHT_RELA`` none
				447	``.rodata`` ``SHT_PROGBITS`` ``SHF_ALLOC``
				448	``.shstrtab`` ``SHT_STRTAB`` none
				449	``.strtab`` ``SHT_STRTAB`` none
				450	``.symtab`` ``SHT_SYMTAB`` none
				451	``.text`` ``SHT_PROGBITS`` ``SHF_ALLOC`` + ``SHF_EXECINSTR``
				452	================== ================ =================================
				453
				454	These sections have their standard meanings (see [ELF]_) and are only generated
				455	if needed.
				456
				457	``.debug``\ \*
				458	The standard DWARF sections. See :ref:`amdgpu-dwarf` for information on the
				459	DWARF produced by the AMDGPU backend.
				460
				461	``.dynamic``, ``.dynstr``, ``.dynstr``, ``.hash``
				462	The standard sections used by a dynamic loader.
				463
				464	``.note``
				465	See :ref:`amdgpu-note-records` for the note records supported by the AMDGPU
				466	backend.
				467
				468	``.rela``\ name, ``.rela.dyn``
				469	For relocatable code objects, name is the name of the section that the
				470	relocation records apply. For example, ``.rela.text`` is the section name for
				471	relocation records associated with the ``.text`` section.
				472
				473	For linked shared code objects, ``.rela.dyn`` contains all the relocation
				474	records from each of the relocatable code object's ``.rela``\ name sections.
				475
				476	See :ref:`amdgpu-relocation-records` for the relocation records supported by
				477	the AMDGPU backend.
				478
				479	``.text``
				480	The executable machine code for the kernels and functions they call. Generated
				481	as position independent code. See :ref:`amdgpu-code-conventions` for
				482	information on conventions used in the isa generation.
				483
				484	.. _amdgpu-note-records:
				485
				486	Note Records
				487	------------
				488
				489	As required by ``ELFCLASS64``, minimal zero byte padding must be generated after
				490	the ``name`` field to ensure the ``desc`` field is 4 byte aligned. In addition,
				491	minimal zero byte padding must be generated to ensure the ``desc`` field size is
				492	a multiple of 4 bytes. The ``sh_addralign`` field of the ``.note`` section must
				493	be at least 4 to indicate at least 8 byte alignment.
				494
				495	The AMDGPU backend code object uses the following ELF note records in the
				496	``.note`` section. The Description column specifies the layout of the note
				497	record’s ``desc`` field. All fields are consecutive bytes. Note records with
				498	variable size strings have a corresponding ``*_size`` field that specifies the
				499	number of bytes, including the terminating null character, in the string. The
				500	string(s) come immediately after the preceding fields.
				501
				502	Additional note records can be present.
				503
				504	.. table:: AMDGPU ELF Note Records
				505	:name: amdgpu-elf-note-records-table
				506
				507	===== ========================== ==========================================
				508	Name Type Description
				509	===== ========================== ==========================================
				510	"AMD" ``NT_AMD_AMDGPU_METADATA`` <metadata null terminated string>
				511	"AMD" ``NT_AMD_AMDGPU_ISA`` <isa name null terminated string>
				512	===== ========================== ==========================================
				513
				514	..
				515
				516	.. table:: AMDGPU ELF Note Record Enumeration Values
				517	:name: amdgpu-elf-note-record-enumeration-values-table
				518
				519	============================= =====
				520	Name Value
				521	============================= =====
				522	reserved 0-9
				523	``NT_AMD_AMDGPU_METADATA`` 10
				524	``NT_AMD_AMDGPU_ISA`` 11
				525	============================= =====
				526
				527	``NT_AMD_AMDGPU_ISA``
				528	Specifies the instruction set architecture used by the machine code contained
				529	in the code object.
				530
				531	This note record is required for code objects containing machine code for
				532	processors matching the ``amdgcn`` architecture in table
				533	:ref:`amdgpu-processors`.
				534
				535	The null terminated string has the following syntax:
				536
				537	architecture\ ``-``\ vendor\ ``-``\ os\ ``-``\ environment\ ``-``\ processor
				538
				539	where:
				540
				541	architecture
				542	The architecture from table :ref:`amdgpu-target-triples-table`.
				543
				544	This is always ``amdgcn`` when the target triple OS is ``amdhsa`` (see
				545	:ref:`amdgpu-target-triples`).
				546
				547	vendor
				548	The vendor from table :ref:`amdgpu-target-triples-table`.
				549
				550	For the AMDGPU backend this is always ``amd``.
				551
				552	os
				553	The OS from table :ref:`amdgpu-target-triples-table`.
				554
				555	environment
				556	An environment from table :ref:`amdgpu-target-triples-table`, or blank if
				557	the environment has no affect on the execution of the code object.
				558
				559	For the AMDGPU backend this is currently always blank.
				560	processor
				561	The processor from table :ref:`amdgpu-processors-table`.
				562
				563	For example:
				564
				565	``amdgcn-amd-amdhsa--gfx901``
				566
				567	``NT_AMD_AMDGPU_METADATA``
				568	Specifies extensible metadata associated with the code object. See
				569	:ref:`amdgpu-code-object-metadata` for the syntax of the code object metadata
				570	string.
				571
				572	This note record is required and must contain the minimum information
				573	necessary to support the ROCM kernel queries. For example, the segment sizes
				574	needed in a dispatch packet. In addition, a high level language runtime may
				575	require other information to be included. For example, the AMD OpenCL runtime
				576	records kernel argument information.
				577
				578	.. TODO
				579	Is the string null terminated? It probably should not if YAML allows it to
				580	contain null characters, otherwise it should be.
				581
				582	.. _amdgpu-code-object-metadata:
				583
				584	Code Object Metadata
				585	--------------------
				586
				587	The code object metadata is specified by the ``NT_AMD_AMDHSA_METADATA`` note
				588	record (see :ref:`amdgpu-note-records`).
				589
				590	The metadata is specified as a YAML formated string (see [YAML]_ and
				591	:doc:`YamlIO`).
				592
				593	The metadata is represented as a single YAML document comprised of the mapping
				594	defined in table :ref:`amdgpu-amdhsa-code-object-metadata-mapping-table` and
				595	referenced tables.
				596
				597	For boolean values, the string values of ``false`` and ``true`` are used for
				598	false and true respectively.
				599
				600	Additional information can be added to the mappings. To avoid conflicts, any
				601	non-AMD key names should be prefixed by "vendor-name.".
				602
				603	.. table:: AMDHSA Code Object Metadata Mapping
				604	:name: amdgpu-amdhsa-code-object-metadata-mapping-table
				605
				606	========== ============== ========= =======================================
				607	String Key Value Type Required? Description
				608	========== ============== ========= =======================================
				609	"Version" sequence of Required - The first integer is the major
				610	2 integers version. Currently 1.
				611	- The second integer is the minor
				612	version. Currently 0.
				613	"Printf" sequence of Each string is encoded information
				614	strings about a printf function call. The
				615	encoded information is organized as
				616	fields separated by colon (':'):
				617
				618	``ID:N:S[0]:S[1]:...:S[N-1]:FormatString``
				619
				620	where:
				621
				622	``ID``
				623	A 32 bit integer as a unique id for
				624	each printf function call
				625
				626	``N``
				627	A 32 bit integer equal to the number
				628	of arguments of printf function call
				629	minus 1
				630
				631	``S[i]`` (where i = 0, 1, ... , N-1)
				632	32 bit integers for the size in bytes
				633	of the i-th FormatString argument of
				634	the printf function call
				635
				636	FormatString
				637	The format string passed to the
				638	printf function call.
				639	"Kernels" sequence of Required Sequence of the mappings for each
				640	mapping kernel in the code object. See
				641	:ref:`amdgpu-amdhsa-code-object-kernel-metadata-mapping-table`
				642	for the definition of the mapping.
				643	========== ============== ========= =======================================
				644
				645	..
				646
				647	.. table:: AMDHSA Code Object Kernel Metadata Mapping
				648	:name: amdgpu-amdhsa-code-object-kernel-metadata-mapping-table
				649
				650	================= ============== ========= ================================
				651	String Key Value Type Required? Description
				652	================= ============== ========= ================================
				653	"Name" string Required Source name of the kernel.
				654	"SymbolName" string Required Name of the kernel
				655	descriptor ELF symbol.
				656	"Language" string Source language of the kernel.
				657	Values include:
				658
				659	- "OpenCL C"
				660	- "OpenCL C++"
				661	- "HCC"
				662	- "OpenMP"
				663
				664	"LanguageVersion" sequence of - The first integer is the major
				665	2 integers version.
				666	- The second integer is the
				667	minor version.
				668	"Attrs" mapping Mapping of kernel attributes.
				669	See
				670	:ref:`amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table`
				671	for the mapping definition.
				672	"Arguments" sequence of Sequence of mappings of the
				673	mapping kernel arguments. See
				674	:ref:`amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table`
				675	for the definition of the mapping.
				676	"CodeProps" mapping Mapping of properties related to
				677	the kernel code. See
				678	:ref:`amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table`
				679	for the mapping definition.
				680	"DebugProps" mapping Mapping of properties related to
				681	the kernel debugging. See
				682	:ref:`amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table`
				683	for the mapping definition.
				684	================= ============== ========= ================================
				685
				686	..
				687
				688	.. table:: AMDHSA Code Object Kernel Attribute Metadata Mapping
				689	:name: amdgpu-amdhsa-code-object-kernel-attribute-metadata-mapping-table
				690
				691	=================== ============== ========= ==============================
				692	String Key Value Type Required? Description
				693	=================== ============== ========= ==============================
				694	"ReqdWorkGroupSize" sequence of The dispatch work-group size
				695	3 integers X, Y, Z must correspond to the
				696	specified values.
				697
				698	Corresponds to the OpenCL
				699	``reqd_work_group_size``
				700	attribute.
				701	"WorkGroupSizeHint" sequence of The dispatch work-group size
				702	3 integers X, Y, Z is likely to be the
				703	specified values.
				704
				705	Corresponds to the OpenCL
				706	``work_group_size_hint``
				707	attribute.
				708	"VecTypeHint" string The name of a scalar or vector
				709	type.
				710
				711	Corresponds to the OpenCL
				712	``vec_type_hint`` attribute.
				713	=================== ============== ========= ==============================
				714
				715	..
				716
				717	.. table:: AMDHSA Code Object Kernel Argument Metadata Mapping
				718	:name: amdgpu-amdhsa-code-object-kernel-argument-metadata-mapping-table
				719
				720	================= ============== ========= ================================
				721	String Key Value Type Required? Description
				722	================= ============== ========= ================================
				723	"Name" string Kernel argument name.
				724	"TypeName" string Kernel argument type name.
				725	"Size" integer Required Kernel argument size in bytes.
				726	"Align" integer Required Kernel argument alignment in
				727	bytes. Must be a power of two.
				728	"ValueKind" string Required Kernel argument kind that
				729	specifies how to set up the
				730	corresponding argument.
				731	Values include:
				732
				733	"ByValue"
				734	The argument is copied
				735	directly into the kernarg.
				736
				737	"GlobalBuffer"
				738	A global address space pointer
				739	to the buffer data is passed
				740	in the kernarg.
				741
				742	"DynamicSharedPointer"
				743	A group address space pointer
				744	to dynamically allocated LDS
				745	is passed in the kernarg.
				746
				747	"Sampler"
				748	A global address space
				749	pointer to a S# is passed in
				750	the kernarg.
				751
				752	"Image"
				753	A global address space
				754	pointer to a T# is passed in
				755	the kernarg.
				756
				757	"Pipe"
				758	A global address space pointer
				759	to an OpenCL pipe is passed in
				760	the kernarg.
				761
				762	"Queue"
				763	A global address space pointer
				764	to an OpenCL device enqueue
				765	queue is passed in the
				766	kernarg.
				767
				768	"HiddenGlobalOffsetX"
				769	The OpenCL grid dispatch
				770	global offset for the X
				771	dimension is passed in the
				772	kernarg.
				773
				774	"HiddenGlobalOffsetY"
				775	The OpenCL grid dispatch
				776	global offset for the Y
				777	dimension is passed in the
				778	kernarg.
				779
				780	"HiddenGlobalOffsetZ"
				781	The OpenCL grid dispatch
				782	global offset for the Z
				783	dimension is passed in the
				784	kernarg.
				785
				786	"HiddenNone"
				787	An argument that is not used
				788	by the kernel. Space needs to
				789	be left for it, but it does
				790	not need to be set up.
				791
				792	"HiddenPrintfBuffer"
				793	A global address space pointer
				794	to the runtime printf buffer
				795	is passed in kernarg.
				796
				797	"HiddenDefaultQueue"
				798	A global address space pointer
				799	to the OpenCL device enqueue
				800	queue that should be used by
				801	the kernel by default is
				802	passed in the kernarg.
				803
				804	"HiddenCompletionAction"
				805	TBD
				806
				807	.. TODO
				808	Add description.
				809
				810	"ValueType" string Required Kernel argument value type. Only
				811	present if "ValueKind" is
				812	"ByValue". For vector data
				813	types, the value is for the
				814	element type. Values include:
				815
				816	- "Struct"
				817	- "I8"
				818	- "U8"
				819	- "I16"
				820	- "U16"
				821	- "F16"
				822	- "I32"
				823	- "U32"
				824	- "F32"
				825	- "I64"
				826	- "U64"
				827	- "F64"
				828
				829	.. TODO
				830	How can it be determined if a
				831	vector type, and what size
				832	vector?
				833	"PointeeAlign" integer Alignment in bytes of pointee
				834	type for pointer type kernel
				835	argument. Must be a power
				836	of 2. Only present if
				837	"ValueKind" is
				838	"DynamicSharedPointer".
				839	"AddrSpaceQual" string Kernel argument address space
				840	qualifier. Only present if
				841	"ValueKind" is "GlobalBuffer" or
				842	"DynamicSharedPointer". Values
				843	are:
				844
				845	- "Private"
				846	- "Global"
				847	- "Constant"
				848	- "Local"
				849	- "Generic"
				850	- "Region"
				851
				852	.. TODO
				853	Is GlobalBuffer only Global
				854	or Constant? Is
				855	DynamicSharedPointer always
				856	Local? Can HCC allow Generic?
				857	How can Private or Region
				858	ever happen?
				859	"AccQual" string Kernel argument access
				860	qualifier. Only present if
				861	"ValueKind" is "Image" or
				862	"Pipe". Values
				863	are:
				864
				865	- "ReadOnly"
				866	- "WriteOnly"
				867	- "ReadWrite"
				868
				869	.. TODO
				870	Does this apply to
				871	GlobalBuffer?
				872	"ActualAcc" string The actual memory accesses
				873	performed by the kernel on the
				874	kernel argument. Only present if
				875	"ValueKind" is "GlobalBuffer",
				876	"Image", or "Pipe". This may be
				877	more restrictive than indicated
				878	by "AccQual" to reflect what the
				879	kernel actual does. If not
				880	present then the runtime must
				881	assume what is implied by
				882	"AccQual" and "IsConst". Values
				883	are:
				884
				885	- "ReadOnly"
				886	- "WriteOnly"
				887	- "ReadWrite"
				888
				889	"IsConst" boolean Indicates if the kernel argument
				890	is const qualified. Only present
				891	if "ValueKind" is
				892	"GlobalBuffer".
				893
				894	"IsRestrict" boolean Indicates if the kernel argument
				895	is restrict qualified. Only
				896	present if "ValueKind" is
				897	"GlobalBuffer".
				898
				899	"IsVolatile" boolean Indicates if the kernel argument
				900	is volatile qualified. Only
				901	present if "ValueKind" is
				902	"GlobalBuffer".
				903
				904	"IsPipe" boolean Indicates if the kernel argument
				905	is pipe qualified. Only present
				906	if "ValueKind" is "Pipe".
				907
				908	.. TODO
				909	Can GlobalBuffer be pipe
				910	qualified?
				911	================= ============== ========= ================================
				912
				913	..
				914
				915	.. table:: AMDHSA Code Object Kernel Code Properties Metadata Mapping
				916	:name: amdgpu-amdhsa-code-object-kernel-code-properties-metadata-mapping-table
				917
				918	============================ ============== ========= =====================
				919	String Key Value Type Required? Description
				920	============================ ============== ========= =====================
				921	"KernargSegmentSize" integer Required The size in bytes of
				922	the kernarg segment
				923	that holds the values
				924	of the arguments to
				925	the kernel.
				926	"GroupSegmentFixedSize" integer Required The amount of group
				927	segment memory
				928	required by a
				929	work-group in
				930	bytes. This does not
				931	include any
				932	dynamically allocated
				933	group segment memory
				934	that may be added
				935	when the kernel is
				936	dispatched.
				937	"PrivateSegmentFixedSize" integer Required The amount of fixed
				938	private address space
				939	memory required for a
				940	work-item in
				941	bytes. If
				942	IsDynamicCallstack
				943	is 1 then additional
				944	space must be added
				945	to this value for the
				946	call stack.
				947	"KernargSegmentAlign" integer Required The maximum byte
				948	alignment of
				949	arguments in the
				950	kernarg segment. Must
				951	be a power of 2.
				952	"WavefrontSize" integer Required Wavefront size. Must
				953	be a power of 2.
				954	"NumSGPRs" integer Number of scalar
				955	registers used by a
				956	wavefront for
				957	GFX6-GFX9. This
				958	includes the special
				959	SGPRs for VCC, Flat
				960	Scratch (GFX7-GFX9)
				961	and XNACK (for
				962	GFX8-GFX9). It does
				963	not include the 16
				964	SGPR added if a trap
				965	handler is
				966	enabled. It is not
				967	rounded up to the
				968	allocation
				969	granularity.
				970	"NumVGPRs" integer Number of vector
				971	registers used by
				972	each work-item for
				973	GFX6-GFX9
				974	"MaxFlatWorkgroupSize" integer Maximum flat
				975	work-group size
				976	supported by the
				977	kernel in work-items.
				978	"IsDynamicCallStack" boolean Indicates if the
				979	generated machine
				980	code is using a
				981	dynamically sized
				982	call stack.
				983	"IsXNACKEnabled" boolean Indicates if the
				984	generated machine
				985	code is capable of
				986	supporting XNACK.
				987	============================ ============== ========= =====================
				988
				989	..
				990
				991	.. table:: AMDHSA Code Object Kernel Debug Properties Metadata Mapping
				992	:name: amdgpu-amdhsa-code-object-kernel-debug-properties-metadata-mapping-table
				993
				994	=================================== ============== ========= ==============
				995	String Key Value Type Required? Description
				996	=================================== ============== ========= ==============
				997	"DebuggerABIVersion" string
				998	"ReservedNumVGPRs" integer
				999	"ReservedFirstVGPR" integer
				1000	"PrivateSegmentBufferSGPR" integer
				1001	"WavefrontPrivateSegmentOffsetSGPR" integer
				1002	=================================== ============== ========= ==============
				1003
				1004	.. TODO
				1005	Plan to remove the debug properties metadata.
				1006
				1007	.. _amdgpu-symbols:
				1008
				1009	Symbols
				1010	-------
				1011
				1012	Symbols include the following:
				1013
				1014	.. table:: AMDGPU ELF Symbols
				1015	:name: amdgpu-elf-symbols-table
				1016
				1017	===================== ============== ============= ==================
				1018	Name Type Section Description
				1019	===================== ============== ============= ==================
				1020	link-name ``STT_OBJECT`` - ``.data`` Global variable
				1021	- ``.rodata``
				1022	- ``.bss``
				1023	link-name\ ``@kd`` ``STT_OBJECT`` - ``.rodata`` Kernel descriptor
				1024	link-name ``STT_FUNC`` - ``.text`` Kernel entry point
				1025	===================== ============== ============= ==================
				1026
				1027	Global variable
				1028	Global variables both used and defined by the compilation unit.
				1029
				1030	If the symbol is defined in the compilation unit then it is allocated in the
				1031	appropriate section according to if it has initialized data or is readonly.
				1032
				1033	If the symbol is external then its section is ``STN_UNDEF`` and the loader
				1034	will resolve relocations using the defintion provided by another code object
				1035	or explicitly defined by the runtime.
				1036
				1037	All global symbols, whether defined in the compilation unit or external, are
				1038	accessed by the machine code indirectly throught a GOT table entry. This
				1039	allows them to be preemptable. The GOT table is only supported when the target
				1040	triple OS is ``amdhsa`` (see :ref:`amdgpu-target-triples`).
				1041
				1042	.. TODO
				1043	Add description of linked shared object symbols. Seems undefined symbols
				1044	are marked as STT_NOTYPE.
				1045
				1046	Kernel descriptor
				1047	Every HSA kernel has an associated kernel descriptor. It is the address of the
				1048	kernel descriptor that is used in the AQL dispatch packet used to invoke the
				1049	kernel, not the kernel entry point. The layout of the HSA kernel descriptor is
				1050	defined in :ref:`amdgpu-amdhsa-kernel-descriptor`.
				1051
				1052	Kernel entry point
				1053	Every HSA kernel also has a symbol for its machine code entry point.
				1054
				1055	.. _amdgpu-relocation-records:
				1056
				1057	Relocation Records
				1058	------------------
				1059
				1060	AMDGPU backend generates ``Elf64_Rela`` relocation records. Supported
				1061	relocatable fields are:
				1062
				1063	``word32``
				1064	This specifies a 32-bit field occupying 4 bytes with arbitrary byte
				1065	alignment. These values use the same byte order as other word values in the
				1066	AMD GPU architecture.
				1067
				1068	``word64``
				1069	This specifies a 64-bit field occupying 8 bytes with arbitrary byte
				1070	alignment. These values use the same byte order as other word values in the
				1071	AMD GPU architecture.
				1072
				1073	Following notations are used for specifying relocation calculations:
				1074
				1075	A
				1076	Represents the addend used to compute the value of the relocatable field.
				1077
				1078	G
				1079	Represents the offset into the global offset table at which the relocation
				1080	entry’s symbol will reside during execution.
				1081
				1082	GOT
				1083	Represents the address of the global offset table.
				1084
				1085	P
				1086	Represents the place (section offset for ``et_rel`` or address for ``et_dyn``)
				1087	of the storage unit being relocated (computed using ``r_offset``).
				1088
				1089	S
				1090	Represents the value of the symbol whose index resides in the relocation
				1091	entry.
				1092
				1093	The following relocation types are supported:
				1094
				1095	.. table:: AMDGPU ELF Relocation Records
				1096	:name: amdgpu-elf-relocation-records-table
				1097
				1098	========================== ===== ========== ==============================
				1099	Relocation Type Value Field Calculation
				1100	========================== ===== ========== ==============================
				1101	``R_AMDGPU_NONE`` 0 none none
				1102	``R_AMDGPU_ABS32_LO`` 1 ``word32`` (S + A) & 0xFFFFFFFF
				1103	``R_AMDGPU_ABS32_HI`` 2 ``word32`` (S + A) >> 32
				1104	``R_AMDGPU_ABS64`` 3 ``word64`` S + A
				1105	``R_AMDGPU_REL32`` 4 ``word32`` S + A - P
				1106	``R_AMDGPU_REL64`` 5 ``word64`` S + A - P
				1107	``R_AMDGPU_ABS32`` 6 ``word32`` S + A
				1108	``R_AMDGPU_GOTPCREL`` 7 ``word32`` G + GOT + A - P
				1109	``R_AMDGPU_GOTPCREL32_LO`` 8 ``word32`` (G + GOT + A - P) & 0xFFFFFFFF
				1110	``R_AMDGPU_GOTPCREL32_HI`` 9 ``word32`` (G + GOT + A - P) >> 32
				1111	``R_AMDGPU_REL32_LO`` 10 ``word32`` (S + A - P) & 0xFFFFFFFF
				1112	``R_AMDGPU_REL32_HI`` 11 ``word32`` (S + A - P) >> 32
				1113	========================== ===== ========== ==============================
				1114
				1115	.. _amdgpu-dwarf:
				1116
				1117	DWARF
				1118	-----
				1119
				1120	Standard DWARF [DWARF]_ Version 2 sections can be generated. These contain
				1121	information that maps the code object executable code and data to the source
				1122	language constructs. It can be used by tools such as debuggers and profilers.
				1123
				1124	Address Space Mapping
				1125	~~~~~~~~~~~~~~~~~~~~~
				1126
				1127	The following address space mapping is used:
				1128
				1129	.. table:: AMDGPU DWARF Address Space Mapping
				1130	:name: amdgpu-dwarf-address-space-mapping-table
				1131
				1132	=================== =================
				1133	DWARF Address Space Memory Space
				1134	=================== =================
				1135	1 Private (Scratch)
				1136	2 Local (group/LDS)
				1137	omitted Global
				1138	omitted Constant
				1139	omitted Generic (Flat)
				1140	not supported Region (GDS)
				1141	=================== =================
				1142
				1143	See :ref:`amdgpu-address-spaces` for infomration on the memory space terminology
				1144	used in the table.
				1145
				1146	An ``address_class`` attribute is generated on pointer type DIEs to specify the
				1147	DWARF address space of the value of the pointer when it is in the private or
				1148	local address space. Otherwise the attribute is omitted.
				1149
				1150	An ``XDEREF`` operation is generated in location list expressions for variables
				1151	that are allocated in the private and local address space. Otherwise no
				1152	``XDREF`` is omitted.
				1153
				1154	Register Mapping
				1155	~~~~~~~~~~~~~~~~
				1156
				1157	This section is WIP.
				1158
				1159	.. TODO
				1160	Define DWARF register enumeration.
				1161
				1162	If want to present a wavefront state then should expose vector registers as
				1163	64 wide (rather than per work-item view that LLVM uses). Either as seperate
				1164	registers, or a 64x4 byte single register. In either case use a new LANE op
				1165	(akin to XDREF) to select the current lane usage in a location
				1166	expression. This would also allow scalar register spilling to vector register
				1167	lanes to be expressed (currently no debug information is being generated for
				1168	spilling). If choose a wide single register approach then use LANE in
				1169	conjunction with PIECE operation to select the dword part of the register for
				1170	the current lane. If the separate register approach then use LANE to select
				1171	the register.
				1172
				1173	Source Text
				1174	~~~~~~~~~~~
				1175
				1176	This section is WIP.
				1177
				1178	.. TODO
				1179	DWARF extension to include runtime generated source text.
				1180
				1181	.. _amdgpu-code-conventions:
				1182
				1183	Code Conventions
				1184	================
				1185
				1186	AMDHSA
				1187	------
				1188
				1189	This section provides code conventions used when the target triple OS is
				1190	``amdhsa`` (see :ref:`amdgpu-target-triples`).
				1191
				1192	Kernel Dispatch
				1193	~~~~~~~~~~~~~~~
				1194
				1195	The HSA architected queuing language (AQL) defines a user space memory interface
				1196	that can be used to control the dispatch of kernels, in an agent independent
				1197	way. An agent can have zero or more AQL queues created for it using the ROCm
				1198	runtime, in which AQL packets (all of which are 64 bytes) can be placed. See the
				1199	HSA Platform System Architecture Specification [HSA]_ for the AQL queue
				1200	mechanics and packet layouts.
				1201
				1202	The packet processor of a kernel agent is responsible for detecting and
				1203	dispatching HSA kernels from the AQL queues associated with it. For AMD GPUs the
				1204	packet processor is implemented by the hardware command processor (CP),
				1205	asynchronous dispatch controller (ADC) and shader processor input controller
				1206	(SPI).
				1207
				1208	The ROCm runtime can be used to allocate an AQL queue object. It uses the kernel
				1209	mode driver to initialize and register the AQL queue with CP.
				1210
				1211	To dispatch a kernel the following actions are performed. This can occur in the
				1212	CPU host program, or from an HSA kernel executing on a GPU.
				1213
				1214	1. A pointer to an AQL queue for the kernel agent on which the kernel is to be
				1215	executed is obtained.
				1216	2. A pointer to the kernel descriptor (see
				1217	:ref:`amdgpu-amdhsa-kernel-descriptor`) of the kernel to execute is
				1218	obtained. It must be for a kernel that is contained in a code object that that
				1219	was loaded by the ROCm runtime on the kernel agent with which the AQL queue is
				1220	associated.
				1221	3. Space is allocated for the kernel arguments using the ROCm runtime allocator
				1222	for a memory region with the kernarg property for the kernel agent that will
				1223	execute the kernel. It must be at least 16 byte aligned.
				1224	4. Kernel argument values are assigned to the kernel argument memory
				1225	allocation. The layout is defined in the HSA Programmer’s Language Reference
				1226	[HSA]_. For AMDGPU the kernel execution directly accesses the kernel argument
				1227	memory in the same way constant memory is accessed. (Note that the HSA
				1228	specification allows an implementation to copy the kernel argument contents to
				1229	another location that is accessed by the kernel.)
				1230	5. An AQL kernel dispatch packet is created on the AQL queue. The ROCm runtime
				1231	api uses 64 bit atomic operations to reserve space in the AQL queue for the
				1232	packet. The packet must be set up, and the final write must use an atomic
				1233	store release to set the packet kind to ensure the packet contents are
				1234	visible to the kernel agent. AQL defines a doorbell signal mechanism to
				1235	notify the kernel agent that the AQL queue has been updated. These rules, and
				1236	the layout of the AQL queue and kernel dispatch packet is defined in the *HSA
				1237	System Architecture Specification* [HSA]_.
				1238	6. A kernel dispatch packet includes information about the actual dispatch,
				1239	such as grid and work-group size, together with information from the code
				1240	object about the kernel, such as segment sizes. The ROCm runtime queries on
				1241	the kernel symbol can be used to obtain the code object values which are
				1242	recorded in the :ref:`amdgpu-code-object-metadata`.
				1243	7. CP executes micro-code and is responsible for detecting and setting up the
				1244	GPU to execute the wavefronts of a kernel dispatch.
				1245	8. CP ensures that when the a wavefront starts executing the kernel machine
				1246	code, the scalar general purpose registers (SGPR) and vector general purpose
				1247	registers (VGPR) are set up as required by the machine code. The required
				1248	setup is defined in the :ref:`amdgpu-amdhsa-kernel-descriptor`. The initial
				1249	register state is defined in
				1250	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`.
				1251	9. The prolog of the kernel machine code (see
				1252	:ref:`amdgpu-amdhsa-kernel-prolog`) sets up the machine state as necessary
				1253	before continuing executing the machine code that corresponds to the kernel.
				1254	10. When the kernel dispatch has completed execution, CP signals the completion
				1255	signal specified in the kernel dispatch packet if not 0.
				1256
				1257	.. _amdgpu-amdhsa-memory-spaces:
				1258
				1259	Memory Spaces
				1260	~~~~~~~~~~~~~
				1261
				1262	The memory space properties are:
				1263
				1264	.. table:: AMDHSA Memory Spaces
				1265	:name: amdgpu-amdhsa-memory-spaces-table
				1266
				1267	================= =========== ======== ======= ==================
				1268	Memory Space Name HSA Segment Hardware Address NULL Value
				1269	Name Name Size
				1270	================= =========== ======== ======= ==================
				1271	Private private scratch 32 0x00000000
				1272	Local group LDS 32 0xFFFFFFFF
				1273	Global global global 64 0x0000000000000000
				1274	Constant constant *same as 64 0x0000000000000000
				1275	global*
				1276	Generic flat flat 64 0x0000000000000000
				1277	Region N/A GDS 32 *not implemented
				1278	for AMDHSA*
				1279	================= =========== ======== ======= ==================
				1280
				1281	The global and constant memory spaces both use global virtual addresses, which
				1282	are the same virtual address space used by the CPU. However, some virtual
				1283	addresses may only be accessible to the CPU, some only accessible by the GPU,
				1284	and some by both.
				1285
				1286	Using the constant memory space indicates that the data will not change during
				1287	the execution of the kernel. This allows scalar read instructions to be
				1288	used. The vector and scalar L1 caches are invalidated of volatile data before
				1289	each kernel dispatch execution to allow constant memory to change values between
				1290	kernel dispatches.
				1291
				1292	The local memory space uses the hardware Local Data Store (LDS) which is
				1293	automatically allocated when the hardware creates work-groups of wavefronts, and
				1294	freed when all the wavefronts of a work-group have terminated. The data store
				1295	(DS) instructions can be used to access it.
				1296
				1297	The private memory space uses the hardware scratch memory support. If the kernel
				1298	uses scratch, then the hardware allocates memory that is accessed using
				1299	wavefront lane dword (4 byte) interleaving. The mapping used from private
				1300	address to physical address is:
				1301
				1302	``wavefront-scratch-base +
				1303	(private-address * wavefront-size * 4) +
				1304	(wavefront-lane-id * 4)``
				1305
				1306	There are different ways that the wavefront scratch base address is determined
				1307	by a wavefront (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). This
				1308	memory can be accessed in an interleaved manner using buffer instruction with
				1309	the scratch buffer descriptor and per wave scratch offset, by the scratch
				1310	instructions, or by flat instructions. If each lane of a wavefront accesses the
				1311	same private address, the interleaving results in adjacent dwords being accessed
				1312	and hence requires fewer cache lines to be fetched. Multi-dword access is not
				1313	supported except by flat and scratch instructions in GFX9.
				1314
				1315	The generic address space uses the hardware flat address support available in
				1316	GFX7-GFX9. This uses two fixed ranges of virtual addresses (the private and
				1317	local appertures), that are outside the range of addressible global memory, to
				1318	map from a flat address to a private or local address.
				1319
				1320	FLAT instructions can take a flat address and access global, private (scratch)
				1321	and group (LDS) memory depending in if the address is within one of the
				1322	apperture ranges. Flat access to scratch requires hardware aperture setup and
				1323	setup in the kernel prologue (see :ref:`amdgpu-amdhsa-flat-scratch`). Flat
				1324	access to LDS requires hardware aperture setup and M0 (GFX7-GFX8) register setup
				1325	(see :ref:`amdgpu-amdhsa-m0`).
				1326
				1327	To convert between a segment address and a flat address the base address of the
				1328	appertures address can be used. For GFX7-GFX8 these are available in the
				1329	:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
				1330	Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
				1331	GFX9 the appature base addresses are directly available as inline constant
				1332	registers ``SRC_SHARED_BASE/LIMIT`` and ``SRC_PRIVATE_BASE/LIMIT``. In 64 bit
				1333	address mode the apperture sizes are 2^32 bytes and the base is aligned to 2^32
				1334	which makes it easier to convert from flat to segment or segment to flat.
				1335
				1336	HSA Image and Samplers
				1337	~~~~~~~~~~~~~~~~~~~~~~
				1338
				1339	Image and sample handles created by the ROCm runtime are 64 bit addresses of a
				1340	hardware 32 byte V# and 48 byte S# object respectively. In order to support the
				1341	HSA ``query_sampler`` operations two extra dwords are used to store the HSA BRIG
				1342	enumeration values for the queries that are not trivially deducible from the S#
				1343	representation.
				1344
				1345	HSA Signals
				1346	~~~~~~~~~~~
				1347
				1348	Signal handles created by the ROCm runtime are 64 bit addresses of a structure
				1349	allocated in memory accessible from both the CPU and GPU. The structure is
				1350	defined by the ROCm runtime and subject to change between releases (see
				1351	[AMD-ROCm-github]_).
				1352
				1353	.. _amdgpu-amdhsa-hsa-aql-queue:
				1354
				1355	HSA AQL Queue
				1356	~~~~~~~~~~~~~
				1357
				1358	The AQL queue structure is defined by the ROCm runtime and subject to change
				1359	between releases (see [AMD-ROCm-github]_). For some processors it contains
				1360	fields needed to implement certain language features such as the flat address
				1361	aperture bases. It also contains fields used by CP such as managing the
				1362	allocation of scratch memory.
				1363
				1364	.. _amdgpu-amdhsa-kernel-descriptor:
				1365
				1366	Kernel Descriptor
				1367	~~~~~~~~~~~~~~~~~
				1368
				1369	A kernel descriptor consists of the information needed by CP to initiate the
				1370	execution of a kernel, including the entry point address of the machine code
				1371	that implements the kernel.
				1372
				1373	Kernel Descriptor for GFX6-GFX9
				1374	+++++++++++++++++++++++++++++++
				1375
				1376	CP microcode requires the Kernel descritor to be allocated on 64 byte alignment.
				1377
				1378	.. table:: Kernel Descriptor for GFX6-GFX9
				1379	:name: amdgpu-amdhsa-kernel-descriptor-gfx6-gfx9-table
				1380
				1381	======= ======= =============================== ===========================
				1382	Bits Size Field Name Description
				1383	======= ======= =============================== ===========================
				1384	31:0 4 bytes group_segment_fixed_size The amount of fixed local
				1385	address space memory
				1386	required for a work-group
				1387	in bytes. This does not
				1388	include any dynamically
				1389	allocated local address
				1390	space memory that may be
				1391	added when the kernel is
				1392	dispatched.
				1393	63:32 4 bytes private_segment_fixed_size The amount of fixed
				1394	private address space
				1395	memory required for a
				1396	work-item in bytes. If
				1397	is_dynamic_callstack is 1
				1398	then additional space must
				1399	be added to this value for
				1400	the call stack.
				1401	95:64 4 bytes max_flat_workgroup_size Maximum flat work-group
				1402	size supported by the
				1403	kernel in work-items.
				1404	96 1 bit is_dynamic_call_stack Indicates if the generated
				1405	machine code is using a
				1406	dynamically sized call
				1407	stack.
				1408	97 1 bit is_xnack_enabled Indicates if the generated
				1409	machine code is capable of
				1410	suppoting XNACK.
				1411	127:98 30 bits Reserved. Must be 0.
				1412	191:128 8 bytes kernel_code_entry_byte_offset Byte offset (possibly
				1413	negative) from base
				1414	address of kernel
				1415	descriptor to kernel's
				1416	entry point instruction
				1417	which must be 256 byte
				1418	aligned.
				1419	383:192 24 Reserved. Must be 0.
				1420	bytes
				1421	415:384 4 bytes compute_pgm_rsrc1 Compute Shader (CS)
				1422	program settings used by
				1423	CP to set up
				1424	``COMPUTE_PGM_RSRC1``
				1425	configuration
				1426	register. See
				1427	:ref:`amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table`.
				1428	447:416 4 bytes compute_pgm_rsrc2 Compute Shader (CS)
				1429	program settings used by
				1430	CP to set up
				1431	``COMPUTE_PGM_RSRC2``
				1432	configuration
				1433	register. See
				1434	:ref:`amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table`.
				1435	448 1 bit enable_sgpr_private_segment Enable the setup of the
				1436	_buffer SGPR user data registers
				1437	(see
				1438	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1439
				1440	The total number of SGPR
				1441	user data registers
				1442	requested must not exceed
				1443	16 and match value in
				1444	``compute_pgm_rsrc2.user_sgpr.user_sgpr_count``.
				1445	Any requests beyond 16
				1446	will be ignored.
				1447	449 1 bit enable_sgpr_dispatch_ptr see above
				1448	450 1 bit enable_sgpr_queue_ptr see above
				1449	451 1 bit enable_sgpr_kernarg_segment_ptr see above
				1450	452 1 bit enable_sgpr_dispatch_id see above
				1451	453 1 bit enable_sgpr_flat_scratch_init see above
				1452	454 1 bit enable_sgpr_private_segment see above
				1453	_size
				1454	455 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1455	_count_X should always be 0.
				1456	456 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1457	_count_Y should always be 0.
				1458	457 1 bit enable_sgpr_grid_workgroup Not implemented in CP and
				1459	_count_Z should always be 0.
				1460	463:458 6 bits Reserved. Must be 0.
				1461	511:464 4 Reserved. Must be 0.
				1462	bytes
				1463	512 Total size 64 bytes.
				1464	======= ===================================================================
				1465
				1466	..
				1467
				1468	.. table:: compute_pgm_rsrc1 for GFX6-GFX9
				1469	:name: amdgpu-amdhsa-compute_pgm_rsrc1_t-gfx6-gfx9-table
				1470
				1471	======= ======= =============================== ===========================
				1472	Bits Size Field Name Description
				1473	======= ======= =============================== ===========================
				1474	5:0 6 bits granulated_workitem_vgpr_count Number of vector registers
				1475	used by each work-item,
				1476	granularity is device
				1477	specific:
				1478
				1479	GFX6-9
				1480	roundup((max-vgpg + 1)
				1481	/ 4) - 1
				1482
				1483	Used by CP to set up
				1484	``COMPUTE_PGM_RSRC1.VGPRS``.
				1485	9:6 4 bits granulated_wavefront_sgpr_count Number of scalar registers
				1486	used by a wavefront,
				1487	granularity is device
				1488	specific:
				1489
				1490	GFX6-8
				1491	roundup((max-sgpg + 1)
				1492	/ 8) - 1
				1493	GFX9
				1494	roundup((max-sgpg + 1)
				1495	/ 16) - 1
				1496
				1497	Includes the special SGPRs
				1498	for VCC, Flat Scratch (for
				1499	GFX7 onwards) and XNACK
				1500	(for GFX8 onwards). It does
				1501	not include the 16 SGPR
				1502	added if a trap handler is
				1503	enabled.
				1504
				1505	Used by CP to set up
				1506	``COMPUTE_PGM_RSRC1.SGPRS``.
				1507	11:10 2 bits priority Must be 0.
				1508
				1509	Start executing wavefront
				1510	at the specified priority.
				1511
				1512	CP is responsible for
				1513	filling in
				1514	``COMPUTE_PGM_RSRC1.PRIORITY``.
				1515	13:12 2 bits float_mode_round_32 Wavefront starts execution
				1516	with specified rounding
				1517	mode for single (32
				1518	bit) floating point
				1519	precision floating point
				1520	operations.
				1521
				1522	Floating point rounding
				1523	mode values are defined in
				1524	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1525
				1526	Used by CP to set up
				1527	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1528	15:14 2 bits float_mode_round_16_64 Wavefront starts execution
				1529	with specified rounding
				1530	denorm mode for half/double (16
				1531	and 64 bit) floating point
				1532	precision floating point
				1533	operations.
				1534
				1535	Floating point rounding
				1536	mode values are defined in
				1537	:ref:`amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table`.
				1538
				1539	Used by CP to set up
				1540	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1541	17:16 2 bits float_mode_denorm_32 Wavefront starts execution
				1542	with specified denorm mode
				1543	for single (32
				1544	bit) floating point
				1545	precision floating point
				1546	operations.
				1547
				1548	Floating point denorm mode
				1549	values are defined in
				1550	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1551
				1552	Used by CP to set up
				1553	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1554	19:18 2 bits float_mode_denorm_16_64 Wavefront starts execution
				1555	with specified denorm mode
				1556	for half/double (16
				1557	and 64 bit) floating point
				1558	precision floating point
				1559	operations.
				1560
				1561	Floating point denorm mode
				1562	values are defined in
				1563	:ref:`amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table`.
				1564
				1565	Used by CP to set up
				1566	``COMPUTE_PGM_RSRC1.FLOAT_MODE``.
				1567	20 1 bit priv Must be 0.
				1568
				1569	Start executing wavefront
				1570	in privilege trap handler
				1571	mode.
				1572
				1573	CP is responsible for
				1574	filling in
				1575	``COMPUTE_PGM_RSRC1.PRIV``.
				1576	21 1 bit enable_dx10_clamp Wavefront starts execution
				1577	with DX10 clamp mode
				1578	enabled. Used by the vector
				1579	ALU to force DX-10 style
				1580	treatment of NaN's (when
				1581	set, clamp NaN to zero,
				1582	otherwise pass NaN
				1583	through).
				1584
				1585	Used by CP to set up
				1586	``COMPUTE_PGM_RSRC1.DX10_CLAMP``.
				1587	22 1 bit debug_mode Must be 0.
				1588
				1589	Start executing wavefront
				1590	in single step mode.
				1591
				1592	CP is responsible for
				1593	filling in
				1594	``COMPUTE_PGM_RSRC1.DEBUG_MODE``.
				1595	23 1 bit enable_ieee_mode Wavefront starts execution
				1596	with IEEE mode
				1597	enabled. Floating point
				1598	opcodes that support
				1599	exception flag gathering
				1600	will quiet and propagate
				1601	signaling-NaN inputs per
				1602	IEEE 754-2008. Min_dx10 and
				1603	max_dx10 become IEEE
				1604	754-2008 compliant due to
				1605	signaling-NaN propagation
				1606	and quieting.
				1607
				1608	Used by CP to set up
				1609	``COMPUTE_PGM_RSRC1.IEEE_MODE``.
				1610	24 1 bit bulky Must be 0.
				1611
				1612	Only one work-group allowed
				1613	to execute on a compute
				1614	unit.
				1615
				1616	CP is responsible for
				1617	filling in
				1618	``COMPUTE_PGM_RSRC1.BULKY``.
				1619	25 1 bit cdbg_user Must be 0.
				1620
				1621	Flag that can be used to
				1622	control debugging code.
				1623
				1624	CP is responsible for
				1625	filling in
				1626	``COMPUTE_PGM_RSRC1.CDBG_USER``.
				1627	31:26 6 bits Reserved. Must be 0.
				1628	32 Total size 4 bytes
				1629	======= ===================================================================
				1630
				1631	..
				1632
				1633	.. table:: compute_pgm_rsrc2 for GFX6-GFX9
				1634	:name: amdgpu-amdhsa-compute_pgm_rsrc2-gfx6-gfx9-table
				1635
				1636	======= ======= =============================== ===========================
				1637	Bits Size Field Name Description
				1638	======= ======= =============================== ===========================
				1639	0 1 bit enable_sgpr_private_segment Enable the setup of the
				1640	_wave_offset SGPR wave scratch offset
				1641	system register (see
				1642	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1643
				1644	Used by CP to set up
				1645	``COMPUTE_PGM_RSRC2.SCRATCH_EN``.
				1646	5:1 5 bits user_sgpr_count The total number of SGPR
				1647	user data registers
				1648	requested. This number must
				1649	match the number of user
				1650	data registers enabled.
				1651
				1652	Used by CP to set up
				1653	``COMPUTE_PGM_RSRC2.USER_SGPR``.
				1654	6 1 bit enable_trap_handler Set to 1 if code contains a
				1655	TRAP instruction which
				1656	requires a trap hander to
				1657	be enabled.
				1658
				1659	CP sets
				1660	``COMPUTE_PGM_RSRC2.TRAP_PRESENT``
				1661	if the runtime has
				1662	installed a trap handler
				1663	regardless of the setting
				1664	of this field.
				1665	7 1 bit enable_sgpr_workgroup_id_x Enable the setup of the
				1666	system SGPR register for
				1667	the work-group id in the X
				1668	dimension (see
				1669	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1670
				1671	Used by CP to set up
				1672	``COMPUTE_PGM_RSRC2.TGID_X_EN``.
				1673	8 1 bit enable_sgpr_workgroup_id_y Enable the setup of the
				1674	system SGPR register for
				1675	the work-group id in the Y
				1676	dimension (see
				1677	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1678
				1679	Used by CP to set up
				1680	``COMPUTE_PGM_RSRC2.TGID_Y_EN``.
				1681	9 1 bit enable_sgpr_workgroup_id_z Enable the setup of the
				1682	system SGPR register for
				1683	the work-group id in the Z
				1684	dimension (see
				1685	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1686
				1687	Used by CP to set up
				1688	``COMPUTE_PGM_RSRC2.TGID_Z_EN``.
				1689	10 1 bit enable_sgpr_workgroup_info Enable the setup of the
				1690	system SGPR register for
				1691	work-group information (see
				1692	:ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
				1693
				1694	Used by CP to set up
				1695	``COMPUTE_PGM_RSRC2.TGID_SIZE_EN``.
				1696	12:11 2 bits enable_vgpr_workitem_id Enable the setup of the
				1697	VGPR system registers used
				1698	for the work-item ID.
				1699	:ref:`amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table`
				1700	defines the values.
				1701
				1702	Used by CP to set up
				1703	``COMPUTE_PGM_RSRC2.TIDIG_CMP_CNT``.
				1704	13 1 bit enable_exception_address_watch Must be 0.
				1705
				1706	Wavefront starts execution
				1707	with address watch
				1708	exceptions enabled which
				1709	are generated when L1 has
				1710	witnessed a thread access
				1711	an *address of
				1712	interest*.
				1713
				1714	CP is responsible for
				1715	filling in the address
				1716	watch bit in
				1717	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1718	according to what the
				1719	runtime requests.
				1720	14 1 bit enable_exception_memory Must be 0.
				1721
				1722	Wavefront starts execution
				1723	with memory violation
				1724	exceptions exceptions
				1725	enabled which are generated
				1726	when a memory violation has
				1727	occurred for this wave from
				1728	L1 or LDS
				1729	(write-to-read-only-memory,
				1730	mis-aligned atomic, LDS
				1731	address out of range,
				1732	illegal address, etc.).
				1733
				1734	CP sets the memory
				1735	violation bit in
				1736	``COMPUTE_PGM_RSRC2.EXCP_EN_MSB``
				1737	according to what the
				1738	runtime requests.
				1739	23:15 9 bits granulated_lds_size Must be 0.
				1740
				1741	CP uses the rounded value
				1742	from the dispatch packet,
				1743	not this value, as the
				1744	dispatch may contain
				1745	dynamically allocated group
				1746	segment memory. CP writes
				1747	directly to
				1748	``COMPUTE_PGM_RSRC2.LDS_SIZE``.
				1749
				1750	Amount of group segment
				1751	(LDS) to allocate for each
				1752	work-group. Granularity is
				1753	device specific:
				1754
				1755	GFX6:
				1756	roundup(lds-size / (64 * 4))
				1757	GFX7-GFX9:
				1758	roundup(lds-size / (128 * 4))
				1759
				1760	24 1 bit enable_exception_ieee_754_fp Wavefront starts execution
				1761	_invalid_operation with specified exceptions
				1762	enabled.
				1763
				1764	Used by CP to set up
				1765	``COMPUTE_PGM_RSRC2.EXCP_EN``
				1766	(set from bits 0..6).
				1767
				1768	IEEE 754 FP Invalid
				1769	Operation
				1770	25 1 bit enable_exception_fp_denormal FP Denormal one or more
				1771	_source input operands is a
				1772	denormal number
				1773	26 1 bit enable_exception_ieee_754_fp IEEE 754 FP Division by
				1774	_division_by_zero Zero
				1775	27 1 bit enable_exception_ieee_754_fp IEEE 754 FP FP Overflow
				1776	_overflow
				1777	28 1 bit enable_exception_ieee_754_fp IEEE 754 FP Underflow
				1778	_underflow
				1779	29 1 bit enable_exception_ieee_754_fp IEEE 754 FP Inexact
				1780	_inexact
				1781	30 1 bit enable_exception_int_divide_by Integer Division by Zero
				1782	_zero (rcp_iflag_f32 instruction
				1783	only)
				1784	31 1 bit Reserved. Must be 0.
				1785	32 Total size 4 bytes.
				1786	======= ===================================================================
				1787
				1788	..
				1789
				1790	.. table:: Floating Point Rounding Mode Enumeration Values
				1791	:name: amdgpu-amdhsa-floating-point-rounding-mode-enumeration-values-table
				1792
				1793	===================================== ===== ===============================
				1794	Enumeration Name Value Description
				1795	===================================== ===== ===============================
				1796	AMD_FLOAT_ROUND_MODE_NEAR_EVEN 0 Round Ties To Even
				1797	AMD_FLOAT_ROUND_MODE_PLUS_INFINITY 1 Round Toward +infinity
				1798	AMD_FLOAT_ROUND_MODE_MINUS_INFINITY 2 Round Toward -infinity
				1799	AMD_FLOAT_ROUND_MODE_ZERO 3 Round Toward 0
				1800	===================================== ===== ===============================
				1801
				1802	..
				1803
				1804	.. table:: Floating Point Denorm Mode Enumeration Values
				1805	:name: amdgpu-amdhsa-floating-point-denorm-mode-enumeration-values-table
				1806
				1807	===================================== ===== ===============================
				1808	Enumeration Name Value Description
				1809	===================================== ===== ===============================
				1810	AMD_FLOAT_DENORM_MODE_FLUSH_SRC_DST 0 Flush Source and Destination
				1811	Denorms
				1812	AMD_FLOAT_DENORM_MODE_FLUSH_DST 1 Flush Output Denorms
				1813	AMD_FLOAT_DENORM_MODE_FLUSH_SRC 2 Flush Source Denorms
				1814	AMD_FLOAT_DENORM_MODE_FLUSH_NONE 3 No Flush
				1815	===================================== ===== ===============================
				1816
				1817	..
				1818
				1819	.. table:: System VGPR Work-Item ID Enumeration Values
				1820	:name: amdgpu-amdhsa-system-vgpr-work-item-id-enumeration-values-table
				1821
				1822	===================================== ===== ===============================
				1823	Enumeration Name Value Description
				1824	===================================== ===== ===============================
				1825	AMD_SYSTEM_VGPR_WORKITEM_ID_X 0 Set work-item X dimension ID.
				1826	AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y 1 Set work-item X and Y
				1827	dimensions ID.
				1828	AMD_SYSTEM_VGPR_WORKITEM_ID_X_Y_Z 2 Set work-item X, Y and Z
				1829	dimensions ID.
				1830	AMD_SYSTEM_VGPR_WORKITEM_ID_UNDEFINED 3 Undefined.
				1831	===================================== ===== ===============================
				1832
				1833	.. _amdgpu-amdhsa-initial-kernel-execution-state:
				1834
				1835	Initial Kernel Execution State
				1836	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1837
				1838	This section defines the register state that will be set up by the packet
				1839	processor prior to the start of execution of every wavefront. This is limited by
				1840	the constraints of the hardware controllers of CP/ADC/SPI.
				1841
				1842	The order of the SGPR registers is defined, but the compiler can specify which
				1843	ones are actually setup in the kernel descriptor using the ``enable_sgpr_*`` bit
				1844	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				1845	for enabled registers are dense starting at SGPR0: the first enabled register is
				1846	SGPR0, the next enabled register is SGPR1 etc.; disabled registers do not have
				1847	an SGPR number.
				1848
				1849	The initial SGPRs comprise up to 16 User SRGPs that are set by CP and apply to
				1850	all waves of the grid. It is possible to specify more than 16 User SGPRs using
				1851	the ``enable_sgpr_*`` bit fields, in which case only the first 16 are actually
				1852	initialized. These are then immediately followed by the System SGPRs that are
				1853	set up by ADC/SPI and can have different values for each wave of the grid
				1854	dispatch.
				1855
				1856	SGPR register initial state is defined in
				1857	:ref:`amdgpu-amdhsa-sgpr-register-set-up-order-table`.
				1858
				1859	.. table:: SGPR Register Set Up Order
				1860	:name: amdgpu-amdhsa-sgpr-register-set-up-order-table
				1861
				1862	========== ========================== ====== ==============================
				1863	SGPR Order Name Number Description
				1864	(kernel descriptor enable of
				1865	field) SGPRs
				1866	========== ========================== ====== ==============================
				1867	First Private Segment Buffer 4 V# that can be used, together
				1868	(enable_sgpr_private with Scratch Wave Offset as an
				1869	_segment_buffer) offset, to access the private
				1870	memory space using a segment
				1871	address.
				1872
				1873	CP uses the value provided by
				1874	the runtime.
				1875	then Dispatch Ptr 2 64 bit address of AQL dispatch
				1876	(enable_sgpr_dispatch_ptr) packet for kernel dispatch
				1877	actually executing.
				1878	then Queue Ptr 2 64 bit address of amd_queue_t
				1879	(enable_sgpr_queue_ptr) object for AQL queue on which
				1880	the dispatch packet was
				1881	queued.
				1882	then Kernarg Segment Ptr 2 64 bit address of Kernarg
				1883	(enable_sgpr_kernarg segment. This is directly
				1884	_segment_ptr) copied from the
				1885	kernarg_address in the kernel
				1886	dispatch packet.
				1887
				1888	Having CP load it once avoids
				1889	loading it at the beginning of
				1890	every wavefront.
				1891	then Dispatch Id 2 64 bit Dispatch ID of the
				1892	(enable_sgpr_dispatch_id) dispatch packet being
				1893	executed.
				1894	then Flat Scratch Init 2 This is 2 SGPRs:
				1895	(enable_sgpr_flat_scratch
				1896	_init) GFX6
				1897	Not supported.
				1898	GFX7-GFX8
				1899	The first SGPR is a 32 bit
				1900	byte offset from
				1901	``SH_HIDDEN_PRIVATE_BASE_VIMID``
				1902	to per SPI base of memory
				1903	for scratch for the queue
				1904	executing the kernel
				1905	dispatch. CP obtains this
				1906	from the runtime.
				1907
				1908	This is the same offset used
				1909	in computing the Scratch
				1910	Segment Buffer base
				1911	address. The value of
				1912	Scratch Wave Offset must be
				1913	added by the kernel machine
				1914	code and moved to SGPRn-4
				1915	for use as the FLAT SCRATCH
				1916	BASE in flat memory
				1917	instructions.
				1918
				1919	The second SGPR is 32 bit
				1920	byte size of a single
				1921	work-item’s scratch memory
				1922	usage. This is directly
				1923	loaded from the kernel
				1924	dispatch packet Private
				1925	Segment Byte Size and
				1926	rounded up to a multiple of
				1927	DWORD.
				1928
				1929	The kernel code must move to
				1930	SGPRn-3 for use as the FLAT
				1931	SCRATCH SIZE in flat memory
				1932	instructions. Having CP load
				1933	it once avoids loading it at
				1934	the beginning of every
				1935	wavefront.
				1936	GFX9
				1937	This is the 64 bit base
				1938	address of the per SPI
				1939	scratch backing memory
				1940	managed by SPI for the queue
				1941	executing the kernel
				1942	dispatch. CP obtains this
				1943	from the runtime (and
				1944	divides it if there are
				1945	multiple Shader Arrays each
				1946	with its own SPI). The value
				1947	of Scratch Wave Offset must
				1948	be added by the kernel
				1949	machine code and moved to
				1950	SGPRn-4 and SGPRn-3 for use
				1951	as the FLAT SCRATCH BASE in
				1952	flat memory instructions.
				1953	then Private Segment Size 1 The 32 bit byte size of a
				1954	(enable_sgpr_private single work-item’s scratch
				1955	_segment_size) memory allocation. This is the
				1956	value from the kernel dispatch
				1957	packet Private Segment Byte
				1958	Size rounded up by CP to a
				1959	multiple of DWORD.
				1960
				1961	Having CP load it once avoids
				1962	loading it at the beginning of
				1963	every wavefront.
				1964
				1965	This is not used for
				1966	GFX7-GFX8 since it is the same
				1967	value as the second SGPR of
				1968	Flat Scratch Init. However, it
				1969	may be needed for GFX9 which
				1970	changes the meaning of the
				1971	Flat Scratch Init value.
				1972	then Grid Work-Group Count X 1 32 bit count of the number of
				1973	(enable_sgpr_grid work-groups in the X dimension
				1974	_workgroup_count_X) for the grid being
				1975	executed. Computed from the
				1976	fields in the kernel dispatch
				1977	packet as ((grid_size.x +
				1978	workgroup_size.x - 1) /
				1979	workgroup_size.x).
				1980	then Grid Work-Group Count Y 1 32 bit count of the number of
				1981	(enable_sgpr_grid work-groups in the Y dimension
				1982	_workgroup_count_Y && for the grid being
				1983	less than 16 previous executed. Computed from the
				1984	SGPRs) fields in the kernel dispatch
				1985	packet as ((grid_size.y +
				1986	workgroup_size.y - 1) /
				1987	workgroupSize.y).
				1988
				1989	Only initialized if <16
				1990	previous SGPRs initialized.
				1991	then Grid Work-Group Count Z 1 32 bit count of the number of
				1992	(enable_sgpr_grid work-groups in the Z dimension
				1993	_workgroup_count_Z && for the grid being
				1994	less than 16 previous executed. Computed from the
				1995	SGPRs) fields in the kernel dispatch
				1996	packet as ((grid_size.z +
				1997	workgroup_size.z - 1) /
				1998	workgroupSize.z).
				1999
				2000	Only initialized if <16
				2001	previous SGPRs initialized.
				2002	then Work-Group Id X 1 32 bit work-group id in X
				2003	(enable_sgpr_workgroup_id dimension of grid for
				2004	_X) wavefront.
				2005	then Work-Group Id Y 1 32 bit work-group id in Y
				2006	(enable_sgpr_workgroup_id dimension of grid for
				2007	_Y) wavefront.
				2008	then Work-Group Id Z 1 32 bit work-group id in Z
				2009	(enable_sgpr_workgroup_id dimension of grid for
				2010	_Z) wavefront.
				2011	then Work-Group Info 1 {first_wave, 14’b0000,
				2012	(enable_sgpr_workgroup ordered_append_term[10:0],
				2013	_info) threadgroup_size_in_waves[5:0]}
				2014	then Scratch Wave Offset 1 32 bit byte offset from base
				2015	(enable_sgpr_private of scratch base of queue
				2016	_segment_wave_offset) executing the kernel
				2017	dispatch. Must be used as an
				2018	offset with Private
				2019	segment address when using
				2020	Scratch Segment Buffer. It
				2021	must be used to set up FLAT
				2022	SCRATCH for flat addressing
				2023	(see
				2024	:ref:`amdgpu-amdhsa-flat-scratch`).
				2025	========== ========================== ====== ==============================
				2026
				2027	The order of the VGPR registers is defined, but the compiler can specify which
				2028	ones are actually setup in the kernel descriptor using the ``enable_vgpr*`` bit
				2029	fields (see :ref:`amdgpu-amdhsa-kernel-descriptor`). The register numbers used
				2030	for enabled registers are dense starting at VGPR0: the first enabled register is
				2031	VGPR0, the next enabled register is VGPR1 etc.; disabled registers do not have a
				2032	VGPR number.
				2033
				2034	VGPR register initial state is defined in
				2035	:ref:`amdgpu-amdhsa-vgpr-register-set-up-order-table`.
				2036
				2037	.. table:: VGPR Register Set Up Order
				2038	:name: amdgpu-amdhsa-vgpr-register-set-up-order-table
				2039
				2040	========== ========================== ====== ==============================
				2041	VGPR Order Name Number Description
				2042	(kernel descriptor enable of
				2043	field) VGPRs
				2044	========== ========================== ====== ==============================
				2045	First Work-Item Id X 1 32 bit work item id in X
				2046	(Always initialized) dimension of work-group for
				2047	wavefront lane.
				2048	then Work-Item Id Y 1 32 bit work item id in Y
				2049	(enable_vgpr_workitem_id dimension of work-group for
				2050	> 0) wavefront lane.
				2051	then Work-Item Id Z 1 32 bit work item id in Z
				2052	(enable_vgpr_workitem_id dimension of work-group for
				2053	> 1) wavefront lane.
				2054	========== ========================== ====== ==============================
				2055
				2056	The setting of registers is is done by GPU CP/ADC/SPI hardware as follows:
				2057
				2058	1. SGPRs before the Work-Group Ids are set by CP using the 16 User Data
				2059	registers.
				2060	2. Work-group Id registers X, Y, Z are set by ADC which supports any
				2061	combination including none.
				2062	3. Scratch Wave Offset is set by SPI in a per wave basis which is why its value
				2063	cannot included with the flat scratch init value which is per queue.
				2064	4. The VGPRs are set by SPI which only supports specifying either (X), (X, Y)
				2065	or (X, Y, Z).
				2066
				2067	Flat Scratch register pair are adjacent SGRRs so they can be moved as a 64 bit
				2068	value to the hardware required SGPRn-3 and SGPRn-4 respectively.
				2069
				2070	The global segment can be accessed either using buffer instructions (GFX6 which
				2071	has V# 64 bit address support), flat instructions (GFX7-9), or global
				2072	instructions (GFX9).
				2073
				2074	If buffer operations are used then the compiler can generate a V# with the
				2075	following properties:
				2076
				2077	* base address of 0
				2078	* no swizzle
				2079	* ATC: 1 if IOMMU present (such as APU)
				2080	* ptr64: 1
				2081	* MTYPE set to support memory coherence that matches the runtime (such as CC for
				2082	APU and NC for dGPU).
				2083
				2084	.. _amdgpu-amdhsa-kernel-prolog:
				2085
				2086	Kernel Prolog
				2087	~~~~~~~~~~~~~
				2088
				2089	.. _amdgpu-amdhsa-m0:
				2090
				2091	M0
				2092	++
				2093
				2094	GFX6-GFX8
				2095	The M0 register must be initialized with a value at least the total LDS size
				2096	if the kernel may access LDS via DS or flat operations. Total LDS size is
				2097	available in dispatch packet. For M0, it is also possible to use maximum
				2098	possible value of LDS for given target (0x7FFF for GFX6 and 0xFFFF for
				2099	GFX7-GFX8).
				2100	GFX9
				2101	The M0 register is not used for range checking LDS accesses and so does not
				2102	need to be initialized in the prolog.
				2103
				2104	.. _amdgpu-amdhsa-flat-scratch:
				2105
				2106	Flat Scratch
				2107	++++++++++++
				2108
				2109	If the kernel may use flat operations to access scratch memory, the prolog code
				2110	must set up FLAT_SCRATCH register pair (FLAT_SCRATCH_LO/FLAT_SCRATCH_HI which
				2111	are in SGPRn-4/SGPRn-3). Initialization uses Flat Scratch Init and Scratch Wave
				2112	Offset SGPR registers (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`):
				2113
				2114	GFX6
				2115	Flat scratch is not supported.
				2116
				2117	GFX7-8
				2118	1. The low word of Flat Scratch Init is 32 bit byte offset from
				2119	``SH_HIDDEN_PRIVATE_BASE_VIMID`` to the base of scratch backing memory
				2120	being managed by SPI for the queue executing the kernel dispatch. This is
				2121	the same value used in the Scratch Segment Buffer V# base address. The
				2122	prolog must add the value of Scratch Wave Offset to get the wave's byte
				2123	scratch backing memory offset from ``SH_HIDDEN_PRIVATE_BASE_VIMID``. Since
				2124	FLAT_SCRATCH_LO is in units of 256 bytes, the offset must be right shifted
				2125	by 8 before moving into FLAT_SCRATCH_LO.
				2126	2. The second word of Flat Scratch Init is 32 bit byte size of a single
				2127	work-items scratch memory usage. This is directly loaded from the kernel
				2128	dispatch packet Private Segment Byte Size and rounded up to a multiple of
				2129	DWORD. Having CP load it once avoids loading it at the beginning of every
				2130	wavefront. The prolog must move it to FLAT_SCRATCH_LO for use as FLAT SCRATCH
				2131	SIZE.
				2132	GFX9
				2133	The Flat Scratch Init is the 64 bit address of the base of scratch backing
				2134	memory being managed by SPI for the queue executing the kernel dispatch. The
				2135	prolog must add the value of Scratch Wave Offset and moved to the FLAT_SCRATCH
				2136	pair for use as the flat scratch base in flat memory instructions.
				2137
				2138	.. _amdgpu-amdhsa-memory-model:
				2139
				2140	Memory Model
				2141	~~~~~~~~~~~~
				2142
				2143	This section describes the mapping of LLVM memory model onto AMDGPU machine code
				2144	(see :ref:`memmodel`). The implementation is WIP.
				2145
				2146	.. TODO
				2147	Update when implementation complete.
				2148
				2149	Support more relaxed OpenCL memory model to be controled by environment
				2150	component of target triple.
				2151
				2152	The AMDGPU backend supports the memory synchronization scopes specified in
				2153	:ref:`amdgpu-memory-scopes`.
				2154
				2155	The code sequences used to implement the memory model are defined in table
				2156	:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
				2157
				2158	The sequences specify the order of instructions that a single thread must
				2159	execute. The ``s_waitcnt`` and ``buffer_wbinvl1_vol`` are defined with respect
				2160	to other memory instructions executed by the same thread. This allows them to be
				2161	moved earlier or later which can allow them to be combined with other instances
				2162	of the same instruction, or hoisted/sunk out of loops to improve
				2163	performance. Only the instructions related to the memory model are given;
				2164	additional ``s_waitcnt`` instructions are required to ensure registers are
				2165	defined before being used. These may be able to be combined with the memory
				2166	model ``s_waitcnt`` instructions as described above.
				2167
				2168	The AMDGPU memory model supports both the HSA [HSA]_ memory model, and the
				2169	OpenCL [OpenCL]_ memory model. The HSA memory model uses a single happens-before
				2170	relation for all address spaces (see :ref:`amdgpu-address-spaces`). The OpenCL
				2171	memory model which has separate happens-before relations for the global and
				2172	local address spaces, and only a fence specifying both global and local address
				2173	space joins the relationships. Since the LLVM ``memfence`` instruction does not
				2174	allow an address space to be specified the OpenCL fence has to convervatively
				2175	assume both local and global address space was specified. However, optimizations
				2176	can often be done to eliminate the additional ``s_waitcnt``instructions when
				2177	there are no intervening corresponding ``ds/flat_load/store/atomic`` memory
				2178	instructions. The code sequences in the table indicate what can be omitted for
				2179	the OpenCL memory. The target triple environment is used to determine if the
				2180	source language is OpenCL (see :ref:`amdgpu-opencl`).
				2181
				2182	``ds/flat_load/store/atomic`` instructions to local memory are termed LDS
				2183	operations.
				2184
				2185	``buffer/global/flat_load/store/atomic`` instructions to global memory are
				2186	termed vector memory operations.
				2187
				2188	For GFX6-GFX9:
				2189
				2190	* Each agent has multiple compute units (CU).
				2191	* Each CU has multiple SIMDs that execute wavefronts.
				2192	* The wavefronts for a single work-group are executed in the same CU but may be
				2193	executed by different SIMDs.
				2194	* Each CU has a single LDS memory shared by the wavefronts of the work-groups
				2195	executing on it.
				2196	* All LDS operations of a CU are performed as wavefront wide operations in a
				2197	global order and involve no caching. Completion is reported to a wavefront in
				2198	execution order.
				2199	* The LDS memory has multiple request queues shared by the SIMDs of a
				2200	CU. Therefore, the LDS operations performed by different waves of a work-group
				2201	can be reordered relative to each other, which can result in reordering the
				2202	visibility of vector memory operations with respect to LDS operations of other
				2203	wavefronts in the same work-group. A ``s_waitcnt lgkmcnt(0)`` is required to
				2204	ensure synchonization between LDS operations and vector memory operations
				2205	between waves of a work-group, but not between operations performed by the
				2206	same wavefront.
				2207	* The vector memory operations are performed as wavefront wide operations and
				2208	completion is reported to a wavefront in execution order. The exception is
				2209	that for GFX7-9 ``flat_load/store/atomic`` instructions can report out of
				2210	vector memory order if they access LDS memory, and out of LDS operation order
				2211	if they access global memory.
				2212	* The vector memory operations access a vector L1 cache shared by all wavefronts
				2213	on a CU. Therefore, no special action is required for coherence between
				2214	wavefronts in the same work-group. A ``buffer_wbinvl1_vol`` is required for
				2215	coherence between waves executing in different work-groups as they may be
				2216	executing on different CUs.
				2217	* The scalar memory operations access a scalar L1 cache shared by all wavefronts
				2218	on a group of CUs. The scalar and vector L1 caches are not coherent. However,
				2219	scalar operations are used in a restricted way so do not impact the memory
				2220	model. See :ref:`amdgpu-amdhsa-memory-spaces`.
				2221	* The vector and scalar memory operations use an L2 cache shared by all CUs on
				2222	the same agent.
				2223	* The L2 cache has independent channels to service disjoint ranges of virtual
				2224	addresses.
				2225	* Each CU has a separate request queue per channel. Therefore, the vector and
				2226	scalar memory operations performed by waves executing in different work-groups
				2227	(which may be executing on different CUs) of an agent can be reordered
				2228	relative to each other. A ``s_waitcnt vmcnt(0)`` is required to ensure
				2229	synchonization between vector memory operations of different CUs. It ensures a
				2230	previous vector memory operation has completed before executing a subsequent
				2231	vector memory or LDS operation and so can be used to meet the requirements of
				2232	acquire and release.
				2233	* The L2 cache can be kept coherent with other agents on some targets, or ranges
				2234	of virtual addresses can be set up to bypass it to ensure system coherence.
				2235
				2236	Private address space uses ``buffer_load/store`` using the scratch V# (GFX6-8),
				2237	or ``scratch_load/store`` (GFX9). Since only a single thread is accessing the
				2238	memory, atomic memory orderings are not meaningful and all accesses are treated
				2239	as non-atomic.
				2240
				2241	Constant address space uses ``buffer/global_load`` instructions (or equivalent
				2242	scalar memory instructions). Since the constant address space contents do not
				2243	change during the execution of a kernel dispatch it is not legal to perform
				2244	stores, and atomic memory orderings are not meaningful and all access are
				2245	treated as non-atomic.
				2246
				2247	A memory synchronization scope wider than work-group is not meaningful for the
				2248	group (LDS) address space and is treated as work-group.
				2249
				2250	The memory model does not support the region address space which is treated as
				2251	non-atomic.
				2252
				2253	Acquire memory ordering is not meaningful on store atomic instructions and is
				2254	treated as non-atomic.
				2255
				2256	Release memory ordering is not meaningful on load atomic instructions and is
				2257	treated a non-atomic.
				2258
				2259	Acquire-release memory ordering is not meaningful on load or store atomic
				2260	instructions and is treated as acquire and release respectively.
				2261
				2262	AMDGPU backend only uses scalar memory operations to access memory that is
				2263	proven to not change during the execution of the kernel dispatch. This includes
				2264	constant address space and global address space for program scope const
				2265	variables. Therefore the kernel machine code does not have to maintain the
				2266	scalar L1 cache to ensure it is coherent with the vector L1 cache. The scalar
				2267	and vector L1 caches are invalidated between kernel dispatches by CP since
				2268	constant address space data may change between kernel dispatch executions. See
				2269	:ref:`amdgpu-amdhsa-memory-spaces`.
				2270
				2271	The one exeception is if scalar writes are used to spill SGPR registers. In this
				2272	case the AMDGPU backend ensures the memory location used to spill is never
				2273	accessed by vector memory operations at the same time. If scalar writes are used
				2274	then a ``s_dcache_wb`` is inserted before the ``s_endpgm`` and before a function
				2275	return since the locations may be used for vector memory instructions by a
				2276	future wave that uses the same scratch area, or a function call that creates a
				2277	frame at the same address, respectively. There is no need for a ``s_dcache_inv``
				2278	as all scalar writes are write-before-read in the same thread.
				2279
				2280	Scratch backing memory (which is used for the private address space) is accessed
				2281	with MTYPE NC_NV (non-coherenent non-volatile). Since the private address space
				2282	is only accessed by a single thread, and is always write-before-read,
				2283	there is never a need to invalidate these entries from the L1 cache. Hence all
				2284	cache invalidates are done as ``*_vol`` to only invalidate the volatile cache
				2285	lines.
				2286
				2287	On dGPU the kernarg backing memory is accessed as UC (uncached) to avoid needing
				2288	to invalidate the L2 cache. This also causes it to be treated as non-volatile
				2289	and so is not invalidated by ``*_vol``. On APU it is accessed as CC (cache
				2290	coherent) and so the L2 cache will coherent with the CPU and other agents.
				2291
				2292	.. table:: AMDHSA Memory Model Code Sequences GFX6-GFX9
				2293	:name: amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table
				2294
				2295	============ ============ ============== ========== =======================
				2296	LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
				2297	Ordering Sync Scope Address
				2298	Space
				2299	============ ============ ============== ========== =======================
				2300	Non-Atomic
				2301	---------------------------------------------------------------------------
				2302	load none none - global non-volatile
				2303	- generic 1. buffer/global/flat_load
				2304	volatile
				2305	1. buffer/global/flat_load
				2306	glc=1
				2307	load none none - local 1. ds_load
				2308	store none none - global 1. buffer/global/flat_store
				2309	- generic
				2310	store none none - local 1. ds_store
				2311	Unordered Atomic
				2312	---------------------------------------------------------------------------
				2313	load atomic unordered any any Same as non-atomic.
				2314	store atomic unordered any any Same as non-atomic.
				2315	atomicrmw unordered any any *Same as monotonic
				2316	atomic*.
				2317	Monotonic Atomic
				2318	---------------------------------------------------------------------------
				2319	load atomic monotonic - singlethread - global 1. buffer/global/flat_load
				2320	- wavefront - generic
				2321	- workgroup
				2322	load atomic monotonic - singlethread - local 1. ds_load
				2323	- wavefront
				2324	- workgroup
				2325	load atomic monotonic - agent - global 1. buffer/global/flat_load
				2326	- system - generic glc=1
				2327	store atomic monotonic - singlethread - global 1. buffer/global/flat_store
				2328	- wavefront - generic
				2329	- workgroup
				2330	- agent
				2331	- system
				2332	store atomic monotonic - singlethread - local 1. ds_store
				2333	- wavefront
				2334	- workgroup
				2335	atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
				2336	- wavefront - generic
				2337	- workgroup
				2338	- agent
				2339	- system
				2340	atomicrmw monotonic - singlethread - local 1. ds_atomic
				2341	- wavefront
				2342	- workgroup
				2343	Acquire Atomic
				2344	---------------------------------------------------------------------------
				2345	load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
				2346	- wavefront - local
				2347	- generic
				2348	load atomic acquire - workgroup - global 1. buffer/global_load
				2349	load atomic acquire - workgroup - local 1. ds/flat_load
				2350	- generic 2. s_waitcnt lgkmcnt(0)
				2351
				2352	- If OpenCL, omit
				2353	waitcnt.
				2354	- Must happen before
				2355	any following
				2356	global/generic
				2357	load/load
				2358	atomic/store/store
				2359	atomic/atomicrmw.
				2360	- Ensures any
				2361	following global
				2362	data read is no
				2363	older than the load
				2364	atomic value being
				2365	acquired.
				2366
				2367	load atomic acquire - agent - global 1. buffer/global_load
				2368	- system glc=1
				2369	2. s_waitcnt vmcnt(0)
				2370
				2371	- Must happen before
				2372	following
				2373	buffer_wbinvl1_vol.
				2374	- Ensures the load
				2375	has completed
				2376	before invalidating
				2377	the cache.
				2378
				2379	3. buffer_wbinvl1_vol
				2380
				2381	- Must happen before
				2382	any following
				2383	global/generic
				2384	load/load
				2385	atomic/atomicrmw.
				2386	- Ensures that
				2387	following
				2388	loads will not see
				2389	stale global data.
				2390
				2391	load atomic acquire - agent - generic 1. flat_load glc=1
				2392	- system 2. s_waitcnt vmcnt(0) &
				2393	lgkmcnt(0)
				2394
				2395	- If OpenCL omit
				2396	lgkmcnt(0).
				2397	- Must happen before
				2398	following
				2399	buffer_wbinvl1_vol.
				2400	- Ensures the flat_load
				2401	has completed
				2402	before invalidating
				2403	the cache.
				2404
				2405	3. buffer_wbinvl1_vol
				2406
				2407	- Must happen before
				2408	any following
				2409	global/generic
				2410	load/load
				2411	atomic/atomicrmw.
				2412	- Ensures that
				2413	following loads
				2414	will not see stale
				2415	global data.
				2416
				2417	atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
				2418	- wavefront - local
				2419	- generic
				2420	atomicrmw acquire - workgroup - global 1. buffer/global_atomic
				2421	atomicrmw acquire - workgroup - local 1. ds/flat_atomic
				2422	- generic 2. waitcnt lgkmcnt(0)
				2423
				2424	- If OpenCL, omit
				2425	waitcnt.
				2426	- Must happen before
				2427	any following
				2428	global/generic
				2429	load/load
				2430	atomic/store/store
				2431	atomic/atomicrmw.
				2432	- Ensures any
				2433	following global
				2434	data read is no
				2435	older than the
				2436	atomicrmw value
				2437	being acquired.
				2438
				2439	atomicrmw acquire - agent - global 1. buffer/global_atomic
				2440	- system 2. s_waitcnt vmcnt(0)
				2441
				2442	- Must happen before
				2443	following
				2444	buffer_wbinvl1_vol.
				2445	- Ensures the
				2446	atomicrmw has
				2447	completed before
				2448	invalidating the
				2449	cache.
				2450
				2451	3. buffer_wbinvl1_vol
				2452
				2453	- Must happen before
				2454	any following
				2455	global/generic
				2456	load/load
				2457	atomic/atomicrmw.
				2458	- Ensures that
				2459	following loads
				2460	will not see stale
				2461	global data.
				2462
				2463	atomicrmw acquire - agent - generic 1. flat_atomic
				2464	- system 2. s_waitcnt vmcnt(0) &
				2465	lgkmcnt(0)
				2466
				2467	- If OpenCL, omit
				2468	lgkmcnt(0).
				2469	- Must happen before
				2470	following
				2471	buffer_wbinvl1_vol.
				2472	- Ensures the
				2473	atomicrmw has
				2474	completed before
				2475	invalidating the
				2476	cache.
				2477
				2478	3. buffer_wbinvl1_vol
				2479
				2480	- Must happen before
				2481	any following
				2482	global/generic
				2483	load/load
				2484	atomic/atomicrmw.
				2485	- Ensures that
				2486	following loads
				2487	will not see stale
				2488	global data.
				2489
				2490	fence acquire - singlethread none none
				2491	- wavefront
				2492	fence acquire - workgroup none 1. s_waitcnt lgkmcnt(0)
				2493
				2494	- If OpenCL and
				2495	address space is
				2496	not generic, omit
				2497	waitcnt. However,
				2498	since LLVM
				2499	currently has no
				2500	address space on
				2501	the fence need to
				2502	conservatively
				2503	always generate. If
				2504	fence had an
				2505	address space then
				2506	set to address
				2507	space of OpenCL
				2508	fence flag, or to
				2509	generic if both
				2510	local and global
				2511	flags are
				2512	specified.
				2513	- Must happen after
				2514	any preceding
				2515	local/generic load
				2516	atomic/atomicrmw
				2517	with an equal or
				2518	wider sync scope
				2519	and memory ordering
				2520	stronger than
				2521	unordered (this is
				2522	termed the
				2523	fence-paired-atomic).
				2524	- Must happen before
				2525	any following
				2526	global/generic
				2527	load/load
				2528	atomic/store/store
				2529	atomic/atomicrmw.
				2530	- Ensures any
				2531	following global
				2532	data read is no
				2533	older than the
				2534	value read by the
				2535	fence-paired-atomic.
				2536
				2537	fence acquire - agent none 1. s_waitcnt vmcnt(0) &
				2538	- system lgkmcnt(0)
				2539
				2540	- If OpenCL and
				2541	address space is
				2542	not generic, omit
				2543	lgkmcnt(0).
				2544	However, since LLVM
				2545	currently has no
				2546	address space on
				2547	the fence need to
				2548	conservatively
				2549	always generate
				2550	(see comment for
				2551	previous fence).
Tony Tye	d9c251f	2017-06-07 00:08:35 +0000	[diff] [blame^]	2552	- Could be split into
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	2553	separate s_waitcnt
				2554	vmcnt(0) and
				2555	s_waitcnt
				2556	lgkmcnt(0) to allow
				2557	them to be
				2558	independently moved
				2559	according to the
				2560	following rules.
				2561	- s_waitcnt vmcnt(0)
				2562	must happen after
				2563	any preceding
				2564	global/generic load
				2565	atomic/atomicrmw
				2566	with an equal or
				2567	wider sync scope
				2568	and memory ordering
				2569	stronger than
				2570	unordered (this is
				2571	termed the
				2572	fence-paired-atomic).
				2573	- s_waitcnt lgkmcnt(0)
				2574	must happen after
				2575	any preceding
				2576	group/generic load
				2577	atomic/atomicrmw
				2578	with an equal or
				2579	wider sync scope
				2580	and memory ordering
				2581	stronger than
				2582	unordered (this is
				2583	termed the
				2584	fence-paired-atomic).
				2585	- Must happen before
				2586	the following
				2587	buffer_wbinvl1_vol.
				2588	- Ensures that the
				2589	fence-paired atomic
				2590	has completed
				2591	before invalidating
				2592	the
				2593	cache. Therefore
				2594	any following
				2595	locations read must
				2596	be no older than
				2597	the value read by
				2598	the
				2599	fence-paired-atomic.
				2600
				2601	2. buffer_wbinvl1_vol
				2602
				2603	- Must happen before
				2604	any following global/generic
				2605	load/load
				2606	atomic/store/store
				2607	atomic/atomicrmw.
				2608	- Ensures that
				2609	following loads
				2610	will not see stale
				2611	global data.
				2612
				2613	Release Atomic
				2614	---------------------------------------------------------------------------
				2615	store atomic release - singlethread - global 1. buffer/global/ds/flat_store
				2616	- wavefront - local
				2617	- generic
				2618	store atomic release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2619	- generic
				2620	- If OpenCL, omit
				2621	waitcnt.
				2622	- Must happen after
				2623	any preceding
				2624	local/generic
				2625	load/store/load
				2626	atomic/store
				2627	atomic/atomicrmw.
				2628	- Must happen before
				2629	the following
				2630	store.
				2631	- Ensures that all
				2632	memory operations
				2633	to local have
				2634	completed before
				2635	performing the
				2636	store that is being
				2637	released.
				2638
				2639	2. buffer/global/flat_store
				2640	store atomic release - workgroup - local 1. ds_store
				2641	store atomic release - agent - global 1. s_waitcnt vmcnt(0) &
				2642	- system - generic lgkmcnt(0)
				2643
				2644	- If OpenCL, omit
				2645	lgkmcnt(0).
				2646	- Could be split into
				2647	separate s_waitcnt
				2648	vmcnt(0) and
				2649	s_waitcnt
				2650	lgkmcnt(0) to allow
				2651	them to be
				2652	independently moved
				2653	according to the
				2654	following rules.
				2655	- s_waitcnt vmcnt(0)
				2656	must happen after
				2657	any preceding
				2658	global/generic
				2659	load/store/load
				2660	atomic/store
				2661	atomic/atomicrmw.
				2662	- s_waitcnt lgkmcnt(0)
				2663	must happen after
				2664	any preceding
				2665	local/generic
				2666	load/store/load
				2667	atomic/store
				2668	atomic/atomicrmw.
				2669	- Must happen before
				2670	the following
				2671	store.
				2672	- Ensures that all
				2673	memory operations
				2674	to global have
				2675	completed before
				2676	performing the
				2677	store that is being
				2678	released.
				2679
				2680	2. buffer/global/ds/flat_store
				2681	atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
				2682	- wavefront - local
				2683	- generic
				2684	atomicrmw release - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2685	- generic
				2686	- If OpenCL, omit
				2687	waitcnt.
				2688	- Must happen after
				2689	any preceding
				2690	local/generic
				2691	load/store/load
				2692	atomic/store
				2693	atomic/atomicrmw.
				2694	- Must happen before
				2695	the following
				2696	atomicrmw.
				2697	- Ensures that all
				2698	memory operations
				2699	to local have
				2700	completed before
				2701	performing the
				2702	atomicrmw that is
				2703	being released.
				2704
				2705	2. buffer/global/flat_atomic
				2706	atomicrmw release - workgroup - local 1. ds_atomic
				2707	atomicrmw release - agent - global 1. s_waitcnt vmcnt(0) &
				2708	- system - generic lgkmcnt(0)
				2709
				2710	- If OpenCL, omit
				2711	lgkmcnt(0).
				2712	- Could be split into
				2713	separate s_waitcnt
				2714	vmcnt(0) and
				2715	s_waitcnt
				2716	lgkmcnt(0) to allow
				2717	them to be
				2718	independently moved
				2719	according to the
				2720	following rules.
				2721	- s_waitcnt vmcnt(0)
				2722	must happen after
				2723	any preceding
				2724	global/generic
				2725	load/store/load
				2726	atomic/store
				2727	atomic/atomicrmw.
				2728	- s_waitcnt lgkmcnt(0)
				2729	must happen after
				2730	any preceding
				2731	local/generic
				2732	load/store/load
				2733	atomic/store
				2734	atomic/atomicrmw.
				2735	- Must happen before
				2736	the following
				2737	atomicrmw.
				2738	- Ensures that all
				2739	memory operations
				2740	to global and local
				2741	have completed
				2742	before performing
				2743	the atomicrmw that
				2744	is being released.
				2745
				2746	2. buffer/global/ds/flat_atomic*
				2747	fence release - singlethread none none
				2748	- wavefront
				2749	fence release - workgroup none 1. s_waitcnt lgkmcnt(0)
				2750
				2751	- If OpenCL and
				2752	address space is
				2753	not generic, omit
				2754	waitcnt. However,
				2755	since LLVM
				2756	currently has no
				2757	address space on
				2758	the fence need to
				2759	conservatively
				2760	always generate
				2761	(see comment for
				2762	previous fence).
				2763	- Must happen after
				2764	any preceding
				2765	local/generic
				2766	load/load
				2767	atomic/store/store
				2768	atomic/atomicrmw.
				2769	- Must happen before
				2770	any following store
				2771	atomic/atomicrmw
				2772	with an equal or
				2773	wider sync scope
				2774	and memory ordering
				2775	stronger than
				2776	unordered (this is
				2777	termed the
				2778	fence-paired-atomic).
				2779	- Ensures that all
				2780	memory operations
				2781	to local have
				2782	completed before
				2783	performing the
				2784	following
				2785	fence-paired-atomic.
				2786
				2787	fence release - agent none 1. s_waitcnt vmcnt(0) &
				2788	- system lgkmcnt(0)
				2789
				2790	- If OpenCL and
				2791	address space is
				2792	not generic, omit
				2793	lgkmcnt(0).
				2794	However, since LLVM
				2795	currently has no
				2796	address space on
				2797	the fence need to
				2798	conservatively
				2799	always generate
				2800	(see comment for
				2801	previous fence).
				2802	- Could be split into
				2803	separate s_waitcnt
				2804	vmcnt(0) and
				2805	s_waitcnt
				2806	lgkmcnt(0) to allow
				2807	them to be
				2808	independently moved
				2809	according to the
				2810	following rules.
				2811	- s_waitcnt vmcnt(0)
				2812	must happen after
				2813	any preceding
				2814	global/generic
				2815	load/store/load
				2816	atomic/store
				2817	atomic/atomicrmw.
				2818	- s_waitcnt lgkmcnt(0)
				2819	must happen after
				2820	any preceding
				2821	local/generic
				2822	load/store/load
				2823	atomic/store
				2824	atomic/atomicrmw.
				2825	- Must happen before
				2826	any following store
				2827	atomic/atomicrmw
				2828	with an equal or
				2829	wider sync scope
				2830	and memory ordering
				2831	stronger than
				2832	unordered (this is
				2833	termed the
				2834	fence-paired-atomic).
				2835	- Ensures that all
				2836	memory operations
				2837	to global have
				2838	completed before
				2839	performing the
				2840	following
				2841	fence-paired-atomic.
				2842
				2843	Acquire-Release Atomic
				2844	---------------------------------------------------------------------------
				2845	atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
				2846	- wavefront - local
				2847	- generic
				2848	atomicrmw acq_rel - workgroup - global 1. s_waitcnt lgkmcnt(0)
				2849
				2850	- If OpenCL, omit
				2851	waitcnt.
				2852	- Must happen after
				2853	any preceding
				2854	local/generic
				2855	load/store/load
				2856	atomic/store
				2857	atomic/atomicrmw.
				2858	- Must happen before
				2859	the following
				2860	atomicrmw.
				2861	- Ensures that all
				2862	memory operations
				2863	to local have
				2864	completed before
				2865	performing the
				2866	atomicrmw that is
				2867	being released.
				2868
				2869	2. buffer/global_atomic
				2870	atomicrmw acq_rel - workgroup - local 1. ds_atomic
				2871	2. s_waitcnt lgkmcnt(0)
				2872
				2873	- If OpenCL, omit
				2874	waitcnt.
				2875	- Must happen before
				2876	any following
				2877	global/generic
				2878	load/load
				2879	atomic/store/store
				2880	atomic/atomicrmw.
				2881	- Ensures any
				2882	following global
				2883	data read is no
				2884	older than the load
				2885	atomic value being
				2886	acquired.
				2887
				2888	atomicrmw acq_rel - workgroup - generic 1. s_waitcnt lgkmcnt(0)
				2889
				2890	- If OpenCL, omit
				2891	waitcnt.
				2892	- Must happen after
				2893	any preceding
				2894	local/generic
				2895	load/store/load
				2896	atomic/store
				2897	atomic/atomicrmw.
				2898	- Must happen before
				2899	the following
				2900	atomicrmw.
				2901	- Ensures that all
				2902	memory operations
				2903	to local have
				2904	completed before
				2905	performing the
				2906	atomicrmw that is
				2907	being released.
				2908
				2909	2. flat_atomic
				2910	3. s_waitcnt lgkmcnt(0)
				2911
				2912	- If OpenCL, omit
				2913	waitcnt.
				2914	- Must happen before
				2915	any following
				2916	global/generic
				2917	load/load
				2918	atomic/store/store
				2919	atomic/atomicrmw.
				2920	- Ensures any
				2921	following global
				2922	data read is no
				2923	older than the load
				2924	atomic value being
				2925	acquired.
				2926	atomicrmw acq_rel - agent - global 1. s_waitcnt vmcnt(0) &
				2927	- system lgkmcnt(0)
				2928
				2929	- If OpenCL, omit
				2930	lgkmcnt(0).
				2931	- Could be split into
				2932	separate s_waitcnt
				2933	vmcnt(0) and
				2934	s_waitcnt
				2935	lgkmcnt(0) to allow
				2936	them to be
				2937	independently moved
				2938	according to the
				2939	following rules.
				2940	- s_waitcnt vmcnt(0)
				2941	must happen after
				2942	any preceding
				2943	global/generic
				2944	load/store/load
				2945	atomic/store
				2946	atomic/atomicrmw.
				2947	- s_waitcnt lgkmcnt(0)
				2948	must happen after
				2949	any preceding
				2950	local/generic
				2951	load/store/load
				2952	atomic/store
				2953	atomic/atomicrmw.
				2954	- Must happen before
				2955	the following
				2956	atomicrmw.
				2957	- Ensures that all
				2958	memory operations
				2959	to global have
				2960	completed before
				2961	performing the
				2962	atomicrmw that is
				2963	being released.
				2964
				2965	2. buffer/global_atomic
				2966	3. s_waitcnt vmcnt(0)
				2967
				2968	- Must happen before
				2969	following
				2970	buffer_wbinvl1_vol.
				2971	- Ensures the
				2972	atomicrmw has
				2973	completed before
				2974	invalidating the
				2975	cache.
				2976
				2977	4. buffer_wbinvl1_vol
				2978
				2979	- Must happen before
				2980	any following
				2981	global/generic
				2982	load/load
				2983	atomic/atomicrmw.
				2984	- Ensures that
				2985	following loads
				2986	will not see stale
				2987	global data.
				2988
				2989	atomicrmw acq_rel - agent - generic 1. s_waitcnt vmcnt(0) &
				2990	- system lgkmcnt(0)
				2991
				2992	- If OpenCL, omit
				2993	lgkmcnt(0).
				2994	- Could be split into
				2995	separate s_waitcnt
				2996	vmcnt(0) and
				2997	s_waitcnt
				2998	lgkmcnt(0) to allow
				2999	them to be
				3000	independently moved
				3001	according to the
				3002	following rules.
				3003	- s_waitcnt vmcnt(0)
				3004	must happen after
				3005	any preceding
				3006	global/generic
				3007	load/store/load
				3008	atomic/store
				3009	atomic/atomicrmw.
				3010	- s_waitcnt lgkmcnt(0)
				3011	must happen after
				3012	any preceding
				3013	local/generic
				3014	load/store/load
				3015	atomic/store
				3016	atomic/atomicrmw.
				3017	- Must happen before
				3018	the following
				3019	atomicrmw.
				3020	- Ensures that all
				3021	memory operations
				3022	to global have
				3023	completed before
				3024	performing the
				3025	atomicrmw that is
				3026	being released.
				3027
				3028	2. flat_atomic
				3029	3. s_waitcnt vmcnt(0) &
				3030	lgkmcnt(0)
				3031
				3032	- If OpenCL, omit
				3033	lgkmcnt(0).
				3034	- Must happen before
				3035	following
				3036	buffer_wbinvl1_vol.
				3037	- Ensures the
				3038	atomicrmw has
				3039	completed before
				3040	invalidating the
				3041	cache.
				3042
				3043	4. buffer_wbinvl1_vol
				3044
				3045	- Must happen before
				3046	any following
				3047	global/generic
				3048	load/load
				3049	atomic/atomicrmw.
				3050	- Ensures that
				3051	following loads
				3052	will not see stale
				3053	global data.
				3054
				3055	fence acq_rel - singlethread none none
				3056	- wavefront
				3057	fence acq_rel - workgroup none 1. s_waitcnt lgkmcnt(0)
				3058
				3059	- If OpenCL and
				3060	address space is
				3061	not generic, omit
				3062	waitcnt. However,
				3063	since LLVM
				3064	currently has no
				3065	address space on
				3066	the fence need to
				3067	conservatively
				3068	always generate
				3069	(see comment for
				3070	previous fence).
				3071	- Must happen after
				3072	any preceding
				3073	local/generic
				3074	load/load
				3075	atomic/store/store
				3076	atomic/atomicrmw.
				3077	- Must happen before
				3078	any following
				3079	global/generic
				3080	load/load
				3081	atomic/store/store
				3082	atomic/atomicrmw.
				3083	- Ensures that all
				3084	memory operations
				3085	to local have
				3086	completed before
				3087	performing any
				3088	following global
				3089	memory operations.
				3090	- Ensures that the
				3091	preceding
				3092	local/generic load
				3093	atomic/atomicrmw
				3094	with an equal or
				3095	wider sync scope
				3096	and memory ordering
				3097	stronger than
				3098	unordered (this is
				3099	termed the
				3100	fence-paired-atomic)
				3101	has completed
				3102	before following
				3103	global memory
				3104	operations. This
				3105	satisfies the
				3106	requirements of
				3107	acquire.
				3108	- Ensures that all
				3109	previous memory
				3110	operations have
				3111	completed before a
				3112	following
				3113	local/generic store
				3114	atomic/atomicrmw
				3115	with an equal or
				3116	wider sync scope
				3117	and memory ordering
				3118	stronger than
				3119	unordered (this is
				3120	termed the
				3121	fence-paired-atomic).
				3122	This satisfies the
				3123	requirements of
				3124	release.
				3125
				3126	fence acq_rel - agent none 1. s_waitcnt vmcnt(0) &
				3127	- system lgkmcnt(0)
				3128
				3129	- If OpenCL and
				3130	address space is
				3131	not generic, omit
				3132	lgkmcnt(0).
				3133	However, since LLVM
				3134	currently has no
				3135	address space on
				3136	the fence need to
				3137	conservatively
				3138	always generate
				3139	(see comment for
				3140	previous fence).
				3141	- Could be split into
				3142	separate s_waitcnt
				3143	vmcnt(0) and
				3144	s_waitcnt
				3145	lgkmcnt(0) to allow
				3146	them to be
				3147	independently moved
				3148	according to the
				3149	following rules.
				3150	- s_waitcnt vmcnt(0)
				3151	must happen after
				3152	any preceding
				3153	global/generic
				3154	load/store/load
				3155	atomic/store
				3156	atomic/atomicrmw.
				3157	- s_waitcnt lgkmcnt(0)
				3158	must happen after
				3159	any preceding
				3160	local/generic
				3161	load/store/load
				3162	atomic/store
				3163	atomic/atomicrmw.
				3164	- Must happen before
				3165	the following
				3166	buffer_wbinvl1_vol.
				3167	- Ensures that the
				3168	preceding
				3169	global/local/generic
				3170	load
				3171	atomic/atomicrmw
				3172	with an equal or
				3173	wider sync scope
				3174	and memory ordering
				3175	stronger than
				3176	unordered (this is
				3177	termed the
				3178	fence-paired-atomic)
				3179	has completed
				3180	before invalidating
				3181	the cache. This
				3182	satisfies the
				3183	requirements of
				3184	acquire.
				3185	- Ensures that all
				3186	previous memory
				3187	operations have
				3188	completed before a
				3189	following
				3190	global/local/generic
				3191	store
				3192	atomic/atomicrmw
				3193	with an equal or
				3194	wider sync scope
				3195	and memory ordering
				3196	stronger than
				3197	unordered (this is
				3198	termed the
				3199	fence-paired-atomic).
				3200	This satisfies the
				3201	requirements of
				3202	release.
				3203
				3204	2. buffer_wbinvl1_vol
				3205
				3206	- Must happen before
				3207	any following
				3208	global/generic
				3209	load/load
				3210	atomic/store/store
				3211	atomic/atomicrmw.
				3212	- Ensures that
				3213	following loads
				3214	will not see stale
				3215	global data. This
				3216	satisfies the
				3217	requirements of
				3218	acquire.
				3219
				3220	Sequential Consistent Atomic
				3221	---------------------------------------------------------------------------
				3222	load atomic seq_cst - singlethread - global *Same as corresponding
				3223	- wavefront - local load atomic acquire*.
				3224	- workgroup - generic
				3225	load atomic seq_cst - agent - global 1. s_waitcnt vmcnt(0)
				3226	- system - local
				3227	- generic - Must happen after
				3228	preceding
				3229	global/generic load
				3230	atomic/store
				3231	atomic/atomicrmw
				3232	with memory
				3233	ordering of seq_cst
				3234	and with equal or
				3235	wider sync scope.
				3236	(Note that seq_cst
				3237	fences have their
				3238	own s_waitcnt
				3239	vmcnt(0) and so do
				3240	not need to be
				3241	considered.)
				3242	- Ensures any
				3243	preceding
				3244	sequential
				3245	consistent global
				3246	memory instructions
				3247	have completed
				3248	before executing
				3249	this sequentially
				3250	consistent
				3251	instruction. This
				3252	prevents reordering
				3253	a seq_cst store
				3254	followed by a
				3255	seq_cst load (Note
				3256	that seq_cst is
				3257	stronger than
				3258	acquire/release as
				3259	the reordering of
				3260	load acquire
				3261	followed by a store
				3262	release is
				3263	prevented by the
				3264	waitcnt vmcnt(0) of
				3265	the release, but
				3266	there is nothing
				3267	preventing a store
				3268	release followed by
				3269	load acquire from
				3270	competing out of
				3271	order.)
				3272
				3273	2. *Following
				3274	instructions same as
				3275	corresponding load
				3276	atomic acquire*.
				3277
				3278	store atomic seq_cst - singlethread - global *Same as corresponding
				3279	- wavefront - local store atomic release*.
				3280	- workgroup - generic
				3281	store atomic seq_cst - agent - global *Same as corresponding
				3282	- system - generic store atomic release*.
				3283	atomicrmw seq_cst - singlethread - global *Same as corresponding
				3284	- wavefront - local atomicrmw acq_rel*.
				3285	- workgroup - generic
				3286	atomicrmw seq_cst - agent - global *Same as corresponding
				3287	- system - generic atomicrmw acq_rel*.
				3288	fence seq_cst - singlethread none *Same as corresponding
				3289	- wavefront fence acq_rel*.
				3290	- workgroup
				3291	- agent
				3292	- system
				3293	============ ============ ============== ========== =======================
				3294
				3295	The memory order also adds the single thread optimization constrains defined in
				3296	table
				3297	:ref:`amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table`.
				3298
				3299	.. table:: AMDHSA Memory Model Single Thread Optimization Constraints GFX6-GFX9
				3300	:name: amdgpu-amdhsa-memory-model-single-thread-optimization-constraints-gfx6-gfx9-table
				3301
				3302	============ ==============================================================
				3303	LLVM Memory Optimization Constraints
				3304	Ordering
				3305	============ ==============================================================
				3306	unordered none
				3307	monotonic none
				3308	acquire - If a load atomic/atomicrmw then no following load/load
				3309	atomic/store/ store atomic/atomicrmw/fence instruction can
				3310	be moved before the acquire.
				3311	- If a fence then same as load atomic, plus no preceding
				3312	associated fence-paired-atomic can be moved after the fence.
				3313	release - If a store atomic/atomicrmw then no preceeding load/load
				3314	atomic/store/ store atomic/atomicrmw/fence instruction can
				3315	be moved after the release.
				3316	- If a fence then same as store atomic, plus no following
				3317	associated fence-paired-atomic can be moved before the
				3318	fence.
				3319	acq_rel Same constraints as both acquire and release.
				3320	seq_cst - If a load atomic then same constraints as acquire, plus no
				3321	preceding sequentially consistent load atomic/store
				3322	atomic/atomicrmw/fence instruction can be moved after the
				3323	seq_cst.
				3324	- If a store atomic then the same constraints as release, plus
				3325	no following sequentially consistent load atomic/store
				3326	atomic/atomicrmw/fence instruction can be moved before the
				3327	seq_cst.
				3328	- If an atomicrmw/fence then same constraints as acq_rel.
				3329	============ ==============================================================
Konstantin Zhuravlyov	d5561e0	2017-03-08 23:55:44 +0000	[diff] [blame]	3330
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3331	Trap Handler ABI
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3332	~~~~~~~~~~~~~~~~
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3333
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3334	For code objects generated by AMDGPU backend for HSA [HSA]_ compatible runtimes
				3335	(such as ROCm [AMD-ROCm]_), the runtime installs a trap handler that supports
				3336	the ``s_trap`` instruction with the following usage:
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3337
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3338	.. table:: AMDGPU Trap Handler for AMDHSA OS
				3339	:name: amdgpu-trap-handler-for-amdhsa-os-table
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3340
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3341	=================== =============== =============== =======================
				3342	Usage Code Sequence Trap Handler Description
				3343	Inputs
				3344	=================== =============== =============== =======================
				3345	reserved ``s_trap 0x00`` Reserved by hardware.
				3346	``debugtrap(arg)`` ``s_trap 0x01`` ``SGPR0-1``: Reserved for HSA
				3347	``queue_ptr`` ``debugtrap``
				3348	``VGPR0``: intrinsic (not
				3349	``arg`` implemented).
				3350	``llvm.trap`` ``s_trap 0x02`` ``SGPR0-1``: Causes dispatch to be
				3351	``queue_ptr`` terminated and its
				3352	associated queue put
				3353	into the error state.
				3354	``llvm.debugtrap`` ``s_trap 0x03`` ``SGPR0-1``: If debugger not
				3355	``queue_ptr`` installed handled
				3356	same as ``llvm.trap``.
				3357	debugger breakpoint ``s_trap 0x07`` Reserved for debugger
				3358	breakpoints.
				3359	debugger ``s_trap 0x08`` Reserved for debugger.
				3360	debugger ``s_trap 0xfe`` Reserved for debugger.
				3361	debugger ``s_trap 0xff`` Reserved for debugger.
				3362	=================== =============== =============== =======================
Wei Ding	16289cf	2017-02-21 18:48:01 +0000	[diff] [blame]	3363
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3364	Non-AMDHSA
				3365	----------
				3366
				3367	Trap Handler ABI
				3368	~~~~~~~~~~~~~~~~
				3369
				3370	For code objects generated by AMDGPU backend for non-amdhsa OS, the runtime does
				3371	not install a trap handler. The ``llvm.trap`` and ``llvm.debugtrap``
				3372	instructions are handled as follows:
				3373
				3374	.. table:: AMDGPU Trap Handler for Non-AMDHSA OS
				3375	:name: amdgpu-trap-handler-for-non-amdhsa-os-table
				3376
				3377	=============== =============== ===========================================
				3378	Usage Code Sequence Description
				3379	=============== =============== ===========================================
				3380	llvm.trap s_endpgm Causes wavefront to be terminated.
				3381	llvm.debugtrap none Compiler warning given that there is no
				3382	trap handler installed.
				3383	=============== =============== ===========================================
				3384
				3385	Source Languages
				3386	================
				3387
				3388	.. _amdgpu-opencl:
				3389
				3390	OpenCL
				3391	------
				3392
				3393	When generating code for the OpenCL language the target triple environment
				3394	should be ``opencl`` or ``amdgizcl`` (see :ref:`amdgpu-target-triples`).
				3395
				3396	When the language is OpenCL the following differences occur:
				3397
				3398	1. The OpenCL memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3399	2. The AMDGPU backend adds additional arguments to the kernel.
				3400	3. Additional metadata is generated (:ref:`amdgpu-code-object-metadata`).
				3401
				3402	.. TODO
				3403	Specify what affect this has. Hidden arguments added. Additional metadata
				3404	generated.
				3405
				3406	.. _amdgpu-hcc:
				3407
				3408	HCC
				3409	---
				3410
				3411	When generating code for the OpenCL language the target triple environment
				3412	should be ``hcc`` (see :ref:`amdgpu-target-triples`).
				3413
				3414	When the language is OpenCL the following differences occur:
				3415
				3416	1. The HSA memory model is used (see :ref:`amdgpu-amdhsa-memory-model`).
				3417
				3418	.. TODO
				3419	Specify what affect this has.
Tom Stellard	3ec09e6	2016-04-06 01:29:19 +0000	[diff] [blame]	3420
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3421	Assembler
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3422	---------
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3423
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3424	AMDGPU backend has LLVM-MC based assembler which is currently in development.
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3425	It supports AMDGCN GFX6-GFX8.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3426
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3427	This section describes general syntax for instructions and operands. For more
				3428	information about instructions, their semantics and supported combinations of
				3429	operands, refer to one of instruction set architecture manuals
				3430	[AMD-Souther-Islands]_ [AMD-Sea-Islands]_ [AMD-Volcanic-Islands]_.
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3431
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3432	An instruction has the following syntax (register operands are normally
				3433	comma-separated while extra operands are space-separated):
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3434
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3435	<opcode> <register_operand0>, ... <extra_operand0> ...
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3436
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3437	Operands
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3438	~~~~~~~~
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3439
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3440	The following syntax for register operands is supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3441
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3442	* SGPR registers: s0, ... or s[0], ...
				3443	* VGPR registers: v0, ... or v[0], ...
				3444	* TTMP registers: ttmp0, ... or ttmp[0], ...
				3445	* Special registers: exec (exec_lo, exec_hi), vcc (vcc_lo, vcc_hi), flat_scratch (flat_scratch_lo, flat_scratch_hi)
				3446	* Special trap registers: tba (tba_lo, tba_hi), tma (tma_lo, tma_hi)
				3447	* Register pairs, quads, etc: s[2:3], v[10:11], ttmp[5:6], s[4:7], v[12:15], ttmp[4:7], s[8:15], ...
				3448	* Register lists: [s0, s1], [ttmp0, ttmp1, ttmp2, ttmp3]
				3449	* Register index expressions: v[2*2], s[1-1:2-1]
				3450	* 'off' indicates that an operand is not enabled
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3451
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3452	The following extra operands are supported:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3453
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3454	* offset, offset0, offset1
				3455	* idxen, offen bits
				3456	* glc, slc, tfe bits
				3457	* waitcnt: integer or combination of counter values
				3458	* VOP3 modifiers:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3459
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3460	- abs (\\| \\|), neg (\-)
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3461
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3462	* DPP modifiers:
				3463
				3464	- row_shl, row_shr, row_ror, row_rol
				3465	- row_mirror, row_half_mirror, row_bcast
				3466	- wave_shl, wave_shr, wave_ror, wave_rol, quad_perm
				3467	- row_mask, bank_mask, bound_ctrl
				3468
				3469	* SDWA modifiers:
				3470
				3471	- dst_sel, src0_sel, src1_sel (BYTE_N, WORD_M, DWORD)
				3472	- dst_unused (UNUSED_PAD, UNUSED_SEXT, UNUSED_PRESERVE)
				3473	- abs, neg, sext
				3474
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3475	Instruction Examples
				3476	~~~~~~~~~~~~~~~~~~~~
				3477
				3478	DS
				3479	~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3480
				3481	.. code-block:: nasm
				3482
				3483	ds_add_u32 v2, v4 offset:16
				3484	ds_write_src2_b64 v2 offset0:4 offset1:8
				3485	ds_cmpst_f32 v2, v4, v6
				3486	ds_min_rtn_f64 v[8:9], v2, v[4:5]
				3487
				3488
				3489	For full list of supported instructions, refer to "LDS/GDS instructions" in ISA Manual.
				3490
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3491	FLAT
				3492	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3493
				3494	.. code-block:: nasm
				3495
				3496	flat_load_dword v1, v[3:4]
				3497	flat_store_dwordx3 v[3:4], v[5:7]
				3498	flat_atomic_swap v1, v[3:4], v5 glc
				3499	flat_atomic_cmpswap v1, v[3:4], v[5:6] glc slc
				3500	flat_atomic_fmax_x2 v[1:2], v[3:4], v[5:6] glc
				3501
				3502	For full list of supported instructions, refer to "FLAT instructions" in ISA Manual.
				3503
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3504	MUBUF
				3505	+++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3506
				3507	.. code-block:: nasm
				3508
				3509	buffer_load_dword v1, off, s[4:7], s1
				3510	buffer_store_dwordx4 v[1:4], v2, ttmp[4:7], s1 offen offset:4 glc tfe
				3511	buffer_store_format_xy v[1:2], off, s[4:7], s1
				3512	buffer_wbinvl1
				3513	buffer_atomic_inc v1, v2, s[8:11], s4 idxen offset:4 slc
				3514
				3515	For full list of supported instructions, refer to "MUBUF Instructions" in ISA Manual.
				3516
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3517	SMRD/SMEM
				3518	+++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3519
				3520	.. code-block:: nasm
				3521
				3522	s_load_dword s1, s[2:3], 0xfc
				3523	s_load_dwordx8 s[8:15], s[2:3], s4
				3524	s_load_dwordx16 s[88:103], s[2:3], s4
				3525	s_dcache_inv_vol
				3526	s_memtime s[4:5]
				3527
				3528	For full list of supported instructions, refer to "Scalar Memory Operations" in ISA Manual.
				3529
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3530	SOP1
				3531	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3532
				3533	.. code-block:: nasm
				3534
				3535	s_mov_b32 s1, s2
				3536	s_mov_b64 s[0:1], 0x80000000
				3537	s_cmov_b32 s1, 200
				3538	s_wqm_b64 s[2:3], s[4:5]
				3539	s_bcnt0_i32_b64 s1, s[2:3]
				3540	s_swappc_b64 s[2:3], s[4:5]
				3541	s_cbranch_join s[4:5]
				3542
				3543	For full list of supported instructions, refer to "SOP1 Instructions" in ISA Manual.
				3544
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3545	SOP2
				3546	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3547
				3548	.. code-block:: nasm
				3549
				3550	s_add_u32 s1, s2, s3
				3551	s_and_b64 s[2:3], s[4:5], s[6:7]
				3552	s_cselect_b32 s1, s2, s3
				3553	s_andn2_b32 s2, s4, s6
				3554	s_lshr_b64 s[2:3], s[4:5], s6
				3555	s_ashr_i32 s2, s4, s6
				3556	s_bfm_b64 s[2:3], s4, s6
				3557	s_bfe_i64 s[2:3], s[4:5], s6
				3558	s_cbranch_g_fork s[4:5], s[6:7]
				3559
				3560	For full list of supported instructions, refer to "SOP2 Instructions" in ISA Manual.
				3561
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3562	SOPC
				3563	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3564
				3565	.. code-block:: nasm
				3566
				3567	s_cmp_eq_i32 s1, s2
				3568	s_bitcmp1_b32 s1, s2
				3569	s_bitcmp0_b64 s[2:3], s4
				3570	s_setvskip s3, s5
				3571
				3572	For full list of supported instructions, refer to "SOPC Instructions" in ISA Manual.
				3573
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3574	SOPP
				3575	++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3576
				3577	.. code-block:: nasm
				3578
				3579	s_barrier
				3580	s_nop 2
				3581	s_endpgm
				3582	s_waitcnt 0 ; Wait for all counters to be 0
				3583	s_waitcnt vmcnt(0) & expcnt(0) & lgkmcnt(0) ; Equivalent to above
				3584	s_waitcnt vmcnt(1) ; Wait for vmcnt counter to be 1.
				3585	s_sethalt 9
				3586	s_sleep 10
				3587	s_sendmsg 0x1
				3588	s_sendmsg sendmsg(MSG_INTERRUPT)
				3589	s_trap 1
				3590
				3591	For full list of supported instructions, refer to "SOPP Instructions" in ISA Manual.
				3592
				3593	Unless otherwise mentioned, little verification is performed on the operands
Sylvestre Ledru	e6ec441	2017-01-14 11:37:01 +0000	[diff] [blame]	3594	of SOPP Instructions, so it is up to the programmer to be familiar with the
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3595	range or acceptable values.
				3596
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3597	VALU
				3598	++++
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3599
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3600	For vector ALU instruction opcodes (VOP1, VOP2, VOP3, VOPC, VOP_DPP, VOP_SDWA),
				3601	the assembler will automatically use optimal encoding based on its operands.
				3602	To force specific encoding, one can add a suffix to the opcode of the instruction:
				3603
				3604	* _e32 for 32-bit VOP1/VOP2/VOPC
				3605	* _e64 for 64-bit VOP3
				3606	* _dpp for VOP_DPP
				3607	* _sdwa for VOP_SDWA
				3608
				3609	VOP1/VOP2/VOP3/VOPC examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3610
				3611	.. code-block:: nasm
				3612
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3613	v_mov_b32 v1, v2
				3614	v_mov_b32_e32 v1, v2
				3615	v_nop
				3616	v_cvt_f64_i32_e32 v[1:2], v2
				3617	v_floor_f32_e32 v1, v2
				3618	v_bfrev_b32_e32 v1, v2
				3619	v_add_f32_e32 v1, v2, v3
				3620	v_mul_i32_i24_e64 v1, v2, 3
				3621	v_mul_i32_i24_e32 v1, -3, v3
				3622	v_mul_i32_i24_e32 v1, -100, v3
				3623	v_addc_u32 v1, s[0:1], v2, v3, s[2:3]
				3624	v_max_f16_e32 v1, v2, v3
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3625
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3626	VOP_DPP examples:
Tom Stellard	45bb48e	2015-06-13 03:28:10 +0000	[diff] [blame]	3627
				3628	.. code-block:: nasm
				3629
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3630	v_mov_b32 v0, v0 quad_perm:[0,2,1,1]
				3631	v_sin_f32 v0, v0 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3632	v_mov_b32 v0, v0 wave_shl:1
				3633	v_mov_b32 v0, v0 row_mirror
				3634	v_mov_b32 v0, v0 row_bcast:31
				3635	v_mov_b32 v0, v0 quad_perm:[1,3,0,1] row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3636	v_add_f32 v0, v0, \|v0\| row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
				3637	v_max_f16 v1, v2, v3 row_shl:1 row_mask:0xa bank_mask:0x1 bound_ctrl:0
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3638
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3639	VOP_SDWA examples:
				3640
				3641	.. code-block:: nasm
				3642
				3643	v_mov_b32 v1, v2 dst_sel:BYTE_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD
				3644	v_min_u32 v200, v200, v1 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
				3645	v_sin_f32 v0, v0 dst_unused:UNUSED_PAD src0_sel:WORD_1
				3646	v_fract_f32 v0, \|v0\| dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
				3647	v_cmpx_le_u32 vcc, v1, v2 src0_sel:BYTE_2 src1_sel:WORD_0
				3648
				3649	For full list of supported instructions, refer to "Vector ALU instructions".
				3650
				3651	HSA Code Object Directives
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3652	~~~~~~~~~~~~~~~~~~~~~~~~~~
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3653
				3654	AMDGPU ABI defines auxiliary data in output code object. In assembly source,
				3655	one can specify them with assembler directives.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3656
				3657	.hsa_code_object_version major, minor
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3658	+++++++++++++++++++++++++++++++++++++
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3659
				3660	major and minor are integers that specify the version of the HSA code
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3661	object that will be generated by the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3662
				3663	.hsa_code_object_isa [major, minor, stepping, vendor, arch]
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3664	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
				3665
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3666
				3667	major, minor, and stepping are all integers that describe the instruction
				3668	set architecture (ISA) version of the assembly program.
				3669
				3670	vendor and arch are quoted strings. vendor should always be equal to
				3671	"AMD" and arch should always be equal to "AMDGPU".
				3672
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3673	By default, the assembler will derive the ISA version, vendor, and arch
				3674	from the value of the -mcpu option that is passed to the assembler.
Tom Stellard	347ac79	2015-06-26 21:15:07 +0000	[diff] [blame]	3675
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3676	.amdgpu_hsa_kernel (name)
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3677	+++++++++++++++++++++++++
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3678
				3679	This directives specifies that the symbol with given name is a kernel entry point
				3680	(label) and the object should contain corresponding symbol of type STT_AMDGPU_HSA_KERNEL.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3681
				3682	.amd_kernel_code_t
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3683	++++++++++++++++++
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3684
				3685	This directive marks the beginning of a list of key / value pairs that are used
				3686	to specify the amd_kernel_code_t object that will be emitted by the assembler.
				3687	The list must be terminated by the .end_amd_kernel_code_t directive. For
				3688	any amd_kernel_code_t values that are unspecified a default value will be
				3689	used. The default value for all keys is 0, with the following exceptions:
				3690
				3691	- kernel_code_version_major defaults to 1.
				3692	- machine_kind defaults to 1.
				3693	- machine_version_major, machine_version_minor, and
				3694	machine_version_stepping are derived from the value of the -mcpu option
				3695	that is passed to the assembler.
				3696	- kernel_code_entry_byte_offset defaults to 256.
				3697	- wavefront_size defaults to 6.
				3698	- kernarg_segment_alignment, group_segment_alignment, and
				3699	private_segment_alignment default to 4. Note that alignments are specified
				3700	as a power of two, so a value of n means an alignment of 2^ n.
				3701
				3702	The .amd_kernel_code_t directive must be placed immediately after the
				3703	function label and before any instructions.
				3704
Nikolay Haustov	96a56bd	2016-09-20 09:04:51 +0000	[diff] [blame]	3705	For a full list of amd_kernel_code_t keys, refer to AMDGPU ABI document,
				3706	comments in lib/Target/AMDGPU/AmdKernelCodeT.h and test/CodeGen/AMDGPU/hsa.s.
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3707
				3708	Here is an example of a minimal amd_kernel_code_t specification:
				3709
Aaron Ballman	887ad0e	2016-07-19 17:46:55 +0000	[diff] [blame]	3710	.. code-block:: none
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3711
				3712	.hsa_code_object_version 1,0
				3713	.hsa_code_object_isa
				3714
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3715	.hsatext
				3716	.globl hello_world
				3717	.p2align 8
				3718	.amdgpu_hsa_kernel hello_world
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3719
				3720	hello_world:
				3721
				3722	.amd_kernel_code_t
				3723	enable_sgpr_kernarg_segment_ptr = 1
				3724	is_ptr64 = 1
				3725	compute_pgm_rsrc1_vgprs = 0
				3726	compute_pgm_rsrc1_sgprs = 0
				3727	compute_pgm_rsrc2_user_sgpr = 2
				3728	kernarg_segment_byte_size = 8
				3729	wavefront_sgpr_count = 2
				3730	workitem_vgpr_count = 3
				3731	.end_amd_kernel_code_t
				3732
				3733	s_load_dwordx2 s[0:1], s[0:1] 0x0
				3734	v_mov_b32 v0, 3.14159
				3735	s_waitcnt lgkmcnt(0)
				3736	v_mov_b32 v1, s0
				3737	v_mov_b32 v2, s1
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3738	flat_store_dword v[1:2], v0
Tom Stellard	ff7416b	2015-06-26 21:58:31 +0000	[diff] [blame]	3739	s_endpgm
Sylvestre Ledru	a7de982	2016-02-23 11:17:27 +0000	[diff] [blame]	3740	.Lfunc_end0:
Tom Stellard	b8a91bb	2016-02-22 18:36:00 +0000	[diff] [blame]	3741	.size hello_world, .Lfunc_end0-hello_world
Tony Tye	f16a45e	2017-06-06 20:31:59 +0000	[diff] [blame]	3742
				3743	Additional Documentation
				3744	========================
				3745
				3746	.. [AMD-R6xx] `AMD R6xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R600_Instruction_Set_Architecture.pdf>`__
				3747	.. [AMD-R7xx] `AMD R7xx shader ISA <http://developer.amd.com/wordpress/media/2012/10/R700-Family_Instruction_Set_Architecture.pdf>`__
				3748	.. [AMD-Evergreen] `AMD Evergreen shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_Evergreen-Family_Instruction_Set_Architecture.pdf>`__
				3749	.. [AMD-Cayman-Trinity] `AMD Cayman/Trinity shader ISA <http://developer.amd.com/wordpress/media/2012/10/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf>`__
				3750	.. [AMD-Souther-Islands] `AMD Southern Islands Series ISA <http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf>`__
				3751	.. [AMD-Sea-Islands] `AMD Sea Islands Series ISA <http://developer.amd.com/wordpress/media/2013/07/AMD_Sea_Islands_Instruction_Set_Architecture.pdf>`_
				3752	.. [AMD-Volcanic-Islands] `AMD GCN3 Instruction Set Architecture <http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf>`__
				3753	.. [AMD-OpenCL_Programming-Guide] `AMD Accelerated Parallel Processing OpenCL Programming Guide <http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf>`_
				3754	.. [AMD-APP-SDK] `AMD Accelerated Parallel Processing APP SDK Documentation <http://developer.amd.com/tools/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/documentation/>`__
				3755	.. [AMD-ROCm] `ROCm: Open Platform for Development, Discovery and Education Around GPU Computing <http://gpuopen.com/compute-product/rocm/>`__
				3756	.. [AMD-ROCm-github] `ROCm github <http://github.com/RadeonOpenCompute>`__
				3757	.. [HSA] `Heterogeneous System Architecture (HSA) Foundation <http://www.hsafoundation.com/>`__
				3758	.. [ELF] `Executable and Linkable Format (ELF) <http://www.sco.com/developers/gabi/>`__
				3759	.. [DWARF] `DWARF Debugging Information Format <http://dwarfstd.org/>`__
				3760	.. [YAML] `YAML Ain’t Markup Language (YAML™) Version 1.2 <http://www.yaml.org/spec/1.2/spec.html>`__
				3761	.. [OpenCL] `The OpenCL Specification Version 2.0 <http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf>`__
				3762	.. [HRF] `Heterogeneous-race-free Memory Models <http://benedictgaster.org/wp-content/uploads/2014/01/asplos269-FINAL.pdf>`__
				3763	.. [AMD-AMDGPU-Compute-Application-Binary-Interface] `AMDGPU Compute Application Binary Interface <https://github.com/RadeonOpenCompute/ROCm-ComputeABI-Doc/blob/master/AMDGPU-ABI.md>`__