Blame - llvm/docs/NVPTXUsage.rst - toolchain/llvm-project

blob: 5451619686d9595c758106abdbd7ff29e1d1eb35 [file] [log] [blame]

Justin Holewinski	45df882	2013-03-30 16:41:14 +0000	[diff] [blame^]	1	=============================
				2	User Guide for NVPTX Back-end
				3	=============================
				4
				5	.. contents::
				6	:local:
				7	:depth: 3
				8
				9
				10	Introduction
				11	============
				12
				13	To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
				14	along with a defined set of conventions used to represent GPU programming
				15	concepts. This document provides an overview of the general usage of the back-
				16	end, including a description of the conventions used and the set of accepted
				17	LLVM IR.
				18
				19	.. note::
				20
				21	This document assumes a basic familiarity with CUDA and the PTX
				22	assembly language. Information about the CUDA Driver API and the PTX assembly
				23	language can be found in the `CUDA documentation
				24	<http://docs.nvidia.com/cuda/index.html>`_.
				25
				26
				27
				28	Conventions
				29	===========
				30
				31	Marking Functions as Kernels
				32	----------------------------
				33
				34	In PTX, there are two types of functions: device functions, which are only
				35	callable by device code, and kernel functions, which are callable by host
				36	code. By default, the back-end will emit device functions. Metadata is used to
				37	declare a function as a kernel function. This metadata is attached to the
				38	``nvvm.annotations`` named metadata object, and has the following format:
				39
				40	.. code-block:: llvm
				41
				42	!0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
				43
				44	The first parameter is a reference to the kernel function. The following
				45	example shows a kernel function calling a device function in LLVM IR. The
				46	function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
				47
				48	.. code-block:: llvm
				49
				50	define float @my_fmad(float %x, float %y, float %z) {
				51	%mul = fmul float %x, %y
				52	%add = fadd float %mul, %z
				53	ret float %add
				54	}
				55
				56	define void @my_kernel(float* %ptr) {
				57	%val = load float* %ptr
				58	%ret = call float @my_fmad(float %val, float %val, float %val)
				59	store float %ret, float* %ptr
				60	ret void
				61	}
				62
				63	!nvvm.annotations = !{!1}
				64	!1 = metadata !{void (float) @my_kernel, metadata !"kernel", i32 1}
				65
				66	When compiled, the PTX kernel functions are callable by host-side code.
				67
				68
				69	Address Spaces
				70	--------------
				71
				72	The NVPTX back-end uses the following address space mapping:
				73
				74	============= ======================
				75	Address Space Memory Space
				76	============= ======================
				77	0 Generic
				78	1 Global
				79	2 Internal Use
				80	3 Shared
				81	4 Constant
				82	5 Local
				83	============= ======================
				84
				85	Every global variable and pointer type is assigned to one of these address
				86	spaces, with 0 being the default address space. Intrinsics are provided which
				87	can be used to convert pointers between the generic and non-generic address
				88	spaces.
				89
				90	As an example, the following IR will define an array ``@g`` that resides in
				91	global device memory.
				92
				93	.. code-block:: llvm
				94
				95	@g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
				96
				97	LLVM IR functions can read and write to this array, and host-side code can
				98	copy data to it by name with the CUDA Driver API.
				99
				100	Note that since address space 0 is the generic space, it is illegal to have
				101	global variables in address space 0. Address space 0 is the default address
				102	space in LLVM, so the ``addrspace(N)`` annotation is required for global
				103	variables.
				104
				105
				106	NVPTX Intrinsics
				107	================
				108
				109	Address Space Conversion
				110	------------------------
				111
				112	'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
				113	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				114
				115	Syntax:
				116	"""""""
				117
				118	These are overloaded intrinsics. You can use these on any pointer types.
				119
				120	.. code-block:: llvm
				121
				122	declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
				123	declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
				124	declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
				125	declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
				126
				127	Overview:
				128	"""""""""
				129
				130	The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
				131	address space to a generic address space pointer.
				132
				133	Semantics:
				134	""""""""""
				135
				136	These intrinsics modify the pointer value to be a valid generic address space
				137	pointer.
				138
				139
				140	'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
				141	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				142
				143	Syntax:
				144	"""""""
				145
				146	These are overloaded intrinsics. You can use these on any pointer types.
				147
				148	.. code-block:: llvm
				149
				150	declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
				151	declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
				152	declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
				153	declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
				154
				155	Overview:
				156	"""""""""
				157
				158	The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
				159	address space to a pointer in the target address space. Note that these
				160	intrinsics are only useful if the address space of the target address space of
				161	the pointer is known. It is not legal to use address space conversion
				162	intrinsics to convert a pointer from one non-generic address space to another
				163	non-generic address space.
				164
				165	Semantics:
				166	""""""""""
				167
				168	These intrinsics modify the pointer value to be a valid pointer in the target
				169	non-generic address space.
				170
				171
				172	Reading PTX Special Registers
				173	-----------------------------
				174
				175	'``llvm.nvvm.read.ptx.sreg.*``'
				176	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				177
				178	Syntax:
				179	"""""""
				180
				181	.. code-block:: llvm
				182
				183	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				184	declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				185	declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				186	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
				187	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
				188	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
				189	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
				190	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
				191	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
				192	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
				193	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
				194	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
				195	declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
				196
				197	Overview:
				198	"""""""""
				199
				200	The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
				201	special registers, in particular the kernel launch bounds. These registers
				202	map in the following way to CUDA builtins:
				203
				204	============ =====================================
				205	CUDA Builtin PTX Special Register Intrinsic
				206	============ =====================================
				207	``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
				208	``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
				209	``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
				210	``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
				211	============ =====================================
				212
				213
				214	Barriers
				215	--------
				216
				217	'``llvm.nvvm.barrier0``'
				218	^^^^^^^^^^^^^^^^^^^^^^^^^^^
				219
				220	Syntax:
				221	"""""""
				222
				223	.. code-block:: llvm
				224
				225	declare void @llvm.nvvm.barrier0()
				226
				227	Overview:
				228	"""""""""
				229
				230	The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
				231	instruction, equivalent to the ``__syncthreads()`` call in CUDA.
				232
				233
				234	Other Intrinsics
				235	----------------
				236
				237	For the full set of NVPTX intrinsics, please see the
				238	``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
				239
				240
				241	Executing PTX
				242	=============
				243
				244	The most common way to execute PTX assembly on a GPU device is to use the CUDA
				245	Driver API. This API is a low-level interface to the GPU driver and allows for
				246	JIT compilation of PTX code to native GPU machine code.
				247
				248	Initializing the Driver API:
				249
				250	.. code-block:: c++
				251
				252	CUdevice device;
				253	CUcontext context;
				254
				255	// Initialize the driver API
				256	cuInit(0);
				257	// Get a handle to the first compute device
				258	cuDeviceGet(&device, 0);
				259	// Create a compute device context
				260	cuCtxCreate(&context, 0, device);
				261
				262	JIT compiling a PTX string to a device binary:
				263
				264	.. code-block:: c++
				265
				266	CUmodule module;
				267	CUfunction funcion;
				268
				269	// JIT compile a null-terminated PTX string
				270	cuModuleLoadData(&module, (void*)PTXString);
				271
				272	// Get a handle to the "myfunction" kernel function
				273	cuModuleGetFunction(&function, module, "myfunction");
				274
				275	For full examples of executing PTX assembly, please see the `CUDA Samples
				276	<https://developer.nvidia.com/cuda-downloads>`_ distribution.