Blame - llvm/docs/NVPTXUsage.rst - toolchain/llvm-project

blob: e1c401df877af50146194a16a7770f7f2e1e7b4c [file] [log] [blame]

Justin Holewinski	45df882	2013-03-30 16:41:14 +0000	[diff] [blame]	1	=============================
				2	User Guide for NVPTX Back-end
				3	=============================
				4
				5	.. contents::
				6	:local:
				7	:depth: 3
				8
				9
				10	Introduction
				11	============
				12
				13	To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
				14	along with a defined set of conventions used to represent GPU programming
				15	concepts. This document provides an overview of the general usage of the back-
				16	end, including a description of the conventions used and the set of accepted
				17	LLVM IR.
				18
				19	.. note::
				20
				21	This document assumes a basic familiarity with CUDA and the PTX
				22	assembly language. Information about the CUDA Driver API and the PTX assembly
				23	language can be found in the `CUDA documentation
				24	<http://docs.nvidia.com/cuda/index.html>`_.
				25
				26
				27
				28	Conventions
				29	===========
				30
				31	Marking Functions as Kernels
				32	----------------------------
				33
				34	In PTX, there are two types of functions: device functions, which are only
				35	callable by device code, and kernel functions, which are callable by host
				36	code. By default, the back-end will emit device functions. Metadata is used to
				37	declare a function as a kernel function. This metadata is attached to the
				38	``nvvm.annotations`` named metadata object, and has the following format:
				39
				40	.. code-block:: llvm
				41
				42	!0 = metadata !{<function-ref>, metadata !"kernel", i32 1}
				43
				44	The first parameter is a reference to the kernel function. The following
				45	example shows a kernel function calling a device function in LLVM IR. The
				46	function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.
				47
				48	.. code-block:: llvm
				49
				50	define float @my_fmad(float %x, float %y, float %z) {
				51	%mul = fmul float %x, %y
				52	%add = fadd float %mul, %z
				53	ret float %add
				54	}
				55
				56	define void @my_kernel(float* %ptr) {
				57	%val = load float* %ptr
				58	%ret = call float @my_fmad(float %val, float %val, float %val)
				59	store float %ret, float* %ptr
				60	ret void
				61	}
				62
				63	!nvvm.annotations = !{!1}
				64	!1 = metadata !{void (float) @my_kernel, metadata !"kernel", i32 1}
				65
				66	When compiled, the PTX kernel functions are callable by host-side code.
				67
				68
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	69	.. _address_spaces:
				70
Justin Holewinski	45df882	2013-03-30 16:41:14 +0000	[diff] [blame]	71	Address Spaces
				72	--------------
				73
				74	The NVPTX back-end uses the following address space mapping:
				75
				76	============= ======================
				77	Address Space Memory Space
				78	============= ======================
				79	0 Generic
				80	1 Global
				81	2 Internal Use
				82	3 Shared
				83	4 Constant
				84	5 Local
				85	============= ======================
				86
				87	Every global variable and pointer type is assigned to one of these address
				88	spaces, with 0 being the default address space. Intrinsics are provided which
				89	can be used to convert pointers between the generic and non-generic address
				90	spaces.
				91
				92	As an example, the following IR will define an array ``@g`` that resides in
				93	global device memory.
				94
				95	.. code-block:: llvm
				96
				97	@g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]
				98
				99	LLVM IR functions can read and write to this array, and host-side code can
				100	copy data to it by name with the CUDA Driver API.
				101
				102	Note that since address space 0 is the generic space, it is illegal to have
				103	global variables in address space 0. Address space 0 is the default address
				104	space in LLVM, so the ``addrspace(N)`` annotation is required for global
				105	variables.
				106
				107
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	108	Triples
				109	-------
				110
				111	The NVPTX target uses the module triple to select between 32/64-bit code
				112	generation and the driver-compiler interface to use. The triple architecture
				113	can be one of ``nvptx`` (32-bit PTX) or ``nvptx64`` (64-bit PTX). The
				114	operating system should be one of ``cuda`` or ``nvcl``, which determines the
				115	interface used by the generated code to communicate with the driver. Most
				116	users will want to use ``cuda`` as the operating system, which makes the
				117	generated PTX compatible with the CUDA Driver API.
				118
				119	Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
				120
				121	Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
				122
				123
				124
				125	.. _nvptx_intrinsics:
				126
Justin Holewinski	45df882	2013-03-30 16:41:14 +0000	[diff] [blame]	127	NVPTX Intrinsics
				128	================
				129
				130	Address Space Conversion
				131	------------------------
				132
				133	'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
				134	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				135
				136	Syntax:
				137	"""""""
				138
				139	These are overloaded intrinsics. You can use these on any pointer types.
				140
				141	.. code-block:: llvm
				142
				143	declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
				144	declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
				145	declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
				146	declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)
				147
				148	Overview:
				149	"""""""""
				150
				151	The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
				152	address space to a generic address space pointer.
				153
				154	Semantics:
				155	""""""""""
				156
				157	These intrinsics modify the pointer value to be a valid generic address space
				158	pointer.
				159
				160
				161	'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
				162	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				163
				164	Syntax:
				165	"""""""
				166
				167	These are overloaded intrinsics. You can use these on any pointer types.
				168
				169	.. code-block:: llvm
				170
				171	declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
				172	declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
				173	declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
				174	declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)
				175
				176	Overview:
				177	"""""""""
				178
				179	The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
				180	address space to a pointer in the target address space. Note that these
				181	intrinsics are only useful if the address space of the target address space of
				182	the pointer is known. It is not legal to use address space conversion
				183	intrinsics to convert a pointer from one non-generic address space to another
				184	non-generic address space.
				185
				186	Semantics:
				187	""""""""""
				188
				189	These intrinsics modify the pointer value to be a valid pointer in the target
				190	non-generic address space.
				191
				192
				193	Reading PTX Special Registers
				194	-----------------------------
				195
				196	'``llvm.nvvm.read.ptx.sreg.*``'
				197	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				198
				199	Syntax:
				200	"""""""
				201
				202	.. code-block:: llvm
				203
				204	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
				205	declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
				206	declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
				207	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
				208	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
				209	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
				210	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
				211	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
				212	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
				213	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
				214	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
				215	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
				216	declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()
				217
				218	Overview:
				219	"""""""""
				220
				221	The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
				222	special registers, in particular the kernel launch bounds. These registers
				223	map in the following way to CUDA builtins:
				224
				225	============ =====================================
				226	CUDA Builtin PTX Special Register Intrinsic
				227	============ =====================================
				228	``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
				229	``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
				230	``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
				231	``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
				232	============ =====================================
				233
				234
				235	Barriers
				236	--------
				237
				238	'``llvm.nvvm.barrier0``'
				239	^^^^^^^^^^^^^^^^^^^^^^^^^^^
				240
				241	Syntax:
				242	"""""""
				243
				244	.. code-block:: llvm
				245
				246	declare void @llvm.nvvm.barrier0()
				247
				248	Overview:
				249	"""""""""
				250
				251	The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
				252	instruction, equivalent to the ``__syncthreads()`` call in CUDA.
				253
				254
				255	Other Intrinsics
				256	----------------
				257
				258	For the full set of NVPTX intrinsics, please see the
				259	``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.
				260
				261
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	262	.. _libdevice:
				263
				264	Linking with Libdevice
				265	======================
				266
				267	The CUDA Toolkit comes with an LLVM bitcode library called ``libdevice`` that
				268	implements many common mathematical functions. This library can be used as a
				269	high-performance math library for any compilers using the LLVM NVPTX target.
				270	The library can be found under ``nvvm/libdevice/`` in the CUDA Toolkit and
				271	there is a separate version for each compute architecture.
				272
				273	For a list of all math functions implemented in libdevice, see
				274	`libdevice Users Guide <http://docs.nvidia.com/cuda/libdevice-users-guide/index.html>`_.
				275
Alp Toker	171b0c3	2013-12-20 00:33:39 +0000	[diff] [blame]	276	To accommodate various math-related compiler flags that can affect code
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	277	generation of libdevice code, the library code depends on a special LLVM IR
				278	pass (``NVVMReflect``) to handle conditional compilation within LLVM IR. This
				279	pass looks for calls to the ``@__nvvm_reflect`` function and replaces them
				280	with constants based on the defined reflection parameters. Such conditional
				281	code often follows a pattern:
				282
				283	.. code-block:: c++
				284
				285	float my_function(float a) {
				286	if (__nvvm_reflect("FASTMATH"))
				287	return my_function_fast(a);
				288	else
				289	return my_function_precise(a);
				290	}
				291
				292	The default value for all unspecified reflection parameters is zero.
				293
				294	The ``NVVMReflect`` pass should be executed early in the optimization
				295	pipeline, immediately after the link stage. The ``internalize`` pass is also
				296	recommended to remove unused math functions from the resulting PTX. For an
				297	input IR module ``module.bc``, the following compilation flow is recommended:
				298
				299	1. Save list of external functions in ``module.bc``
				300	2. Link ``module.bc`` with ``libdevice.compute_XX.YY.bc``
				301	3. Internalize all functions not in list from (1)
				302	4. Eliminate all unused internal functions
				303	5. Run ``NVVMReflect`` pass
				304	6. Run standard optimization pipeline
				305
				306	.. note::
				307
				308	``linkonce`` and ``linkonce_odr`` linkage types are not suitable for the
				309	libdevice functions. It is possible to link two IR modules that have been
				310	linked against libdevice using different reflection variables.
				311
				312	Since the ``NVVMReflect`` pass replaces conditionals with constants, it will
				313	often leave behind dead code of the form:
				314
				315	.. code-block:: llvm
				316
				317	entry:
				318	..
				319	br i1 true, label %foo, label %bar
				320	foo:
				321	..
				322	bar:
				323	; Dead code
				324	..
				325
				326	Therefore, it is recommended that ``NVVMReflect`` is executed early in the
				327	optimization pipeline before dead-code elimination.
				328
				329
				330	Reflection Parameters
				331	---------------------
				332
				333	The libdevice library currently uses the following reflection parameters to
				334	control code generation:
				335
				336	==================== ======================================================
				337	Flag Description
				338	==================== ======================================================
				339	``__CUDA_FTZ=[0,1]`` Use optimized code paths that flush subnormals to zero
				340	==================== ======================================================
				341
				342
				343	Invoking NVVMReflect
				344	--------------------
				345
				346	To ensure that all dead code caused by the reflection pass is eliminated, it
				347	is recommended that the reflection pass is executed early in the LLVM IR
				348	optimization pipeline. The pass takes an optional mapping of reflection
				349	parameter name to an integer value. This mapping can be specified as either a
				350	command-line option to ``opt`` or as an LLVM ``StringMap<int>`` object when
				351	programmatically creating a pass pipeline.
				352
				353	With ``opt``:
				354
				355	.. code-block:: text
				356
				357	# opt -nvvm-reflect -nvvm-reflect-list=<var>=<value>,<var>=<value> module.bc -o module.reflect.bc
				358
				359
				360	With programmatic pass pipeline:
				361
				362	.. code-block:: c++
				363
				364	extern ModulePass *llvm::createNVVMReflectPass(const StringMap<int>& Mapping);
				365
				366	StringMap<int> ReflectParams;
				367	ReflectParams["__CUDA_FTZ"] = 1;
				368	Passes.add(createNVVMReflectPass(ReflectParams));
				369
				370
				371
Justin Holewinski	45df882	2013-03-30 16:41:14 +0000	[diff] [blame]	372	Executing PTX
				373	=============
				374
				375	The most common way to execute PTX assembly on a GPU device is to use the CUDA
				376	Driver API. This API is a low-level interface to the GPU driver and allows for
				377	JIT compilation of PTX code to native GPU machine code.
				378
				379	Initializing the Driver API:
				380
				381	.. code-block:: c++
				382
				383	CUdevice device;
				384	CUcontext context;
				385
				386	// Initialize the driver API
				387	cuInit(0);
				388	// Get a handle to the first compute device
				389	cuDeviceGet(&device, 0);
				390	// Create a compute device context
				391	cuCtxCreate(&context, 0, device);
				392
				393	JIT compiling a PTX string to a device binary:
				394
				395	.. code-block:: c++
				396
				397	CUmodule module;
				398	CUfunction funcion;
				399
				400	// JIT compile a null-terminated PTX string
				401	cuModuleLoadData(&module, (void*)PTXString);
				402
				403	// Get a handle to the "myfunction" kernel function
				404	cuModuleGetFunction(&function, module, "myfunction");
				405
				406	For full examples of executing PTX assembly, please see the `CUDA Samples
				407	<https://developer.nvidia.com/cuda-downloads>`_ distribution.
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	408
				409
				410	Common Issues
				411	=============
				412
				413	ptxas complains of undefined function: __nvvm_reflect
				414	-----------------------------------------------------
				415
				416	When linking with libdevice, the ``NVVMReflect`` pass must be used. See
				417	:ref:`libdevice` for more information.
				418
				419
				420	Tutorial: A Simple Compute Kernel
				421	=================================
				422
				423	To start, let us take a look at a simple compute kernel written directly in
				424	LLVM IR. The kernel implements vector addition, where each thread computes one
				425	element of the output vector C from the input vectors A and B. To make this
				426	easier, we also assume that only a single CTA (thread block) will be launched,
				427	and that it will be one dimensional.
				428
				429
				430	The Kernel
				431	----------
				432
				433	.. code-block:: llvm
				434
				435	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				436	target triple = "nvptx64-nvidia-cuda"
				437
				438	; Intrinsic to read X component of thread ID
				439	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
				440
				441	define void @kernel(float addrspace(1)* %A,
				442	float addrspace(1)* %B,
				443	float addrspace(1)* %C) {
				444	entry:
				445	; What is my ID?
				446	%id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
				447
				448	; Compute pointers into A, B, and C
				449	%ptrA = getelementptr float addrspace(1)* %A, i32 %id
				450	%ptrB = getelementptr float addrspace(1)* %B, i32 %id
				451	%ptrC = getelementptr float addrspace(1)* %C, i32 %id
				452
				453	; Read A, B
				454	%valA = load float addrspace(1)* %ptrA, align 4
				455	%valB = load float addrspace(1)* %ptrB, align 4
				456
				457	; Compute C = A + B
				458	%valC = fadd float %valA, %valB
				459
				460	; Store back to C
				461	store float %valC, float addrspace(1)* %ptrC, align 4
				462
				463	ret void
				464	}
				465
				466	!nvvm.annotations = !{!0}
				467	!0 = metadata !{void (float addrspace(1)*,
				468	float addrspace(1)*,
				469	float addrspace(1)) @kernel, metadata !"kernel", i32 1}
				470
				471
				472	We can use the LLVM ``llc`` tool to directly run the NVPTX code generator:
				473
				474	.. code-block:: text
				475
				476	# llc -mcpu=sm_20 kernel.ll -o kernel.ptx
				477
				478
				479	.. note::
				480
				481	If you want to generate 32-bit code, change ``p:64:64:64`` to ``p:32:32:32``
Justin Holewinski	f4919c5	2013-11-15 16:08:49 +0000	[diff] [blame]	482	in the module data layout string and use ``nvptx-nvidia-cuda`` as the
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	483	target triple.
				484
				485
				486	The output we get from ``llc`` (as of LLVM 3.4):
				487
				488	.. code-block:: text
				489
				490	//
				491	// Generated by LLVM NVPTX Back-End
				492	//
				493
				494	.version 3.1
				495	.target sm_20
				496	.address_size 64
				497
				498	// .globl kernel
				499	// @kernel
				500	.visible .entry kernel(
				501	.param .u64 kernel_param_0,
				502	.param .u64 kernel_param_1,
				503	.param .u64 kernel_param_2
				504	)
				505	{
				506	.reg .f32 %f<4>;
				507	.reg .s32 %r<2>;
				508	.reg .s64 %rl<8>;
				509
				510	// BB#0: // %entry
				511	ld.param.u64 %rl1, [kernel_param_0];
				512	mov.u32 %r1, %tid.x;
				513	mul.wide.s32 %rl2, %r1, 4;
				514	add.s64 %rl3, %rl1, %rl2;
				515	ld.param.u64 %rl4, [kernel_param_1];
				516	add.s64 %rl5, %rl4, %rl2;
				517	ld.param.u64 %rl6, [kernel_param_2];
				518	add.s64 %rl7, %rl6, %rl2;
				519	ld.global.f32 %f1, [%rl3];
				520	ld.global.f32 %f2, [%rl5];
				521	add.f32 %f3, %f1, %f2;
				522	st.global.f32 [%rl7], %f3;
				523	ret;
				524	}
				525
				526
				527	Dissecting the Kernel
				528	---------------------
				529
				530	Now let us dissect the LLVM IR that makes up this kernel.
				531
				532	Data Layout
				533	^^^^^^^^^^^
				534
				535	The data layout string determines the size in bits of common data types, their
				536	ABI alignment, and their storage size. For NVPTX, you should use one of the
				537	following:
				538
				539	32-bit PTX:
				540
				541	.. code-block:: llvm
				542
				543	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				544
				545	64-bit PTX:
				546
				547	.. code-block:: llvm
				548
				549	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				550
				551
				552	Target Intrinsics
				553	^^^^^^^^^^^^^^^^^
				554
				555	In this example, we use the ``@llvm.nvvm.read.ptx.sreg.tid.x`` intrinsic to
				556	read the X component of the current thread's ID, which corresponds to a read
				557	of register ``%tid.x`` in PTX. The NVPTX back-end supports a large set of
				558	intrinsics. A short list is shown below; please see
				559	``include/llvm/IR/IntrinsicsNVVM.td`` for the full list.
				560
				561
				562	================================================ ====================
				563	Intrinsic CUDA Equivalent
				564	================================================ ====================
				565	``i32 @llvm.nvvm.read.ptx.sreg.tid.{x,y,z}`` threadIdx.{x,y,z}
				566	``i32 @llvm.nvvm.read.ptx.sreg.ctaid.{x,y,z}`` blockIdx.{x,y,z}
				567	``i32 @llvm.nvvm.read.ptx.sreg.ntid.{x,y,z}`` blockDim.{x,y,z}
				568	``i32 @llvm.nvvm.read.ptx.sreg.nctaid.{x,y,z}`` gridDim.{x,y,z}
				569	``void @llvm.cuda.syncthreads()`` __syncthreads()
				570	================================================ ====================
				571
				572
				573	Address Spaces
				574	^^^^^^^^^^^^^^
				575
				576	You may have noticed that all of the pointer types in the LLVM IR example had
				577	an explicit address space specifier. What is address space 1? NVIDIA GPU
				578	devices (generally) have four types of memory:
				579
				580	- Global: Large, off-chip memory
				581	- Shared: Small, on-chip memory shared among all threads in a CTA
				582	- Local: Per-thread, private memory
				583	- Constant: Read-only memory shared across all threads
				584
				585	These different types of memory are represented in LLVM IR as address spaces.
				586	There is also a fifth address space used by the NVPTX code generator that
				587	corresponds to the "generic" address space. This address space can represent
				588	addresses in any other address space (with a few exceptions). This allows
				589	users to write IR functions that can load/store memory using the same
				590	instructions. Intrinsics are provided to convert pointers between the generic
				591	and non-generic address spaces.
				592
				593	See :ref:`address_spaces` and :ref:`nvptx_intrinsics` for more information.
				594
				595
				596	Kernel Metadata
				597	^^^^^^^^^^^^^^^
				598
				599	In PTX, a function can be either a `kernel` function (callable from the host
				600	program), or a `device` function (callable only from GPU code). You can think
				601	of `kernel` functions as entry-points in the GPU program. To mark an LLVM IR
				602	function as a `kernel` function, we make use of special LLVM metadata. The
				603	NVPTX back-end will look for a named metadata node called
				604	``nvvm.annotations``. This named metadata must contain a list of metadata that
				605	describe the IR. For our purposes, we need to declare a metadata node that
				606	assigns the "kernel" attribute to the LLVM IR function that should be emitted
				607	as a PTX `kernel` function. These metadata nodes take the form:
				608
				609	.. code-block:: text
				610
				611	metadata !{<function ref>, metadata !"kernel", i32 1}
				612
				613	For the previous example, we have:
				614
				615	.. code-block:: llvm
				616
				617	!nvvm.annotations = !{!0}
				618	!0 = metadata !{void (float addrspace(1)*,
				619	float addrspace(1)*,
				620	float addrspace(1)) @kernel, metadata !"kernel", i32 1}
				621
				622	Here, we have a single metadata declaration in ``nvvm.annotations``. This
				623	metadata annotates our ``@kernel`` function with the ``kernel`` attribute.
				624
				625
				626	Running the Kernel
				627	------------------
				628
				629	Generating PTX from LLVM IR is all well and good, but how do we execute it on
				630	a real GPU device? The CUDA Driver API provides a convenient mechanism for
				631	loading and JIT compiling PTX to a native GPU device, and launching a kernel.
				632	The API is similar to OpenCL. A simple example showing how to load and
				633	execute our vector addition code is shown below. Note that for brevity this
				634	code does not perform much error checking!
				635
				636	.. note::
				637
				638	You can also use the ``ptxas`` tool provided by the CUDA Toolkit to offline
				639	compile PTX to machine code (SASS) for a specific GPU architecture. Such
				640	binaries can be loaded by the CUDA Driver API in the same way as PTX. This
				641	can be useful for reducing startup time by precompiling the PTX kernels.
				642
				643
				644	.. code-block:: c++
				645
				646	#include <iostream>
				647	#include <fstream>
				648	#include <cassert>
				649	#include "cuda.h"
				650
				651
				652	void checkCudaErrors(CUresult err) {
				653	assert(err == CUDA_SUCCESS);
				654	}
				655
				656	/// main - Program entry point
				657	int main(int argc, char **argv) {
				658	CUdevice device;
				659	CUmodule cudaModule;
				660	CUcontext context;
				661	CUfunction function;
				662	CUlinkState linker;
				663	int devCount;
				664
				665	// CUDA initialization
				666	checkCudaErrors(cuInit(0));
				667	checkCudaErrors(cuDeviceGetCount(&devCount));
				668	checkCudaErrors(cuDeviceGet(&device, 0));
				669
				670	char name[128];
				671	checkCudaErrors(cuDeviceGetName(name, 128, device));
				672	std::cout << "Using CUDA Device [0]: " << name << "\n";
				673
				674	int devMajor, devMinor;
				675	checkCudaErrors(cuDeviceComputeCapability(&devMajor, &devMinor, device));
				676	std::cout << "Device Compute Capability: "
				677	<< devMajor << "." << devMinor << "\n";
				678	if (devMajor < 2) {
				679	std::cerr << "ERROR: Device 0 is not SM 2.0 or greater\n";
				680	return 1;
				681	}
				682
				683	std::ifstream t("kernel.ptx");
				684	if (!t.is_open()) {
				685	std::cerr << "kernel.ptx not found\n";
				686	return 1;
				687	}
				688	std::string str((std::istreambuf_iterator<char>(t)),
				689	std::istreambuf_iterator<char>());
				690
				691	// Create driver context
				692	checkCudaErrors(cuCtxCreate(&context, 0, device));
				693
				694	// Create module for object
				695	checkCudaErrors(cuModuleLoadDataEx(&cudaModule, str.c_str(), 0, 0, 0));
				696
				697	// Get kernel function
				698	checkCudaErrors(cuModuleGetFunction(&function, cudaModule, "kernel"));
				699
				700	// Device data
				701	CUdeviceptr devBufferA;
				702	CUdeviceptr devBufferB;
				703	CUdeviceptr devBufferC;
				704
				705	checkCudaErrors(cuMemAlloc(&devBufferA, sizeof(float)*16));
				706	checkCudaErrors(cuMemAlloc(&devBufferB, sizeof(float)*16));
				707	checkCudaErrors(cuMemAlloc(&devBufferC, sizeof(float)*16));
				708
				709	float* hostA = new float[16];
				710	float* hostB = new float[16];
				711	float* hostC = new float[16];
				712
				713	// Populate input
				714	for (unsigned i = 0; i != 16; ++i) {
				715	hostA[i] = (float)i;
				716	hostB[i] = (float)(2*i);
				717	hostC[i] = 0.0f;
				718	}
				719
				720	checkCudaErrors(cuMemcpyHtoD(devBufferA, &hostA[0], sizeof(float)*16));
				721	checkCudaErrors(cuMemcpyHtoD(devBufferB, &hostB[0], sizeof(float)*16));
				722
				723
				724	unsigned blockSizeX = 16;
				725	unsigned blockSizeY = 1;
				726	unsigned blockSizeZ = 1;
				727	unsigned gridSizeX = 1;
				728	unsigned gridSizeY = 1;
				729	unsigned gridSizeZ = 1;
				730
				731	// Kernel parameters
				732	void *KernelParams[] = { &devBufferA, &devBufferB, &devBufferC };
				733
				734	std::cout << "Launching kernel\n";
				735
				736	// Kernel launch
				737	checkCudaErrors(cuLaunchKernel(function, gridSizeX, gridSizeY, gridSizeZ,
				738	blockSizeX, blockSizeY, blockSizeZ,
				739	0, NULL, KernelParams, NULL));
				740
				741	// Retrieve device data
				742	checkCudaErrors(cuMemcpyDtoH(&hostC[0], devBufferC, sizeof(float)*16));
				743
				744
				745	std::cout << "Results:\n";
				746	for (unsigned i = 0; i != 16; ++i) {
				747	std::cout << hostA[i] << " + " << hostB[i] << " = " << hostC[i] << "\n";
				748	}
				749
				750
				751	// Clean up after ourselves
				752	delete [] hostA;
				753	delete [] hostB;
				754	delete [] hostC;
				755
				756	// Clean-up
				757	checkCudaErrors(cuMemFree(devBufferA));
				758	checkCudaErrors(cuMemFree(devBufferB));
				759	checkCudaErrors(cuMemFree(devBufferC));
				760	checkCudaErrors(cuModuleUnload(cudaModule));
				761	checkCudaErrors(cuCtxDestroy(context));
				762
				763	return 0;
				764	}
				765
				766
				767	You will need to link with the CUDA driver and specify the path to cuda.h.
				768
				769	.. code-block:: text
				770
				771	# clang++ sample.cpp -o sample -O2 -g -I/usr/local/cuda-5.5/include -lcuda
				772
				773	We don't need to specify a path to ``libcuda.so`` since this is installed in a
				774	system location by the driver, not the CUDA toolkit.
				775
				776	If everything goes as planned, you should see the following output when
				777	running the compiled program:
				778
				779	.. code-block:: text
				780
				781	Using CUDA Device [0]: GeForce GTX 680
				782	Device Compute Capability: 3.0
				783	Launching kernel
				784	Results:
				785	0 + 0 = 0
				786	1 + 2 = 3
				787	2 + 4 = 6
				788	3 + 6 = 9
				789	4 + 8 = 12
				790	5 + 10 = 15
				791	6 + 12 = 18
				792	7 + 14 = 21
				793	8 + 16 = 24
				794	9 + 18 = 27
				795	10 + 20 = 30
				796	11 + 22 = 33
				797	12 + 24 = 36
				798	13 + 26 = 39
				799	14 + 28 = 42
				800	15 + 30 = 45
				801
				802	.. note::
				803
				804	You will likely see a different device identifier based on your hardware
				805
				806
				807	Tutorial: Linking with Libdevice
				808	================================
				809
				810	In this tutorial, we show a simple example of linking LLVM IR with the
				811	libdevice library. We will use the same kernel as the previous tutorial,
				812	except that we will compute ``C = pow(A, B)`` instead of ``C = A + B``.
				813	Libdevice provides an ``__nv_powf`` function that we will use.
				814
				815	.. code-block:: llvm
				816
				817	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
				818	target triple = "nvptx64-nvidia-cuda"
				819
				820	; Intrinsic to read X component of thread ID
				821	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
				822	; libdevice function
				823	declare float @__nv_powf(float, float)
				824
				825	define void @kernel(float addrspace(1)* %A,
				826	float addrspace(1)* %B,
				827	float addrspace(1)* %C) {
				828	entry:
				829	; What is my ID?
				830	%id = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() readnone nounwind
				831
				832	; Compute pointers into A, B, and C
				833	%ptrA = getelementptr float addrspace(1)* %A, i32 %id
				834	%ptrB = getelementptr float addrspace(1)* %B, i32 %id
				835	%ptrC = getelementptr float addrspace(1)* %C, i32 %id
				836
				837	; Read A, B
				838	%valA = load float addrspace(1)* %ptrA, align 4
				839	%valB = load float addrspace(1)* %ptrB, align 4
				840
				841	; Compute C = pow(A, B)
Eli Bendersky	b512b96	2013-12-17 14:14:15 +0000	[diff] [blame]	842	%valC = call float @__nv_powf(float %valA, float %valB)
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	843
				844	; Store back to C
				845	store float %valC, float addrspace(1)* %ptrC, align 4
				846
				847	ret void
				848	}
				849
				850	!nvvm.annotations = !{!0}
				851	!0 = metadata !{void (float addrspace(1)*,
				852	float addrspace(1)*,
Eli Bendersky	b512b96	2013-12-17 14:14:15 +0000	[diff] [blame]	853	float addrspace(1)) @kernel, metadata !"kernel", i32 1}
Justin Holewinski	fb9bd75	2013-11-15 13:02:10 +0000	[diff] [blame]	854
				855
				856	To compile this kernel, we perform the following steps:
				857
				858	1. Link with libdevice
				859	2. Internalize all but the public kernel function
				860	3. Run ``NVVMReflect`` and set ``__CUDA_FTZ`` to 0
				861	4. Optimize the linked module
				862	5. Codegen the module
				863
				864
				865	These steps can be performed by the LLVM ``llvm-link``, ``opt``, and ``llc``
				866	tools. In a complete compiler, these steps can also be performed entirely
				867	programmatically by setting up an appropriate pass configuration (see
				868	:ref:`libdevice`).
				869
				870	.. code-block:: text
				871
				872	# llvm-link t2.bc libdevice.compute_20.10.bc -o t2.linked.bc
				873	# opt -internalize -internalize-public-api-list=kernel -nvvm-reflect-list=__CUDA_FTZ=0 -nvvm-reflect -O3 t2.linked.bc -o t2.opt.bc
				874	# llc -mcpu=sm_20 t2.opt.bc -o t2.ptx
				875
				876	.. note::
				877
				878	The ``-nvvm-reflect-list=_CUDA_FTZ=0`` is not strictly required, as any
				879	undefined variables will default to zero. It is shown here for evaluation
				880	purposes.
				881
				882
				883	This gives us the following PTX (excerpt):
				884
				885	.. code-block:: text
				886
				887	//
				888	// Generated by LLVM NVPTX Back-End
				889	//
				890
				891	.version 3.1
				892	.target sm_20
				893	.address_size 64
				894
				895	// .globl kernel
				896	// @kernel
				897	.visible .entry kernel(
				898	.param .u64 kernel_param_0,
				899	.param .u64 kernel_param_1,
				900	.param .u64 kernel_param_2
				901	)
				902	{
				903	.reg .pred %p<30>;
				904	.reg .f32 %f<111>;
				905	.reg .s32 %r<21>;
				906	.reg .s64 %rl<8>;
				907
				908	// BB#0: // %entry
				909	ld.param.u64 %rl2, [kernel_param_0];
				910	mov.u32 %r3, %tid.x;
				911	ld.param.u64 %rl3, [kernel_param_1];
				912	mul.wide.s32 %rl4, %r3, 4;
				913	add.s64 %rl5, %rl2, %rl4;
				914	ld.param.u64 %rl6, [kernel_param_2];
				915	add.s64 %rl7, %rl3, %rl4;
				916	add.s64 %rl1, %rl6, %rl4;
				917	ld.global.f32 %f1, [%rl5];
				918	ld.global.f32 %f2, [%rl7];
				919	setp.eq.f32 %p1, %f1, 0f3F800000;
				920	setp.eq.f32 %p2, %f2, 0f00000000;
				921	or.pred %p3, %p1, %p2;
				922	@%p3 bra BB0_1;
				923	bra.uni BB0_2;
				924	BB0_1:
				925	mov.f32 %f110, 0f3F800000;
				926	st.global.f32 [%rl1], %f110;
				927	ret;
				928	BB0_2: // %__nv_isnanf.exit.i
				929	abs.f32 %f4, %f1;
				930	setp.gtu.f32 %p4, %f4, 0f7F800000;
				931	@%p4 bra BB0_4;
				932	// BB#3: // %__nv_isnanf.exit5.i
				933	abs.f32 %f5, %f2;
				934	setp.le.f32 %p5, %f5, 0f7F800000;
				935	@%p5 bra BB0_5;
				936	BB0_4: // %.critedge1.i
				937	add.f32 %f110, %f1, %f2;
				938	st.global.f32 [%rl1], %f110;
				939	ret;
				940	BB0_5: // %__nv_isinff.exit.i
				941
				942	...
				943
				944	BB0_26: // %__nv_truncf.exit.i.i.i.i.i
				945	mul.f32 %f90, %f107, 0f3FB8AA3B;
				946	cvt.rzi.f32.f32 %f91, %f90;
				947	mov.f32 %f92, 0fBF317200;
				948	fma.rn.f32 %f93, %f91, %f92, %f107;
				949	mov.f32 %f94, 0fB5BFBE8E;
				950	fma.rn.f32 %f95, %f91, %f94, %f93;
				951	mul.f32 %f89, %f95, 0f3FB8AA3B;
				952	// inline asm
				953	ex2.approx.ftz.f32 %f88,%f89;
				954	// inline asm
				955	add.f32 %f96, %f91, 0f00000000;
				956	ex2.approx.f32 %f97, %f96;
				957	mul.f32 %f98, %f88, %f97;
				958	setp.lt.f32 %p15, %f107, 0fC2D20000;
				959	selp.f32 %f99, 0f00000000, %f98, %p15;
				960	setp.gt.f32 %p16, %f107, 0f42D20000;
				961	selp.f32 %f110, 0f7F800000, %f99, %p16;
				962	setp.eq.f32 %p17, %f110, 0f7F800000;
				963	@%p17 bra BB0_28;
				964	// BB#27:
				965	fma.rn.f32 %f110, %f110, %f108, %f110;
				966	BB0_28: // %__internal_accurate_powf.exit.i
				967	setp.lt.f32 %p18, %f1, 0f00000000;
				968	setp.eq.f32 %p19, %f3, 0f3F800000;
				969	and.pred %p20, %p18, %p19;
				970	@!%p20 bra BB0_30;
				971	bra.uni BB0_29;
				972	BB0_29:
				973	mov.b32 %r9, %f110;
				974	xor.b32 %r10, %r9, -2147483648;
				975	mov.b32 %f110, %r10;
				976	BB0_30: // %__nv_powf.exit
				977	st.global.f32 [%rl1], %f110;
				978	ret;
				979	}
				980