docs/NVPTXUsage.rst - fp2-dev/platform/external/llvm - Gitiles

 =============================
 User Guide for NVPTX Back-end
 =============================

 .. contents::
    :local:
    :depth: 3


 Introduction
 ============

 To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
 along with a defined set of conventions used to represent GPU programming
 concepts. This document provides an overview of the general usage of the back-
 end, including a description of the conventions used and the set of accepted
 LLVM IR.

 .. note::

    This document assumes a basic familiarity with CUDA and the PTX
    assembly language. Information about the CUDA Driver API and the PTX assembly
    language can be found in the `CUDA documentation
    <http://docs.nvidia.com/cuda/index.html>`_.


 Conventions
 ===========

 Marking Functions as Kernels
 ----------------------------

 In PTX, there are two types of functions: *device functions*, which are only
 callable by device code, and *kernel functions*, which are callable by host
 code. By default, the back-end will emit device functions. Metadata is used to
 declare a function as a kernel function. This metadata is attached to the
 ``nvvm.annotations`` named metadata object, and has the following format:

 .. code-block:: llvm

    !0 = metadata !{<function-ref>, metadata !"kernel", i32 1}

 The first parameter is a reference to the kernel function. The following
 example shows a kernel function calling a device function in LLVM IR. The
 function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.

 .. code-block:: llvm

     define float @my_fmad(float %x, float %y, float %z) {
       %mul = fmul float %x, %y
       %add = fadd float %mul, %z
       ret float %add
     }

     define void @my_kernel(float* %ptr) {
       %val = load float* %ptr
       %ret = call float @my_fmad(float %val, float %val, float %val)
       store float %ret, float* %ptr
       ret void
     }

     !nvvm.annotations = !{!1}
     !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1}

 When compiled, the PTX kernel functions are callable by host-side code.


 Address Spaces
 --------------

 The NVPTX back-end uses the following address space mapping:

    ============= ======================
    Address Space Memory Space
    ============= ======================
    0             Generic
    1             Global
    2             Internal Use
    3             Shared
    4             Constant
    5             Local
    ============= ======================

 Every global variable and pointer type is assigned to one of these address
 spaces, with 0 being the default address space. Intrinsics are provided which
 can be used to convert pointers between the generic and non-generic address
 spaces.

 As an example, the following IR will define an array ``@g`` that resides in
 global device memory.

 .. code-block:: llvm

     @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]

 LLVM IR functions can read and write to this array, and host-side code can
 copy data to it by name with the CUDA Driver API.

 Note that since address space 0 is the generic space, it is illegal to have
 global variables in address space 0.  Address space 0 is the default address
 space in LLVM, so the ``addrspace(N)`` annotation is *required* for global
 variables.


 NVPTX Intrinsics
 ================

 Address Space Conversion
 ------------------------

 '``llvm.nvvm.ptr.*.to.gen``' Intrinsics
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Syntax:
 """""""

 These are overloaded intrinsics.  You can use these on any pointer types.

 .. code-block:: llvm

     declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
     declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
     declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
     declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)

 Overview:
 """""""""

 The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
 address space to a generic address space pointer.

 Semantics:
 """"""""""

 These intrinsics modify the pointer value to be a valid generic address space
 pointer.


 '``llvm.nvvm.ptr.gen.to.*``' Intrinsics
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Syntax:
 """""""

 These are overloaded intrinsics.  You can use these on any pointer types.

 .. code-block:: llvm

     declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
     declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
     declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
     declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)

 Overview:
 """""""""

 The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
 address space to a pointer in the target address space.  Note that these
 intrinsics are only useful if the address space of the target address space of
 the pointer is known.  It is not legal to use address space conversion
 intrinsics to convert a pointer from one non-generic address space to another
 non-generic address space.

 Semantics:
 """"""""""

 These intrinsics modify the pointer value to be a valid pointer in the target
 non-generic address space.


 Reading PTX Special Registers
 -----------------------------

 '``llvm.nvvm.read.ptx.sreg.*``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Syntax:
 """""""

 .. code-block:: llvm

     declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
     declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
     declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
     declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
     declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
     declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
     declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

 Overview:
 """""""""

 The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
 special registers, in particular the kernel launch bounds.  These registers
 map in the following way to CUDA builtins:

    ============ =====================================
    CUDA Builtin PTX Special Register Intrinsic
    ============ =====================================
    ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
    ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
    ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
    ``gridDim``  ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
    ============ =====================================


 Barriers
 --------

 '``llvm.nvvm.barrier0``'
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Syntax:
 """""""

 .. code-block:: llvm

   declare void @llvm.nvvm.barrier0()

 Overview:
 """""""""

 The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
 instruction, equivalent to the ``__syncthreads()`` call in CUDA.


 Other Intrinsics
 ----------------

 For the full set of NVPTX intrinsics, please see the
 ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.


 Executing PTX
 =============

 The most common way to execute PTX assembly on a GPU device is to use the CUDA
 Driver API. This API is a low-level interface to the GPU driver and allows for
 JIT compilation of PTX code to native GPU machine code.

 Initializing the Driver API:

 .. code-block:: c++

     CUdevice device;
     CUcontext context;

     // Initialize the driver API
     cuInit(0);
     // Get a handle to the first compute device
     cuDeviceGet(&device, 0);
     // Create a compute device context
     cuCtxCreate(&context, 0, device);

 JIT compiling a PTX string to a device binary:

 .. code-block:: c++

     CUmodule module;
     CUfunction funcion;

     // JIT compile a null-terminated PTX string
     cuModuleLoadData(&module, (void*)PTXString);

     // Get a handle to the "myfunction" kernel function
     cuModuleGetFunction(&function, module, "myfunction");

 For full examples of executing PTX assembly, please see the `CUDA Samples
 <https://developer.nvidia.com/cuda-downloads>`_ distribution.
	=============================
	User Guide for NVPTX Back-end
	=============================

	.. contents::
	:local:
	:depth: 3


	Introduction
	============

	To support GPU programming, the NVPTX back-end supports a subset of LLVM IR
	along with a defined set of conventions used to represent GPU programming
	concepts. This document provides an overview of the general usage of the back-
	end, including a description of the conventions used and the set of accepted
	LLVM IR.

	.. note::

	This document assumes a basic familiarity with CUDA and the PTX
	assembly language. Information about the CUDA Driver API and the PTX assembly
	language can be found in the `CUDA documentation
	<http://docs.nvidia.com/cuda/index.html>`_.



	Conventions
	===========

	Marking Functions as Kernels
	----------------------------

	In PTX, there are two types of functions: device functions, which are only
	callable by device code, and kernel functions, which are callable by host
	code. By default, the back-end will emit device functions. Metadata is used to
	declare a function as a kernel function. This metadata is attached to the
	``nvvm.annotations`` named metadata object, and has the following format:

	.. code-block:: llvm

	!0 = metadata !{<function-ref>, metadata !"kernel", i32 1}

	The first parameter is a reference to the kernel function. The following
	example shows a kernel function calling a device function in LLVM IR. The
	function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not.

	.. code-block:: llvm

	define float @my_fmad(float %x, float %y, float %z) {
	%mul = fmul float %x, %y
	%add = fadd float %mul, %z
	ret float %add
	}

	define void @my_kernel(float* %ptr) {
	%val = load float* %ptr
	%ret = call float @my_fmad(float %val, float %val, float %val)
	store float %ret, float* %ptr
	ret void
	}

	!nvvm.annotations = !{!1}
	!1 = metadata !{void (float) @my_kernel, metadata !"kernel", i32 1}

	When compiled, the PTX kernel functions are callable by host-side code.


	Address Spaces
	--------------

	The NVPTX back-end uses the following address space mapping:

	============= ======================
	Address Space Memory Space
	============= ======================
	0 Generic
	1 Global
	2 Internal Use
	3 Shared
	4 Constant
	5 Local
	============= ======================

	Every global variable and pointer type is assigned to one of these address
	spaces, with 0 being the default address space. Intrinsics are provided which
	can be used to convert pointers between the generic and non-generic address
	spaces.

	As an example, the following IR will define an array ``@g`` that resides in
	global device memory.

	.. code-block:: llvm

	@g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ]

	LLVM IR functions can read and write to this array, and host-side code can
	copy data to it by name with the CUDA Driver API.

	Note that since address space 0 is the generic space, it is illegal to have
	global variables in address space 0. Address space 0 is the default address
	space in LLVM, so the ``addrspace(N)`` annotation is required for global
	variables.


	NVPTX Intrinsics
	================

	Address Space Conversion
	------------------------

	'``llvm.nvvm.ptr.*.to.gen``' Intrinsics
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:
	"""""""

	These are overloaded intrinsics. You can use these on any pointer types.

	.. code-block:: llvm

	declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*)
	declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*)
	declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*)
	declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*)

	Overview:
	"""""""""

	The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic
	address space to a generic address space pointer.

	Semantics:
	""""""""""

	These intrinsics modify the pointer value to be a valid generic address space
	pointer.


	'``llvm.nvvm.ptr.gen.to.*``' Intrinsics
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:
	"""""""

	These are overloaded intrinsics. You can use these on any pointer types.

	.. code-block:: llvm

	declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*)
	declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*)
	declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*)
	declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*)

	Overview:
	"""""""""

	The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic
	address space to a pointer in the target address space. Note that these
	intrinsics are only useful if the address space of the target address space of
	the pointer is known. It is not legal to use address space conversion
	intrinsics to convert a pointer from one non-generic address space to another
	non-generic address space.

	Semantics:
	""""""""""

	These intrinsics modify the pointer value to be a valid pointer in the target
	non-generic address space.


	Reading PTX Special Registers
	-----------------------------

	'``llvm.nvvm.read.ptx.sreg.*``'
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:
	"""""""

	.. code-block:: llvm

	declare i32 @llvm.nvvm.read.ptx.sreg.tid.x()
	declare i32 @llvm.nvvm.read.ptx.sreg.tid.y()
	declare i32 @llvm.nvvm.read.ptx.sreg.tid.z()
	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y()
	declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z()
	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()
	declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z()
	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()
	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y()
	declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z()
	declare i32 @llvm.nvvm.read.ptx.sreg.warpsize()

	Overview:
	"""""""""

	The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX
	special registers, in particular the kernel launch bounds. These registers
	map in the following way to CUDA builtins:

	============ =====================================
	CUDA Builtin PTX Special Register Intrinsic
	============ =====================================
	``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*``
	``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*``
	``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*``
	``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*``
	============ =====================================


	Barriers
	--------

	'``llvm.nvvm.barrier0``'
	^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:
	"""""""

	.. code-block:: llvm

	declare void @llvm.nvvm.barrier0()

	Overview:
	"""""""""

	The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0``
	instruction, equivalent to the ``__syncthreads()`` call in CUDA.


	Other Intrinsics
	----------------

	For the full set of NVPTX intrinsics, please see the
	``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree.


	Executing PTX
	=============

	The most common way to execute PTX assembly on a GPU device is to use the CUDA
	Driver API. This API is a low-level interface to the GPU driver and allows for
	JIT compilation of PTX code to native GPU machine code.

	Initializing the Driver API:

	.. code-block:: c++

	CUdevice device;
	CUcontext context;

	// Initialize the driver API
	cuInit(0);
	// Get a handle to the first compute device
	cuDeviceGet(&device, 0);
	// Create a compute device context
	cuCtxCreate(&context, 0, device);

	JIT compiling a PTX string to a device binary:

	.. code-block:: c++

	CUmodule module;
	CUfunction funcion;

	// JIT compile a null-terminated PTX string
	cuModuleLoadData(&module, (void*)PTXString);

	// Get a handle to the "myfunction" kernel function
	cuModuleGetFunction(&function, module, "myfunction");

	For full examples of executing PTX assembly, please see the `CUDA Samples
	<https://developer.nvidia.com/cuda-downloads>`_ distribution.