Justin Holewinski | 45df882 | 2013-03-30 16:41:14 +0000 | [diff] [blame^] | 1 | ============================= |
| 2 | User Guide for NVPTX Back-end |
| 3 | ============================= |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | :depth: 3 |
| 8 | |
| 9 | |
| 10 | Introduction |
| 11 | ============ |
| 12 | |
| 13 | To support GPU programming, the NVPTX back-end supports a subset of LLVM IR |
| 14 | along with a defined set of conventions used to represent GPU programming |
| 15 | concepts. This document provides an overview of the general usage of the back- |
| 16 | end, including a description of the conventions used and the set of accepted |
| 17 | LLVM IR. |
| 18 | |
| 19 | .. note:: |
| 20 | |
| 21 | This document assumes a basic familiarity with CUDA and the PTX |
| 22 | assembly language. Information about the CUDA Driver API and the PTX assembly |
| 23 | language can be found in the `CUDA documentation |
| 24 | <http://docs.nvidia.com/cuda/index.html>`_. |
| 25 | |
| 26 | |
| 27 | |
| 28 | Conventions |
| 29 | =========== |
| 30 | |
| 31 | Marking Functions as Kernels |
| 32 | ---------------------------- |
| 33 | |
| 34 | In PTX, there are two types of functions: *device functions*, which are only |
| 35 | callable by device code, and *kernel functions*, which are callable by host |
| 36 | code. By default, the back-end will emit device functions. Metadata is used to |
| 37 | declare a function as a kernel function. This metadata is attached to the |
| 38 | ``nvvm.annotations`` named metadata object, and has the following format: |
| 39 | |
| 40 | .. code-block:: llvm |
| 41 | |
| 42 | !0 = metadata !{<function-ref>, metadata !"kernel", i32 1} |
| 43 | |
| 44 | The first parameter is a reference to the kernel function. The following |
| 45 | example shows a kernel function calling a device function in LLVM IR. The |
| 46 | function ``@my_kernel`` is callable from host code, but ``@my_fmad`` is not. |
| 47 | |
| 48 | .. code-block:: llvm |
| 49 | |
| 50 | define float @my_fmad(float %x, float %y, float %z) { |
| 51 | %mul = fmul float %x, %y |
| 52 | %add = fadd float %mul, %z |
| 53 | ret float %add |
| 54 | } |
| 55 | |
| 56 | define void @my_kernel(float* %ptr) { |
| 57 | %val = load float* %ptr |
| 58 | %ret = call float @my_fmad(float %val, float %val, float %val) |
| 59 | store float %ret, float* %ptr |
| 60 | ret void |
| 61 | } |
| 62 | |
| 63 | !nvvm.annotations = !{!1} |
| 64 | !1 = metadata !{void (float*)* @my_kernel, metadata !"kernel", i32 1} |
| 65 | |
| 66 | When compiled, the PTX kernel functions are callable by host-side code. |
| 67 | |
| 68 | |
| 69 | Address Spaces |
| 70 | -------------- |
| 71 | |
| 72 | The NVPTX back-end uses the following address space mapping: |
| 73 | |
| 74 | ============= ====================== |
| 75 | Address Space Memory Space |
| 76 | ============= ====================== |
| 77 | 0 Generic |
| 78 | 1 Global |
| 79 | 2 Internal Use |
| 80 | 3 Shared |
| 81 | 4 Constant |
| 82 | 5 Local |
| 83 | ============= ====================== |
| 84 | |
| 85 | Every global variable and pointer type is assigned to one of these address |
| 86 | spaces, with 0 being the default address space. Intrinsics are provided which |
| 87 | can be used to convert pointers between the generic and non-generic address |
| 88 | spaces. |
| 89 | |
| 90 | As an example, the following IR will define an array ``@g`` that resides in |
| 91 | global device memory. |
| 92 | |
| 93 | .. code-block:: llvm |
| 94 | |
| 95 | @g = internal addrspace(1) global [4 x i32] [ i32 0, i32 1, i32 2, i32 3 ] |
| 96 | |
| 97 | LLVM IR functions can read and write to this array, and host-side code can |
| 98 | copy data to it by name with the CUDA Driver API. |
| 99 | |
| 100 | Note that since address space 0 is the generic space, it is illegal to have |
| 101 | global variables in address space 0. Address space 0 is the default address |
| 102 | space in LLVM, so the ``addrspace(N)`` annotation is *required* for global |
| 103 | variables. |
| 104 | |
| 105 | |
| 106 | NVPTX Intrinsics |
| 107 | ================ |
| 108 | |
| 109 | Address Space Conversion |
| 110 | ------------------------ |
| 111 | |
| 112 | '``llvm.nvvm.ptr.*.to.gen``' Intrinsics |
| 113 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 114 | |
| 115 | Syntax: |
| 116 | """"""" |
| 117 | |
| 118 | These are overloaded intrinsics. You can use these on any pointer types. |
| 119 | |
| 120 | .. code-block:: llvm |
| 121 | |
| 122 | declare i8* @llvm.nvvm.ptr.global.to.gen.p0i8.p1i8(i8 addrspace(1)*) |
| 123 | declare i8* @llvm.nvvm.ptr.shared.to.gen.p0i8.p3i8(i8 addrspace(3)*) |
| 124 | declare i8* @llvm.nvvm.ptr.constant.to.gen.p0i8.p4i8(i8 addrspace(4)*) |
| 125 | declare i8* @llvm.nvvm.ptr.local.to.gen.p0i8.p5i8(i8 addrspace(5)*) |
| 126 | |
| 127 | Overview: |
| 128 | """"""""" |
| 129 | |
| 130 | The '``llvm.nvvm.ptr.*.to.gen``' intrinsics convert a pointer in a non-generic |
| 131 | address space to a generic address space pointer. |
| 132 | |
| 133 | Semantics: |
| 134 | """""""""" |
| 135 | |
| 136 | These intrinsics modify the pointer value to be a valid generic address space |
| 137 | pointer. |
| 138 | |
| 139 | |
| 140 | '``llvm.nvvm.ptr.gen.to.*``' Intrinsics |
| 141 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 142 | |
| 143 | Syntax: |
| 144 | """"""" |
| 145 | |
| 146 | These are overloaded intrinsics. You can use these on any pointer types. |
| 147 | |
| 148 | .. code-block:: llvm |
| 149 | |
| 150 | declare i8* @llvm.nvvm.ptr.gen.to.global.p1i8.p0i8(i8 addrspace(1)*) |
| 151 | declare i8* @llvm.nvvm.ptr.gen.to.shared.p3i8.p0i8(i8 addrspace(3)*) |
| 152 | declare i8* @llvm.nvvm.ptr.gen.to.constant.p4i8.p0i8(i8 addrspace(4)*) |
| 153 | declare i8* @llvm.nvvm.ptr.gen.to.local.p5i8.p0i8(i8 addrspace(5)*) |
| 154 | |
| 155 | Overview: |
| 156 | """"""""" |
| 157 | |
| 158 | The '``llvm.nvvm.ptr.gen.to.*``' intrinsics convert a pointer in the generic |
| 159 | address space to a pointer in the target address space. Note that these |
| 160 | intrinsics are only useful if the address space of the target address space of |
| 161 | the pointer is known. It is not legal to use address space conversion |
| 162 | intrinsics to convert a pointer from one non-generic address space to another |
| 163 | non-generic address space. |
| 164 | |
| 165 | Semantics: |
| 166 | """""""""" |
| 167 | |
| 168 | These intrinsics modify the pointer value to be a valid pointer in the target |
| 169 | non-generic address space. |
| 170 | |
| 171 | |
| 172 | Reading PTX Special Registers |
| 173 | ----------------------------- |
| 174 | |
| 175 | '``llvm.nvvm.read.ptx.sreg.*``' |
| 176 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 177 | |
| 178 | Syntax: |
| 179 | """"""" |
| 180 | |
| 181 | .. code-block:: llvm |
| 182 | |
| 183 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() |
| 184 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() |
| 185 | declare i32 @llvm.nvvm.read.ptx.sreg.tid.z() |
| 186 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() |
| 187 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() |
| 188 | declare i32 @llvm.nvvm.read.ptx.sreg.ntid.z() |
| 189 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() |
| 190 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() |
| 191 | declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.z() |
| 192 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() |
| 193 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.y() |
| 194 | declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.z() |
| 195 | declare i32 @llvm.nvvm.read.ptx.sreg.warpsize() |
| 196 | |
| 197 | Overview: |
| 198 | """"""""" |
| 199 | |
| 200 | The '``@llvm.nvvm.read.ptx.sreg.*``' intrinsics provide access to the PTX |
| 201 | special registers, in particular the kernel launch bounds. These registers |
| 202 | map in the following way to CUDA builtins: |
| 203 | |
| 204 | ============ ===================================== |
| 205 | CUDA Builtin PTX Special Register Intrinsic |
| 206 | ============ ===================================== |
| 207 | ``threadId`` ``@llvm.nvvm.read.ptx.sreg.tid.*`` |
| 208 | ``blockIdx`` ``@llvm.nvvm.read.ptx.sreg.ctaid.*`` |
| 209 | ``blockDim`` ``@llvm.nvvm.read.ptx.sreg.ntid.*`` |
| 210 | ``gridDim`` ``@llvm.nvvm.read.ptx.sreg.nctaid.*`` |
| 211 | ============ ===================================== |
| 212 | |
| 213 | |
| 214 | Barriers |
| 215 | -------- |
| 216 | |
| 217 | '``llvm.nvvm.barrier0``' |
| 218 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 219 | |
| 220 | Syntax: |
| 221 | """"""" |
| 222 | |
| 223 | .. code-block:: llvm |
| 224 | |
| 225 | declare void @llvm.nvvm.barrier0() |
| 226 | |
| 227 | Overview: |
| 228 | """"""""" |
| 229 | |
| 230 | The '``@llvm.nvvm.barrier0()``' intrinsic emits a PTX ``bar.sync 0`` |
| 231 | instruction, equivalent to the ``__syncthreads()`` call in CUDA. |
| 232 | |
| 233 | |
| 234 | Other Intrinsics |
| 235 | ---------------- |
| 236 | |
| 237 | For the full set of NVPTX intrinsics, please see the |
| 238 | ``include/llvm/IR/IntrinsicsNVVM.td`` file in the LLVM source tree. |
| 239 | |
| 240 | |
| 241 | Executing PTX |
| 242 | ============= |
| 243 | |
| 244 | The most common way to execute PTX assembly on a GPU device is to use the CUDA |
| 245 | Driver API. This API is a low-level interface to the GPU driver and allows for |
| 246 | JIT compilation of PTX code to native GPU machine code. |
| 247 | |
| 248 | Initializing the Driver API: |
| 249 | |
| 250 | .. code-block:: c++ |
| 251 | |
| 252 | CUdevice device; |
| 253 | CUcontext context; |
| 254 | |
| 255 | // Initialize the driver API |
| 256 | cuInit(0); |
| 257 | // Get a handle to the first compute device |
| 258 | cuDeviceGet(&device, 0); |
| 259 | // Create a compute device context |
| 260 | cuCtxCreate(&context, 0, device); |
| 261 | |
| 262 | JIT compiling a PTX string to a device binary: |
| 263 | |
| 264 | .. code-block:: c++ |
| 265 | |
| 266 | CUmodule module; |
| 267 | CUfunction funcion; |
| 268 | |
| 269 | // JIT compile a null-terminated PTX string |
| 270 | cuModuleLoadData(&module, (void*)PTXString); |
| 271 | |
| 272 | // Get a handle to the "myfunction" kernel function |
| 273 | cuModuleGetFunction(&function, module, "myfunction"); |
| 274 | |
| 275 | For full examples of executing PTX assembly, please see the `CUDA Samples |
| 276 | <https://developer.nvidia.com/cuda-downloads>`_ distribution. |