Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 1 | ========================= |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 2 | Compiling CUDA with clang |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 3 | ========================= |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Introduction |
| 9 | ============ |
| 10 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 11 | This document describes how to compile CUDA code with clang, and gives some |
| 12 | details about LLVM and clang's CUDA implementations. |
| 13 | |
| 14 | This document assumes a basic familiarity with CUDA. Information about CUDA |
| 15 | programming can be found in the |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 16 | `CUDA programming guide |
| 17 | <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. |
| 18 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 19 | Compiling CUDA Code |
| 20 | =================== |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 21 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 22 | Prerequisites |
| 23 | ------------- |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 24 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 25 | CUDA is supported in llvm 3.9, but it's still in active development, so we |
| 26 | recommend you `compile clang/LLVM from HEAD |
| 27 | <http://llvm.org/docs/GettingStarted.html>`_. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 28 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 29 | Before you build CUDA code, you'll need to have installed the appropriate |
| 30 | driver for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation |
| 31 | guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ |
| 32 | for details. Note that clang `does not support |
| 33 | <https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed |
| 34 | by many Linux package managers; you probably need to install nvidia's package. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 35 | |
Justin Lebar | 7029cb5 | 2016-09-07 20:09:53 +0000 | [diff] [blame] | 36 | You will need CUDA 7.0 or 7.5 to compile with clang. CUDA 8 support is in the |
| 37 | works. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 38 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 39 | Invoking clang |
| 40 | -------------- |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 41 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 42 | Invoking clang for CUDA compilation works similarly to compiling regular C++. |
| 43 | You just need to be aware of a few additional flags. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 44 | |
Justin Lebar | 62d5b01 | 2016-09-07 20:42:24 +0000 | [diff] [blame^] | 45 | You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_ |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 46 | program as a toy example. Save it as ``axpy.cu``. To build and run, run the |
| 47 | following commands: |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 48 | |
| 49 | .. code-block:: console |
| 50 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 51 | $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ |
| 52 | -L<CUDA install path>/<lib64 or lib> \ |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame] | 53 | -lcudart_static -ldl -lrt -pthread |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 54 | $ ./axpy |
| 55 | y[0] = 2 |
| 56 | y[1] = 4 |
| 57 | y[2] = 6 |
| 58 | y[3] = 8 |
| 59 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 60 | * clang detects that you're compiling CUDA by noticing that your source file ends |
| 61 | with ``.cu``. (Alternatively, you can pass ``-x cuda``.) |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 62 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 63 | * ``<CUDA install path>`` is the root directory where you installed CUDA SDK, |
| 64 | typically ``/usr/local/cuda``. |
Justin Lebar | 84473cd | 2016-09-07 20:09:46 +0000 | [diff] [blame] | 65 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 66 | Pass e.g. ``/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise, |
| 67 | pass ``/usr/local/cuda/lib``. (In CUDA, the device code and host code always |
| 68 | have the same pointer widths, so if you're compiling 64-bit code for the |
| 69 | host, you're also compiling 64-bit code for the device.) |
Justin Lebar | 84473cd | 2016-09-07 20:09:46 +0000 | [diff] [blame] | 70 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 71 | * ``<GPU arch>`` is `the compute capability of your GPU |
| 72 | <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want to run |
| 73 | your program on a GPU with compute capability of 3.5, you should specify |
| 74 | ``--cuda-gpu-arch=sm_35``. |
Justin Lebar | 32835c8 | 2016-03-21 23:05:15 +0000 | [diff] [blame] | 75 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 76 | Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``; |
| 77 | only ``sm_XX`` is currently supported. However, clang always includes PTX in |
| 78 | its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be |
| 79 | forwards-compatible with e.g. ``sm_35`` GPUs. |
Justin Lebar | 32835c8 | 2016-03-21 23:05:15 +0000 | [diff] [blame] | 80 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 81 | You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple |
| 82 | archs. |
Justin Lebar | 32835c8 | 2016-03-21 23:05:15 +0000 | [diff] [blame] | 83 | |
Justin Lebar | b649e75 | 2016-05-25 23:11:31 +0000 | [diff] [blame] | 84 | Flags that control numerical code |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 85 | --------------------------------- |
Justin Lebar | b649e75 | 2016-05-25 23:11:31 +0000 | [diff] [blame] | 86 | |
| 87 | If you're using GPUs, you probably care about making numerical code run fast. |
| 88 | GPU hardware allows for more control over numerical operations than most CPUs, |
| 89 | but this results in more compiler options for you to juggle. |
| 90 | |
| 91 | Flags you may wish to tweak include: |
| 92 | |
| 93 | * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when |
| 94 | compiling CUDA) Controls whether the compiler emits fused multiply-add |
| 95 | operations. |
| 96 | |
| 97 | * ``off``: never emit fma operations, and prevent ptxas from fusing multiply |
| 98 | and add instructions. |
| 99 | * ``on``: fuse multiplies and adds within a single statement, but never |
| 100 | across statements (C11 semantics). Prevent ptxas from fusing other |
| 101 | multiplies and adds. |
| 102 | * ``fast``: fuse multiplies and adds wherever profitable, even across |
| 103 | statements. Doesn't prevent ptxas from fusing additional multiplies and |
| 104 | adds. |
| 105 | |
| 106 | Fused multiply-add instructions can be much faster than the unfused |
| 107 | equivalents, but because the intermediate result in an fma is not rounded, |
| 108 | this flag can affect numerical code. |
| 109 | |
| 110 | * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled, |
| 111 | floating point operations may flush `denormal |
| 112 | <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0. |
| 113 | Operations on denormal numbers are often much slower than the same operations |
| 114 | on normal numbers. |
| 115 | |
| 116 | * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the |
| 117 | compiler may emit calls to faster, approximate versions of transcendental |
| 118 | functions, instead of using the slower, fully IEEE-compliant versions. For |
| 119 | example, this flag allows clang to emit the ptx ``sin.approx.f32`` |
| 120 | instruction. |
| 121 | |
| 122 | This is implied by ``-ffast-math``. |
| 123 | |
Justin Lebar | 6f04ed9 | 2016-09-07 20:37:41 +0000 | [diff] [blame] | 124 | Detecting clang vs NVCC from code |
| 125 | ================================= |
| 126 | |
| 127 | Although clang's CUDA implementation is largely compatible with NVCC's, you may |
| 128 | still want to detect when you're compiling CUDA code specifically with clang. |
| 129 | |
| 130 | This is tricky, because NVCC may invoke clang as part of its own compilation |
| 131 | process! For example, NVCC uses the host compiler's preprocessor when |
| 132 | compiling for device code, and that host compiler may in fact be clang. |
| 133 | |
| 134 | When clang is actually compiling CUDA code -- rather than being used as a |
| 135 | subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is |
| 136 | defined only in device mode (but will be defined if NVCC is using clang as a |
| 137 | preprocessor). So you can use the following incantations to detect clang CUDA |
| 138 | compilation, in host and device modes: |
| 139 | |
| 140 | .. code-block:: c++ |
| 141 | |
| 142 | #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__) |
| 143 | // clang compiling CUDA code, host mode. |
| 144 | #endif |
| 145 | |
| 146 | #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__) |
| 147 | // clang compiling CUDA code, device mode. |
| 148 | #endif |
| 149 | |
| 150 | Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can |
| 151 | detect NVCC specifically by looking for ``__NVCC__``. |
| 152 | |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 153 | Optimizations |
| 154 | ============= |
| 155 | |
| 156 | CPU and GPU have different design philosophies and architectures. For example, a |
| 157 | typical CPU has branch prediction, out-of-order execution, and is superscalar, |
| 158 | whereas a typical GPU has none of these. Due to such differences, an |
| 159 | optimization pipeline well-tuned for CPUs may be not suitable for GPUs. |
| 160 | |
| 161 | LLVM performs several general and CUDA-specific optimizations for GPUs. The |
| 162 | list below shows some of the more important optimizations for GPUs. Most of |
| 163 | them have been upstreamed to ``lib/Transforms/Scalar`` and |
| 164 | ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a |
| 165 | customizable target-independent optimization pipeline. |
| 166 | |
| 167 | * **Straight-line scalar optimizations**. These optimizations reduce redundancy |
| 168 | in straight-line code. Details can be found in the `design document for |
| 169 | straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. |
| 170 | |
| 171 | * **Inferring memory spaces**. `This optimization |
Jingyue Wu | f190ed4 | 2016-03-30 05:05:40 +0000 | [diff] [blame] | 172 | <https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_ |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 173 | infers the memory space of an address so that the backend can emit faster |
Jingyue Wu | f190ed4 | 2016-03-30 05:05:40 +0000 | [diff] [blame] | 174 | special loads and stores from it. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 175 | |
| 176 | * **Aggressive loop unrooling and function inlining**. Loop unrolling and |
| 177 | function inlining need to be more aggressive for GPUs than for CPUs because |
| 178 | control flow transfer in GPU is more expensive. They also promote other |
| 179 | optimizations such as constant propagation and SROA which sometimes speed up |
| 180 | code by over 10x. An empirical inline threshold for GPUs is 1100. This |
| 181 | configuration has yet to be upstreamed with a target-specific optimization |
| 182 | pipeline. LLVM also provides `loop unrolling pragmas |
| 183 | <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ |
| 184 | and ``__attribute__((always_inline))`` for programmers to force unrolling and |
| 185 | inling. |
| 186 | |
| 187 | * **Aggressive speculative execution**. `This transformation |
| 188 | <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is |
| 189 | mainly for promoting straight-line scalar optimizations which are most |
| 190 | effective on code along dominator paths. |
| 191 | |
| 192 | * **Memory-space alias analysis**. `This alias analysis |
Jingyue Wu | 03d90e5 | 2015-11-18 22:01:44 +0000 | [diff] [blame] | 193 | <http://reviews.llvm.org/D12414>`_ infers that two pointers in different |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 194 | special memory spaces do not alias. It has yet to be integrated to the new |
| 195 | alias analysis infrastructure; the new infrastructure does not run |
| 196 | target-specific alias analysis. |
| 197 | |
| 198 | * **Bypassing 64-bit divides**. `An existing optimization |
| 199 | <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ |
| 200 | enabled in the NVPTX backend. 64-bit integer divides are much slower than |
| 201 | 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit |
| 202 | divides in our benchmarks have a divisor and dividend which fit in 32-bits at |
| 203 | runtime. This optimization provides a fast path for this common case. |
Jingyue Wu | bec7818 | 2016-02-23 23:34:49 +0000 | [diff] [blame] | 204 | |
Jingyue Wu | f190ed4 | 2016-03-30 05:05:40 +0000 | [diff] [blame] | 205 | Publication |
| 206 | =========== |
| 207 | |
| 208 | | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_ |
| 209 | | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt |
| 210 | | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)* |
| 211 | | `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_ |
| 212 | |
| 213 | Tutorial |
| 214 | ======== |
| 215 | |
| 216 | `CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_ |
| 217 | |
Jingyue Wu | bec7818 | 2016-02-23 23:34:49 +0000 | [diff] [blame] | 218 | Obtaining Help |
| 219 | ============== |
| 220 | |
| 221 | To obtain help on LLVM in general and its CUDA support, see `the LLVM |
| 222 | community <http://llvm.org/docs/#mailing-lists>`_. |