Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 1 | =================================== |
| 2 | Compiling CUDA C/C++ with LLVM |
| 3 | =================================== |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Introduction |
| 9 | ============ |
| 10 | |
| 11 | This document contains the user guides and the internals of compiling CUDA |
| 12 | C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM |
| 13 | and developers who want to improve LLVM for GPUs. This document assumes a basic |
| 14 | familiarity with CUDA. Information about CUDA programming can be found in the |
| 15 | `CUDA programming guide |
| 16 | <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_. |
| 17 | |
| 18 | How to Build LLVM with CUDA Support |
| 19 | =================================== |
| 20 | |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 21 | CUDA support is still in development and works the best in the trunk version |
| 22 | of LLVM. Below is a quick summary of downloading and building the trunk |
| 23 | version. Consult the `Getting Started |
| 24 | <http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting |
| 25 | up LLVM. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 26 | |
| 27 | #. Checkout LLVM |
| 28 | |
| 29 | .. code-block:: console |
| 30 | |
| 31 | $ cd where-you-want-llvm-to-live |
| 32 | $ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm |
| 33 | |
| 34 | #. Checkout Clang |
| 35 | |
| 36 | .. code-block:: console |
| 37 | |
| 38 | $ cd where-you-want-llvm-to-live |
| 39 | $ cd llvm/tools |
| 40 | $ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang |
| 41 | |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 42 | #. Configure and build LLVM and Clang |
| 43 | |
| 44 | .. code-block:: console |
| 45 | |
| 46 | $ cd where-you-want-llvm-to-live |
| 47 | $ mkdir build |
| 48 | $ cd build |
| 49 | $ cmake [options] .. |
| 50 | $ make |
| 51 | |
| 52 | How to Compile CUDA C/C++ with LLVM |
| 53 | =================================== |
| 54 | |
| 55 | We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA |
| 56 | CUDA installation Guide |
| 57 | <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if |
| 58 | you have not. |
| 59 | |
| 60 | Suppose you want to compile and run the following CUDA program (``axpy.cu``) |
| 61 | which multiplies a ``float`` array by a ``float`` scalar (AXPY). |
| 62 | |
| 63 | .. code-block:: c++ |
| 64 | |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 65 | #include <iostream> |
| 66 | |
| 67 | __global__ void axpy(float a, float* x, float* y) { |
| 68 | y[threadIdx.x] = a * x[threadIdx.x]; |
| 69 | } |
| 70 | |
| 71 | int main(int argc, char* argv[]) { |
| 72 | const int kDataLen = 4; |
| 73 | |
| 74 | float a = 2.0f; |
| 75 | float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f}; |
| 76 | float host_y[kDataLen]; |
| 77 | |
| 78 | // Copy input data to device. |
| 79 | float* device_x; |
| 80 | float* device_y; |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 81 | cudaMalloc(&device_x, kDataLen * sizeof(float)); |
| 82 | cudaMalloc(&device_y, kDataLen * sizeof(float)); |
| 83 | cudaMemcpy(device_x, host_x, kDataLen * sizeof(float), |
| 84 | cudaMemcpyHostToDevice); |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 85 | |
| 86 | // Launch the kernel. |
| 87 | axpy<<<1, kDataLen>>>(a, device_x, device_y); |
| 88 | |
| 89 | // Copy output data to host. |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 90 | cudaDeviceSynchronize(); |
| 91 | cudaMemcpy(host_y, device_y, kDataLen * sizeof(float), |
| 92 | cudaMemcpyDeviceToHost); |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 93 | |
| 94 | // Print the results. |
| 95 | for (int i = 0; i < kDataLen; ++i) { |
| 96 | std::cout << "y[" << i << "] = " << host_y[i] << "\n"; |
| 97 | } |
| 98 | |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 99 | cudaDeviceReset(); |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 100 | return 0; |
| 101 | } |
| 102 | |
| 103 | The command line for compilation is similar to what you would use for C++. |
| 104 | |
| 105 | .. code-block:: console |
| 106 | |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 107 | $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \ |
| 108 | -L<CUDA install path>/<lib64 or lib> \ |
| 109 | -lcudart_static -ldl -lrt -pthread |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 110 | $ ./axpy |
| 111 | y[0] = 2 |
| 112 | y[1] = 4 |
| 113 | y[2] = 6 |
| 114 | y[3] = 8 |
| 115 | |
Jingyue Wu | 313496b | 2016-01-30 23:48:47 +0000 | [diff] [blame^] | 116 | ``<CUDA install path>`` is the root directory where you installed CUDA SDK, |
| 117 | typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of |
| 118 | your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want |
| 119 | to run your program on a GPU with compute capability of 3.5, you should specify |
| 120 | ``--cuda-gpu-arch=sm_35``. |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 121 | |
| 122 | Optimizations |
| 123 | ============= |
| 124 | |
| 125 | CPU and GPU have different design philosophies and architectures. For example, a |
| 126 | typical CPU has branch prediction, out-of-order execution, and is superscalar, |
| 127 | whereas a typical GPU has none of these. Due to such differences, an |
| 128 | optimization pipeline well-tuned for CPUs may be not suitable for GPUs. |
| 129 | |
| 130 | LLVM performs several general and CUDA-specific optimizations for GPUs. The |
| 131 | list below shows some of the more important optimizations for GPUs. Most of |
| 132 | them have been upstreamed to ``lib/Transforms/Scalar`` and |
| 133 | ``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a |
| 134 | customizable target-independent optimization pipeline. |
| 135 | |
| 136 | * **Straight-line scalar optimizations**. These optimizations reduce redundancy |
| 137 | in straight-line code. Details can be found in the `design document for |
| 138 | straight-line scalar optimizations <https://goo.gl/4Rb9As>`_. |
| 139 | |
| 140 | * **Inferring memory spaces**. `This optimization |
| 141 | <http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_ |
| 142 | infers the memory space of an address so that the backend can emit faster |
| 143 | special loads and stores from it. Details can be found in the `design |
| 144 | document for memory space inference <https://goo.gl/5wH2Ct>`_. |
| 145 | |
| 146 | * **Aggressive loop unrooling and function inlining**. Loop unrolling and |
| 147 | function inlining need to be more aggressive for GPUs than for CPUs because |
| 148 | control flow transfer in GPU is more expensive. They also promote other |
| 149 | optimizations such as constant propagation and SROA which sometimes speed up |
| 150 | code by over 10x. An empirical inline threshold for GPUs is 1100. This |
| 151 | configuration has yet to be upstreamed with a target-specific optimization |
| 152 | pipeline. LLVM also provides `loop unrolling pragmas |
| 153 | <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_ |
| 154 | and ``__attribute__((always_inline))`` for programmers to force unrolling and |
| 155 | inling. |
| 156 | |
| 157 | * **Aggressive speculative execution**. `This transformation |
| 158 | <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is |
| 159 | mainly for promoting straight-line scalar optimizations which are most |
| 160 | effective on code along dominator paths. |
| 161 | |
| 162 | * **Memory-space alias analysis**. `This alias analysis |
Jingyue Wu | 03d90e5 | 2015-11-18 22:01:44 +0000 | [diff] [blame] | 163 | <http://reviews.llvm.org/D12414>`_ infers that two pointers in different |
Jingyue Wu | 4f2a6cb | 2015-11-10 22:35:47 +0000 | [diff] [blame] | 164 | special memory spaces do not alias. It has yet to be integrated to the new |
| 165 | alias analysis infrastructure; the new infrastructure does not run |
| 166 | target-specific alias analysis. |
| 167 | |
| 168 | * **Bypassing 64-bit divides**. `An existing optimization |
| 169 | <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ |
| 170 | enabled in the NVPTX backend. 64-bit integer divides are much slower than |
| 171 | 32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit |
| 172 | divides in our benchmarks have a divisor and dividend which fit in 32-bits at |
| 173 | runtime. This optimization provides a fast path for this common case. |