blob: 8b29624d65d099bc6aa9a45fd5b51d24cc79303b [file] [log] [blame]
Justin Lebar6f04ed92016-09-07 20:37:41 +00001=========================
Justin Lebar7029cb52016-09-07 20:09:53 +00002Compiling CUDA with clang
Justin Lebar6f04ed92016-09-07 20:37:41 +00003=========================
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +00004
5.. contents::
6 :local:
7
8Introduction
9============
10
Justin Lebar7029cb52016-09-07 20:09:53 +000011This document describes how to compile CUDA code with clang, and gives some
12details about LLVM and clang's CUDA implementations.
13
14This document assumes a basic familiarity with CUDA. Information about CUDA
15programming can be found in the
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000016`CUDA programming guide
17<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
18
Justin Lebar7029cb52016-09-07 20:09:53 +000019Compiling CUDA Code
20===================
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000021
Justin Lebar7029cb52016-09-07 20:09:53 +000022Prerequisites
23-------------
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000024
Justin Lebar7029cb52016-09-07 20:09:53 +000025CUDA is supported in llvm 3.9, but it's still in active development, so we
26recommend you `compile clang/LLVM from HEAD
27<http://llvm.org/docs/GettingStarted.html>`_.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000028
Justin Lebar7029cb52016-09-07 20:09:53 +000029Before you build CUDA code, you'll need to have installed the appropriate
30driver for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation
31guide <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_
32for details. Note that clang `does not support
33<https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed
34by many Linux package managers; you probably need to install nvidia's package.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000035
Justin Lebar78801412016-11-18 00:41:40 +000036You will need CUDA 7.0, 7.5, or 8.0 to compile with clang.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000037
Justin Lebar2d229202016-11-18 00:42:00 +000038CUDA compilation is supported on Linux, and on MacOS as of XXXX-XX-XX. Windows
39support is planned but not yet in place.
40
Justin Lebar6f04ed92016-09-07 20:37:41 +000041Invoking clang
42--------------
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000043
Justin Lebar6f04ed92016-09-07 20:37:41 +000044Invoking clang for CUDA compilation works similarly to compiling regular C++.
45You just need to be aware of a few additional flags.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000046
Justin Lebar62d5b012016-09-07 20:42:24 +000047You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
Justin Lebar1c102572016-09-07 21:46:21 +000048program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
49compiling CUDA code by noticing that your filename ends with ``.cu``.
50Alternatively, you can pass ``-x cuda``.)
51
52To build and run, run the following commands, filling in the parts in angle
53brackets as described below:
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000054
55.. code-block:: console
56
Justin Lebar6f04ed92016-09-07 20:37:41 +000057 $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
58 -L<CUDA install path>/<lib64 or lib> \
Jingyue Wu313496b2016-01-30 23:48:47 +000059 -lcudart_static -ldl -lrt -pthread
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000060 $ ./axpy
61 y[0] = 2
62 y[1] = 4
63 y[2] = 6
64 y[3] = 8
65
Justin Lebar1c102572016-09-07 21:46:21 +000066* ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
67 Typically, ``/usr/local/cuda``.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +000068
Justin Lebar1c102572016-09-07 21:46:21 +000069 Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
70 pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
71 always have the same pointer widths, so if you're compiling 64-bit code for
72 the host, you're also compiling 64-bit code for the device.)
Justin Lebar84473cd2016-09-07 20:09:46 +000073
Justin Lebar1c102572016-09-07 21:46:21 +000074* ``<GPU arch>`` -- the `compute capability
75 <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
76 want to run your program on a GPU with compute capability of 3.5, specify
Justin Lebar6f04ed92016-09-07 20:37:41 +000077 ``--cuda-gpu-arch=sm_35``.
Justin Lebar32835c82016-03-21 23:05:15 +000078
Justin Lebar6f04ed92016-09-07 20:37:41 +000079 Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
80 only ``sm_XX`` is currently supported. However, clang always includes PTX in
81 its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
82 forwards-compatible with e.g. ``sm_35`` GPUs.
Justin Lebar32835c82016-03-21 23:05:15 +000083
Justin Lebar1c102572016-09-07 21:46:21 +000084 You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
Justin Lebar32835c82016-03-21 23:05:15 +000085
Justin Lebarb5cb9df2016-09-07 21:46:49 +000086The `-L` and `-l` flags only need to be passed when linking. When compiling,
87you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
88the CUDA SDK into ``/usr/local/cuda``, ``/usr/local/cuda-7.0``, or
89``/usr/local/cuda-7.5``.
90
Justin Lebarb649e752016-05-25 23:11:31 +000091Flags that control numerical code
Justin Lebar6f04ed92016-09-07 20:37:41 +000092---------------------------------
Justin Lebarb649e752016-05-25 23:11:31 +000093
94If you're using GPUs, you probably care about making numerical code run fast.
95GPU hardware allows for more control over numerical operations than most CPUs,
96but this results in more compiler options for you to juggle.
97
98Flags you may wish to tweak include:
99
100* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
101 compiling CUDA) Controls whether the compiler emits fused multiply-add
102 operations.
103
104 * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
105 and add instructions.
106 * ``on``: fuse multiplies and adds within a single statement, but never
107 across statements (C11 semantics). Prevent ptxas from fusing other
108 multiplies and adds.
109 * ``fast``: fuse multiplies and adds wherever profitable, even across
110 statements. Doesn't prevent ptxas from fusing additional multiplies and
111 adds.
112
113 Fused multiply-add instructions can be much faster than the unfused
114 equivalents, but because the intermediate result in an fma is not rounded,
115 this flag can affect numerical code.
116
117* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
118 floating point operations may flush `denormal
119 <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
120 Operations on denormal numbers are often much slower than the same operations
121 on normal numbers.
122
123* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
124 compiler may emit calls to faster, approximate versions of transcendental
125 functions, instead of using the slower, fully IEEE-compliant versions. For
126 example, this flag allows clang to emit the ptx ``sin.approx.f32``
127 instruction.
128
129 This is implied by ``-ffast-math``.
130
Justin Lebara4fa3592016-09-15 02:04:32 +0000131Standard library support
132========================
133
134In clang and nvcc, most of the C++ standard library is not supported on the
135device side.
136
Justin Lebar4856e2302016-09-16 04:14:02 +0000137``<math.h>`` and ``<cmath>``
138----------------------------
Justin Lebara4fa3592016-09-15 02:04:32 +0000139
140In clang, ``math.h`` and ``cmath`` are available and `pass
141<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/math_h.cu>`_
142`tests
143<https://github.com/llvm-mirror/test-suite/blob/master/External/CUDA/cmath.cu>`_
144adapted from libc++'s test suite.
145
146In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
147in namespace std (e.g. ``std::sinf``) are not available, and where the standard
148calls for overloads that take integral arguments, these are usually not
149available.
150
151.. code-block:: c++
152
153 #include <math.h>
154 #include <cmath.h>
155
156 // clang is OK with everything in this function.
157 __device__ void test() {
158 std::sin(0.); // nvcc - ok
159 std::sin(0); // nvcc - error, because no std::sin(int) override is available.
160 sin(0); // nvcc - same as above.
161
162 sinf(0.); // nvcc - ok
163 std::sinf(0.); // nvcc - no such function
164 }
165
Justin Lebar4856e2302016-09-16 04:14:02 +0000166``<std::complex>``
167------------------
Justin Lebara4fa3592016-09-15 02:04:32 +0000168
169nvcc does not officially support ``std::complex``. It's an error to use
170``std::complex`` in ``__device__`` code, but it often works in ``__host__
171__device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
172below). However, we have heard from implementers that it's possible to get
173into situations where nvcc will omit a call to an ``std::complex`` function,
174especially when compiling without optimizations.
175
Justin Lebarbe0cfcc2016-11-17 01:03:42 +0000176As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
177tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
178newer than 2016-11-16.
Justin Lebara4fa3592016-09-15 02:04:32 +0000179
Justin Lebar4856e2302016-09-16 04:14:02 +0000180``<algorithm>``
181---------------
182
183In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
184``std::max``) become constexpr. You can therefore use these in device code,
185when compiling with clang.
Justin Lebara4fa3592016-09-15 02:04:32 +0000186
Justin Lebar6f04ed92016-09-07 20:37:41 +0000187Detecting clang vs NVCC from code
188=================================
189
190Although clang's CUDA implementation is largely compatible with NVCC's, you may
191still want to detect when you're compiling CUDA code specifically with clang.
192
193This is tricky, because NVCC may invoke clang as part of its own compilation
194process! For example, NVCC uses the host compiler's preprocessor when
195compiling for device code, and that host compiler may in fact be clang.
196
197When clang is actually compiling CUDA code -- rather than being used as a
198subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
199defined only in device mode (but will be defined if NVCC is using clang as a
200preprocessor). So you can use the following incantations to detect clang CUDA
201compilation, in host and device modes:
202
203.. code-block:: c++
204
205 #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
Justin Lebara4fa3592016-09-15 02:04:32 +0000206 // clang compiling CUDA code, host mode.
Justin Lebar6f04ed92016-09-07 20:37:41 +0000207 #endif
208
209 #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
Justin Lebara4fa3592016-09-15 02:04:32 +0000210 // clang compiling CUDA code, device mode.
Justin Lebar6f04ed92016-09-07 20:37:41 +0000211 #endif
212
213Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
214detect NVCC specifically by looking for ``__NVCC__``.
215
Justin Lebara4fa3592016-09-15 02:04:32 +0000216Dialect Differences Between clang and nvcc
217==========================================
218
219There is no formal CUDA spec, and clang and nvcc speak slightly different
220dialects of the language. Below, we describe some of the differences.
221
222This section is painful; hopefully you can skip this section and live your life
223blissfully unaware.
224
225Compilation Models
226------------------
227
228Most of the differences between clang and nvcc stem from the different
229compilation models used by clang and nvcc. nvcc uses *split compilation*,
230which works roughly as follows:
231
232 * Run a preprocessor over the input ``.cu`` file to split it into two source
233 files: ``H``, containing source code for the host, and ``D``, containing
234 source code for the device.
235
236 * For each GPU architecture ``arch`` that we're compiling for, do:
237
238 * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
239 ``P_arch``.
240
241 * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
242 ``S_arch``, containing GPU machine code (SASS) for ``arch``.
243
244 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
245 single "fat binary" file, ``F``.
246
247 * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
248 like). ``F`` is packaged up into a header file which is force-included into
249 ``H``; nvcc generates code that calls into this header to e.g. launch
250 kernels.
251
252clang uses *merged parsing*. This is similar to split compilation, except all
253of the host and device code is present and must be semantically-correct in both
254compilation steps.
255
256 * For each GPU architecture ``arch`` that we're compiling for, do:
257
258 * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
259 is parsed and must be semantically correct, even though we're not
260 generating code for the host at this time.
261
262 The output of this step is a ``ptx`` file ``P_arch``.
263
264 * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
265 nvcc, clang always generates SASS code.
266
267 * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
268 single fat binary file, ``F``.
269
270 * Compile ``H`` using clang. ``__device__`` code is parsed and must be
271 semantically correct, even though we're not generating code for the device
272 at this time.
273
274 ``F`` is passed to this compilation, and clang includes it in a special ELF
275 section, where it can be found by tools like ``cuobjdump``.
276
277(You may ask at this point, why does clang need to parse the input file
278multiple times? Why not parse it just once, and then use the AST to generate
279code for the host and each device architecture?
280
281Unfortunately this can't work because we have to define different macros during
282host compilation and during device compilation for each GPU architecture.)
283
284clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
285need to decide at an early stage which declarations to keep and which to throw
286away. But it has some consequences you should be aware of.
287
288Overloading Based on ``__host__`` and ``__device__`` Attributes
289---------------------------------------------------------------
290
291Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
292functions", and "``__host__ __device__`` functions", respectively. Functions
293with no attributes behave the same as H.
294
295nvcc does not allow you to create H and D functions with the same signature:
296
297.. code-block:: c++
298
299 // nvcc: error - function "foo" has already been defined
300 __host__ void foo() {}
301 __device__ void foo() {}
302
303However, nvcc allows you to "overload" H and D functions with different
304signatures:
305
306.. code-block:: c++
307
308 // nvcc: no error
309 __host__ void foo(int) {}
310 __device__ void foo() {}
311
312In clang, the ``__host__`` and ``__device__`` attributes are part of a
313function's signature, and so it's legal to have H and D functions with
314(otherwise) the same signature:
315
316.. code-block:: c++
317
318 // clang: no error
319 __host__ void foo() {}
320 __device__ void foo() {}
321
322HD functions cannot be overloaded by H or D functions with the same signature:
323
324.. code-block:: c++
325
326 // nvcc: error - function "foo" has already been defined
327 // clang: error - redefinition of 'foo'
328 __host__ __device__ void foo() {}
329 __device__ void foo() {}
330
331 // nvcc: no error
332 // clang: no error
333 __host__ __device__ void bar(int) {}
334 __device__ void bar() {}
335
336When resolving an overloaded function, clang considers the host/device
337attributes of the caller and callee. These are used as a tiebreaker during
338overload resolution. See `IdentifyCUDAPreference
339<http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
340but at a high level they are:
341
342 * D functions prefer to call other Ds. HDs are given lower priority.
343
344 * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
345 (with equal priority). HDs are given lower priority.
346
347 * HD functions prefer to call other HDs.
348
349 When compiling for device, HDs will call Ds with lower priority than HD, and
350 will call Hs with still lower priority. If it's forced to call an H, the
351 program is malformed if we emit code for this HD function. We call this the
352 "wrong-side rule", see example below.
353
354 The rules are symmetrical when compiling for host.
355
356Some examples:
357
358.. code-block:: c++
359
360 __host__ void foo();
361 __device__ void foo();
362
363 __host__ void bar();
364 __host__ __device__ void bar();
365
366 __host__ void test_host() {
367 foo(); // calls H overload
368 bar(); // calls H overload
369 }
370
371 __device__ void test_device() {
372 foo(); // calls D overload
373 bar(); // calls HD overload
374 }
375
376 __host__ __device__ void test_hd() {
377 foo(); // calls H overload when compiling for host, otherwise D overload
378 bar(); // always calls HD overload
379 }
380
381Wrong-side rule example:
382
383.. code-block:: c++
384
385 __host__ void host_only();
386
387 // We don't codegen inline functions unless they're referenced by a
388 // non-inline function. inline_hd1() is called only from the host side, so
389 // does not generate an error. inline_hd2() is called from the device side,
390 // so it generates an error.
391 inline __host__ __device__ void inline_hd1() { host_only(); } // no error
392 inline __host__ __device__ void inline_hd2() { host_only(); } // error
393
394 __host__ void host_fn() { inline_hd1(); }
395 __device__ void device_fn() { inline_hd2(); }
396
397 // This function is not inline, so it's always codegen'ed on both the host
398 // and the device. Therefore, it generates an error.
399 __host__ __device__ void not_inline_hd() { host_only(); }
400
401For the purposes of the wrong-side rule, templated functions also behave like
402``inline`` functions: They aren't codegen'ed unless they're instantiated
403(usually as part of the process of invoking them).
404
405clang's behavior with respect to the wrong-side rule matches nvcc's, except
406nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
407``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
408call to ``host_only`` entirely, or it may try to generate code for
409``host_only`` on the device. What you get seems to depend on whether or not
410the compiler chooses to inline ``host_only``.
411
412Member functions, including constructors, may be overloaded using H and D
413attributes. However, destructors cannot be overloaded.
414
415Using a Different Class on Host/Device
416--------------------------------------
417
418Occasionally you may want to have a class with different host/device versions.
419
420If all of the class's members are the same on the host and device, you can just
421provide overloads for the class's member functions.
422
423However, if you want your class to have different members on host/device, you
424won't be able to provide working H and D overloads in both classes. In this
425case, clang is likely to be unhappy with you.
426
427.. code-block:: c++
428
429 #ifdef __CUDA_ARCH__
430 struct S {
431 __device__ void foo() { /* use device_only */ }
432 int device_only;
433 };
434 #else
435 struct S {
436 __host__ void foo() { /* use host_only */ }
437 double host_only;
438 };
439
440 __device__ void test() {
441 S s;
442 // clang generates an error here, because during host compilation, we
443 // have ifdef'ed away the __device__ overload of S::foo(). The __device__
444 // overload must be present *even during host compilation*.
445 S.foo();
446 }
447 #endif
448
449We posit that you don't really want to have classes with different members on H
450and D. For example, if you were to pass one of these as a parameter to a
451kernel, it would have a different layout on H and D, so would not work
452properly.
453
454To make code like this compatible with clang, we recommend you separate it out
455into two classes. If you need to write code that works on both host and
456device, consider writing an overloaded wrapper function that returns different
457types on host and device.
458
459.. code-block:: c++
460
461 struct HostS { ... };
462 struct DeviceS { ... };
463
464 __host__ HostS MakeStruct() { return HostS(); }
465 __device__ DeviceS MakeStruct() { return DeviceS(); }
466
467 // Now host and device code can call MakeStruct().
468
469Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
470you to overload based on the H/D attributes. Here's an idiom that works with
471both clang and nvcc:
472
473.. code-block:: c++
474
475 struct HostS { ... };
476 struct DeviceS { ... };
477
478 #ifdef __NVCC__
479 #ifndef __CUDA_ARCH__
480 __host__ HostS MakeStruct() { return HostS(); }
481 #else
482 __device__ DeviceS MakeStruct() { return DeviceS(); }
483 #endif
484 #else
485 __host__ HostS MakeStruct() { return HostS(); }
486 __device__ DeviceS MakeStruct() { return DeviceS(); }
487 #endif
488
489 // Now host and device code can call MakeStruct().
490
491Hopefully you don't have to do this sort of thing often.
492
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000493Optimizations
494=============
495
Justin Lebar66feaf92016-09-07 21:46:53 +0000496Modern CPUs and GPUs are architecturally quite different, so code that's fast
497on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
498LLVM to make it generate good GPU code. Among these changes are:
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000499
Justin Lebar66feaf92016-09-07 21:46:53 +0000500* `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
501 reduce redundancy within straight-line code.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000502
Justin Lebar66feaf92016-09-07 21:46:53 +0000503* `Aggressive speculative execution
504 <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
505 -- This is mainly for promoting straight-line scalar optimizations, which are
506 most effective on code along dominator paths.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000507
Justin Lebar66feaf92016-09-07 21:46:53 +0000508* `Memory space inference
509 <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
510 In PTX, we can operate on pointers that are in a paricular "address space"
511 (global, shared, constant, or local), or we can operate on pointers in the
512 "generic" address space, which can point to anything. Operations in a
513 non-generic address space are faster, but pointers in CUDA are not explicitly
514 annotated with their address space, so it's up to LLVM to infer it where
515 possible.
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000516
Justin Lebar66feaf92016-09-07 21:46:53 +0000517* `Bypassing 64-bit divides
518 <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
519 This was an existing optimization that we enabled for the PTX backend.
520
521 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
522 Many of the 64-bit divides in our benchmarks have a divisor and dividend
523 which fit in 32-bits at runtime. This optimization provides a fast path for
524 this common case.
525
526* Aggressive loop unrooling and function inlining -- Loop unrolling and
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000527 function inlining need to be more aggressive for GPUs than for CPUs because
Justin Lebar66feaf92016-09-07 21:46:53 +0000528 control flow transfer in GPU is more expensive. More aggressive unrolling and
529 inlining also promote other optimizations, such as constant propagation and
530 SROA, which sometimes speed up code by over 10x.
531
532 (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
Jingyue Wu4f2a6cb2015-11-10 22:35:47 +0000533 <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
Justin Lebar66feaf92016-09-07 21:46:53 +0000534 and ``__attribute__((always_inline))``.)
Jingyue Wubec78182016-02-23 23:34:49 +0000535
Jingyue Wuf190ed42016-03-30 05:05:40 +0000536Publication
537===========
538
Justin Lebar66feaf92016-09-07 21:46:53 +0000539The team at Google published a paper in CGO 2016 detailing the optimizations
540they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
541The relevant tools are now just vanilla clang/LLVM.
542
Jingyue Wuf190ed42016-03-30 05:05:40 +0000543| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
544| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
545| *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
Justin Lebar66feaf92016-09-07 21:46:53 +0000546|
547| `Slides from the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
548|
549| `Tutorial given at CGO <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
Jingyue Wuf190ed42016-03-30 05:05:40 +0000550
Jingyue Wubec78182016-02-23 23:34:49 +0000551Obtaining Help
552==============
553
554To obtain help on LLVM in general and its CUDA support, see `the LLVM
555community <http://llvm.org/docs/#mailing-lists>`_.