Blame - llvm/docs/CompileCudaWithLLVM.rst - toolchain/llvm-project

blob: 85aab5dda0f25b0b90c209f3673e7e24775cd6f6 [file] [log] [blame]

Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	1	===================================
				2	Compiling CUDA C/C++ with LLVM
				3	===================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
				11	This document contains the user guides and the internals of compiling CUDA
				12	C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
				13	and developers who want to improve LLVM for GPUs. This document assumes a basic
				14	familiarity with CUDA. Information about CUDA programming can be found in the
				15	`CUDA programming guide
				16	<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
				17
				18	How to Build LLVM with CUDA Support
				19	===================================
				20
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	21	CUDA support is still in development and works the best in the trunk version
				22	of LLVM. Below is a quick summary of downloading and building the trunk
				23	version. Consult the `Getting Started
				24	<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
				25	up LLVM.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	26
				27	#. Checkout LLVM
				28
				29	.. code-block:: console
				30
				31	$ cd where-you-want-llvm-to-live
				32	$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
				33
				34	#. Checkout Clang
				35
				36	.. code-block:: console
				37
				38	$ cd where-you-want-llvm-to-live
				39	$ cd llvm/tools
				40	$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
				41
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	42	#. Configure and build LLVM and Clang
				43
				44	.. code-block:: console
				45
				46	$ cd where-you-want-llvm-to-live
				47	$ mkdir build
				48	$ cd build
				49	$ cmake [options] ..
				50	$ make
				51
				52	How to Compile CUDA C/C++ with LLVM
				53	===================================
				54
				55	We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame]	56	CUDA installation guide
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	57	<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
				58	you have not.
				59
				60	Suppose you want to compile and run the following CUDA program (``axpy.cu``)
				61	which multiplies a ``float`` array by a ``float`` scalar (AXPY).
				62
				63	.. code-block:: c++
				64
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	65	#include <iostream>
				66
				67	__global__ void axpy(float a, float* x, float* y) {
				68	y[threadIdx.x] = a * x[threadIdx.x];
				69	}
				70
				71	int main(int argc, char* argv[]) {
				72	const int kDataLen = 4;
				73
				74	float a = 2.0f;
				75	float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
				76	float host_y[kDataLen];
				77
				78	// Copy input data to device.
				79	float* device_x;
				80	float* device_y;
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	81	cudaMalloc(&device_x, kDataLen * sizeof(float));
				82	cudaMalloc(&device_y, kDataLen * sizeof(float));
				83	cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
				84	cudaMemcpyHostToDevice);
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	85
				86	// Launch the kernel.
				87	axpy<<<1, kDataLen>>>(a, device_x, device_y);
				88
				89	// Copy output data to host.
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	90	cudaDeviceSynchronize();
				91	cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
				92	cudaMemcpyDeviceToHost);
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	93
				94	// Print the results.
				95	for (int i = 0; i < kDataLen; ++i) {
				96	std::cout << "y[" << i << "] = " << host_y[i] << "\n";
				97	}
				98
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	99	cudaDeviceReset();
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	100	return 0;
				101	}
				102
				103	The command line for compilation is similar to what you would use for C++.
				104
				105	.. code-block:: console
				106
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	107	$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
				108	-L<CUDA install path>/<lib64 or lib> \
				109	-lcudart_static -ldl -lrt -pthread
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	110	$ ./axpy
				111	y[0] = 2
				112	y[1] = 4
				113	y[2] = 6
				114	y[3] = 8
				115
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	116	``<CUDA install path>`` is the root directory where you installed CUDA SDK,
				117	typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of
				118	your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want
				119	to run your program on a GPU with compute capability of 3.5, you should specify
				120	``--cuda-gpu-arch=sm_35``.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	121
Justin Lebar	84473cd	2016-09-07 20:09:46 +0000	[diff] [blame^]	122	Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
				123	only ``sm_XX`` is currently supported. However, clang always includes PTX in
				124	its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
				125	forwards-compatible with e.g. ``sm_35`` GPUs.
				126
				127	You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
				128
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	129	Detecting clang vs NVCC
				130	=======================
				131
				132	Although clang's CUDA implementation is largely compatible with NVCC's, you may
				133	still want to detect when you're compiling CUDA code specifically with clang.
				134
Justin Lebar	068a794	2016-03-23 22:43:10 +0000	[diff] [blame]	135	This is tricky, because NVCC may invoke clang as part of its own compilation
				136	process! For example, NVCC uses the host compiler's preprocessor when
				137	compiling for device code, and that host compiler may in fact be clang.
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	138
				139	When clang is actually compiling CUDA code -- rather than being used as a
				140	subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
				141	defined only in device mode (but will be defined if NVCC is using clang as a
				142	preprocessor). So you can use the following incantations to detect clang CUDA
				143	compilation, in host and device modes:
				144
				145	.. code-block:: c++
				146
				147	#if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
				148	// clang compiling CUDA code, host mode.
				149	#endif
				150
				151	#if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
				152	// clang compiling CUDA code, device mode.
				153	#endif
				154
Justin Lebar	068a794	2016-03-23 22:43:10 +0000	[diff] [blame]	155	Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
				156	detect NVCC specifically by looking for ``__NVCC__``.
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	157
Justin Lebar	b649e75	2016-05-25 23:11:31 +0000	[diff] [blame]	158	Flags that control numerical code
				159	=================================
				160
				161	If you're using GPUs, you probably care about making numerical code run fast.
				162	GPU hardware allows for more control over numerical operations than most CPUs,
				163	but this results in more compiler options for you to juggle.
				164
				165	Flags you may wish to tweak include:
				166
				167	* ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
				168	compiling CUDA) Controls whether the compiler emits fused multiply-add
				169	operations.
				170
				171	* ``off``: never emit fma operations, and prevent ptxas from fusing multiply
				172	and add instructions.
				173	* ``on``: fuse multiplies and adds within a single statement, but never
				174	across statements (C11 semantics). Prevent ptxas from fusing other
				175	multiplies and adds.
				176	* ``fast``: fuse multiplies and adds wherever profitable, even across
				177	statements. Doesn't prevent ptxas from fusing additional multiplies and
				178	adds.
				179
				180	Fused multiply-add instructions can be much faster than the unfused
				181	equivalents, but because the intermediate result in an fma is not rounded,
				182	this flag can affect numerical code.
				183
				184	* ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
				185	floating point operations may flush `denormal
				186	<https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
				187	Operations on denormal numbers are often much slower than the same operations
				188	on normal numbers.
				189
				190	* ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
				191	compiler may emit calls to faster, approximate versions of transcendental
				192	functions, instead of using the slower, fully IEEE-compliant versions. For
				193	example, this flag allows clang to emit the ptx ``sin.approx.f32``
				194	instruction.
				195
				196	This is implied by ``-ffast-math``.
				197
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	198	Optimizations
				199	=============
				200
				201	CPU and GPU have different design philosophies and architectures. For example, a
				202	typical CPU has branch prediction, out-of-order execution, and is superscalar,
				203	whereas a typical GPU has none of these. Due to such differences, an
				204	optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
				205
				206	LLVM performs several general and CUDA-specific optimizations for GPUs. The
				207	list below shows some of the more important optimizations for GPUs. Most of
				208	them have been upstreamed to ``lib/Transforms/Scalar`` and
				209	``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
				210	customizable target-independent optimization pipeline.
				211
				212	* Straight-line scalar optimizations. These optimizations reduce redundancy
				213	in straight-line code. Details can be found in the `design document for
				214	straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
				215
				216	* Inferring memory spaces. `This optimization
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame]	217	<https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	218	infers the memory space of an address so that the backend can emit faster
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame]	219	special loads and stores from it.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	220
				221	* Aggressive loop unrooling and function inlining. Loop unrolling and
				222	function inlining need to be more aggressive for GPUs than for CPUs because
				223	control flow transfer in GPU is more expensive. They also promote other
				224	optimizations such as constant propagation and SROA which sometimes speed up
				225	code by over 10x. An empirical inline threshold for GPUs is 1100. This
				226	configuration has yet to be upstreamed with a target-specific optimization
				227	pipeline. LLVM also provides `loop unrolling pragmas
				228	<http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
				229	and ``__attribute__((always_inline))`` for programmers to force unrolling and
				230	inling.
				231
				232	* Aggressive speculative execution. `This transformation
				233	<http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
				234	mainly for promoting straight-line scalar optimizations which are most
				235	effective on code along dominator paths.
				236
				237	* Memory-space alias analysis. `This alias analysis
Jingyue Wu	03d90e5	2015-11-18 22:01:44 +0000	[diff] [blame]	238	<http://reviews.llvm.org/D12414>`_ infers that two pointers in different
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	239	special memory spaces do not alias. It has yet to be integrated to the new
				240	alias analysis infrastructure; the new infrastructure does not run
				241	target-specific alias analysis.
				242
				243	* Bypassing 64-bit divides. `An existing optimization
				244	<http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
				245	enabled in the NVPTX backend. 64-bit integer divides are much slower than
				246	32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
				247	divides in our benchmarks have a divisor and dividend which fit in 32-bits at
				248	runtime. This optimization provides a fast path for this common case.
Jingyue Wu	bec7818	2016-02-23 23:34:49 +0000	[diff] [blame]	249
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame]	250	Publication
				251	===========
				252
				253	\| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
				254	\| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
				255	\| Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)
				256	\| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
				257
				258	Tutorial
				259	========
				260
				261	`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
				262
Jingyue Wu	bec7818	2016-02-23 23:34:49 +0000	[diff] [blame]	263	Obtaining Help
				264	==============
				265
				266	To obtain help on LLVM in general and its CUDA support, see `the LLVM
				267	community <http://llvm.org/docs/#mailing-lists>`_.