Blame - llvm/docs/CompileCudaWithLLVM.rst - toolchain/llvm-project

blob: 5ed3f14005adfcf33f8c76008fa0289b6bd293ac [file] [log] [blame]

Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	1	===================================
				2	Compiling CUDA C/C++ with LLVM
				3	===================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
				11	This document contains the user guides and the internals of compiling CUDA
				12	C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
				13	and developers who want to improve LLVM for GPUs. This document assumes a basic
				14	familiarity with CUDA. Information about CUDA programming can be found in the
				15	`CUDA programming guide
				16	<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
				17
				18	How to Build LLVM with CUDA Support
				19	===================================
				20
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	21	CUDA support is still in development and works the best in the trunk version
				22	of LLVM. Below is a quick summary of downloading and building the trunk
				23	version. Consult the `Getting Started
				24	<http://llvm.org/docs/GettingStarted.html>`_ page for more details on setting
				25	up LLVM.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	26
				27	#. Checkout LLVM
				28
				29	.. code-block:: console
				30
				31	$ cd where-you-want-llvm-to-live
				32	$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
				33
				34	#. Checkout Clang
				35
				36	.. code-block:: console
				37
				38	$ cd where-you-want-llvm-to-live
				39	$ cd llvm/tools
				40	$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
				41
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	42	#. Configure and build LLVM and Clang
				43
				44	.. code-block:: console
				45
				46	$ cd where-you-want-llvm-to-live
				47	$ mkdir build
				48	$ cd build
				49	$ cmake [options] ..
				50	$ make
				51
				52	How to Compile CUDA C/C++ with LLVM
				53	===================================
				54
				55	We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame^]	56	CUDA installation guide
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	57	<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
				58	you have not.
				59
				60	Suppose you want to compile and run the following CUDA program (``axpy.cu``)
				61	which multiplies a ``float`` array by a ``float`` scalar (AXPY).
				62
				63	.. code-block:: c++
				64
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	65	#include <iostream>
				66
				67	__global__ void axpy(float a, float* x, float* y) {
				68	y[threadIdx.x] = a * x[threadIdx.x];
				69	}
				70
				71	int main(int argc, char* argv[]) {
				72	const int kDataLen = 4;
				73
				74	float a = 2.0f;
				75	float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
				76	float host_y[kDataLen];
				77
				78	// Copy input data to device.
				79	float* device_x;
				80	float* device_y;
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	81	cudaMalloc(&device_x, kDataLen * sizeof(float));
				82	cudaMalloc(&device_y, kDataLen * sizeof(float));
				83	cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
				84	cudaMemcpyHostToDevice);
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	85
				86	// Launch the kernel.
				87	axpy<<<1, kDataLen>>>(a, device_x, device_y);
				88
				89	// Copy output data to host.
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	90	cudaDeviceSynchronize();
				91	cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
				92	cudaMemcpyDeviceToHost);
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	93
				94	// Print the results.
				95	for (int i = 0; i < kDataLen; ++i) {
				96	std::cout << "y[" << i << "] = " << host_y[i] << "\n";
				97	}
				98
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	99	cudaDeviceReset();
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	100	return 0;
				101	}
				102
				103	The command line for compilation is similar to what you would use for C++.
				104
				105	.. code-block:: console
				106
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	107	$ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
				108	-L<CUDA install path>/<lib64 or lib> \
				109	-lcudart_static -ldl -lrt -pthread
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	110	$ ./axpy
				111	y[0] = 2
				112	y[1] = 4
				113	y[2] = 6
				114	y[3] = 8
				115
Jingyue Wu	313496b	2016-01-30 23:48:47 +0000	[diff] [blame]	116	``<CUDA install path>`` is the root directory where you installed CUDA SDK,
				117	typically ``/usr/local/cuda``. ``<GPU arch>`` is `the compute capability of
				118	your GPU <https://developer.nvidia.com/cuda-gpus>`_. For example, if you want
				119	to run your program on a GPU with compute capability of 3.5, you should specify
				120	``--cuda-gpu-arch=sm_35``.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	121
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	122	Detecting clang vs NVCC
				123	=======================
				124
				125	Although clang's CUDA implementation is largely compatible with NVCC's, you may
				126	still want to detect when you're compiling CUDA code specifically with clang.
				127
Justin Lebar	068a794	2016-03-23 22:43:10 +0000	[diff] [blame]	128	This is tricky, because NVCC may invoke clang as part of its own compilation
				129	process! For example, NVCC uses the host compiler's preprocessor when
				130	compiling for device code, and that host compiler may in fact be clang.
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	131
				132	When clang is actually compiling CUDA code -- rather than being used as a
				133	subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
				134	defined only in device mode (but will be defined if NVCC is using clang as a
				135	preprocessor). So you can use the following incantations to detect clang CUDA
				136	compilation, in host and device modes:
				137
				138	.. code-block:: c++
				139
				140	#if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
				141	// clang compiling CUDA code, host mode.
				142	#endif
				143
				144	#if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
				145	// clang compiling CUDA code, device mode.
				146	#endif
				147
Justin Lebar	068a794	2016-03-23 22:43:10 +0000	[diff] [blame]	148	Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
				149	detect NVCC specifically by looking for ``__NVCC__``.
Justin Lebar	32835c8	2016-03-21 23:05:15 +0000	[diff] [blame]	150
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	151	Optimizations
				152	=============
				153
				154	CPU and GPU have different design philosophies and architectures. For example, a
				155	typical CPU has branch prediction, out-of-order execution, and is superscalar,
				156	whereas a typical GPU has none of these. Due to such differences, an
				157	optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
				158
				159	LLVM performs several general and CUDA-specific optimizations for GPUs. The
				160	list below shows some of the more important optimizations for GPUs. Most of
				161	them have been upstreamed to ``lib/Transforms/Scalar`` and
				162	``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
				163	customizable target-independent optimization pipeline.
				164
				165	* Straight-line scalar optimizations. These optimizations reduce redundancy
				166	in straight-line code. Details can be found in the `design document for
				167	straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
				168
				169	* Inferring memory spaces. `This optimization
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame^]	170	<https://github.com/llvm-mirror/llvm/blob/master/lib/Target/NVPTX/NVPTXInferAddressSpaces.cpp>`_
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	171	infers the memory space of an address so that the backend can emit faster
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame^]	172	special loads and stores from it.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	173
				174	* Aggressive loop unrooling and function inlining. Loop unrolling and
				175	function inlining need to be more aggressive for GPUs than for CPUs because
				176	control flow transfer in GPU is more expensive. They also promote other
				177	optimizations such as constant propagation and SROA which sometimes speed up
				178	code by over 10x. An empirical inline threshold for GPUs is 1100. This
				179	configuration has yet to be upstreamed with a target-specific optimization
				180	pipeline. LLVM also provides `loop unrolling pragmas
				181	<http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
				182	and ``__attribute__((always_inline))`` for programmers to force unrolling and
				183	inling.
				184
				185	* Aggressive speculative execution. `This transformation
				186	<http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
				187	mainly for promoting straight-line scalar optimizations which are most
				188	effective on code along dominator paths.
				189
				190	* Memory-space alias analysis. `This alias analysis
Jingyue Wu	03d90e5	2015-11-18 22:01:44 +0000	[diff] [blame]	191	<http://reviews.llvm.org/D12414>`_ infers that two pointers in different
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	192	special memory spaces do not alias. It has yet to be integrated to the new
				193	alias analysis infrastructure; the new infrastructure does not run
				194	target-specific alias analysis.
				195
				196	* Bypassing 64-bit divides. `An existing optimization
				197	<http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
				198	enabled in the NVPTX backend. 64-bit integer divides are much slower than
				199	32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
				200	divides in our benchmarks have a divisor and dividend which fit in 32-bits at
				201	runtime. This optimization provides a fast path for this common case.
Jingyue Wu	bec7818	2016-02-23 23:34:49 +0000	[diff] [blame]	202
Jingyue Wu	f190ed4	2016-03-30 05:05:40 +0000	[diff] [blame^]	203	Publication
				204	===========
				205
				206	\| `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
				207	\| Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
				208	\| Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)
				209	\| `Slides for the CGO talk <http://wujingyue.com/docs/gpucc-talk.pdf>`_
				210
				211	Tutorial
				212	========
				213
				214	`CGO 2016 gpucc tutorial <http://wujingyue.com/docs/gpucc-tutorial.pdf>`_
				215
Jingyue Wu	bec7818	2016-02-23 23:34:49 +0000	[diff] [blame]	216	Obtaining Help
				217	==============
				218
				219	To obtain help on LLVM in general and its CUDA support, see `the LLVM
				220	community <http://llvm.org/docs/#mailing-lists>`_.