Blame - llvm/docs/CompileCudaWithLLVM.rst - toolchain/llvm-project

blob: 7c1c8f48f7ba1d54ee1e0498ca18a35bb8cd9da1 [file] [log] [blame]

Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	1	===================================
				2	Compiling CUDA C/C++ with LLVM
				3	===================================
				4
				5	.. contents::
				6	:local:
				7
				8	Introduction
				9	============
				10
				11	This document contains the user guides and the internals of compiling CUDA
				12	C/C++ with LLVM. It is aimed at both users who want to compile CUDA with LLVM
				13	and developers who want to improve LLVM for GPUs. This document assumes a basic
				14	familiarity with CUDA. Information about CUDA programming can be found in the
				15	`CUDA programming guide
				16	<http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
				17
				18	How to Build LLVM with CUDA Support
				19	===================================
				20
Jingyue Wu	1cbe0e2	2015-11-18 01:59:05 +0000	[diff] [blame]	21	Below is a quick summary of downloading and building LLVM. Consult the `Getting
				22	Started <http://llvm.org/docs/GettingStarted.html>`_ page for more details on
				23	setting up LLVM.
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	24
				25	#. Checkout LLVM
				26
				27	.. code-block:: console
				28
				29	$ cd where-you-want-llvm-to-live
				30	$ svn co http://llvm.org/svn/llvm-project/llvm/trunk llvm
				31
				32	#. Checkout Clang
				33
				34	.. code-block:: console
				35
				36	$ cd where-you-want-llvm-to-live
				37	$ cd llvm/tools
				38	$ svn co http://llvm.org/svn/llvm-project/cfe/trunk clang
				39
Jingyue Wu	4f2a6cb	2015-11-10 22:35:47 +0000	[diff] [blame]	40	#. Configure and build LLVM and Clang
				41
				42	.. code-block:: console
				43
				44	$ cd where-you-want-llvm-to-live
				45	$ mkdir build
				46	$ cd build
				47	$ cmake [options] ..
				48	$ make
				49
				50	How to Compile CUDA C/C++ with LLVM
				51	===================================
				52
				53	We assume you have installed the CUDA driver and runtime. Consult the `NVIDIA
				54	CUDA installation Guide
				55	<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ if
				56	you have not.
				57
				58	Suppose you want to compile and run the following CUDA program (``axpy.cu``)
				59	which multiplies a ``float`` array by a ``float`` scalar (AXPY).
				60
				61	.. code-block:: c++
				62
				63	#include <helper_cuda.h> // for checkCudaErrors
				64
				65	#include <iostream>
				66
				67	__global__ void axpy(float a, float* x, float* y) {
				68	y[threadIdx.x] = a * x[threadIdx.x];
				69	}
				70
				71	int main(int argc, char* argv[]) {
				72	const int kDataLen = 4;
				73
				74	float a = 2.0f;
				75	float host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
				76	float host_y[kDataLen];
				77
				78	// Copy input data to device.
				79	float* device_x;
				80	float* device_y;
				81	checkCudaErrors(cudaMalloc(&device_x, kDataLen * sizeof(float)));
				82	checkCudaErrors(cudaMalloc(&device_y, kDataLen * sizeof(float)));
				83	checkCudaErrors(cudaMemcpy(device_x, host_x, kDataLen * sizeof(float),
				84	cudaMemcpyHostToDevice));
				85
				86	// Launch the kernel.
				87	axpy<<<1, kDataLen>>>(a, device_x, device_y);
				88
				89	// Copy output data to host.
				90	checkCudaErrors(cudaDeviceSynchronize());
				91	checkCudaErrors(cudaMemcpy(host_y, device_y, kDataLen * sizeof(float),
				92	cudaMemcpyDeviceToHost));
				93
				94	// Print the results.
				95	for (int i = 0; i < kDataLen; ++i) {
				96	std::cout << "y[" << i << "] = " << host_y[i] << "\n";
				97	}
				98
				99	checkCudaErrors(cudaDeviceReset());
				100	return 0;
				101	}
				102
				103	The command line for compilation is similar to what you would use for C++.
				104
				105	.. code-block:: console
				106
				107	$ clang++ -o axpy -I<CUDA install path>/samples/common/inc -L<CUDA install path>/<lib64 or lib> axpy.cu -lcudart_static -lcuda -ldl -lrt -pthread
				108	$ ./axpy
				109	y[0] = 2
				110	y[1] = 4
				111	y[2] = 6
				112	y[3] = 8
				113
				114	Note that ``helper_cuda.h`` comes from the CUDA samples, so you need the
				115	samples installed for this example. ``<CUDA install path>`` is the root
				116	directory where you installed CUDA SDK, typically ``/usr/local/cuda``.
				117
				118	Optimizations
				119	=============
				120
				121	CPU and GPU have different design philosophies and architectures. For example, a
				122	typical CPU has branch prediction, out-of-order execution, and is superscalar,
				123	whereas a typical GPU has none of these. Due to such differences, an
				124	optimization pipeline well-tuned for CPUs may be not suitable for GPUs.
				125
				126	LLVM performs several general and CUDA-specific optimizations for GPUs. The
				127	list below shows some of the more important optimizations for GPUs. Most of
				128	them have been upstreamed to ``lib/Transforms/Scalar`` and
				129	``lib/Target/NVPTX``. A few of them have not been upstreamed due to lack of a
				130	customizable target-independent optimization pipeline.
				131
				132	* Straight-line scalar optimizations. These optimizations reduce redundancy
				133	in straight-line code. Details can be found in the `design document for
				134	straight-line scalar optimizations <https://goo.gl/4Rb9As>`_.
				135
				136	* Inferring memory spaces. `This optimization
				137	<http://www.llvm.org/docs/doxygen/html/NVPTXFavorNonGenericAddrSpaces_8cpp_source.html>`_
				138	infers the memory space of an address so that the backend can emit faster
				139	special loads and stores from it. Details can be found in the `design
				140	document for memory space inference <https://goo.gl/5wH2Ct>`_.
				141
				142	* Aggressive loop unrooling and function inlining. Loop unrolling and
				143	function inlining need to be more aggressive for GPUs than for CPUs because
				144	control flow transfer in GPU is more expensive. They also promote other
				145	optimizations such as constant propagation and SROA which sometimes speed up
				146	code by over 10x. An empirical inline threshold for GPUs is 1100. This
				147	configuration has yet to be upstreamed with a target-specific optimization
				148	pipeline. LLVM also provides `loop unrolling pragmas
				149	<http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
				150	and ``__attribute__((always_inline))`` for programmers to force unrolling and
				151	inling.
				152
				153	* Aggressive speculative execution. `This transformation
				154	<http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_ is
				155	mainly for promoting straight-line scalar optimizations which are most
				156	effective on code along dominator paths.
				157
				158	* Memory-space alias analysis. `This alias analysis
				159	<http://llvm.org/docs/NVPTXUsage.html>`_ infers that two pointers in different
				160	special memory spaces do not alias. It has yet to be integrated to the new
				161	alias analysis infrastructure; the new infrastructure does not run
				162	target-specific alias analysis.
				163
				164	* Bypassing 64-bit divides. `An existing optimization
				165	<http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_
				166	enabled in the NVPTX backend. 64-bit integer divides are much slower than
				167	32-bit ones on NVIDIA GPUs due to lack of a divide unit. Many of the 64-bit
				168	divides in our benchmarks have a divisor and dividend which fit in 32-bits at
				169	runtime. This optimization provides a fast path for this common case.