blob: f3b76f8c63c428c638a6f20176d093224127b186 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
<title>Clang - Features and Goals</title>
<link type="text/css" rel="stylesheet" href="menu.css" />
<link type="text/css" rel="stylesheet" href="content.css" />
<style type="text/css">
</style>
</head>
<body>
<!--#include virtual="menu.html.incl"-->
<div id="content">
<!--*************************************************************************-->
<h1>Clang - Features and Goals</h1>
<!--*************************************************************************-->
<p>
This page describes the <a href="index.html#goals">features and goals</a> of
Clang in more detail and gives a more broad explanation about what we mean.
These features are:
</p>
<p>End-User Features:</p>
<ul>
<li><a href="#performance">Fast compiles and low memory use</a></li>
<li><a href="#expressivediags">Expressive diagnostics</a></li>
<li><a href="#gcccompat">GCC compatibility</a></li>
</ul>
<p>Utility and Applications:</p>
<ul>
<li><a href="#libraryarch">Library based architecture</a></li>
<li><a href="#diverseclients">Support diverse clients</a></li>
<li><a href="#ideintegration">Integration with IDEs</a></li>
<li><a href="#license">Use the LLVM 'BSD' License</a></li>
</ul>
<p>Internal Design and Implementation:</p>
<ul>
<li><a href="#real">A real-world, production quality compiler</a></li>
<li><a href="#simplecode">A simple and hackable code base</a></li>
<li><a href="#unifiedparser">A single unified parser for C, Objective C, C++,
and Objective C++</a></li>
<li><a href="#conformance">Conformance with C/C++/ObjC and their
variants</a></li>
</ul>
<!--*************************************************************************-->
<h2><a name="enduser">End-User Features</a></h2>
<!--*************************************************************************-->
<!--=======================================================================-->
<h3><a name="performance">Fast compiles and Low Memory Use</a></h3>
<!--=======================================================================-->
<p>A major focus of our work on clang is to make it fast, light and scalable.
The library-based architecture of clang makes it straight-forward to time and
profile the cost of each layer of the stack, and the driver has a number of
options for performance analysis.</p>
<p>While there is still much that can be done, we find that the clang front-end
is significantly quicker than gcc and uses less memory For example, when
compiling "Carbon.h" on Mac OS/X, we see that clang is 2.5x faster than GCC:</p>
<img class="img_slide" src="feature-compile1.png" width="400" height="300" />
<p>Carbon.h is a monster: it transitively includes 558 files, 12.3M of code,
declares 10000 functions, has 2000 struct definitions, 8000 fields, 20000 enum
constants, etc (see slide 25+ of the <a href="clang_video-07-25-2007.html">clang
talk</a> for more information). It is also #include'd into almost every C file
in a GUI app on the Mac, so its compile time is very important.</p>
<p>From the slide above, you can see that we can measure the time to preprocess
the file independently from the time to parse it, and independently from the
time to build the ASTs for the code. GCC doesn't provide a way to measure the
parser without AST building (it only provides -fsyntax-only). In our
measurements, we find that clang's preprocessor is consistently 40% faster than
GCCs, and the parser + AST builder is ~4x faster than GCC's. If you have
sources that do not depend as heavily on the preprocessor (or if you
use Precompiled Headers) you may see a much bigger speedup from clang.
</p>
<p>Compile time performance is important, but when using clang as an API, often
memory use is even moreso: the less memory the code takes the more code you can
fit into memory at a time (useful for whole program analysis tools, for
example).</p>
<img class="img_slide" src="feature-memory1.png" width="400" height="300" />
<p>Here we see a huge advantage of clang: its ASTs take <b>5x less memory</b>
than GCC's syntax trees, despite the fact that clang's ASTs capture far more
source-level information than GCC's trees do. This feat is accomplished through
the use of carefully designed APIs and efficient representations.</p>
<p>In addition to being efficient when pitted head-to-head against GCC in batch
mode, clang is built with a <a href="#libraryarch">library based
architecture</a> that makes it relatively easy to adapt it and build new tools
with it. This means that it is often possible to apply out-of-the-box thinking
and novel techniques to improve compilation in various ways.</p>
<img class="img_slide" src="feature-compile2.png" width="400" height="300" />
<p>This slide shows how the clang preprocessor can be used to make "distcc"
parallelization <b>3x</b> more scalable than when using the GCC preprocessor.
"distcc" quickly bottlenecks on the preprocessor running on the central driver
machine, so a fast preprocessor is very useful. Comparing the first two bars
of each group shows how a ~40% faster preprocessor can reduce preprocessing time
of these large C++ apps by about 40% (shocking!).</p>
<p>The third bar on the slide is the interesting part: it shows how trivial
caching of file system accesses across invocations of the preprocessor allows
clang to reduce time spent in the kernel by 10x, making distcc over 3x more
scalable. This is obviously just one simple hack, doing more interesting things
(like caching tokens across preprocessed files) would yield another substantial
speedup.</p>
<p>The clean framework-based design of clang means that many things are possible
that would be very difficult in other systems, for example incremental
compilation, multithreading, intelligent caching, etc. We are only starting
to tap the full potential of the clang design.</p>
<!--=======================================================================-->
<h3><a name="expressivediags">Expressive Diagnostics</a></h3>
<!--=======================================================================-->
<p>In addition to being fast and functional, we aim to make Clang extremely user
friendly. As far as a command-line compiler goes, this basically boils down to
making the diagnostics (error and warning messages) generated by the compiler
be as useful as possible. There are several ways that we do this. This section
talks about the experience provided by the command line compiler, contrasting
Clang output to GCC 4.2's output in several examples.
<!--
Other clients
that embed Clang and extract equivalent information through internal APIs.-->
</p>
<h4>Column Numbers and Caret Diagnostics</h4>
<p>First, all diagnostics produced by clang include full column number
information, and use this to print "caret diagnostics". This is a feature
provided by many commercial compilers, but is generally missing from open source
compilers. This is nice because it makes it very easy to understand exactly
what is wrong in a particular piece of code, an example is:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only -Wformat format-strings.c</b>
format-strings.c:91: warning: too few arguments for format
$ <b>clang -fsyntax-only format-strings.c</b>
format-strings.c:91:13: warning: '.*' specified field precision is missing a matching 'int' argument
<font color="darkgreen"> printf("%.*d");</font>
<font color="blue"> ^</font>
</pre>
<p>The caret (the blue "^" character) exactly shows where the problem is, even
inside of the string. This makes it really easy to jump to the problem and
helps when multiple instances of the same character occur on a line. We'll
revisit this more in following examples.</p>
<h4>Range Highlighting for Related Text</h4>
<p>Clang captures and accurately tracks range information for expressions,
statements, and other constructs in your program and uses this to make
diagnostics highlight related information. For example, here's a somewhat
nonsensical example to illustrate this:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
t.c:7: error: invalid operands to binary + (have 'int' and 'struct A')
$ <b>clang -fsyntax-only t.c</b>
t.c:7:39: error: invalid operands to binary expression ('int' and 'struct A')
<font color="darkgreen"> return y + func(y ? ((SomeA.X + 40) + SomeA) / 42 + SomeA.X : SomeA.X);</font>
<font color="blue"> ~~~~~~~~~~~~~~ ^ ~~~~~</font>
</pre>
<p>Here you can see that you don't even need to see the original source code to
understand what is wrong based on the Clang error: Because clang prints a
caret, you know exactly <em>which</em> plus it is complaining about. The range
information highlights the left and right side of the plus which makes it
immediately obvious what the compiler is talking about, which is very useful for
cases involving precedence issues and many other cases.</p>
<h4>Precision in Wording</h4>
<p>A detail is that we have tried really hard to make the diagnostics that come
out of clang contain exactly the pertinent information about what is wrong and
why. In the example above, we tell you what the inferred types are for
the left and right hand sides, and we don't repeat what is obvious from the
caret (that this is a "binary +"). Many other examples abound, here is a simple
one:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
t.c:5: error: invalid type argument of 'unary *'
$ <b>clang -fsyntax-only t.c</b>
t.c:5:11: error: indirection requires pointer operand ('int' invalid)
<font color="darkgreen"> int y = *SomeA.X;</font>
<font color="blue"> ^~~~~~~~</font>
</pre>
<p>In this example, not only do we tell you that there is a problem with the *
and point to it, we say exactly why and tell you what the type is (in case it is
a complicated subexpression, such as a call to an overloaded function). This
sort of attention to detail makes it much easier to understand and fix problems
quickly.</p>
<h4>No Pretty Printing of Expressions in Diagnostics</h4>
<p>Since Clang has range highlighting, it never needs to pretty print your code
back out to you. This is particularly bad in G++ (which often emits errors
containing lowered vtable references), but even GCC can produce
inscrutible error messages in some cases when it tries to do this. In this
example P and Q have type "int*":</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
#'exact_div_expr' not supported by pp_c_expression#'t.c:12: error: called object is not a function
$ <b>clang -fsyntax-only t.c</b>
t.c:12:8: error: called object type 'int' is not a function or function pointer
<font color="darkgreen"> (P-Q)();</font>
<font color="blue"> ~~~~~^</font>
</pre>
<h4>Typedef Preservation and Selective Unwrapping</h4>
<p>Many programmers use high-level user defined types, typedefs, and other
syntactic sugar to refer to types in their program. This is useful because they
can abbreviate otherwise very long types and it is useful to preserve the
typename in diagnostics. However, sometimes very simple typedefs can wrap
trivial types and it is important to strip off the typedef to understand what
is going on. Clang aims to handle both cases well.<p>
<p>For example, here is an example that shows where it is important to preserve
a typedef in C:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
t.c:15: error: invalid operands to binary / (have 'float __vector__' and 'const int *')
$ <b>clang -fsyntax-only t.c</b>
t.c:15:11: error: can't convert between vector values of different size ('__m128' and 'int const *')
<font color="darkgreen"> myvec[1]/P;</font>
<font color="blue"> ~~~~~~~~^~</font>
</pre>
<p>Here the type printed by GCC isn't even valid, but if the error were about a
very long and complicated type (as often happens in C++) the error message would
be ugly just because it was long and hard to read. Here's an example where it
is useful for the compiler to expose underlying details of a typedef:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
t.c:13: error: request for member 'x' in something not a structure or union
$ <b>clang -fsyntax-only t.c</b>
t.c:13:9: error: member reference base type 'pid_t' (aka 'int') is not a structure or union
<font color="darkgreen"> myvar = myvar.x;</font>
<font color="blue"> ~~~~~ ^</font>
</pre>
<p>If the user was somehow confused about how the system "pid_t" typedef is
defined, Clang helpfully displays it with "aka".</p>
<h4>Automatic Macro Expansion</h4>
<p>Many errors happen in macros that are sometimes deeply nested. With
traditional compilers, you need to dig deep into the definition of the macro to
understand how you got into trouble. Here's a simple example that shows how
Clang helps you out:</p>
<pre>
$ <b>gcc-4.2 -fsyntax-only t.c</b>
t.c: In function 'test':
t.c:80: error: invalid operands to binary &lt; (have 'struct mystruct' and 'float')
$ <b>clang -fsyntax-only t.c</b>
t.c:80:3: error: invalid operands to binary expression ('typeof(P)' (aka 'struct mystruct') and 'typeof(F)' (aka 'float'))
<font color="darkgreen"> X = MYMAX(P, F);</font>
<font color="blue"> ^~~~~~~~~~~</font>
t.c:76:94: note: instantiated from:
<font color="darkgreen">#define MYMAX(A,B) __extension__ ({ __typeof__(A) __a = (A); __typeof__(B) __b = (B); __a &lt; __b ? __b : __a; })</font>
<font color="blue"> ~~~ ^ ~~~</font>
</pre>
<p>This shows how clang automatically prints instantiation information and
nested range information for diagnostics as they are instantiated through macros
and also shows how some of the other pieces work in a bigger example. Here's
another real world warning that occurs in the "window" Unix package (which
implements the "wwopen" class of APIs):</p>
<pre>
$ <b>clang -fsyntax-only t.c</b>
t.c:22:2: warning: type specifier missing, defaults to 'int'
<font color="darkgreen"> ILPAD();</font>
<font color="blue"> ^</font>
t.c:17:17: note: instantiated from:
<font color="darkgreen">#define ILPAD() PAD((NROW - tt.tt_row) * 10) /* 1 ms per char */</font>
<font color="blue"> ^</font>
t.c:14:2: note: instantiated from:
<font color="darkgreen"> register i; \</font>
<font color="blue"> ^</font>
</pre>
<p>In practice, we've found that this is actually more useful in multiply nested
macros that in simple ones.</p>
<h4>Fix-it Hints</h4>
<p>simple example + template&lt;&gt; example</p>
<h4>C++ Fun Examples</h4>
<p>...</p>
<!--=======================================================================-->
<h3><a name="gcccompat">GCC Compatibility</a></h3>
<!--=======================================================================-->
<p>GCC is currently the defacto-standard open source compiler today, and it
routinely compiles a huge volume of code. GCC supports a huge number of
extensions and features (many of which are undocumented) and a lot of
code and header files depend on these features in order to build.</p>
<p>While it would be nice to be able to ignore these extensions and focus on
implementing the language standards to the letter, pragmatics force us to
support the GCC extensions that see the most use. Many users just want their
code to compile, they don't care to argue about whether it is pedantically C99
or not.</p>
<p>As mentioned above, all
extensions are explicitly recognized as such and marked with extension
diagnostics, which can be mapped to warnings, errors, or just ignored.
</p>
<!--*************************************************************************-->
<h2><a name="applications">Utility and Applications</a></h2>
<!--*************************************************************************-->
<!--=======================================================================-->
<h3><a name="libraryarch">Library Based Architecture</a></h3>
<!--=======================================================================-->
<p>A major design concept for clang is its use of a library-based
architecture. In this design, various parts of the front-end can be cleanly
divided into separate libraries which can then be mixed up for different needs
and uses. In addition, the library-based approach encourages good interfaces
and makes it easier for new developers to get involved (because they only need
to understand small pieces of the big picture).</p>
<blockquote>
"The world needs better compiler tools, tools which are built as libraries.
This design point allows reuse of the tools in new and novel ways. However,
building the tools as libraries isn't enough: they must have clean APIs, be as
decoupled from each other as possible, and be easy to modify/extend. This
requires clean layering, decent design, and keeping the libraries independent of
any specific client."</blockquote>
<p>
Currently, clang is divided into the following libraries and tool:
</p>
<ul>
<li><b>libsupport</b> - Basic support library, from LLVM.</li>
<li><b>libsystem</b> - System abstraction library, from LLVM.</li>
<li><b>libbasic</b> - Diagnostics, SourceLocations, SourceBuffer abstraction,
file system caching for input source files.</li>
<li><b>libast</b> - Provides classes to represent the C AST, the C type system,
builtin functions, and various helpers for analyzing and manipulating the
AST (visitors, pretty printers, etc).</li>
<li><b>liblex</b> - Lexing and preprocessing, identifier hash table, pragma
handling, tokens, and macro expansion.</li>
<li><b>libparse</b> - Parsing. This library invokes coarse-grained 'Actions'
provided by the client (e.g. libsema builds ASTs) but knows nothing about
ASTs or other client-specific data structures.</li>
<li><b>libsema</b> - Semantic Analysis. This provides a set of parser actions
to build a standardized AST for programs.</li>
<li><b>libcodegen</b> - Lower the AST to LLVM IR for optimization &amp; code
generation.</li>
<li><b>librewrite</b> - Editing of text buffers (important for code rewriting
transformation, like refactoring).</li>
<li><b>libanalysis</b> - Static analysis support.</li>
<li><b>clang</b> - A driver program, client of the libraries at various
levels.</li>
</ul>
<p>As an example of the power of this library based design.... If you wanted to
build a preprocessor, you would take the Basic and Lexer libraries. If you want
an indexer, you would take the previous two and add the Parser library and
some actions for indexing. If you want a refactoring, static analysis, or
source-to-source compiler tool, you would then add the AST building and
semantic analyzer libraries.</p>
<p>For more information about the low-level implementation details of the
various clang libraries, please see the <a href="docs/InternalsManual.html">
clang Internals Manual</a>.</p>
<!--=======================================================================-->
<h3><a name="diverseclients">Support Diverse Clients</a></h3>
<!--=======================================================================-->
<p>Clang is designed and built with many grand plans for how we can use it. The
driving force is the fact that we use C and C++ daily, and have to suffer due to
a lack of good tools available for it. We believe that the C and C++ tools
ecosystem has been significantly limited by how difficult it is to parse and
represent the source code for these languages, and we aim to rectify this
problem in clang.</p>
<p>The problem with this goal is that different clients have very different
requirements. Consider code generation, for example: a simple front-end that
parses for code generation must analyze the code for validity and emit code
in some intermediate form to pass off to a optimizer or backend. Because
validity analysis and code generation can largely be done on the fly, there is
not hard requirement that the front-end actually build up a full AST for all
the expressions and statements in the code. TCC and GCC are examples of
compilers that either build no real AST (in the former case) or build a stripped
down and simplified AST (in the later case) because they focus primarily on
codegen.</p>
<p>On the opposite side of the spectrum, some clients (like refactoring) want
highly detailed information about the original source code and want a complete
AST to describe it with. Refactoring wants to have information about macro
expansions, the location of every paren expression '(((x)))' vs 'x', full
position information, and much more. Further, refactoring wants to look
<em>across the whole program</em> to ensure that it is making transformations
that are safe. Making this efficient and getting this right requires a
significant amount of engineering and algorithmic work that simply are
unnecessary for a simple static compiler.</p>
<p>The beauty of the clang approach is that it does not restrict how you use it.
In particular, it is possible to use the clang preprocessor and parser to build
an extremely quick and light-weight on-the-fly code generator (similar to TCC)
that does not build an AST at all. As an intermediate step, clang supports
using the current AST generation and semantic analysis code and having a code
generation client free the AST for each function after code generation. Finally,
clang provides support for building and retaining fully-fledged ASTs, and even
supports writing them out to disk.</p>
<p>Designing the libraries with clean and simple APIs allows these high-level
policy decisions to be determined in the client, instead of forcing "one true
way" in the implementation of any of these libraries. Getting this right is
hard, and we don't always get it right the first time, but we fix any problems
when we realize we made a mistake.</p>
<!--=======================================================================-->
<h3><a name="ideintegration">Integration with IDEs</h3>
<!--=======================================================================-->
<p>
We believe that Integrated Development Environments (IDE's) are a great way
to pull together various pieces of the development puzzle, and aim to make clang
work well in such an environment. The chief advantage of an IDE is that they
typically have visibility across your entire project and are long-lived
processes, whereas stand-alone compiler tools are typically invoked on each
individual file in the project, and thus have limited scope.</p>
<p>There are many implications of this difference, but a significant one has to
do with efficiency and caching: sharing an address space across different files
in a project, means that you can use intelligent caching and other techniques to
dramatically reduce analysis/compilation time.</p>
<p>A further difference between IDEs and batch compiler is that they often
impose very different requirements on the front-end: they depend on high
performance in order to provide a "snappy" experience, and thus really want
techniques like "incremental compilation", "fuzzy parsing", etc. Finally, IDEs
often have very different requirements than code generation, often requiring
information that a codegen-only frontend can throw away. Clang is
specifically designed and built to capture this information.
</p>
<!--=======================================================================-->
<h3><a name="license">Use the LLVM 'BSD' License</a></h3>
<!--=======================================================================-->
<p>We actively indend for clang (and a LLVM as a whole) to be used for
commercial projects, and the BSD license is the simplest way to allow this. We
feel that the license encourages contributors to pick up the source and work
with it, and believe that those individuals and organizations will contribute
back their work if they do not want to have to maintain a fork forever (which is
time consuming and expensive when merges are involved). Further, nobody makes
money on compilers these days, but many people need them to get bigger goals
accomplished: it makes sense for everyone to work together.</p>
<p>For more information about the LLVM/clang license, please see the <a
href="http://llvm.org/docs/DeveloperPolicy.html#license">LLVM License
Description</a> for more information.</p>
<!--*************************************************************************-->
<h2><a name="design">Internal Design and Implementation</a></h2>
<!--*************************************************************************-->
<!--=======================================================================-->
<h3><a name="real">A real-world, production quality compiler</a></h3>
<!--=======================================================================-->
<p>
Clang is designed and built by experienced compiler developers who
are increasingly frustrated with the problems that <a
href="comparison.html">existing open source compilers</a> have. Clang is
carefully and thoughtfully designed and built to provide the foundation of a
whole new generation of C/C++/Objective C development tools, and we intend for
it to be production quality.</p>
<p>Being a production quality compiler means many things: it means being high
performance, being solid and (relatively) bug free, and it means eventually
being used and depended on by a broad range of people. While we are still in
the early development stages, we strongly believe that this will become a
reality.</p>
<!--=======================================================================-->
<h3><a name="simplecode">A simple and hackable code base</a></h3>
<!--=======================================================================-->
<p>Our goal is to make it possible for anyone with a basic understanding
of compilers and working knowledge of the C/C++/ObjC languages to understand and
extend the clang source base. A large part of this falls out of our decision to
make the AST mirror the languages as closely as possible: you have your friendly
if statement, for statement, parenthesis expression, structs, unions, etc, all
represented in a simple and explicit way.</p>
<p>In addition to a simple design, we work to make the source base approachable
by commenting it well, including citations of the language standards where
appropriate, and designing the code for simplicity. Beyond that, clang offers
a set of AST dumpers, printers, and visualizers that make it easy to put code in
and see how it is represented.</p>
<!--=======================================================================-->
<h3><a name="unifiedparser">A single unified parser for C, Objective C, C++,
and Objective C++</a></h3>
<!--=======================================================================-->
<p>Clang is the "C Language Family Front-end", which means we intend to support
the most popular members of the C family. We are convinced that the right
parsing technology for this class of languages is a hand-built recursive-descent
parser. Because it is plain C++ code, recursive descent makes it very easy for
new developers to understand the code, it easily supports ad-hoc rules and other
strange hacks required by C/C++, and makes it straight-forward to implement
excellent diagnostics and error recovery.</p>
<p>We believe that implementing C/C++/ObjC in a single unified parser makes the
end result easier to maintain and evolve than maintaining a separate C and C++
parser which must be bugfixed and maintained independently of each other.</p>
<!--=======================================================================-->
<h3><a name="conformance">Conformance with C/C++/ObjC and their
variants</a></h3>
<!--=======================================================================-->
<p>When you start work on implementing a language, you find out that there is a
huge gap between how the language works and how most people understand it to
work. This gap is the difference between a normal programmer and a (scary?
super-natural?) "language lawyer", who knows the ins and outs of the language
and can grok standardese with ease.</p>
<p>In practice, being conformant with the languages means that we aim to support
the full language, including the dark and dusty corners (like trigraphs,
preprocessor arcana, C99 VLAs, etc). Where we support extensions above and
beyond what the standard officially allows, we make an effort to explicitly call
this out in the code and emit warnings about it (which are disabled by default,
but can optionally be mapped to either warnings or errors), allowing you to use
clang in "strict" mode if you desire.</p>
<p>We also intend to support "dialects" of these languages, such as C89, K&amp;R
C, C++'03, Objective-C 2, etc.</p>
</div>
</body>
</html>