blob: aa96c0df40c347d1c9e510aa170914aef1193729 [file] [log] [blame]
Ted Kremenek17a295d2008-06-11 06:19:49 +00001<html>
2<head>
Chris Lattner552de0a2008-11-23 08:16:56 +00003<title>"Clang" CFE Internals Manual</title>
Ted Kremenek17a295d2008-06-11 06:19:49 +00004<link type="text/css" rel="stylesheet" href="../menu.css" />
5<link type="text/css" rel="stylesheet" href="../content.css" />
Sebastian Redl68168562008-11-22 22:16:45 +00006<style type="text/css">
7td {
8 vertical-align: top;
9}
10</style>
Ted Kremenek17a295d2008-06-11 06:19:49 +000011</head>
12<body>
13
14<!--#include virtual="../menu.html.incl"-->
15
16<div id="content">
Chris Lattner86920d32007-07-31 05:42:17 +000017
Chris Lattner552de0a2008-11-23 08:16:56 +000018<h1>"Clang" CFE Internals Manual</h1>
Chris Lattner86920d32007-07-31 05:42:17 +000019
20<ul>
21<li><a href="#intro">Introduction</a></li>
22<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
Chris Lattner552de0a2008-11-23 08:16:56 +000023<li><a href="#libbasic">The Clang 'Basic' Library</a>
Chris Lattner86920d32007-07-31 05:42:17 +000024 <ul>
Chris Lattner62fd2782008-11-22 21:41:31 +000025 <li><a href="#Diagnostics">The Diagnostics Subsystem</a></li>
Chris Lattner86920d32007-07-31 05:42:17 +000026 <li><a href="#SourceLocation">The SourceLocation and SourceManager
27 classes</a></li>
28 </ul>
29</li>
30<li><a href="#liblex">The Lexer and Preprocessor Library</a>
31 <ul>
32 <li><a href="#Token">The Token class</a></li>
33 <li><a href="#Lexer">The Lexer class</a></li>
Chris Lattner79281252008-03-09 02:27:26 +000034 <li><a href="#TokenLexer">The TokenLexer class</a></li>
Chris Lattner86920d32007-07-31 05:42:17 +000035 <li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
36 </ul>
37</li>
38<li><a href="#libparse">The Parser Library</a>
39 <ul>
40 </ul>
41</li>
42<li><a href="#libast">The AST Library</a>
43 <ul>
44 <li><a href="#Type">The Type class and its subclasses</a></li>
45 <li><a href="#QualType">The QualType class</a></li>
Douglas Gregor2e1cd422008-11-17 14:58:09 +000046 <li><a href="#DeclarationName">Declaration names</a></li>
Douglas Gregor074149e2009-01-05 19:45:36 +000047 <li><a href="#DeclContext">Declaration contexts</a>
48 <ul>
49 <li><a href="#Redeclarations">Redeclarations and Overloads</a></li>
50 <li><a href="#LexicalAndSemanticContexts">Lexical and Semantic
51 Contexts</a></li>
52 <li><a href="#TransparentContexts">Transparent Declaration Contexts</a></li>
53 <li><a href="#MultiDeclContext">Multiply-Defined Declaration Contexts</a></li>
54 </ul>
55 </li>
Ted Kremenek8bc05712007-10-10 23:01:43 +000056 <li><a href="#CFG">The CFG class</a></li>
Chris Lattner7bad1992008-11-16 21:48:07 +000057 <li><a href="#Constants">Constant Folding in the Clang AST</a></li>
Chris Lattner86920d32007-07-31 05:42:17 +000058 </ul>
59</li>
60</ul>
61
62
63<!-- ======================================================================= -->
64<h2 id="intro">Introduction</h2>
65<!-- ======================================================================= -->
66
67<p>This document describes some of the more important APIs and internal design
Chris Lattner552de0a2008-11-23 08:16:56 +000068decisions made in the Clang C front-end. The purpose of this document is to
Chris Lattner86920d32007-07-31 05:42:17 +000069both capture some of this high level information and also describe some of the
70design decisions behind it. This is meant for people interested in hacking on
Chris Lattner552de0a2008-11-23 08:16:56 +000071Clang, not for end-users. The description below is categorized by
Chris Lattner86920d32007-07-31 05:42:17 +000072libraries, and does not describe any of the clients of the libraries.</p>
73
74<!-- ======================================================================= -->
75<h2 id="libsystem">LLVM System and Support Libraries</h2>
76<!-- ======================================================================= -->
77
Chris Lattner552de0a2008-11-23 08:16:56 +000078<p>The LLVM libsystem library provides the basic Clang system abstraction layer,
Chris Lattner86920d32007-07-31 05:42:17 +000079which is used for file system access. The LLVM libsupport library provides many
80underlying libraries and <a
81href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
82 including command line option
83processing and various containers.</p>
84
85<!-- ======================================================================= -->
Chris Lattner552de0a2008-11-23 08:16:56 +000086<h2 id="libbasic">The Clang 'Basic' Library</h2>
Chris Lattner86920d32007-07-31 05:42:17 +000087<!-- ======================================================================= -->
88
89<p>This library certainly needs a better name. The 'basic' library contains a
90number of low-level utilities for tracking and manipulating source buffers,
91locations within the source buffers, diagnostics, tokens, target abstraction,
92and information about the subset of the language being compiled for.</p>
93
94<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
95other parts could be reused for other non-C-based languages (SourceLocation,
96SourceManager, Diagnostics, FileManager). When and if there is future demand
97we can figure out if it makes sense to introduce a new library, move the general
98classes somewhere else, or introduce some other solution.</p>
99
100<p>We describe the roles of these classes in order of their dependencies.</p>
101
Chris Lattner62fd2782008-11-22 21:41:31 +0000102
103<!-- ======================================================================= -->
104<h3 id="Diagnostics">The Diagnostics Subsystem</h3>
105<!-- ======================================================================= -->
106
107<p>The Clang Diagnostics subsystem is an important part of how the compiler
108communicates with the human. Diagnostics are the warnings and errors produced
109when the code is incorrect or dubious. In Clang, each diagnostic produced has
110(at the minimum) a unique ID, a <a href="#SourceLocation">SourceLocation</a> to
111"put the caret", an English translation associated with it, and a severity (e.g.
112<tt>WARNING</tt> or <tt>ERROR</tt>). They can also optionally include a number
113of arguments to the dianostic (which fill in "%0"'s in the string) as well as a
114number of source ranges that related to the diagnostic.</p>
115
Chris Lattner552de0a2008-11-23 08:16:56 +0000116<p>In this section, we'll be giving examples produced by the Clang command line
Chris Lattner62fd2782008-11-22 21:41:31 +0000117driver, but diagnostics can be <a href="#DiagnosticClient">rendered in many
118different ways</a> depending on how the DiagnosticClient interface is
119implemented. A representative example of a diagonstic is:</p>
120
121<pre>
122t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float')
123 <font color="darkgreen">P = (P-42) + Gamma*4;</font>
124 <font color="blue">~~~~~~ ^ ~~~~~~~</font>
125</pre>
126
127<p>In this example, you can see the English translation, the severity (error),
128you can see the source location (the caret ("^") and file/line/column info),
129the source ranges "~~~~", arguments to the diagnostic ("int*" and "_Complex
130float"). You'll have to believe me that there is a unique ID backing the
131diagnostic :).</p>
132
133<p>Getting all of this to happen has several steps and involves many moving
134pieces, this section describes them and talks about best practices when adding
135a new diagnostic.</p>
136
137<!-- ============================ -->
138<h4>The DiagnosticKinds.def file</h4>
139<!-- ============================ -->
140
141<p>Diagnostics are created by adding an entry to the <tt><a
142href="http://llvm.org/svn/llvm-project/cfe/trunk/include/clang/Basic/DiagnosticKinds.def"
143>DiagnosticKinds.def</a></tt> file. This file encodes the unique ID of the
144diagnostic (as an enum, the first argument), the severity of the diagnostic
145(second argument) and the English translation + format string.</p>
146
147<p>There is little sanity with the naming of the unique ID's right now. Some
148start with err_, warn_, ext_ to encode the severity into the name. Since the
149enum is referenced in the C++ code that produces the diagnostic, it is somewhat
150useful for it to be reasonably short.</p>
151
152<p>The severity of the diagnostic comes from the set {<tt>NOTE</tt>,
153<tt>WARNING</tt>, <tt>EXTENSION</tt>, <tt>EXTWARN</tt>, <tt>ERROR</tt>}. The
154<tt>ERROR</tt> severity is used for diagnostics indicating the program is never
155acceptable under any circumstances. When an error is emitted, the AST for the
156input code may not be fully built. The <tt>EXTENSION</tt> and <tt>EXTWARN</tt>
157severities are used for extensions to the language that Clang accepts. This
158means that Clang fully understands and can represent them in the AST, but we
159produce diagnostics to tell the user their code is non-portable. The difference
160is that the former are ignored by default, and the later warn by default. The
161<tt>WARNING</tt> severity is used for constructs that are valid in the currently
162selected source language but that are dubious in some way. The <tt>NOTE</tt>
163level is used to staple more information onto a previous diagnostics.</p>
164
165<p>These <em>severities</em> are mapped into a smaller set (the
166Diagnostic::Level enum, {<tt>Ignored</tt>, <tt>Note</tt>, <tt>Warning</tt>,
167<tt>Error</tt> }) of output <em>levels</em> by the diagnostics subsystem based
168on various configuration options. For example, if the user specifies
169<tt>-pedantic</tt>, <tt>EXTENSION</tt> maps to <tt>Warning</tt>, if they specify
170<tt>-pedantic-errors</tt>, it turns into <tt>Error</tt>. Clang also internally
171supports a fully fine grained mapping mechanism that allows you to map any
172diagnostic that doesn't have <tt>ERRROR</tt> severity to any output level that
173you want. This is used to implement options like <tt>-Wunused_macros</tt>,
174<tt>-Wundef</tt> etc.</p>
175
176<!-- ================= -->
177<h4>The Format String</h4>
178<!-- ================= -->
179
180<p>The format string for the diagnostic is very simple, but it has some power.
181It takes the form of a string in English with markers that indicate where and
182how arguments to the diagnostic are inserted and formatted. For example, here
183are some simple format strings:</p>
184
185<pre>
186 "binary integer literals are an extension"
187 "format string contains '\\0' within the string body"
188 "more '<b>%%</b>' conversions than data arguments"
Chris Lattner545b3682008-11-23 20:27:13 +0000189 "invalid operands to binary expression (<b>%0</b> and <b>%1</b>)"
Chris Lattner62fd2782008-11-22 21:41:31 +0000190 "overloaded '<b>%0</b>' must be a <b>%select{unary|binary|unary or binary}2</b> operator"
191 " (has <b>%1</b> parameter<b>%s1</b>)"
192</pre>
193
194<p>These examples show some important points of format strings. You can use any
195 plain ASCII character in the diagnostic string except "%" without a problem,
196 but these are C strings, so you have to use and be aware of all the C escape
197 sequences (as in the second example). If you want to produce a "%" in the
198 output, use the "%%" escape sequence, like the third diagnostic. Finally,
Chris Lattner552de0a2008-11-23 08:16:56 +0000199 Clang uses the "%...[digit]" sequences to specify where and how arguments to
Chris Lattner62fd2782008-11-22 21:41:31 +0000200 the diagnostic are formatted.</p>
201
202<p>Arguments to the diagnostic are numbered according to how they are specified
203 by the C++ code that <a href="#producingdiag">produces them</a>, and are
204 referenced by <tt>%0</tt> .. <tt>%9</tt>. If you have more than 10 arguments
Chris Lattner552de0a2008-11-23 08:16:56 +0000205 to your diagnostic, you are doing something wrong :). Unlike printf, there
Chris Lattner62fd2782008-11-22 21:41:31 +0000206 is no requirement that arguments to the diagnostic end up in the output in
207 the same order as they are specified, you could have a format string with
208 <tt>"%1 %0"</tt> that swaps them, for example. The text in between the
209 percent and digit are formatting instructions. If there are no instructions,
210 the argument is just turned into a string and substituted in.</p>
211
212<p>Here are some "best practices" for writing the English format string:</p>
213
214<ul>
215<li>Keep the string short. It should ideally fit in the 80 column limit of the
216 <tt>DiagnosticKinds.def</tt> file. This avoids the diagnostic wrapping when
217 printed, and forces you to think about the important point you are conveying
218 with the diagnostic.</li>
219<li>Take advantage of location information. The user will be able to see the
220 line and location of the caret, so you don't need to tell them that the
221 problem is with the 4th argument to the function: just point to it.</li>
222<li>Do not capitalize the diagnostic string, and do not end it with a
223 period.</li>
224<li>If you need to quote something in the diagnostic string, use single
225 quotes.</li>
226</ul>
227
228<p>Diagnostics should never take random English strings as arguments: you
229shouldn't use <tt>"you have a problem with %0"</tt> and pass in things like
230<tt>"your argument"</tt> or <tt>"your return value"</tt> as arguments. Doing
231this prevents <a href="translation">translating</a> the Clang diagnostics to
232other languages (because they'll get random English words in their otherwise
233localized diagnostic). The exceptions to this are C/C++ language keywords
234(e.g. auto, const, mutable, etc) and C/C++ operators (<tt>/=</tt>). Note
235that things like "pointer" and "reference" are not keywords. On the other
236hand, you <em>can</em> include anything that comes from the user's source code,
Chris Lattner552de0a2008-11-23 08:16:56 +0000237including variable names, types, labels, etc. The 'select' format can be
238used to achieve this sort of thing in a localizable way, see below.</p>
Chris Lattner62fd2782008-11-22 21:41:31 +0000239
240<!-- ==================================== -->
241<h4>Formatting a Diagnostic Argument</a></h4>
242<!-- ==================================== -->
243
244<p>Arguments to diagnostics are fully typed internally, and come from a couple
245different classes: integers, types, names, and random strings. Depending on
246the class of the argument, it can be optionally formatted in different ways.
247This gives the DiagnosticClient information about what the argument means
248without requiring it to use a specific presentation (consider this MVC for
249Clang :).</p>
250
251<p>Here are the different diagnostic argument formats currently supported by
252Clang:</p>
253
254<table>
255<tr><td colspan="2"><b>"s" format</b></td></tr>
256<tr><td>Example:</td><td><tt>"requires %1 parameter%s1"</tt></td></tr>
Chris Lattner552de0a2008-11-23 08:16:56 +0000257<tr><td>Class:</td><td>Integers</td></tr>
Chris Lattner62fd2782008-11-22 21:41:31 +0000258<tr><td>Description:</td><td>This is a simple formatter for integers that is
259 useful when producing English diagnostics. When the integer is 1, it prints
260 as nothing. When the integer is not 1, it prints as "s". This allows some
Chris Lattner627b7052008-11-23 00:28:33 +0000261 simple grammatical forms to be to be handled correctly, and eliminates the
262 need to use gross things like <tt>"requires %1 parameter(s)"</tt>.</td></tr>
Chris Lattner62fd2782008-11-22 21:41:31 +0000263
264<tr><td colspan="2"><b>"select" format</b></td></tr>
265<tr><td>Example:</td><td><tt>"must be a %select{unary|binary|unary or binary}2
266 operator"</tt></td></tr>
Chris Lattner552de0a2008-11-23 08:16:56 +0000267<tr><td>Class:</td><td>Integers</td></tr>
Chris Lattnercc543342008-11-22 23:50:47 +0000268<tr><td>Description:</td><td>This format specifier is used to merge multiple
269 related diagnostics together into one common one, without requiring the
Chris Lattner552de0a2008-11-23 08:16:56 +0000270 difference to be specified as an English string argument. Instead of
Chris Lattnercc543342008-11-22 23:50:47 +0000271 specifying the string, the diagnostic gets an integer argument and the
272 format string selects the numbered option. In this case, the "%2" value
273 must be an integer in the range [0..2]. If it is 0, it prints 'unary', if
274 it is 1 it prints 'binary' if it is 2, it prints 'unary or binary'. This
275 allows other language translations to substitute reasonable words (or entire
276 phrases) based on the semantics of the diagnostic instead of having to do
277 things textually.</td></tr>
Chris Lattner62fd2782008-11-22 21:41:31 +0000278
279<tr><td colspan="2"><b>"plural" format</b></td></tr>
Sebastian Redl68168562008-11-22 22:16:45 +0000280<tr><td>Example:</td><td><tt>"you have %1 %plural{1:mouse|:mice}1 connected to
281 your computer"</tt></td></tr>
Chris Lattner552de0a2008-11-23 08:16:56 +0000282<tr><td>Class:</td><td>Integers</td></tr>
Sebastian Redl68168562008-11-22 22:16:45 +0000283<tr><td>Description:</td><td><p>This is a formatter for complex plural forms.
284 It is designed to handle even the requirements of languages with very
285 complex plural forms, as many Baltic languages have. The argument consists
286 of a series of expression/form pairs, separated by ':', where the first form
287 whose expression evaluates to true is the result of the modifier.</p>
288 <p>An expression can be empty, in which case it is always true. See the
289 example at the top. Otherwise, it is a series of one or more numeric
290 conditions, separated by ','. If any condition matches, the expression
291 matches. Each numeric condition can take one of three forms.</p>
292 <ul>
293 <li>number: A simple decimal number matches if the argument is the same
Chris Lattner627b7052008-11-23 00:28:33 +0000294 as the number. Example: <tt>"%plural{1:mouse|:mice}4"</tt></li>
Sebastian Redl68168562008-11-22 22:16:45 +0000295 <li>range: A range in square brackets matches if the argument is within
Chris Lattner552de0a2008-11-23 08:16:56 +0000296 the range. Then range is inclusive on both ends. Example:
Chris Lattner627b7052008-11-23 00:28:33 +0000297 <tt>"%plural{0:none|1:one|[2,5]:some|:many}2"</tt></li>
298 <li>modulo: A modulo operator is followed by a number, and
299 equals sign and either a number or a range. The tests are the
300 same as for plain
Sebastian Redl68168562008-11-22 22:16:45 +0000301 numbers and ranges, but the argument is taken modulo the number first.
Chris Lattner627b7052008-11-23 00:28:33 +0000302 Example: <tt>"%plural{%100=0:even hundred|%100=[1,50]:lower half|:everything
303 else}1"</tt></li>
Sebastian Redl68168562008-11-22 22:16:45 +0000304 </ul>
305 <p>The parser is very unforgiving. A syntax error, even whitespace, will
306 abort, as will a failure to match the argument against any
307 expression.</p></td></tr>
Chris Lattner62fd2782008-11-22 21:41:31 +0000308
Chris Lattner077bf5e2008-11-24 03:33:13 +0000309<tr><td colspan="2"><b>"objcclass" format</b></td></tr>
310<tr><td>Example:</td><td><tt>"method %objcclass0 not found"</tt></td></tr>
311<tr><td>Class:</td><td>DeclarationName</td></tr>
312<tr><td>Description:</td><td><p>This is a simple formatter that indicates the
313 DeclarationName corresponds to an Objective-C class method selector. As
314 such, it prints the selector with a leading '+'.</p></td></tr>
315
316<tr><td colspan="2"><b>"objcinstance" format</b></td></tr>
317<tr><td>Example:</td><td><tt>"method %objcinstance0 not found"</tt></td></tr>
318<tr><td>Class:</td><td>DeclarationName</td></tr>
319<tr><td>Description:</td><td><p>This is a simple formatter that indicates the
320 DeclarationName corresponds to an Objective-C instance method selector. As
321 such, it prints the selector with a leading '-'.</p></td></tr>
322
Chris Lattner62fd2782008-11-22 21:41:31 +0000323</table>
324
Chris Lattnercc543342008-11-22 23:50:47 +0000325<p>It is really easy to add format specifiers to the Clang diagnostics system,
Chris Lattner552de0a2008-11-23 08:16:56 +0000326but they should be discussed before they are added. If you are creating a lot
327of repetitive diagnostics and/or have an idea for a useful formatter, please
328bring it up on the cfe-dev mailing list.</p>
Chris Lattnercc543342008-11-22 23:50:47 +0000329
Chris Lattner62fd2782008-11-22 21:41:31 +0000330<!-- ===================================================== -->
331<h4><a name="#producingdiag">Producing the Diagnostic</a></h4>
332<!-- ===================================================== -->
333
Chris Lattner627b7052008-11-23 00:28:33 +0000334<p>Now that you've created the diagnostic in the DiagnosticKinds.def file, you
Chris Lattner552de0a2008-11-23 08:16:56 +0000335need to write the code that detects the condition in question and emits the
336new diagnostic. Various components of Clang (e.g. the preprocessor, Sema,
Chris Lattner627b7052008-11-23 00:28:33 +0000337etc) provide a helper function named "Diag". It creates a diagnostic and
338accepts the arguments, ranges, and other information that goes along with
339it.</p>
Chris Lattner62fd2782008-11-22 21:41:31 +0000340
Chris Lattner552de0a2008-11-23 08:16:56 +0000341<p>For example, the binary expression error comes from code like this:</p>
Chris Lattner627b7052008-11-23 00:28:33 +0000342
343<pre>
344 if (various things that are bad)
345 Diag(Loc, diag::err_typecheck_invalid_operands)
346 &lt;&lt; lex-&gt;getType() &lt;&lt; rex-&gt;getType()
347 &lt;&lt; lex-&gt;getSourceRange() &lt;&lt; rex-&gt;getSourceRange();
348</pre>
349
350<p>This shows that use of the Diag method: they take a location (a <a
351href="#SourceLocation">SourceLocation</a> object) and a diagnostic enum value
352(which matches the name from DiagnosticKinds.def). If the diagnostic takes
353arguments, they are specified with the &lt;&lt; operator: the first argument
354becomes %0, the second becomes %1, etc. The diagnostic interface allows you to
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000355specify arguments of many different types, including <tt>int</tt> and
356<tt>unsigned</tt> for integer arguments, <tt>const char*</tt> and
357<tt>std::string</tt> for string arguments, <tt>DeclarationName</tt> and
358<tt>const IdentifierInfo*</tt> for names, <tt>QualType</tt> for types, etc.
359SourceRanges are also specified with the &lt;&lt; operator, but do not have a
360specific ordering requirement.</p>
Chris Lattner627b7052008-11-23 00:28:33 +0000361
362<p>As you can see, adding and producing a diagnostic is pretty straightforward.
363The hard part is deciding exactly what you need to say to help the user, picking
364a suitable wording, and providing the information needed to format it correctly.
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000365The good news is that the call site that issues a diagnostic should be
366completely independent of how the diagnostic is formatted and in what language
367it is rendered.
Chris Lattner627b7052008-11-23 00:28:33 +0000368</p>
Chris Lattner62fd2782008-11-22 21:41:31 +0000369
370<!-- ============================================================= -->
371<h4><a name="DiagnosticClient">The DiagnosticClient Interface</a></h4>
372<!-- ============================================================= -->
373
Chris Lattner627b7052008-11-23 00:28:33 +0000374<p>Once code generates a diagnostic with all of the arguments and the rest of
375the relevant information, Clang needs to know what to do with it. As previously
376mentioned, the diagnostic machinery goes through some filtering to map a
377severity onto a diagnostic level, then (assuming the diagnostic is not mapped to
378"<tt>Ignore</tt>") it invokes an object that implements the DiagnosticClient
379interface with the information.</p>
380
381<p>It is possible to implement this interface in many different ways. For
382example, the normal Clang DiagnosticClient (named 'TextDiagnosticPrinter') turns
383the arguments into strings (according to the various formatting rules), prints
384out the file/line/column information and the string, then prints out the line of
385code, the source ranges, and the caret. However, this behavior isn't required.
386</p>
387
388<p>Another implementation of the DiagnosticClient interface is the
Chris Lattner552de0a2008-11-23 08:16:56 +0000389'TextDiagnosticBuffer' class, which is used when Clang is in -verify mode.
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000390Instead of formatting and printing out the diagnostics, this implementation just
391captures and remembers the diagnostics as they fly by. Then -verify compares
Chris Lattner552de0a2008-11-23 08:16:56 +0000392the list of produced diagnostics to the list of expected ones. If they disagree,
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000393it prints out its own output.
Chris Lattner627b7052008-11-23 00:28:33 +0000394</p>
395
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000396<p>There are many other possible implementations of this interface, and this is
397why we prefer diagnostics to pass down rich structured information in arguments.
398For example, an HTML output might want declaration names be linkified to where
399they come from in the source. Another example is that a GUI might let you click
400on typedefs to expand them. This application would want to pass significantly
401more information about types through to the GUI than a simple flat string. The
402interface allows this to happen.</p>
Chris Lattner62fd2782008-11-22 21:41:31 +0000403
404<!-- ====================================================== -->
405<h4><a name="translation">Adding Translations to Clang</a></h4>
406<!-- ====================================================== -->
407
Chris Lattner627b7052008-11-23 00:28:33 +0000408<p>Not possible yet! Diagnostic strings should be written in UTF-8, the client
Chris Lattnerfd408ea2008-11-23 00:42:53 +0000409can translate to the relevant code page if needed. Each translation completely
410replaces the format string for the diagnostic.</p>
Chris Lattner62fd2782008-11-22 21:41:31 +0000411
412
Chris Lattner86920d32007-07-31 05:42:17 +0000413<!-- ======================================================================= -->
414<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
415<!-- ======================================================================= -->
416
417<p>Strangely enough, the SourceLocation class represents a location within the
418source code of the program. Important design points include:</p>
419
420<ol>
421<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
422 many AST nodes and are passed around often. Currently it is 32 bits.</li>
423<li>SourceLocation must be a simple value object that can be efficiently
424 copied.</li>
425<li>We should be able to represent a source location for any byte of any input
426 file. This includes in the middle of tokens, in whitespace, in trigraphs,
427 etc.</li>
428<li>A SourceLocation must encode the current #include stack that was active when
429 the location was processed. For example, if the location corresponds to a
430 token, it should contain the set of #includes active when the token was
431 lexed. This allows us to print the #include stack for a diagnostic.</li>
432<li>SourceLocation must be able to describe macro expansions, capturing both
433 the ultimate instantiation point and the source of the original character
434 data.</li>
435</ol>
436
437<p>In practice, the SourceLocation works together with the SourceManager class
438to encode two pieces of information about a location: it's physical location
439and it's virtual location. For most tokens, these will be the same. However,
440for a macro expansion (or tokens that came from a _Pragma directive) these will
441describe the location of the characters corresponding to the token and the
442location where the token was used (i.e. the macro instantiation point or the
443location of the _Pragma itself).</p>
444
Chris Lattner3fcbb892008-11-23 08:32:53 +0000445<p>For efficiency, we only track one level of macro instantiations: if a token was
Chris Lattner86920d32007-07-31 05:42:17 +0000446produced by multiple instantiations, we only track the source and ultimate
447destination. Though we could track the intermediate instantiation points, this
448would require extra bookkeeping and no known client would benefit substantially
449from this.</p>
450
Chris Lattner552de0a2008-11-23 08:16:56 +0000451<p>The Clang front-end inherently depends on the location of a token being
Chris Lattner86920d32007-07-31 05:42:17 +0000452tracked correctly. If it is ever incorrect, the front-end may get confused and
453die. The reason for this is that the notion of the 'spelling' of a Token in
Chris Lattner552de0a2008-11-23 08:16:56 +0000454Clang depends on being able to find the original input characters for the token.
Chris Lattner86920d32007-07-31 05:42:17 +0000455This concept maps directly to the "physical" location for the token.</p>
456
457<!-- ======================================================================= -->
458<h2 id="liblex">The Lexer and Preprocessor Library</h2>
459<!-- ======================================================================= -->
460
461<p>The Lexer library contains several tightly-connected classes that are involved
462with the nasty process of lexing and preprocessing C source code. The main
463interface to this library for outside clients is the large <a
464href="#Preprocessor">Preprocessor</a> class.
465It contains the various pieces of state that are required to coherently read
466tokens out of a translation unit.</p>
467
468<p>The core interface to the Preprocessor object (once it is set up) is the
469Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
470the preprocessor stream. There are two types of token providers that the
471preprocessor is capable of reading from: a buffer lexer (provided by the <a
472href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
Chris Lattner79281252008-03-09 02:27:26 +0000473href="#TokenLexer">TokenLexer</a> class).
Chris Lattner86920d32007-07-31 05:42:17 +0000474
475
476<!-- ======================================================================= -->
477<h3 id="Token">The Token class</h3>
478<!-- ======================================================================= -->
479
480<p>The Token class is used to represent a single lexed token. Tokens are
481intended to be used by the lexer/preprocess and parser libraries, but are not
482intended to live beyond them (for example, they should not live in the ASTs).<p>
483
484<p>Tokens most often live on the stack (or some other location that is efficient
485to access) as the parser is running, but occasionally do get buffered up. For
486example, macro definitions are stored as a series of tokens, and the C++
Chris Lattner3fcbb892008-11-23 08:32:53 +0000487front-end periodically needs to buffer tokens up for tentative parsing and
Chris Lattner86920d32007-07-31 05:42:17 +0000488various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
489system, sizeof(Token) is currently 16 bytes.</p>
490
491<p>Tokens contain the following information:</p>
492
493<ul>
494<li><b>A SourceLocation</b> - This indicates the location of the start of the
495token.</li>
496
497<li><b>A length</b> - This stores the length of the token as stored in the
498SourceBuffer. For tokens that include them, this length includes trigraphs and
499escaped newlines which are ignored by later phases of the compiler. By pointing
500into the original source buffer, it is always possible to get the original
501spelling of a token completely accurately.</li>
502
503<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
504identifier lookup was enabled when the token was lexed (e.g. the lexer was not
505reading in 'raw' mode) this contains a pointer to the unique hash value for the
506identifier. Because the lookup happens before keyword identification, this
507field is set even for language keywords like 'for'.</li>
508
509<li><b>TokenKind</b> - This indicates the kind of token as classified by the
510lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
511operator), <tt>tok::ampamp</tt> for the "&amp;&amp;" token, and keyword values
512(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
513that some tokens can be spelled multiple ways. For example, C++ supports
514"operator keywords", where things like "and" are treated exactly like the
515"&amp;&amp;" operator. In these cases, the kind value is set to
516<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
517consider both forms. For something that cares about which form is used (e.g.
518the preprocessor 'stringize' operator) the spelling indicates the original
519form.</li>
520
521<li><b>Flags</b> - There are currently four flags tracked by the
522lexer/preprocessor system on a per-token basis:
523
524 <ol>
525 <li><b>StartOfLine</b> - This was the first token that occurred on its input
526 source line.</li>
527 <li><b>LeadingSpace</b> - There was a space character either immediately
528 before the token or transitively before the token as it was expanded
529 through a macro. The definition of this flag is very closely defined by
530 the stringizing requirements of the preprocessor.</li>
531 <li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
532 represent identifier tokens which have macro expansion disabled. This
533 prevents them from being considered as candidates for macro expansion ever
534 in the future.</li>
535 <li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
536 token includes a trigraph or escaped newline. Since this is uncommon,
537 many pieces of code can fast-path on tokens that did not need cleaning.
538 </p>
539 </ol>
540</li>
541</ul>
542
543<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
544contain any semantic information about the lexed value. For example, if the
545token was a pp-number token, we do not represent the value of the number that
546was lexed (this is left for later pieces of code to decide). Additionally, the
547lexer library has no notion of typedef names vs variable names: both are
548returned as identifiers, and the parser is left to decide whether a specific
549identifier is a typedef or a variable (tracking this requires scope information
550among other things).</p>
551
552<!-- ======================================================================= -->
553<h3 id="Lexer">The Lexer class</h3>
554<!-- ======================================================================= -->
555
556<p>The Lexer class provides the mechanics of lexing tokens out of a source
557buffer and deciding what they mean. The Lexer is complicated by the fact that
558it operates on raw buffers that have not had spelling eliminated (this is a
559necessity to get decent performance), but this is countered with careful coding
560as well as standard performance techniques (for example, the comment handling
561code is vectorized on X86 and PowerPC hosts).</p>
562
563<p>The lexer has a couple of interesting modal features:</p>
564
565<ul>
566<li>The lexer can operate in 'raw' mode. This mode has several features that
567 make it possible to quickly lex the file (e.g. it stops identifier lookup,
568 doesn't specially handle preprocessor tokens, handles EOF differently, etc).
569 This mode is used for lexing within an "<tt>#if 0</tt>" block, for
570 example.</li>
571<li>The lexer can capture and return comments as tokens. This is required to
572 support the -C preprocessor mode, which passes comments through, and is
573 used by the diagnostic checker to identifier expect-error annotations.</li>
574<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
Chris Lattner84386242007-09-16 19:25:23 +0000575 after reading a #include directive. This mode changes the parsing of '&lt;'
Chris Lattner86920d32007-07-31 05:42:17 +0000576 to return an "angled string" instead of a bunch of tokens for each thing
577 within the filename.</li>
578<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
579 ParsingPreprocessorDirective mode is entered. This changes the parser to
580 return EOM at a newline.</li>
581<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
582 whether C++ or ObjC keywords are recognized, etc.</li>
583</ul>
584
585<p>In addition to these modes, the lexer keeps track of a couple of other
586 features that are local to a lexed buffer, which change as the buffer is
587 lexed:</p>
588
589<ul>
590<li>The Lexer uses BufferPtr to keep track of the current character being
591 lexed.</li>
592<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
593 will start with its "start of line" bit set.</li>
594<li>The Lexer keeps track of the current #if directives that are active (which
595 can be nested).</li>
596<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
597 MultipleIncludeOpt</a> object, which is used to
598 detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
599 <tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
600 subsequent includes can be ignored if the XX macro is defined.</li>
601</ul>
602
603<!-- ======================================================================= -->
Chris Lattner79281252008-03-09 02:27:26 +0000604<h3 id="TokenLexer">The TokenLexer class</h3>
Chris Lattner86920d32007-07-31 05:42:17 +0000605<!-- ======================================================================= -->
606
Chris Lattner79281252008-03-09 02:27:26 +0000607<p>The TokenLexer class is a token provider that returns tokens from a list
Chris Lattner86920d32007-07-31 05:42:17 +0000608of tokens that came from somewhere else. It typically used for two things: 1)
609returning tokens from a macro definition as it is being expanded 2) returning
610tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
611will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
612
613<!-- ======================================================================= -->
614<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
615<!-- ======================================================================= -->
616
617<p>The MultipleIncludeOpt class implements a really simple little state machine
618that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
619idiom that people typically use to prevent multiple inclusion of headers. If a
620buffer uses this idiom and is subsequently #include'd, the preprocessor can
621simply check to see whether the guarding condition is defined or not. If so,
622the preprocessor can completely ignore the include of the header.</p>
623
624
625
626<!-- ======================================================================= -->
627<h2 id="libparse">The Parser Library</h2>
628<!-- ======================================================================= -->
629
630<!-- ======================================================================= -->
631<h2 id="libast">The AST Library</h2>
632<!-- ======================================================================= -->
633
634<!-- ======================================================================= -->
635<h3 id="Type">The Type class and its subclasses</h3>
636<!-- ======================================================================= -->
637
638<p>The Type class (and its subclasses) are an important part of the AST. Types
639are accessed through the ASTContext class, which implicitly creates and uniques
640them as they are needed. Types have a couple of non-obvious features: 1) they
641do not capture type qualifiers like const or volatile (See
642<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner8a2bc622007-07-31 06:37:39 +0000643information. Once created, types are immutable (unlike decls).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000644
645<p>Typedefs in C make semantic analysis a bit more complex than it would
646be without them. The issue is that we want to capture typedef information
647and represent it in the AST perfectly, but the semantics of operations need to
648"see through" typedefs. For example, consider this code:</p>
649
650<code>
651void func() {<br>
Bill Wendling30d17752007-10-06 01:56:01 +0000652&nbsp;&nbsp;typedef int foo;<br>
653&nbsp;&nbsp;foo X, *Y;<br>
654&nbsp;&nbsp;typedef foo* bar;<br>
655&nbsp;&nbsp;bar Z;<br>
656&nbsp;&nbsp;*X; <i>// error</i><br>
657&nbsp;&nbsp;**Y; <i>// error</i><br>
658&nbsp;&nbsp;**Z; <i>// error</i><br>
Chris Lattner86920d32007-07-31 05:42:17 +0000659}<br>
660</code>
661
662<p>The code above is illegal, and thus we expect there to be diagnostics emitted
663on the annotated lines. In this example, we expect to get:</p>
664
665<pre>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000666<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000667*X; // error
668<font color="blue">^~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000669<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000670**Y; // error
671<font color="blue">^~~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000672<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
673**Z; // error
674<font color="blue">^~~</font>
Chris Lattner86920d32007-07-31 05:42:17 +0000675</pre>
676
677<p>While this example is somewhat silly, it illustrates the point: we want to
678retain typedef information where possible, so that we can emit errors about
679"<tt>std::string</tt>" instead of "<tt>std::basic_string&lt;char, std:...</tt>".
680Doing this requires properly keeping typedef information (for example, the type
681of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner8a2bc622007-07-31 06:37:39 +0000682various operators (for example, the type of *Y is "foo", not "int"). In order
683to retain this information, the type of these expressions is an instance of the
684TypedefType class, which indicates that the type of these expressions is a
685typedef for foo.
686</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000687
Chris Lattner8a2bc622007-07-31 06:37:39 +0000688<p>Representing types like this is great for diagnostics, because the
689user-specified type is always immediately available. There are two problems
690with this: first, various semantic checks need to make judgements about the
Chris Lattner33fc68a2007-07-31 18:54:50 +0000691<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
692efficient way to query whether two types are structurally identical to each
693other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner8a2bc622007-07-31 06:37:39 +0000694canonical types.</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000695
Chris Lattner62fd2782008-11-22 21:41:31 +0000696<!-- =============== -->
Chris Lattner8a2bc622007-07-31 06:37:39 +0000697<h4>Canonical Types</h4>
Chris Lattner62fd2782008-11-22 21:41:31 +0000698<!-- =============== -->
Chris Lattner86920d32007-07-31 05:42:17 +0000699
Chris Lattner8a2bc622007-07-31 06:37:39 +0000700<p>Every instance of the Type class contains a canonical type pointer. For
701simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
702"<tt>int**</tt>"), the type just points to itself. For types that have a
703typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
704"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
705structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
706"<tt>int*</tt>", "<tt>int**</tt>", and "<tt>int*</tt>" respectively).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000707
Chris Lattner8a2bc622007-07-31 06:37:39 +0000708<p>This design provides a constant time operation (dereferencing the canonical
709type pointer) that gives us access to the structure of types. For example,
710we can trivially tell that "bar" and "foo*" are the same type by dereferencing
711their canonical type pointers and doing a pointer comparison (they both point
712to the single "<tt>int*</tt>" type).</p>
713
714<p>Canonical types and typedef types bring up some complexities that must be
715carefully managed. Specifically, the "isa/cast/dyncast" operators generally
716shouldn't be used in code that is inspecting the AST. For example, when type
717checking the indirection operator (unary '*' on a pointer), the type checker
718must verify that the operand has a pointer type. It would not be correct to
719check that with "<tt>isa&lt;PointerType&gt;(SubExpr-&gt;getType())</tt>",
720because this predicate would fail if the subexpression had a typedef type.</p>
721
722<p>The solution to this problem are a set of helper methods on Type, used to
723check their properties. In this case, it would be correct to use
724"<tt>SubExpr-&gt;getType()-&gt;isPointerType()</tt>" to do the check. This
725predicate will return true if the <em>canonical type is a pointer</em>, which is
726true any time the type is structurally a pointer type. The only hard part here
727is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
728
729<p>The second problem we face is how to get access to the pointer type once we
730know it exists. To continue the example, the result type of the indirection
731operator is the pointee type of the subexpression. In order to determine the
732type, we need to get the instance of PointerType that best captures the typedef
733information in the program. If the type of the expression is literally a
734PointerType, we can return that, otherwise we have to dig through the
735typedefs to find the pointer type. For example, if the subexpression had type
736"<tt>foo*</tt>", we could return that type as the result. If the subexpression
737had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
738<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner11406c12007-07-31 16:50:51 +0000739a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner8a2bc622007-07-31 06:37:39 +0000740PointerType and, if so, returns the best one. If not, it returns a null
741pointer.</p>
742
743<p>This structure is somewhat mystical, but after meditating on it, it will
744make sense to you :).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000745
746<!-- ======================================================================= -->
747<h3 id="QualType">The QualType class</h3>
748<!-- ======================================================================= -->
749
750<p>The QualType class is designed as a trivial value class that is small,
751passed by-value and is efficient to query. The idea of QualType is that it
752stores the type qualifiers (const, volatile, restrict) separately from the types
753themselves: QualType is conceptually a pair of "Type*" and bits for the type
754qualifiers.</p>
755
756<p>By storing the type qualifiers as bits in the conceptual pair, it is
757extremely efficient to get the set of qualifiers on a QualType (just return the
758field of the pair), add a type qualifier (which is a trivial constant-time
759operation that sets a bit), and remove one or more type qualifiers (just return
760a QualType with the bitfield set to empty).</p>
761
762<p>Further, because the bits are stored outside of the type itself, we do not
763need to create duplicates of types with different sets of qualifiers (i.e. there
764is only a single heap allocated "int" type: "const int" and "volatile const int"
765both point to the same heap allocated "int" type). This reduces the heap size
766used to represent bits and also means we do not have to consider qualifiers when
767uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
768
769<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
770the low bit of the pointer to the Type object. This means that QualType is
771exactly the same size as a pointer, and this works fine on any system where
772malloc'd objects are at least 8 byte aligned.</p>
Ted Kremenek8bc05712007-10-10 23:01:43 +0000773
774<!-- ======================================================================= -->
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000775<h3 id="DeclarationName">Declaration names</h3>
776<!-- ======================================================================= -->
777
778<p>The <tt>DeclarationName</tt> class represents the name of a
779 declaration in Clang. Declarations in the C family of languages can
Chris Lattner3fcbb892008-11-23 08:32:53 +0000780 take several different forms. Most declarations are named by
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000781 simple identifiers, e.g., "<code>f</code>" and "<code>x</code>" in
782 the function declaration <code>f(int x)</code>. In C++, declaration
783 names can also name class constructors ("<code>Class</code>"
784 in <code>struct Class { Class(); }</code>), class destructors
785 ("<code>~Class</code>"), overloaded operator names ("operator+"),
786 and conversion functions ("<code>operator void const *</code>"). In
787 Objective-C, declaration names can refer to the names of Objective-C
788 methods, which involve the method name and the parameters,
Chris Lattner3fcbb892008-11-23 08:32:53 +0000789 collectively called a <i>selector</i>, e.g.,
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000790 "<code>setWidth:height:</code>". Since all of these kinds of
Chris Lattner3fcbb892008-11-23 08:32:53 +0000791 entities - variables, functions, Objective-C methods, C++
792 constructors, destructors, and operators - are represented as
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000793 subclasses of Clang's common <code>NamedDecl</code>
794 class, <code>DeclarationName</code> is designed to efficiently
795 represent any kind of name.</p>
796
797<p>Given
798 a <code>DeclarationName</code> <code>N</code>, <code>N.getNameKind()</code>
Douglas Gregor2def4832008-11-17 20:34:05 +0000799 will produce a value that describes what kind of name <code>N</code>
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000800 stores. There are 8 options (all of the names are inside
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000801 the <code>DeclarationName</code> class)</p>
802<dl>
803 <dt>Identifier</dt>
804 <dd>The name is a simple
805 identifier. Use <code>N.getAsIdentifierInfo()</code> to retrieve the
806 corresponding <code>IdentifierInfo*</code> pointing to the actual
807 identifier. Note that C++ overloaded operators (e.g.,
808 "<code>operator+</code>") are represented as special kinds of
809 identifiers. Use <code>IdentifierInfo</code>'s <code>getOverloadedOperatorID</code>
810 function to determine whether an identifier is an overloaded
811 operator name.</dd>
812
813 <dt>ObjCZeroArgSelector, ObjCOneArgSelector,
814 ObjCMultiArgSelector</dt>
815 <dd>The name is an Objective-C selector, which can be retrieved as a
816 <code>Selector</code> instance
817 via <code>N.getObjCSelector()</code>. The three possible name
818 kinds for Objective-C reflect an optimization within
819 the <code>DeclarationName</code> class: both zero- and
820 one-argument selectors are stored as a
821 masked <code>IdentifierInfo</code> pointer, and therefore require
822 very little space, since zero- and one-argument selectors are far
823 more common than multi-argument selectors (which use a different
824 structure).</dd>
825
826 <dt>CXXConstructorName</dt>
827 <dd>The name is a C++ constructor
828 name. Use <code>N.getCXXNameType()</code> to retrieve
829 the <a href="#QualType">type</a> that this constructor is meant to
830 construct. The type is always the canonical type, since all
831 constructors for a given type have the same name.</dd>
832
833 <dt>CXXDestructorName</dt>
834 <dd>The name is a C++ destructor
835 name. Use <code>N.getCXXNameType()</code> to retrieve
836 the <a href="#QualType">type</a> whose destructor is being
837 named. This type is always a canonical type.</dd>
838
839 <dt>CXXConversionFunctionName</dt>
840 <dd>The name is a C++ conversion function. Conversion functions are
841 named according to the type they convert to, e.g., "<code>operator void
842 const *</code>". Use <code>N.getCXXNameType()</code> to retrieve
843 the type that this conversion function converts to. This type is
844 always a canonical type.</dd>
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000845
846 <dt>CXXOperatorName</dt>
847 <dd>The name is a C++ overloaded operator name. Overloaded operators
848 are named according to their spelling, e.g.,
849 "<code>operator+</code>" or "<code>operator new
850 []</code>". Use <code>N.getCXXOverloadedOperator()</code> to
851 retrieve the overloaded operator (a value of
852 type <code>OverloadedOperatorKind</code>).</dd>
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000853</dl>
854
855<p><code>DeclarationName</code>s are cheap to create, copy, and
856 compare. They require only a single pointer's worth of storage in
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000857 the common cases (identifiers, zero-
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000858 and one-argument Objective-C selectors) and use dense, uniqued
859 storage for the other kinds of
860 names. Two <code>DeclarationName</code>s can be compared for
861 equality (<code>==</code>, <code>!=</code>) using a simple bitwise
862 comparison, can be ordered
863 with <code>&lt;</code>, <code>&gt;</code>, <code>&lt;=</code>,
864 and <code>&gt;=</code> (which provide a lexicographical ordering for
865 normal identifiers but an unspecified ordering for other kinds of
866 names), and can be placed into LLVM <code>DenseMap</code>s
867 and <code>DenseSet</code>s.</p>
868
869<p><code>DeclarationName</code> instances can be created in different
870 ways depending on what kind of name the instance will store. Normal
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000871 identifiers (<code>IdentifierInfo</code> pointers) and Objective-C selectors
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000872 (<code>Selector</code>) can be implicitly converted
873 to <code>DeclarationName</code>s. Names for C++ constructors,
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000874 destructors, conversion functions, and overloaded operators can be retrieved from
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000875 the <code>DeclarationNameTable</code>, an instance of which is
876 available as <code>ASTContext::DeclarationNames</code>. The member
877 functions <code>getCXXConstructorName</code>, <code>getCXXDestructorName</code>,
Douglas Gregore94ca9e42008-11-18 14:39:36 +0000878 <code>getCXXConversionFunctionName</code>, and <code>getCXXOperatorName</code>, respectively,
879 return <code>DeclarationName</code> instances for the four kinds of
Douglas Gregor2e1cd422008-11-17 14:58:09 +0000880 C++ special function names.</p>
881
882<!-- ======================================================================= -->
Douglas Gregor074149e2009-01-05 19:45:36 +0000883<h3 id="DeclContext">Declaration contexts</h3>
884<!-- ======================================================================= -->
885<p>Every declaration in a program exists within some <i>declaration
886 context</i>, such as a translation unit, namespace, class, or
887 function. Declaration contexts in Clang are represented by
888 the <code>DeclContext</code> class, from which the various
889 declaration-context AST nodes
890 (<code>TranslationUnitDecl</code>, <code>NamespaceDecl</code>, <code>RecordDecl</code>, <code>FunctionDecl</code>,
891 etc.) will derive. The <code>DeclContext</code> class provides
892 several facilities common to each declaration context:</p>
893<dl>
894 <dt>Source-centric vs. Semantics-centric View of Declarations</dt>
895 <dd><code>DeclContext</code> provides two views of the declarations
896 stored within a declaration context. The source-centric view
897 accurately represents the program source code as written, including
898 multiple declarations of entities where present (see the
899 section <a href="#Redeclarations">Redeclarations and
900 Overloads</a>), while the semantics-centric view represents the
901 program semantics. The two views are kept synchronized by semantic
902 analysis while the ASTs are being constructed.</dd>
903
904 <dt>Storage of declarations within that context</dt>
905 <dd>Every declaration context can contain some number of
906 declarations. For example, a C++ class (represented
907 by <code>RecordDecl</code>) contains various member functions,
908 fields, nested types, and so on. All of these declarations will be
909 stored within the <code>DeclContext</code>, and one can iterate
910 over the declarations via
911 [<code>DeclContext::decls_begin()</code>,
912 <code>DeclContext::decls_end()</code>). This mechanism provides
913 the source-centric view of declarations in the context.</dd>
914
915 <dt>Lookup of declarations within that context</dt>
916 <dd>The <code>DeclContext</code> structure provides efficient name
917 lookup for names within that declaration context. For example,
918 if <code>N</code> is a namespace we can look for the
919 name <code>N::f</code>
920 using <code>DeclContext::lookup</code>. The lookup itself is
921 based on a lazily-constructed array (for declaration contexts
922 with a small number of declarations) or hash table (for
923 declaration contexts with more declarations). The lookup
924 operation provides the semantics-centric view of the declarations
925 in the context.</dd>
926
927 <dt>Ownership of declarations</dt>
928 <dd>The <code>DeclContext</code> owns all of the declarations that
929 were declared within its declaration context, and is responsible
930 for the management of their memory as well as their
931 (de-)serialization.</dd>
932</dl>
933
934<p>The declarations stored within each declaration context are
935 called <i>scoped declarations</i> and the AST nodes for each of
936 these declarations are
937 derived from the <code>ScopedDecl</code> class, which provides
938 information about the context in which that declaration lives. One
939 can retrieve the <code>DeclContext</code> that contains a
940 particular <code>ScopedDecl</code>
941 using <code>ScopedDecl::getDeclContext</code>. However, see the
942 section <a href="#LexicalAndSemanticContexts">Lexical and Semantic
943 Contexts</a> for more information about how to interpret this
944 context information.</p>
945
946<h4 id="Redeclarations">Redeclarations and Overloads</h4>
947<p>Within a translation unit, it is common for an entity to be
948declared several times. For example, we might declare a function "f"
949 and then later re-declare it as part of an inlined definition:</p>
950
951<pre>
952void f(int x, int y, int z = 1);
953
954inline void f(int x, int y, int z) { /* ... */ }
955</pre>
956
957<p>The representation of "f" differs in the source-centric and
958 semantics-centric views of a declaration context. In the
959 source-centric view, all redeclarations will be present, in the
960 order they occurred in the source code, making
961 this view suitable for clients that wish to see the structure of
962 the source code. In the semantics-centric view, only the most recent "f"
963 will be found by the lookup, since it effectively replaces the first
964 declaration of "f".</p>
965
966<p>In the semantics-centric view, overloading of functions is
967 represented explicitly. For example, given two declarations of a
968 function "g" that are overloaded, e.g.,</p>
969<pre>
970void g();
971void g(int);
972</pre>
973<p>the <code>DeclContext::lookup</code> operation will return
974 an <code>OverloadedFunctionDecl</code> that contains both
975 declarations of "g". Clients that perform semantic analysis on a
976 program that is not concerned with the actual source code will
977 primarily use this semantics-centric view.</p>
978
979<h4 id="LexicalAndSemanticContexts">Lexical and Semantic Contexts</h4>
980<p>Each scoped declaration (whose AST node derived
981 from <code>ScopedDecl</code>) has two potentially different
982 declaration contexts: a <i>lexical</i> context, which corresponds to
983 the source-centric view of the declaration context, and
984 a <i>semantic</i> context, which corresponds to the
985 semantics-centric view. The lexical context is accessible
986 via <code>ScopedDecl::getLexicalDeclContext</code> while the
987 semantic context is accessible
988 via <code>ScopedDecl::getDeclContext</code>, both of which return
989 <code>DeclContext</code> pointers. For most declarations, the two
990 contexts are identical. For example:</p>
991
992<pre>
993class X {
994public:
995 void f(int x);
996};
997</pre>
998
999<p>Here, the semantic and lexical contexts of <code>X::f</code> are
1000 the <code>DeclContext</code> associated with the
1001 class <code>X</code> (itself stored as a <code>RecordDecl</code> AST
1002 node). However, we can now define <code>X::f</code> out-of-line:</p>
1003
1004<pre>
1005void X::f(int x = 17) { /* ... */ }
1006</pre>
1007
1008<p>This definition of has different lexical and semantic
1009 contexts. The lexical context corresponds to the declaration
1010 context in which the actual declaration occurred in the source
1011 code, e.g., the translation unit containing <code>X</code>. Thus,
1012 this declaration of <code>X::f</code> can be found by traversing
1013 the declarations provided by
1014 [<code>decls_begin()</code>, <code>decls_end()</code>) in the
1015 translation unit.</p>
1016
1017<p>The semantic context of <code>X::f</code> corresponds to the
1018 class <code>X</code>, since this member function is (semantically) a
1019 member of <code>X</code>. Lookup of the name <code>f</code> into
1020 the <code>DeclContext</code> associated with <code>X</code> will
1021 then return the definition of <code>X::f</code> (including
1022 information about the default argument).</p>
1023
1024<h4 id="TransparentContexts">Transparent Declaration Contexts</h4>
1025<p>In C and C++, there are several contexts in which names that are
1026 logically declared inside another declaration will actually "leak"
1027 out into the enclosing scope from the perspective of name
1028 lookup. The most obvious instance of this behavior is in
1029 enumeration types, e.g.,</p>
1030<pre>
1031enum Color {
1032 Red,
1033 Green,
1034 Blue
1035};
1036</pre>
1037
1038<p>Here, <code>Color</code> is an enumeration, which is a declaration
1039 context that contains the
1040 enumerators <code>Red</code>, <code>Green</code>,
1041 and <code>Blue</code>. Thus, traversing the list of declarations
1042 contained in the enumeration <code>Color</code> will
1043 yield <code>Red</code>, <code>Green</code>,
1044 and <code>Blue</code>. However, outside of the scope
1045 of <code>Color</code> one can name the enumerator <code>Red</code>
1046 without qualifying the name, e.g.,</p>
1047
1048<pre>
1049Color c = Red;
1050</pre>
1051
1052<p>There are other entities in C++ that provide similar behavior. For
1053 example, linkage specifications that use curly braces:</p>
1054
1055<pre>
1056extern "C" {
1057 void f(int);
1058 void g(int);
1059}
1060// f and g are visible here
1061</pre>
1062
1063<p>For source-level accuracy, we treat the linkage specification and
1064 enumeration type as a
1065 declaration context in which its enclosed declarations ("Red",
1066 "Green", and "Blue"; "f" and "g")
1067 are declared. However, these declarations are visible outside of the
1068 scope of the declaration context.</p>
1069
1070<p>These language features (and several others, described below) have
1071 roughly the same set of
1072 requirements: declarations are declared within a particular lexical
1073 context, but the declarations are also found via name lookup in
1074 scopes enclosing the declaration itself. This feature is implemented
1075 via <i>transparent</i> declaration contexts
1076 (see <code>DeclContext::isTransparentContext()</code>), whose
1077 declarations are visible in the nearest enclosing non-transparent
1078 declaration context. This means that the lexical context of the
1079 declaration (e.g., an enumerator) will be the
1080 transparent <code>DeclContext</code> itself, as will the semantic
1081 context, but the declaration will be visible in every outer context
1082 up to and including the first non-transparent declaration context (since
1083 transparent declaration contexts can be nested).</p>
1084
1085<p>The transparent <code>DeclContexts</code> are:</p>
1086<ul>
1087 <li>Enumerations (but not C++0x "scoped enumerations"):
1088 <pre>
1089enum Color {
1090 Red,
1091 Green,
1092 Blue
1093};
1094// Red, Green, and Blue are in scope
1095 </pre></li>
1096 <li>C++ linkage specifications:
1097 <pre>
1098extern "C" {
1099 void f(int);
1100 void g(int);
1101}
1102// f and g are in scope
1103 </pre></li>
1104 <li>Anonymous unions and structs:
1105 <pre>
1106struct LookupTable {
1107 bool IsVector;
1108 union {
1109 std::vector&lt;Item&gt; *Vector;
1110 std::set&lt;Item&gt; *Set;
1111 };
1112};
1113
1114LookupTable LT;
1115LT.Vector = 0; // Okay: finds Vector inside the unnamed union
1116 </pre>
1117 </li>
1118 <li>C++0x inline namespaces:
1119<pre>
1120namespace mylib {
1121 inline namespace debug {
1122 class X;
1123 }
1124}
1125mylib::X *xp; // okay: mylib::X refers to mylib::debug::X
1126</pre>
1127</li>
1128</ul>
1129
1130
1131<h4 id="MultiDeclContext">Multiply-Defined Declaration Contexts</h4>
1132<p>C++ namespaces have the interesting--and, so far, unique--property that
1133the namespace can be defined multiple times, and the declarations
1134provided by each namespace definition are effectively merged (from
1135the semantic point of view). For example, the following two code
1136snippets are semantically indistinguishable:</p>
1137<pre>
1138// Snippet #1:
1139namespace N {
1140 void f();
1141}
1142namespace N {
1143 void f(int);
1144}
1145
1146// Snippet #2:
1147namespace N {
1148 void f();
1149 void f(int);
1150}
1151</pre>
1152
1153<p>In Clang's representation, the source-centric view of declaration
1154 contexts will actually have two separate <code>NamespaceDecl</code>
1155 nodes in Snippet #1, each of which is a declaration context that
1156 contains a single declaration of "f". However, the semantics-centric
1157 view provided by name lookup into the namespace <code>N</code> for
1158 "f" will return an <code>OverloadedFunctionDecl</code> that contains
1159 both declarations of "f".</p>
1160
1161<p><code>DeclContext</code> manages multiply-defined declaration
1162 contexts internally. The
1163 function <code>DeclContext::getPrimaryContext</code> retrieves the
1164 "primary" context for a given <code>DeclContext</code> instance,
1165 which is the <code>DeclContext</code> responsible for maintaining
1166 the lookup table used for the semantics-centric view. Given the
1167 primary context, one can follow the chain
1168 of <code>DeclContext</code> nodes that define additional
1169 declarations via <code>DeclContext::getNextContext</code>. Note that
1170 these functions are used internally within the lookup and insertion
1171 methods of the <code>DeclContext</code>, so the vast majority of
1172 clients can ignore them.</p>
1173
1174<!-- ======================================================================= -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001175<h3 id="CFG">The <tt>CFG</tt> class</h3>
1176<!-- ======================================================================= -->
1177
1178<p>The <tt>CFG</tt> class is designed to represent a source-level
1179control-flow graph for a single statement (<tt>Stmt*</tt>). Typically
1180instances of <tt>CFG</tt> are constructed for function bodies (usually
1181an instance of <tt>CompoundStmt</tt>), but can also be instantiated to
1182represent the control-flow of any class that subclasses <tt>Stmt</tt>,
1183which includes simple expressions. Control-flow graphs are especially
1184useful for performing
1185<a href="http://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities">flow-
1186or path-sensitive</a> program analyses on a given function.</p>
1187
Chris Lattner62fd2782008-11-22 21:41:31 +00001188<!-- ============ -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001189<h4>Basic Blocks</h4>
Chris Lattner62fd2782008-11-22 21:41:31 +00001190<!-- ============ -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001191
1192<p>Concretely, an instance of <tt>CFG</tt> is a collection of basic
1193blocks. Each basic block is an instance of <tt>CFGBlock</tt>, which
1194simply contains an ordered sequence of <tt>Stmt*</tt> (each referring
1195to statements in the AST). The ordering of statements within a block
1196indicates unconditional flow of control from one statement to the
1197next. <a href="#ConditionalControlFlow">Conditional control-flow</a>
1198is represented using edges between basic blocks. The statements
1199within a given <tt>CFGBlock</tt> can be traversed using
1200the <tt>CFGBlock::*iterator</tt> interface.</p>
1201
1202<p>
Ted Kremenek18e17e72007-10-18 22:50:52 +00001203A <tt>CFG</tt> object owns the instances of <tt>CFGBlock</tt> within
Ted Kremenek8bc05712007-10-10 23:01:43 +00001204the control-flow graph it represents. Each <tt>CFGBlock</tt> within a
1205CFG is also uniquely numbered (accessible
1206via <tt>CFGBlock::getBlockID()</tt>). Currently the number is
1207based on the ordering the blocks were created, but no assumptions
1208should be made on how <tt>CFGBlock</tt>s are numbered other than their
1209numbers are unique and that they are numbered from 0..N-1 (where N is
1210the number of basic blocks in the CFG).</p>
1211
Chris Lattner62fd2782008-11-22 21:41:31 +00001212<!-- ===================== -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001213<h4>Entry and Exit Blocks</h4>
Chris Lattner62fd2782008-11-22 21:41:31 +00001214<!-- ===================== -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001215
1216Each instance of <tt>CFG</tt> contains two special blocks:
1217an <i>entry</i> block (accessible via <tt>CFG::getEntry()</tt>), which
1218has no incoming edges, and an <i>exit</i> block (accessible
1219via <tt>CFG::getExit()</tt>), which has no outgoing edges. Neither
1220block contains any statements, and they serve the role of providing a
1221clear entrance and exit for a body of code such as a function body.
1222The presence of these empty blocks greatly simplifies the
1223implementation of many analyses built on top of CFGs.
1224
Chris Lattner62fd2782008-11-22 21:41:31 +00001225<!-- ===================================================== -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001226<h4 id ="ConditionalControlFlow">Conditional Control-Flow</h4>
Chris Lattner62fd2782008-11-22 21:41:31 +00001227<!-- ===================================================== -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001228
1229<p>Conditional control-flow (such as those induced by if-statements
1230and loops) is represented as edges between <tt>CFGBlock</tt>s.
1231Because different C language constructs can induce control-flow,
1232each <tt>CFGBlock</tt> also records an extra <tt>Stmt*</tt> that
1233represents the <i>terminator</i> of the block. A terminator is simply
1234the statement that caused the control-flow, and is used to identify
1235the nature of the conditional control-flow between blocks. For
1236example, in the case of an if-statement, the terminator refers to
1237the <tt>IfStmt</tt> object in the AST that represented the given
1238branch.</p>
1239
1240<p>To illustrate, consider the following code example:</p>
1241
1242<code>
1243int foo(int x) {<br>
1244&nbsp;&nbsp;x = x + 1;<br>
1245<br>
1246&nbsp;&nbsp;if (x > 2) x++;<br>
1247&nbsp;&nbsp;else {<br>
1248&nbsp;&nbsp;&nbsp;&nbsp;x += 2;<br>
1249&nbsp;&nbsp;&nbsp;&nbsp;x *= 2;<br>
1250&nbsp;&nbsp;}<br>
1251<br>
1252&nbsp;&nbsp;return x;<br>
1253}
1254</code>
1255
1256<p>After invoking the parser+semantic analyzer on this code fragment,
1257the AST of the body of <tt>foo</tt> is referenced by a
1258single <tt>Stmt*</tt>. We can then construct an instance
1259of <tt>CFG</tt> representing the control-flow graph of this function
1260body by single call to a static class method:</p>
1261
1262<code>
1263&nbsp;&nbsp;Stmt* FooBody = ...<br>
1264&nbsp;&nbsp;CFG* FooCFG = <b>CFG::buildCFG</b>(FooBody);
1265</code>
1266
1267<p>It is the responsibility of the caller of <tt>CFG::buildCFG</tt>
1268to <tt>delete</tt> the returned <tt>CFG*</tt> when the CFG is no
1269longer needed.</p>
1270
1271<p>Along with providing an interface to iterate over
1272its <tt>CFGBlock</tt>s, the <tt>CFG</tt> class also provides methods
1273that are useful for debugging and visualizing CFGs. For example, the
1274method
1275<tt>CFG::dump()</tt> dumps a pretty-printed version of the CFG to
1276standard error. This is especially useful when one is using a
1277debugger such as gdb. For example, here is the output
1278of <tt>FooCFG->dump()</tt>:</p>
1279
1280<code>
1281&nbsp;[ B5 (ENTRY) ]<br>
1282&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (0):<br>
1283&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B4<br>
1284<br>
1285&nbsp;[ B4 ]<br>
1286&nbsp;&nbsp;&nbsp;&nbsp;1: x = x + 1<br>
1287&nbsp;&nbsp;&nbsp;&nbsp;2: (x > 2)<br>
1288&nbsp;&nbsp;&nbsp;&nbsp;<b>T: if [B4.2]</b><br>
1289&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B5<br>
1290&nbsp;&nbsp;&nbsp;&nbsp;Successors (2): B3 B2<br>
1291<br>
1292&nbsp;[ B3 ]<br>
1293&nbsp;&nbsp;&nbsp;&nbsp;1: x++<br>
1294&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B4<br>
1295&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B1<br>
1296<br>
1297&nbsp;[ B2 ]<br>
1298&nbsp;&nbsp;&nbsp;&nbsp;1: x += 2<br>
1299&nbsp;&nbsp;&nbsp;&nbsp;2: x *= 2<br>
1300&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B4<br>
1301&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B1<br>
1302<br>
1303&nbsp;[ B1 ]<br>
1304&nbsp;&nbsp;&nbsp;&nbsp;1: return x;<br>
1305&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (2): B2 B3<br>
1306&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B0<br>
1307<br>
1308&nbsp;[ B0 (EXIT) ]<br>
1309&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B1<br>
1310&nbsp;&nbsp;&nbsp;&nbsp;Successors (0):
1311</code>
1312
1313<p>For each block, the pretty-printed output displays for each block
1314the number of <i>predecessor</i> blocks (blocks that have outgoing
1315control-flow to the given block) and <i>successor</i> blocks (blocks
1316that have control-flow that have incoming control-flow from the given
1317block). We can also clearly see the special entry and exit blocks at
1318the beginning and end of the pretty-printed output. For the entry
1319block (block B5), the number of predecessor blocks is 0, while for the
1320exit block (block B0) the number of successor blocks is 0.</p>
1321
1322<p>The most interesting block here is B4, whose outgoing control-flow
1323represents the branching caused by the sole if-statement
1324in <tt>foo</tt>. Of particular interest is the second statement in
1325the block, <b><tt>(x > 2)</tt></b>, and the terminator, printed
1326as <b><tt>if [B4.2]</tt></b>. The second statement represents the
1327evaluation of the condition of the if-statement, which occurs before
1328the actual branching of control-flow. Within the <tt>CFGBlock</tt>
1329for B4, the <tt>Stmt*</tt> for the second statement refers to the
1330actual expression in the AST for <b><tt>(x > 2)</tt></b>. Thus
1331pointers to subclasses of <tt>Expr</tt> can appear in the list of
1332statements in a block, and not just subclasses of <tt>Stmt</tt> that
1333refer to proper C statements.</p>
1334
1335<p>The terminator of block B4 is a pointer to the <tt>IfStmt</tt>
1336object in the AST. The pretty-printer outputs <b><tt>if
1337[B4.2]</tt></b> because the condition expression of the if-statement
1338has an actual place in the basic block, and thus the terminator is
1339essentially
1340<i>referring</i> to the expression that is the second statement of
1341block B4 (i.e., B4.2). In this manner, conditions for control-flow
1342(which also includes conditions for loops and switch statements) are
1343hoisted into the actual basic block.</p>
1344
Chris Lattner62fd2782008-11-22 21:41:31 +00001345<!-- ===================== -->
1346<!-- <h4>Implicit Control-Flow</h4> -->
1347<!-- ===================== -->
Ted Kremenek8bc05712007-10-10 23:01:43 +00001348
1349<!--
1350<p>A key design principle of the <tt>CFG</tt> class was to not require
1351any transformations to the AST in order to represent control-flow.
1352Thus the <tt>CFG</tt> does not perform any "lowering" of the
1353statements in an AST: loops are not transformed into guarded gotos,
1354short-circuit operations are not converted to a set of if-statements,
1355and so on.</p>
1356-->
Ted Kremenek17a295d2008-06-11 06:19:49 +00001357
Chris Lattner7bad1992008-11-16 21:48:07 +00001358
1359<!-- ======================================================================= -->
1360<h3 id="Constants">Constant Folding in the Clang AST</h3>
1361<!-- ======================================================================= -->
1362
1363<p>There are several places where constants and constant folding matter a lot to
1364the Clang front-end. First, in general, we prefer the AST to retain the source
1365code as close to how the user wrote it as possible. This means that if they
1366wrote "5+4", we want to keep the addition and two constants in the AST, we don't
1367want to fold to "9". This means that constant folding in various ways turns
1368into a tree walk that needs to handle the various cases.</p>
1369
1370<p>However, there are places in both C and C++ that require constants to be
1371folded. For example, the C standard defines what an "integer constant
1372expression" (i-c-e) is with very precise and specific requirements. The
1373language then requires i-c-e's in a lot of places (for example, the size of a
1374bitfield, the value for a case statement, etc). For these, we have to be able
1375to constant fold the constants, to do semantic checks (e.g. verify bitfield size
1376is non-negative and that case statements aren't duplicated). We aim for Clang
1377to be very pedantic about this, diagnosing cases when the code does not use an
1378i-c-e where one is required, but accepting the code unless running with
1379<tt>-pedantic-errors</tt>.</p>
1380
1381<p>Things get a little bit more tricky when it comes to compatibility with
1382real-world source code. Specifically, GCC has historically accepted a huge
1383superset of expressions as i-c-e's, and a lot of real world code depends on this
1384unfortuate accident of history (including, e.g., the glibc system headers). GCC
1385accepts anything its "fold" optimizer is capable of reducing to an integer
1386constant, which means that the definition of what it accepts changes as its
1387optimizer does. One example is that GCC accepts things like "case X-X:" even
1388when X is a variable, because it can fold this to 0.</p>
1389
1390<p>Another issue are how constants interact with the extensions we support, such
1391as __builtin_constant_p, __builtin_inf, __extension__ and many others. C99
1392obviously does not specify the semantics of any of these extensions, and the
1393definition of i-c-e does not include them. However, these extensions are often
1394used in real code, and we have to have a way to reason about them.</p>
1395
1396<p>Finally, this is not just a problem for semantic analysis. The code
1397generator and other clients have to be able to fold constants (e.g. to
1398initialize global variables) and has to handle a superset of what C99 allows.
1399Further, these clients can benefit from extended information. For example, we
1400know that "foo()||1" always evaluates to true, but we can't replace the
1401expression with true because it has side effects.</p>
1402
1403<!-- ======================= -->
1404<h4>Implementation Approach</h4>
1405<!-- ======================= -->
1406
1407<p>After trying several different approaches, we've finally converged on a
1408design (Note, at the time of this writing, not all of this has been implemented,
1409consider this a design goal!). Our basic approach is to define a single
1410recursive method evaluation method (<tt>Expr::Evaluate</tt>), which is
1411implemented in <tt>AST/ExprConstant.cpp</tt>. Given an expression with 'scalar'
1412type (integer, fp, complex, or pointer) this method returns the following
1413information:</p>
1414
1415<ul>
1416<li>Whether the expression is an integer constant expression, a general
1417 constant that was folded but has no side effects, a general constant that
1418 was folded but that does have side effects, or an uncomputable/unfoldable
1419 value.
1420</li>
1421<li>If the expression was computable in any way, this method returns the APValue
1422 for the result of the expression.</li>
1423<li>If the expression is not evaluatable at all, this method returns
1424 information on one of the problems with the expression. This includes a
1425 SourceLocation for where the problem is, and a diagnostic ID that explains
1426 the problem. The diagnostic should be have ERROR type.</li>
1427<li>If the expression is not an integer constant expression, this method returns
1428 information on one of the problems with the expression. This includes a
1429 SourceLocation for where the problem is, and a diagnostic ID that explains
1430 the problem. The diagnostic should be have EXTENSION type.</li>
1431</ul>
1432
1433<p>This information gives various clients the flexibility that they want, and we
1434will eventually have some helper methods for various extensions. For example,
1435Sema should have a <tt>Sema::VerifyIntegerConstantExpression</tt> method, which
1436calls Evaluate on the expression. If the expression is not foldable, the error
1437is emitted, and it would return true. If the expression is not an i-c-e, the
1438EXTENSION diagnostic is emitted. Finally it would return false to indicate that
1439the AST is ok.</p>
1440
1441<p>Other clients can use the information in other ways, for example, codegen can
1442just use expressions that are foldable in any way.</p>
1443
1444<!-- ========== -->
1445<h4>Extensions</h4>
1446<!-- ========== -->
1447
Chris Lattner552de0a2008-11-23 08:16:56 +00001448<p>This section describes how some of the various extensions Clang supports
Chris Lattner7bad1992008-11-16 21:48:07 +00001449interacts with constant evaluation:</p>
1450
1451<ul>
1452<li><b><tt>__extension__</tt></b>: The expression form of this extension causes
1453 any evaluatable subexpression to be accepted as an integer constant
1454 expression.</li>
1455<li><b><tt>__builtin_constant_p</tt></b>: This returns true (as a integer
Chris Lattner28daa532008-12-12 06:55:44 +00001456 constant expression) if the operand is any evaluatable constant. As a
1457 special case, if <tt>__builtin_constant_p</tt> is the (potentially
1458 parenthesized) condition of a conditional operator expression ("?:"), only
Chris Lattner42b83dd2008-12-12 18:00:51 +00001459 the true side of the conditional operator is considered, and it is evaluated
1460 with full constant folding.</li>
Chris Lattner7bad1992008-11-16 21:48:07 +00001461<li><b><tt>__builtin_choose_expr</tt></b>: The condition is required to be an
1462 integer constant expression, but we accept any constant as an "extension of
1463 an extension". This only evaluates one operand depending on which way the
1464 condition evaluates.</li>
1465<li><b><tt>__builtin_classify_type</tt></b>: This always returns an integer
1466 constant expression.</li>
1467<li><b><tt>__builtin_inf,nan,..</tt></b>: These are treated just like a
1468 floating-point literal.</li>
1469<li><b><tt>__builtin_abs,copysign,..</tt></b>: These are constant folded as
1470 general constant expressions.</li>
1471</ul>
1472
1473
1474
1475
Ted Kremenek17a295d2008-06-11 06:19:49 +00001476</div>
1477</body>
Douglas Gregor2e1cd422008-11-17 14:58:09 +00001478</html>