blob: 1e1fc990adc57e025f27c4147756a5ef093f30c2 [file] [log] [blame]
Ted Kremenek17a295d2008-06-11 06:19:49 +00001<html>
2<head>
Chris Lattner86920d32007-07-31 05:42:17 +00003<title>"clang" CFE Internals Manual</title>
Ted Kremenek17a295d2008-06-11 06:19:49 +00004<link type="text/css" rel="stylesheet" href="../menu.css" />
5<link type="text/css" rel="stylesheet" href="../content.css" />
6</head>
7<body>
8
9<!--#include virtual="../menu.html.incl"-->
10
11<div id="content">
Chris Lattner86920d32007-07-31 05:42:17 +000012
13<h1>"clang" CFE Internals Manual</h1>
14
15<ul>
16<li><a href="#intro">Introduction</a></li>
17<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
18<li><a href="#libbasic">The clang 'Basic' Library</a>
19 <ul>
20 <li><a href="#SourceLocation">The SourceLocation and SourceManager
21 classes</a></li>
22 </ul>
23</li>
24<li><a href="#liblex">The Lexer and Preprocessor Library</a>
25 <ul>
26 <li><a href="#Token">The Token class</a></li>
27 <li><a href="#Lexer">The Lexer class</a></li>
Chris Lattner79281252008-03-09 02:27:26 +000028 <li><a href="#TokenLexer">The TokenLexer class</a></li>
Chris Lattner86920d32007-07-31 05:42:17 +000029 <li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
30 </ul>
31</li>
32<li><a href="#libparse">The Parser Library</a>
33 <ul>
34 </ul>
35</li>
36<li><a href="#libast">The AST Library</a>
37 <ul>
38 <li><a href="#Type">The Type class and its subclasses</a></li>
39 <li><a href="#QualType">The QualType class</a></li>
Ted Kremenek8bc05712007-10-10 23:01:43 +000040 <li><a href="#CFG">The CFG class</a></li>
Chris Lattner7bad1992008-11-16 21:48:07 +000041 <li><a href="#Constants">Constant Folding in the Clang AST</a></li>
Chris Lattner86920d32007-07-31 05:42:17 +000042 </ul>
43</li>
44</ul>
45
46
47<!-- ======================================================================= -->
48<h2 id="intro">Introduction</h2>
49<!-- ======================================================================= -->
50
51<p>This document describes some of the more important APIs and internal design
52decisions made in the clang C front-end. The purpose of this document is to
53both capture some of this high level information and also describe some of the
54design decisions behind it. This is meant for people interested in hacking on
55clang, not for end-users. The description below is categorized by
56libraries, and does not describe any of the clients of the libraries.</p>
57
58<!-- ======================================================================= -->
59<h2 id="libsystem">LLVM System and Support Libraries</h2>
60<!-- ======================================================================= -->
61
62<p>The LLVM libsystem library provides the basic clang system abstraction layer,
63which is used for file system access. The LLVM libsupport library provides many
64underlying libraries and <a
65href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
66 including command line option
67processing and various containers.</p>
68
69<!-- ======================================================================= -->
70<h2 id="libbasic">The clang 'Basic' Library</h2>
71<!-- ======================================================================= -->
72
73<p>This library certainly needs a better name. The 'basic' library contains a
74number of low-level utilities for tracking and manipulating source buffers,
75locations within the source buffers, diagnostics, tokens, target abstraction,
76and information about the subset of the language being compiled for.</p>
77
78<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
79other parts could be reused for other non-C-based languages (SourceLocation,
80SourceManager, Diagnostics, FileManager). When and if there is future demand
81we can figure out if it makes sense to introduce a new library, move the general
82classes somewhere else, or introduce some other solution.</p>
83
84<p>We describe the roles of these classes in order of their dependencies.</p>
85
86<!-- ======================================================================= -->
87<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
88<!-- ======================================================================= -->
89
90<p>Strangely enough, the SourceLocation class represents a location within the
91source code of the program. Important design points include:</p>
92
93<ol>
94<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
95 many AST nodes and are passed around often. Currently it is 32 bits.</li>
96<li>SourceLocation must be a simple value object that can be efficiently
97 copied.</li>
98<li>We should be able to represent a source location for any byte of any input
99 file. This includes in the middle of tokens, in whitespace, in trigraphs,
100 etc.</li>
101<li>A SourceLocation must encode the current #include stack that was active when
102 the location was processed. For example, if the location corresponds to a
103 token, it should contain the set of #includes active when the token was
104 lexed. This allows us to print the #include stack for a diagnostic.</li>
105<li>SourceLocation must be able to describe macro expansions, capturing both
106 the ultimate instantiation point and the source of the original character
107 data.</li>
108</ol>
109
110<p>In practice, the SourceLocation works together with the SourceManager class
111to encode two pieces of information about a location: it's physical location
112and it's virtual location. For most tokens, these will be the same. However,
113for a macro expansion (or tokens that came from a _Pragma directive) these will
114describe the location of the characters corresponding to the token and the
115location where the token was used (i.e. the macro instantiation point or the
116location of the _Pragma itself).</p>
117
118<p>For efficiency, we only track one level of macro instantions: if a token was
119produced by multiple instantiations, we only track the source and ultimate
120destination. Though we could track the intermediate instantiation points, this
121would require extra bookkeeping and no known client would benefit substantially
122from this.</p>
123
124<p>The clang front-end inherently depends on the location of a token being
125tracked correctly. If it is ever incorrect, the front-end may get confused and
126die. The reason for this is that the notion of the 'spelling' of a Token in
127clang depends on being able to find the original input characters for the token.
128This concept maps directly to the "physical" location for the token.</p>
129
130<!-- ======================================================================= -->
131<h2 id="liblex">The Lexer and Preprocessor Library</h2>
132<!-- ======================================================================= -->
133
134<p>The Lexer library contains several tightly-connected classes that are involved
135with the nasty process of lexing and preprocessing C source code. The main
136interface to this library for outside clients is the large <a
137href="#Preprocessor">Preprocessor</a> class.
138It contains the various pieces of state that are required to coherently read
139tokens out of a translation unit.</p>
140
141<p>The core interface to the Preprocessor object (once it is set up) is the
142Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
143the preprocessor stream. There are two types of token providers that the
144preprocessor is capable of reading from: a buffer lexer (provided by the <a
145href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
Chris Lattner79281252008-03-09 02:27:26 +0000146href="#TokenLexer">TokenLexer</a> class).
Chris Lattner86920d32007-07-31 05:42:17 +0000147
148
149<!-- ======================================================================= -->
150<h3 id="Token">The Token class</h3>
151<!-- ======================================================================= -->
152
153<p>The Token class is used to represent a single lexed token. Tokens are
154intended to be used by the lexer/preprocess and parser libraries, but are not
155intended to live beyond them (for example, they should not live in the ASTs).<p>
156
157<p>Tokens most often live on the stack (or some other location that is efficient
158to access) as the parser is running, but occasionally do get buffered up. For
159example, macro definitions are stored as a series of tokens, and the C++
160front-end will eventually need to buffer tokens up for tentative parsing and
161various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
162system, sizeof(Token) is currently 16 bytes.</p>
163
164<p>Tokens contain the following information:</p>
165
166<ul>
167<li><b>A SourceLocation</b> - This indicates the location of the start of the
168token.</li>
169
170<li><b>A length</b> - This stores the length of the token as stored in the
171SourceBuffer. For tokens that include them, this length includes trigraphs and
172escaped newlines which are ignored by later phases of the compiler. By pointing
173into the original source buffer, it is always possible to get the original
174spelling of a token completely accurately.</li>
175
176<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
177identifier lookup was enabled when the token was lexed (e.g. the lexer was not
178reading in 'raw' mode) this contains a pointer to the unique hash value for the
179identifier. Because the lookup happens before keyword identification, this
180field is set even for language keywords like 'for'.</li>
181
182<li><b>TokenKind</b> - This indicates the kind of token as classified by the
183lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
184operator), <tt>tok::ampamp</tt> for the "&amp;&amp;" token, and keyword values
185(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
186that some tokens can be spelled multiple ways. For example, C++ supports
187"operator keywords", where things like "and" are treated exactly like the
188"&amp;&amp;" operator. In these cases, the kind value is set to
189<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
190consider both forms. For something that cares about which form is used (e.g.
191the preprocessor 'stringize' operator) the spelling indicates the original
192form.</li>
193
194<li><b>Flags</b> - There are currently four flags tracked by the
195lexer/preprocessor system on a per-token basis:
196
197 <ol>
198 <li><b>StartOfLine</b> - This was the first token that occurred on its input
199 source line.</li>
200 <li><b>LeadingSpace</b> - There was a space character either immediately
201 before the token or transitively before the token as it was expanded
202 through a macro. The definition of this flag is very closely defined by
203 the stringizing requirements of the preprocessor.</li>
204 <li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
205 represent identifier tokens which have macro expansion disabled. This
206 prevents them from being considered as candidates for macro expansion ever
207 in the future.</li>
208 <li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
209 token includes a trigraph or escaped newline. Since this is uncommon,
210 many pieces of code can fast-path on tokens that did not need cleaning.
211 </p>
212 </ol>
213</li>
214</ul>
215
216<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
217contain any semantic information about the lexed value. For example, if the
218token was a pp-number token, we do not represent the value of the number that
219was lexed (this is left for later pieces of code to decide). Additionally, the
220lexer library has no notion of typedef names vs variable names: both are
221returned as identifiers, and the parser is left to decide whether a specific
222identifier is a typedef or a variable (tracking this requires scope information
223among other things).</p>
224
225<!-- ======================================================================= -->
226<h3 id="Lexer">The Lexer class</h3>
227<!-- ======================================================================= -->
228
229<p>The Lexer class provides the mechanics of lexing tokens out of a source
230buffer and deciding what they mean. The Lexer is complicated by the fact that
231it operates on raw buffers that have not had spelling eliminated (this is a
232necessity to get decent performance), but this is countered with careful coding
233as well as standard performance techniques (for example, the comment handling
234code is vectorized on X86 and PowerPC hosts).</p>
235
236<p>The lexer has a couple of interesting modal features:</p>
237
238<ul>
239<li>The lexer can operate in 'raw' mode. This mode has several features that
240 make it possible to quickly lex the file (e.g. it stops identifier lookup,
241 doesn't specially handle preprocessor tokens, handles EOF differently, etc).
242 This mode is used for lexing within an "<tt>#if 0</tt>" block, for
243 example.</li>
244<li>The lexer can capture and return comments as tokens. This is required to
245 support the -C preprocessor mode, which passes comments through, and is
246 used by the diagnostic checker to identifier expect-error annotations.</li>
247<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
Chris Lattner84386242007-09-16 19:25:23 +0000248 after reading a #include directive. This mode changes the parsing of '&lt;'
Chris Lattner86920d32007-07-31 05:42:17 +0000249 to return an "angled string" instead of a bunch of tokens for each thing
250 within the filename.</li>
251<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
252 ParsingPreprocessorDirective mode is entered. This changes the parser to
253 return EOM at a newline.</li>
254<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
255 whether C++ or ObjC keywords are recognized, etc.</li>
256</ul>
257
258<p>In addition to these modes, the lexer keeps track of a couple of other
259 features that are local to a lexed buffer, which change as the buffer is
260 lexed:</p>
261
262<ul>
263<li>The Lexer uses BufferPtr to keep track of the current character being
264 lexed.</li>
265<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
266 will start with its "start of line" bit set.</li>
267<li>The Lexer keeps track of the current #if directives that are active (which
268 can be nested).</li>
269<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
270 MultipleIncludeOpt</a> object, which is used to
271 detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
272 <tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
273 subsequent includes can be ignored if the XX macro is defined.</li>
274</ul>
275
276<!-- ======================================================================= -->
Chris Lattner79281252008-03-09 02:27:26 +0000277<h3 id="TokenLexer">The TokenLexer class</h3>
Chris Lattner86920d32007-07-31 05:42:17 +0000278<!-- ======================================================================= -->
279
Chris Lattner79281252008-03-09 02:27:26 +0000280<p>The TokenLexer class is a token provider that returns tokens from a list
Chris Lattner86920d32007-07-31 05:42:17 +0000281of tokens that came from somewhere else. It typically used for two things: 1)
282returning tokens from a macro definition as it is being expanded 2) returning
283tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
284will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
285
286<!-- ======================================================================= -->
287<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
288<!-- ======================================================================= -->
289
290<p>The MultipleIncludeOpt class implements a really simple little state machine
291that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
292idiom that people typically use to prevent multiple inclusion of headers. If a
293buffer uses this idiom and is subsequently #include'd, the preprocessor can
294simply check to see whether the guarding condition is defined or not. If so,
295the preprocessor can completely ignore the include of the header.</p>
296
297
298
299<!-- ======================================================================= -->
300<h2 id="libparse">The Parser Library</h2>
301<!-- ======================================================================= -->
302
303<!-- ======================================================================= -->
304<h2 id="libast">The AST Library</h2>
305<!-- ======================================================================= -->
306
307<!-- ======================================================================= -->
308<h3 id="Type">The Type class and its subclasses</h3>
309<!-- ======================================================================= -->
310
311<p>The Type class (and its subclasses) are an important part of the AST. Types
312are accessed through the ASTContext class, which implicitly creates and uniques
313them as they are needed. Types have a couple of non-obvious features: 1) they
314do not capture type qualifiers like const or volatile (See
315<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner8a2bc622007-07-31 06:37:39 +0000316information. Once created, types are immutable (unlike decls).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000317
318<p>Typedefs in C make semantic analysis a bit more complex than it would
319be without them. The issue is that we want to capture typedef information
320and represent it in the AST perfectly, but the semantics of operations need to
321"see through" typedefs. For example, consider this code:</p>
322
323<code>
324void func() {<br>
Bill Wendling30d17752007-10-06 01:56:01 +0000325&nbsp;&nbsp;typedef int foo;<br>
326&nbsp;&nbsp;foo X, *Y;<br>
327&nbsp;&nbsp;typedef foo* bar;<br>
328&nbsp;&nbsp;bar Z;<br>
329&nbsp;&nbsp;*X; <i>// error</i><br>
330&nbsp;&nbsp;**Y; <i>// error</i><br>
331&nbsp;&nbsp;**Z; <i>// error</i><br>
Chris Lattner86920d32007-07-31 05:42:17 +0000332}<br>
333</code>
334
335<p>The code above is illegal, and thus we expect there to be diagnostics emitted
336on the annotated lines. In this example, we expect to get:</p>
337
338<pre>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000339<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000340*X; // error
341<font color="blue">^~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000342<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000343**Y; // error
344<font color="blue">^~~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000345<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
346**Z; // error
347<font color="blue">^~~</font>
Chris Lattner86920d32007-07-31 05:42:17 +0000348</pre>
349
350<p>While this example is somewhat silly, it illustrates the point: we want to
351retain typedef information where possible, so that we can emit errors about
352"<tt>std::string</tt>" instead of "<tt>std::basic_string&lt;char, std:...</tt>".
353Doing this requires properly keeping typedef information (for example, the type
354of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner8a2bc622007-07-31 06:37:39 +0000355various operators (for example, the type of *Y is "foo", not "int"). In order
356to retain this information, the type of these expressions is an instance of the
357TypedefType class, which indicates that the type of these expressions is a
358typedef for foo.
359</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000360
Chris Lattner8a2bc622007-07-31 06:37:39 +0000361<p>Representing types like this is great for diagnostics, because the
362user-specified type is always immediately available. There are two problems
363with this: first, various semantic checks need to make judgements about the
Chris Lattner33fc68a2007-07-31 18:54:50 +0000364<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
365efficient way to query whether two types are structurally identical to each
366other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner8a2bc622007-07-31 06:37:39 +0000367canonical types.</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000368
Chris Lattner8a2bc622007-07-31 06:37:39 +0000369<h4>Canonical Types</h4>
Chris Lattner86920d32007-07-31 05:42:17 +0000370
Chris Lattner8a2bc622007-07-31 06:37:39 +0000371<p>Every instance of the Type class contains a canonical type pointer. For
372simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
373"<tt>int**</tt>"), the type just points to itself. For types that have a
374typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
375"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
376structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
377"<tt>int*</tt>", "<tt>int**</tt>", and "<tt>int*</tt>" respectively).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000378
Chris Lattner8a2bc622007-07-31 06:37:39 +0000379<p>This design provides a constant time operation (dereferencing the canonical
380type pointer) that gives us access to the structure of types. For example,
381we can trivially tell that "bar" and "foo*" are the same type by dereferencing
382their canonical type pointers and doing a pointer comparison (they both point
383to the single "<tt>int*</tt>" type).</p>
384
385<p>Canonical types and typedef types bring up some complexities that must be
386carefully managed. Specifically, the "isa/cast/dyncast" operators generally
387shouldn't be used in code that is inspecting the AST. For example, when type
388checking the indirection operator (unary '*' on a pointer), the type checker
389must verify that the operand has a pointer type. It would not be correct to
390check that with "<tt>isa&lt;PointerType&gt;(SubExpr-&gt;getType())</tt>",
391because this predicate would fail if the subexpression had a typedef type.</p>
392
393<p>The solution to this problem are a set of helper methods on Type, used to
394check their properties. In this case, it would be correct to use
395"<tt>SubExpr-&gt;getType()-&gt;isPointerType()</tt>" to do the check. This
396predicate will return true if the <em>canonical type is a pointer</em>, which is
397true any time the type is structurally a pointer type. The only hard part here
398is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
399
400<p>The second problem we face is how to get access to the pointer type once we
401know it exists. To continue the example, the result type of the indirection
402operator is the pointee type of the subexpression. In order to determine the
403type, we need to get the instance of PointerType that best captures the typedef
404information in the program. If the type of the expression is literally a
405PointerType, we can return that, otherwise we have to dig through the
406typedefs to find the pointer type. For example, if the subexpression had type
407"<tt>foo*</tt>", we could return that type as the result. If the subexpression
408had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
409<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner11406c12007-07-31 16:50:51 +0000410a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner8a2bc622007-07-31 06:37:39 +0000411PointerType and, if so, returns the best one. If not, it returns a null
412pointer.</p>
413
414<p>This structure is somewhat mystical, but after meditating on it, it will
415make sense to you :).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000416
417<!-- ======================================================================= -->
418<h3 id="QualType">The QualType class</h3>
419<!-- ======================================================================= -->
420
421<p>The QualType class is designed as a trivial value class that is small,
422passed by-value and is efficient to query. The idea of QualType is that it
423stores the type qualifiers (const, volatile, restrict) separately from the types
424themselves: QualType is conceptually a pair of "Type*" and bits for the type
425qualifiers.</p>
426
427<p>By storing the type qualifiers as bits in the conceptual pair, it is
428extremely efficient to get the set of qualifiers on a QualType (just return the
429field of the pair), add a type qualifier (which is a trivial constant-time
430operation that sets a bit), and remove one or more type qualifiers (just return
431a QualType with the bitfield set to empty).</p>
432
433<p>Further, because the bits are stored outside of the type itself, we do not
434need to create duplicates of types with different sets of qualifiers (i.e. there
435is only a single heap allocated "int" type: "const int" and "volatile const int"
436both point to the same heap allocated "int" type). This reduces the heap size
437used to represent bits and also means we do not have to consider qualifiers when
438uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
439
440<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
441the low bit of the pointer to the Type object. This means that QualType is
442exactly the same size as a pointer, and this works fine on any system where
443malloc'd objects are at least 8 byte aligned.</p>
Ted Kremenek8bc05712007-10-10 23:01:43 +0000444
445<!-- ======================================================================= -->
446<h3 id="CFG">The <tt>CFG</tt> class</h3>
447<!-- ======================================================================= -->
448
449<p>The <tt>CFG</tt> class is designed to represent a source-level
450control-flow graph for a single statement (<tt>Stmt*</tt>). Typically
451instances of <tt>CFG</tt> are constructed for function bodies (usually
452an instance of <tt>CompoundStmt</tt>), but can also be instantiated to
453represent the control-flow of any class that subclasses <tt>Stmt</tt>,
454which includes simple expressions. Control-flow graphs are especially
455useful for performing
456<a href="http://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities">flow-
457or path-sensitive</a> program analyses on a given function.</p>
458
459<h4>Basic Blocks</h4>
460
461<p>Concretely, an instance of <tt>CFG</tt> is a collection of basic
462blocks. Each basic block is an instance of <tt>CFGBlock</tt>, which
463simply contains an ordered sequence of <tt>Stmt*</tt> (each referring
464to statements in the AST). The ordering of statements within a block
465indicates unconditional flow of control from one statement to the
466next. <a href="#ConditionalControlFlow">Conditional control-flow</a>
467is represented using edges between basic blocks. The statements
468within a given <tt>CFGBlock</tt> can be traversed using
469the <tt>CFGBlock::*iterator</tt> interface.</p>
470
471<p>
Ted Kremenek18e17e72007-10-18 22:50:52 +0000472A <tt>CFG</tt> object owns the instances of <tt>CFGBlock</tt> within
Ted Kremenek8bc05712007-10-10 23:01:43 +0000473the control-flow graph it represents. Each <tt>CFGBlock</tt> within a
474CFG is also uniquely numbered (accessible
475via <tt>CFGBlock::getBlockID()</tt>). Currently the number is
476based on the ordering the blocks were created, but no assumptions
477should be made on how <tt>CFGBlock</tt>s are numbered other than their
478numbers are unique and that they are numbered from 0..N-1 (where N is
479the number of basic blocks in the CFG).</p>
480
481<h4>Entry and Exit Blocks</h4>
482
483Each instance of <tt>CFG</tt> contains two special blocks:
484an <i>entry</i> block (accessible via <tt>CFG::getEntry()</tt>), which
485has no incoming edges, and an <i>exit</i> block (accessible
486via <tt>CFG::getExit()</tt>), which has no outgoing edges. Neither
487block contains any statements, and they serve the role of providing a
488clear entrance and exit for a body of code such as a function body.
489The presence of these empty blocks greatly simplifies the
490implementation of many analyses built on top of CFGs.
491
492<h4 id ="ConditionalControlFlow">Conditional Control-Flow</h4>
493
494<p>Conditional control-flow (such as those induced by if-statements
495and loops) is represented as edges between <tt>CFGBlock</tt>s.
496Because different C language constructs can induce control-flow,
497each <tt>CFGBlock</tt> also records an extra <tt>Stmt*</tt> that
498represents the <i>terminator</i> of the block. A terminator is simply
499the statement that caused the control-flow, and is used to identify
500the nature of the conditional control-flow between blocks. For
501example, in the case of an if-statement, the terminator refers to
502the <tt>IfStmt</tt> object in the AST that represented the given
503branch.</p>
504
505<p>To illustrate, consider the following code example:</p>
506
507<code>
508int foo(int x) {<br>
509&nbsp;&nbsp;x = x + 1;<br>
510<br>
511&nbsp;&nbsp;if (x > 2) x++;<br>
512&nbsp;&nbsp;else {<br>
513&nbsp;&nbsp;&nbsp;&nbsp;x += 2;<br>
514&nbsp;&nbsp;&nbsp;&nbsp;x *= 2;<br>
515&nbsp;&nbsp;}<br>
516<br>
517&nbsp;&nbsp;return x;<br>
518}
519</code>
520
521<p>After invoking the parser+semantic analyzer on this code fragment,
522the AST of the body of <tt>foo</tt> is referenced by a
523single <tt>Stmt*</tt>. We can then construct an instance
524of <tt>CFG</tt> representing the control-flow graph of this function
525body by single call to a static class method:</p>
526
527<code>
528&nbsp;&nbsp;Stmt* FooBody = ...<br>
529&nbsp;&nbsp;CFG* FooCFG = <b>CFG::buildCFG</b>(FooBody);
530</code>
531
532<p>It is the responsibility of the caller of <tt>CFG::buildCFG</tt>
533to <tt>delete</tt> the returned <tt>CFG*</tt> when the CFG is no
534longer needed.</p>
535
536<p>Along with providing an interface to iterate over
537its <tt>CFGBlock</tt>s, the <tt>CFG</tt> class also provides methods
538that are useful for debugging and visualizing CFGs. For example, the
539method
540<tt>CFG::dump()</tt> dumps a pretty-printed version of the CFG to
541standard error. This is especially useful when one is using a
542debugger such as gdb. For example, here is the output
543of <tt>FooCFG->dump()</tt>:</p>
544
545<code>
546&nbsp;[ B5 (ENTRY) ]<br>
547&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (0):<br>
548&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B4<br>
549<br>
550&nbsp;[ B4 ]<br>
551&nbsp;&nbsp;&nbsp;&nbsp;1: x = x + 1<br>
552&nbsp;&nbsp;&nbsp;&nbsp;2: (x > 2)<br>
553&nbsp;&nbsp;&nbsp;&nbsp;<b>T: if [B4.2]</b><br>
554&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B5<br>
555&nbsp;&nbsp;&nbsp;&nbsp;Successors (2): B3 B2<br>
556<br>
557&nbsp;[ B3 ]<br>
558&nbsp;&nbsp;&nbsp;&nbsp;1: x++<br>
559&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B4<br>
560&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B1<br>
561<br>
562&nbsp;[ B2 ]<br>
563&nbsp;&nbsp;&nbsp;&nbsp;1: x += 2<br>
564&nbsp;&nbsp;&nbsp;&nbsp;2: x *= 2<br>
565&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B4<br>
566&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B1<br>
567<br>
568&nbsp;[ B1 ]<br>
569&nbsp;&nbsp;&nbsp;&nbsp;1: return x;<br>
570&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (2): B2 B3<br>
571&nbsp;&nbsp;&nbsp;&nbsp;Successors (1): B0<br>
572<br>
573&nbsp;[ B0 (EXIT) ]<br>
574&nbsp;&nbsp;&nbsp;&nbsp;Predecessors (1): B1<br>
575&nbsp;&nbsp;&nbsp;&nbsp;Successors (0):
576</code>
577
578<p>For each block, the pretty-printed output displays for each block
579the number of <i>predecessor</i> blocks (blocks that have outgoing
580control-flow to the given block) and <i>successor</i> blocks (blocks
581that have control-flow that have incoming control-flow from the given
582block). We can also clearly see the special entry and exit blocks at
583the beginning and end of the pretty-printed output. For the entry
584block (block B5), the number of predecessor blocks is 0, while for the
585exit block (block B0) the number of successor blocks is 0.</p>
586
587<p>The most interesting block here is B4, whose outgoing control-flow
588represents the branching caused by the sole if-statement
589in <tt>foo</tt>. Of particular interest is the second statement in
590the block, <b><tt>(x > 2)</tt></b>, and the terminator, printed
591as <b><tt>if [B4.2]</tt></b>. The second statement represents the
592evaluation of the condition of the if-statement, which occurs before
593the actual branching of control-flow. Within the <tt>CFGBlock</tt>
594for B4, the <tt>Stmt*</tt> for the second statement refers to the
595actual expression in the AST for <b><tt>(x > 2)</tt></b>. Thus
596pointers to subclasses of <tt>Expr</tt> can appear in the list of
597statements in a block, and not just subclasses of <tt>Stmt</tt> that
598refer to proper C statements.</p>
599
600<p>The terminator of block B4 is a pointer to the <tt>IfStmt</tt>
601object in the AST. The pretty-printer outputs <b><tt>if
602[B4.2]</tt></b> because the condition expression of the if-statement
603has an actual place in the basic block, and thus the terminator is
604essentially
605<i>referring</i> to the expression that is the second statement of
606block B4 (i.e., B4.2). In this manner, conditions for control-flow
607(which also includes conditions for loops and switch statements) are
608hoisted into the actual basic block.</p>
609
Ted Kremenek98f19b62007-10-10 23:22:00 +0000610<!--
Ted Kremenek8bc05712007-10-10 23:01:43 +0000611<h4>Implicit Control-Flow</h4>
Ted Kremenek98f19b62007-10-10 23:22:00 +0000612-->
Ted Kremenek8bc05712007-10-10 23:01:43 +0000613
614<!--
615<p>A key design principle of the <tt>CFG</tt> class was to not require
616any transformations to the AST in order to represent control-flow.
617Thus the <tt>CFG</tt> does not perform any "lowering" of the
618statements in an AST: loops are not transformed into guarded gotos,
619short-circuit operations are not converted to a set of if-statements,
620and so on.</p>
621-->
Ted Kremenek17a295d2008-06-11 06:19:49 +0000622
Chris Lattner7bad1992008-11-16 21:48:07 +0000623
624<!-- ======================================================================= -->
625<h3 id="Constants">Constant Folding in the Clang AST</h3>
626<!-- ======================================================================= -->
627
628<p>There are several places where constants and constant folding matter a lot to
629the Clang front-end. First, in general, we prefer the AST to retain the source
630code as close to how the user wrote it as possible. This means that if they
631wrote "5+4", we want to keep the addition and two constants in the AST, we don't
632want to fold to "9". This means that constant folding in various ways turns
633into a tree walk that needs to handle the various cases.</p>
634
635<p>However, there are places in both C and C++ that require constants to be
636folded. For example, the C standard defines what an "integer constant
637expression" (i-c-e) is with very precise and specific requirements. The
638language then requires i-c-e's in a lot of places (for example, the size of a
639bitfield, the value for a case statement, etc). For these, we have to be able
640to constant fold the constants, to do semantic checks (e.g. verify bitfield size
641is non-negative and that case statements aren't duplicated). We aim for Clang
642to be very pedantic about this, diagnosing cases when the code does not use an
643i-c-e where one is required, but accepting the code unless running with
644<tt>-pedantic-errors</tt>.</p>
645
646<p>Things get a little bit more tricky when it comes to compatibility with
647real-world source code. Specifically, GCC has historically accepted a huge
648superset of expressions as i-c-e's, and a lot of real world code depends on this
649unfortuate accident of history (including, e.g., the glibc system headers). GCC
650accepts anything its "fold" optimizer is capable of reducing to an integer
651constant, which means that the definition of what it accepts changes as its
652optimizer does. One example is that GCC accepts things like "case X-X:" even
653when X is a variable, because it can fold this to 0.</p>
654
655<p>Another issue are how constants interact with the extensions we support, such
656as __builtin_constant_p, __builtin_inf, __extension__ and many others. C99
657obviously does not specify the semantics of any of these extensions, and the
658definition of i-c-e does not include them. However, these extensions are often
659used in real code, and we have to have a way to reason about them.</p>
660
661<p>Finally, this is not just a problem for semantic analysis. The code
662generator and other clients have to be able to fold constants (e.g. to
663initialize global variables) and has to handle a superset of what C99 allows.
664Further, these clients can benefit from extended information. For example, we
665know that "foo()||1" always evaluates to true, but we can't replace the
666expression with true because it has side effects.</p>
667
668<!-- ======================= -->
669<h4>Implementation Approach</h4>
670<!-- ======================= -->
671
672<p>After trying several different approaches, we've finally converged on a
673design (Note, at the time of this writing, not all of this has been implemented,
674consider this a design goal!). Our basic approach is to define a single
675recursive method evaluation method (<tt>Expr::Evaluate</tt>), which is
676implemented in <tt>AST/ExprConstant.cpp</tt>. Given an expression with 'scalar'
677type (integer, fp, complex, or pointer) this method returns the following
678information:</p>
679
680<ul>
681<li>Whether the expression is an integer constant expression, a general
682 constant that was folded but has no side effects, a general constant that
683 was folded but that does have side effects, or an uncomputable/unfoldable
684 value.
685</li>
686<li>If the expression was computable in any way, this method returns the APValue
687 for the result of the expression.</li>
688<li>If the expression is not evaluatable at all, this method returns
689 information on one of the problems with the expression. This includes a
690 SourceLocation for where the problem is, and a diagnostic ID that explains
691 the problem. The diagnostic should be have ERROR type.</li>
692<li>If the expression is not an integer constant expression, this method returns
693 information on one of the problems with the expression. This includes a
694 SourceLocation for where the problem is, and a diagnostic ID that explains
695 the problem. The diagnostic should be have EXTENSION type.</li>
696</ul>
697
698<p>This information gives various clients the flexibility that they want, and we
699will eventually have some helper methods for various extensions. For example,
700Sema should have a <tt>Sema::VerifyIntegerConstantExpression</tt> method, which
701calls Evaluate on the expression. If the expression is not foldable, the error
702is emitted, and it would return true. If the expression is not an i-c-e, the
703EXTENSION diagnostic is emitted. Finally it would return false to indicate that
704the AST is ok.</p>
705
706<p>Other clients can use the information in other ways, for example, codegen can
707just use expressions that are foldable in any way.</p>
708
709<!-- ========== -->
710<h4>Extensions</h4>
711<!-- ========== -->
712
713<p>This section describes how some of the various extensions clang supports
714interacts with constant evaluation:</p>
715
716<ul>
717<li><b><tt>__extension__</tt></b>: The expression form of this extension causes
718 any evaluatable subexpression to be accepted as an integer constant
719 expression.</li>
720<li><b><tt>__builtin_constant_p</tt></b>: This returns true (as a integer
721 constant expression) if the operand is any evaluatable constant.</li>
722<li><b><tt>__builtin_choose_expr</tt></b>: The condition is required to be an
723 integer constant expression, but we accept any constant as an "extension of
724 an extension". This only evaluates one operand depending on which way the
725 condition evaluates.</li>
726<li><b><tt>__builtin_classify_type</tt></b>: This always returns an integer
727 constant expression.</li>
728<li><b><tt>__builtin_inf,nan,..</tt></b>: These are treated just like a
729 floating-point literal.</li>
730<li><b><tt>__builtin_abs,copysign,..</tt></b>: These are constant folded as
731 general constant expressions.</li>
732</ul>
733
734
735
736
Ted Kremenek17a295d2008-06-11 06:19:49 +0000737</div>
738</body>
739</html>