blob: eb36a790d3f8a2851987be10f1f56e28a95ca8f7 [file] [log] [blame]
Chris Lattner86920d32007-07-31 05:42:17 +00001<title>"clang" CFE Internals Manual</title>
2
3<h1>"clang" CFE Internals Manual</h1>
4
5<ul>
6<li><a href="#intro">Introduction</a></li>
7<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
8<li><a href="#libbasic">The clang 'Basic' Library</a>
9 <ul>
10 <li><a href="#SourceLocation">The SourceLocation and SourceManager
11 classes</a></li>
12 </ul>
13</li>
14<li><a href="#liblex">The Lexer and Preprocessor Library</a>
15 <ul>
16 <li><a href="#Token">The Token class</a></li>
17 <li><a href="#Lexer">The Lexer class</a></li>
18 <li><a href="#MacroExpander">The MacroExpander class</a></li>
19 <li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
20 </ul>
21</li>
22<li><a href="#libparse">The Parser Library</a>
23 <ul>
24 </ul>
25</li>
26<li><a href="#libast">The AST Library</a>
27 <ul>
28 <li><a href="#Type">The Type class and its subclasses</a></li>
29 <li><a href="#QualType">The QualType class</a></li>
30 </ul>
31</li>
32</ul>
33
34
35<!-- ======================================================================= -->
36<h2 id="intro">Introduction</h2>
37<!-- ======================================================================= -->
38
39<p>This document describes some of the more important APIs and internal design
40decisions made in the clang C front-end. The purpose of this document is to
41both capture some of this high level information and also describe some of the
42design decisions behind it. This is meant for people interested in hacking on
43clang, not for end-users. The description below is categorized by
44libraries, and does not describe any of the clients of the libraries.</p>
45
46<!-- ======================================================================= -->
47<h2 id="libsystem">LLVM System and Support Libraries</h2>
48<!-- ======================================================================= -->
49
50<p>The LLVM libsystem library provides the basic clang system abstraction layer,
51which is used for file system access. The LLVM libsupport library provides many
52underlying libraries and <a
53href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
54 including command line option
55processing and various containers.</p>
56
57<!-- ======================================================================= -->
58<h2 id="libbasic">The clang 'Basic' Library</h2>
59<!-- ======================================================================= -->
60
61<p>This library certainly needs a better name. The 'basic' library contains a
62number of low-level utilities for tracking and manipulating source buffers,
63locations within the source buffers, diagnostics, tokens, target abstraction,
64and information about the subset of the language being compiled for.</p>
65
66<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
67other parts could be reused for other non-C-based languages (SourceLocation,
68SourceManager, Diagnostics, FileManager). When and if there is future demand
69we can figure out if it makes sense to introduce a new library, move the general
70classes somewhere else, or introduce some other solution.</p>
71
72<p>We describe the roles of these classes in order of their dependencies.</p>
73
74<!-- ======================================================================= -->
75<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
76<!-- ======================================================================= -->
77
78<p>Strangely enough, the SourceLocation class represents a location within the
79source code of the program. Important design points include:</p>
80
81<ol>
82<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
83 many AST nodes and are passed around often. Currently it is 32 bits.</li>
84<li>SourceLocation must be a simple value object that can be efficiently
85 copied.</li>
86<li>We should be able to represent a source location for any byte of any input
87 file. This includes in the middle of tokens, in whitespace, in trigraphs,
88 etc.</li>
89<li>A SourceLocation must encode the current #include stack that was active when
90 the location was processed. For example, if the location corresponds to a
91 token, it should contain the set of #includes active when the token was
92 lexed. This allows us to print the #include stack for a diagnostic.</li>
93<li>SourceLocation must be able to describe macro expansions, capturing both
94 the ultimate instantiation point and the source of the original character
95 data.</li>
96</ol>
97
98<p>In practice, the SourceLocation works together with the SourceManager class
99to encode two pieces of information about a location: it's physical location
100and it's virtual location. For most tokens, these will be the same. However,
101for a macro expansion (or tokens that came from a _Pragma directive) these will
102describe the location of the characters corresponding to the token and the
103location where the token was used (i.e. the macro instantiation point or the
104location of the _Pragma itself).</p>
105
106<p>For efficiency, we only track one level of macro instantions: if a token was
107produced by multiple instantiations, we only track the source and ultimate
108destination. Though we could track the intermediate instantiation points, this
109would require extra bookkeeping and no known client would benefit substantially
110from this.</p>
111
112<p>The clang front-end inherently depends on the location of a token being
113tracked correctly. If it is ever incorrect, the front-end may get confused and
114die. The reason for this is that the notion of the 'spelling' of a Token in
115clang depends on being able to find the original input characters for the token.
116This concept maps directly to the "physical" location for the token.</p>
117
118<!-- ======================================================================= -->
119<h2 id="liblex">The Lexer and Preprocessor Library</h2>
120<!-- ======================================================================= -->
121
122<p>The Lexer library contains several tightly-connected classes that are involved
123with the nasty process of lexing and preprocessing C source code. The main
124interface to this library for outside clients is the large <a
125href="#Preprocessor">Preprocessor</a> class.
126It contains the various pieces of state that are required to coherently read
127tokens out of a translation unit.</p>
128
129<p>The core interface to the Preprocessor object (once it is set up) is the
130Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
131the preprocessor stream. There are two types of token providers that the
132preprocessor is capable of reading from: a buffer lexer (provided by the <a
133href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
134href="#MacroExpander">MacroExpander</a> class).
135
136
137<!-- ======================================================================= -->
138<h3 id="Token">The Token class</h3>
139<!-- ======================================================================= -->
140
141<p>The Token class is used to represent a single lexed token. Tokens are
142intended to be used by the lexer/preprocess and parser libraries, but are not
143intended to live beyond them (for example, they should not live in the ASTs).<p>
144
145<p>Tokens most often live on the stack (or some other location that is efficient
146to access) as the parser is running, but occasionally do get buffered up. For
147example, macro definitions are stored as a series of tokens, and the C++
148front-end will eventually need to buffer tokens up for tentative parsing and
149various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
150system, sizeof(Token) is currently 16 bytes.</p>
151
152<p>Tokens contain the following information:</p>
153
154<ul>
155<li><b>A SourceLocation</b> - This indicates the location of the start of the
156token.</li>
157
158<li><b>A length</b> - This stores the length of the token as stored in the
159SourceBuffer. For tokens that include them, this length includes trigraphs and
160escaped newlines which are ignored by later phases of the compiler. By pointing
161into the original source buffer, it is always possible to get the original
162spelling of a token completely accurately.</li>
163
164<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
165identifier lookup was enabled when the token was lexed (e.g. the lexer was not
166reading in 'raw' mode) this contains a pointer to the unique hash value for the
167identifier. Because the lookup happens before keyword identification, this
168field is set even for language keywords like 'for'.</li>
169
170<li><b>TokenKind</b> - This indicates the kind of token as classified by the
171lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
172operator), <tt>tok::ampamp</tt> for the "&amp;&amp;" token, and keyword values
173(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
174that some tokens can be spelled multiple ways. For example, C++ supports
175"operator keywords", where things like "and" are treated exactly like the
176"&amp;&amp;" operator. In these cases, the kind value is set to
177<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
178consider both forms. For something that cares about which form is used (e.g.
179the preprocessor 'stringize' operator) the spelling indicates the original
180form.</li>
181
182<li><b>Flags</b> - There are currently four flags tracked by the
183lexer/preprocessor system on a per-token basis:
184
185 <ol>
186 <li><b>StartOfLine</b> - This was the first token that occurred on its input
187 source line.</li>
188 <li><b>LeadingSpace</b> - There was a space character either immediately
189 before the token or transitively before the token as it was expanded
190 through a macro. The definition of this flag is very closely defined by
191 the stringizing requirements of the preprocessor.</li>
192 <li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
193 represent identifier tokens which have macro expansion disabled. This
194 prevents them from being considered as candidates for macro expansion ever
195 in the future.</li>
196 <li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
197 token includes a trigraph or escaped newline. Since this is uncommon,
198 many pieces of code can fast-path on tokens that did not need cleaning.
199 </p>
200 </ol>
201</li>
202</ul>
203
204<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
205contain any semantic information about the lexed value. For example, if the
206token was a pp-number token, we do not represent the value of the number that
207was lexed (this is left for later pieces of code to decide). Additionally, the
208lexer library has no notion of typedef names vs variable names: both are
209returned as identifiers, and the parser is left to decide whether a specific
210identifier is a typedef or a variable (tracking this requires scope information
211among other things).</p>
212
213<!-- ======================================================================= -->
214<h3 id="Lexer">The Lexer class</h3>
215<!-- ======================================================================= -->
216
217<p>The Lexer class provides the mechanics of lexing tokens out of a source
218buffer and deciding what they mean. The Lexer is complicated by the fact that
219it operates on raw buffers that have not had spelling eliminated (this is a
220necessity to get decent performance), but this is countered with careful coding
221as well as standard performance techniques (for example, the comment handling
222code is vectorized on X86 and PowerPC hosts).</p>
223
224<p>The lexer has a couple of interesting modal features:</p>
225
226<ul>
227<li>The lexer can operate in 'raw' mode. This mode has several features that
228 make it possible to quickly lex the file (e.g. it stops identifier lookup,
229 doesn't specially handle preprocessor tokens, handles EOF differently, etc).
230 This mode is used for lexing within an "<tt>#if 0</tt>" block, for
231 example.</li>
232<li>The lexer can capture and return comments as tokens. This is required to
233 support the -C preprocessor mode, which passes comments through, and is
234 used by the diagnostic checker to identifier expect-error annotations.</li>
235<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
236 after reading a #include directive. This mode changes the parsing of '<'
237 to return an "angled string" instead of a bunch of tokens for each thing
238 within the filename.</li>
239<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
240 ParsingPreprocessorDirective mode is entered. This changes the parser to
241 return EOM at a newline.</li>
242<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
243 whether C++ or ObjC keywords are recognized, etc.</li>
244</ul>
245
246<p>In addition to these modes, the lexer keeps track of a couple of other
247 features that are local to a lexed buffer, which change as the buffer is
248 lexed:</p>
249
250<ul>
251<li>The Lexer uses BufferPtr to keep track of the current character being
252 lexed.</li>
253<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
254 will start with its "start of line" bit set.</li>
255<li>The Lexer keeps track of the current #if directives that are active (which
256 can be nested).</li>
257<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
258 MultipleIncludeOpt</a> object, which is used to
259 detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
260 <tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
261 subsequent includes can be ignored if the XX macro is defined.</li>
262</ul>
263
264<!-- ======================================================================= -->
265<h3 id="MacroExpander">The MacroExpander class</h3>
266<!-- ======================================================================= -->
267
268<p>The MacroExpander class is a token provider that returns tokens from a list
269of tokens that came from somewhere else. It typically used for two things: 1)
270returning tokens from a macro definition as it is being expanded 2) returning
271tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
272will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
273
274<!-- ======================================================================= -->
275<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
276<!-- ======================================================================= -->
277
278<p>The MultipleIncludeOpt class implements a really simple little state machine
279that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
280idiom that people typically use to prevent multiple inclusion of headers. If a
281buffer uses this idiom and is subsequently #include'd, the preprocessor can
282simply check to see whether the guarding condition is defined or not. If so,
283the preprocessor can completely ignore the include of the header.</p>
284
285
286
287<!-- ======================================================================= -->
288<h2 id="libparse">The Parser Library</h2>
289<!-- ======================================================================= -->
290
291<!-- ======================================================================= -->
292<h2 id="libast">The AST Library</h2>
293<!-- ======================================================================= -->
294
295<!-- ======================================================================= -->
296<h3 id="Type">The Type class and its subclasses</h3>
297<!-- ======================================================================= -->
298
299<p>The Type class (and its subclasses) are an important part of the AST. Types
300are accessed through the ASTContext class, which implicitly creates and uniques
301them as they are needed. Types have a couple of non-obvious features: 1) they
302do not capture type qualifiers like const or volatile (See
303<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner8a2bc622007-07-31 06:37:39 +0000304information. Once created, types are immutable (unlike decls).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000305
306<p>Typedefs in C make semantic analysis a bit more complex than it would
307be without them. The issue is that we want to capture typedef information
308and represent it in the AST perfectly, but the semantics of operations need to
309"see through" typedefs. For example, consider this code:</p>
310
311<code>
312void func() {<br>
313 typedef int foo;<br>
314 foo X, *Y;<br>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000315 typedef foo* bar;<br>
316 bar Z;<br>
Chris Lattner86920d32007-07-31 05:42:17 +0000317 *X; <i>// error</i><br>
318 **Y; <i>// error</i><br>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000319 **Z; <i>// error</i><br>
Chris Lattner86920d32007-07-31 05:42:17 +0000320}<br>
321</code>
322
323<p>The code above is illegal, and thus we expect there to be diagnostics emitted
324on the annotated lines. In this example, we expect to get:</p>
325
326<pre>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000327<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000328*X; // error
329<font color="blue">^~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000330<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner86920d32007-07-31 05:42:17 +0000331**Y; // error
332<font color="blue">^~~</font>
Chris Lattner8a2bc622007-07-31 06:37:39 +0000333<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
334**Z; // error
335<font color="blue">^~~</font>
Chris Lattner86920d32007-07-31 05:42:17 +0000336</pre>
337
338<p>While this example is somewhat silly, it illustrates the point: we want to
339retain typedef information where possible, so that we can emit errors about
340"<tt>std::string</tt>" instead of "<tt>std::basic_string&lt;char, std:...</tt>".
341Doing this requires properly keeping typedef information (for example, the type
342of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner8a2bc622007-07-31 06:37:39 +0000343various operators (for example, the type of *Y is "foo", not "int"). In order
344to retain this information, the type of these expressions is an instance of the
345TypedefType class, which indicates that the type of these expressions is a
346typedef for foo.
347</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000348
Chris Lattner8a2bc622007-07-31 06:37:39 +0000349<p>Representing types like this is great for diagnostics, because the
350user-specified type is always immediately available. There are two problems
351with this: first, various semantic checks need to make judgements about the
Chris Lattner33fc68a2007-07-31 18:54:50 +0000352<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
353efficient way to query whether two types are structurally identical to each
354other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner8a2bc622007-07-31 06:37:39 +0000355canonical types.</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000356
Chris Lattner8a2bc622007-07-31 06:37:39 +0000357<h4>Canonical Types</h4>
Chris Lattner86920d32007-07-31 05:42:17 +0000358
Chris Lattner8a2bc622007-07-31 06:37:39 +0000359<p>Every instance of the Type class contains a canonical type pointer. For
360simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
361"<tt>int**</tt>"), the type just points to itself. For types that have a
362typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
363"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
364structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
365"<tt>int*</tt>", "<tt>int**</tt>", and "<tt>int*</tt>" respectively).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000366
Chris Lattner8a2bc622007-07-31 06:37:39 +0000367<p>This design provides a constant time operation (dereferencing the canonical
368type pointer) that gives us access to the structure of types. For example,
369we can trivially tell that "bar" and "foo*" are the same type by dereferencing
370their canonical type pointers and doing a pointer comparison (they both point
371to the single "<tt>int*</tt>" type).</p>
372
373<p>Canonical types and typedef types bring up some complexities that must be
374carefully managed. Specifically, the "isa/cast/dyncast" operators generally
375shouldn't be used in code that is inspecting the AST. For example, when type
376checking the indirection operator (unary '*' on a pointer), the type checker
377must verify that the operand has a pointer type. It would not be correct to
378check that with "<tt>isa&lt;PointerType&gt;(SubExpr-&gt;getType())</tt>",
379because this predicate would fail if the subexpression had a typedef type.</p>
380
381<p>The solution to this problem are a set of helper methods on Type, used to
382check their properties. In this case, it would be correct to use
383"<tt>SubExpr-&gt;getType()-&gt;isPointerType()</tt>" to do the check. This
384predicate will return true if the <em>canonical type is a pointer</em>, which is
385true any time the type is structurally a pointer type. The only hard part here
386is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
387
388<p>The second problem we face is how to get access to the pointer type once we
389know it exists. To continue the example, the result type of the indirection
390operator is the pointee type of the subexpression. In order to determine the
391type, we need to get the instance of PointerType that best captures the typedef
392information in the program. If the type of the expression is literally a
393PointerType, we can return that, otherwise we have to dig through the
394typedefs to find the pointer type. For example, if the subexpression had type
395"<tt>foo*</tt>", we could return that type as the result. If the subexpression
396had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
397<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner11406c12007-07-31 16:50:51 +0000398a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner8a2bc622007-07-31 06:37:39 +0000399PointerType and, if so, returns the best one. If not, it returns a null
400pointer.</p>
401
402<p>This structure is somewhat mystical, but after meditating on it, it will
403make sense to you :).</p>
Chris Lattner86920d32007-07-31 05:42:17 +0000404
405<!-- ======================================================================= -->
406<h3 id="QualType">The QualType class</h3>
407<!-- ======================================================================= -->
408
409<p>The QualType class is designed as a trivial value class that is small,
410passed by-value and is efficient to query. The idea of QualType is that it
411stores the type qualifiers (const, volatile, restrict) separately from the types
412themselves: QualType is conceptually a pair of "Type*" and bits for the type
413qualifiers.</p>
414
415<p>By storing the type qualifiers as bits in the conceptual pair, it is
416extremely efficient to get the set of qualifiers on a QualType (just return the
417field of the pair), add a type qualifier (which is a trivial constant-time
418operation that sets a bit), and remove one or more type qualifiers (just return
419a QualType with the bitfield set to empty).</p>
420
421<p>Further, because the bits are stored outside of the type itself, we do not
422need to create duplicates of types with different sets of qualifiers (i.e. there
423is only a single heap allocated "int" type: "const int" and "volatile const int"
424both point to the same heap allocated "int" type). This reduces the heap size
425used to represent bits and also means we do not have to consider qualifiers when
426uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
427
428<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
429the low bit of the pointer to the Type object. This means that QualType is
430exactly the same size as a pointer, and this works fine on any system where
431malloc'd objects are at least 8 byte aligned.</p>