blob: 05278f37e0a5b2b288032905f2129bf0404642cc [file] [log] [blame]
Benjamin Kramer665a8dc2012-01-15 15:26:07 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
Douglas Gregor29dde392009-06-03 21:57:43 +00003<html>
4<head>
Douglas Gregor72e4f252012-09-16 01:44:02 +00005 <title>Precompiled Header and Modules Internals</title>
Benjamin Kramer665a8dc2012-01-15 15:26:07 +00006 <link type="text/css" rel="stylesheet" href="../menu.css">
7 <link type="text/css" rel="stylesheet" href="../content.css">
Douglas Gregor29dde392009-06-03 21:57:43 +00008 <style type="text/css">
9 td {
10 vertical-align: top;
11 }
12 </style>
Douglas Gregor32110df2009-05-20 00:16:32 +000013</head>
14
15<body>
16
17<!--#include virtual="../menu.html.incl"-->
18
19<div id="content">
20
Douglas Gregor72e4f252012-09-16 01:44:02 +000021<h1>Precompiled Header and Modules Internals</h1>
Douglas Gregor32110df2009-05-20 00:16:32 +000022
23 <p>This document describes the design and implementation of Clang's
Douglas Gregor72e4f252012-09-16 01:44:02 +000024 precompiled headers (PCH) and modules. If you are interested in the end-user
Douglas Gregor32110df2009-05-20 00:16:32 +000025 view, please see the <a
26 href="UsersManual.html#precompiledheaders">User's Manual</a>.</p>
27
Douglas Gregor923cb232009-06-03 18:35:59 +000028 <p><b>Table of Contents</b></p>
29 <ul>
30 <li><a href="#usage">Using Precompiled Headers with
Daniel Dunbar69cfd862009-12-11 23:17:03 +000031 <tt>clang</tt></a></li>
Douglas Gregor923cb232009-06-03 18:35:59 +000032 <li><a href="#philosophy">Design Philosophy</a></li>
Douglas Gregor72e4f252012-09-16 01:44:02 +000033 <li><a href="#contents">Serialized AST File Contents</a>
Douglas Gregor923cb232009-06-03 18:35:59 +000034 <ul>
35 <li><a href="#metadata">Metadata Block</a></li>
36 <li><a href="#sourcemgr">Source Manager Block</a></li>
37 <li><a href="#preprocessor">Preprocessor Block</a></li>
38 <li><a href="#types">Types Block</a></li>
39 <li><a href="#decls">Declarations Block</a></li>
40 <li><a href="#stmt">Statements and Expressions</a></li>
41 <li><a href="#idtable">Identifier Table Block</a></li>
42 <li><a href="#method-pool">Method Pool Block</a></li>
43 </ul>
44 </li>
Douglas Gregor72e4f252012-09-16 01:44:02 +000045 <li><a href="#tendrils">AST Reader Integration Points</a></li>
46 <li><a href="#chained">Chained precompiled headers</a></li>
47 <li><a href="#modules">Modules</a></li>
Douglas Gregor0084ead2009-06-03 21:41:31 +000048</ul>
Douglas Gregor923cb232009-06-03 18:35:59 +000049
Daniel Dunbar69cfd862009-12-11 23:17:03 +000050<h2 id="usage">Using Precompiled Headers with <tt>clang</tt></h2>
Douglas Gregor32110df2009-05-20 00:16:32 +000051
Daniel Dunbar69cfd862009-12-11 23:17:03 +000052<p>The Clang compiler frontend, <tt>clang -cc1</tt>, supports two command line
53options for generating and using PCH files.<p>
Douglas Gregor32110df2009-05-20 00:16:32 +000054
Daniel Dunbar69cfd862009-12-11 23:17:03 +000055<p>To generate PCH files using <tt>clang -cc1</tt>, use the option
Douglas Gregor32110df2009-05-20 00:16:32 +000056<b><tt>-emit-pch</tt></b>:
57
Daniel Dunbar69cfd862009-12-11 23:17:03 +000058<pre> $ clang -cc1 test.h -emit-pch -o test.h.pch </pre>
Douglas Gregor32110df2009-05-20 00:16:32 +000059
60<p>This option is transparently used by <tt>clang</tt> when generating
61PCH files. The resulting PCH file contains the serialized form of the
62compiler's internal representation after it has completed parsing and
63semantic analysis. The PCH file can then be used as a prefix header
64with the <b><tt>-include-pch</tt></b> option:</p>
65
66<pre>
Daniel Dunbar69cfd862009-12-11 23:17:03 +000067 $ clang -cc1 -include-pch test.h.pch test.c -o test.s
Douglas Gregor32110df2009-05-20 00:16:32 +000068</pre>
69
Douglas Gregor923cb232009-06-03 18:35:59 +000070<h2 id="philosophy">Design Philosophy</h2>
Douglas Gregor32110df2009-05-20 00:16:32 +000071
72<p>Precompiled headers are meant to improve overall compile times for
73 projects, so the design of precompiled headers is entirely driven by
74 performance concerns. The use case for precompiled headers is
75 relatively simple: when there is a common set of headers that is
76 included in nearly every source file in the project, we
77 <i>precompile</i> that bundle of headers into a single precompiled
78 header (PCH file). Then, when compiling the source files in the
79 project, we load the PCH file first (as a prefix header), which acts
80 as a stand-in for that bundle of headers.</p>
81
82<p>A precompiled header implementation improves performance when:</p>
83<ul>
84 <li>Loading the PCH file is significantly faster than re-parsing the
85 bundle of headers stored within the PCH file. Thus, a precompiled
86 header design attempts to minimize the cost of reading the PCH
87 file. Ideally, this cost should not vary with the size of the
88 precompiled header file.</li>
89
90 <li>The cost of generating the PCH file initially is not so large
91 that it counters the per-source-file performance improvement due to
92 eliminating the need to parse the bundled headers in the first
93 place. This is particularly important on multi-core systems, because
94 PCH file generation serializes the build when all compilations
95 require the PCH file to be up-to-date.</li>
96</ul>
Douglas Gregor2cc390e2009-06-02 22:08:07 +000097
Douglas Gregor72e4f252012-09-16 01:44:02 +000098<p>Modules, as implemented in Clang, use the same mechanisms as
99precompiled headers to save a serialized AST file (one per module) and
100use those AST modules. From an implementation standpoint, modules are
101a generalization of precompiled headers, lifting a number of
102restrictions placed on precompiled headers. In particular, there can
103only be one precompiled header and it must be included at the
104beginning of the translation unit. The extensions to the AST file
105format required for modules are discussed in the section on <a href="#modules">modules</a>.</p>
106
107<p>Clang's AST files are designed with a compact on-disk
108representation, which minimizes both creation time and the time
109required to initially load the AST file. The AST file itself contains
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000110a serialized representation of Clang's abstract syntax trees and
111supporting data structures, stored using the same compressed bitstream
112as <a href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitcode
113file format</a>.</p>
114
Douglas Gregor72e4f252012-09-16 01:44:02 +0000115<p>Clang's AST files are loaded "lazily" from disk. When an
116AST file is initially loaded, Clang reads only a small amount of data
117from the AST file to establish where certain important data structures
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000118are stored. The amount of data read in this initial load is
Douglas Gregor72e4f252012-09-16 01:44:02 +0000119independent of the size of the AST file, such that a larger AST file
120does not lead to longer AST load times. The actual header data in the
121AST file--macros, functions, variables, types, etc.--is loaded only
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000122when it is referenced from the user's code, at which point only that
123entity (and those entities it depends on) are deserialized from the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000124AST file. With this approach, the cost of using an AST file
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000125for a translation unit is proportional to the amount of code actually
Douglas Gregor72e4f252012-09-16 01:44:02 +0000126used from the AST file, rather than being proportional to the size of
127the AST file itself.</p>
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000128
129<p>When given the <code>-print-stats</code> option, Clang produces
Douglas Gregor72e4f252012-09-16 01:44:02 +0000130statistics describing how much of the AST file was actually
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000131loaded from disk. For a simple "Hello, World!" program that includes
132the Apple <code>Cocoa.h</code> header (which is built as a precompiled
133header), this option illustrates how little of the actual precompiled
134header is required:</p>
135
136<pre>
137*** PCH Statistics:
138 933 stat cache hits
139 4 stat cache misses
140 895/39981 source location entries read (2.238563%)
141 19/15315 types read (0.124061%)
142 20/82685 declarations read (0.024188%)
143 154/58070 identifiers read (0.265197%)
144 0/7260 selectors read (0.000000%)
145 0/30842 statements read (0.000000%)
146 4/8400 macros read (0.047619%)
147 1/4995 lexical declcontexts read (0.020020%)
148 0/4413 visible declcontexts read (0.000000%)
149 0/7230 method pool entries read (0.000000%)
150 0 method pool misses
151</pre>
152
153<p>For this small program, only a tiny fraction of the source
154locations, types, declarations, identifiers, and macros were actually
155deserialized from the precompiled header. These statistics can be
Douglas Gregor72e4f252012-09-16 01:44:02 +0000156useful to determine whether the AST file implementation can
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000157be improved by making more of the implementation lazy.</p>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000158
Sebastian Redla93e3b52010-07-08 22:01:51 +0000159<p>Precompiled headers can be chained. When you create a PCH while
160including an existing PCH, Clang can create the new PCH by referencing
161the original file and only writing the new data to the new file. For
162example, you could create a PCH out of all the headers that are very
163commonly used throughout your project, and then create a PCH for every
164single source file in the project that includes the code that is
165specific to that file, so that recompiling the file itself is very fast,
Douglas Gregor72e4f252012-09-16 01:44:02 +0000166without duplicating the data from the common headers for every
167file. The mechanisms behind chained precompiled headers are discussed
168in a <a href="#chained">later section</a>.
Sebastian Redla93e3b52010-07-08 22:01:51 +0000169
Douglas Gregor72e4f252012-09-16 01:44:02 +0000170<h2 id="contents">AST File Contents</h2>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000171
Benjamin Kramer665a8dc2012-01-15 15:26:07 +0000172<img src="PCHLayout.png" style="float:right" alt="Precompiled header layout">
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000173
Douglas Gregor72e4f252012-09-16 01:44:02 +0000174<p>Clang's AST files are organized into several different
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000175blocks, each of which contains the serialized representation of a part
176of Clang's internal representation. Each of the blocks corresponds to
177either a block or a record within <a
178 href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitstream
179format</a>. The contents of each of these logical blocks are described
180below.</p>
181
Douglas Gregor72e4f252012-09-16 01:44:02 +0000182<p>For a given AST file, the <a
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000183href="http://llvm.org/cmds/llvm-bcanalyzer.html"><code>llvm-bcanalyzer</code></a>
184utility can be used to examine the actual structure of the bitstream
Douglas Gregor72e4f252012-09-16 01:44:02 +0000185for the AST file. This information can be used both to help
186understand the structure of the AST file and to isolate
187areas where AST files can still be optimized, e.g., through
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000188the introduction of abbreviations.</p>
189
Douglas Gregor923cb232009-06-03 18:35:59 +0000190<h3 id="metadata">Metadata Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000191
192<p>The metadata block contains several records that provide
Douglas Gregor72e4f252012-09-16 01:44:02 +0000193information about how the AST file was built. This metadata
194is primarily used to validate the use of an AST file. For
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000195example, a precompiled header built for a 32-bit x86 target cannot be used
196when compiling for a 64-bit x86 target. The metadata block contains
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000197information about:</p>
198
199<dl>
200 <dt>Language options</dt>
201 <dd>Describes the particular language dialect used to compile the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000202AST file, including major options (e.g., Objective-C support) and more
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000203minor options (e.g., support for "//" comments). The contents of this
204record correspond to the <code>LangOptions</code> class.</dd>
Douglas Gregor32110df2009-05-20 00:16:32 +0000205
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000206 <dt>Target architecture</dt>
207 <dd>The target triple that describes the architecture, platform, and
Douglas Gregor72e4f252012-09-16 01:44:02 +0000208ABI for which the AST file was generated, e.g.,
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000209<code>i386-apple-darwin9</code>.</dd>
210
Douglas Gregor72e4f252012-09-16 01:44:02 +0000211 <dt>AST version</dt>
212 <dd>The major and minor version numbers of the AST file
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000213format. Changes in the minor version number should not affect backward
214compatibility, while changes in the major version number imply that a
215newer compiler cannot read an older precompiled header (and
216vice-versa).</dd>
217
218 <dt>Original file name</dt>
219 <dd>The full path of the header that was used to generate the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000220AST file.</dd>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000221
222 <dt>Predefines buffer</dt>
223 <dd>Although not explicitly stored as part of the metadata, the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000224predefines buffer is used in the validation of the AST file.
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000225The predefines buffer itself contains code generated by the compiler
226to initialize the preprocessor state according to the current target,
227platform, and command-line options. For example, the predefines buffer
228will contain "<code>#define __STDC__ 1</code>" when we are compiling C
229without Microsoft extensions. The predefines buffer itself is stored
230within the <a href="#sourcemgr">source manager block</a>, but its
Douglas Gregor5accbb92009-06-03 16:06:22 +0000231contents are verified along with the rest of the metadata.</dd>
232
233</dl>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000234
Douglas Gregor72e4f252012-09-16 01:44:02 +0000235<p>A chained PCH file (that is, one that references another PCH) and a
236module (which may import other modules) have additional metadata
237containing the list of all AST files that this AST file depends
238on. Each of those files will be loaded along with this AST file.</p>
Sebastian Redla93e3b52010-07-08 22:01:51 +0000239
Douglas Gregor72e4f252012-09-16 01:44:02 +0000240<p>For chained precompiled headers, the language options, target
241architecture and predefines buffer data is taken from the end of the
242chain, since they have to match anyway.</p>
Sebastian Redla93e3b52010-07-08 22:01:51 +0000243
Douglas Gregor923cb232009-06-03 18:35:59 +0000244<h3 id="sourcemgr">Source Manager Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000245
246<p>The source manager block contains the serialized representation of
247Clang's <a
248 href="InternalsManual.html#SourceLocation">SourceManager</a> class,
249which handles the mapping from source locations (as represented in
250Clang's abstract syntax tree) into actual column/line positions within
Douglas Gregor72e4f252012-09-16 01:44:02 +0000251a source file or macro instantiation. The AST file's
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000252representation of the source manager also includes information about
253all of the headers that were (transitively) included when building the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000254AST file.</p>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000255
256<p>The bulk of the source manager block is dedicated to information
257about the various files, buffers, and macro instantiations into which
258a source location can refer. Each of these is referenced by a numeric
259"file ID", which is a unique number (allocated starting at 1) stored
260in the source location. Clang serializes the information for each kind
261of file ID, along with an index that maps file IDs to the position
Douglas Gregor72e4f252012-09-16 01:44:02 +0000262within the AST file where the information about that file ID is
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000263stored. The data associated with a file ID is loaded only when
264required by the front end, e.g., to emit a diagnostic that includes a
265macro instantiation history inside the header itself.</p>
266
267<p>The source manager block also contains information about all of the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000268headers that were included when building the AST file. This
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000269includes information about the controlling macro for the header (e.g.,
270when the preprocessor identified that the contents of the header
271dependent on a macro like <code>LLVM_CLANG_SOURCEMANAGER_H</code>)
272along with a cached version of the results of the <code>stat()</code>
Douglas Gregor72e4f252012-09-16 01:44:02 +0000273system calls performed when building the AST file. The
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000274latter is particularly useful in reducing system time when searching
275for include files.</p>
276
Douglas Gregor923cb232009-06-03 18:35:59 +0000277<h3 id="preprocessor">Preprocessor Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000278
279<p>The preprocessor block contains the serialized representation of
280the preprocessor. Specifically, it contains all of the macros that
281have been defined by the end of the header used to build the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000282AST file, along with the token sequences that comprise each
283macro. The macro definitions are only read from the AST file when the
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000284name of the macro first occurs in the program. This lazy loading of
Chris Lattner57eccbe2009-06-13 18:11:10 +0000285macro definitions is triggered by lookups into the <a
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000286 href="#idtable">identifier table</a>.</p>
287
Douglas Gregor923cb232009-06-03 18:35:59 +0000288<h3 id="types">Types Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000289
290<p>The types block contains the serialized representation of all of
291the types referenced in the translation unit. Each Clang type node
292(<code>PointerType</code>, <code>FunctionProtoType</code>, etc.) has a
Douglas Gregor72e4f252012-09-16 01:44:02 +0000293corresponding record type in the AST file. When types are deserialized
294from the AST file, the data within the record is used to
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000295reconstruct the appropriate type node using the AST context.</p>
296
297<p>Each type has a unique type ID, which is an integer that uniquely
298identifies that type. Type ID 0 represents the NULL type, type IDs
299less than <code>NUM_PREDEF_TYPE_IDS</code> represent predefined types
300(<code>void</code>, <code>float</code>, etc.), while other
301"user-defined" type IDs are assigned consecutively from
302<code>NUM_PREDEF_TYPE_IDS</code> upward as the types are encountered.
Douglas Gregor72e4f252012-09-16 01:44:02 +0000303The AST file has an associated mapping from the user-defined types
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000304block to the location within the types block where the serialized
305representation of that type resides, enabling lazy deserialization of
Douglas Gregor72e4f252012-09-16 01:44:02 +0000306types. When a type is referenced from within the AST file, that
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000307reference is encoded using the type ID shifted left by 3 bits. The
308lower three bits are used to represent the <code>const</code>,
309<code>volatile</code>, and <code>restrict</code> qualifiers, as in
310Clang's <a
311 href="http://clang.llvm.org/docs/InternalsManual.html#Type">QualType</a>
312class.</p>
313
Douglas Gregor923cb232009-06-03 18:35:59 +0000314<h3 id="decls">Declarations Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000315
316<p>The declarations block contains the serialized representation of
317all of the declarations referenced in the translation unit. Each Clang
318declaration node (<code>VarDecl</code>, <code>FunctionDecl</code>,
Douglas Gregor72e4f252012-09-16 01:44:02 +0000319etc.) has a corresponding record type in the AST file. When
320declarations are deserialized from the AST file, the data
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000321within the record is used to build and populate a new instance of the
322corresponding <code>Decl</code> node. As with types, each declaration
323node has a numeric ID that is used to refer to that declaration within
Douglas Gregor72e4f252012-09-16 01:44:02 +0000324the AST file. In addition, a lookup table provides a mapping from that
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000325numeric ID to the offset within the precompiled header where that
326declaration is described.</p>
327
328<p>Declarations in Clang's abstract syntax trees are stored
329hierarchically. At the top of the hierarchy is the translation unit
330(<code>TranslationUnitDecl</code>), which contains all of the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000331declarations in the translation unit but is not actually written as a
332specific declaration node. Its child declarations (such as
Chris Lattner57eccbe2009-06-13 18:11:10 +0000333functions or struct types) may also contain other declarations inside
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000334them, and so on. Within Clang, each declaration is stored within a <a
335href="http://clang.llvm.org/docs/InternalsManual.html#DeclContext">declaration
336context</a>, as represented by the <code>DeclContext</code> class.
337Declaration contexts provide the mechanism to perform name lookup
338within a given declaration (e.g., find the member named <code>x</code>
339in a structure) and iterate over the declarations stored within a
340context (e.g., iterate over all of the fields of a structure for
341structure layout).</p>
342
Douglas Gregor72e4f252012-09-16 01:44:02 +0000343<p>In Clang's AST file format, deserializing a declaration
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000344that is a <code>DeclContext</code> is a separate operation from
345deserializing all of the declarations stored within that declaration
346context. Therefore, Clang will deserialize the translation unit
347declaration without deserializing the declarations within that
348translation unit. When required, the declarations stored within a
Chris Lattner57eccbe2009-06-13 18:11:10 +0000349declaration context will be deserialized. There are two representations
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000350of the declarations within a declaration context, which correspond to
351the name-lookup and iteration behavior described above:</p>
352
353<ul>
354 <li>When the front end performs name lookup to find a name
355 <code>x</code> within a given declaration context (for example,
356 during semantic analysis of the expression <code>p-&gt;x</code>,
357 where <code>p</code>'s type is defined in the precompiled header),
Douglas Gregor72e4f252012-09-16 01:44:02 +0000358 Clang refers to an on-disk hash table that maps from the names
359 within that declaration context to the declaration IDs that
360 represent each visible declaration with that name. The actual
361 declarations will then be deserialized to provide the results of
362 name lookup.</li>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000363
364 <li>When the front end performs iteration over all of the
365 declarations within a declaration context, all of those declarations
366 are immediately de-serialized. For large declaration contexts (e.g.,
367 the translation unit), this operation is expensive; however, large
368 declaration contexts are not traversed in normal compilation, since
369 such a traversal is unnecessary. However, it is common for the code
370 generator and semantic analysis to traverse declaration contexts for
371 structs, classes, unions, and enumerations, although those contexts
372 contain relatively few declarations in the common case.</li>
373</ul>
374
Douglas Gregor923cb232009-06-03 18:35:59 +0000375<h3 id="stmt">Statements and Expressions</h3>
Douglas Gregor5accbb92009-06-03 16:06:22 +0000376
Douglas Gregor72e4f252012-09-16 01:44:02 +0000377<p>Statements and expressions are stored in the AST file in
Douglas Gregor5accbb92009-06-03 16:06:22 +0000378both the <a href="#types">types</a> and the <a
379 href="#decls">declarations</a> blocks, because every statement or
380expression will be associated with either a type or declaration. The
381actual statement and expression records are stored immediately
382following the declaration or type that owns the statement or
383expression. For example, the statement representing the body of a
384function will be stored directly following the declaration of the
385function.</p>
386
387<p>As with types and declarations, each statement and expression kind
388in Clang's abstract syntax tree (<code>ForStmt</code>,
389<code>CallExpr</code>, etc.) has a corresponding record type in the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000390AST file, which contains the serialized representation of
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000391that statement or expression. Each substatement or subexpression
392within an expression is stored as a separate record (which keeps most
Douglas Gregor72e4f252012-09-16 01:44:02 +0000393records to a fixed size). Within the AST file, the
Argyrios Kyrtzidis86d3ca52010-09-13 17:48:02 +0000394subexpressions of an expression are stored, in reverse order, prior to the expression
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000395that owns those expression, using a form of <a
396href="http://en.wikipedia.org/wiki/Reverse_Polish_notation">Reverse
397Polish Notation</a>. For example, an expression <code>3 - 4 + 5</code>
398would be represented as follows:</p>
399
400<table border="1">
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000401 <tr><td><code>IntegerLiteral(5)</code></td></tr>
Argyrios Kyrtzidis86d3ca52010-09-13 17:48:02 +0000402 <tr><td><code>IntegerLiteral(4)</code></td></tr>
403 <tr><td><code>IntegerLiteral(3)</code></td></tr>
404 <tr><td><code>BinaryOperator(-)</code></td></tr>
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000405 <tr><td><code>BinaryOperator(+)</code></td></tr>
406 <tr><td>STOP</td></tr>
407</table>
408
409<p>When reading this representation, Clang evaluates each expression
Argyrios Kyrtzidis86d3ca52010-09-13 17:48:02 +0000410record it encounters, builds the appropriate abstract syntax tree node,
Douglas Gregorfe3f2232009-06-03 18:26:16 +0000411and then pushes that expression on to a stack. When a record contains <i>N</i>
412subexpressions--<code>BinaryOperator</code> has two of them--those
413expressions are popped from the top of the stack. The special STOP
414code indicates that we have reached the end of a serialized expression
415or statement; other expression or statement records may follow, but
416they are part of a different expression.</p>
Douglas Gregor5accbb92009-06-03 16:06:22 +0000417
Douglas Gregor923cb232009-06-03 18:35:59 +0000418<h3 id="idtable">Identifier Table Block</h3>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000419
420<p>The identifier table block contains an on-disk hash table that maps
Douglas Gregor72e4f252012-09-16 01:44:02 +0000421each identifier mentioned within the AST file to the
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000422serialized representation of the identifier's information (e.g, the
423<code>IdentifierInfo</code> structure). The serialized representation
424contains:</p>
425
426<ul>
427 <li>The actual identifier string.</li>
428 <li>Flags that describe whether this identifier is the name of a
429 built-in, a poisoned identifier, an extension token, or a
430 macro.</li>
431 <li>If the identifier names a macro, the offset of the macro
432 definition within the <a href="#preprocessor">preprocessor
433 block</a>.</li>
434 <li>If the identifier names one or more declarations visible from
435 translation unit scope, the <a href="#decls">declaration IDs</a> of these
436 declarations.</li>
437</ul>
438
Douglas Gregor72e4f252012-09-16 01:44:02 +0000439<p>When an AST file is loaded, the AST file reader
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000440mechanism introduces itself into the identifier table as an external
441lookup source. Thus, when the user program refers to an identifier
442that has not yet been seen, Clang will perform a lookup into the
Chris Lattner57eccbe2009-06-13 18:11:10 +0000443identifier table. If an identifier is found, its contents (macro
Douglas Gregor72e4f252012-09-16 01:44:02 +0000444definitions, flags, top-level declarations, etc.) will be
445deserialized, at which point the corresponding
446<code>IdentifierInfo</code> structure will have the same contents it
447would have after parsing the headers in the AST file.</p>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000448
Douglas Gregor72e4f252012-09-16 01:44:02 +0000449<p>Within the AST file, the identifiers used to name declarations are represented with an integral value. A separate table provides a mapping from this integral value (the identifier ID) to the location within the on-disk
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000450hash table where that identifier is stored. This mapping is used when
451deserializing the name of a declaration, the identifier of a token, or
Douglas Gregor72e4f252012-09-16 01:44:02 +0000452any other construct in the AST file that refers to a name.</p>
Douglas Gregor2cc390e2009-06-02 22:08:07 +0000453
Douglas Gregor923cb232009-06-03 18:35:59 +0000454<h3 id="method-pool">Method Pool Block</h3>
Douglas Gregor5accbb92009-06-03 16:06:22 +0000455
456<p>The method pool block is represented as an on-disk hash table that
457serves two purposes: it provides a mapping from the names of
458Objective-C selectors to the set of Objective-C instance and class
459methods that have that particular selector (which is required for
460semantic analysis in Objective-C) and also stores all of the selectors
Douglas Gregor72e4f252012-09-16 01:44:02 +0000461used by entities within the AST file. The design of the
Douglas Gregor5accbb92009-06-03 16:06:22 +0000462method pool is similar to that of the <a href="#idtable">identifier
463table</a>: the first time a particular selector is formed during the
464compilation of the program, Clang will search in the on-disk hash
465table of selectors; if found, Clang will read the Objective-C methods
466associated with that selector into the appropriate front-end data
467structure (<code>Sema::InstanceMethodPool</code> and
468<code>Sema::FactoryMethodPool</code> for instance and class methods,
469respectively).</p>
470
471<p>As with identifiers, selectors are represented by numeric values
Douglas Gregor72e4f252012-09-16 01:44:02 +0000472within the AST file. A separate index maps these numeric selector
Douglas Gregor5accbb92009-06-03 16:06:22 +0000473values to the offset of the selector within the on-disk hash table,
474and will be used when de-serializing an Objective-C method declaration
475(or other Objective-C construct) that refers to the selector.</p>
476
Douglas Gregor72e4f252012-09-16 01:44:02 +0000477<h2 id="tendrils">AST Reader Integration Points</h2>
Douglas Gregor0084ead2009-06-03 21:41:31 +0000478
Douglas Gregor72e4f252012-09-16 01:44:02 +0000479<p>The "lazy" deserialization behavior of AST files requires
Douglas Gregor0084ead2009-06-03 21:41:31 +0000480their integration into several completely different submodules of
481Clang. For example, lazily deserializing the declarations during name
482lookup requires that the name-lookup routines be able to query the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000483AST file to find entities stored there.</p>
Douglas Gregor0084ead2009-06-03 21:41:31 +0000484
485<p>For each Clang data structure that requires direct interaction with
Douglas Gregor72e4f252012-09-16 01:44:02 +0000486the AST reader logic, there is an abstract class that provides
487the interface between the two modules. The <code>ASTReader</code>
488class, which handles the loading of an AST file, inherits
Douglas Gregor0084ead2009-06-03 21:41:31 +0000489from all of these abstract classes to provide lazy deserialization of
Douglas Gregor72e4f252012-09-16 01:44:02 +0000490Clang's data structures. <code>ASTReader</code> implements the
Douglas Gregor0084ead2009-06-03 21:41:31 +0000491following abstract classes:</p>
492
493<dl>
494 <dt><code>StatSysCallCache</code></dt>
495 <dd>This abstract interface is associated with the
496 <code>FileManager</code> class, and is used whenever the file
497 manager is going to perform a <code>stat()</code> system call.</dd>
498
499 <dt><code>ExternalSLocEntrySource</code></dt>
500 <dd>This abstract interface is associated with the
501 <code>SourceManager</code> class, and is used whenever the
502 <a href="#sourcemgr">source manager</a> needs to load the details
503 of a file, buffer, or macro instantiation.</dd>
504
505 <dt><code>IdentifierInfoLookup</code></dt>
506 <dd>This abstract interface is associated with the
507 <code>IdentifierTable</code> class, and is used whenever the
508 program source refers to an identifier that has not yet been seen.
Douglas Gregor72e4f252012-09-16 01:44:02 +0000509 In this case, the AST reader searches for
Douglas Gregor0084ead2009-06-03 21:41:31 +0000510 this identifier within its <a href="#idtable">identifier table</a>
511 to load any top-level declarations or macros associated with that
512 identifier.</dd>
513
514 <dt><code>ExternalASTSource</code></dt>
515 <dd>This abstract interface is associated with the
516 <code>ASTContext</code> class, and is used whenever the abstract
Douglas Gregor72e4f252012-09-16 01:44:02 +0000517 syntax tree nodes need to loaded from the AST file. It
Douglas Gregor0084ead2009-06-03 21:41:31 +0000518 provides the ability to de-serialize declarations and types
519 identified by their numeric values, read the bodies of functions
520 when required, and read the declarations stored within a
521 declaration context (either for iteration or for name lookup).</dd>
522
523 <dt><code>ExternalSemaSource</code></dt>
524 <dd>This abstract interface is associated with the <code>Sema</code>
525 class, and is used whenever semantic analysis needs to read
526 information from the <a href="#methodpool">global method
527 pool</a>.</dd>
528</dl>
529
Douglas Gregor72e4f252012-09-16 01:44:02 +0000530<h2 id="chained">Chained precompiled headers</h2>
531
532<p>Chained precompiled headers were initially intended to improve the
533performance of IDE-centric operations such as syntax highlighting and
534code completion while a particular source file is being edited by the
535user. To minimize the amount of reparsing required after a change to
536the file, a form of precompiled header--called a precompiled
537<i>preamble</i>--is automatically generated by parsing all of the
538headers in the source file, up to and including the last
539#include. When only the source file changes (and none of the headers
540it depends on), reparsing of that source file can use the precompiled
541preamble and start parsing after the #includes, so parsing time is
542proportional to the size of the source file (rather than all of its
543includes). However, the compilation of that translation unit
544may already uses a precompiled header: in this case, Clang will create
545the precompiled preamble as a chained precompiled header that refers
546to the original precompiled header. This drastically reduces the time
547needed to serialize the precompiled preamble for use in reparsing.</p>
548
549<p>Chained precompiled headers get their name because each precompiled header
550can depend on one other precompiled header, forming a chain of
551dependencies. A translation unit will then include the precompiled
552header that starts the chain (i.e., nothing depends on it). This
553linearity of dependencies is important for the semantic model of
554chained precompiled headers, because the most-recent precompiled
555header can provide information that overrides the information provided
556by the precompiled headers it depends on, just like a header file
557<code>B.h</code> that includes another header <code>A.h</code> can
558modify the state produced by parsing <code>A.h</code>, e.g., by
559<code>#undef</code>'ing a macro defined in <code>A.h</code>.</p>
560
561<p>There are several ways in which chained precompiled headers
562generalize the AST file model:</p>
563
564<dl>
565 <dt>Numbering of IDs</dt>
566 <dd>Many different kinds of entities--identifiers, declarations,
567 types, etc.---have ID numbers that start at 1 or some other
568 predefined constant and grow upward. Each precompiled header records
569 the maximum ID number it has assigned in each category. Then, when a
570 new precompiled header is generated that depends on (chains to)
571 another precompiled header, it will start counting at the next
572 available ID number. This way, one can determine, given an ID
573 number, which AST file actually contains the entity.</dd>
574
575 <dt>Name lookup</dt>
576 <dd>When writing a chained precompiled header, Clang attempts to
577 write only information that has changed from the precompiled header
578 on which it is based. This changes the lookup algorithm for the
579 various tables, such as the <a href="#idtable">identifier table</a>:
580 the search starts at the most-recent precompiled header. If no entry
581 is found, lookup then proceeds to the identifier table in the
582 precompiled header it depends on, and so one. Once a lookup
583 succeeds, that result is considered definitive, overriding any
584 results from earlier precompiled headers.</dd>
585
586 <dt>Update records</dt>
587 <dd>There are various ways in which a later precompiled header can
588 modify the entities described in an earlier precompiled header. For
589 example, later precompiled headers can add entries into the various
590 name-lookup tables for the translation unit or namespaces, or add
591 new categories to an Objective-C class. Each of these updates is
592 captured in an "update record" that is stored in the chained
593 precompiled header file and will be loaded along with the original
594 entity.</dd>
595</dl>
596
597<h2 id="modules">Modules</h2>
598
599<p>Modules generalize the chained precompiled header model yet
600further, from a linear chain of precompiled headers to an arbitrary
601directed acyclic graph (DAG) of AST files. All of the same techniques
602used to make chained precompiled headers work---ID number, name
603lookup, update records---are shared with modules. However, the DAG
604nature of modules introduce a number of additional complications to
605the model:
606
607<dl>
608 <dt>Numbering of IDs</dt>
609 <dd>The simple, linear numbering scheme used in chained precompiled
610 headers falls apart with the module DAG, because different modules
611 may end up with different numbering schemes for entities they
612 imported from common shared modules. To account for this, each
613 module file provides information about which modules it depends on
614 and which ID numbers it assigned to the entities in those modules,
615 as well as which ID numbers it took for its own new entities. The
616 AST reader then maps these "local" ID numbers into a "global" ID
617 number space for the current translation unit, providing a 1-1
618 mapping between entities (in whatever AST file they inhabit) and
619 global ID numbers. If that translation unit is then serialized into
620 an AST file, this mapping will be stored for use when the AST file
621 is imported.</dd>
622
623 <dt>Declaration merging</dt>
624 <dd>It is possible for a given entity (from the language's
625 perspective) to be declared multiple times in different places. For
626 example, two different headers can have the declaration of
627 <tt>printf</tt> or could forward-declare <tt>struct stat</tt>. If
628 each of those headers is included in a module, and some third party
629 imports both of those modules, there is a potentially serious
630 problem: name lookup for <tt>printf</tt> or <tt>struct stat</tt> will
631 find both declarations, but the AST nodes are unrelated. This would
632 result in a compilation error, due to an ambiguity in name
633 lookup. Therefore, the AST reader performs declaration merging
Douglas Gregor0e6b1552012-09-21 20:16:09 +0000634 according to the appropriate language semantics, ensuring that the
Douglas Gregor72e4f252012-09-16 01:44:02 +0000635 two disjoint declarations are merged into a single redeclaration
636 chain (with a common canonical declaration), so that it is as if one
637 of the headers had been included before the other.</dd>
638
639 <dt>Name Visibility</dt>
640 <dd>Modules allow certain names that occur during module creation to
641 be "hidden", so that they are not part of the public interface of
642 the module and are not visible to its clients. The AST reader
643 maintains a "visible" bit on various AST nodes (declarations, macros,
644 etc.) to indicate whether that particular AST node is currently
645 visible; the various name lookup mechanisms in Clang inspect the
646 visible bit to determine whether that entity, which is still in the
647 AST (because other, visible AST nodes may depend on it), can
648 actually be found by name lookup. When a new (sub)module is
649 imported, it may make existing, non-visible, already-deserialized
650 AST nodes visible; it is the responsibility of the AST reader to
651 find and update these AST nodes when it is notified of the import.</dd>
652
653</dl>
654
Douglas Gregor32110df2009-05-20 00:16:32 +0000655</div>
656
Douglas Gregor4c0397f2009-06-03 21:55:35 +0000657</body>
Douglas Gregor32110df2009-05-20 00:16:32 +0000658</html>