More PCH documentation
git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@72743 91177308-0d34-0410-b5e6-96231b3b80d8
diff --git a/docs/PCHInternals.html b/docs/PCHInternals.html
index d90c446..ef1dd37 100644
--- a/docs/PCHInternals.html
+++ b/docs/PCHInternals.html
@@ -63,9 +63,250 @@
PCH file generation serializes the build when all compilations
require the PCH file to be up-to-date.</li>
</ul>
+
+<p>Clang's precompiled headers are designed with a compact on-disk
+representation, which minimizes both PCH creation time and the time
+required to initially load the PCH file. The PCH file itself contains
+a serialized representation of Clang's abstract syntax trees and
+supporting data structures, stored using the same compressed bitstream
+as <a href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitcode
+file format</a>.</p>
+
+<p>Clang's precompiled headers are loaded "lazily" from disk. When a
+PCH file is initially loaded, Clang reads only a small amount of data
+from the PCH file to establish where certain important data structures
+are stored. The amount of data read in this initial load is
+independent of the size of the PCH file, such that a larger PCH file
+does not lead to longer PCH load times. The actual header data in the
+PCH file--macros, functions, variables, types, etc.--is loaded only
+when it is referenced from the user's code, at which point only that
+entity (and those entities it depends on) are deserialized from the
+PCH file. With this approach, the cost of using a precompiled header
+for a translation unit is proportional to the amount of code actually
+used from the header, rather than being proportional to the size of
+the header itself.</p> </body>
+
+<h2>Precompiled Header Contents</h2>
+
+<img src="PCHLayout.png" align="right" alt="Precompiled header layout">
+
+<p>Clang's precompiled headers are organized into several different
+blocks, each of which contains the serialized representation of a part
+of Clang's internal representation. Each of the blocks corresponds to
+either a block or a record within <a
+ href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitstream
+format</a>. The contents of each of these logical blocks are described
+below.</p>
+
+<h3 name="metadata">Metadata Block</h3>
+
+<p>The metadata block contains several records that provide
+information about how the precompiled header was built. This metadata
+is primarily used to validate the use of a precompiled header. For
+example, a precompiled header built for x86 (32-bit) cannot be used
+when compiling for x86-64 (64-bit). The metadata block contains
+information about:</p>
+
+<dl>
+ <dt>Language options</dt>
+ <dd>Describes the particular language dialect used to compile the
+PCH file, including major options (e.g., Objective-C support) and more
+minor options (e.g., support for "//" comments). The contents of this
+record correspond to the <code>LangOptions</code> class.</dd>
-<p>More to be written...</p>
+ <dt>Target architecture</dt>
+ <dd>The target triple that describes the architecture, platform, and
+ABI for which the PCH file was generated, e.g.,
+<code>i386-apple-darwin9</code>.</dd>
+
+ <dt>PCH version</dt>
+ <dd>The major and minor version numbers of the precompiled header
+format. Changes in the minor version number should not affect backward
+compatibility, while changes in the major version number imply that a
+newer compiler cannot read an older precompiled header (and
+vice-versa).</dd>
+
+ <dt>Original file name</dt>
+ <dd>The full path of the header that was used to generate the
+precompiled header.</dd> </dl>
+
+ <dt>Predefines buffer</dt>
+ <dd>Although not explicitly stored as part of the metadata, the
+predefines buffer is used in the validation of the precompiled header.
+The predefines buffer itself contains code generated by the compiler
+to initialize the preprocessor state according to the current target,
+platform, and command-line options. For example, the predefines buffer
+will contain "<code>#define __STDC__ 1</code>" when we are compiling C
+without Microsoft extensions. The predefines buffer itself is stored
+within the <a href="#sourcemgr">source manager block</a>, but its
+contents are verified along with the rest of the metadata.</dd> </dl>
+
+<h3 name="sourcemgr">Source Manager Block</h3>
+
+<p>The source manager block contains the serialized representation of
+Clang's <a
+ href="InternalsManual.html#SourceLocation">SourceManager</a> class,
+which handles the mapping from source locations (as represented in
+Clang's abstract syntax tree) into actual column/line positions within
+a source file or macro instantiation. The precompiled header's
+representation of the source manager also includes information about
+all of the headers that were (transitively) included when building the
+precompiled header.</p>
+
+<p>The bulk of the source manager block is dedicated to information
+about the various files, buffers, and macro instantiations into which
+a source location can refer. Each of these is referenced by a numeric
+"file ID", which is a unique number (allocated starting at 1) stored
+in the source location. Clang serializes the information for each kind
+of file ID, along with an index that maps file IDs to the position
+within the PCH file where the information about that file ID is
+stored. The data associated with a file ID is loaded only when
+required by the front end, e.g., to emit a diagnostic that includes a
+macro instantiation history inside the header itself.</p>
+
+<p>The source manager block also contains information about all of the
+headers that were included when building the precompiled header. This
+includes information about the controlling macro for the header (e.g.,
+when the preprocessor identified that the contents of the header
+dependent on a macro like <code>LLVM_CLANG_SOURCEMANAGER_H</code>)
+along with a cached version of the results of the <code>stat()</code>
+system calls performed when building the precompiled header. The
+latter is particularly useful in reducing system time when searching
+for include files.</p>
+
+<h3 name="preprocessor">Preprocessor Block</h3>
+
+<p>The preprocessor block contains the serialized representation of
+the preprocessor. Specifically, it contains all of the macros that
+have been defined by the end of the header used to build the
+precompiled header, along with the token sequences that comprise each
+macro. The macro definitions are only read from the PCH file when the
+name of the macro first occurs in the program. This lazy loading of
+macro definitions is trigged by lookups into the <a
+ href="#idtable">identifier table</a>.</p>
+
+<h3 name="types">Types Block</h3>
+
+<p>The types block contains the serialized representation of all of
+the types referenced in the translation unit. Each Clang type node
+(<code>PointerType</code>, <code>FunctionProtoType</code>, etc.) has a
+corresponding record type in the PCH file. When types are deserialized
+from the precompiled header, the data within the record is used to
+reconstruct the appropriate type node using the AST context.</p>
+
+<p>Each type has a unique type ID, which is an integer that uniquely
+identifies that type. Type ID 0 represents the NULL type, type IDs
+less than <code>NUM_PREDEF_TYPE_IDS</code> represent predefined types
+(<code>void</code>, <code>float</code>, etc.), while other
+"user-defined" type IDs are assigned consecutively from
+<code>NUM_PREDEF_TYPE_IDS</code> upward as the types are encountered.
+The PCH file has an associated mapping from the user-defined types
+block to the location within the types block where the serialized
+representation of that type resides, enabling lazy deserialization of
+types. When a type is referenced from within the PCH file, that
+reference is encoded using the type ID shifted left by 3 bits. The
+lower three bits are used to represent the <code>const</code>,
+<code>volatile</code>, and <code>restrict</code> qualifiers, as in
+Clang's <a
+ href="http://clang.llvm.org/docs/InternalsManual.html#Type">QualType</a>
+class.</p>
+
+<h3 name="decls">Declarations Block</h3>
+
+<p>The declarations block contains the serialized representation of
+all of the declarations referenced in the translation unit. Each Clang
+declaration node (<code>VarDecl</code>, <code>FunctionDecl</code>,
+etc.) has a corresponding record type in the PCH file. When
+declarations are deserialized from the precompiled header, the data
+within the record is used to build and populate a new instance of the
+corresponding <code>Decl</code> node. As with types, each declaration
+node has a numeric ID that is used to refer to that declaration within
+the PCH file. In addition, a lookup table provides a mapping from that
+numeric ID to the offset within the precompiled header where that
+declaration is described.</p>
+
+<p>Declarations in Clang's abstract syntax trees are stored
+hierarchically. At the top of the hierarchy is the translation unit
+(<code>TranslationUnitDecl</code>), which contains all of the
+declarations in the translation unit. These declarations---such as
+functions or struct types---may also contain other declarations inside
+them, and so on. Within Clang, each declaration is stored within a <a
+href="http://clang.llvm.org/docs/InternalsManual.html#DeclContext">declaration
+context</a>, as represented by the <code>DeclContext</code> class.
+Declaration contexts provide the mechanism to perform name lookup
+within a given declaration (e.g., find the member named <code>x</code>
+in a structure) and iterate over the declarations stored within a
+context (e.g., iterate over all of the fields of a structure for
+structure layout).</p>
+
+<p>In Clang's precompiled header format, deserializing a declaration
+that is a <code>DeclContext</code> is a separate operation from
+deserializing all of the declarations stored within that declaration
+context. Therefore, Clang will deserialize the translation unit
+declaration without deserializing the declarations within that
+translation unit. When required, the declarations stored within a
+declaration context will be serialized. There are two representations
+of the declarations within a declaration context, which correspond to
+the name-lookup and iteration behavior described above:</p>
+
+<ul>
+ <li>When the front end performs name lookup to find a name
+ <code>x</code> within a given declaration context (for example,
+ during semantic analysis of the expression <code>p->x</code>,
+ where <code>p</code>'s type is defined in the precompiled header),
+ Clang deserializes a hash table mapping from the names within that
+ declaration context to the declaration IDs that represent each
+ visible declaration with that name. The entire hash table is
+ deserialized at this point (into the <code>llvm::DenseMap</code>
+ stored within each <code>DeclContext</code> object), but the actual
+ declarations are not yet deserialized. In a second step, those
+ declarations with the name <code>x</code> will be deserialized and
+ will be used as the result of name lookup.</li>
+
+ <li>When the front end performs iteration over all of the
+ declarations within a declaration context, all of those declarations
+ are immediately de-serialized. For large declaration contexts (e.g.,
+ the translation unit), this operation is expensive; however, large
+ declaration contexts are not traversed in normal compilation, since
+ such a traversal is unnecessary. However, it is common for the code
+ generator and semantic analysis to traverse declaration contexts for
+ structs, classes, unions, and enumerations, although those contexts
+ contain relatively few declarations in the common case.</li>
+</ul>
+
+<h3 name="idtable">Identifier Table Block</h3>
+
+<p>The identifier table block contains an on-disk hash table that maps
+each identifier mentioned within the precompiled header to the
+serialized representation of the identifier's information (e.g, the
+<code>IdentifierInfo</code> structure). The serialized representation
+contains:</p>
+
+<ul>
+ <li>The actual identifier string.</li>
+ <li>Flags that describe whether this identifier is the name of a
+ built-in, a poisoned identifier, an extension token, or a
+ macro.</li>
+ <li>If the identifier names a macro, the offset of the macro
+ definition within the <a href="#preprocessor">preprocessor
+ block</a>.</li>
+ <li>If the identifier names one or more declarations visible from
+ translation unit scope, the <a href="#decls">declaration IDs</a> of these
+ declarations.</li>
+</ul>
+
+<p>When a precompiled header is loaded, the precompiled header
+mechanism introduces itself into the identifier table as an external
+lookup source. Thus, when the user program refers to an identifier
+that has not yet been seen, Clang will perform a lookup into the
+on-disk hash table ... FINISH THIS!
+
+<p>A separate table provides a mapping from the numeric representation
+of identifiers used in the PCH file to the location within the on-disk
+hash table where that identifier is stored. This mapping is used when
+deserializing the name of a declaration, the identifier of a token, or
+any other construct in the PCH file that refers to a name.</p>
+
</div>
-</body>
</html>