docs/InternalsManual.html - fp2-dev/platform/external/clang - Gitiles

 <title>"clang" CFE Internals Manual</title>

 <h1>"clang" CFE Internals Manual</h1>

 <ul>
 <li><a href="#intro">Introduction</a></li>
 <li><a href="#libsystem">LLVM System and Support Libraries</a></li>
 <li><a href="#libbasic">The clang 'Basic' Library</a>
   <ul>
   <li><a href="#SourceLocation">The SourceLocation and SourceManager
       classes</a></li>
   </ul>
 </li>
 <li><a href="#liblex">The Lexer and Preprocessor Library</a>
   <ul>
   <li><a href="#Token">The Token class</a></li>
   <li><a href="#Lexer">The Lexer class</a></li>
   <li><a href="#MacroExpander">The MacroExpander class</a></li>
   <li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
   </ul>
 </li>
 <li><a href="#libparse">The Parser Library</a>
   <ul>
   </ul>
 </li>
 <li><a href="#libast">The AST Library</a>
   <ul>
   <li><a href="#Type">The Type class and its subclasses</a></li>
   <li><a href="#QualType">The QualType class</a></li>
   </ul>
 </li>
 </ul>


 <!-- ======================================================================= -->
 <h2 id="intro">Introduction</h2>
 <!-- ======================================================================= -->

 <p>This document describes some of the more important APIs and internal design
 decisions made in the clang C front-end.  The purpose of this document is to
 both capture some of this high level information and also describe some of the
 design decisions behind it.  This is meant for people interested in hacking on
 clang, not for end-users.  The description below is categorized by
 libraries, and does not describe any of the clients of the libraries.</p>

 <!-- ======================================================================= -->
 <h2 id="libsystem">LLVM System and Support Libraries</h2>
 <!-- ======================================================================= -->

 <p>The LLVM libsystem library provides the basic clang system abstraction layer,
 which is used for file system access.  The LLVM libsupport library provides many
 underlying libraries and <a
 href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
  including command line option
 processing and various containers.</p>

 <!-- ======================================================================= -->
 <h2 id="libbasic">The clang 'Basic' Library</h2>
 <!-- ======================================================================= -->

 <p>This library certainly needs a better name.  The 'basic' library contains a
 number of low-level utilities for tracking and manipulating source buffers,
 locations within the source buffers, diagnostics, tokens, target abstraction,
 and information about the subset of the language being compiled for.</p>

 <p>Part of this infrastructure is specific to C (such as the TargetInfo class),
 other parts could be reused for other non-C-based languages (SourceLocation,
 SourceManager, Diagnostics, FileManager).  When and if there is future demand
 we can figure out if it makes sense to introduce a new library, move the general
 classes somewhere else, or introduce some other solution.</p>

 <p>We describe the roles of these classes in order of their dependencies.</p>

 <!-- ======================================================================= -->
 <h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
 <!-- ======================================================================= -->

 <p>Strangely enough, the SourceLocation class represents a location within the
 source code of the program.  Important design points include:</p>

 <ol>
 <li>sizeof(SourceLocation) must be extremely small, as these are embedded into
     many AST nodes and are passed around often.  Currently it is 32 bits.</li>
 <li>SourceLocation must be a simple value object that can be efficiently
     copied.</li>
 <li>We should be able to represent a source location for any byte of any input
     file.  This includes in the middle of tokens, in whitespace, in trigraphs,
     etc.</li>
 <li>A SourceLocation must encode the current #include stack that was active when
     the location was processed.  For example, if the location corresponds to a
     token, it should contain the set of #includes active when the token was
     lexed.  This allows us to print the #include stack for a diagnostic.</li>
 <li>SourceLocation must be able to describe macro expansions, capturing both
     the ultimate instantiation point and the source of the original character
     data.</li>
 </ol>

 <p>In practice, the SourceLocation works together with the SourceManager class
 to encode two pieces of information about a location: it's physical location
 and it's virtual location.  For most tokens, these will be the same.  However,
 for a macro expansion (or tokens that came from a _Pragma directive) these will
 describe the location of the characters corresponding to the token and the
 location where the token was used (i.e. the macro instantiation point or the
 location of the _Pragma itself).</p>

 <p>For efficiency, we only track one level of macro instantions: if a token was
 produced by multiple instantiations, we only track the source and ultimate
 destination.  Though we could track the intermediate instantiation points, this
 would require extra bookkeeping and no known client would benefit substantially
 from this.</p>

 <p>The clang front-end inherently depends on the location of a token being
 tracked correctly.  If it is ever incorrect, the front-end may get confused and
 die.  The reason for this is that the notion of the 'spelling' of a Token in
 clang depends on being able to find the original input characters for the token.
 This concept maps directly to the "physical" location for the token.</p>

 <!-- ======================================================================= -->
 <h2 id="liblex">The Lexer and Preprocessor Library</h2>
 <!-- ======================================================================= -->

 <p>The Lexer library contains several tightly-connected classes that are involved
 with the nasty process of lexing and preprocessing C source code.  The main
 interface to this library for outside clients is the large <a
 href="#Preprocessor">Preprocessor</a> class.
 It contains the various pieces of state that are required to coherently read
 tokens out of a translation unit.</p>

 <p>The core interface to the Preprocessor object (once it is set up) is the
 Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
 the preprocessor stream.  There are two types of token providers that the
 preprocessor is capable of reading from: a buffer lexer (provided by the <a
 href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
 href="#MacroExpander">MacroExpander</a> class).


 <!-- ======================================================================= -->
 <h3 id="Token">The Token class</h3>
 <!-- ======================================================================= -->

 <p>The Token class is used to represent a single lexed token.  Tokens are
 intended to be used by the lexer/preprocess and parser libraries, but are not
 intended to live beyond them (for example, they should not live in the ASTs).<p>

 <p>Tokens most often live on the stack (or some other location that is efficient
 to access) as the parser is running, but occasionally do get buffered up.  For
 example, macro definitions are stored as a series of tokens, and the C++
 front-end will eventually need to buffer tokens up for tentative parsing and
 various pieces of look-ahead.  As such, the size of a Token matter.  On a 32-bit
 system, sizeof(Token) is currently 16 bytes.</p>

 <p>Tokens contain the following information:</p>

 <ul>
 <li><b>A SourceLocation</b> - This indicates the location of the start of the
 token.</li>

 <li><b>A length</b> - This stores the length of the token as stored in the
 SourceBuffer.  For tokens that include them, this length includes trigraphs and
 escaped newlines which are ignored by later phases of the compiler.  By pointing
 into the original source buffer, it is always possible to get the original
 spelling of a token completely accurately.</li>

 <li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
 identifier lookup was enabled when the token was lexed (e.g. the lexer was not
 reading in 'raw' mode) this contains a pointer to the unique hash value for the
 identifier.  Because the lookup happens before keyword identification, this
 field is set even for language keywords like 'for'.</li>

 <li><b>TokenKind</b> - This indicates the kind of token as classified by the
 lexer.  This includes things like <tt>tok::starequal</tt> (for the "*="
 operator), <tt>tok::ampamp</tt> for the "&amp;&amp;" token, and keyword values
 (e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords.  Note
 that some tokens can be spelled multiple ways.  For example, C++ supports
 "operator keywords", where things like "and" are treated exactly like the
 "&amp;&amp;" operator.  In these cases, the kind value is set to
 <tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
 consider both forms.  For something that cares about which form is used (e.g.
 the preprocessor 'stringize' operator) the spelling indicates the original
 form.</li>

 <li><b>Flags</b> - There are currently four flags tracked by the
 lexer/preprocessor system on a per-token basis:

   <ol>
   <li><b>StartOfLine</b> - This was the first token that occurred on its input
        source line.</li>
   <li><b>LeadingSpace</b> - There was a space character either immediately
        before the token or transitively before the token as it was expanded
        through a macro.  The definition of this flag is very closely defined by
        the stringizing requirements of the preprocessor.</li>
   <li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
       represent identifier tokens which have macro expansion disabled.  This
       prevents them from being considered as candidates for macro expansion ever
       in the future.</li>
   <li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
       token includes a trigraph or escaped newline.  Since this is uncommon,
       many pieces of code can fast-path on tokens that did not need cleaning.
       </p>
    </ol>
 </li>
 </ul>

 <p>One interesting (and somewhat unusual) aspect of tokens is that they don't
 contain any semantic information about the lexed value.  For example, if the
 token was a pp-number token, we do not represent the value of the number that
 was lexed (this is left for later pieces of code to decide).  Additionally, the
 lexer library has no notion of typedef names vs variable names: both are
 returned as identifiers, and the parser is left to decide whether a specific
 identifier is a typedef or a variable (tracking this requires scope information
 among other things).</p>

 <!-- ======================================================================= -->
 <h3 id="Lexer">The Lexer class</h3>
 <!-- ======================================================================= -->

 <p>The Lexer class provides the mechanics of lexing tokens out of a source
 buffer and deciding what they mean.  The Lexer is complicated by the fact that
 it operates on raw buffers that have not had spelling eliminated (this is a
 necessity to get decent performance), but this is countered with careful coding
 as well as standard performance techniques (for example, the comment handling
 code is vectorized on X86 and PowerPC hosts).</p>

 <p>The lexer has a couple of interesting modal features:</p>

 <ul>
 <li>The lexer can operate in 'raw' mode.  This mode has several features that
     make it possible to quickly lex the file (e.g. it stops identifier lookup,
     doesn't specially handle preprocessor tokens, handles EOF differently, etc).
     This mode is used for lexing within an "<tt>#if 0</tt>" block, for
     example.</li>
 <li>The lexer can capture and return comments as tokens.  This is required to
     support the -C preprocessor mode, which passes comments through, and is
     used by the diagnostic checker to identifier expect-error annotations.</li>
 <li>The lexer can be in ParsingFilename mode, which happens when preprocessing
     after reading a #include directive.  This mode changes the parsing of '<'
     to return an "angled string" instead of a bunch of tokens for each thing
     within the filename.</li>
 <li>When parsing a preprocessor directive (after "<tt>#</tt>") the
     ParsingPreprocessorDirective mode is entered.  This changes the parser to
     return EOM at a newline.</li>
 <li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
     whether C++ or ObjC keywords are recognized, etc.</li>
 </ul>

 <p>In addition to these modes, the lexer keeps track of a couple of other
    features that are local to a lexed buffer, which change as the buffer is
    lexed:</p>

 <ul>
 <li>The Lexer uses BufferPtr to keep track of the current character being
     lexed.</li>
 <li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
     will start with its "start of line" bit set.</li>
 <li>The Lexer keeps track of the current #if directives that are active (which
     can be nested).</li>
 <li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
     MultipleIncludeOpt</a> object, which is used to
     detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
     <tt>#define XX</tt>" idiom to prevent multiple inclusion.  If a buffer does,
     subsequent includes can be ignored if the XX macro is defined.</li>
 </ul>

 <!-- ======================================================================= -->
 <h3 id="MacroExpander">The MacroExpander class</h3>
 <!-- ======================================================================= -->

 <p>The MacroExpander class is a token provider that returns tokens from a list
 of tokens that came from somewhere else.  It typically used for two things: 1)
 returning tokens from a macro definition as it is being expanded 2) returning
 tokens from an arbitrary buffer of tokens.  The later use is used by _Pragma and
 will most likely be used to handle unbounded look-ahead for the C++ parser.</p>

 <!-- ======================================================================= -->
 <h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
 <!-- ======================================================================= -->

 <p>The MultipleIncludeOpt class implements a really simple little state machine
 that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
 idiom that people typically use to prevent multiple inclusion of headers.  If a
 buffer uses this idiom and is subsequently #include'd, the preprocessor can
 simply check to see whether the guarding condition is defined or not.  If so,
 the preprocessor can completely ignore the include of the header.</p>


 <!-- ======================================================================= -->
 <h2 id="libparse">The Parser Library</h2>
 <!-- ======================================================================= -->

 <!-- ======================================================================= -->
 <h2 id="libast">The AST Library</h2>
 <!-- ======================================================================= -->

 <!-- ======================================================================= -->
 <h3 id="Type">The Type class and its subclasses</h3>
 <!-- ======================================================================= -->

 <p>The Type class (and its subclasses) are an important part of the AST.  Types
 are accessed through the ASTContext class, which implicitly creates and uniques
 them as they are needed.  Types have a couple of non-obvious features: 1) they
 do not capture type qualifiers like const or volatile (See
 <a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
 information.</p>

 <p>Typedefs in C make semantic analysis a bit more complex than it would
 be without them.  The issue is that we want to capture typedef information
 and represent it in the AST perfectly, but the semantics of operations need to
 "see through" typedefs.  For example, consider this code:</p>

 <code>
 void func() {<br>
   typedef int foo;<br>
   foo X, *Y;<br>
   *X;   <i>// error</i><br>
   **Y;  <i>// error</i><br>
 }<br>
 </code>

 <p>The code above is illegal, and thus we expect there to be diagnostics emitted
 on the annotated lines.  In this example, we expect to get:</p>

 <pre>
 <b>../t.c:4:1: error: indirection requires pointer operand ('foo' invalid)</b>
 *X; // error
 <font color="blue">^~</font>
 <b>../t.c:5:1: error: indirection requires pointer operand ('foo' invalid)</b>
 **Y; // error
 <font color="blue">^~~</font>
 </pre>

 <p>While this example is somewhat silly, it illustrates the point: we want to
 retain typedef information where possible, so that we can emit errors about
 "<tt>std::string</tt>" instead of "<tt>std::basic_string&lt;char, std:...</tt>".
 Doing this requires properly keeping typedef information (for example, the type
 of "X" is "foo", not "int"), and requires properly propagating it through the
 various operators (for example, the type of *Y is "foo", not "int").</p>


 <p>
 /// Type - This is the base class of the type hierarchy.  A central concept
 /// with types is that each type always has a canonical type.  A canonical type
 /// is the type with any typedef names stripped out of it or the types it
 /// references.  For example, consider:
 ///
 ///  typedef int  foo;
 ///  typedef foo* bar;
 ///    'int *'    'foo *'    'bar'
 ///
 /// There will be a Type object created for 'int'.  Since int is canonical, its
 /// canonicaltype pointer points to itself.  There is also a Type for 'foo' (a
 /// TypeNameType).  Its CanonicalType pointer points to the 'int' Type.  Next
 /// there is a PointerType that represents 'int*', which, like 'int', is
 /// canonical.  Finally, there is a PointerType type for 'foo*' whose canonical
 /// type is 'int*', and there is a TypeNameType for 'bar', whose canonical type
 /// is also 'int*'.
 ///
 /// Non-canonical types are useful for emitting diagnostics, without losing
 /// information about typedefs being used.  Canonical types are useful for type
 /// comparisons (they allow by-pointer equality tests) and useful for reasoning
 /// about whether something has a particular form (e.g. is a function type),
 /// because they implicitly, recursively, strip all typedefs out of a type.
 ///
 /// Types, once created, are immutable.
 ///</p>


 <!-- ======================================================================= -->
 <h3 id="QualType">The QualType class</h3>
 <!-- ======================================================================= -->

 <p>The QualType class is designed as a trivial value class that is small,
 passed by-value and is efficient to query.  The idea of QualType is that it
 stores the type qualifiers (const, volatile, restrict) separately from the types
 themselves: QualType is conceptually a pair of "Type*" and bits for the type
 qualifiers.</p>

 <p>By storing the type qualifiers as bits in the conceptual pair, it is
 extremely efficient to get the set of qualifiers on a QualType (just return the
 field of the pair), add a type qualifier (which is a trivial constant-time
 operation that sets a bit), and remove one or more type qualifiers (just return
 a QualType with the bitfield set to empty).</p>

 <p>Further, because the bits are stored outside of the type itself, we do not
 need to create duplicates of types with different sets of qualifiers (i.e. there
 is only a single heap allocated "int" type: "const int" and "volatile const int"
 both point to the same heap allocated "int" type).  This reduces the heap size
 used to represent bits and also means we do not have to consider qualifiers when
 uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>

 <p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
 the low bit of the pointer to the Type object.  This means that QualType is
 exactly the same size as a pointer, and this works fine on any system where
 malloc'd objects are at least 8 byte aligned.</p>
	<title>"clang" CFE Internals Manual</title>

	<h1>"clang" CFE Internals Manual</h1>

	<ul>
	<li><a href="#intro">Introduction</a></li>
	<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
	<li><a href="#libbasic">The clang 'Basic' Library</a>
	<ul>
	<li><a href="#SourceLocation">The SourceLocation and SourceManager
	classes</a></li>
	</ul>
	</li>
	<li><a href="#liblex">The Lexer and Preprocessor Library</a>
	<ul>
	<li><a href="#Token">The Token class</a></li>
	<li><a href="#Lexer">The Lexer class</a></li>
	<li><a href="#MacroExpander">The MacroExpander class</a></li>
	<li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
	</ul>
	</li>
	<li><a href="#libparse">The Parser Library</a>
	<ul>
	</ul>
	</li>
	<li><a href="#libast">The AST Library</a>
	<ul>
	<li><a href="#Type">The Type class and its subclasses</a></li>
	<li><a href="#QualType">The QualType class</a></li>
	</ul>
	</li>
	</ul>


	<!-- ======================================================================= -->
	<h2 id="intro">Introduction</h2>
	<!-- ======================================================================= -->

	<p>This document describes some of the more important APIs and internal design
	decisions made in the clang C front-end. The purpose of this document is to
	both capture some of this high level information and also describe some of the
	design decisions behind it. This is meant for people interested in hacking on
	clang, not for end-users. The description below is categorized by
	libraries, and does not describe any of the clients of the libraries.</p>

	<!-- ======================================================================= -->
	<h2 id="libsystem">LLVM System and Support Libraries</h2>
	<!-- ======================================================================= -->

	<p>The LLVM libsystem library provides the basic clang system abstraction layer,
	which is used for file system access. The LLVM libsupport library provides many
	underlying libraries and <a
	href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
	including command line option
	processing and various containers.</p>

	<!-- ======================================================================= -->
	<h2 id="libbasic">The clang 'Basic' Library</h2>
	<!-- ======================================================================= -->

	<p>This library certainly needs a better name. The 'basic' library contains a
	number of low-level utilities for tracking and manipulating source buffers,
	locations within the source buffers, diagnostics, tokens, target abstraction,
	and information about the subset of the language being compiled for.</p>

	<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
	other parts could be reused for other non-C-based languages (SourceLocation,
	SourceManager, Diagnostics, FileManager). When and if there is future demand
	we can figure out if it makes sense to introduce a new library, move the general
	classes somewhere else, or introduce some other solution.</p>

	<p>We describe the roles of these classes in order of their dependencies.</p>

	<!-- ======================================================================= -->
	<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
	<!-- ======================================================================= -->

	<p>Strangely enough, the SourceLocation class represents a location within the
	source code of the program. Important design points include:</p>

	<ol>
	<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
	many AST nodes and are passed around often. Currently it is 32 bits.</li>
	<li>SourceLocation must be a simple value object that can be efficiently
	copied.</li>
	<li>We should be able to represent a source location for any byte of any input
	file. This includes in the middle of tokens, in whitespace, in trigraphs,
	etc.</li>
	<li>A SourceLocation must encode the current #include stack that was active when
	the location was processed. For example, if the location corresponds to a
	token, it should contain the set of #includes active when the token was
	lexed. This allows us to print the #include stack for a diagnostic.</li>
	<li>SourceLocation must be able to describe macro expansions, capturing both
	the ultimate instantiation point and the source of the original character
	data.</li>
	</ol>

	<p>In practice, the SourceLocation works together with the SourceManager class
	to encode two pieces of information about a location: it's physical location
	and it's virtual location. For most tokens, these will be the same. However,
	for a macro expansion (or tokens that came from a _Pragma directive) these will
	describe the location of the characters corresponding to the token and the
	location where the token was used (i.e. the macro instantiation point or the
	location of the _Pragma itself).</p>

	<p>For efficiency, we only track one level of macro instantions: if a token was
	produced by multiple instantiations, we only track the source and ultimate
	destination. Though we could track the intermediate instantiation points, this
	would require extra bookkeeping and no known client would benefit substantially
	from this.</p>

	<p>The clang front-end inherently depends on the location of a token being
	tracked correctly. If it is ever incorrect, the front-end may get confused and
	die. The reason for this is that the notion of the 'spelling' of a Token in
	clang depends on being able to find the original input characters for the token.
	This concept maps directly to the "physical" location for the token.</p>

	<!-- ======================================================================= -->
	<h2 id="liblex">The Lexer and Preprocessor Library</h2>
	<!-- ======================================================================= -->

	<p>The Lexer library contains several tightly-connected classes that are involved
	with the nasty process of lexing and preprocessing C source code. The main
	interface to this library for outside clients is the large <a
	href="#Preprocessor">Preprocessor</a> class.
	It contains the various pieces of state that are required to coherently read
	tokens out of a translation unit.</p>

	<p>The core interface to the Preprocessor object (once it is set up) is the
	Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
	the preprocessor stream. There are two types of token providers that the
	preprocessor is capable of reading from: a buffer lexer (provided by the <a
	href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
	href="#MacroExpander">MacroExpander</a> class).


	<!-- ======================================================================= -->
	<h3 id="Token">The Token class</h3>
	<!-- ======================================================================= -->

	<p>The Token class is used to represent a single lexed token. Tokens are
	intended to be used by the lexer/preprocess and parser libraries, but are not
	intended to live beyond them (for example, they should not live in the ASTs).<p>

	<p>Tokens most often live on the stack (or some other location that is efficient
	to access) as the parser is running, but occasionally do get buffered up. For
	example, macro definitions are stored as a series of tokens, and the C++
	front-end will eventually need to buffer tokens up for tentative parsing and
	various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
	system, sizeof(Token) is currently 16 bytes.</p>

	<p>Tokens contain the following information:</p>

	<ul>
	<li><b>A SourceLocation</b> - This indicates the location of the start of the
	token.</li>

	<li><b>A length</b> - This stores the length of the token as stored in the
	SourceBuffer. For tokens that include them, this length includes trigraphs and
	escaped newlines which are ignored by later phases of the compiler. By pointing
	into the original source buffer, it is always possible to get the original
	spelling of a token completely accurately.</li>

	<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
	identifier lookup was enabled when the token was lexed (e.g. the lexer was not
	reading in 'raw' mode) this contains a pointer to the unique hash value for the
	identifier. Because the lookup happens before keyword identification, this
	field is set even for language keywords like 'for'.</li>

	<li><b>TokenKind</b> - This indicates the kind of token as classified by the
	lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
	operator), <tt>tok::ampamp</tt> for the "&&" token, and keyword values
	(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
	that some tokens can be spelled multiple ways. For example, C++ supports
	"operator keywords", where things like "and" are treated exactly like the
	"&&" operator. In these cases, the kind value is set to
	<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
	consider both forms. For something that cares about which form is used (e.g.
	the preprocessor 'stringize' operator) the spelling indicates the original
	form.</li>

	<li><b>Flags</b> - There are currently four flags tracked by the
	lexer/preprocessor system on a per-token basis:

	<ol>
	<li><b>StartOfLine</b> - This was the first token that occurred on its input
	source line.</li>
	<li><b>LeadingSpace</b> - There was a space character either immediately
	before the token or transitively before the token as it was expanded
	through a macro. The definition of this flag is very closely defined by
	the stringizing requirements of the preprocessor.</li>
	<li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
	represent identifier tokens which have macro expansion disabled. This
	prevents them from being considered as candidates for macro expansion ever
	in the future.</li>
	<li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
	token includes a trigraph or escaped newline. Since this is uncommon,
	many pieces of code can fast-path on tokens that did not need cleaning.
	</p>
	</ol>
	</li>
	</ul>

	<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
	contain any semantic information about the lexed value. For example, if the
	token was a pp-number token, we do not represent the value of the number that
	was lexed (this is left for later pieces of code to decide). Additionally, the
	lexer library has no notion of typedef names vs variable names: both are
	returned as identifiers, and the parser is left to decide whether a specific
	identifier is a typedef or a variable (tracking this requires scope information
	among other things).</p>

	<!-- ======================================================================= -->
	<h3 id="Lexer">The Lexer class</h3>
	<!-- ======================================================================= -->

	<p>The Lexer class provides the mechanics of lexing tokens out of a source
	buffer and deciding what they mean. The Lexer is complicated by the fact that
	it operates on raw buffers that have not had spelling eliminated (this is a
	necessity to get decent performance), but this is countered with careful coding
	as well as standard performance techniques (for example, the comment handling
	code is vectorized on X86 and PowerPC hosts).</p>

	<p>The lexer has a couple of interesting modal features:</p>

	<ul>
	<li>The lexer can operate in 'raw' mode. This mode has several features that
	make it possible to quickly lex the file (e.g. it stops identifier lookup,
	doesn't specially handle preprocessor tokens, handles EOF differently, etc).
	This mode is used for lexing within an "<tt>#if 0</tt>" block, for
	example.</li>
	<li>The lexer can capture and return comments as tokens. This is required to
	support the -C preprocessor mode, which passes comments through, and is
	used by the diagnostic checker to identifier expect-error annotations.</li>
	<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
	after reading a #include directive. This mode changes the parsing of '<'
	to return an "angled string" instead of a bunch of tokens for each thing
	within the filename.</li>
	<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
	ParsingPreprocessorDirective mode is entered. This changes the parser to
	return EOM at a newline.</li>
	<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
	whether C++ or ObjC keywords are recognized, etc.</li>
	</ul>

	<p>In addition to these modes, the lexer keeps track of a couple of other
	features that are local to a lexed buffer, which change as the buffer is
	lexed:</p>

	<ul>
	<li>The Lexer uses BufferPtr to keep track of the current character being
	lexed.</li>
	<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
	will start with its "start of line" bit set.</li>
	<li>The Lexer keeps track of the current #if directives that are active (which
	can be nested).</li>
	<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
	MultipleIncludeOpt</a> object, which is used to
	detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
	<tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
	subsequent includes can be ignored if the XX macro is defined.</li>
	</ul>

	<!-- ======================================================================= -->
	<h3 id="MacroExpander">The MacroExpander class</h3>
	<!-- ======================================================================= -->

	<p>The MacroExpander class is a token provider that returns tokens from a list
	of tokens that came from somewhere else. It typically used for two things: 1)
	returning tokens from a macro definition as it is being expanded 2) returning
	tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
	will most likely be used to handle unbounded look-ahead for the C++ parser.</p>

	<!-- ======================================================================= -->
	<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
	<!-- ======================================================================= -->

	<p>The MultipleIncludeOpt class implements a really simple little state machine
	that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
	idiom that people typically use to prevent multiple inclusion of headers. If a
	buffer uses this idiom and is subsequently #include'd, the preprocessor can
	simply check to see whether the guarding condition is defined or not. If so,
	the preprocessor can completely ignore the include of the header.</p>



	<!-- ======================================================================= -->
	<h2 id="libparse">The Parser Library</h2>
	<!-- ======================================================================= -->

	<!-- ======================================================================= -->
	<h2 id="libast">The AST Library</h2>
	<!-- ======================================================================= -->

	<!-- ======================================================================= -->
	<h3 id="Type">The Type class and its subclasses</h3>
	<!-- ======================================================================= -->

	<p>The Type class (and its subclasses) are an important part of the AST. Types
	are accessed through the ASTContext class, which implicitly creates and uniques
	them as they are needed. Types have a couple of non-obvious features: 1) they
	do not capture type qualifiers like const or volatile (See
	<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
	information.</p>

	<p>Typedefs in C make semantic analysis a bit more complex than it would
	be without them. The issue is that we want to capture typedef information
	and represent it in the AST perfectly, but the semantics of operations need to
	"see through" typedefs. For example, consider this code:</p>

	<code>
	void func() {<br>
	typedef int foo;<br>
	foo X, *Y;<br>
	*X; <i>// error</i><br>
	**Y; <i>// error</i><br>
	}<br>
	</code>

	<p>The code above is illegal, and thus we expect there to be diagnostics emitted
	on the annotated lines. In this example, we expect to get:</p>

	<pre>
	<b>../t.c:4:1: error: indirection requires pointer operand ('foo' invalid)</b>
	*X; // error
	<font color="blue">^~</font>
	<b>../t.c:5:1: error: indirection requires pointer operand ('foo' invalid)</b>
	**Y; // error
	<font color="blue">^~~</font>
	</pre>

	<p>While this example is somewhat silly, it illustrates the point: we want to
	retain typedef information where possible, so that we can emit errors about
	"<tt>std::string</tt>" instead of "<tt>std::basic_string<char, std:...</tt>".
	Doing this requires properly keeping typedef information (for example, the type
	of "X" is "foo", not "int"), and requires properly propagating it through the
	various operators (for example, the type of *Y is "foo", not "int").</p>



	<p>
	/// Type - This is the base class of the type hierarchy. A central concept
	/// with types is that each type always has a canonical type. A canonical type
	/// is the type with any typedef names stripped out of it or the types it
	/// references. For example, consider:
	///
	/// typedef int foo;
	/// typedef foo* bar;
	/// 'int ' 'foo ' 'bar'
	///
	/// There will be a Type object created for 'int'. Since int is canonical, its
	/// canonicaltype pointer points to itself. There is also a Type for 'foo' (a
	/// TypeNameType). Its CanonicalType pointer points to the 'int' Type. Next
	/// there is a PointerType that represents 'int*', which, like 'int', is
	/// canonical. Finally, there is a PointerType type for 'foo*' whose canonical
	/// type is 'int*', and there is a TypeNameType for 'bar', whose canonical type
	/// is also 'int*'.
	///
	/// Non-canonical types are useful for emitting diagnostics, without losing
	/// information about typedefs being used. Canonical types are useful for type
	/// comparisons (they allow by-pointer equality tests) and useful for reasoning
	/// about whether something has a particular form (e.g. is a function type),
	/// because they implicitly, recursively, strip all typedefs out of a type.
	///
	/// Types, once created, are immutable.
	///</p>


	<!-- ======================================================================= -->
	<h3 id="QualType">The QualType class</h3>
	<!-- ======================================================================= -->

	<p>The QualType class is designed as a trivial value class that is small,
	passed by-value and is efficient to query. The idea of QualType is that it
	stores the type qualifiers (const, volatile, restrict) separately from the types
	themselves: QualType is conceptually a pair of "Type*" and bits for the type
	qualifiers.</p>

	<p>By storing the type qualifiers as bits in the conceptual pair, it is
	extremely efficient to get the set of qualifiers on a QualType (just return the
	field of the pair), add a type qualifier (which is a trivial constant-time
	operation that sets a bit), and remove one or more type qualifiers (just return
	a QualType with the bitfield set to empty).</p>

	<p>Further, because the bits are stored outside of the type itself, we do not
	need to create duplicates of types with different sets of qualifiers (i.e. there
	is only a single heap allocated "int" type: "const int" and "volatile const int"
	both point to the same heap allocated "int" type). This reduces the heap size
	used to represent bits and also means we do not have to consider qualifiers when
	uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>

	<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
	the low bit of the pointer to the Type object. This means that QualType is
	exactly the same size as a pointer, and this works fine on any system where
	malloc'd objects are at least 8 byte aligned.</p>