Blame - docs/InternalsManual.html - fp2-dev/platform/external/clang

blob: 640fd0bd7c5a25768c091ff5c5771d0934cdb174 [file] [log] [blame]

Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	1	<title>"clang" CFE Internals Manual</title>
				2
				3	<h1>"clang" CFE Internals Manual</h1>
				4
				5	<ul>
				6	<li><a href="#intro">Introduction</a></li>
				7	<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
				8	<li><a href="#libbasic">The clang 'Basic' Library</a>
				9	<ul>
				10	<li><a href="#SourceLocation">The SourceLocation and SourceManager
				11	classes</a></li>
				12	</ul>
				13	</li>
				14	<li><a href="#liblex">The Lexer and Preprocessor Library</a>
				15	<ul>
				16	<li><a href="#Token">The Token class</a></li>
				17	<li><a href="#Lexer">The Lexer class</a></li>
				18	<li><a href="#MacroExpander">The MacroExpander class</a></li>
				19	<li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
				20	</ul>
				21	</li>
				22	<li><a href="#libparse">The Parser Library</a>
				23	<ul>
				24	</ul>
				25	</li>
				26	<li><a href="#libast">The AST Library</a>
				27	<ul>
				28	<li><a href="#Type">The Type class and its subclasses</a></li>
				29	<li><a href="#QualType">The QualType class</a></li>
				30	</ul>
				31	</li>
				32	</ul>
				33
				34
				35	<!-- ======================================================================= -->
				36	<h2 id="intro">Introduction</h2>
				37	<!-- ======================================================================= -->
				38
				39	<p>This document describes some of the more important APIs and internal design
				40	decisions made in the clang C front-end. The purpose of this document is to
				41	both capture some of this high level information and also describe some of the
				42	design decisions behind it. This is meant for people interested in hacking on
				43	clang, not for end-users. The description below is categorized by
				44	libraries, and does not describe any of the clients of the libraries.</p>
				45
				46	<!-- ======================================================================= -->
				47	<h2 id="libsystem">LLVM System and Support Libraries</h2>
				48	<!-- ======================================================================= -->
				49
				50	<p>The LLVM libsystem library provides the basic clang system abstraction layer,
				51	which is used for file system access. The LLVM libsupport library provides many
				52	underlying libraries and <a
				53	href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
				54	including command line option
				55	processing and various containers.</p>
				56
				57	<!-- ======================================================================= -->
				58	<h2 id="libbasic">The clang 'Basic' Library</h2>
				59	<!-- ======================================================================= -->
				60
				61	<p>This library certainly needs a better name. The 'basic' library contains a
				62	number of low-level utilities for tracking and manipulating source buffers,
				63	locations within the source buffers, diagnostics, tokens, target abstraction,
				64	and information about the subset of the language being compiled for.</p>
				65
				66	<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
				67	other parts could be reused for other non-C-based languages (SourceLocation,
				68	SourceManager, Diagnostics, FileManager). When and if there is future demand
				69	we can figure out if it makes sense to introduce a new library, move the general
				70	classes somewhere else, or introduce some other solution.</p>
				71
				72	<p>We describe the roles of these classes in order of their dependencies.</p>
				73
				74	<!-- ======================================================================= -->
				75	<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
				76	<!-- ======================================================================= -->
				77
				78	<p>Strangely enough, the SourceLocation class represents a location within the
				79	source code of the program. Important design points include:</p>
				80
				81	<ol>
				82	<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
				83	many AST nodes and are passed around often. Currently it is 32 bits.</li>
				84	<li>SourceLocation must be a simple value object that can be efficiently
				85	copied.</li>
				86	<li>We should be able to represent a source location for any byte of any input
				87	file. This includes in the middle of tokens, in whitespace, in trigraphs,
				88	etc.</li>
				89	<li>A SourceLocation must encode the current #include stack that was active when
				90	the location was processed. For example, if the location corresponds to a
				91	token, it should contain the set of #includes active when the token was
				92	lexed. This allows us to print the #include stack for a diagnostic.</li>
				93	<li>SourceLocation must be able to describe macro expansions, capturing both
				94	the ultimate instantiation point and the source of the original character
				95	data.</li>
				96	</ol>
				97
				98	<p>In practice, the SourceLocation works together with the SourceManager class
				99	to encode two pieces of information about a location: it's physical location
				100	and it's virtual location. For most tokens, these will be the same. However,
				101	for a macro expansion (or tokens that came from a _Pragma directive) these will
				102	describe the location of the characters corresponding to the token and the
				103	location where the token was used (i.e. the macro instantiation point or the
				104	location of the _Pragma itself).</p>
				105
				106	<p>For efficiency, we only track one level of macro instantions: if a token was
				107	produced by multiple instantiations, we only track the source and ultimate
				108	destination. Though we could track the intermediate instantiation points, this
				109	would require extra bookkeeping and no known client would benefit substantially
				110	from this.</p>
				111
				112	<p>The clang front-end inherently depends on the location of a token being
				113	tracked correctly. If it is ever incorrect, the front-end may get confused and
				114	die. The reason for this is that the notion of the 'spelling' of a Token in
				115	clang depends on being able to find the original input characters for the token.
				116	This concept maps directly to the "physical" location for the token.</p>
				117
				118	<!-- ======================================================================= -->
				119	<h2 id="liblex">The Lexer and Preprocessor Library</h2>
				120	<!-- ======================================================================= -->
				121
				122	<p>The Lexer library contains several tightly-connected classes that are involved
				123	with the nasty process of lexing and preprocessing C source code. The main
				124	interface to this library for outside clients is the large <a
				125	href="#Preprocessor">Preprocessor</a> class.
				126	It contains the various pieces of state that are required to coherently read
				127	tokens out of a translation unit.</p>
				128
				129	<p>The core interface to the Preprocessor object (once it is set up) is the
				130	Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
				131	the preprocessor stream. There are two types of token providers that the
				132	preprocessor is capable of reading from: a buffer lexer (provided by the <a
				133	href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
				134	href="#MacroExpander">MacroExpander</a> class).
				135
				136
				137	<!-- ======================================================================= -->
				138	<h3 id="Token">The Token class</h3>
				139	<!-- ======================================================================= -->
				140
				141	<p>The Token class is used to represent a single lexed token. Tokens are
				142	intended to be used by the lexer/preprocess and parser libraries, but are not
				143	intended to live beyond them (for example, they should not live in the ASTs).<p>
				144
				145	<p>Tokens most often live on the stack (or some other location that is efficient
				146	to access) as the parser is running, but occasionally do get buffered up. For
				147	example, macro definitions are stored as a series of tokens, and the C++
				148	front-end will eventually need to buffer tokens up for tentative parsing and
				149	various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
				150	system, sizeof(Token) is currently 16 bytes.</p>
				151
				152	<p>Tokens contain the following information:</p>
				153
				154	<ul>
				155	<li><b>A SourceLocation</b> - This indicates the location of the start of the
				156	token.</li>
				157
				158	<li><b>A length</b> - This stores the length of the token as stored in the
				159	SourceBuffer. For tokens that include them, this length includes trigraphs and
				160	escaped newlines which are ignored by later phases of the compiler. By pointing
				161	into the original source buffer, it is always possible to get the original
				162	spelling of a token completely accurately.</li>
				163
				164	<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
				165	identifier lookup was enabled when the token was lexed (e.g. the lexer was not
				166	reading in 'raw' mode) this contains a pointer to the unique hash value for the
				167	identifier. Because the lookup happens before keyword identification, this
				168	field is set even for language keywords like 'for'.</li>
				169
				170	<li><b>TokenKind</b> - This indicates the kind of token as classified by the
				171	lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
				172	operator), <tt>tok::ampamp</tt> for the "&&" token, and keyword values
				173	(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
				174	that some tokens can be spelled multiple ways. For example, C++ supports
				175	"operator keywords", where things like "and" are treated exactly like the
				176	"&&" operator. In these cases, the kind value is set to
				177	<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
				178	consider both forms. For something that cares about which form is used (e.g.
				179	the preprocessor 'stringize' operator) the spelling indicates the original
				180	form.</li>
				181
				182	<li><b>Flags</b> - There are currently four flags tracked by the
				183	lexer/preprocessor system on a per-token basis:
				184
				185	<ol>
				186	<li><b>StartOfLine</b> - This was the first token that occurred on its input
				187	source line.</li>
				188	<li><b>LeadingSpace</b> - There was a space character either immediately
				189	before the token or transitively before the token as it was expanded
				190	through a macro. The definition of this flag is very closely defined by
				191	the stringizing requirements of the preprocessor.</li>
				192	<li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
				193	represent identifier tokens which have macro expansion disabled. This
				194	prevents them from being considered as candidates for macro expansion ever
				195	in the future.</li>
				196	<li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
				197	token includes a trigraph or escaped newline. Since this is uncommon,
				198	many pieces of code can fast-path on tokens that did not need cleaning.
				199	</p>
				200	</ol>
				201	</li>
				202	</ul>
				203
				204	<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
				205	contain any semantic information about the lexed value. For example, if the
				206	token was a pp-number token, we do not represent the value of the number that
				207	was lexed (this is left for later pieces of code to decide). Additionally, the
				208	lexer library has no notion of typedef names vs variable names: both are
				209	returned as identifiers, and the parser is left to decide whether a specific
				210	identifier is a typedef or a variable (tracking this requires scope information
				211	among other things).</p>
				212
				213	<!-- ======================================================================= -->
				214	<h3 id="Lexer">The Lexer class</h3>
				215	<!-- ======================================================================= -->
				216
				217	<p>The Lexer class provides the mechanics of lexing tokens out of a source
				218	buffer and deciding what they mean. The Lexer is complicated by the fact that
				219	it operates on raw buffers that have not had spelling eliminated (this is a
				220	necessity to get decent performance), but this is countered with careful coding
				221	as well as standard performance techniques (for example, the comment handling
				222	code is vectorized on X86 and PowerPC hosts).</p>
				223
				224	<p>The lexer has a couple of interesting modal features:</p>
				225
				226	<ul>
				227	<li>The lexer can operate in 'raw' mode. This mode has several features that
				228	make it possible to quickly lex the file (e.g. it stops identifier lookup,
				229	doesn't specially handle preprocessor tokens, handles EOF differently, etc).
				230	This mode is used for lexing within an "<tt>#if 0</tt>" block, for
				231	example.</li>
				232	<li>The lexer can capture and return comments as tokens. This is required to
				233	support the -C preprocessor mode, which passes comments through, and is
				234	used by the diagnostic checker to identifier expect-error annotations.</li>
				235	<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
Chris Lattner	8438624	2007-09-16 19:25:23 +0000	[diff] [blame]	236	after reading a #include directive. This mode changes the parsing of '<'
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	237	to return an "angled string" instead of a bunch of tokens for each thing
				238	within the filename.</li>
				239	<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
				240	ParsingPreprocessorDirective mode is entered. This changes the parser to
				241	return EOM at a newline.</li>
				242	<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
				243	whether C++ or ObjC keywords are recognized, etc.</li>
				244	</ul>
				245
				246	<p>In addition to these modes, the lexer keeps track of a couple of other
				247	features that are local to a lexed buffer, which change as the buffer is
				248	lexed:</p>
				249
				250	<ul>
				251	<li>The Lexer uses BufferPtr to keep track of the current character being
				252	lexed.</li>
				253	<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
				254	will start with its "start of line" bit set.</li>
				255	<li>The Lexer keeps track of the current #if directives that are active (which
				256	can be nested).</li>
				257	<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
				258	MultipleIncludeOpt</a> object, which is used to
				259	detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
				260	<tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
				261	subsequent includes can be ignored if the XX macro is defined.</li>
				262	</ul>
				263
				264	<!-- ======================================================================= -->
				265	<h3 id="MacroExpander">The MacroExpander class</h3>
				266	<!-- ======================================================================= -->
				267
				268	<p>The MacroExpander class is a token provider that returns tokens from a list
				269	of tokens that came from somewhere else. It typically used for two things: 1)
				270	returning tokens from a macro definition as it is being expanded 2) returning
				271	tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
				272	will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
				273
				274	<!-- ======================================================================= -->
				275	<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
				276	<!-- ======================================================================= -->
				277
				278	<p>The MultipleIncludeOpt class implements a really simple little state machine
				279	that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
				280	idiom that people typically use to prevent multiple inclusion of headers. If a
				281	buffer uses this idiom and is subsequently #include'd, the preprocessor can
				282	simply check to see whether the guarding condition is defined or not. If so,
				283	the preprocessor can completely ignore the include of the header.</p>
				284
				285
				286
				287	<!-- ======================================================================= -->
				288	<h2 id="libparse">The Parser Library</h2>
				289	<!-- ======================================================================= -->
				290
				291	<!-- ======================================================================= -->
				292	<h2 id="libast">The AST Library</h2>
				293	<!-- ======================================================================= -->
				294
				295	<!-- ======================================================================= -->
				296	<h3 id="Type">The Type class and its subclasses</h3>
				297	<!-- ======================================================================= -->
				298
				299	<p>The Type class (and its subclasses) are an important part of the AST. Types
				300	are accessed through the ASTContext class, which implicitly creates and uniques
				301	them as they are needed. Types have a couple of non-obvious features: 1) they
				302	do not capture type qualifiers like const or volatile (See
				303	<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	304	information. Once created, types are immutable (unlike decls).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	305
				306	<p>Typedefs in C make semantic analysis a bit more complex than it would
				307	be without them. The issue is that we want to capture typedef information
				308	and represent it in the AST perfectly, but the semantics of operations need to
				309	"see through" typedefs. For example, consider this code:</p>
				310
				311	<code>
				312	void func() {<br>
				313	typedef int foo;<br>
				314	foo X, *Y;<br>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	315	typedef foo* bar;<br>
				316	bar Z;<br>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	317	*X; <i>// error</i><br>
				318	**Y; <i>// error</i><br>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	319	**Z; <i>// error</i><br>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	320	}<br>
				321	</code>
				322
				323	<p>The code above is illegal, and thus we expect there to be diagnostics emitted
				324	on the annotated lines. In this example, we expect to get:</p>
				325
				326	<pre>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	327	<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	328	*X; // error
				329	<font color="blue">^~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	330	<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	331	**Y; // error
				332	<font color="blue">^~~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	333	<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
				334	**Z; // error
				335	<font color="blue">^~~</font>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	336	</pre>
				337
				338	<p>While this example is somewhat silly, it illustrates the point: we want to
				339	retain typedef information where possible, so that we can emit errors about
				340	"<tt>std::string</tt>" instead of "<tt>std::basic_string<char, std:...</tt>".
				341	Doing this requires properly keeping typedef information (for example, the type
				342	of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	343	various operators (for example, the type of *Y is "foo", not "int"). In order
				344	to retain this information, the type of these expressions is an instance of the
				345	TypedefType class, which indicates that the type of these expressions is a
				346	typedef for foo.
				347	</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	348
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	349	<p>Representing types like this is great for diagnostics, because the
				350	user-specified type is always immediately available. There are two problems
				351	with this: first, various semantic checks need to make judgements about the
Chris Lattner	33fc68a	2007-07-31 18:54:50 +0000	[diff] [blame]	352	<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
				353	efficient way to query whether two types are structurally identical to each
				354	other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	355	canonical types.</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	356
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	357	<h4>Canonical Types</h4>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	358
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	359	<p>Every instance of the Type class contains a canonical type pointer. For
				360	simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
				361	"<tt>int**</tt>"), the type just points to itself. For types that have a
				362	typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
				363	"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
				364	structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
				365	"<tt>int</tt>", "<tt>int</tt>", and "<tt>int</tt>" respectively).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	366
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	367	<p>This design provides a constant time operation (dereferencing the canonical
				368	type pointer) that gives us access to the structure of types. For example,
				369	we can trivially tell that "bar" and "foo*" are the same type by dereferencing
				370	their canonical type pointers and doing a pointer comparison (they both point
				371	to the single "<tt>int*</tt>" type).</p>
				372
				373	<p>Canonical types and typedef types bring up some complexities that must be
				374	carefully managed. Specifically, the "isa/cast/dyncast" operators generally
				375	shouldn't be used in code that is inspecting the AST. For example, when type
				376	checking the indirection operator (unary '*' on a pointer), the type checker
				377	must verify that the operand has a pointer type. It would not be correct to
				378	check that with "<tt>isa<PointerType>(SubExpr->getType())</tt>",
				379	because this predicate would fail if the subexpression had a typedef type.</p>
				380
				381	<p>The solution to this problem are a set of helper methods on Type, used to
				382	check their properties. In this case, it would be correct to use
				383	"<tt>SubExpr->getType()->isPointerType()</tt>" to do the check. This
				384	predicate will return true if the <em>canonical type is a pointer</em>, which is
				385	true any time the type is structurally a pointer type. The only hard part here
				386	is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
				387
				388	<p>The second problem we face is how to get access to the pointer type once we
				389	know it exists. To continue the example, the result type of the indirection
				390	operator is the pointee type of the subexpression. In order to determine the
				391	type, we need to get the instance of PointerType that best captures the typedef
				392	information in the program. If the type of the expression is literally a
				393	PointerType, we can return that, otherwise we have to dig through the
				394	typedefs to find the pointer type. For example, if the subexpression had type
				395	"<tt>foo*</tt>", we could return that type as the result. If the subexpression
				396	had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
				397	<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner	11406c1	2007-07-31 16:50:51 +0000	[diff] [blame]	398	a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	399	PointerType and, if so, returns the best one. If not, it returns a null
				400	pointer.</p>
				401
				402	<p>This structure is somewhat mystical, but after meditating on it, it will
				403	make sense to you :).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	404
				405	<!-- ======================================================================= -->
				406	<h3 id="QualType">The QualType class</h3>
				407	<!-- ======================================================================= -->
				408
				409	<p>The QualType class is designed as a trivial value class that is small,
				410	passed by-value and is efficient to query. The idea of QualType is that it
				411	stores the type qualifiers (const, volatile, restrict) separately from the types
				412	themselves: QualType is conceptually a pair of "Type*" and bits for the type
				413	qualifiers.</p>
				414
				415	<p>By storing the type qualifiers as bits in the conceptual pair, it is
				416	extremely efficient to get the set of qualifiers on a QualType (just return the
				417	field of the pair), add a type qualifier (which is a trivial constant-time
				418	operation that sets a bit), and remove one or more type qualifiers (just return
				419	a QualType with the bitfield set to empty).</p>
				420
				421	<p>Further, because the bits are stored outside of the type itself, we do not
				422	need to create duplicates of types with different sets of qualifiers (i.e. there
				423	is only a single heap allocated "int" type: "const int" and "volatile const int"
				424	both point to the same heap allocated "int" type). This reduces the heap size
				425	used to represent bits and also means we do not have to consider qualifiers when
				426	uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
				427
				428	<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
				429	the low bit of the pointer to the Type object. This means that QualType is
				430	exactly the same size as a pointer, and this works fine on any system where
				431	malloc'd objects are at least 8 byte aligned.</p>