Blame - docs/InternalsManual.html - platform/external/clang

blob: cc4630adf76a029627245d23a9ddbc2d2b8fa5c3 [file] [log] [blame]

Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	1	<title>"clang" CFE Internals Manual</title>
				2
				3	<h1>"clang" CFE Internals Manual</h1>
				4
				5	<ul>
				6	<li><a href="#intro">Introduction</a></li>
				7	<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
				8	<li><a href="#libbasic">The clang 'Basic' Library</a>
				9	<ul>
				10	<li><a href="#SourceLocation">The SourceLocation and SourceManager
				11	classes</a></li>
				12	</ul>
				13	</li>
				14	<li><a href="#liblex">The Lexer and Preprocessor Library</a>
				15	<ul>
				16	<li><a href="#Token">The Token class</a></li>
				17	<li><a href="#Lexer">The Lexer class</a></li>
				18	<li><a href="#MacroExpander">The MacroExpander class</a></li>
				19	<li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
				20	</ul>
				21	</li>
				22	<li><a href="#libparse">The Parser Library</a>
				23	<ul>
				24	</ul>
				25	</li>
				26	<li><a href="#libast">The AST Library</a>
				27	<ul>
				28	<li><a href="#Type">The Type class and its subclasses</a></li>
				29	<li><a href="#QualType">The QualType class</a></li>
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	30	<li><a href="#CFG">The CFG class</a></li>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	31	</ul>
				32	</li>
				33	</ul>
				34
				35
				36	<!-- ======================================================================= -->
				37	<h2 id="intro">Introduction</h2>
				38	<!-- ======================================================================= -->
				39
				40	<p>This document describes some of the more important APIs and internal design
				41	decisions made in the clang C front-end. The purpose of this document is to
				42	both capture some of this high level information and also describe some of the
				43	design decisions behind it. This is meant for people interested in hacking on
				44	clang, not for end-users. The description below is categorized by
				45	libraries, and does not describe any of the clients of the libraries.</p>
				46
				47	<!-- ======================================================================= -->
				48	<h2 id="libsystem">LLVM System and Support Libraries</h2>
				49	<!-- ======================================================================= -->
				50
				51	<p>The LLVM libsystem library provides the basic clang system abstraction layer,
				52	which is used for file system access. The LLVM libsupport library provides many
				53	underlying libraries and <a
				54	href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
				55	including command line option
				56	processing and various containers.</p>
				57
				58	<!-- ======================================================================= -->
				59	<h2 id="libbasic">The clang 'Basic' Library</h2>
				60	<!-- ======================================================================= -->
				61
				62	<p>This library certainly needs a better name. The 'basic' library contains a
				63	number of low-level utilities for tracking and manipulating source buffers,
				64	locations within the source buffers, diagnostics, tokens, target abstraction,
				65	and information about the subset of the language being compiled for.</p>
				66
				67	<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
				68	other parts could be reused for other non-C-based languages (SourceLocation,
				69	SourceManager, Diagnostics, FileManager). When and if there is future demand
				70	we can figure out if it makes sense to introduce a new library, move the general
				71	classes somewhere else, or introduce some other solution.</p>
				72
				73	<p>We describe the roles of these classes in order of their dependencies.</p>
				74
				75	<!-- ======================================================================= -->
				76	<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
				77	<!-- ======================================================================= -->
				78
				79	<p>Strangely enough, the SourceLocation class represents a location within the
				80	source code of the program. Important design points include:</p>
				81
				82	<ol>
				83	<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
				84	many AST nodes and are passed around often. Currently it is 32 bits.</li>
				85	<li>SourceLocation must be a simple value object that can be efficiently
				86	copied.</li>
				87	<li>We should be able to represent a source location for any byte of any input
				88	file. This includes in the middle of tokens, in whitespace, in trigraphs,
				89	etc.</li>
				90	<li>A SourceLocation must encode the current #include stack that was active when
				91	the location was processed. For example, if the location corresponds to a
				92	token, it should contain the set of #includes active when the token was
				93	lexed. This allows us to print the #include stack for a diagnostic.</li>
				94	<li>SourceLocation must be able to describe macro expansions, capturing both
				95	the ultimate instantiation point and the source of the original character
				96	data.</li>
				97	</ol>
				98
				99	<p>In practice, the SourceLocation works together with the SourceManager class
				100	to encode two pieces of information about a location: it's physical location
				101	and it's virtual location. For most tokens, these will be the same. However,
				102	for a macro expansion (or tokens that came from a _Pragma directive) these will
				103	describe the location of the characters corresponding to the token and the
				104	location where the token was used (i.e. the macro instantiation point or the
				105	location of the _Pragma itself).</p>
				106
				107	<p>For efficiency, we only track one level of macro instantions: if a token was
				108	produced by multiple instantiations, we only track the source and ultimate
				109	destination. Though we could track the intermediate instantiation points, this
				110	would require extra bookkeeping and no known client would benefit substantially
				111	from this.</p>
				112
				113	<p>The clang front-end inherently depends on the location of a token being
				114	tracked correctly. If it is ever incorrect, the front-end may get confused and
				115	die. The reason for this is that the notion of the 'spelling' of a Token in
				116	clang depends on being able to find the original input characters for the token.
				117	This concept maps directly to the "physical" location for the token.</p>
				118
				119	<!-- ======================================================================= -->
				120	<h2 id="liblex">The Lexer and Preprocessor Library</h2>
				121	<!-- ======================================================================= -->
				122
				123	<p>The Lexer library contains several tightly-connected classes that are involved
				124	with the nasty process of lexing and preprocessing C source code. The main
				125	interface to this library for outside clients is the large <a
				126	href="#Preprocessor">Preprocessor</a> class.
				127	It contains the various pieces of state that are required to coherently read
				128	tokens out of a translation unit.</p>
				129
				130	<p>The core interface to the Preprocessor object (once it is set up) is the
				131	Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
				132	the preprocessor stream. There are two types of token providers that the
				133	preprocessor is capable of reading from: a buffer lexer (provided by the <a
				134	href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
				135	href="#MacroExpander">MacroExpander</a> class).
				136
				137
				138	<!-- ======================================================================= -->
				139	<h3 id="Token">The Token class</h3>
				140	<!-- ======================================================================= -->
				141
				142	<p>The Token class is used to represent a single lexed token. Tokens are
				143	intended to be used by the lexer/preprocess and parser libraries, but are not
				144	intended to live beyond them (for example, they should not live in the ASTs).<p>
				145
				146	<p>Tokens most often live on the stack (or some other location that is efficient
				147	to access) as the parser is running, but occasionally do get buffered up. For
				148	example, macro definitions are stored as a series of tokens, and the C++
				149	front-end will eventually need to buffer tokens up for tentative parsing and
				150	various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
				151	system, sizeof(Token) is currently 16 bytes.</p>
				152
				153	<p>Tokens contain the following information:</p>
				154
				155	<ul>
				156	<li><b>A SourceLocation</b> - This indicates the location of the start of the
				157	token.</li>
				158
				159	<li><b>A length</b> - This stores the length of the token as stored in the
				160	SourceBuffer. For tokens that include them, this length includes trigraphs and
				161	escaped newlines which are ignored by later phases of the compiler. By pointing
				162	into the original source buffer, it is always possible to get the original
				163	spelling of a token completely accurately.</li>
				164
				165	<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
				166	identifier lookup was enabled when the token was lexed (e.g. the lexer was not
				167	reading in 'raw' mode) this contains a pointer to the unique hash value for the
				168	identifier. Because the lookup happens before keyword identification, this
				169	field is set even for language keywords like 'for'.</li>
				170
				171	<li><b>TokenKind</b> - This indicates the kind of token as classified by the
				172	lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
				173	operator), <tt>tok::ampamp</tt> for the "&&" token, and keyword values
				174	(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
				175	that some tokens can be spelled multiple ways. For example, C++ supports
				176	"operator keywords", where things like "and" are treated exactly like the
				177	"&&" operator. In these cases, the kind value is set to
				178	<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
				179	consider both forms. For something that cares about which form is used (e.g.
				180	the preprocessor 'stringize' operator) the spelling indicates the original
				181	form.</li>
				182
				183	<li><b>Flags</b> - There are currently four flags tracked by the
				184	lexer/preprocessor system on a per-token basis:
				185
				186	<ol>
				187	<li><b>StartOfLine</b> - This was the first token that occurred on its input
				188	source line.</li>
				189	<li><b>LeadingSpace</b> - There was a space character either immediately
				190	before the token or transitively before the token as it was expanded
				191	through a macro. The definition of this flag is very closely defined by
				192	the stringizing requirements of the preprocessor.</li>
				193	<li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
				194	represent identifier tokens which have macro expansion disabled. This
				195	prevents them from being considered as candidates for macro expansion ever
				196	in the future.</li>
				197	<li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
				198	token includes a trigraph or escaped newline. Since this is uncommon,
				199	many pieces of code can fast-path on tokens that did not need cleaning.
				200	</p>
				201	</ol>
				202	</li>
				203	</ul>
				204
				205	<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
				206	contain any semantic information about the lexed value. For example, if the
				207	token was a pp-number token, we do not represent the value of the number that
				208	was lexed (this is left for later pieces of code to decide). Additionally, the
				209	lexer library has no notion of typedef names vs variable names: both are
				210	returned as identifiers, and the parser is left to decide whether a specific
				211	identifier is a typedef or a variable (tracking this requires scope information
				212	among other things).</p>
				213
				214	<!-- ======================================================================= -->
				215	<h3 id="Lexer">The Lexer class</h3>
				216	<!-- ======================================================================= -->
				217
				218	<p>The Lexer class provides the mechanics of lexing tokens out of a source
				219	buffer and deciding what they mean. The Lexer is complicated by the fact that
				220	it operates on raw buffers that have not had spelling eliminated (this is a
				221	necessity to get decent performance), but this is countered with careful coding
				222	as well as standard performance techniques (for example, the comment handling
				223	code is vectorized on X86 and PowerPC hosts).</p>
				224
				225	<p>The lexer has a couple of interesting modal features:</p>
				226
				227	<ul>
				228	<li>The lexer can operate in 'raw' mode. This mode has several features that
				229	make it possible to quickly lex the file (e.g. it stops identifier lookup,
				230	doesn't specially handle preprocessor tokens, handles EOF differently, etc).
				231	This mode is used for lexing within an "<tt>#if 0</tt>" block, for
				232	example.</li>
				233	<li>The lexer can capture and return comments as tokens. This is required to
				234	support the -C preprocessor mode, which passes comments through, and is
				235	used by the diagnostic checker to identifier expect-error annotations.</li>
				236	<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
Chris Lattner	8438624	2007-09-16 19:25:23 +0000	[diff] [blame]	237	after reading a #include directive. This mode changes the parsing of '<'
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	238	to return an "angled string" instead of a bunch of tokens for each thing
				239	within the filename.</li>
				240	<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
				241	ParsingPreprocessorDirective mode is entered. This changes the parser to
				242	return EOM at a newline.</li>
				243	<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
				244	whether C++ or ObjC keywords are recognized, etc.</li>
				245	</ul>
				246
				247	<p>In addition to these modes, the lexer keeps track of a couple of other
				248	features that are local to a lexed buffer, which change as the buffer is
				249	lexed:</p>
				250
				251	<ul>
				252	<li>The Lexer uses BufferPtr to keep track of the current character being
				253	lexed.</li>
				254	<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
				255	will start with its "start of line" bit set.</li>
				256	<li>The Lexer keeps track of the current #if directives that are active (which
				257	can be nested).</li>
				258	<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
				259	MultipleIncludeOpt</a> object, which is used to
				260	detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
				261	<tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
				262	subsequent includes can be ignored if the XX macro is defined.</li>
				263	</ul>
				264
				265	<!-- ======================================================================= -->
				266	<h3 id="MacroExpander">The MacroExpander class</h3>
				267	<!-- ======================================================================= -->
				268
				269	<p>The MacroExpander class is a token provider that returns tokens from a list
				270	of tokens that came from somewhere else. It typically used for two things: 1)
				271	returning tokens from a macro definition as it is being expanded 2) returning
				272	tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
				273	will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
				274
				275	<!-- ======================================================================= -->
				276	<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
				277	<!-- ======================================================================= -->
				278
				279	<p>The MultipleIncludeOpt class implements a really simple little state machine
				280	that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
				281	idiom that people typically use to prevent multiple inclusion of headers. If a
				282	buffer uses this idiom and is subsequently #include'd, the preprocessor can
				283	simply check to see whether the guarding condition is defined or not. If so,
				284	the preprocessor can completely ignore the include of the header.</p>
				285
				286
				287
				288	<!-- ======================================================================= -->
				289	<h2 id="libparse">The Parser Library</h2>
				290	<!-- ======================================================================= -->
				291
				292	<!-- ======================================================================= -->
				293	<h2 id="libast">The AST Library</h2>
				294	<!-- ======================================================================= -->
				295
				296	<!-- ======================================================================= -->
				297	<h3 id="Type">The Type class and its subclasses</h3>
				298	<!-- ======================================================================= -->
				299
				300	<p>The Type class (and its subclasses) are an important part of the AST. Types
				301	are accessed through the ASTContext class, which implicitly creates and uniques
				302	them as they are needed. Types have a couple of non-obvious features: 1) they
				303	do not capture type qualifiers like const or volatile (See
				304	<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	305	information. Once created, types are immutable (unlike decls).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	306
				307	<p>Typedefs in C make semantic analysis a bit more complex than it would
				308	be without them. The issue is that we want to capture typedef information
				309	and represent it in the AST perfectly, but the semantics of operations need to
				310	"see through" typedefs. For example, consider this code:</p>
				311
				312	<code>
				313	void func() {<br>
Bill Wendling	30d1775	2007-10-06 01:56:01 +0000	[diff] [blame]	314	typedef int foo;<br>
				315	foo X, *Y;<br>
				316	typedef foo* bar;<br>
				317	bar Z;<br>
				318	*X; <i>// error</i><br>
				319	**Y; <i>// error</i><br>
				320	**Z; <i>// error</i><br>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	321	}<br>
				322	</code>
				323
				324	<p>The code above is illegal, and thus we expect there to be diagnostics emitted
				325	on the annotated lines. In this example, we expect to get:</p>
				326
				327	<pre>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	328	<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	329	*X; // error
				330	<font color="blue">^~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	331	<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	332	**Y; // error
				333	<font color="blue">^~~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	334	<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
				335	**Z; // error
				336	<font color="blue">^~~</font>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	337	</pre>
				338
				339	<p>While this example is somewhat silly, it illustrates the point: we want to
				340	retain typedef information where possible, so that we can emit errors about
				341	"<tt>std::string</tt>" instead of "<tt>std::basic_string<char, std:...</tt>".
				342	Doing this requires properly keeping typedef information (for example, the type
				343	of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	344	various operators (for example, the type of *Y is "foo", not "int"). In order
				345	to retain this information, the type of these expressions is an instance of the
				346	TypedefType class, which indicates that the type of these expressions is a
				347	typedef for foo.
				348	</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	349
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	350	<p>Representing types like this is great for diagnostics, because the
				351	user-specified type is always immediately available. There are two problems
				352	with this: first, various semantic checks need to make judgements about the
Chris Lattner	33fc68a	2007-07-31 18:54:50 +0000	[diff] [blame]	353	<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
				354	efficient way to query whether two types are structurally identical to each
				355	other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	356	canonical types.</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	357
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	358	<h4>Canonical Types</h4>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	359
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	360	<p>Every instance of the Type class contains a canonical type pointer. For
				361	simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
				362	"<tt>int**</tt>"), the type just points to itself. For types that have a
				363	typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
				364	"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
				365	structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
				366	"<tt>int</tt>", "<tt>int</tt>", and "<tt>int</tt>" respectively).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	367
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	368	<p>This design provides a constant time operation (dereferencing the canonical
				369	type pointer) that gives us access to the structure of types. For example,
				370	we can trivially tell that "bar" and "foo*" are the same type by dereferencing
				371	their canonical type pointers and doing a pointer comparison (they both point
				372	to the single "<tt>int*</tt>" type).</p>
				373
				374	<p>Canonical types and typedef types bring up some complexities that must be
				375	carefully managed. Specifically, the "isa/cast/dyncast" operators generally
				376	shouldn't be used in code that is inspecting the AST. For example, when type
				377	checking the indirection operator (unary '*' on a pointer), the type checker
				378	must verify that the operand has a pointer type. It would not be correct to
				379	check that with "<tt>isa<PointerType>(SubExpr->getType())</tt>",
				380	because this predicate would fail if the subexpression had a typedef type.</p>
				381
				382	<p>The solution to this problem are a set of helper methods on Type, used to
				383	check their properties. In this case, it would be correct to use
				384	"<tt>SubExpr->getType()->isPointerType()</tt>" to do the check. This
				385	predicate will return true if the <em>canonical type is a pointer</em>, which is
				386	true any time the type is structurally a pointer type. The only hard part here
				387	is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
				388
				389	<p>The second problem we face is how to get access to the pointer type once we
				390	know it exists. To continue the example, the result type of the indirection
				391	operator is the pointee type of the subexpression. In order to determine the
				392	type, we need to get the instance of PointerType that best captures the typedef
				393	information in the program. If the type of the expression is literally a
				394	PointerType, we can return that, otherwise we have to dig through the
				395	typedefs to find the pointer type. For example, if the subexpression had type
				396	"<tt>foo*</tt>", we could return that type as the result. If the subexpression
				397	had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
				398	<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner	11406c1	2007-07-31 16:50:51 +0000	[diff] [blame]	399	a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	400	PointerType and, if so, returns the best one. If not, it returns a null
				401	pointer.</p>
				402
				403	<p>This structure is somewhat mystical, but after meditating on it, it will
				404	make sense to you :).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	405
				406	<!-- ======================================================================= -->
				407	<h3 id="QualType">The QualType class</h3>
				408	<!-- ======================================================================= -->
				409
				410	<p>The QualType class is designed as a trivial value class that is small,
				411	passed by-value and is efficient to query. The idea of QualType is that it
				412	stores the type qualifiers (const, volatile, restrict) separately from the types
				413	themselves: QualType is conceptually a pair of "Type*" and bits for the type
				414	qualifiers.</p>
				415
				416	<p>By storing the type qualifiers as bits in the conceptual pair, it is
				417	extremely efficient to get the set of qualifiers on a QualType (just return the
				418	field of the pair), add a type qualifier (which is a trivial constant-time
				419	operation that sets a bit), and remove one or more type qualifiers (just return
				420	a QualType with the bitfield set to empty).</p>
				421
				422	<p>Further, because the bits are stored outside of the type itself, we do not
				423	need to create duplicates of types with different sets of qualifiers (i.e. there
				424	is only a single heap allocated "int" type: "const int" and "volatile const int"
				425	both point to the same heap allocated "int" type). This reduces the heap size
				426	used to represent bits and also means we do not have to consider qualifiers when
				427	uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
				428
				429	<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
				430	the low bit of the pointer to the Type object. This means that QualType is
				431	exactly the same size as a pointer, and this works fine on any system where
				432	malloc'd objects are at least 8 byte aligned.</p>
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	433
				434	<!-- ======================================================================= -->
				435	<h3 id="CFG">The <tt>CFG</tt> class</h3>
				436	<!-- ======================================================================= -->
				437
				438	<p>The <tt>CFG</tt> class is designed to represent a source-level
				439	control-flow graph for a single statement (<tt>Stmt*</tt>). Typically
				440	instances of <tt>CFG</tt> are constructed for function bodies (usually
				441	an instance of <tt>CompoundStmt</tt>), but can also be instantiated to
				442	represent the control-flow of any class that subclasses <tt>Stmt</tt>,
				443	which includes simple expressions. Control-flow graphs are especially
				444	useful for performing
				445	<a href="http://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities">flow-
				446	or path-sensitive</a> program analyses on a given function.</p>
				447
				448	<h4>Basic Blocks</h4>
				449
				450	<p>Concretely, an instance of <tt>CFG</tt> is a collection of basic
				451	blocks. Each basic block is an instance of <tt>CFGBlock</tt>, which
				452	simply contains an ordered sequence of <tt>Stmt*</tt> (each referring
				453	to statements in the AST). The ordering of statements within a block
				454	indicates unconditional flow of control from one statement to the
				455	next. <a href="#ConditionalControlFlow">Conditional control-flow</a>
				456	is represented using edges between basic blocks. The statements
				457	within a given <tt>CFGBlock</tt> can be traversed using
				458	the <tt>CFGBlock::*iterator</tt> interface.</p>
				459
				460	<p>
Ted Kremenek	18e17e7	2007-10-18 22:50:52 +0000	[diff] [blame^]	461	A <tt>CFG</tt> object owns the instances of <tt>CFGBlock</tt> within
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	462	the control-flow graph it represents. Each <tt>CFGBlock</tt> within a
				463	CFG is also uniquely numbered (accessible
				464	via <tt>CFGBlock::getBlockID()</tt>). Currently the number is
				465	based on the ordering the blocks were created, but no assumptions
				466	should be made on how <tt>CFGBlock</tt>s are numbered other than their
				467	numbers are unique and that they are numbered from 0..N-1 (where N is
				468	the number of basic blocks in the CFG).</p>
				469
				470	<h4>Entry and Exit Blocks</h4>
				471
				472	Each instance of <tt>CFG</tt> contains two special blocks:
				473	an <i>entry</i> block (accessible via <tt>CFG::getEntry()</tt>), which
				474	has no incoming edges, and an <i>exit</i> block (accessible
				475	via <tt>CFG::getExit()</tt>), which has no outgoing edges. Neither
				476	block contains any statements, and they serve the role of providing a
				477	clear entrance and exit for a body of code such as a function body.
				478	The presence of these empty blocks greatly simplifies the
				479	implementation of many analyses built on top of CFGs.
				480
				481	<h4 id ="ConditionalControlFlow">Conditional Control-Flow</h4>
				482
				483	<p>Conditional control-flow (such as those induced by if-statements
				484	and loops) is represented as edges between <tt>CFGBlock</tt>s.
				485	Because different C language constructs can induce control-flow,
				486	each <tt>CFGBlock</tt> also records an extra <tt>Stmt*</tt> that
				487	represents the <i>terminator</i> of the block. A terminator is simply
				488	the statement that caused the control-flow, and is used to identify
				489	the nature of the conditional control-flow between blocks. For
				490	example, in the case of an if-statement, the terminator refers to
				491	the <tt>IfStmt</tt> object in the AST that represented the given
				492	branch.</p>
				493
				494	<p>To illustrate, consider the following code example:</p>
				495
				496	<code>
				497	int foo(int x) {<br>
				498	x = x + 1;<br>
				499	<br>
				500	if (x > 2) x++;<br>
				501	else {<br>
				502	x += 2;<br>
				503	x *= 2;<br>
				504	}<br>
				505	<br>
				506	return x;<br>
				507	}
				508	</code>
				509
				510	<p>After invoking the parser+semantic analyzer on this code fragment,
				511	the AST of the body of <tt>foo</tt> is referenced by a
				512	single <tt>Stmt*</tt>. We can then construct an instance
				513	of <tt>CFG</tt> representing the control-flow graph of this function
				514	body by single call to a static class method:</p>
				515
				516	<code>
				517	Stmt* FooBody = ...<br>
				518	CFG* FooCFG = <b>CFG::buildCFG</b>(FooBody);
				519	</code>
				520
				521	<p>It is the responsibility of the caller of <tt>CFG::buildCFG</tt>
				522	to <tt>delete</tt> the returned <tt>CFG*</tt> when the CFG is no
				523	longer needed.</p>
				524
				525	<p>Along with providing an interface to iterate over
				526	its <tt>CFGBlock</tt>s, the <tt>CFG</tt> class also provides methods
				527	that are useful for debugging and visualizing CFGs. For example, the
				528	method
				529	<tt>CFG::dump()</tt> dumps a pretty-printed version of the CFG to
				530	standard error. This is especially useful when one is using a
				531	debugger such as gdb. For example, here is the output
				532	of <tt>FooCFG->dump()</tt>:</p>
				533
				534	<code>
				535	[ B5 (ENTRY) ]<br>
				536	Predecessors (0):<br>
				537	Successors (1): B4<br>
				538	<br>
				539	[ B4 ]<br>
				540	1: x = x + 1<br>
				541	2: (x > 2)<br>
				542	<b>T: if [B4.2]</b><br>
				543	Predecessors (1): B5<br>
				544	Successors (2): B3 B2<br>
				545	<br>
				546	[ B3 ]<br>
				547	1: x++<br>
				548	Predecessors (1): B4<br>
				549	Successors (1): B1<br>
				550	<br>
				551	[ B2 ]<br>
				552	1: x += 2<br>
				553	2: x *= 2<br>
				554	Predecessors (1): B4<br>
				555	Successors (1): B1<br>
				556	<br>
				557	[ B1 ]<br>
				558	1: return x;<br>
				559	Predecessors (2): B2 B3<br>
				560	Successors (1): B0<br>
				561	<br>
				562	[ B0 (EXIT) ]<br>
				563	Predecessors (1): B1<br>
				564	Successors (0):
				565	</code>
				566
				567	<p>For each block, the pretty-printed output displays for each block
				568	the number of <i>predecessor</i> blocks (blocks that have outgoing
				569	control-flow to the given block) and <i>successor</i> blocks (blocks
				570	that have control-flow that have incoming control-flow from the given
				571	block). We can also clearly see the special entry and exit blocks at
				572	the beginning and end of the pretty-printed output. For the entry
				573	block (block B5), the number of predecessor blocks is 0, while for the
				574	exit block (block B0) the number of successor blocks is 0.</p>
				575
				576	<p>The most interesting block here is B4, whose outgoing control-flow
				577	represents the branching caused by the sole if-statement
				578	in <tt>foo</tt>. Of particular interest is the second statement in
				579	the block, <b><tt>(x > 2)</tt></b>, and the terminator, printed
				580	as <b><tt>if [B4.2]</tt></b>. The second statement represents the
				581	evaluation of the condition of the if-statement, which occurs before
				582	the actual branching of control-flow. Within the <tt>CFGBlock</tt>
				583	for B4, the <tt>Stmt*</tt> for the second statement refers to the
				584	actual expression in the AST for <b><tt>(x > 2)</tt></b>. Thus
				585	pointers to subclasses of <tt>Expr</tt> can appear in the list of
				586	statements in a block, and not just subclasses of <tt>Stmt</tt> that
				587	refer to proper C statements.</p>
				588
				589	<p>The terminator of block B4 is a pointer to the <tt>IfStmt</tt>
				590	object in the AST. The pretty-printer outputs <b><tt>if
				591	[B4.2]</tt></b> because the condition expression of the if-statement
				592	has an actual place in the basic block, and thus the terminator is
				593	essentially
				594	<i>referring</i> to the expression that is the second statement of
				595	block B4 (i.e., B4.2). In this manner, conditions for control-flow
				596	(which also includes conditions for loops and switch statements) are
				597	hoisted into the actual basic block.</p>
				598
Ted Kremenek	98f19b6	2007-10-10 23:22:00 +0000	[diff] [blame]	599	<!--
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	600	<h4>Implicit Control-Flow</h4>
Ted Kremenek	98f19b6	2007-10-10 23:22:00 +0000	[diff] [blame]	601	-->
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	602
				603	<!--
				604	<p>A key design principle of the <tt>CFG</tt> class was to not require
				605	any transformations to the AST in order to represent control-flow.
				606	Thus the <tt>CFG</tt> does not perform any "lowering" of the
				607	statements in an AST: loops are not transformed into guarded gotos,
				608	short-circuit operations are not converted to a set of if-statements,
				609	and so on.</p>
				610	-->