Blame - docs/InternalsManual.html - platform/external/clang

blob: 1e1fc990adc57e025f27c4147756a5ef093f30c2 [file] [log] [blame]

Ted Kremenek	17a295d	2008-06-11 06:19:49 +0000	[diff] [blame]	1	<html>
				2	<head>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	3	<title>"clang" CFE Internals Manual</title>
Ted Kremenek	17a295d	2008-06-11 06:19:49 +0000	[diff] [blame]	4	<link type="text/css" rel="stylesheet" href="../menu.css" />
				5	<link type="text/css" rel="stylesheet" href="../content.css" />
				6	</head>
				7	<body>
				8
				9	<!--#include virtual="../menu.html.incl"-->
				10
				11	<div id="content">
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	12
				13	<h1>"clang" CFE Internals Manual</h1>
				14
				15	<ul>
				16	<li><a href="#intro">Introduction</a></li>
				17	<li><a href="#libsystem">LLVM System and Support Libraries</a></li>
				18	<li><a href="#libbasic">The clang 'Basic' Library</a>
				19	<ul>
				20	<li><a href="#SourceLocation">The SourceLocation and SourceManager
				21	classes</a></li>
				22	</ul>
				23	</li>
				24	<li><a href="#liblex">The Lexer and Preprocessor Library</a>
				25	<ul>
				26	<li><a href="#Token">The Token class</a></li>
				27	<li><a href="#Lexer">The Lexer class</a></li>
Chris Lattner	7928125	2008-03-09 02:27:26 +0000	[diff] [blame]	28	<li><a href="#TokenLexer">The TokenLexer class</a></li>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	29	<li><a href="#MultipleIncludeOpt">The MultipleIncludeOpt class</a></li>
				30	</ul>
				31	</li>
				32	<li><a href="#libparse">The Parser Library</a>
				33	<ul>
				34	</ul>
				35	</li>
				36	<li><a href="#libast">The AST Library</a>
				37	<ul>
				38	<li><a href="#Type">The Type class and its subclasses</a></li>
				39	<li><a href="#QualType">The QualType class</a></li>
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	40	<li><a href="#CFG">The CFG class</a></li>
Chris Lattner	7bad199	2008-11-16 21:48:07 +0000	[diff] [blame^]	41	<li><a href="#Constants">Constant Folding in the Clang AST</a></li>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	42	</ul>
				43	</li>
				44	</ul>
				45
				46
				47	<!-- ======================================================================= -->
				48	<h2 id="intro">Introduction</h2>
				49	<!-- ======================================================================= -->
				50
				51	<p>This document describes some of the more important APIs and internal design
				52	decisions made in the clang C front-end. The purpose of this document is to
				53	both capture some of this high level information and also describe some of the
				54	design decisions behind it. This is meant for people interested in hacking on
				55	clang, not for end-users. The description below is categorized by
				56	libraries, and does not describe any of the clients of the libraries.</p>
				57
				58	<!-- ======================================================================= -->
				59	<h2 id="libsystem">LLVM System and Support Libraries</h2>
				60	<!-- ======================================================================= -->
				61
				62	<p>The LLVM libsystem library provides the basic clang system abstraction layer,
				63	which is used for file system access. The LLVM libsupport library provides many
				64	underlying libraries and <a
				65	href="http://llvm.org/docs/ProgrammersManual.html">data-structures</a>,
				66	including command line option
				67	processing and various containers.</p>
				68
				69	<!-- ======================================================================= -->
				70	<h2 id="libbasic">The clang 'Basic' Library</h2>
				71	<!-- ======================================================================= -->
				72
				73	<p>This library certainly needs a better name. The 'basic' library contains a
				74	number of low-level utilities for tracking and manipulating source buffers,
				75	locations within the source buffers, diagnostics, tokens, target abstraction,
				76	and information about the subset of the language being compiled for.</p>
				77
				78	<p>Part of this infrastructure is specific to C (such as the TargetInfo class),
				79	other parts could be reused for other non-C-based languages (SourceLocation,
				80	SourceManager, Diagnostics, FileManager). When and if there is future demand
				81	we can figure out if it makes sense to introduce a new library, move the general
				82	classes somewhere else, or introduce some other solution.</p>
				83
				84	<p>We describe the roles of these classes in order of their dependencies.</p>
				85
				86	<!-- ======================================================================= -->
				87	<h3 id="SourceLocation">The SourceLocation and SourceManager classes</h3>
				88	<!-- ======================================================================= -->
				89
				90	<p>Strangely enough, the SourceLocation class represents a location within the
				91	source code of the program. Important design points include:</p>
				92
				93	<ol>
				94	<li>sizeof(SourceLocation) must be extremely small, as these are embedded into
				95	many AST nodes and are passed around often. Currently it is 32 bits.</li>
				96	<li>SourceLocation must be a simple value object that can be efficiently
				97	copied.</li>
				98	<li>We should be able to represent a source location for any byte of any input
				99	file. This includes in the middle of tokens, in whitespace, in trigraphs,
				100	etc.</li>
				101	<li>A SourceLocation must encode the current #include stack that was active when
				102	the location was processed. For example, if the location corresponds to a
				103	token, it should contain the set of #includes active when the token was
				104	lexed. This allows us to print the #include stack for a diagnostic.</li>
				105	<li>SourceLocation must be able to describe macro expansions, capturing both
				106	the ultimate instantiation point and the source of the original character
				107	data.</li>
				108	</ol>
				109
				110	<p>In practice, the SourceLocation works together with the SourceManager class
				111	to encode two pieces of information about a location: it's physical location
				112	and it's virtual location. For most tokens, these will be the same. However,
				113	for a macro expansion (or tokens that came from a _Pragma directive) these will
				114	describe the location of the characters corresponding to the token and the
				115	location where the token was used (i.e. the macro instantiation point or the
				116	location of the _Pragma itself).</p>
				117
				118	<p>For efficiency, we only track one level of macro instantions: if a token was
				119	produced by multiple instantiations, we only track the source and ultimate
				120	destination. Though we could track the intermediate instantiation points, this
				121	would require extra bookkeeping and no known client would benefit substantially
				122	from this.</p>
				123
				124	<p>The clang front-end inherently depends on the location of a token being
				125	tracked correctly. If it is ever incorrect, the front-end may get confused and
				126	die. The reason for this is that the notion of the 'spelling' of a Token in
				127	clang depends on being able to find the original input characters for the token.
				128	This concept maps directly to the "physical" location for the token.</p>
				129
				130	<!-- ======================================================================= -->
				131	<h2 id="liblex">The Lexer and Preprocessor Library</h2>
				132	<!-- ======================================================================= -->
				133
				134	<p>The Lexer library contains several tightly-connected classes that are involved
				135	with the nasty process of lexing and preprocessing C source code. The main
				136	interface to this library for outside clients is the large <a
				137	href="#Preprocessor">Preprocessor</a> class.
				138	It contains the various pieces of state that are required to coherently read
				139	tokens out of a translation unit.</p>
				140
				141	<p>The core interface to the Preprocessor object (once it is set up) is the
				142	Preprocessor::Lex method, which returns the next <a href="#Token">Token</a> from
				143	the preprocessor stream. There are two types of token providers that the
				144	preprocessor is capable of reading from: a buffer lexer (provided by the <a
				145	href="#Lexer">Lexer</a> class) and a buffered token stream (provided by the <a
Chris Lattner	7928125	2008-03-09 02:27:26 +0000	[diff] [blame]	146	href="#TokenLexer">TokenLexer</a> class).
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	147
				148
				149	<!-- ======================================================================= -->
				150	<h3 id="Token">The Token class</h3>
				151	<!-- ======================================================================= -->
				152
				153	<p>The Token class is used to represent a single lexed token. Tokens are
				154	intended to be used by the lexer/preprocess and parser libraries, but are not
				155	intended to live beyond them (for example, they should not live in the ASTs).<p>
				156
				157	<p>Tokens most often live on the stack (or some other location that is efficient
				158	to access) as the parser is running, but occasionally do get buffered up. For
				159	example, macro definitions are stored as a series of tokens, and the C++
				160	front-end will eventually need to buffer tokens up for tentative parsing and
				161	various pieces of look-ahead. As such, the size of a Token matter. On a 32-bit
				162	system, sizeof(Token) is currently 16 bytes.</p>
				163
				164	<p>Tokens contain the following information:</p>
				165
				166	<ul>
				167	<li><b>A SourceLocation</b> - This indicates the location of the start of the
				168	token.</li>
				169
				170	<li><b>A length</b> - This stores the length of the token as stored in the
				171	SourceBuffer. For tokens that include them, this length includes trigraphs and
				172	escaped newlines which are ignored by later phases of the compiler. By pointing
				173	into the original source buffer, it is always possible to get the original
				174	spelling of a token completely accurately.</li>
				175
				176	<li><b>IdentifierInfo</b> - If a token takes the form of an identifier, and if
				177	identifier lookup was enabled when the token was lexed (e.g. the lexer was not
				178	reading in 'raw' mode) this contains a pointer to the unique hash value for the
				179	identifier. Because the lookup happens before keyword identification, this
				180	field is set even for language keywords like 'for'.</li>
				181
				182	<li><b>TokenKind</b> - This indicates the kind of token as classified by the
				183	lexer. This includes things like <tt>tok::starequal</tt> (for the "*="
				184	operator), <tt>tok::ampamp</tt> for the "&&" token, and keyword values
				185	(e.g. <tt>tok::kw_for</tt>) for identifiers that correspond to keywords. Note
				186	that some tokens can be spelled multiple ways. For example, C++ supports
				187	"operator keywords", where things like "and" are treated exactly like the
				188	"&&" operator. In these cases, the kind value is set to
				189	<tt>tok::ampamp</tt>, which is good for the parser, which doesn't have to
				190	consider both forms. For something that cares about which form is used (e.g.
				191	the preprocessor 'stringize' operator) the spelling indicates the original
				192	form.</li>
				193
				194	<li><b>Flags</b> - There are currently four flags tracked by the
				195	lexer/preprocessor system on a per-token basis:
				196
				197	<ol>
				198	<li><b>StartOfLine</b> - This was the first token that occurred on its input
				199	source line.</li>
				200	<li><b>LeadingSpace</b> - There was a space character either immediately
				201	before the token or transitively before the token as it was expanded
				202	through a macro. The definition of this flag is very closely defined by
				203	the stringizing requirements of the preprocessor.</li>
				204	<li><b>DisableExpand</b> - This flag is used internally to the preprocessor to
				205	represent identifier tokens which have macro expansion disabled. This
				206	prevents them from being considered as candidates for macro expansion ever
				207	in the future.</li>
				208	<li><b>NeedsCleaning</b> - This flag is set if the original spelling for the
				209	token includes a trigraph or escaped newline. Since this is uncommon,
				210	many pieces of code can fast-path on tokens that did not need cleaning.
				211	</p>
				212	</ol>
				213	</li>
				214	</ul>
				215
				216	<p>One interesting (and somewhat unusual) aspect of tokens is that they don't
				217	contain any semantic information about the lexed value. For example, if the
				218	token was a pp-number token, we do not represent the value of the number that
				219	was lexed (this is left for later pieces of code to decide). Additionally, the
				220	lexer library has no notion of typedef names vs variable names: both are
				221	returned as identifiers, and the parser is left to decide whether a specific
				222	identifier is a typedef or a variable (tracking this requires scope information
				223	among other things).</p>
				224
				225	<!-- ======================================================================= -->
				226	<h3 id="Lexer">The Lexer class</h3>
				227	<!-- ======================================================================= -->
				228
				229	<p>The Lexer class provides the mechanics of lexing tokens out of a source
				230	buffer and deciding what they mean. The Lexer is complicated by the fact that
				231	it operates on raw buffers that have not had spelling eliminated (this is a
				232	necessity to get decent performance), but this is countered with careful coding
				233	as well as standard performance techniques (for example, the comment handling
				234	code is vectorized on X86 and PowerPC hosts).</p>
				235
				236	<p>The lexer has a couple of interesting modal features:</p>
				237
				238	<ul>
				239	<li>The lexer can operate in 'raw' mode. This mode has several features that
				240	make it possible to quickly lex the file (e.g. it stops identifier lookup,
				241	doesn't specially handle preprocessor tokens, handles EOF differently, etc).
				242	This mode is used for lexing within an "<tt>#if 0</tt>" block, for
				243	example.</li>
				244	<li>The lexer can capture and return comments as tokens. This is required to
				245	support the -C preprocessor mode, which passes comments through, and is
				246	used by the diagnostic checker to identifier expect-error annotations.</li>
				247	<li>The lexer can be in ParsingFilename mode, which happens when preprocessing
Chris Lattner	8438624	2007-09-16 19:25:23 +0000	[diff] [blame]	248	after reading a #include directive. This mode changes the parsing of '<'
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	249	to return an "angled string" instead of a bunch of tokens for each thing
				250	within the filename.</li>
				251	<li>When parsing a preprocessor directive (after "<tt>#</tt>") the
				252	ParsingPreprocessorDirective mode is entered. This changes the parser to
				253	return EOM at a newline.</li>
				254	<li>The Lexer uses a LangOptions object to know whether trigraphs are enabled,
				255	whether C++ or ObjC keywords are recognized, etc.</li>
				256	</ul>
				257
				258	<p>In addition to these modes, the lexer keeps track of a couple of other
				259	features that are local to a lexed buffer, which change as the buffer is
				260	lexed:</p>
				261
				262	<ul>
				263	<li>The Lexer uses BufferPtr to keep track of the current character being
				264	lexed.</li>
				265	<li>The Lexer uses IsAtStartOfLine to keep track of whether the next lexed token
				266	will start with its "start of line" bit set.</li>
				267	<li>The Lexer keeps track of the current #if directives that are active (which
				268	can be nested).</li>
				269	<li>The Lexer keeps track of an <a href="#MultipleIncludeOpt">
				270	MultipleIncludeOpt</a> object, which is used to
				271	detect whether the buffer uses the standard "<tt>#ifndef XX</tt> /
				272	<tt>#define XX</tt>" idiom to prevent multiple inclusion. If a buffer does,
				273	subsequent includes can be ignored if the XX macro is defined.</li>
				274	</ul>
				275
				276	<!-- ======================================================================= -->
Chris Lattner	7928125	2008-03-09 02:27:26 +0000	[diff] [blame]	277	<h3 id="TokenLexer">The TokenLexer class</h3>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	278	<!-- ======================================================================= -->
				279
Chris Lattner	7928125	2008-03-09 02:27:26 +0000	[diff] [blame]	280	<p>The TokenLexer class is a token provider that returns tokens from a list
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	281	of tokens that came from somewhere else. It typically used for two things: 1)
				282	returning tokens from a macro definition as it is being expanded 2) returning
				283	tokens from an arbitrary buffer of tokens. The later use is used by _Pragma and
				284	will most likely be used to handle unbounded look-ahead for the C++ parser.</p>
				285
				286	<!-- ======================================================================= -->
				287	<h3 id="MultipleIncludeOpt">The MultipleIncludeOpt class</h3>
				288	<!-- ======================================================================= -->
				289
				290	<p>The MultipleIncludeOpt class implements a really simple little state machine
				291	that is used to detect the standard "<tt>#ifndef XX</tt> / <tt>#define XX</tt>"
				292	idiom that people typically use to prevent multiple inclusion of headers. If a
				293	buffer uses this idiom and is subsequently #include'd, the preprocessor can
				294	simply check to see whether the guarding condition is defined or not. If so,
				295	the preprocessor can completely ignore the include of the header.</p>
				296
				297
				298
				299	<!-- ======================================================================= -->
				300	<h2 id="libparse">The Parser Library</h2>
				301	<!-- ======================================================================= -->
				302
				303	<!-- ======================================================================= -->
				304	<h2 id="libast">The AST Library</h2>
				305	<!-- ======================================================================= -->
				306
				307	<!-- ======================================================================= -->
				308	<h3 id="Type">The Type class and its subclasses</h3>
				309	<!-- ======================================================================= -->
				310
				311	<p>The Type class (and its subclasses) are an important part of the AST. Types
				312	are accessed through the ASTContext class, which implicitly creates and uniques
				313	them as they are needed. Types have a couple of non-obvious features: 1) they
				314	do not capture type qualifiers like const or volatile (See
				315	<a href="#QualType">QualType</a>), and 2) they implicitly capture typedef
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	316	information. Once created, types are immutable (unlike decls).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	317
				318	<p>Typedefs in C make semantic analysis a bit more complex than it would
				319	be without them. The issue is that we want to capture typedef information
				320	and represent it in the AST perfectly, but the semantics of operations need to
				321	"see through" typedefs. For example, consider this code:</p>
				322
				323	<code>
				324	void func() {<br>
Bill Wendling	30d1775	2007-10-06 01:56:01 +0000	[diff] [blame]	325	typedef int foo;<br>
				326	foo X, *Y;<br>
				327	typedef foo* bar;<br>
				328	bar Z;<br>
				329	*X; <i>// error</i><br>
				330	**Y; <i>// error</i><br>
				331	**Z; <i>// error</i><br>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	332	}<br>
				333	</code>
				334
				335	<p>The code above is illegal, and thus we expect there to be diagnostics emitted
				336	on the annotated lines. In this example, we expect to get:</p>
				337
				338	<pre>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	339	<b>test.c:6:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	340	*X; // error
				341	<font color="blue">^~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	342	<b>test.c:7:1: error: indirection requires pointer operand ('foo' invalid)</b>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	343	**Y; // error
				344	<font color="blue">^~~</font>
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	345	<b>test.c:8:1: error: indirection requires pointer operand ('foo' invalid)</b>
				346	**Z; // error
				347	<font color="blue">^~~</font>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	348	</pre>
				349
				350	<p>While this example is somewhat silly, it illustrates the point: we want to
				351	retain typedef information where possible, so that we can emit errors about
				352	"<tt>std::string</tt>" instead of "<tt>std::basic_string<char, std:...</tt>".
				353	Doing this requires properly keeping typedef information (for example, the type
				354	of "X" is "foo", not "int"), and requires properly propagating it through the
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	355	various operators (for example, the type of *Y is "foo", not "int"). In order
				356	to retain this information, the type of these expressions is an instance of the
				357	TypedefType class, which indicates that the type of these expressions is a
				358	typedef for foo.
				359	</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	360
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	361	<p>Representing types like this is great for diagnostics, because the
				362	user-specified type is always immediately available. There are two problems
				363	with this: first, various semantic checks need to make judgements about the
Chris Lattner	33fc68a	2007-07-31 18:54:50 +0000	[diff] [blame]	364	<em>actual structure</em> of a type, ignoring typdefs. Second, we need an
				365	efficient way to query whether two types are structurally identical to each
				366	other, ignoring typedefs. The solution to both of these problems is the idea of
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	367	canonical types.</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	368
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	369	<h4>Canonical Types</h4>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	370
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	371	<p>Every instance of the Type class contains a canonical type pointer. For
				372	simple types with no typedefs involved (e.g. "<tt>int</tt>", "<tt>int*</tt>",
				373	"<tt>int**</tt>"), the type just points to itself. For types that have a
				374	typedef somewhere in their structure (e.g. "<tt>foo</tt>", "<tt>foo*</tt>",
				375	"<tt>foo**</tt>", "<tt>bar</tt>"), the canonical type pointer points to their
				376	structurally equivalent type without any typedefs (e.g. "<tt>int</tt>",
				377	"<tt>int</tt>", "<tt>int</tt>", and "<tt>int</tt>" respectively).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	378
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	379	<p>This design provides a constant time operation (dereferencing the canonical
				380	type pointer) that gives us access to the structure of types. For example,
				381	we can trivially tell that "bar" and "foo*" are the same type by dereferencing
				382	their canonical type pointers and doing a pointer comparison (they both point
				383	to the single "<tt>int*</tt>" type).</p>
				384
				385	<p>Canonical types and typedef types bring up some complexities that must be
				386	carefully managed. Specifically, the "isa/cast/dyncast" operators generally
				387	shouldn't be used in code that is inspecting the AST. For example, when type
				388	checking the indirection operator (unary '*' on a pointer), the type checker
				389	must verify that the operand has a pointer type. It would not be correct to
				390	check that with "<tt>isa<PointerType>(SubExpr->getType())</tt>",
				391	because this predicate would fail if the subexpression had a typedef type.</p>
				392
				393	<p>The solution to this problem are a set of helper methods on Type, used to
				394	check their properties. In this case, it would be correct to use
				395	"<tt>SubExpr->getType()->isPointerType()</tt>" to do the check. This
				396	predicate will return true if the <em>canonical type is a pointer</em>, which is
				397	true any time the type is structurally a pointer type. The only hard part here
				398	is remembering not to use the <tt>isa/cast/dyncast</tt> operations.</p>
				399
				400	<p>The second problem we face is how to get access to the pointer type once we
				401	know it exists. To continue the example, the result type of the indirection
				402	operator is the pointee type of the subexpression. In order to determine the
				403	type, we need to get the instance of PointerType that best captures the typedef
				404	information in the program. If the type of the expression is literally a
				405	PointerType, we can return that, otherwise we have to dig through the
				406	typedefs to find the pointer type. For example, if the subexpression had type
				407	"<tt>foo*</tt>", we could return that type as the result. If the subexpression
				408	had type "<tt>bar</tt>", we want to return "<tt>foo*</tt>" (note that we do
				409	<em>not</em> want "<tt>int*</tt>"). In order to provide all of this, Type has
Chris Lattner	11406c1	2007-07-31 16:50:51 +0000	[diff] [blame]	410	a getAsPointerType() method that checks whether the type is structurally a
Chris Lattner	8a2bc62	2007-07-31 06:37:39 +0000	[diff] [blame]	411	PointerType and, if so, returns the best one. If not, it returns a null
				412	pointer.</p>
				413
				414	<p>This structure is somewhat mystical, but after meditating on it, it will
				415	make sense to you :).</p>
Chris Lattner	86920d3	2007-07-31 05:42:17 +0000	[diff] [blame]	416
				417	<!-- ======================================================================= -->
				418	<h3 id="QualType">The QualType class</h3>
				419	<!-- ======================================================================= -->
				420
				421	<p>The QualType class is designed as a trivial value class that is small,
				422	passed by-value and is efficient to query. The idea of QualType is that it
				423	stores the type qualifiers (const, volatile, restrict) separately from the types
				424	themselves: QualType is conceptually a pair of "Type*" and bits for the type
				425	qualifiers.</p>
				426
				427	<p>By storing the type qualifiers as bits in the conceptual pair, it is
				428	extremely efficient to get the set of qualifiers on a QualType (just return the
				429	field of the pair), add a type qualifier (which is a trivial constant-time
				430	operation that sets a bit), and remove one or more type qualifiers (just return
				431	a QualType with the bitfield set to empty).</p>
				432
				433	<p>Further, because the bits are stored outside of the type itself, we do not
				434	need to create duplicates of types with different sets of qualifiers (i.e. there
				435	is only a single heap allocated "int" type: "const int" and "volatile const int"
				436	both point to the same heap allocated "int" type). This reduces the heap size
				437	used to represent bits and also means we do not have to consider qualifiers when
				438	uniquing types (<a href="#Type">Type</a> does not even contain qualifiers).</p>
				439
				440	<p>In practice, on hosts where it is safe, the 3 type qualifiers are stored in
				441	the low bit of the pointer to the Type object. This means that QualType is
				442	exactly the same size as a pointer, and this works fine on any system where
				443	malloc'd objects are at least 8 byte aligned.</p>
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	444
				445	<!-- ======================================================================= -->
				446	<h3 id="CFG">The <tt>CFG</tt> class</h3>
				447	<!-- ======================================================================= -->
				448
				449	<p>The <tt>CFG</tt> class is designed to represent a source-level
				450	control-flow graph for a single statement (<tt>Stmt*</tt>). Typically
				451	instances of <tt>CFG</tt> are constructed for function bodies (usually
				452	an instance of <tt>CompoundStmt</tt>), but can also be instantiated to
				453	represent the control-flow of any class that subclasses <tt>Stmt</tt>,
				454	which includes simple expressions. Control-flow graphs are especially
				455	useful for performing
				456	<a href="http://en.wikipedia.org/wiki/Data_flow_analysis#Sensitivities">flow-
				457	or path-sensitive</a> program analyses on a given function.</p>
				458
				459	<h4>Basic Blocks</h4>
				460
				461	<p>Concretely, an instance of <tt>CFG</tt> is a collection of basic
				462	blocks. Each basic block is an instance of <tt>CFGBlock</tt>, which
				463	simply contains an ordered sequence of <tt>Stmt*</tt> (each referring
				464	to statements in the AST). The ordering of statements within a block
				465	indicates unconditional flow of control from one statement to the
				466	next. <a href="#ConditionalControlFlow">Conditional control-flow</a>
				467	is represented using edges between basic blocks. The statements
				468	within a given <tt>CFGBlock</tt> can be traversed using
				469	the <tt>CFGBlock::*iterator</tt> interface.</p>
				470
				471	<p>
Ted Kremenek	18e17e7	2007-10-18 22:50:52 +0000	[diff] [blame]	472	A <tt>CFG</tt> object owns the instances of <tt>CFGBlock</tt> within
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	473	the control-flow graph it represents. Each <tt>CFGBlock</tt> within a
				474	CFG is also uniquely numbered (accessible
				475	via <tt>CFGBlock::getBlockID()</tt>). Currently the number is
				476	based on the ordering the blocks were created, but no assumptions
				477	should be made on how <tt>CFGBlock</tt>s are numbered other than their
				478	numbers are unique and that they are numbered from 0..N-1 (where N is
				479	the number of basic blocks in the CFG).</p>
				480
				481	<h4>Entry and Exit Blocks</h4>
				482
				483	Each instance of <tt>CFG</tt> contains two special blocks:
				484	an <i>entry</i> block (accessible via <tt>CFG::getEntry()</tt>), which
				485	has no incoming edges, and an <i>exit</i> block (accessible
				486	via <tt>CFG::getExit()</tt>), which has no outgoing edges. Neither
				487	block contains any statements, and they serve the role of providing a
				488	clear entrance and exit for a body of code such as a function body.
				489	The presence of these empty blocks greatly simplifies the
				490	implementation of many analyses built on top of CFGs.
				491
				492	<h4 id ="ConditionalControlFlow">Conditional Control-Flow</h4>
				493
				494	<p>Conditional control-flow (such as those induced by if-statements
				495	and loops) is represented as edges between <tt>CFGBlock</tt>s.
				496	Because different C language constructs can induce control-flow,
				497	each <tt>CFGBlock</tt> also records an extra <tt>Stmt*</tt> that
				498	represents the <i>terminator</i> of the block. A terminator is simply
				499	the statement that caused the control-flow, and is used to identify
				500	the nature of the conditional control-flow between blocks. For
				501	example, in the case of an if-statement, the terminator refers to
				502	the <tt>IfStmt</tt> object in the AST that represented the given
				503	branch.</p>
				504
				505	<p>To illustrate, consider the following code example:</p>
				506
				507	<code>
				508	int foo(int x) {<br>
				509	x = x + 1;<br>
				510	<br>
				511	if (x > 2) x++;<br>
				512	else {<br>
				513	x += 2;<br>
				514	x *= 2;<br>
				515	}<br>
				516	<br>
				517	return x;<br>
				518	}
				519	</code>
				520
				521	<p>After invoking the parser+semantic analyzer on this code fragment,
				522	the AST of the body of <tt>foo</tt> is referenced by a
				523	single <tt>Stmt*</tt>. We can then construct an instance
				524	of <tt>CFG</tt> representing the control-flow graph of this function
				525	body by single call to a static class method:</p>
				526
				527	<code>
				528	Stmt* FooBody = ...<br>
				529	CFG* FooCFG = <b>CFG::buildCFG</b>(FooBody);
				530	</code>
				531
				532	<p>It is the responsibility of the caller of <tt>CFG::buildCFG</tt>
				533	to <tt>delete</tt> the returned <tt>CFG*</tt> when the CFG is no
				534	longer needed.</p>
				535
				536	<p>Along with providing an interface to iterate over
				537	its <tt>CFGBlock</tt>s, the <tt>CFG</tt> class also provides methods
				538	that are useful for debugging and visualizing CFGs. For example, the
				539	method
				540	<tt>CFG::dump()</tt> dumps a pretty-printed version of the CFG to
				541	standard error. This is especially useful when one is using a
				542	debugger such as gdb. For example, here is the output
				543	of <tt>FooCFG->dump()</tt>:</p>
				544
				545	<code>
				546	[ B5 (ENTRY) ]<br>
				547	Predecessors (0):<br>
				548	Successors (1): B4<br>
				549	<br>
				550	[ B4 ]<br>
				551	1: x = x + 1<br>
				552	2: (x > 2)<br>
				553	<b>T: if [B4.2]</b><br>
				554	Predecessors (1): B5<br>
				555	Successors (2): B3 B2<br>
				556	<br>
				557	[ B3 ]<br>
				558	1: x++<br>
				559	Predecessors (1): B4<br>
				560	Successors (1): B1<br>
				561	<br>
				562	[ B2 ]<br>
				563	1: x += 2<br>
				564	2: x *= 2<br>
				565	Predecessors (1): B4<br>
				566	Successors (1): B1<br>
				567	<br>
				568	[ B1 ]<br>
				569	1: return x;<br>
				570	Predecessors (2): B2 B3<br>
				571	Successors (1): B0<br>
				572	<br>
				573	[ B0 (EXIT) ]<br>
				574	Predecessors (1): B1<br>
				575	Successors (0):
				576	</code>
				577
				578	<p>For each block, the pretty-printed output displays for each block
				579	the number of <i>predecessor</i> blocks (blocks that have outgoing
				580	control-flow to the given block) and <i>successor</i> blocks (blocks
				581	that have control-flow that have incoming control-flow from the given
				582	block). We can also clearly see the special entry and exit blocks at
				583	the beginning and end of the pretty-printed output. For the entry
				584	block (block B5), the number of predecessor blocks is 0, while for the
				585	exit block (block B0) the number of successor blocks is 0.</p>
				586
				587	<p>The most interesting block here is B4, whose outgoing control-flow
				588	represents the branching caused by the sole if-statement
				589	in <tt>foo</tt>. Of particular interest is the second statement in
				590	the block, <b><tt>(x > 2)</tt></b>, and the terminator, printed
				591	as <b><tt>if [B4.2]</tt></b>. The second statement represents the
				592	evaluation of the condition of the if-statement, which occurs before
				593	the actual branching of control-flow. Within the <tt>CFGBlock</tt>
				594	for B4, the <tt>Stmt*</tt> for the second statement refers to the
				595	actual expression in the AST for <b><tt>(x > 2)</tt></b>. Thus
				596	pointers to subclasses of <tt>Expr</tt> can appear in the list of
				597	statements in a block, and not just subclasses of <tt>Stmt</tt> that
				598	refer to proper C statements.</p>
				599
				600	<p>The terminator of block B4 is a pointer to the <tt>IfStmt</tt>
				601	object in the AST. The pretty-printer outputs <b><tt>if
				602	[B4.2]</tt></b> because the condition expression of the if-statement
				603	has an actual place in the basic block, and thus the terminator is
				604	essentially
				605	<i>referring</i> to the expression that is the second statement of
				606	block B4 (i.e., B4.2). In this manner, conditions for control-flow
				607	(which also includes conditions for loops and switch statements) are
				608	hoisted into the actual basic block.</p>
				609
Ted Kremenek	98f19b6	2007-10-10 23:22:00 +0000	[diff] [blame]	610	<!--
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	611	<h4>Implicit Control-Flow</h4>
Ted Kremenek	98f19b6	2007-10-10 23:22:00 +0000	[diff] [blame]	612	-->
Ted Kremenek	8bc0571	2007-10-10 23:01:43 +0000	[diff] [blame]	613
				614	<!--
				615	<p>A key design principle of the <tt>CFG</tt> class was to not require
				616	any transformations to the AST in order to represent control-flow.
				617	Thus the <tt>CFG</tt> does not perform any "lowering" of the
				618	statements in an AST: loops are not transformed into guarded gotos,
				619	short-circuit operations are not converted to a set of if-statements,
				620	and so on.</p>
				621	-->
Ted Kremenek	17a295d	2008-06-11 06:19:49 +0000	[diff] [blame]	622
Chris Lattner	7bad199	2008-11-16 21:48:07 +0000	[diff] [blame^]	623
				624	<!-- ======================================================================= -->
				625	<h3 id="Constants">Constant Folding in the Clang AST</h3>
				626	<!-- ======================================================================= -->
				627
				628	<p>There are several places where constants and constant folding matter a lot to
				629	the Clang front-end. First, in general, we prefer the AST to retain the source
				630	code as close to how the user wrote it as possible. This means that if they
				631	wrote "5+4", we want to keep the addition and two constants in the AST, we don't
				632	want to fold to "9". This means that constant folding in various ways turns
				633	into a tree walk that needs to handle the various cases.</p>
				634
				635	<p>However, there are places in both C and C++ that require constants to be
				636	folded. For example, the C standard defines what an "integer constant
				637	expression" (i-c-e) is with very precise and specific requirements. The
				638	language then requires i-c-e's in a lot of places (for example, the size of a
				639	bitfield, the value for a case statement, etc). For these, we have to be able
				640	to constant fold the constants, to do semantic checks (e.g. verify bitfield size
				641	is non-negative and that case statements aren't duplicated). We aim for Clang
				642	to be very pedantic about this, diagnosing cases when the code does not use an
				643	i-c-e where one is required, but accepting the code unless running with
				644	<tt>-pedantic-errors</tt>.</p>
				645
				646	<p>Things get a little bit more tricky when it comes to compatibility with
				647	real-world source code. Specifically, GCC has historically accepted a huge
				648	superset of expressions as i-c-e's, and a lot of real world code depends on this
				649	unfortuate accident of history (including, e.g., the glibc system headers). GCC
				650	accepts anything its "fold" optimizer is capable of reducing to an integer
				651	constant, which means that the definition of what it accepts changes as its
				652	optimizer does. One example is that GCC accepts things like "case X-X:" even
				653	when X is a variable, because it can fold this to 0.</p>
				654
				655	<p>Another issue are how constants interact with the extensions we support, such
				656	as __builtin_constant_p, __builtin_inf, __extension__ and many others. C99
				657	obviously does not specify the semantics of any of these extensions, and the
				658	definition of i-c-e does not include them. However, these extensions are often
				659	used in real code, and we have to have a way to reason about them.</p>
				660
				661	<p>Finally, this is not just a problem for semantic analysis. The code
				662	generator and other clients have to be able to fold constants (e.g. to
				663	initialize global variables) and has to handle a superset of what C99 allows.
				664	Further, these clients can benefit from extended information. For example, we
				665	know that "foo()\|\|1" always evaluates to true, but we can't replace the
				666	expression with true because it has side effects.</p>
				667
				668	<!-- ======================= -->
				669	<h4>Implementation Approach</h4>
				670	<!-- ======================= -->
				671
				672	<p>After trying several different approaches, we've finally converged on a
				673	design (Note, at the time of this writing, not all of this has been implemented,
				674	consider this a design goal!). Our basic approach is to define a single
				675	recursive method evaluation method (<tt>Expr::Evaluate</tt>), which is
				676	implemented in <tt>AST/ExprConstant.cpp</tt>. Given an expression with 'scalar'
				677	type (integer, fp, complex, or pointer) this method returns the following
				678	information:</p>
				679
				680	<ul>
				681	<li>Whether the expression is an integer constant expression, a general
				682	constant that was folded but has no side effects, a general constant that
				683	was folded but that does have side effects, or an uncomputable/unfoldable
				684	value.
				685	</li>
				686	<li>If the expression was computable in any way, this method returns the APValue
				687	for the result of the expression.</li>
				688	<li>If the expression is not evaluatable at all, this method returns
				689	information on one of the problems with the expression. This includes a
				690	SourceLocation for where the problem is, and a diagnostic ID that explains
				691	the problem. The diagnostic should be have ERROR type.</li>
				692	<li>If the expression is not an integer constant expression, this method returns
				693	information on one of the problems with the expression. This includes a
				694	SourceLocation for where the problem is, and a diagnostic ID that explains
				695	the problem. The diagnostic should be have EXTENSION type.</li>
				696	</ul>
				697
				698	<p>This information gives various clients the flexibility that they want, and we
				699	will eventually have some helper methods for various extensions. For example,
				700	Sema should have a <tt>Sema::VerifyIntegerConstantExpression</tt> method, which
				701	calls Evaluate on the expression. If the expression is not foldable, the error
				702	is emitted, and it would return true. If the expression is not an i-c-e, the
				703	EXTENSION diagnostic is emitted. Finally it would return false to indicate that
				704	the AST is ok.</p>
				705
				706	<p>Other clients can use the information in other ways, for example, codegen can
				707	just use expressions that are foldable in any way.</p>
				708
				709	<!-- ========== -->
				710	<h4>Extensions</h4>
				711	<!-- ========== -->
				712
				713	<p>This section describes how some of the various extensions clang supports
				714	interacts with constant evaluation:</p>
				715
				716	<ul>
				717	<li><b><tt>__extension__</tt></b>: The expression form of this extension causes
				718	any evaluatable subexpression to be accepted as an integer constant
				719	expression.</li>
				720	<li><b><tt>__builtin_constant_p</tt></b>: This returns true (as a integer
				721	constant expression) if the operand is any evaluatable constant.</li>
				722	<li><b><tt>__builtin_choose_expr</tt></b>: The condition is required to be an
				723	integer constant expression, but we accept any constant as an "extension of
				724	an extension". This only evaluates one operand depending on which way the
				725	condition evaluates.</li>
				726	<li><b><tt>__builtin_classify_type</tt></b>: This always returns an integer
				727	constant expression.</li>
				728	<li><b><tt>__builtin_inf,nan,..</tt></b>: These are treated just like a
				729	floating-point literal.</li>
				730	<li><b><tt>__builtin_abs,copysign,..</tt></b>: These are constant folded as
				731	general constant expressions.</li>
				732	</ul>
				733
				734
				735
				736
Ted Kremenek	17a295d	2008-06-11 06:19:49 +0000	[diff] [blame]	737	</div>
				738	</body>
				739	</html>