Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: c153aa2c973f48e7a8431f462415b5886fed5c27 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame^]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
				16	<div class="doc_author">
				17	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				18	</div>
				19
				20	<!-- *********************************************************************** -->
				21	<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
				22	<!-- *********************************************************************** -->
				23
				24	<div class="doc_text">
				25
				26	<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
				27	LLVM</a>" tutorial. This chapter shows you how to use the <a
				28	href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
				29	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
				30	our Kaleidoscope language and build an <a
				31	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				32	Tree</a> (AST).</p>
				33
				34	<p>The parser we will build uses a combination of <a
				35	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				36	Parsing</a> and <a href=
				37	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				38	Parsing</a> to parse the Kaleidoscope language (the later for binary expression
				39	and the former for everything else). Before we get to parsing though, lets talk
				40	about the output of the parser: the Abstract Syntax Tree.</p>
				41
				42	</div>
				43
				44	<!-- *********************************************************************** -->
				45	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				46	<!-- *********************************************************************** -->
				47
				48	<div class="doc_text">
				49
				50	<p>The AST for a program captures its behavior in a way that it is easy for
				51	later stages of the compiler (e.g. code generation) to interpret. We basically
				52	want one object for each construct in the language, and the AST should closely
				53	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				54	function object. We'll start with expressions first:</p>
				55
				56	<div class="doc_code">
				57	<pre>
				58	/// ExprAST - Base class for all expression nodes.
				59	class ExprAST {
				60	public:
				61	virtual ~ExprAST() {}
				62	};
				63
				64	/// NumberExprAST - Expression class for numeric literals like "1.0".
				65	class NumberExprAST : public ExprAST {
				66	double Val;
				67	public:
				68	NumberExprAST(double val) : Val(val) {}
				69	};
				70	</pre>
				71	</div>
				72
				73	<p>The code above shows the definition of the base ExprAST class and one
				74	subclass which we use for numeric literals. The important thing about this is
				75	that the NumberExprAST class captures the numeric value of the literal in the
				76	class, so that later phases of the compiler can know what it is.</p>
				77
				78	<p>Right now we only create the AST, so there are no useful accessor methods on
				79	them. It would be very easy to add a virtual method to pretty print the code,
				80	for example. Here are the other expression AST node definitions that we'll use
				81	in the basic form of the Kaleidoscope language.
				82	</p>
				83
				84	<div class="doc_code">
				85	<pre>
				86	/// VariableExprAST - Expression class for referencing a variable, like "a".
				87	class VariableExprAST : public ExprAST {
				88	std::string Name;
				89	public:
				90	VariableExprAST(const std::string &name) : Name(name) {}
				91	};
				92
				93	/// BinaryExprAST - Expression class for a binary operator.
				94	class BinaryExprAST : public ExprAST {
				95	char Op;
				96	ExprAST LHS, RHS;
				97	public:
				98	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				99	: Op(op), LHS(lhs), RHS(rhs) {}
				100	};
				101
				102	/// CallExprAST - Expression class for function calls.
				103	class CallExprAST : public ExprAST {
				104	std::string Callee;
				105	std::vector<ExprAST*> Args;
				106	public:
				107	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				108	: Callee(callee), Args(args) {}
				109	};
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentially) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name and list of argument expressions. One thing that is
				116	nice about our AST is that it captures the language features without talking
				117	about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
				121	because it doesn't have conditional control flow, it isn't turing complete:
				122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	/// PrototypeAST - This class represents the "prototype" for a function,
				129	/// which captures its argument names as well as if it is an operator.
				130	class PrototypeAST {
				131	std::string Name;
				132	std::vector<std::string> Args;
				133	public:
				134	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				135	: Name(name), Args(args) {}
				136	};
				137
				138	/// FunctionAST - This class represents a function definition itself.
				139	class FunctionAST {
				140	PrototypeAST *Proto;
				141	ExprAST *Body;
				142	public:
				143	FunctionAST(PrototypeAST proto, ExprAST body)
				144	: Proto(proto), Body(body) {}
				145	};
				146	</pre>
				147	</div>
				148
				149	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				150	Since all values are double precision floating point, this fact doesn't need to
				151	be captured anywhere. In a more aggressive and realistic language, the
				152	"ExprAST" class would probably have a type field.</p>
				153
				154	<p>With this scaffolding, we can now talk about parsing expressions and function
				155	bodies in Kaleidoscope.</p>
				156
				157	</div>
				158
				159	<!-- *********************************************************************** -->
				160	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				161	<!-- *********************************************************************** -->
				162
				163	<div class="doc_text">
				164
				165	<p>Now that we have an AST to build, we need to define the parser code to build
				166	it. The idea here is that we want to parse something like "x+y" (which is
				167	returned as three tokens by the lexer) into an AST that could be generated with
				168	calls like this:</p>
				169
				170	<div class="doc_code">
				171	<pre>
				172	ExprAST *X = new VariableExprAST("x");
				173	ExprAST *Y = new VariableExprAST("y");
				174	ExprAST *Result = new BinaryExprAST('+', X, Y);
				175	</pre>
				176	</div>
				177
				178	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				179
				180	<div class="doc_code">
				181	<pre>
				182	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				183	/// token the parser it looking at. getNextToken reads another token from the
				184	/// lexer and updates CurTok with its results.
				185	static int CurTok;
				186	static int getNextToken() {
				187	return CurTok = gettok();
				188	}
				189	</pre>
				190	</div>
				191
				192	<p>
				193	This implements a simple token buffer around the lexer. This allows
				194	us to look one token ahead at what the lexer is returning. Every function in
				195	our lexer will assume that CurTok is the current token that needs to be
				196	parsed.</p>
				197
				198	<p>Again, we define
				199	these with global variables: it would be better design to wrap the entire parser
				200	in a class and use instance variables for these.
				201	</p>
				202
				203	<div class="doc_code">
				204	<pre>
				205
				206	/// Error* - These are little helper functions for error handling.
				207	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				208	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				209	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				210	</pre>
				211	</div>
				212
				213	<p>
				214	The <tt>Error</tt> routines are simple helper routines that our parser will use
				215	to handle errors. The error recovery in our parser will not be the best and
				216	are not particular user-friendly, but it will be enough for our tutorial. These
				217	routines make it easier to handle errors in routines that have various return
				218	types: they always return null.</p>
				219
				220	<p>With these basic helper functions implemented, we can implement the first
				221	piece of our grammar: we'll start with numeric literals.</p>
				222
				223	</div>
				224
				225	<!-- *********************************************************************** -->
				226	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				227	Parsing</a></div>
				228	<!-- *********************************************************************** -->
				229
				230	<div class="doc_text">
				231
				232	<p>We start with numeric literals, because they are the simplest to process.
				233	For each production in our grammar, we'll define a function which parses that
				234	production. For numeric literals, we have:
				235	</p>
				236
				237	<div class="doc_code">
				238	<pre>
				239	/// numberexpr ::= number
				240	static ExprAST *ParseNumberExpr() {
				241	ExprAST *Result = new NumberExprAST(NumVal);
				242	getNextToken(); // consume the number
				243	return Result;
				244	}
				245	</pre>
				246	</div>
				247
				248	<p>This routine is very simple: it expects to be called when the current token
				249	is a <tt>tok_number</tt> token. It takes the current number value, creates
				250	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
				251	returns.</p>
				252
				253	<p>There are some interesting aspects of this. The most important one is that
				254	this routine eats all of the tokens that correspond to the production, and
				255	returns the lexer buffer with the next token (which is not part of the grammar
				256	production) ready to go. This is a fairly standard way to go for recursive
				257	descent parsers. For a better example, the parenthesis operator is defined like
				258	this:</p>
				259
				260	<div class="doc_code">
				261	<pre>
				262	/// parenexpr ::= '(' expression ')'
				263	static ExprAST *ParseParenExpr() {
				264	getNextToken(); // eat (.
				265	ExprAST *V = ParseExpression();
				266	if (!V) return 0;
				267
				268	if (CurTok != ')')
				269	return Error("expected ')'");
				270	getNextToken(); // eat ).
				271	return V;
				272	}
				273	</pre>
				274	</div>
				275
				276	<p>This function illustrates a number of interesting things about the parser:
				277	1) it shows how we use the Error routines. When called, this function expects
				278	that the current token is a '(' token, but after parsing the subexpression, it
				279	is possible that there is not a ')' waiting. For example, if the user types in
				280	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				281	occur, the parser needs a way to indicate that they happened: in our parser, we
				282	return null on an error.</p>
				283
				284	<p>Another interesting aspect of this function is that it uses recursion by
				285	calling <tt>ParseExpression</tt> (we will soon see that ParseExpression can call
				286	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				287	recursive grammars, and keeps each production very simple. Note that
				288	parenthesis do not cause construction of AST nodes themselves. While we could
				289	do this, the most important role of parens are to guide the parser and provide
				290	grouping. Once the parser constructs the AST, parens are not needed.</p>
				291
				292	<p>The next simple production is for handling variable references and function
				293	calls:</p>
				294
				295	<div class="doc_code">
				296	<pre>
				297	/// identifierexpr
				298	/// ::= identifer
				299	/// ::= identifer '(' expression* ')'
				300	static ExprAST *ParseIdentifierExpr() {
				301	std::string IdName = IdentifierStr;
				302
				303	getNextToken(); // eat identifer.
				304
				305	if (CurTok != '(') // Simple variable ref.
				306	return new VariableExprAST(IdName);
				307
				308	// Call.
				309	getNextToken(); // eat (
				310	std::vector<ExprAST*> Args;
				311	while (1) {
				312	ExprAST *Arg = ParseExpression();
				313	if (!Arg) return 0;
				314	Args.push_back(Arg);
				315
				316	if (CurTok == ')') break;
				317
				318	if (CurTok != ',')
				319	return Error("Expected ')'");
				320	getNextToken();
				321	}
				322
				323	// Eat the ')'.
				324	getNextToken();
				325
				326	return new CallExprAST(IdName, Args);
				327	}
				328	</pre>
				329	</div>
				330
				331	<p>This routine follows the same style as the other routines (it expects to be
				332	called if the current token is a <tt>tok_identifier</tt> token). It also has
				333	recursion and error handling. One interesting aspect of this is that it uses
				334	<em>look-ahead</em> to determine if the current identifier is a stand alone
				335	variable reference or if it is a function call expression. It handles this by
				336	checking to see if the token after the identifier is a '(' token, and constructs
				337	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				338	</p>
				339
				340	<p>Now that we have all of our simple expression parsing logic in place, we can
				341	define a helper function to wrap them up in a class. We call this class of
				342	expressions "primary" expressions, for reasons that will become more clear
				343	later. In order to parse a primary expression, we need to determine what sort
				344	of expression it is:</p>
				345
				346	<div class="doc_code">
				347	<pre>
				348	/// primary
				349	/// ::= identifierexpr
				350	/// ::= numberexpr
				351	/// ::= parenexpr
				352	static ExprAST *ParsePrimary() {
				353	switch (CurTok) {
				354	default: return Error("unknown token when expecting an expression");
				355	case tok_identifier: return ParseIdentifierExpr();
				356	case tok_number: return ParseNumberExpr();
				357	case '(': return ParseParenExpr();
				358	}
				359	}
				360	</pre>
				361	</div>
				362
				363	<p>Now that you see the definition of this function, it makes it more obvious
				364	why we can assume the state of CurTok in the various functions. This uses
				365	look-ahead to determine which sort of expression is being inspected, and parses
				366	it with a function call.</p>
				367
				368	<p>Now that basic expressions are handled, we need to handle binary expressions,
				369	which are a bit more complex.</p>
				370
				371	</div>
				372
				373	<!-- *********************************************************************** -->
				374	<div class="doc_section"><a name="parserbinops">Binary Expression
				375	Parsing</a></div>
				376	<!-- *********************************************************************** -->
				377
				378	<div class="doc_text">
				379
				380	<p>Binary expressions are significantly harder to parse because they are often
				381	ambiguous. For example, when given the string "x+y*z", the parser can choose
				382	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				383	mathematics, we expect the later parse, because "*" (multiplication) has
				384	higher <em>precedence</em> than "+" (addition).</p>
				385
				386	<p>There are many ways to handle this, but an elegant and efficient way is to
				387	use <a href=
				388	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				389	Parsing</a>. This parsing technique uses the precedence of binary operators to
				390	guide recursion. To start with, we need a table of precedences:</p>
				391
				392	<div class="doc_code">
				393	<pre>
				394	/// BinopPrecedence - This holds the precedence for each binary operator that is
				395	/// defined.
				396	static std::map<char, int> BinopPrecedence;
				397
				398	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				399	static int GetTokPrecedence() {
				400	if (!isascii(CurTok))
				401	return -1;
				402
				403	// Make sure it's a declared binop.
				404	int TokPrec = BinopPrecedence[CurTok];
				405	if (TokPrec <= 0) return -1;
				406	return TokPrec;
				407	}
				408
				409	int main() {
				410	// Install standard binary operators.
				411	// 1 is lowest precedence.
				412	BinopPrecedence['<'] = 10;
				413	BinopPrecedence['+'] = 20;
				414	BinopPrecedence['-'] = 20;
				415	BinopPrecedence['*'] = 40; // highest.
				416	...
				417	}
				418	</pre>
				419	</div>
				420
				421	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				422	(this can obviously be extended by you, the reader). The
				423	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				424	or -1 if the token is not a binary operator. Having a map makes it easy to add
				425	new operators and makes it clear that the algorithm doesn't depend on the
				426	specific operators involved, but it would be easy enough to eliminate the map
				427	and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
				428
				429	<p>With the helper above defined, we can now start parsing binary expressions.
				430	The basic idea of operator precedence parsing is to break down an expression
				431	with potentially ambiguous binary operators into pieces. Consider for example
				432	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				433	as a stream of primary expressions separated by binary operators. As such,
				434	it will first parse the leading primary expression "a", then it will see the
				435	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				436	are primary expressions that the binary expression parser doesn't need to worry
				437	about nested subexpressions like (c+d) at all.
				438	</p>
				439
				440	<p>
				441	To start, an expression is a primary expression potentially followed by a
				442	sequence of [binop,primaryexpr] pairs:</p>
				443
				444	<div class="doc_code">
				445	<pre>
				446	/// expression
				447	/// ::= primary binoprhs
				448	///
				449	static ExprAST *ParseExpression() {
				450	ExprAST *LHS = ParsePrimary();
				451	if (!LHS) return 0;
				452
				453	return ParseBinOpRHS(0, LHS);
				454	}
				455	</pre>
				456	</div>
				457
				458	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
				459	us. It takes a precedence and a pointer to an expression for the part parsed
				460	so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
				461	allowed to be empty, in which case it returns the expression that is passed into
				462	it. In our example above, the code passes the expression for "a" into
				463	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				464
				465	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				466	minimal operator precedence</em> that the function is allowed to eat. For
				467	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				468	passed in a precedence of 40, it will not consume any tokens (because the
				469	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				470	with:</p>
				471
				472	<div class="doc_code">
				473	<pre>
				474	/// binoprhs
				475	/// ::= ('+' primary)*
				476	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				477	// If this is a binop, find its precedence.
				478	while (1) {
				479	int TokPrec = GetTokPrecedence();
				480
				481	// If this is a binop that binds at least as tightly as the current binop,
				482	// consume it, otherwise we are done.
				483	if (TokPrec < ExprPrec)
				484	return LHS;
				485	</pre>
				486	</div>
				487
				488	<p>This code gets the precedence of the current token and checks to see if if is
				489	too low. Because we defined invalid tokens to have a precedence of -1, this
				490	check implicitly knows that the pair-stream ends when the token stream runs out
				491	of binary operators. If this check succeeds, we know that the token is a binary
				492	operator and that it will be included in this expression:</p>
				493
				494	<div class="doc_code">
				495	<pre>
				496	// Okay, we know this is a binop.
				497	int BinOp = CurTok;
				498	getNextToken(); // eat binop
				499
				500	// Parse the primary expression after the binary operator.
				501	ExprAST *RHS = ParsePrimary();
				502	if (!RHS) return 0;
				503	</pre>
				504	</div>
				505
				506	<p>As such, this code eats (and remembers) the binary operator and then parses
				507	the following primary expression. This builds up the whole pair, the first of
				508	which is [+, b] for the running example.</p>
				509
				510	<p>Now that we parsed the left-hand side of an expression and one pair of the
				511	RHS sequence, we have to decide which way the expression associates. In
				512	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				513	To determine this, we look ahead at "binop" to determine its precedence and
				514	compare it to BinOp's precedence (which is '+' in this case):</p>
				515
				516	<div class="doc_code">
				517	<pre>
				518	// If BinOp binds less tightly with RHS than the operator after RHS, let
				519	// the pending operator take RHS as its LHS.
				520	int NextPrec = GetTokPrecedence();
				521	if (TokPrec < NextPrec) {
				522	</pre>
				523	</div>
				524
				525	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				526	precedence of our current operator, then we know that the parentheses associate
				527	as "(a+b) binop ...". In our example, since the next operator is "+" and so is
				528	our current one, we know that they have the same precedence. In this case we'll
				529	create the AST node for "a+b", and then continue parsing:</p>
				530
				531	<div class="doc_code">
				532	<pre>
				533	... if body omitted ...
				534	}
				535
				536	// Merge LHS/RHS.
				537	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				538	} // loop around to the top of the while loop.
				539	}
				540	</pre>
				541	</div>
				542
				543	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				544	iteration of the loop, with "+" as the current token. The code above will eat
				545	and remember it and parse "(c+d)" as the primary expression, which makes the
				546	current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
				547	binop to the right of the primary. In this case, the precedence of "*" is
				548	higher than the precedence of "+" so the if condition will be entered.</p>
				549
				550	<p>The critical question left here is "how can the if condition parse the right
				551	hand side in full"? In particular, to build the AST correctly for our example,
				552	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				553	do this is surprisingly simple (code from the above two blocks duplicated for
				554	context):</p>
				555
				556	<div class="doc_code">
				557	<pre>
				558	// If BinOp binds less tightly with RHS than the operator after RHS, let
				559	// the pending operator take RHS as its LHS.
				560	int NextPrec = GetTokPrecedence();
				561	if (TokPrec < NextPrec) {
				562	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				563	if (RHS == 0) return 0;
				564	}
				565	// Merge LHS/RHS.
				566	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				567	} // loop around to the top of the while loop.
				568	}
				569	</pre>
				570	</div>
				571
				572	<p>At this point, we know that the binary operator to the RHS of our primary
				573	has higher precedence than the binop we are currently parsing. As such, we know
				574	that any sequence of pairs whose operators are all higher precedence than "+"
				575	should be parsed together and returned as "RHS". To do this, we recursively
				576	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				577	precedence required for it to continue. In our example above, this will cause
				578	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				579	of the '+' expression.</p>
				580
				581	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
				582	and added to the AST. With this little bit of code (14 non-trivial lines), we
				583	correctly handle fully general binary expression parsing in a very elegant way.
				584	</p>
				585
				586	<p>This wraps up handling of expressions. At this point, we can point the
				587	parser at an arbitrary token stream and build an expression from them, stopping
				588	at the first token that is not part of the expression. Next up we need to
				589	handle function definitions etc.</p>
				590
				591	</div>
				592
				593	<!-- *********************************************************************** -->
				594	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				595	<!-- *********************************************************************** -->
				596
				597	<div class="doc_text">
				598
				599	<p>
				600	The first basic thing missing is that of function prototypes. In Kaleidoscope,
				601	these are used both for 'extern' function declarations as well as function body
				602	definitions. The code to do this is straight-forward and not very interesting
				603	(once you've survived expressions):
				604	</p>
				605
				606	<div class="doc_code">
				607	<pre>
				608	/// prototype
				609	/// ::= id '(' id* ')'
				610	static PrototypeAST *ParsePrototype() {
				611	if (CurTok != tok_identifier)
				612	return ErrorP("Expected function name in prototype");
				613
				614	std::string FnName = IdentifierStr;
				615	getNextToken();
				616
				617	if (CurTok != '(')
				618	return ErrorP("Expected '(' in prototype");
				619
				620	std::vector<std::string> ArgNames;
				621	while (getNextToken() == tok_identifier)
				622	ArgNames.push_back(IdentifierStr);
				623	if (CurTok != ')')
				624	return ErrorP("Expected ')' in prototype");
				625
				626	// success.
				627	getNextToken(); // eat ')'.
				628
				629	return new PrototypeAST(FnName, ArgNames);
				630	}
				631	</pre>
				632	</div>
				633
				634	<p>Given this, a function definition is very simple, just a prototype plus
				635	and expression to implement the body:</p>
				636
				637	<div class="doc_code">
				638	<pre>
				639	/// definition ::= 'def' prototype expression
				640	static FunctionAST *ParseDefinition() {
				641	getNextToken(); // eat def.
				642	PrototypeAST *Proto = ParsePrototype();
				643	if (Proto == 0) return 0;
				644
				645	if (ExprAST *E = ParseExpression())
				646	return new FunctionAST(Proto, E);
				647	return 0;
				648	}
				649	</pre>
				650	</div>
				651
				652	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				653	well as to support forward declaration of user functions. 'externs' are just
				654	prototypes with no body:</p>
				655
				656	<div class="doc_code">
				657	<pre>
				658	/// external ::= 'extern' prototype
				659	static PrototypeAST *ParseExtern() {
				660	getNextToken(); // eat extern.
				661	return ParsePrototype();
				662	}
				663	</pre>
				664	</div>
				665
				666	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				667	evaluate them on the fly. We will handle this by defining anonymous nullary
				668	(zero argument) functions for them:</p>
				669
				670	<div class="doc_code">
				671	<pre>
				672	/// toplevelexpr ::= expression
				673	static FunctionAST *ParseTopLevelExpr() {
				674	if (ExprAST *E = ParseExpression()) {
				675	// Make an anonymous proto.
				676	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				677	return new FunctionAST(Proto, E);
				678	}
				679	return 0;
				680	}
				681	</pre>
				682	</div>
				683
				684	<p>Now that we have all the pieces, lets build a little driver that will let us
				685	actually <em>execute</em> this code we've built!</p>
				686
				687	</div>
				688
				689	<!-- *********************************************************************** -->
				690	<div class="doc_section"><a name="driver">The Driver</a></div>
				691	<!-- *********************************************************************** -->
				692
				693	<div class="doc_text">
				694
				695	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				696	dispatch loop. There isn't much interesting here, so I'll just include the
				697	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				698	Parsing" section.</p>
				699
				700	<div class="doc_code">
				701	<pre>
				702	/// top ::= definition \| external \| expression \| ';'
				703	static void MainLoop() {
				704	while (1) {
				705	fprintf(stderr, "ready> ");
				706	switch (CurTok) {
				707	case tok_eof: return;
				708	case ';': getNextToken(); break; // ignore top level semicolons.
				709	case tok_def: HandleDefinition(); break;
				710	case tok_extern: HandleExtern(); break;
				711	default: HandleTopLevelExpression(); break;
				712	}
				713	}
				714	}
				715	</pre>
				716	</div>
				717
				718	<p>The most interesting part of this is that we ignore top-level semi colons.
				719	Why is this do you ask? The basic reason is that if you type "4 + 5" at the
				720	command line, the parser doesn't know that that is the end of what you will
				721	type. For example, on the next line you could type "def foo..." in which case
				722	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				723	which would continue the expression. Having top-level semicolons allows you to
				724	type "4+5;" and the parser will know you are done.</p>
				725
				726	</div>
				727
				728	<!-- *********************************************************************** -->
				729	<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
				730	<!-- *********************************************************************** -->
				731
				732	<div class="doc_text">
				733
				734	<p>With just under 400 lines of commented code, we fully defined our minimal
				735	language, including a lexer, parser and AST builder. With this done, the
				736	executable will validate code and tell us if it is gramatically invalid. For
				737	example, here is a sample interaction:</p>
				738
				739	<div class="doc_code">
				740	<pre>
				741	$ ./a.out
				742	ready> def foo(x y) x+foo(y, 4.0);
				743	ready> Parsed an function definition.
				744	ready> def foo(x y) x+y y;
				745	ready> Parsed an function definition.
				746	ready> Parsed a top-level expr
				747	ready> def foo(x y) x+y );
				748	ready> Parsed an function definition.
				749	ready> Error: unknown token when expecting an expression
				750	ready> extern sin(a);
				751	ready> Parsed an extern
				752	ready> ^D
				753	$
				754	</pre>
				755	</div>
				756
				757	<p>
				758	Here is the full code. Note that it is fully self-contained: you don't even
				759	need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
				760	will describe how to generate LLVM IR from the AST.</p>
				761
				762	<div class="doc_code">
				763	<pre>
				764	// To build this:
				765	// g++ -g toy.cpp
				766	// ./a.out
				767
				768	#include <cstdio>
				769	#include <string>
				770	#include <
				771	#include <vector>
				772
				773	//===----------------------------------------------------------------------===//
				774	// Lexer
				775	//===----------------------------------------------------------------------===//
				776
				777	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				778	// of these for known things.
				779	enum Token {
				780	tok_eof = -1,
				781
				782	// commands
				783	tok_def = -2, tok_extern = -3,
				784
				785	// primary
				786	tok_identifier = -4, tok_number = -5,
				787	};
				788
				789	static std::string IdentifierStr; // Filled in if tok_identifier
				790	static double NumVal; // Filled in if tok_number
				791
				792	/// gettok - Return the next token from standard input.
				793	static int gettok() {
				794	static int LastChar = ' ';
				795
				796	// Skip any whitespace.
				797	while (isspace(LastChar))
				798	LastChar = getchar();
				799
				800	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				801	IdentifierStr = LastChar;
				802	while (isalnum((LastChar = getchar())))
				803	IdentifierStr += LastChar;
				804
				805	if (IdentifierStr == "def") return tok_def;
				806	if (IdentifierStr == "extern") return tok_extern;
				807	return tok_identifier;
				808	}
				809
				810	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				811	std::string NumStr;
				812	do {
				813	NumStr += LastChar;
				814	LastChar = getchar();
				815	} while (isdigit(LastChar) \|\| LastChar == '.');
				816
				817	NumVal = strtod(NumStr.c_str(), 0);
				818	return tok_number;
				819	}
				820
				821	if (LastChar == '#') {
				822	// Comment until end of line.
				823	do LastChar = getchar();
				824	while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
				825
				826	if (LastChar != EOF)
				827	return gettok();
				828	}
				829
				830	// Check for end of file. Don't eat the EOF.
				831	if (LastChar == EOF)
				832	return tok_eof;
				833
				834	// Otherwise, just return the character as its ascii value.
				835	int ThisChar = LastChar;
				836	LastChar = getchar();
				837	return ThisChar;
				838	}
				839
				840	//===----------------------------------------------------------------------===//
				841	// Abstract Syntax Tree (aka Parse Tree)
				842	//===----------------------------------------------------------------------===//
				843
				844	/// ExprAST - Base class for all expression nodes.
				845	class ExprAST {
				846	public:
				847	virtual ~ExprAST() {}
				848	};
				849
				850	/// NumberExprAST - Expression class for numeric literals like "1.0".
				851	class NumberExprAST : public ExprAST {
				852	double Val;
				853	public:
				854	NumberExprAST(double val) : Val(val) {}
				855	};
				856
				857	/// VariableExprAST - Expression class for referencing a variable, like "a".
				858	class VariableExprAST : public ExprAST {
				859	std::string Name;
				860	public:
				861	VariableExprAST(const std::string &name) : Name(name) {}
				862	};
				863
				864	/// BinaryExprAST - Expression class for a binary operator.
				865	class BinaryExprAST : public ExprAST {
				866	char Op;
				867	ExprAST LHS, RHS;
				868	public:
				869	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				870	: Op(op), LHS(lhs), RHS(rhs) {}
				871	};
				872
				873	/// CallExprAST - Expression class for function calls.
				874	class CallExprAST : public ExprAST {
				875	std::string Callee;
				876	std::vector<ExprAST*> Args;
				877	public:
				878	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				879	: Callee(callee), Args(args) {}
				880	};
				881
				882	/// PrototypeAST - This class represents the "prototype" for a function,
				883	/// which captures its argument names as well as if it is an operator.
				884	class PrototypeAST {
				885	std::string Name;
				886	std::vector< Args;
				887	public:
				888	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				889	: Name(name), Args(args) {}
				890
				891	};
				892
				893	/// FunctionAST - This class represents a function definition itself.
				894	class FunctionAST {
				895	PrototypeAST *Proto;
				896	ExprAST *Body;
				897	public:
				898	FunctionAST(PrototypeAST proto, ExprAST body)
				899	: Proto(proto), Body(body) {}
				900
				901	};
				902
				903	//===----------------------------------------------------------------------===//
				904	// Parser
				905	//===----------------------------------------------------------------------===//
				906
				907	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				908	/// token the parser it looking at. getNextToken reads another token from the
				909	/// lexer and updates CurTok with its results.
				910	static int CurTok;
				911	static int getNextToken() {
				912	return CurTok = gettok();
				913	}
				914
				915	/// BinopPrecedence - This holds the precedence for each binary operator that is
				916	/// defined.
				917	static std::map<char, int> BinopPrecedence;
				918
				919	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				920	static int GetTokPrecedence() {
				921	if (!isascii(CurTok))
				922	return -1;
				923
				924	// Make sure it's a declared binop.
				925	int TokPrec = BinopPrecedence[CurTok];
				926	if (TokPrec <= 0) return -1;
				927	return TokPrec;
				928	}
				929
				930	/// Error* - These are little helper functions for error handling.
				931	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				932	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				933	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				934
				935	static ExprAST *ParseExpression();
				936
				937	/// identifierexpr
				938	/// ::= identifer
				939	/// ::= identifer '(' expression* ')'
				940	static ExprAST *ParseIdentifierExpr() {
				941	std::string IdName = IdentifierStr;
				942
				943	getNextToken(); // eat identifer.
				944
				945	if (CurTok != '(') // Simple variable ref.
				946	return new VariableExprAST(IdName);
				947
				948	// Call.
				949	getNextToken(); // eat (
				950	std::vector<ExprAST*> Args;
				951	while (1) {
				952	ExprAST *Arg = ParseExpression();
				953	if (!Arg) return 0;
				954	Args.push_back(Arg);
				955
				956	if (CurTok == ')') break;
				957
				958	if (CurTok != ',')
				959	return Error("Expected ')'");
				960	getNextToken();
				961	}
				962
				963	// Eat the ')'.
				964	getNextToken();
				965
				966	return new CallExprAST(IdName, Args);
				967	}
				968
				969	/// numberexpr ::= number
				970	static ExprAST *ParseNumberExpr() {
				971	ExprAST *Result = new NumberExprAST(NumVal);
				972	getNextToken(); // consume the number
				973	return Result;
				974	}
				975
				976	/// parenexpr ::= '(' expression ')'
				977	static ExprAST *ParseParenExpr() {
				978	getNextToken(); // eat (.
				979	ExprAST *V = ParseExpression();
				980	if (!V) return 0;
				981
				982	if (CurTok != ')')
				983	return Error("expected ')'");
				984	getNextToken(); // eat ).
				985	return V;
				986	}
				987
				988	/// primary
				989	/// ::= identifierexpr
				990	/// ::= numberexpr
				991	/// ::= parenexpr
				992	static ExprAST *ParsePrimary() {
				993	switch (CurTok) {
				994	default: return Error("unknown token when expecting an expression");
				995	case tok_identifier: return ParseIdentifierExpr();
				996	case tok_number: return ParseNumberExpr();
				997	case '(': return ParseParenExpr();
				998	}
				999	}
				1000
				1001	/// binoprhs
				1002	/// ::= ('+' primary)*
				1003	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1004	// If this is a binop, find its precedence.
				1005	while (1) {
				1006	int TokPrec = GetTokPrecedence();
				1007
				1008	// If this is a binop that binds at least as tightly as the current binop,
				1009	// consume it, otherwise we are done.
				1010	if (TokPrec < ExprPrec)
				1011	return LHS;
				1012
				1013	// Okay, we know this is a binop.
				1014	int BinOp = CurTok;
				1015	getNextToken(); // eat binop
				1016
				1017	// Parse the primary expression after the binary operator.
				1018	ExprAST *RHS = ParsePrimary();
				1019	if (!RHS) return 0;
				1020
				1021	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1022	// the pending operator take RHS as its LHS.
				1023	int NextPrec = GetTokPrecedence();
				1024	if (TokPrec < NextPrec) {
				1025	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1026	if (RHS == 0) return 0;
				1027	}
				1028
				1029	// Merge LHS/RHS.
				1030	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1031	}
				1032	}
				1033
				1034	/// expression
				1035	/// ::= primary binoprhs
				1036	///
				1037	static ExprAST *ParseExpression() {
				1038	ExprAST *LHS = ParsePrimary();
				1039	if (!LHS) return 0;
				1040
				1041	return ParseBinOpRHS(0, LHS);
				1042	}
				1043
				1044	/// prototype
				1045	/// ::= id '(' id* ')'
				1046	static PrototypeAST *ParsePrototype() {
				1047	if (CurTok != tok_identifier)
				1048	return ErrorP("Expected function name in prototype");
				1049
				1050	std::string FnName = IdentifierStr;
				1051	getNextToken();
				1052
				1053	if (CurTok != '(')
				1054	return ErrorP("Expected '(' in prototype");
				1055
				1056	std::vector<std::string> ArgNames;
				1057	while (getNextToken() == tok_identifier)
				1058	ArgNames.push_back(IdentifierStr);
				1059	if (CurTok != ')')
				1060	return ErrorP("Expected ')' in prototype");
				1061
				1062	// success.
				1063	getNextToken(); // eat ')'.
				1064
				1065	return new PrototypeAST(FnName, ArgNames);
				1066	}
				1067
				1068	/// definition ::= 'def' prototype expression
				1069	static FunctionAST *ParseDefinition() {
				1070	getNextToken(); // eat def.
				1071	PrototypeAST *Proto = ParsePrototype();
				1072	if (Proto == 0) return 0;
				1073
				1074	if (ExprAST *E = ParseExpression())
				1075	return new FunctionAST(Proto, E);
				1076	return 0;
				1077	}
				1078
				1079	/// toplevelexpr ::= expression
				1080	static FunctionAST *ParseTopLevelExpr() {
				1081	if (ExprAST *E = ParseExpression()) {
				1082	// Make an anonymous proto.
				1083	PrototypeAST *Proto = new PrototypeAST("", std::vector<());
				1084	return new FunctionAST(Proto, E);
				1085	}
				1086	return 0;
				1087	}
				1088
				1089	/// external ::= 'extern' prototype
				1090	static PrototypeAST *ParseExtern() {
				1091	getNextToken(); // eat extern.
				1092	return ParsePrototype();
				1093	}
				1094
				1095	//===----------------------------------------------------------------------===//
				1096	// Top-Level parsing
				1097	//===----------------------------------------------------------------------===//
				1098
				1099	static void HandleDefinition() {
				1100	if (FunctionAST *F = ParseDefinition()) {
				1101	fprintf(stderr, "Parsed a function definition.\n");
				1102	} else {
				1103	// Skip token for error recovery.
				1104	getNextToken();
				1105	}
				1106	}
				1107
				1108	static void HandleExtern() {
				1109	if (PrototypeAST *P = ParseExtern()) {
				1110	fprintf(stderr, "Parsed an extern\n");
				1111	} else {
				1112	// Skip token for error recovery.
				1113	getNextToken();
				1114	}
				1115	}
				1116
				1117	static void HandleTopLevelExpression() {
				1118	// Evaluate a top level expression into an anonymous function.
				1119	if (FunctionAST *F = ParseTopLevelExpr()) {
				1120	fprintf(stderr, "Parsed a top-level expr\n");
				1121	} else {
				1122	// Skip token for error recovery.
				1123	getNextToken();
				1124	}
				1125	}
				1126
				1127	/// top ::= definition \| external \| expression \| ';'
				1128	static void MainLoop() {
				1129	while (1) {
				1130	fprintf(stderr, "ready> ");
				1131	switch (CurTok) {
				1132	case tok_eof: return;
				1133	case ';': getNextToken(); break; // ignore top level semicolons.
				1134	case tok_def: HandleDefinition(); break;
				1135	case tok_extern: HandleExtern(); break;
				1136	default: HandleTopLevelExpression(); break;
				1137	}
				1138	}
				1139	}
				1140
				1141	//===----------------------------------------------------------------------===//
				1142	// Main driver code.
				1143	//===----------------------------------------------------------------------===//
				1144
				1145	int main() {
				1146	// Install standard binary operators.
				1147	// 1 is lowest precedence.
				1148	BinopPrecedence['<'] = 10;
				1149	BinopPrecedence['+'] = 20;
				1150	BinopPrecedence['-'] = 20;
				1151	BinopPrecedence['*'] = 40; // highest.
				1152
				1153	// Prime the first token.
				1154	fprintf(stderr, "ready> ");
				1155	getNextToken();
				1156
				1157	MainLoop();
				1158	return 0;
				1159	}
				1160	</pre>
				1161	</div>
				1162	</div>
				1163
				1164	<!-- *********************************************************************** -->
				1165	<hr>
				1166	<address>
				1167	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1168	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1169	<a href="http://validator.w3.org/check/referer"><img
				1170	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
				1171
				1172	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1173	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				1174	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				1175	</address>
				1176	</body>
				1177	</html>