Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: 1fdd442c651cb14a0f7d983852ad20f1bcfb3609 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
				16	<div class="doc_author">
				17	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				18	</div>
				19
				20	<!-- *********************************************************************** -->
				21	<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
				22	<!-- *********************************************************************** -->
				23
				24	<div class="doc_text">
				25
				26	<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
				27	LLVM</a>" tutorial. This chapter shows you how to use the <a
				28	href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
				29	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
				30	our Kaleidoscope language and build an <a
				31	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				32	Tree</a> (AST).</p>
				33
				34	<p>The parser we will build uses a combination of <a
				35	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				36	Parsing</a> and <a href=
				37	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				38	Parsing</a> to parse the Kaleidoscope language (the later for binary expression
				39	and the former for everything else). Before we get to parsing though, lets talk
				40	about the output of the parser: the Abstract Syntax Tree.</p>
				41
				42	</div>
				43
				44	<!-- *********************************************************************** -->
				45	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				46	<!-- *********************************************************************** -->
				47
				48	<div class="doc_text">
				49
				50	<p>The AST for a program captures its behavior in a way that it is easy for
				51	later stages of the compiler (e.g. code generation) to interpret. We basically
				52	want one object for each construct in the language, and the AST should closely
				53	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				54	function object. We'll start with expressions first:</p>
				55
				56	<div class="doc_code">
				57	<pre>
				58	/// ExprAST - Base class for all expression nodes.
				59	class ExprAST {
				60	public:
				61	virtual ~ExprAST() {}
				62	};
				63
				64	/// NumberExprAST - Expression class for numeric literals like "1.0".
				65	class NumberExprAST : public ExprAST {
				66	double Val;
				67	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame^]	68	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	69	};
				70	</pre>
				71	</div>
				72
				73	<p>The code above shows the definition of the base ExprAST class and one
				74	subclass which we use for numeric literals. The important thing about this is
				75	that the NumberExprAST class captures the numeric value of the literal in the
				76	class, so that later phases of the compiler can know what it is.</p>
				77
				78	<p>Right now we only create the AST, so there are no useful accessor methods on
				79	them. It would be very easy to add a virtual method to pretty print the code,
				80	for example. Here are the other expression AST node definitions that we'll use
				81	in the basic form of the Kaleidoscope language.
				82	</p>
				83
				84	<div class="doc_code">
				85	<pre>
				86	/// VariableExprAST - Expression class for referencing a variable, like "a".
				87	class VariableExprAST : public ExprAST {
				88	std::string Name;
				89	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame^]	90	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	91	};
				92
				93	/// BinaryExprAST - Expression class for a binary operator.
				94	class BinaryExprAST : public ExprAST {
				95	char Op;
				96	ExprAST LHS, RHS;
				97	public:
				98	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				99	: Op(op), LHS(lhs), RHS(rhs) {}
				100	};
				101
				102	/// CallExprAST - Expression class for function calls.
				103	class CallExprAST : public ExprAST {
				104	std::string Callee;
				105	std::vector<ExprAST*> Args;
				106	public:
				107	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				108	: Callee(callee), Args(args) {}
				109	};
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentially) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name and list of argument expressions. One thing that is
				116	nice about our AST is that it captures the language features without talking
				117	about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	121	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	/// PrototypeAST - This class represents the "prototype" for a function,
				129	/// which captures its argument names as well as if it is an operator.
				130	class PrototypeAST {
				131	std::string Name;
				132	std::vector<std::string> Args;
				133	public:
				134	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				135	: Name(name), Args(args) {}
				136	};
				137
				138	/// FunctionAST - This class represents a function definition itself.
				139	class FunctionAST {
				140	PrototypeAST *Proto;
				141	ExprAST *Body;
				142	public:
				143	FunctionAST(PrototypeAST proto, ExprAST body)
				144	: Proto(proto), Body(body) {}
				145	};
				146	</pre>
				147	</div>
				148
				149	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				150	Since all values are double precision floating point, this fact doesn't need to
				151	be captured anywhere. In a more aggressive and realistic language, the
				152	"ExprAST" class would probably have a type field.</p>
				153
				154	<p>With this scaffolding, we can now talk about parsing expressions and function
				155	bodies in Kaleidoscope.</p>
				156
				157	</div>
				158
				159	<!-- *********************************************************************** -->
				160	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				161	<!-- *********************************************************************** -->
				162
				163	<div class="doc_text">
				164
				165	<p>Now that we have an AST to build, we need to define the parser code to build
				166	it. The idea here is that we want to parse something like "x+y" (which is
				167	returned as three tokens by the lexer) into an AST that could be generated with
				168	calls like this:</p>
				169
				170	<div class="doc_code">
				171	<pre>
				172	ExprAST *X = new VariableExprAST("x");
				173	ExprAST *Y = new VariableExprAST("y");
				174	ExprAST *Result = new BinaryExprAST('+', X, Y);
				175	</pre>
				176	</div>
				177
				178	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				179
				180	<div class="doc_code">
				181	<pre>
				182	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				183	/// token the parser it looking at. getNextToken reads another token from the
				184	/// lexer and updates CurTok with its results.
				185	static int CurTok;
				186	static int getNextToken() {
				187	return CurTok = gettok();
				188	}
				189	</pre>
				190	</div>
				191
				192	<p>
				193	This implements a simple token buffer around the lexer. This allows
				194	us to look one token ahead at what the lexer is returning. Every function in
				195	our lexer will assume that CurTok is the current token that needs to be
				196	parsed.</p>
				197
Chris Lattner	bee175b	2007-10-22 16:44:31 +0000	[diff] [blame]	198	<p>Again, we define these with global variables; it would be better design to
				199	wrap the entire parser in a class and use instance variables for these.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	200	</p>
				201
				202	<div class="doc_code">
				203	<pre>
				204
				205	/// Error* - These are little helper functions for error handling.
				206	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				207	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				208	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				209	</pre>
				210	</div>
				211
				212	<p>
				213	The <tt>Error</tt> routines are simple helper routines that our parser will use
				214	to handle errors. The error recovery in our parser will not be the best and
				215	are not particular user-friendly, but it will be enough for our tutorial. These
				216	routines make it easier to handle errors in routines that have various return
				217	types: they always return null.</p>
				218
				219	<p>With these basic helper functions implemented, we can implement the first
				220	piece of our grammar: we'll start with numeric literals.</p>
				221
				222	</div>
				223
				224	<!-- *********************************************************************** -->
				225	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				226	Parsing</a></div>
				227	<!-- *********************************************************************** -->
				228
				229	<div class="doc_text">
				230
				231	<p>We start with numeric literals, because they are the simplest to process.
				232	For each production in our grammar, we'll define a function which parses that
				233	production. For numeric literals, we have:
				234	</p>
				235
				236	<div class="doc_code">
				237	<pre>
				238	/// numberexpr ::= number
				239	static ExprAST *ParseNumberExpr() {
				240	ExprAST *Result = new NumberExprAST(NumVal);
				241	getNextToken(); // consume the number
				242	return Result;
				243	}
				244	</pre>
				245	</div>
				246
				247	<p>This routine is very simple: it expects to be called when the current token
				248	is a <tt>tok_number</tt> token. It takes the current number value, creates
				249	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
				250	returns.</p>
				251
				252	<p>There are some interesting aspects of this. The most important one is that
				253	this routine eats all of the tokens that correspond to the production, and
				254	returns the lexer buffer with the next token (which is not part of the grammar
				255	production) ready to go. This is a fairly standard way to go for recursive
				256	descent parsers. For a better example, the parenthesis operator is defined like
				257	this:</p>
				258
				259	<div class="doc_code">
				260	<pre>
				261	/// parenexpr ::= '(' expression ')'
				262	static ExprAST *ParseParenExpr() {
				263	getNextToken(); // eat (.
				264	ExprAST *V = ParseExpression();
				265	if (!V) return 0;
				266
				267	if (CurTok != ')')
				268	return Error("expected ')'");
				269	getNextToken(); // eat ).
				270	return V;
				271	}
				272	</pre>
				273	</div>
				274
				275	<p>This function illustrates a number of interesting things about the parser:
				276	1) it shows how we use the Error routines. When called, this function expects
				277	that the current token is a '(' token, but after parsing the subexpression, it
				278	is possible that there is not a ')' waiting. For example, if the user types in
				279	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				280	occur, the parser needs a way to indicate that they happened: in our parser, we
				281	return null on an error.</p>
				282
				283	<p>Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	284	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	285	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				286	recursive grammars, and keeps each production very simple. Note that
				287	parenthesis do not cause construction of AST nodes themselves. While we could
				288	do this, the most important role of parens are to guide the parser and provide
				289	grouping. Once the parser constructs the AST, parens are not needed.</p>
				290
				291	<p>The next simple production is for handling variable references and function
				292	calls:</p>
				293
				294	<div class="doc_code">
				295	<pre>
				296	/// identifierexpr
				297	/// ::= identifer
				298	/// ::= identifer '(' expression* ')'
				299	static ExprAST *ParseIdentifierExpr() {
				300	std::string IdName = IdentifierStr;
				301
				302	getNextToken(); // eat identifer.
				303
				304	if (CurTok != '(') // Simple variable ref.
				305	return new VariableExprAST(IdName);
				306
				307	// Call.
				308	getNextToken(); // eat (
				309	std::vector<ExprAST*> Args;
				310	while (1) {
				311	ExprAST *Arg = ParseExpression();
				312	if (!Arg) return 0;
				313	Args.push_back(Arg);
				314
				315	if (CurTok == ')') break;
				316
				317	if (CurTok != ',')
				318	return Error("Expected ')'");
				319	getNextToken();
				320	}
				321
				322	// Eat the ')'.
				323	getNextToken();
				324
				325	return new CallExprAST(IdName, Args);
				326	}
				327	</pre>
				328	</div>
				329
				330	<p>This routine follows the same style as the other routines (it expects to be
				331	called if the current token is a <tt>tok_identifier</tt> token). It also has
				332	recursion and error handling. One interesting aspect of this is that it uses
				333	<em>look-ahead</em> to determine if the current identifier is a stand alone
				334	variable reference or if it is a function call expression. It handles this by
				335	checking to see if the token after the identifier is a '(' token, and constructs
				336	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				337	</p>
				338
				339	<p>Now that we have all of our simple expression parsing logic in place, we can
				340	define a helper function to wrap them up in a class. We call this class of
				341	expressions "primary" expressions, for reasons that will become more clear
				342	later. In order to parse a primary expression, we need to determine what sort
				343	of expression it is:</p>
				344
				345	<div class="doc_code">
				346	<pre>
				347	/// primary
				348	/// ::= identifierexpr
				349	/// ::= numberexpr
				350	/// ::= parenexpr
				351	static ExprAST *ParsePrimary() {
				352	switch (CurTok) {
				353	default: return Error("unknown token when expecting an expression");
				354	case tok_identifier: return ParseIdentifierExpr();
				355	case tok_number: return ParseNumberExpr();
				356	case '(': return ParseParenExpr();
				357	}
				358	}
				359	</pre>
				360	</div>
				361
				362	<p>Now that you see the definition of this function, it makes it more obvious
				363	why we can assume the state of CurTok in the various functions. This uses
				364	look-ahead to determine which sort of expression is being inspected, and parses
				365	it with a function call.</p>
				366
				367	<p>Now that basic expressions are handled, we need to handle binary expressions,
				368	which are a bit more complex.</p>
				369
				370	</div>
				371
				372	<!-- *********************************************************************** -->
				373	<div class="doc_section"><a name="parserbinops">Binary Expression
				374	Parsing</a></div>
				375	<!-- *********************************************************************** -->
				376
				377	<div class="doc_text">
				378
				379	<p>Binary expressions are significantly harder to parse because they are often
				380	ambiguous. For example, when given the string "x+y*z", the parser can choose
				381	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				382	mathematics, we expect the later parse, because "*" (multiplication) has
				383	higher <em>precedence</em> than "+" (addition).</p>
				384
				385	<p>There are many ways to handle this, but an elegant and efficient way is to
				386	use <a href=
				387	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				388	Parsing</a>. This parsing technique uses the precedence of binary operators to
				389	guide recursion. To start with, we need a table of precedences:</p>
				390
				391	<div class="doc_code">
				392	<pre>
				393	/// BinopPrecedence - This holds the precedence for each binary operator that is
				394	/// defined.
				395	static std::map<char, int> BinopPrecedence;
				396
				397	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				398	static int GetTokPrecedence() {
				399	if (!isascii(CurTok))
				400	return -1;
				401
				402	// Make sure it's a declared binop.
				403	int TokPrec = BinopPrecedence[CurTok];
				404	if (TokPrec <= 0) return -1;
				405	return TokPrec;
				406	}
				407
				408	int main() {
				409	// Install standard binary operators.
				410	// 1 is lowest precedence.
				411	BinopPrecedence['<'] = 10;
				412	BinopPrecedence['+'] = 20;
				413	BinopPrecedence['-'] = 20;
				414	BinopPrecedence['*'] = 40; // highest.
				415	...
				416	}
				417	</pre>
				418	</div>
				419
				420	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				421	(this can obviously be extended by you, the reader). The
				422	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				423	or -1 if the token is not a binary operator. Having a map makes it easy to add
				424	new operators and makes it clear that the algorithm doesn't depend on the
				425	specific operators involved, but it would be easy enough to eliminate the map
				426	and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
				427
				428	<p>With the helper above defined, we can now start parsing binary expressions.
				429	The basic idea of operator precedence parsing is to break down an expression
				430	with potentially ambiguous binary operators into pieces. Consider for example
				431	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				432	as a stream of primary expressions separated by binary operators. As such,
				433	it will first parse the leading primary expression "a", then it will see the
				434	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				435	are primary expressions that the binary expression parser doesn't need to worry
				436	about nested subexpressions like (c+d) at all.
				437	</p>
				438
				439	<p>
				440	To start, an expression is a primary expression potentially followed by a
				441	sequence of [binop,primaryexpr] pairs:</p>
				442
				443	<div class="doc_code">
				444	<pre>
				445	/// expression
				446	/// ::= primary binoprhs
				447	///
				448	static ExprAST *ParseExpression() {
				449	ExprAST *LHS = ParsePrimary();
				450	if (!LHS) return 0;
				451
				452	return ParseBinOpRHS(0, LHS);
				453	}
				454	</pre>
				455	</div>
				456
				457	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
				458	us. It takes a precedence and a pointer to an expression for the part parsed
				459	so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
				460	allowed to be empty, in which case it returns the expression that is passed into
				461	it. In our example above, the code passes the expression for "a" into
				462	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				463
				464	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				465	minimal operator precedence</em> that the function is allowed to eat. For
				466	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				467	passed in a precedence of 40, it will not consume any tokens (because the
				468	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				469	with:</p>
				470
				471	<div class="doc_code">
				472	<pre>
				473	/// binoprhs
				474	/// ::= ('+' primary)*
				475	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				476	// If this is a binop, find its precedence.
				477	while (1) {
				478	int TokPrec = GetTokPrecedence();
				479
				480	// If this is a binop that binds at least as tightly as the current binop,
				481	// consume it, otherwise we are done.
				482	if (TokPrec < ExprPrec)
				483	return LHS;
				484	</pre>
				485	</div>
				486
				487	<p>This code gets the precedence of the current token and checks to see if if is
				488	too low. Because we defined invalid tokens to have a precedence of -1, this
				489	check implicitly knows that the pair-stream ends when the token stream runs out
				490	of binary operators. If this check succeeds, we know that the token is a binary
				491	operator and that it will be included in this expression:</p>
				492
				493	<div class="doc_code">
				494	<pre>
				495	// Okay, we know this is a binop.
				496	int BinOp = CurTok;
				497	getNextToken(); // eat binop
				498
				499	// Parse the primary expression after the binary operator.
				500	ExprAST *RHS = ParsePrimary();
				501	if (!RHS) return 0;
				502	</pre>
				503	</div>
				504
				505	<p>As such, this code eats (and remembers) the binary operator and then parses
				506	the following primary expression. This builds up the whole pair, the first of
				507	which is [+, b] for the running example.</p>
				508
				509	<p>Now that we parsed the left-hand side of an expression and one pair of the
				510	RHS sequence, we have to decide which way the expression associates. In
				511	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				512	To determine this, we look ahead at "binop" to determine its precedence and
				513	compare it to BinOp's precedence (which is '+' in this case):</p>
				514
				515	<div class="doc_code">
				516	<pre>
				517	// If BinOp binds less tightly with RHS than the operator after RHS, let
				518	// the pending operator take RHS as its LHS.
				519	int NextPrec = GetTokPrecedence();
				520	if (TokPrec < NextPrec) {
				521	</pre>
				522	</div>
				523
				524	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				525	precedence of our current operator, then we know that the parentheses associate
				526	as "(a+b) binop ...". In our example, since the next operator is "+" and so is
				527	our current one, we know that they have the same precedence. In this case we'll
				528	create the AST node for "a+b", and then continue parsing:</p>
				529
				530	<div class="doc_code">
				531	<pre>
				532	... if body omitted ...
				533	}
				534
				535	// Merge LHS/RHS.
				536	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				537	} // loop around to the top of the while loop.
				538	}
				539	</pre>
				540	</div>
				541
				542	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				543	iteration of the loop, with "+" as the current token. The code above will eat
				544	and remember it and parse "(c+d)" as the primary expression, which makes the
				545	current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
				546	binop to the right of the primary. In this case, the precedence of "*" is
				547	higher than the precedence of "+" so the if condition will be entered.</p>
				548
				549	<p>The critical question left here is "how can the if condition parse the right
				550	hand side in full"? In particular, to build the AST correctly for our example,
				551	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				552	do this is surprisingly simple (code from the above two blocks duplicated for
				553	context):</p>
				554
				555	<div class="doc_code">
				556	<pre>
				557	// If BinOp binds less tightly with RHS than the operator after RHS, let
				558	// the pending operator take RHS as its LHS.
				559	int NextPrec = GetTokPrecedence();
				560	if (TokPrec < NextPrec) {
				561	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				562	if (RHS == 0) return 0;
				563	}
				564	// Merge LHS/RHS.
				565	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				566	} // loop around to the top of the while loop.
				567	}
				568	</pre>
				569	</div>
				570
				571	<p>At this point, we know that the binary operator to the RHS of our primary
				572	has higher precedence than the binop we are currently parsing. As such, we know
				573	that any sequence of pairs whose operators are all higher precedence than "+"
				574	should be parsed together and returned as "RHS". To do this, we recursively
				575	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				576	precedence required for it to continue. In our example above, this will cause
				577	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				578	of the '+' expression.</p>
				579
				580	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
				581	and added to the AST. With this little bit of code (14 non-trivial lines), we
				582	correctly handle fully general binary expression parsing in a very elegant way.
				583	</p>
				584
				585	<p>This wraps up handling of expressions. At this point, we can point the
				586	parser at an arbitrary token stream and build an expression from them, stopping
				587	at the first token that is not part of the expression. Next up we need to
				588	handle function definitions etc.</p>
				589
				590	</div>
				591
				592	<!-- *********************************************************************** -->
				593	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				594	<!-- *********************************************************************** -->
				595
				596	<div class="doc_text">
				597
				598	<p>
				599	The first basic thing missing is that of function prototypes. In Kaleidoscope,
				600	these are used both for 'extern' function declarations as well as function body
				601	definitions. The code to do this is straight-forward and not very interesting
				602	(once you've survived expressions):
				603	</p>
				604
				605	<div class="doc_code">
				606	<pre>
				607	/// prototype
				608	/// ::= id '(' id* ')'
				609	static PrototypeAST *ParsePrototype() {
				610	if (CurTok != tok_identifier)
				611	return ErrorP("Expected function name in prototype");
				612
				613	std::string FnName = IdentifierStr;
				614	getNextToken();
				615
				616	if (CurTok != '(')
				617	return ErrorP("Expected '(' in prototype");
				618
				619	std::vector<std::string> ArgNames;
				620	while (getNextToken() == tok_identifier)
				621	ArgNames.push_back(IdentifierStr);
				622	if (CurTok != ')')
				623	return ErrorP("Expected ')' in prototype");
				624
				625	// success.
				626	getNextToken(); // eat ')'.
				627
				628	return new PrototypeAST(FnName, ArgNames);
				629	}
				630	</pre>
				631	</div>
				632
				633	<p>Given this, a function definition is very simple, just a prototype plus
				634	and expression to implement the body:</p>
				635
				636	<div class="doc_code">
				637	<pre>
				638	/// definition ::= 'def' prototype expression
				639	static FunctionAST *ParseDefinition() {
				640	getNextToken(); // eat def.
				641	PrototypeAST *Proto = ParsePrototype();
				642	if (Proto == 0) return 0;
				643
				644	if (ExprAST *E = ParseExpression())
				645	return new FunctionAST(Proto, E);
				646	return 0;
				647	}
				648	</pre>
				649	</div>
				650
				651	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				652	well as to support forward declaration of user functions. 'externs' are just
				653	prototypes with no body:</p>
				654
				655	<div class="doc_code">
				656	<pre>
				657	/// external ::= 'extern' prototype
				658	static PrototypeAST *ParseExtern() {
				659	getNextToken(); // eat extern.
				660	return ParsePrototype();
				661	}
				662	</pre>
				663	</div>
				664
				665	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				666	evaluate them on the fly. We will handle this by defining anonymous nullary
				667	(zero argument) functions for them:</p>
				668
				669	<div class="doc_code">
				670	<pre>
				671	/// toplevelexpr ::= expression
				672	static FunctionAST *ParseTopLevelExpr() {
				673	if (ExprAST *E = ParseExpression()) {
				674	// Make an anonymous proto.
				675	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				676	return new FunctionAST(Proto, E);
				677	}
				678	return 0;
				679	}
				680	</pre>
				681	</div>
				682
				683	<p>Now that we have all the pieces, lets build a little driver that will let us
				684	actually <em>execute</em> this code we've built!</p>
				685
				686	</div>
				687
				688	<!-- *********************************************************************** -->
				689	<div class="doc_section"><a name="driver">The Driver</a></div>
				690	<!-- *********************************************************************** -->
				691
				692	<div class="doc_text">
				693
				694	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				695	dispatch loop. There isn't much interesting here, so I'll just include the
				696	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				697	Parsing" section.</p>
				698
				699	<div class="doc_code">
				700	<pre>
				701	/// top ::= definition \| external \| expression \| ';'
				702	static void MainLoop() {
				703	while (1) {
				704	fprintf(stderr, "ready> ");
				705	switch (CurTok) {
				706	case tok_eof: return;
				707	case ';': getNextToken(); break; // ignore top level semicolons.
				708	case tok_def: HandleDefinition(); break;
				709	case tok_extern: HandleExtern(); break;
				710	default: HandleTopLevelExpression(); break;
				711	}
				712	}
				713	}
				714	</pre>
				715	</div>
				716
				717	<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	718	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	719	command line, the parser doesn't know that that is the end of what you will
				720	type. For example, on the next line you could type "def foo..." in which case
				721	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				722	which would continue the expression. Having top-level semicolons allows you to
				723	type "4+5;" and the parser will know you are done.</p>
				724
				725	</div>
				726
				727	<!-- *********************************************************************** -->
				728	<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
				729	<!-- *********************************************************************** -->
				730
				731	<div class="doc_text">
				732
				733	<p>With just under 400 lines of commented code, we fully defined our minimal
				734	language, including a lexer, parser and AST builder. With this done, the
				735	executable will validate code and tell us if it is gramatically invalid. For
				736	example, here is a sample interaction:</p>
				737
				738	<div class="doc_code">
				739	<pre>
				740	$ ./a.out
				741	ready> def foo(x y) x+foo(y, 4.0);
				742	ready> Parsed an function definition.
				743	ready> def foo(x y) x+y y;
				744	ready> Parsed an function definition.
				745	ready> Parsed a top-level expr
				746	ready> def foo(x y) x+y );
				747	ready> Parsed an function definition.
				748	ready> Error: unknown token when expecting an expression
				749	ready> extern sin(a);
				750	ready> Parsed an extern
				751	ready> ^D
				752	$
				753	</pre>
				754	</div>
				755
				756	<p>
				757	Here is the full code. Note that it is fully self-contained: you don't even
				758	need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
				759	will describe how to generate LLVM IR from the AST.</p>
				760
				761	<div class="doc_code">
				762	<pre>
				763	// To build this:
				764	// g++ -g toy.cpp
				765	// ./a.out
				766
				767	#include <cstdio>
				768	#include <string>
				769	#include <
				770	#include <vector>
				771
				772	//===----------------------------------------------------------------------===//
				773	// Lexer
				774	//===----------------------------------------------------------------------===//
				775
				776	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				777	// of these for known things.
				778	enum Token {
				779	tok_eof = -1,
				780
				781	// commands
				782	tok_def = -2, tok_extern = -3,
				783
				784	// primary
				785	tok_identifier = -4, tok_number = -5,
				786	};
				787
				788	static std::string IdentifierStr; // Filled in if tok_identifier
				789	static double NumVal; // Filled in if tok_number
				790
				791	/// gettok - Return the next token from standard input.
				792	static int gettok() {
				793	static int LastChar = ' ';
				794
				795	// Skip any whitespace.
				796	while (isspace(LastChar))
				797	LastChar = getchar();
				798
				799	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				800	IdentifierStr = LastChar;
				801	while (isalnum((LastChar = getchar())))
				802	IdentifierStr += LastChar;
				803
				804	if (IdentifierStr == "def") return tok_def;
				805	if (IdentifierStr == "extern") return tok_extern;
				806	return tok_identifier;
				807	}
				808
				809	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				810	std::string NumStr;
				811	do {
				812	NumStr += LastChar;
				813	LastChar = getchar();
				814	} while (isdigit(LastChar) \|\| LastChar == '.');
				815
				816	NumVal = strtod(NumStr.c_str(), 0);
				817	return tok_number;
				818	}
				819
				820	if (LastChar == '#') {
				821	// Comment until end of line.
				822	do LastChar = getchar();
				823	while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
				824
				825	if (LastChar != EOF)
				826	return gettok();
				827	}
				828
				829	// Check for end of file. Don't eat the EOF.
				830	if (LastChar == EOF)
				831	return tok_eof;
				832
				833	// Otherwise, just return the character as its ascii value.
				834	int ThisChar = LastChar;
				835	LastChar = getchar();
				836	return ThisChar;
				837	}
				838
				839	//===----------------------------------------------------------------------===//
				840	// Abstract Syntax Tree (aka Parse Tree)
				841	//===----------------------------------------------------------------------===//
				842
				843	/// ExprAST - Base class for all expression nodes.
				844	class ExprAST {
				845	public:
				846	virtual ~ExprAST() {}
				847	};
				848
				849	/// NumberExprAST - Expression class for numeric literals like "1.0".
				850	class NumberExprAST : public ExprAST {
				851	double Val;
				852	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame^]	853	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	854	};
				855
				856	/// VariableExprAST - Expression class for referencing a variable, like "a".
				857	class VariableExprAST : public ExprAST {
				858	std::string Name;
				859	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame^]	860	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	861	};
				862
				863	/// BinaryExprAST - Expression class for a binary operator.
				864	class BinaryExprAST : public ExprAST {
				865	char Op;
				866	ExprAST LHS, RHS;
				867	public:
				868	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				869	: Op(op), LHS(lhs), RHS(rhs) {}
				870	};
				871
				872	/// CallExprAST - Expression class for function calls.
				873	class CallExprAST : public ExprAST {
				874	std::string Callee;
				875	std::vector<ExprAST*> Args;
				876	public:
				877	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				878	: Callee(callee), Args(args) {}
				879	};
				880
				881	/// PrototypeAST - This class represents the "prototype" for a function,
				882	/// which captures its argument names as well as if it is an operator.
				883	class PrototypeAST {
				884	std::string Name;
				885	std::vector< Args;
				886	public:
				887	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				888	: Name(name), Args(args) {}
				889
				890	};
				891
				892	/// FunctionAST - This class represents a function definition itself.
				893	class FunctionAST {
				894	PrototypeAST *Proto;
				895	ExprAST *Body;
				896	public:
				897	FunctionAST(PrototypeAST proto, ExprAST body)
				898	: Proto(proto), Body(body) {}
				899
				900	};
				901
				902	//===----------------------------------------------------------------------===//
				903	// Parser
				904	//===----------------------------------------------------------------------===//
				905
				906	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				907	/// token the parser it looking at. getNextToken reads another token from the
				908	/// lexer and updates CurTok with its results.
				909	static int CurTok;
				910	static int getNextToken() {
				911	return CurTok = gettok();
				912	}
				913
				914	/// BinopPrecedence - This holds the precedence for each binary operator that is
				915	/// defined.
				916	static std::map<char, int> BinopPrecedence;
				917
				918	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				919	static int GetTokPrecedence() {
				920	if (!isascii(CurTok))
				921	return -1;
				922
				923	// Make sure it's a declared binop.
				924	int TokPrec = BinopPrecedence[CurTok];
				925	if (TokPrec <= 0) return -1;
				926	return TokPrec;
				927	}
				928
				929	/// Error* - These are little helper functions for error handling.
				930	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				931	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				932	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				933
				934	static ExprAST *ParseExpression();
				935
				936	/// identifierexpr
				937	/// ::= identifer
				938	/// ::= identifer '(' expression* ')'
				939	static ExprAST *ParseIdentifierExpr() {
				940	std::string IdName = IdentifierStr;
				941
				942	getNextToken(); // eat identifer.
				943
				944	if (CurTok != '(') // Simple variable ref.
				945	return new VariableExprAST(IdName);
				946
				947	// Call.
				948	getNextToken(); // eat (
				949	std::vector<ExprAST*> Args;
				950	while (1) {
				951	ExprAST *Arg = ParseExpression();
				952	if (!Arg) return 0;
				953	Args.push_back(Arg);
				954
				955	if (CurTok == ')') break;
				956
				957	if (CurTok != ',')
				958	return Error("Expected ')'");
				959	getNextToken();
				960	}
				961
				962	// Eat the ')'.
				963	getNextToken();
				964
				965	return new CallExprAST(IdName, Args);
				966	}
				967
				968	/// numberexpr ::= number
				969	static ExprAST *ParseNumberExpr() {
				970	ExprAST *Result = new NumberExprAST(NumVal);
				971	getNextToken(); // consume the number
				972	return Result;
				973	}
				974
				975	/// parenexpr ::= '(' expression ')'
				976	static ExprAST *ParseParenExpr() {
				977	getNextToken(); // eat (.
				978	ExprAST *V = ParseExpression();
				979	if (!V) return 0;
				980
				981	if (CurTok != ')')
				982	return Error("expected ')'");
				983	getNextToken(); // eat ).
				984	return V;
				985	}
				986
				987	/// primary
				988	/// ::= identifierexpr
				989	/// ::= numberexpr
				990	/// ::= parenexpr
				991	static ExprAST *ParsePrimary() {
				992	switch (CurTok) {
				993	default: return Error("unknown token when expecting an expression");
				994	case tok_identifier: return ParseIdentifierExpr();
				995	case tok_number: return ParseNumberExpr();
				996	case '(': return ParseParenExpr();
				997	}
				998	}
				999
				1000	/// binoprhs
				1001	/// ::= ('+' primary)*
				1002	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1003	// If this is a binop, find its precedence.
				1004	while (1) {
				1005	int TokPrec = GetTokPrecedence();
				1006
				1007	// If this is a binop that binds at least as tightly as the current binop,
				1008	// consume it, otherwise we are done.
				1009	if (TokPrec < ExprPrec)
				1010	return LHS;
				1011
				1012	// Okay, we know this is a binop.
				1013	int BinOp = CurTok;
				1014	getNextToken(); // eat binop
				1015
				1016	// Parse the primary expression after the binary operator.
				1017	ExprAST *RHS = ParsePrimary();
				1018	if (!RHS) return 0;
				1019
				1020	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1021	// the pending operator take RHS as its LHS.
				1022	int NextPrec = GetTokPrecedence();
				1023	if (TokPrec < NextPrec) {
				1024	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1025	if (RHS == 0) return 0;
				1026	}
				1027
				1028	// Merge LHS/RHS.
				1029	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1030	}
				1031	}
				1032
				1033	/// expression
				1034	/// ::= primary binoprhs
				1035	///
				1036	static ExprAST *ParseExpression() {
				1037	ExprAST *LHS = ParsePrimary();
				1038	if (!LHS) return 0;
				1039
				1040	return ParseBinOpRHS(0, LHS);
				1041	}
				1042
				1043	/// prototype
				1044	/// ::= id '(' id* ')'
				1045	static PrototypeAST *ParsePrototype() {
				1046	if (CurTok != tok_identifier)
				1047	return ErrorP("Expected function name in prototype");
				1048
				1049	std::string FnName = IdentifierStr;
				1050	getNextToken();
				1051
				1052	if (CurTok != '(')
				1053	return ErrorP("Expected '(' in prototype");
				1054
				1055	std::vector<std::string> ArgNames;
				1056	while (getNextToken() == tok_identifier)
				1057	ArgNames.push_back(IdentifierStr);
				1058	if (CurTok != ')')
				1059	return ErrorP("Expected ')' in prototype");
				1060
				1061	// success.
				1062	getNextToken(); // eat ')'.
				1063
				1064	return new PrototypeAST(FnName, ArgNames);
				1065	}
				1066
				1067	/// definition ::= 'def' prototype expression
				1068	static FunctionAST *ParseDefinition() {
				1069	getNextToken(); // eat def.
				1070	PrototypeAST *Proto = ParsePrototype();
				1071	if (Proto == 0) return 0;
				1072
				1073	if (ExprAST *E = ParseExpression())
				1074	return new FunctionAST(Proto, E);
				1075	return 0;
				1076	}
				1077
				1078	/// toplevelexpr ::= expression
				1079	static FunctionAST *ParseTopLevelExpr() {
				1080	if (ExprAST *E = ParseExpression()) {
				1081	// Make an anonymous proto.
				1082	PrototypeAST *Proto = new PrototypeAST("", std::vector<());
				1083	return new FunctionAST(Proto, E);
				1084	}
				1085	return 0;
				1086	}
				1087
				1088	/// external ::= 'extern' prototype
				1089	static PrototypeAST *ParseExtern() {
				1090	getNextToken(); // eat extern.
				1091	return ParsePrototype();
				1092	}
				1093
				1094	//===----------------------------------------------------------------------===//
				1095	// Top-Level parsing
				1096	//===----------------------------------------------------------------------===//
				1097
				1098	static void HandleDefinition() {
				1099	if (FunctionAST *F = ParseDefinition()) {
				1100	fprintf(stderr, "Parsed a function definition.\n");
				1101	} else {
				1102	// Skip token for error recovery.
				1103	getNextToken();
				1104	}
				1105	}
				1106
				1107	static void HandleExtern() {
				1108	if (PrototypeAST *P = ParseExtern()) {
				1109	fprintf(stderr, "Parsed an extern\n");
				1110	} else {
				1111	// Skip token for error recovery.
				1112	getNextToken();
				1113	}
				1114	}
				1115
				1116	static void HandleTopLevelExpression() {
				1117	// Evaluate a top level expression into an anonymous function.
				1118	if (FunctionAST *F = ParseTopLevelExpr()) {
				1119	fprintf(stderr, "Parsed a top-level expr\n");
				1120	} else {
				1121	// Skip token for error recovery.
				1122	getNextToken();
				1123	}
				1124	}
				1125
				1126	/// top ::= definition \| external \| expression \| ';'
				1127	static void MainLoop() {
				1128	while (1) {
				1129	fprintf(stderr, "ready> ");
				1130	switch (CurTok) {
				1131	case tok_eof: return;
				1132	case ';': getNextToken(); break; // ignore top level semicolons.
				1133	case tok_def: HandleDefinition(); break;
				1134	case tok_extern: HandleExtern(); break;
				1135	default: HandleTopLevelExpression(); break;
				1136	}
				1137	}
				1138	}
				1139
				1140	//===----------------------------------------------------------------------===//
				1141	// Main driver code.
				1142	//===----------------------------------------------------------------------===//
				1143
				1144	int main() {
				1145	// Install standard binary operators.
				1146	// 1 is lowest precedence.
				1147	BinopPrecedence['<'] = 10;
				1148	BinopPrecedence['+'] = 20;
				1149	BinopPrecedence['-'] = 20;
				1150	BinopPrecedence['*'] = 40; // highest.
				1151
				1152	// Prime the first token.
				1153	fprintf(stderr, "ready> ");
				1154	getNextToken();
				1155
				1156	MainLoop();
				1157	return 0;
				1158	}
				1159	</pre>
				1160	</div>
				1161	</div>
				1162
				1163	<!-- *********************************************************************** -->
				1164	<hr>
				1165	<address>
				1166	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1167	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1168	<a href="http://validator.w3.org/check/referer"><img
				1169	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
				1170
				1171	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1172	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				1173	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				1174	</address>
				1175	</body>
				1176	</html>