Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: 653975cd8be684b8a0bc2d73b83013f52a4ac152 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
				16	<div class="doc_author">
				17	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				18	</div>
				19
				20	<!-- *********************************************************************** -->
				21	<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
				22	<!-- *********************************************************************** -->
				23
				24	<div class="doc_text">
				25
				26	<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
				27	LLVM</a>" tutorial. This chapter shows you how to use the <a
				28	href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
				29	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
				30	our Kaleidoscope language and build an <a
				31	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				32	Tree</a> (AST).</p>
				33
				34	<p>The parser we will build uses a combination of <a
				35	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				36	Parsing</a> and <a href=
				37	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				38	Parsing</a> to parse the Kaleidoscope language (the later for binary expression
				39	and the former for everything else). Before we get to parsing though, lets talk
				40	about the output of the parser: the Abstract Syntax Tree.</p>
				41
				42	</div>
				43
				44	<!-- *********************************************************************** -->
				45	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				46	<!-- *********************************************************************** -->
				47
				48	<div class="doc_text">
				49
				50	<p>The AST for a program captures its behavior in a way that it is easy for
				51	later stages of the compiler (e.g. code generation) to interpret. We basically
				52	want one object for each construct in the language, and the AST should closely
				53	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				54	function object. We'll start with expressions first:</p>
				55
				56	<div class="doc_code">
				57	<pre>
				58	/// ExprAST - Base class for all expression nodes.
				59	class ExprAST {
				60	public:
				61	virtual ~ExprAST() {}
				62	};
				63
				64	/// NumberExprAST - Expression class for numeric literals like "1.0".
				65	class NumberExprAST : public ExprAST {
				66	double Val;
				67	public:
				68	NumberExprAST(double val) : Val(val) {}
				69	};
				70	</pre>
				71	</div>
				72
				73	<p>The code above shows the definition of the base ExprAST class and one
				74	subclass which we use for numeric literals. The important thing about this is
				75	that the NumberExprAST class captures the numeric value of the literal in the
				76	class, so that later phases of the compiler can know what it is.</p>
				77
				78	<p>Right now we only create the AST, so there are no useful accessor methods on
				79	them. It would be very easy to add a virtual method to pretty print the code,
				80	for example. Here are the other expression AST node definitions that we'll use
				81	in the basic form of the Kaleidoscope language.
				82	</p>
				83
				84	<div class="doc_code">
				85	<pre>
				86	/// VariableExprAST - Expression class for referencing a variable, like "a".
				87	class VariableExprAST : public ExprAST {
				88	std::string Name;
				89	public:
				90	VariableExprAST(const std::string &name) : Name(name) {}
				91	};
				92
				93	/// BinaryExprAST - Expression class for a binary operator.
				94	class BinaryExprAST : public ExprAST {
				95	char Op;
				96	ExprAST LHS, RHS;
				97	public:
				98	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				99	: Op(op), LHS(lhs), RHS(rhs) {}
				100	};
				101
				102	/// CallExprAST - Expression class for function calls.
				103	class CallExprAST : public ExprAST {
				104	std::string Callee;
				105	std::vector<ExprAST*> Args;
				106	public:
				107	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				108	: Callee(callee), Args(args) {}
				109	};
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentially) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name and list of argument expressions. One thing that is
				116	nice about our AST is that it captures the language features without talking
				117	about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame^]	121	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	/// PrototypeAST - This class represents the "prototype" for a function,
				129	/// which captures its argument names as well as if it is an operator.
				130	class PrototypeAST {
				131	std::string Name;
				132	std::vector<std::string> Args;
				133	public:
				134	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				135	: Name(name), Args(args) {}
				136	};
				137
				138	/// FunctionAST - This class represents a function definition itself.
				139	class FunctionAST {
				140	PrototypeAST *Proto;
				141	ExprAST *Body;
				142	public:
				143	FunctionAST(PrototypeAST proto, ExprAST body)
				144	: Proto(proto), Body(body) {}
				145	};
				146	</pre>
				147	</div>
				148
				149	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				150	Since all values are double precision floating point, this fact doesn't need to
				151	be captured anywhere. In a more aggressive and realistic language, the
				152	"ExprAST" class would probably have a type field.</p>
				153
				154	<p>With this scaffolding, we can now talk about parsing expressions and function
				155	bodies in Kaleidoscope.</p>
				156
				157	</div>
				158
				159	<!-- *********************************************************************** -->
				160	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				161	<!-- *********************************************************************** -->
				162
				163	<div class="doc_text">
				164
				165	<p>Now that we have an AST to build, we need to define the parser code to build
				166	it. The idea here is that we want to parse something like "x+y" (which is
				167	returned as three tokens by the lexer) into an AST that could be generated with
				168	calls like this:</p>
				169
				170	<div class="doc_code">
				171	<pre>
				172	ExprAST *X = new VariableExprAST("x");
				173	ExprAST *Y = new VariableExprAST("y");
				174	ExprAST *Result = new BinaryExprAST('+', X, Y);
				175	</pre>
				176	</div>
				177
				178	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				179
				180	<div class="doc_code">
				181	<pre>
				182	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				183	/// token the parser it looking at. getNextToken reads another token from the
				184	/// lexer and updates CurTok with its results.
				185	static int CurTok;
				186	static int getNextToken() {
				187	return CurTok = gettok();
				188	}
				189	</pre>
				190	</div>
				191
				192	<p>
				193	This implements a simple token buffer around the lexer. This allows
				194	us to look one token ahead at what the lexer is returning. Every function in
				195	our lexer will assume that CurTok is the current token that needs to be
				196	parsed.</p>
				197
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame^]	198	<p>Again, we define these with global variables; it would be better design to wrap the entire parser in a class and use instance variables for these.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	199	</p>
				200
				201	<div class="doc_code">
				202	<pre>
				203
				204	/// Error* - These are little helper functions for error handling.
				205	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				206	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				207	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				208	</pre>
				209	</div>
				210
				211	<p>
				212	The <tt>Error</tt> routines are simple helper routines that our parser will use
				213	to handle errors. The error recovery in our parser will not be the best and
				214	are not particular user-friendly, but it will be enough for our tutorial. These
				215	routines make it easier to handle errors in routines that have various return
				216	types: they always return null.</p>
				217
				218	<p>With these basic helper functions implemented, we can implement the first
				219	piece of our grammar: we'll start with numeric literals.</p>
				220
				221	</div>
				222
				223	<!-- *********************************************************************** -->
				224	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				225	Parsing</a></div>
				226	<!-- *********************************************************************** -->
				227
				228	<div class="doc_text">
				229
				230	<p>We start with numeric literals, because they are the simplest to process.
				231	For each production in our grammar, we'll define a function which parses that
				232	production. For numeric literals, we have:
				233	</p>
				234
				235	<div class="doc_code">
				236	<pre>
				237	/// numberexpr ::= number
				238	static ExprAST *ParseNumberExpr() {
				239	ExprAST *Result = new NumberExprAST(NumVal);
				240	getNextToken(); // consume the number
				241	return Result;
				242	}
				243	</pre>
				244	</div>
				245
				246	<p>This routine is very simple: it expects to be called when the current token
				247	is a <tt>tok_number</tt> token. It takes the current number value, creates
				248	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
				249	returns.</p>
				250
				251	<p>There are some interesting aspects of this. The most important one is that
				252	this routine eats all of the tokens that correspond to the production, and
				253	returns the lexer buffer with the next token (which is not part of the grammar
				254	production) ready to go. This is a fairly standard way to go for recursive
				255	descent parsers. For a better example, the parenthesis operator is defined like
				256	this:</p>
				257
				258	<div class="doc_code">
				259	<pre>
				260	/// parenexpr ::= '(' expression ')'
				261	static ExprAST *ParseParenExpr() {
				262	getNextToken(); // eat (.
				263	ExprAST *V = ParseExpression();
				264	if (!V) return 0;
				265
				266	if (CurTok != ')')
				267	return Error("expected ')'");
				268	getNextToken(); // eat ).
				269	return V;
				270	}
				271	</pre>
				272	</div>
				273
				274	<p>This function illustrates a number of interesting things about the parser:
				275	1) it shows how we use the Error routines. When called, this function expects
				276	that the current token is a '(' token, but after parsing the subexpression, it
				277	is possible that there is not a ')' waiting. For example, if the user types in
				278	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				279	occur, the parser needs a way to indicate that they happened: in our parser, we
				280	return null on an error.</p>
				281
				282	<p>Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame^]	283	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	284	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				285	recursive grammars, and keeps each production very simple. Note that
				286	parenthesis do not cause construction of AST nodes themselves. While we could
				287	do this, the most important role of parens are to guide the parser and provide
				288	grouping. Once the parser constructs the AST, parens are not needed.</p>
				289
				290	<p>The next simple production is for handling variable references and function
				291	calls:</p>
				292
				293	<div class="doc_code">
				294	<pre>
				295	/// identifierexpr
				296	/// ::= identifer
				297	/// ::= identifer '(' expression* ')'
				298	static ExprAST *ParseIdentifierExpr() {
				299	std::string IdName = IdentifierStr;
				300
				301	getNextToken(); // eat identifer.
				302
				303	if (CurTok != '(') // Simple variable ref.
				304	return new VariableExprAST(IdName);
				305
				306	// Call.
				307	getNextToken(); // eat (
				308	std::vector<ExprAST*> Args;
				309	while (1) {
				310	ExprAST *Arg = ParseExpression();
				311	if (!Arg) return 0;
				312	Args.push_back(Arg);
				313
				314	if (CurTok == ')') break;
				315
				316	if (CurTok != ',')
				317	return Error("Expected ')'");
				318	getNextToken();
				319	}
				320
				321	// Eat the ')'.
				322	getNextToken();
				323
				324	return new CallExprAST(IdName, Args);
				325	}
				326	</pre>
				327	</div>
				328
				329	<p>This routine follows the same style as the other routines (it expects to be
				330	called if the current token is a <tt>tok_identifier</tt> token). It also has
				331	recursion and error handling. One interesting aspect of this is that it uses
				332	<em>look-ahead</em> to determine if the current identifier is a stand alone
				333	variable reference or if it is a function call expression. It handles this by
				334	checking to see if the token after the identifier is a '(' token, and constructs
				335	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				336	</p>
				337
				338	<p>Now that we have all of our simple expression parsing logic in place, we can
				339	define a helper function to wrap them up in a class. We call this class of
				340	expressions "primary" expressions, for reasons that will become more clear
				341	later. In order to parse a primary expression, we need to determine what sort
				342	of expression it is:</p>
				343
				344	<div class="doc_code">
				345	<pre>
				346	/// primary
				347	/// ::= identifierexpr
				348	/// ::= numberexpr
				349	/// ::= parenexpr
				350	static ExprAST *ParsePrimary() {
				351	switch (CurTok) {
				352	default: return Error("unknown token when expecting an expression");
				353	case tok_identifier: return ParseIdentifierExpr();
				354	case tok_number: return ParseNumberExpr();
				355	case '(': return ParseParenExpr();
				356	}
				357	}
				358	</pre>
				359	</div>
				360
				361	<p>Now that you see the definition of this function, it makes it more obvious
				362	why we can assume the state of CurTok in the various functions. This uses
				363	look-ahead to determine which sort of expression is being inspected, and parses
				364	it with a function call.</p>
				365
				366	<p>Now that basic expressions are handled, we need to handle binary expressions,
				367	which are a bit more complex.</p>
				368
				369	</div>
				370
				371	<!-- *********************************************************************** -->
				372	<div class="doc_section"><a name="parserbinops">Binary Expression
				373	Parsing</a></div>
				374	<!-- *********************************************************************** -->
				375
				376	<div class="doc_text">
				377
				378	<p>Binary expressions are significantly harder to parse because they are often
				379	ambiguous. For example, when given the string "x+y*z", the parser can choose
				380	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				381	mathematics, we expect the later parse, because "*" (multiplication) has
				382	higher <em>precedence</em> than "+" (addition).</p>
				383
				384	<p>There are many ways to handle this, but an elegant and efficient way is to
				385	use <a href=
				386	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				387	Parsing</a>. This parsing technique uses the precedence of binary operators to
				388	guide recursion. To start with, we need a table of precedences:</p>
				389
				390	<div class="doc_code">
				391	<pre>
				392	/// BinopPrecedence - This holds the precedence for each binary operator that is
				393	/// defined.
				394	static std::map<char, int> BinopPrecedence;
				395
				396	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				397	static int GetTokPrecedence() {
				398	if (!isascii(CurTok))
				399	return -1;
				400
				401	// Make sure it's a declared binop.
				402	int TokPrec = BinopPrecedence[CurTok];
				403	if (TokPrec <= 0) return -1;
				404	return TokPrec;
				405	}
				406
				407	int main() {
				408	// Install standard binary operators.
				409	// 1 is lowest precedence.
				410	BinopPrecedence['<'] = 10;
				411	BinopPrecedence['+'] = 20;
				412	BinopPrecedence['-'] = 20;
				413	BinopPrecedence['*'] = 40; // highest.
				414	...
				415	}
				416	</pre>
				417	</div>
				418
				419	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				420	(this can obviously be extended by you, the reader). The
				421	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				422	or -1 if the token is not a binary operator. Having a map makes it easy to add
				423	new operators and makes it clear that the algorithm doesn't depend on the
				424	specific operators involved, but it would be easy enough to eliminate the map
				425	and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
				426
				427	<p>With the helper above defined, we can now start parsing binary expressions.
				428	The basic idea of operator precedence parsing is to break down an expression
				429	with potentially ambiguous binary operators into pieces. Consider for example
				430	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				431	as a stream of primary expressions separated by binary operators. As such,
				432	it will first parse the leading primary expression "a", then it will see the
				433	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				434	are primary expressions that the binary expression parser doesn't need to worry
				435	about nested subexpressions like (c+d) at all.
				436	</p>
				437
				438	<p>
				439	To start, an expression is a primary expression potentially followed by a
				440	sequence of [binop,primaryexpr] pairs:</p>
				441
				442	<div class="doc_code">
				443	<pre>
				444	/// expression
				445	/// ::= primary binoprhs
				446	///
				447	static ExprAST *ParseExpression() {
				448	ExprAST *LHS = ParsePrimary();
				449	if (!LHS) return 0;
				450
				451	return ParseBinOpRHS(0, LHS);
				452	}
				453	</pre>
				454	</div>
				455
				456	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
				457	us. It takes a precedence and a pointer to an expression for the part parsed
				458	so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
				459	allowed to be empty, in which case it returns the expression that is passed into
				460	it. In our example above, the code passes the expression for "a" into
				461	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				462
				463	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				464	minimal operator precedence</em> that the function is allowed to eat. For
				465	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				466	passed in a precedence of 40, it will not consume any tokens (because the
				467	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				468	with:</p>
				469
				470	<div class="doc_code">
				471	<pre>
				472	/// binoprhs
				473	/// ::= ('+' primary)*
				474	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				475	// If this is a binop, find its precedence.
				476	while (1) {
				477	int TokPrec = GetTokPrecedence();
				478
				479	// If this is a binop that binds at least as tightly as the current binop,
				480	// consume it, otherwise we are done.
				481	if (TokPrec < ExprPrec)
				482	return LHS;
				483	</pre>
				484	</div>
				485
				486	<p>This code gets the precedence of the current token and checks to see if if is
				487	too low. Because we defined invalid tokens to have a precedence of -1, this
				488	check implicitly knows that the pair-stream ends when the token stream runs out
				489	of binary operators. If this check succeeds, we know that the token is a binary
				490	operator and that it will be included in this expression:</p>
				491
				492	<div class="doc_code">
				493	<pre>
				494	// Okay, we know this is a binop.
				495	int BinOp = CurTok;
				496	getNextToken(); // eat binop
				497
				498	// Parse the primary expression after the binary operator.
				499	ExprAST *RHS = ParsePrimary();
				500	if (!RHS) return 0;
				501	</pre>
				502	</div>
				503
				504	<p>As such, this code eats (and remembers) the binary operator and then parses
				505	the following primary expression. This builds up the whole pair, the first of
				506	which is [+, b] for the running example.</p>
				507
				508	<p>Now that we parsed the left-hand side of an expression and one pair of the
				509	RHS sequence, we have to decide which way the expression associates. In
				510	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				511	To determine this, we look ahead at "binop" to determine its precedence and
				512	compare it to BinOp's precedence (which is '+' in this case):</p>
				513
				514	<div class="doc_code">
				515	<pre>
				516	// If BinOp binds less tightly with RHS than the operator after RHS, let
				517	// the pending operator take RHS as its LHS.
				518	int NextPrec = GetTokPrecedence();
				519	if (TokPrec < NextPrec) {
				520	</pre>
				521	</div>
				522
				523	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				524	precedence of our current operator, then we know that the parentheses associate
				525	as "(a+b) binop ...". In our example, since the next operator is "+" and so is
				526	our current one, we know that they have the same precedence. In this case we'll
				527	create the AST node for "a+b", and then continue parsing:</p>
				528
				529	<div class="doc_code">
				530	<pre>
				531	... if body omitted ...
				532	}
				533
				534	// Merge LHS/RHS.
				535	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				536	} // loop around to the top of the while loop.
				537	}
				538	</pre>
				539	</div>
				540
				541	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				542	iteration of the loop, with "+" as the current token. The code above will eat
				543	and remember it and parse "(c+d)" as the primary expression, which makes the
				544	current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
				545	binop to the right of the primary. In this case, the precedence of "*" is
				546	higher than the precedence of "+" so the if condition will be entered.</p>
				547
				548	<p>The critical question left here is "how can the if condition parse the right
				549	hand side in full"? In particular, to build the AST correctly for our example,
				550	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				551	do this is surprisingly simple (code from the above two blocks duplicated for
				552	context):</p>
				553
				554	<div class="doc_code">
				555	<pre>
				556	// If BinOp binds less tightly with RHS than the operator after RHS, let
				557	// the pending operator take RHS as its LHS.
				558	int NextPrec = GetTokPrecedence();
				559	if (TokPrec < NextPrec) {
				560	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				561	if (RHS == 0) return 0;
				562	}
				563	// Merge LHS/RHS.
				564	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				565	} // loop around to the top of the while loop.
				566	}
				567	</pre>
				568	</div>
				569
				570	<p>At this point, we know that the binary operator to the RHS of our primary
				571	has higher precedence than the binop we are currently parsing. As such, we know
				572	that any sequence of pairs whose operators are all higher precedence than "+"
				573	should be parsed together and returned as "RHS". To do this, we recursively
				574	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				575	precedence required for it to continue. In our example above, this will cause
				576	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				577	of the '+' expression.</p>
				578
				579	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
				580	and added to the AST. With this little bit of code (14 non-trivial lines), we
				581	correctly handle fully general binary expression parsing in a very elegant way.
				582	</p>
				583
				584	<p>This wraps up handling of expressions. At this point, we can point the
				585	parser at an arbitrary token stream and build an expression from them, stopping
				586	at the first token that is not part of the expression. Next up we need to
				587	handle function definitions etc.</p>
				588
				589	</div>
				590
				591	<!-- *********************************************************************** -->
				592	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				593	<!-- *********************************************************************** -->
				594
				595	<div class="doc_text">
				596
				597	<p>
				598	The first basic thing missing is that of function prototypes. In Kaleidoscope,
				599	these are used both for 'extern' function declarations as well as function body
				600	definitions. The code to do this is straight-forward and not very interesting
				601	(once you've survived expressions):
				602	</p>
				603
				604	<div class="doc_code">
				605	<pre>
				606	/// prototype
				607	/// ::= id '(' id* ')'
				608	static PrototypeAST *ParsePrototype() {
				609	if (CurTok != tok_identifier)
				610	return ErrorP("Expected function name in prototype");
				611
				612	std::string FnName = IdentifierStr;
				613	getNextToken();
				614
				615	if (CurTok != '(')
				616	return ErrorP("Expected '(' in prototype");
				617
				618	std::vector<std::string> ArgNames;
				619	while (getNextToken() == tok_identifier)
				620	ArgNames.push_back(IdentifierStr);
				621	if (CurTok != ')')
				622	return ErrorP("Expected ')' in prototype");
				623
				624	// success.
				625	getNextToken(); // eat ')'.
				626
				627	return new PrototypeAST(FnName, ArgNames);
				628	}
				629	</pre>
				630	</div>
				631
				632	<p>Given this, a function definition is very simple, just a prototype plus
				633	and expression to implement the body:</p>
				634
				635	<div class="doc_code">
				636	<pre>
				637	/// definition ::= 'def' prototype expression
				638	static FunctionAST *ParseDefinition() {
				639	getNextToken(); // eat def.
				640	PrototypeAST *Proto = ParsePrototype();
				641	if (Proto == 0) return 0;
				642
				643	if (ExprAST *E = ParseExpression())
				644	return new FunctionAST(Proto, E);
				645	return 0;
				646	}
				647	</pre>
				648	</div>
				649
				650	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				651	well as to support forward declaration of user functions. 'externs' are just
				652	prototypes with no body:</p>
				653
				654	<div class="doc_code">
				655	<pre>
				656	/// external ::= 'extern' prototype
				657	static PrototypeAST *ParseExtern() {
				658	getNextToken(); // eat extern.
				659	return ParsePrototype();
				660	}
				661	</pre>
				662	</div>
				663
				664	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				665	evaluate them on the fly. We will handle this by defining anonymous nullary
				666	(zero argument) functions for them:</p>
				667
				668	<div class="doc_code">
				669	<pre>
				670	/// toplevelexpr ::= expression
				671	static FunctionAST *ParseTopLevelExpr() {
				672	if (ExprAST *E = ParseExpression()) {
				673	// Make an anonymous proto.
				674	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				675	return new FunctionAST(Proto, E);
				676	}
				677	return 0;
				678	}
				679	</pre>
				680	</div>
				681
				682	<p>Now that we have all the pieces, lets build a little driver that will let us
				683	actually <em>execute</em> this code we've built!</p>
				684
				685	</div>
				686
				687	<!-- *********************************************************************** -->
				688	<div class="doc_section"><a name="driver">The Driver</a></div>
				689	<!-- *********************************************************************** -->
				690
				691	<div class="doc_text">
				692
				693	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				694	dispatch loop. There isn't much interesting here, so I'll just include the
				695	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				696	Parsing" section.</p>
				697
				698	<div class="doc_code">
				699	<pre>
				700	/// top ::= definition \| external \| expression \| ';'
				701	static void MainLoop() {
				702	while (1) {
				703	fprintf(stderr, "ready> ");
				704	switch (CurTok) {
				705	case tok_eof: return;
				706	case ';': getNextToken(); break; // ignore top level semicolons.
				707	case tok_def: HandleDefinition(); break;
				708	case tok_extern: HandleExtern(); break;
				709	default: HandleTopLevelExpression(); break;
				710	}
				711	}
				712	}
				713	</pre>
				714	</div>
				715
				716	<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame^]	717	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	718	command line, the parser doesn't know that that is the end of what you will
				719	type. For example, on the next line you could type "def foo..." in which case
				720	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				721	which would continue the expression. Having top-level semicolons allows you to
				722	type "4+5;" and the parser will know you are done.</p>
				723
				724	</div>
				725
				726	<!-- *********************************************************************** -->
				727	<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
				728	<!-- *********************************************************************** -->
				729
				730	<div class="doc_text">
				731
				732	<p>With just under 400 lines of commented code, we fully defined our minimal
				733	language, including a lexer, parser and AST builder. With this done, the
				734	executable will validate code and tell us if it is gramatically invalid. For
				735	example, here is a sample interaction:</p>
				736
				737	<div class="doc_code">
				738	<pre>
				739	$ ./a.out
				740	ready> def foo(x y) x+foo(y, 4.0);
				741	ready> Parsed an function definition.
				742	ready> def foo(x y) x+y y;
				743	ready> Parsed an function definition.
				744	ready> Parsed a top-level expr
				745	ready> def foo(x y) x+y );
				746	ready> Parsed an function definition.
				747	ready> Error: unknown token when expecting an expression
				748	ready> extern sin(a);
				749	ready> Parsed an extern
				750	ready> ^D
				751	$
				752	</pre>
				753	</div>
				754
				755	<p>
				756	Here is the full code. Note that it is fully self-contained: you don't even
				757	need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
				758	will describe how to generate LLVM IR from the AST.</p>
				759
				760	<div class="doc_code">
				761	<pre>
				762	// To build this:
				763	// g++ -g toy.cpp
				764	// ./a.out
				765
				766	#include <cstdio>
				767	#include <string>
				768	#include <
				769	#include <vector>
				770
				771	//===----------------------------------------------------------------------===//
				772	// Lexer
				773	//===----------------------------------------------------------------------===//
				774
				775	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				776	// of these for known things.
				777	enum Token {
				778	tok_eof = -1,
				779
				780	// commands
				781	tok_def = -2, tok_extern = -3,
				782
				783	// primary
				784	tok_identifier = -4, tok_number = -5,
				785	};
				786
				787	static std::string IdentifierStr; // Filled in if tok_identifier
				788	static double NumVal; // Filled in if tok_number
				789
				790	/// gettok - Return the next token from standard input.
				791	static int gettok() {
				792	static int LastChar = ' ';
				793
				794	// Skip any whitespace.
				795	while (isspace(LastChar))
				796	LastChar = getchar();
				797
				798	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				799	IdentifierStr = LastChar;
				800	while (isalnum((LastChar = getchar())))
				801	IdentifierStr += LastChar;
				802
				803	if (IdentifierStr == "def") return tok_def;
				804	if (IdentifierStr == "extern") return tok_extern;
				805	return tok_identifier;
				806	}
				807
				808	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				809	std::string NumStr;
				810	do {
				811	NumStr += LastChar;
				812	LastChar = getchar();
				813	} while (isdigit(LastChar) \|\| LastChar == '.');
				814
				815	NumVal = strtod(NumStr.c_str(), 0);
				816	return tok_number;
				817	}
				818
				819	if (LastChar == '#') {
				820	// Comment until end of line.
				821	do LastChar = getchar();
				822	while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
				823
				824	if (LastChar != EOF)
				825	return gettok();
				826	}
				827
				828	// Check for end of file. Don't eat the EOF.
				829	if (LastChar == EOF)
				830	return tok_eof;
				831
				832	// Otherwise, just return the character as its ascii value.
				833	int ThisChar = LastChar;
				834	LastChar = getchar();
				835	return ThisChar;
				836	}
				837
				838	//===----------------------------------------------------------------------===//
				839	// Abstract Syntax Tree (aka Parse Tree)
				840	//===----------------------------------------------------------------------===//
				841
				842	/// ExprAST - Base class for all expression nodes.
				843	class ExprAST {
				844	public:
				845	virtual ~ExprAST() {}
				846	};
				847
				848	/// NumberExprAST - Expression class for numeric literals like "1.0".
				849	class NumberExprAST : public ExprAST {
				850	double Val;
				851	public:
				852	NumberExprAST(double val) : Val(val) {}
				853	};
				854
				855	/// VariableExprAST - Expression class for referencing a variable, like "a".
				856	class VariableExprAST : public ExprAST {
				857	std::string Name;
				858	public:
				859	VariableExprAST(const std::string &name) : Name(name) {}
				860	};
				861
				862	/// BinaryExprAST - Expression class for a binary operator.
				863	class BinaryExprAST : public ExprAST {
				864	char Op;
				865	ExprAST LHS, RHS;
				866	public:
				867	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				868	: Op(op), LHS(lhs), RHS(rhs) {}
				869	};
				870
				871	/// CallExprAST - Expression class for function calls.
				872	class CallExprAST : public ExprAST {
				873	std::string Callee;
				874	std::vector<ExprAST*> Args;
				875	public:
				876	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				877	: Callee(callee), Args(args) {}
				878	};
				879
				880	/// PrototypeAST - This class represents the "prototype" for a function,
				881	/// which captures its argument names as well as if it is an operator.
				882	class PrototypeAST {
				883	std::string Name;
				884	std::vector< Args;
				885	public:
				886	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				887	: Name(name), Args(args) {}
				888
				889	};
				890
				891	/// FunctionAST - This class represents a function definition itself.
				892	class FunctionAST {
				893	PrototypeAST *Proto;
				894	ExprAST *Body;
				895	public:
				896	FunctionAST(PrototypeAST proto, ExprAST body)
				897	: Proto(proto), Body(body) {}
				898
				899	};
				900
				901	//===----------------------------------------------------------------------===//
				902	// Parser
				903	//===----------------------------------------------------------------------===//
				904
				905	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				906	/// token the parser it looking at. getNextToken reads another token from the
				907	/// lexer and updates CurTok with its results.
				908	static int CurTok;
				909	static int getNextToken() {
				910	return CurTok = gettok();
				911	}
				912
				913	/// BinopPrecedence - This holds the precedence for each binary operator that is
				914	/// defined.
				915	static std::map<char, int> BinopPrecedence;
				916
				917	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				918	static int GetTokPrecedence() {
				919	if (!isascii(CurTok))
				920	return -1;
				921
				922	// Make sure it's a declared binop.
				923	int TokPrec = BinopPrecedence[CurTok];
				924	if (TokPrec <= 0) return -1;
				925	return TokPrec;
				926	}
				927
				928	/// Error* - These are little helper functions for error handling.
				929	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				930	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				931	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				932
				933	static ExprAST *ParseExpression();
				934
				935	/// identifierexpr
				936	/// ::= identifer
				937	/// ::= identifer '(' expression* ')'
				938	static ExprAST *ParseIdentifierExpr() {
				939	std::string IdName = IdentifierStr;
				940
				941	getNextToken(); // eat identifer.
				942
				943	if (CurTok != '(') // Simple variable ref.
				944	return new VariableExprAST(IdName);
				945
				946	// Call.
				947	getNextToken(); // eat (
				948	std::vector<ExprAST*> Args;
				949	while (1) {
				950	ExprAST *Arg = ParseExpression();
				951	if (!Arg) return 0;
				952	Args.push_back(Arg);
				953
				954	if (CurTok == ')') break;
				955
				956	if (CurTok != ',')
				957	return Error("Expected ')'");
				958	getNextToken();
				959	}
				960
				961	// Eat the ')'.
				962	getNextToken();
				963
				964	return new CallExprAST(IdName, Args);
				965	}
				966
				967	/// numberexpr ::= number
				968	static ExprAST *ParseNumberExpr() {
				969	ExprAST *Result = new NumberExprAST(NumVal);
				970	getNextToken(); // consume the number
				971	return Result;
				972	}
				973
				974	/// parenexpr ::= '(' expression ')'
				975	static ExprAST *ParseParenExpr() {
				976	getNextToken(); // eat (.
				977	ExprAST *V = ParseExpression();
				978	if (!V) return 0;
				979
				980	if (CurTok != ')')
				981	return Error("expected ')'");
				982	getNextToken(); // eat ).
				983	return V;
				984	}
				985
				986	/// primary
				987	/// ::= identifierexpr
				988	/// ::= numberexpr
				989	/// ::= parenexpr
				990	static ExprAST *ParsePrimary() {
				991	switch (CurTok) {
				992	default: return Error("unknown token when expecting an expression");
				993	case tok_identifier: return ParseIdentifierExpr();
				994	case tok_number: return ParseNumberExpr();
				995	case '(': return ParseParenExpr();
				996	}
				997	}
				998
				999	/// binoprhs
				1000	/// ::= ('+' primary)*
				1001	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1002	// If this is a binop, find its precedence.
				1003	while (1) {
				1004	int TokPrec = GetTokPrecedence();
				1005
				1006	// If this is a binop that binds at least as tightly as the current binop,
				1007	// consume it, otherwise we are done.
				1008	if (TokPrec < ExprPrec)
				1009	return LHS;
				1010
				1011	// Okay, we know this is a binop.
				1012	int BinOp = CurTok;
				1013	getNextToken(); // eat binop
				1014
				1015	// Parse the primary expression after the binary operator.
				1016	ExprAST *RHS = ParsePrimary();
				1017	if (!RHS) return 0;
				1018
				1019	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1020	// the pending operator take RHS as its LHS.
				1021	int NextPrec = GetTokPrecedence();
				1022	if (TokPrec < NextPrec) {
				1023	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1024	if (RHS == 0) return 0;
				1025	}
				1026
				1027	// Merge LHS/RHS.
				1028	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1029	}
				1030	}
				1031
				1032	/// expression
				1033	/// ::= primary binoprhs
				1034	///
				1035	static ExprAST *ParseExpression() {
				1036	ExprAST *LHS = ParsePrimary();
				1037	if (!LHS) return 0;
				1038
				1039	return ParseBinOpRHS(0, LHS);
				1040	}
				1041
				1042	/// prototype
				1043	/// ::= id '(' id* ')'
				1044	static PrototypeAST *ParsePrototype() {
				1045	if (CurTok != tok_identifier)
				1046	return ErrorP("Expected function name in prototype");
				1047
				1048	std::string FnName = IdentifierStr;
				1049	getNextToken();
				1050
				1051	if (CurTok != '(')
				1052	return ErrorP("Expected '(' in prototype");
				1053
				1054	std::vector<std::string> ArgNames;
				1055	while (getNextToken() == tok_identifier)
				1056	ArgNames.push_back(IdentifierStr);
				1057	if (CurTok != ')')
				1058	return ErrorP("Expected ')' in prototype");
				1059
				1060	// success.
				1061	getNextToken(); // eat ')'.
				1062
				1063	return new PrototypeAST(FnName, ArgNames);
				1064	}
				1065
				1066	/// definition ::= 'def' prototype expression
				1067	static FunctionAST *ParseDefinition() {
				1068	getNextToken(); // eat def.
				1069	PrototypeAST *Proto = ParsePrototype();
				1070	if (Proto == 0) return 0;
				1071
				1072	if (ExprAST *E = ParseExpression())
				1073	return new FunctionAST(Proto, E);
				1074	return 0;
				1075	}
				1076
				1077	/// toplevelexpr ::= expression
				1078	static FunctionAST *ParseTopLevelExpr() {
				1079	if (ExprAST *E = ParseExpression()) {
				1080	// Make an anonymous proto.
				1081	PrototypeAST *Proto = new PrototypeAST("", std::vector<());
				1082	return new FunctionAST(Proto, E);
				1083	}
				1084	return 0;
				1085	}
				1086
				1087	/// external ::= 'extern' prototype
				1088	static PrototypeAST *ParseExtern() {
				1089	getNextToken(); // eat extern.
				1090	return ParsePrototype();
				1091	}
				1092
				1093	//===----------------------------------------------------------------------===//
				1094	// Top-Level parsing
				1095	//===----------------------------------------------------------------------===//
				1096
				1097	static void HandleDefinition() {
				1098	if (FunctionAST *F = ParseDefinition()) {
				1099	fprintf(stderr, "Parsed a function definition.\n");
				1100	} else {
				1101	// Skip token for error recovery.
				1102	getNextToken();
				1103	}
				1104	}
				1105
				1106	static void HandleExtern() {
				1107	if (PrototypeAST *P = ParseExtern()) {
				1108	fprintf(stderr, "Parsed an extern\n");
				1109	} else {
				1110	// Skip token for error recovery.
				1111	getNextToken();
				1112	}
				1113	}
				1114
				1115	static void HandleTopLevelExpression() {
				1116	// Evaluate a top level expression into an anonymous function.
				1117	if (FunctionAST *F = ParseTopLevelExpr()) {
				1118	fprintf(stderr, "Parsed a top-level expr\n");
				1119	} else {
				1120	// Skip token for error recovery.
				1121	getNextToken();
				1122	}
				1123	}
				1124
				1125	/// top ::= definition \| external \| expression \| ';'
				1126	static void MainLoop() {
				1127	while (1) {
				1128	fprintf(stderr, "ready> ");
				1129	switch (CurTok) {
				1130	case tok_eof: return;
				1131	case ';': getNextToken(); break; // ignore top level semicolons.
				1132	case tok_def: HandleDefinition(); break;
				1133	case tok_extern: HandleExtern(); break;
				1134	default: HandleTopLevelExpression(); break;
				1135	}
				1136	}
				1137	}
				1138
				1139	//===----------------------------------------------------------------------===//
				1140	// Main driver code.
				1141	//===----------------------------------------------------------------------===//
				1142
				1143	int main() {
				1144	// Install standard binary operators.
				1145	// 1 is lowest precedence.
				1146	BinopPrecedence['<'] = 10;
				1147	BinopPrecedence['+'] = 20;
				1148	BinopPrecedence['-'] = 20;
				1149	BinopPrecedence['*'] = 40; // highest.
				1150
				1151	// Prime the first token.
				1152	fprintf(stderr, "ready> ");
				1153	getNextToken();
				1154
				1155	MainLoop();
				1156	return 0;
				1157	}
				1158	</pre>
				1159	</div>
				1160	</div>
				1161
				1162	<!-- *********************************************************************** -->
				1163	<hr>
				1164	<address>
				1165	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1166	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1167	<a href="http://validator.w3.org/check/referer"><img
				1168	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
				1169
				1170	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1171	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				1172	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				1173	</address>
				1174	</body>
				1175	</html>