Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: d364f4de3e7c3b893d79bc235569b8e82a78ec96 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
				16	<div class="doc_author">
				17	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				18	</div>
				19
				20	<!-- *********************************************************************** -->
				21	<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
				22	<!-- *********************************************************************** -->
				23
				24	<div class="doc_text">
				25
				26	<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
				27	LLVM</a>" tutorial. This chapter shows you how to use the <a
				28	href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
				29	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
				30	our Kaleidoscope language and build an <a
				31	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				32	Tree</a> (AST).</p>
				33
				34	<p>The parser we will build uses a combination of <a
				35	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				36	Parsing</a> and <a href=
				37	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				38	Parsing</a> to parse the Kaleidoscope language (the later for binary expression
				39	and the former for everything else). Before we get to parsing though, lets talk
				40	about the output of the parser: the Abstract Syntax Tree.</p>
				41
				42	</div>
				43
				44	<!-- *********************************************************************** -->
				45	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				46	<!-- *********************************************************************** -->
				47
				48	<div class="doc_text">
				49
				50	<p>The AST for a program captures its behavior in a way that it is easy for
				51	later stages of the compiler (e.g. code generation) to interpret. We basically
				52	want one object for each construct in the language, and the AST should closely
				53	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				54	function object. We'll start with expressions first:</p>
				55
				56	<div class="doc_code">
				57	<pre>
				58	/// ExprAST - Base class for all expression nodes.
				59	class ExprAST {
				60	public:
				61	virtual ~ExprAST() {}
				62	};
				63
				64	/// NumberExprAST - Expression class for numeric literals like "1.0".
				65	class NumberExprAST : public ExprAST {
				66	double Val;
				67	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	68	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	69	};
				70	</pre>
				71	</div>
				72
				73	<p>The code above shows the definition of the base ExprAST class and one
				74	subclass which we use for numeric literals. The important thing about this is
				75	that the NumberExprAST class captures the numeric value of the literal in the
				76	class, so that later phases of the compiler can know what it is.</p>
				77
				78	<p>Right now we only create the AST, so there are no useful accessor methods on
				79	them. It would be very easy to add a virtual method to pretty print the code,
				80	for example. Here are the other expression AST node definitions that we'll use
				81	in the basic form of the Kaleidoscope language.
				82	</p>
				83
				84	<div class="doc_code">
				85	<pre>
				86	/// VariableExprAST - Expression class for referencing a variable, like "a".
				87	class VariableExprAST : public ExprAST {
				88	std::string Name;
				89	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	90	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	91	};
				92
				93	/// BinaryExprAST - Expression class for a binary operator.
				94	class BinaryExprAST : public ExprAST {
				95	char Op;
				96	ExprAST LHS, RHS;
				97	public:
				98	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				99	: Op(op), LHS(lhs), RHS(rhs) {}
				100	};
				101
				102	/// CallExprAST - Expression class for function calls.
				103	class CallExprAST : public ExprAST {
				104	std::string Callee;
				105	std::vector<ExprAST*> Args;
				106	public:
				107	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				108	: Callee(callee), Args(args) {}
				109	};
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentially) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name and list of argument expressions. One thing that is
				116	nice about our AST is that it captures the language features without talking
				117	about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	121	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	/// PrototypeAST - This class represents the "prototype" for a function,
				129	/// which captures its argument names as well as if it is an operator.
				130	class PrototypeAST {
				131	std::string Name;
				132	std::vector<std::string> Args;
				133	public:
				134	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				135	: Name(name), Args(args) {}
				136	};
				137
				138	/// FunctionAST - This class represents a function definition itself.
				139	class FunctionAST {
				140	PrototypeAST *Proto;
				141	ExprAST *Body;
				142	public:
				143	FunctionAST(PrototypeAST proto, ExprAST body)
				144	: Proto(proto), Body(body) {}
				145	};
				146	</pre>
				147	</div>
				148
				149	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				150	Since all values are double precision floating point, this fact doesn't need to
				151	be captured anywhere. In a more aggressive and realistic language, the
				152	"ExprAST" class would probably have a type field.</p>
				153
				154	<p>With this scaffolding, we can now talk about parsing expressions and function
				155	bodies in Kaleidoscope.</p>
				156
				157	</div>
				158
				159	<!-- *********************************************************************** -->
				160	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				161	<!-- *********************************************************************** -->
				162
				163	<div class="doc_text">
				164
				165	<p>Now that we have an AST to build, we need to define the parser code to build
				166	it. The idea here is that we want to parse something like "x+y" (which is
				167	returned as three tokens by the lexer) into an AST that could be generated with
				168	calls like this:</p>
				169
				170	<div class="doc_code">
				171	<pre>
				172	ExprAST *X = new VariableExprAST("x");
				173	ExprAST *Y = new VariableExprAST("y");
				174	ExprAST *Result = new BinaryExprAST('+', X, Y);
				175	</pre>
				176	</div>
				177
				178	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				179
				180	<div class="doc_code">
				181	<pre>
				182	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				183	/// token the parser it looking at. getNextToken reads another token from the
				184	/// lexer and updates CurTok with its results.
				185	static int CurTok;
				186	static int getNextToken() {
				187	return CurTok = gettok();
				188	}
				189	</pre>
				190	</div>
				191
				192	<p>
				193	This implements a simple token buffer around the lexer. This allows
				194	us to look one token ahead at what the lexer is returning. Every function in
Chris Lattner	e949512	2007-10-25 18:05:29 +0000	[diff] [blame^]	195	our parser will assume that CurTok is the current token that needs to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	196	parsed.</p>
				197
Chris Lattner	bee175b	2007-10-22 16:44:31 +0000	[diff] [blame]	198	<p>Again, we define these with global variables; it would be better design to
				199	wrap the entire parser in a class and use instance variables for these.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	200	</p>
				201
				202	<div class="doc_code">
				203	<pre>
				204
				205	/// Error* - These are little helper functions for error handling.
				206	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				207	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				208	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				209	</pre>
				210	</div>
				211
				212	<p>
				213	The <tt>Error</tt> routines are simple helper routines that our parser will use
				214	to handle errors. The error recovery in our parser will not be the best and
				215	are not particular user-friendly, but it will be enough for our tutorial. These
				216	routines make it easier to handle errors in routines that have various return
				217	types: they always return null.</p>
				218
				219	<p>With these basic helper functions implemented, we can implement the first
				220	piece of our grammar: we'll start with numeric literals.</p>
				221
				222	</div>
				223
				224	<!-- *********************************************************************** -->
				225	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				226	Parsing</a></div>
				227	<!-- *********************************************************************** -->
				228
				229	<div class="doc_text">
				230
				231	<p>We start with numeric literals, because they are the simplest to process.
				232	For each production in our grammar, we'll define a function which parses that
				233	production. For numeric literals, we have:
				234	</p>
				235
				236	<div class="doc_code">
				237	<pre>
				238	/// numberexpr ::= number
				239	static ExprAST *ParseNumberExpr() {
				240	ExprAST *Result = new NumberExprAST(NumVal);
				241	getNextToken(); // consume the number
				242	return Result;
				243	}
				244	</pre>
				245	</div>
				246
				247	<p>This routine is very simple: it expects to be called when the current token
				248	is a <tt>tok_number</tt> token. It takes the current number value, creates
				249	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
				250	returns.</p>
				251
				252	<p>There are some interesting aspects of this. The most important one is that
				253	this routine eats all of the tokens that correspond to the production, and
				254	returns the lexer buffer with the next token (which is not part of the grammar
				255	production) ready to go. This is a fairly standard way to go for recursive
				256	descent parsers. For a better example, the parenthesis operator is defined like
				257	this:</p>
				258
				259	<div class="doc_code">
				260	<pre>
				261	/// parenexpr ::= '(' expression ')'
				262	static ExprAST *ParseParenExpr() {
				263	getNextToken(); // eat (.
				264	ExprAST *V = ParseExpression();
				265	if (!V) return 0;
				266
				267	if (CurTok != ')')
				268	return Error("expected ')'");
				269	getNextToken(); // eat ).
				270	return V;
				271	}
				272	</pre>
				273	</div>
				274
				275	<p>This function illustrates a number of interesting things about the parser:
				276	1) it shows how we use the Error routines. When called, this function expects
				277	that the current token is a '(' token, but after parsing the subexpression, it
				278	is possible that there is not a ')' waiting. For example, if the user types in
				279	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				280	occur, the parser needs a way to indicate that they happened: in our parser, we
				281	return null on an error.</p>
				282
				283	<p>Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	284	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	285	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				286	recursive grammars, and keeps each production very simple. Note that
				287	parenthesis do not cause construction of AST nodes themselves. While we could
				288	do this, the most important role of parens are to guide the parser and provide
				289	grouping. Once the parser constructs the AST, parens are not needed.</p>
				290
				291	<p>The next simple production is for handling variable references and function
				292	calls:</p>
				293
				294	<div class="doc_code">
				295	<pre>
				296	/// identifierexpr
				297	/// ::= identifer
				298	/// ::= identifer '(' expression* ')'
				299	static ExprAST *ParseIdentifierExpr() {
				300	std::string IdName = IdentifierStr;
				301
				302	getNextToken(); // eat identifer.
				303
				304	if (CurTok != '(') // Simple variable ref.
				305	return new VariableExprAST(IdName);
				306
				307	// Call.
				308	getNextToken(); // eat (
				309	std::vector<ExprAST*> Args;
				310	while (1) {
				311	ExprAST *Arg = ParseExpression();
				312	if (!Arg) return 0;
				313	Args.push_back(Arg);
				314
				315	if (CurTok == ')') break;
				316
				317	if (CurTok != ',')
				318	return Error("Expected ')'");
				319	getNextToken();
				320	}
				321
				322	// Eat the ')'.
				323	getNextToken();
				324
				325	return new CallExprAST(IdName, Args);
				326	}
				327	</pre>
				328	</div>
				329
				330	<p>This routine follows the same style as the other routines (it expects to be
				331	called if the current token is a <tt>tok_identifier</tt> token). It also has
				332	recursion and error handling. One interesting aspect of this is that it uses
				333	<em>look-ahead</em> to determine if the current identifier is a stand alone
				334	variable reference or if it is a function call expression. It handles this by
				335	checking to see if the token after the identifier is a '(' token, and constructs
				336	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				337	</p>
				338
				339	<p>Now that we have all of our simple expression parsing logic in place, we can
				340	define a helper function to wrap them up in a class. We call this class of
				341	expressions "primary" expressions, for reasons that will become more clear
				342	later. In order to parse a primary expression, we need to determine what sort
				343	of expression it is:</p>
				344
				345	<div class="doc_code">
				346	<pre>
				347	/// primary
				348	/// ::= identifierexpr
				349	/// ::= numberexpr
				350	/// ::= parenexpr
				351	static ExprAST *ParsePrimary() {
				352	switch (CurTok) {
				353	default: return Error("unknown token when expecting an expression");
				354	case tok_identifier: return ParseIdentifierExpr();
				355	case tok_number: return ParseNumberExpr();
				356	case '(': return ParseParenExpr();
				357	}
				358	}
				359	</pre>
				360	</div>
				361
				362	<p>Now that you see the definition of this function, it makes it more obvious
				363	why we can assume the state of CurTok in the various functions. This uses
				364	look-ahead to determine which sort of expression is being inspected, and parses
				365	it with a function call.</p>
				366
				367	<p>Now that basic expressions are handled, we need to handle binary expressions,
				368	which are a bit more complex.</p>
				369
				370	</div>
				371
				372	<!-- *********************************************************************** -->
				373	<div class="doc_section"><a name="parserbinops">Binary Expression
				374	Parsing</a></div>
				375	<!-- *********************************************************************** -->
				376
				377	<div class="doc_text">
				378
				379	<p>Binary expressions are significantly harder to parse because they are often
				380	ambiguous. For example, when given the string "x+y*z", the parser can choose
				381	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				382	mathematics, we expect the later parse, because "*" (multiplication) has
				383	higher <em>precedence</em> than "+" (addition).</p>
				384
				385	<p>There are many ways to handle this, but an elegant and efficient way is to
				386	use <a href=
				387	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				388	Parsing</a>. This parsing technique uses the precedence of binary operators to
				389	guide recursion. To start with, we need a table of precedences:</p>
				390
				391	<div class="doc_code">
				392	<pre>
				393	/// BinopPrecedence - This holds the precedence for each binary operator that is
				394	/// defined.
				395	static std::map<char, int> BinopPrecedence;
				396
				397	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				398	static int GetTokPrecedence() {
				399	if (!isascii(CurTok))
				400	return -1;
				401
				402	// Make sure it's a declared binop.
				403	int TokPrec = BinopPrecedence[CurTok];
				404	if (TokPrec <= 0) return -1;
				405	return TokPrec;
				406	}
				407
				408	int main() {
				409	// Install standard binary operators.
				410	// 1 is lowest precedence.
				411	BinopPrecedence['<'] = 10;
				412	BinopPrecedence['+'] = 20;
				413	BinopPrecedence['-'] = 20;
				414	BinopPrecedence['*'] = 40; // highest.
				415	...
				416	}
				417	</pre>
				418	</div>
				419
				420	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				421	(this can obviously be extended by you, the reader). The
				422	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				423	or -1 if the token is not a binary operator. Having a map makes it easy to add
				424	new operators and makes it clear that the algorithm doesn't depend on the
				425	specific operators involved, but it would be easy enough to eliminate the map
				426	and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
				427
				428	<p>With the helper above defined, we can now start parsing binary expressions.
				429	The basic idea of operator precedence parsing is to break down an expression
				430	with potentially ambiguous binary operators into pieces. Consider for example
				431	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				432	as a stream of primary expressions separated by binary operators. As such,
				433	it will first parse the leading primary expression "a", then it will see the
				434	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				435	are primary expressions that the binary expression parser doesn't need to worry
				436	about nested subexpressions like (c+d) at all.
				437	</p>
				438
				439	<p>
				440	To start, an expression is a primary expression potentially followed by a
				441	sequence of [binop,primaryexpr] pairs:</p>
				442
				443	<div class="doc_code">
				444	<pre>
				445	/// expression
				446	/// ::= primary binoprhs
				447	///
				448	static ExprAST *ParseExpression() {
				449	ExprAST *LHS = ParsePrimary();
				450	if (!LHS) return 0;
				451
				452	return ParseBinOpRHS(0, LHS);
				453	}
				454	</pre>
				455	</div>
				456
				457	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
				458	us. It takes a precedence and a pointer to an expression for the part parsed
				459	so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
				460	allowed to be empty, in which case it returns the expression that is passed into
				461	it. In our example above, the code passes the expression for "a" into
				462	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				463
				464	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				465	minimal operator precedence</em> that the function is allowed to eat. For
				466	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				467	passed in a precedence of 40, it will not consume any tokens (because the
				468	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				469	with:</p>
				470
				471	<div class="doc_code">
				472	<pre>
				473	/// binoprhs
				474	/// ::= ('+' primary)*
				475	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				476	// If this is a binop, find its precedence.
				477	while (1) {
				478	int TokPrec = GetTokPrecedence();
				479
				480	// If this is a binop that binds at least as tightly as the current binop,
				481	// consume it, otherwise we are done.
				482	if (TokPrec < ExprPrec)
				483	return LHS;
				484	</pre>
				485	</div>
				486
				487	<p>This code gets the precedence of the current token and checks to see if if is
				488	too low. Because we defined invalid tokens to have a precedence of -1, this
				489	check implicitly knows that the pair-stream ends when the token stream runs out
				490	of binary operators. If this check succeeds, we know that the token is a binary
				491	operator and that it will be included in this expression:</p>
				492
				493	<div class="doc_code">
				494	<pre>
				495	// Okay, we know this is a binop.
				496	int BinOp = CurTok;
				497	getNextToken(); // eat binop
				498
				499	// Parse the primary expression after the binary operator.
				500	ExprAST *RHS = ParsePrimary();
				501	if (!RHS) return 0;
				502	</pre>
				503	</div>
				504
				505	<p>As such, this code eats (and remembers) the binary operator and then parses
				506	the following primary expression. This builds up the whole pair, the first of
				507	which is [+, b] for the running example.</p>
				508
				509	<p>Now that we parsed the left-hand side of an expression and one pair of the
				510	RHS sequence, we have to decide which way the expression associates. In
				511	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				512	To determine this, we look ahead at "binop" to determine its precedence and
				513	compare it to BinOp's precedence (which is '+' in this case):</p>
				514
				515	<div class="doc_code">
				516	<pre>
				517	// If BinOp binds less tightly with RHS than the operator after RHS, let
				518	// the pending operator take RHS as its LHS.
				519	int NextPrec = GetTokPrecedence();
				520	if (TokPrec < NextPrec) {
				521	</pre>
				522	</div>
				523
				524	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				525	precedence of our current operator, then we know that the parentheses associate
				526	as "(a+b) binop ...". In our example, since the next operator is "+" and so is
				527	our current one, we know that they have the same precedence. In this case we'll
				528	create the AST node for "a+b", and then continue parsing:</p>
				529
				530	<div class="doc_code">
				531	<pre>
				532	... if body omitted ...
				533	}
				534
				535	// Merge LHS/RHS.
				536	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				537	} // loop around to the top of the while loop.
				538	}
				539	</pre>
				540	</div>
				541
				542	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				543	iteration of the loop, with "+" as the current token. The code above will eat
				544	and remember it and parse "(c+d)" as the primary expression, which makes the
				545	current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
				546	binop to the right of the primary. In this case, the precedence of "*" is
				547	higher than the precedence of "+" so the if condition will be entered.</p>
				548
				549	<p>The critical question left here is "how can the if condition parse the right
				550	hand side in full"? In particular, to build the AST correctly for our example,
				551	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				552	do this is surprisingly simple (code from the above two blocks duplicated for
				553	context):</p>
				554
				555	<div class="doc_code">
				556	<pre>
				557	// If BinOp binds less tightly with RHS than the operator after RHS, let
				558	// the pending operator take RHS as its LHS.
				559	int NextPrec = GetTokPrecedence();
				560	if (TokPrec < NextPrec) {
				561	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				562	if (RHS == 0) return 0;
				563	}
				564	// Merge LHS/RHS.
				565	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				566	} // loop around to the top of the while loop.
				567	}
				568	</pre>
				569	</div>
				570
				571	<p>At this point, we know that the binary operator to the RHS of our primary
				572	has higher precedence than the binop we are currently parsing. As such, we know
				573	that any sequence of pairs whose operators are all higher precedence than "+"
				574	should be parsed together and returned as "RHS". To do this, we recursively
				575	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				576	precedence required for it to continue. In our example above, this will cause
				577	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				578	of the '+' expression.</p>
				579
				580	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
				581	and added to the AST. With this little bit of code (14 non-trivial lines), we
				582	correctly handle fully general binary expression parsing in a very elegant way.
				583	</p>
				584
				585	<p>This wraps up handling of expressions. At this point, we can point the
				586	parser at an arbitrary token stream and build an expression from them, stopping
				587	at the first token that is not part of the expression. Next up we need to
				588	handle function definitions etc.</p>
				589
				590	</div>
				591
				592	<!-- *********************************************************************** -->
				593	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				594	<!-- *********************************************************************** -->
				595
				596	<div class="doc_text">
				597
				598	<p>
				599	The first basic thing missing is that of function prototypes. In Kaleidoscope,
				600	these are used both for 'extern' function declarations as well as function body
				601	definitions. The code to do this is straight-forward and not very interesting
				602	(once you've survived expressions):
				603	</p>
				604
				605	<div class="doc_code">
				606	<pre>
				607	/// prototype
				608	/// ::= id '(' id* ')'
				609	static PrototypeAST *ParsePrototype() {
				610	if (CurTok != tok_identifier)
				611	return ErrorP("Expected function name in prototype");
				612
				613	std::string FnName = IdentifierStr;
				614	getNextToken();
				615
				616	if (CurTok != '(')
				617	return ErrorP("Expected '(' in prototype");
				618
				619	std::vector<std::string> ArgNames;
				620	while (getNextToken() == tok_identifier)
				621	ArgNames.push_back(IdentifierStr);
				622	if (CurTok != ')')
				623	return ErrorP("Expected ')' in prototype");
				624
				625	// success.
				626	getNextToken(); // eat ')'.
				627
				628	return new PrototypeAST(FnName, ArgNames);
				629	}
				630	</pre>
				631	</div>
				632
				633	<p>Given this, a function definition is very simple, just a prototype plus
				634	and expression to implement the body:</p>
				635
				636	<div class="doc_code">
				637	<pre>
				638	/// definition ::= 'def' prototype expression
				639	static FunctionAST *ParseDefinition() {
				640	getNextToken(); // eat def.
				641	PrototypeAST *Proto = ParsePrototype();
				642	if (Proto == 0) return 0;
				643
				644	if (ExprAST *E = ParseExpression())
				645	return new FunctionAST(Proto, E);
				646	return 0;
				647	}
				648	</pre>
				649	</div>
				650
				651	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				652	well as to support forward declaration of user functions. 'externs' are just
				653	prototypes with no body:</p>
				654
				655	<div class="doc_code">
				656	<pre>
				657	/// external ::= 'extern' prototype
				658	static PrototypeAST *ParseExtern() {
				659	getNextToken(); // eat extern.
				660	return ParsePrototype();
				661	}
				662	</pre>
				663	</div>
				664
				665	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				666	evaluate them on the fly. We will handle this by defining anonymous nullary
				667	(zero argument) functions for them:</p>
				668
				669	<div class="doc_code">
				670	<pre>
				671	/// toplevelexpr ::= expression
				672	static FunctionAST *ParseTopLevelExpr() {
				673	if (ExprAST *E = ParseExpression()) {
				674	// Make an anonymous proto.
				675	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				676	return new FunctionAST(Proto, E);
				677	}
				678	return 0;
				679	}
				680	</pre>
				681	</div>
				682
				683	<p>Now that we have all the pieces, lets build a little driver that will let us
				684	actually <em>execute</em> this code we've built!</p>
				685
				686	</div>
				687
				688	<!-- *********************************************************************** -->
				689	<div class="doc_section"><a name="driver">The Driver</a></div>
				690	<!-- *********************************************************************** -->
				691
				692	<div class="doc_text">
				693
				694	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				695	dispatch loop. There isn't much interesting here, so I'll just include the
				696	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				697	Parsing" section.</p>
				698
				699	<div class="doc_code">
				700	<pre>
				701	/// top ::= definition \| external \| expression \| ';'
				702	static void MainLoop() {
				703	while (1) {
				704	fprintf(stderr, "ready> ");
				705	switch (CurTok) {
				706	case tok_eof: return;
				707	case ';': getNextToken(); break; // ignore top level semicolons.
				708	case tok_def: HandleDefinition(); break;
				709	case tok_extern: HandleExtern(); break;
				710	default: HandleTopLevelExpression(); break;
				711	}
				712	}
				713	}
				714	</pre>
				715	</div>
				716
				717	<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	718	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	719	command line, the parser doesn't know that that is the end of what you will
				720	type. For example, on the next line you could type "def foo..." in which case
				721	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				722	which would continue the expression. Having top-level semicolons allows you to
				723	type "4+5;" and the parser will know you are done.</p>
				724
				725	</div>
				726
				727	<!-- *********************************************************************** -->
				728	<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
				729	<!-- *********************************************************************** -->
				730
				731	<div class="doc_text">
				732
				733	<p>With just under 400 lines of commented code, we fully defined our minimal
				734	language, including a lexer, parser and AST builder. With this done, the
				735	executable will validate code and tell us if it is gramatically invalid. For
				736	example, here is a sample interaction:</p>
				737
				738	<div class="doc_code">
				739	<pre>
				740	$ ./a.out
Chris Lattner	35abbf5	2007-10-23 06:23:57 +0000	[diff] [blame]	741	ready> def foo(x y) x+foo(y, 4.0);
				742	ready> Parsed an function definition.
				743	ready> def foo(x y) x+y y;
				744	ready> Parsed an function definition.
				745	ready> Parsed a top-level expr
				746	ready> def foo(x y) x+y );
				747	ready> Parsed an function definition.
				748	ready> Error: unknown token when expecting an expression
				749	ready> extern sin(a);
				750	ready> Parsed an extern
				751	ready> ^D
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	752	$
				753	</pre>
				754	</div>
				755
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	756	<p>There is a lot of room for extension here. You can define new AST nodes,
				757	extend the language in many ways, etc. In the <a href="LangImpl3.html">next
				758	installment</a>, we will describe how to generate LLVM IR from the AST.</p>
				759
				760	</div>
				761
				762	<!-- *********************************************************************** -->
				763	<div class="doc_section"><a name="code">Full Code Listing</a></div>
				764	<!-- *********************************************************************** -->
				765
				766	<div class="doc_text">
				767
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	768	<p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	769	Here is the complete code listing for this and the previous chapter.
				770	Note that it is fully self-contained: you don't need LLVM or any external
				771	libraries at all for this (other than the C and C++ standard libraries of
				772	course). To build this, just compile with:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	773
				774	<div class="doc_code">
				775	<pre>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	776	# Compile
				777	g++ -g toy.cpp
				778	# Run
				779	./a.out
				780	</pre>
				781	</div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	782
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	783	<p>Here is the code:</p>
				784
				785	<div class="doc_code">
				786	<pre>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	787	#include <cstdio>
				788	#include <string>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	789	#include <map>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	790	#include <vector>
				791
				792	//===----------------------------------------------------------------------===//
				793	// Lexer
				794	//===----------------------------------------------------------------------===//
				795
				796	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				797	// of these for known things.
				798	enum Token {
				799	tok_eof = -1,
				800
				801	// commands
				802	tok_def = -2, tok_extern = -3,
				803
				804	// primary
				805	tok_identifier = -4, tok_number = -5,
				806	};
				807
				808	static std::string IdentifierStr; // Filled in if tok_identifier
				809	static double NumVal; // Filled in if tok_number
				810
				811	/// gettok - Return the next token from standard input.
				812	static int gettok() {
				813	static int LastChar = ' ';
				814
				815	// Skip any whitespace.
				816	while (isspace(LastChar))
				817	LastChar = getchar();
				818
				819	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				820	IdentifierStr = LastChar;
				821	while (isalnum((LastChar = getchar())))
				822	IdentifierStr += LastChar;
				823
				824	if (IdentifierStr == "def") return tok_def;
				825	if (IdentifierStr == "extern") return tok_extern;
				826	return tok_identifier;
				827	}
				828
				829	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				830	std::string NumStr;
				831	do {
				832	NumStr += LastChar;
				833	LastChar = getchar();
				834	} while (isdigit(LastChar) \|\| LastChar == '.');
				835
				836	NumVal = strtod(NumStr.c_str(), 0);
				837	return tok_number;
				838	}
				839
				840	if (LastChar == '#') {
				841	// Comment until end of line.
				842	do LastChar = getchar();
				843	while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
				844
				845	if (LastChar != EOF)
				846	return gettok();
				847	}
				848
				849	// Check for end of file. Don't eat the EOF.
				850	if (LastChar == EOF)
				851	return tok_eof;
				852
				853	// Otherwise, just return the character as its ascii value.
				854	int ThisChar = LastChar;
				855	LastChar = getchar();
				856	return ThisChar;
				857	}
				858
				859	//===----------------------------------------------------------------------===//
				860	// Abstract Syntax Tree (aka Parse Tree)
				861	//===----------------------------------------------------------------------===//
				862
				863	/// ExprAST - Base class for all expression nodes.
				864	class ExprAST {
				865	public:
				866	virtual ~ExprAST() {}
				867	};
				868
				869	/// NumberExprAST - Expression class for numeric literals like "1.0".
				870	class NumberExprAST : public ExprAST {
				871	double Val;
				872	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	873	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	874	};
				875
				876	/// VariableExprAST - Expression class for referencing a variable, like "a".
				877	class VariableExprAST : public ExprAST {
				878	std::string Name;
				879	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	880	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	881	};
				882
				883	/// BinaryExprAST - Expression class for a binary operator.
				884	class BinaryExprAST : public ExprAST {
				885	char Op;
				886	ExprAST LHS, RHS;
				887	public:
				888	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				889	: Op(op), LHS(lhs), RHS(rhs) {}
				890	};
				891
				892	/// CallExprAST - Expression class for function calls.
				893	class CallExprAST : public ExprAST {
				894	std::string Callee;
				895	std::vector<ExprAST*> Args;
				896	public:
				897	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				898	: Callee(callee), Args(args) {}
				899	};
				900
				901	/// PrototypeAST - This class represents the "prototype" for a function,
				902	/// which captures its argument names as well as if it is an operator.
				903	class PrototypeAST {
				904	std::string Name;
				905	std::vector< Args;
				906	public:
				907	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				908	: Name(name), Args(args) {}
				909
				910	};
				911
				912	/// FunctionAST - This class represents a function definition itself.
				913	class FunctionAST {
				914	PrototypeAST *Proto;
				915	ExprAST *Body;
				916	public:
				917	FunctionAST(PrototypeAST proto, ExprAST body)
				918	: Proto(proto), Body(body) {}
				919
				920	};
				921
				922	//===----------------------------------------------------------------------===//
				923	// Parser
				924	//===----------------------------------------------------------------------===//
				925
				926	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				927	/// token the parser it looking at. getNextToken reads another token from the
				928	/// lexer and updates CurTok with its results.
				929	static int CurTok;
				930	static int getNextToken() {
				931	return CurTok = gettok();
				932	}
				933
				934	/// BinopPrecedence - This holds the precedence for each binary operator that is
				935	/// defined.
				936	static std::map<char, int> BinopPrecedence;
				937
				938	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				939	static int GetTokPrecedence() {
				940	if (!isascii(CurTok))
				941	return -1;
				942
				943	// Make sure it's a declared binop.
				944	int TokPrec = BinopPrecedence[CurTok];
				945	if (TokPrec <= 0) return -1;
				946	return TokPrec;
				947	}
				948
				949	/// Error* - These are little helper functions for error handling.
				950	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				951	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				952	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				953
				954	static ExprAST *ParseExpression();
				955
				956	/// identifierexpr
				957	/// ::= identifer
				958	/// ::= identifer '(' expression* ')'
				959	static ExprAST *ParseIdentifierExpr() {
				960	std::string IdName = IdentifierStr;
				961
				962	getNextToken(); // eat identifer.
				963
				964	if (CurTok != '(') // Simple variable ref.
				965	return new VariableExprAST(IdName);
				966
				967	// Call.
				968	getNextToken(); // eat (
				969	std::vector<ExprAST*> Args;
				970	while (1) {
				971	ExprAST *Arg = ParseExpression();
				972	if (!Arg) return 0;
				973	Args.push_back(Arg);
				974
				975	if (CurTok == ')') break;
				976
				977	if (CurTok != ',')
				978	return Error("Expected ')'");
				979	getNextToken();
				980	}
				981
				982	// Eat the ')'.
				983	getNextToken();
				984
				985	return new CallExprAST(IdName, Args);
				986	}
				987
				988	/// numberexpr ::= number
				989	static ExprAST *ParseNumberExpr() {
				990	ExprAST *Result = new NumberExprAST(NumVal);
				991	getNextToken(); // consume the number
				992	return Result;
				993	}
				994
				995	/// parenexpr ::= '(' expression ')'
				996	static ExprAST *ParseParenExpr() {
				997	getNextToken(); // eat (.
				998	ExprAST *V = ParseExpression();
				999	if (!V) return 0;
				1000
				1001	if (CurTok != ')')
				1002	return Error("expected ')'");
				1003	getNextToken(); // eat ).
				1004	return V;
				1005	}
				1006
				1007	/// primary
				1008	/// ::= identifierexpr
				1009	/// ::= numberexpr
				1010	/// ::= parenexpr
				1011	static ExprAST *ParsePrimary() {
				1012	switch (CurTok) {
				1013	default: return Error("unknown token when expecting an expression");
				1014	case tok_identifier: return ParseIdentifierExpr();
				1015	case tok_number: return ParseNumberExpr();
				1016	case '(': return ParseParenExpr();
				1017	}
				1018	}
				1019
				1020	/// binoprhs
				1021	/// ::= ('+' primary)*
				1022	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1023	// If this is a binop, find its precedence.
				1024	while (1) {
				1025	int TokPrec = GetTokPrecedence();
				1026
				1027	// If this is a binop that binds at least as tightly as the current binop,
				1028	// consume it, otherwise we are done.
				1029	if (TokPrec < ExprPrec)
				1030	return LHS;
				1031
				1032	// Okay, we know this is a binop.
				1033	int BinOp = CurTok;
				1034	getNextToken(); // eat binop
				1035
				1036	// Parse the primary expression after the binary operator.
				1037	ExprAST *RHS = ParsePrimary();
				1038	if (!RHS) return 0;
				1039
				1040	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1041	// the pending operator take RHS as its LHS.
				1042	int NextPrec = GetTokPrecedence();
				1043	if (TokPrec < NextPrec) {
				1044	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1045	if (RHS == 0) return 0;
				1046	}
				1047
				1048	// Merge LHS/RHS.
				1049	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1050	}
				1051	}
				1052
				1053	/// expression
				1054	/// ::= primary binoprhs
				1055	///
				1056	static ExprAST *ParseExpression() {
				1057	ExprAST *LHS = ParsePrimary();
				1058	if (!LHS) return 0;
				1059
				1060	return ParseBinOpRHS(0, LHS);
				1061	}
				1062
				1063	/// prototype
				1064	/// ::= id '(' id* ')'
				1065	static PrototypeAST *ParsePrototype() {
				1066	if (CurTok != tok_identifier)
				1067	return ErrorP("Expected function name in prototype");
				1068
				1069	std::string FnName = IdentifierStr;
				1070	getNextToken();
				1071
				1072	if (CurTok != '(')
				1073	return ErrorP("Expected '(' in prototype");
				1074
				1075	std::vector<std::string> ArgNames;
				1076	while (getNextToken() == tok_identifier)
				1077	ArgNames.push_back(IdentifierStr);
				1078	if (CurTok != ')')
				1079	return ErrorP("Expected ')' in prototype");
				1080
				1081	// success.
				1082	getNextToken(); // eat ')'.
				1083
				1084	return new PrototypeAST(FnName, ArgNames);
				1085	}
				1086
				1087	/// definition ::= 'def' prototype expression
				1088	static FunctionAST *ParseDefinition() {
				1089	getNextToken(); // eat def.
				1090	PrototypeAST *Proto = ParsePrototype();
				1091	if (Proto == 0) return 0;
				1092
				1093	if (ExprAST *E = ParseExpression())
				1094	return new FunctionAST(Proto, E);
				1095	return 0;
				1096	}
				1097
				1098	/// toplevelexpr ::= expression
				1099	static FunctionAST *ParseTopLevelExpr() {
				1100	if (ExprAST *E = ParseExpression()) {
				1101	// Make an anonymous proto.
				1102	PrototypeAST *Proto = new PrototypeAST("", std::vector<());
				1103	return new FunctionAST(Proto, E);
				1104	}
				1105	return 0;
				1106	}
				1107
				1108	/// external ::= 'extern' prototype
				1109	static PrototypeAST *ParseExtern() {
				1110	getNextToken(); // eat extern.
				1111	return ParsePrototype();
				1112	}
				1113
				1114	//===----------------------------------------------------------------------===//
				1115	// Top-Level parsing
				1116	//===----------------------------------------------------------------------===//
				1117
				1118	static void HandleDefinition() {
				1119	if (FunctionAST *F = ParseDefinition()) {
				1120	fprintf(stderr, "Parsed a function definition.\n");
				1121	} else {
				1122	// Skip token for error recovery.
				1123	getNextToken();
				1124	}
				1125	}
				1126
				1127	static void HandleExtern() {
				1128	if (PrototypeAST *P = ParseExtern()) {
				1129	fprintf(stderr, "Parsed an extern\n");
				1130	} else {
				1131	// Skip token for error recovery.
				1132	getNextToken();
				1133	}
				1134	}
				1135
				1136	static void HandleTopLevelExpression() {
				1137	// Evaluate a top level expression into an anonymous function.
				1138	if (FunctionAST *F = ParseTopLevelExpr()) {
				1139	fprintf(stderr, "Parsed a top-level expr\n");
				1140	} else {
				1141	// Skip token for error recovery.
				1142	getNextToken();
				1143	}
				1144	}
				1145
				1146	/// top ::= definition \| external \| expression \| ';'
				1147	static void MainLoop() {
				1148	while (1) {
				1149	fprintf(stderr, "ready> ");
				1150	switch (CurTok) {
				1151	case tok_eof: return;
				1152	case ';': getNextToken(); break; // ignore top level semicolons.
				1153	case tok_def: HandleDefinition(); break;
				1154	case tok_extern: HandleExtern(); break;
				1155	default: HandleTopLevelExpression(); break;
				1156	}
				1157	}
				1158	}
				1159
				1160	//===----------------------------------------------------------------------===//
				1161	// Main driver code.
				1162	//===----------------------------------------------------------------------===//
				1163
				1164	int main() {
				1165	// Install standard binary operators.
				1166	// 1 is lowest precedence.
				1167	BinopPrecedence['<'] = 10;
				1168	BinopPrecedence['+'] = 20;
				1169	BinopPrecedence['-'] = 20;
				1170	BinopPrecedence['*'] = 40; // highest.
				1171
				1172	// Prime the first token.
				1173	fprintf(stderr, "ready> ");
				1174	getNextToken();
				1175
				1176	MainLoop();
				1177	return 0;
				1178	}
				1179	</pre>
				1180	</div>
				1181	</div>
				1182
				1183	<!-- *********************************************************************** -->
				1184	<hr>
				1185	<address>
				1186	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1187	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1188	<a href="http://validator.w3.org/check/referer"><img
Chris Lattner	8eef4b2	2007-10-23 06:30:50 +0000	[diff] [blame]	1189	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1190
				1191	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1192	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				1193	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				1194	</address>
				1195	</body>
				1196	</html>