Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: 9c13b486fa84a6e4bc4e2c90e07712f99ab7ef78 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	16	<ul>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	17	<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	18	<li>Chapter 2
				19	<ol>
				20	<li><a href="#intro">Chapter 2 Introduction</a></li>
				21	<li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
				22	<li><a href="#parserbasics">Parser Basics</a></li>
				23	<li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
				24	<li><a href="#parserbinops">Binary Expression Parsing</a></li>
				25	<li><a href="#parsertop">Parsing the Rest</a></li>
				26	<li><a href="#driver">The Driver</a></li>
				27	<li><a href="#conclusions">Conclusions</a></li>
				28	<li><a href="#code">Full Code Listing</a></li>
				29	</ol>
				30	</li>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	31	<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	32	</ul>
				33
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	34	<div class="doc_author">
				35	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				36	</div>
				37
				38	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	39	<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	40	<!-- *********************************************************************** -->
				41
				42	<div class="doc_text">
				43
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	44	<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	45	with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in
				46	<a href="LangImpl1.html">Chapter 1</a>, to build a full <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	47	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	48	our Kaleidoscope language. Once we have a parser, we'll define and build an <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	49	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				50	Tree</a> (AST).</p>
				51
				52	<p>The parser we will build uses a combination of <a
				53	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				54	Parsing</a> and <a href=
				55	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	56	Parsing</a> to parse the Kaleidoscope language (the latter for
				57	binary expressions and the former for everything else). Before we get to
				58	parsing though, lets talk about the output of the parser: the Abstract Syntax
				59	Tree.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	60
				61	</div>
				62
				63	<!-- *********************************************************************** -->
				64	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				65	<!-- *********************************************************************** -->
				66
				67	<div class="doc_text">
				68
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	69	<p>The AST for a program captures its behavior in such a way that it is easy for
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	70	later stages of the compiler (e.g. code generation) to interpret. We basically
				71	want one object for each construct in the language, and the AST should closely
				72	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				73	function object. We'll start with expressions first:</p>
				74
				75	<div class="doc_code">
				76	<pre>
				77	/// ExprAST - Base class for all expression nodes.
				78	class ExprAST {
				79	public:
				80	virtual ~ExprAST() {}
				81	};
				82
				83	/// NumberExprAST - Expression class for numeric literals like "1.0".
				84	class NumberExprAST : public ExprAST {
				85	double Val;
				86	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	87	NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	88	};
				89	</pre>
				90	</div>
				91
				92	<p>The code above shows the definition of the base ExprAST class and one
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	93	subclass which we use for numeric literals. The important thing to note about
				94	this code is that the NumberExprAST class captures the numeric value of the
				95	literal as an instance variable. This allows later phases of the compiler to
				96	know what the stored numeric value is.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	97
				98	<p>Right now we only create the AST, so there are no useful accessor methods on
				99	them. It would be very easy to add a virtual method to pretty print the code,
				100	for example. Here are the other expression AST node definitions that we'll use
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	101	in the basic form of the Kaleidoscope language:
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	102	</p>
				103
				104	<div class="doc_code">
				105	<pre>
				106	/// VariableExprAST - Expression class for referencing a variable, like "a".
				107	class VariableExprAST : public ExprAST {
				108	std::string Name;
				109	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	110	VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	111	};
				112
				113	/// BinaryExprAST - Expression class for a binary operator.
				114	class BinaryExprAST : public ExprAST {
				115	char Op;
				116	ExprAST LHS, RHS;
				117	public:
				118	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				119	: Op(op), LHS(lhs), RHS(rhs) {}
				120	};
				121
				122	/// CallExprAST - Expression class for function calls.
				123	class CallExprAST : public ExprAST {
				124	std::string Callee;
				125	std::vector<ExprAST*> Args;
				126	public:
				127	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				128	: Callee(callee), Args(args) {}
				129	};
				130	</pre>
				131	</div>
				132
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	133	<p>This is all (intentionally) rather straight-forward: variables capture the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	134	variable name, binary operators capture their opcode (e.g. '+'), and calls
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	135	capture a function name as well as a list of any argument expressions. One thing
				136	that is nice about our AST is that it captures the language features without
				137	talking about the syntax of the language. Note that there is no discussion about
				138	precedence of binary operators, lexical structure, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	139
				140	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	141	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	142	we'll fix that in a later installment. The two things we need next are a way
				143	to talk about the interface to a function, and a way to talk about functions
				144	themselves:</p>
				145
				146	<div class="doc_code">
				147	<pre>
				148	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	149	/// which captures its name, and its argument names (thus implicitly the number
				150	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	151	class PrototypeAST {
				152	std::string Name;
				153	std::vector<std::string> Args;
				154	public:
				155	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				156	: Name(name), Args(args) {}
				157	};
				158
				159	/// FunctionAST - This class represents a function definition itself.
				160	class FunctionAST {
				161	PrototypeAST *Proto;
				162	ExprAST *Body;
				163	public:
				164	FunctionAST(PrototypeAST proto, ExprAST body)
				165	: Proto(proto), Body(body) {}
				166	};
				167	</pre>
				168	</div>
				169
				170	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	171	Since all values are double precision floating point, the type of each argument
				172	doesn't need to be stored anywhere. In a more aggressive and realistic
				173	language, the "ExprAST" class would probably have a type field.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	174
				175	<p>With this scaffolding, we can now talk about parsing expressions and function
				176	bodies in Kaleidoscope.</p>
				177
				178	</div>
				179
				180	<!-- *********************************************************************** -->
				181	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				182	<!-- *********************************************************************** -->
				183
				184	<div class="doc_text">
				185
				186	<p>Now that we have an AST to build, we need to define the parser code to build
				187	it. The idea here is that we want to parse something like "x+y" (which is
				188	returned as three tokens by the lexer) into an AST that could be generated with
				189	calls like this:</p>
				190
				191	<div class="doc_code">
				192	<pre>
				193	ExprAST *X = new VariableExprAST("x");
				194	ExprAST *Y = new VariableExprAST("y");
				195	ExprAST *Result = new BinaryExprAST('+', X, Y);
				196	</pre>
				197	</div>
				198
				199	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				200
				201	<div class="doc_code">
				202	<pre>
				203	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	204	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	205	/// lexer and updates CurTok with its results.
				206	static int CurTok;
				207	static int getNextToken() {
				208	return CurTok = gettok();
				209	}
				210	</pre>
				211	</div>
				212
				213	<p>
				214	This implements a simple token buffer around the lexer. This allows
				215	us to look one token ahead at what the lexer is returning. Every function in
Chris Lattner	e949512	2007-10-25 18:05:29 +0000	[diff] [blame]	216	our parser will assume that CurTok is the current token that needs to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	217	parsed.</p>
				218
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	219	<div class="doc_code">
				220	<pre>
				221
				222	/// Error* - These are little helper functions for error handling.
				223	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				224	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				225	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				226	</pre>
				227	</div>
				228
				229	<p>
				230	The <tt>Error</tt> routines are simple helper routines that our parser will use
				231	to handle errors. The error recovery in our parser will not be the best and
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	232	is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	233	routines make it easier to handle errors in routines that have various return
				234	types: they always return null.</p>
				235
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	236	<p>With these basic helper functions, we can implement the first
				237	piece of our grammar: numeric literals.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	238
				239	</div>
				240
				241	<!-- *********************************************************************** -->
				242	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				243	Parsing</a></div>
				244	<!-- *********************************************************************** -->
				245
				246	<div class="doc_text">
				247
				248	<p>We start with numeric literals, because they are the simplest to process.
				249	For each production in our grammar, we'll define a function which parses that
				250	production. For numeric literals, we have:
				251	</p>
				252
				253	<div class="doc_code">
				254	<pre>
				255	/// numberexpr ::= number
				256	static ExprAST *ParseNumberExpr() {
				257	ExprAST *Result = new NumberExprAST(NumVal);
				258	getNextToken(); // consume the number
				259	return Result;
				260	}
				261	</pre>
				262	</div>
				263
				264	<p>This routine is very simple: it expects to be called when the current token
				265	is a <tt>tok_number</tt> token. It takes the current number value, creates
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	266	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	267	returns.</p>
				268
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	269	<p>There are some interesting aspects to this. The most important one is that
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	270	this routine eats all of the tokens that correspond to the production and
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	271	returns the lexer buffer with the next token (which is not part of the grammar
				272	production) ready to go. This is a fairly standard way to go for recursive
				273	descent parsers. For a better example, the parenthesis operator is defined like
				274	this:</p>
				275
				276	<div class="doc_code">
				277	<pre>
				278	/// parenexpr ::= '(' expression ')'
				279	static ExprAST *ParseParenExpr() {
				280	getNextToken(); // eat (.
				281	ExprAST *V = ParseExpression();
				282	if (!V) return 0;
				283
				284	if (CurTok != ')')
				285	return Error("expected ')'");
				286	getNextToken(); // eat ).
				287	return V;
				288	}
				289	</pre>
				290	</div>
				291
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	292	<p>This function illustrates a number of interesting things about the
				293	parser:</p>
				294
				295	<p>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	296	1) It shows how we use the Error routines. When called, this function expects
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	297	that the current token is a '(' token, but after parsing the subexpression, it
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	298	is possible that there is no ')' waiting. For example, if the user types in
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	299	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				300	occur, the parser needs a way to indicate that they happened: in our parser, we
				301	return null on an error.</p>
				302
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	303	<p>2) Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	304	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	305	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				306	recursive grammars, and keeps each production very simple. Note that
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	307	parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	308	do it this way, the most important role of parentheses are to guide the parser
				309	and provide grouping. Once the parser constructs the AST, parentheses are not
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	310	needed.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	311
				312	<p>The next simple production is for handling variable references and function
				313	calls:</p>
				314
				315	<div class="doc_code">
				316	<pre>
				317	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	318	/// ::= identifier
				319	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	320	static ExprAST *ParseIdentifierExpr() {
				321	std::string IdName = IdentifierStr;
				322
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	323	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	324
				325	if (CurTok != '(') // Simple variable ref.
				326	return new VariableExprAST(IdName);
				327
				328	// Call.
				329	getNextToken(); // eat (
				330	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	331	if (CurTok != ')') {
				332	while (1) {
				333	ExprAST *Arg = ParseExpression();
				334	if (!Arg) return 0;
				335	Args.push_back(Arg);
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	336
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	337	if (CurTok == ')') break;
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	338
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	339	if (CurTok != ',')
Chris Lattner	6c4be9c	2008-04-14 16:44:41 +0000	[diff] [blame]	340	return Error("Expected ')' or ',' in argument list");
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	341	getNextToken();
				342	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	343	}
				344
				345	// Eat the ')'.
				346	getNextToken();
				347
				348	return new CallExprAST(IdName, Args);
				349	}
				350	</pre>
				351	</div>
				352
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	353	<p>This routine follows the same style as the other routines. (It expects to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	354	called if the current token is a <tt>tok_identifier</tt> token). It also has
				355	recursion and error handling. One interesting aspect of this is that it uses
				356	<em>look-ahead</em> to determine if the current identifier is a stand alone
				357	variable reference or if it is a function call expression. It handles this by
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	358	checking to see if the token after the identifier is a '(' token, constructing
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	359	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				360	</p>
				361
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	362	<p>Now that we have all of our simple expression-parsing logic in place, we can
				363	define a helper function to wrap it together into one entry point. We call this
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	364	class of expressions "primary" expressions, for reasons that will become more
				365	clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
				366	parse an arbitrary primary expression, we need to determine what sort of
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	367	expression it is:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	368
				369	<div class="doc_code">
				370	<pre>
				371	/// primary
				372	/// ::= identifierexpr
				373	/// ::= numberexpr
				374	/// ::= parenexpr
				375	static ExprAST *ParsePrimary() {
				376	switch (CurTok) {
				377	default: return Error("unknown token when expecting an expression");
				378	case tok_identifier: return ParseIdentifierExpr();
				379	case tok_number: return ParseNumberExpr();
				380	case '(': return ParseParenExpr();
				381	}
				382	}
				383	</pre>
				384	</div>
				385
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	386	<p>Now that you see the definition of this function, it is more obvious why we
				387	can assume the state of CurTok in the various functions. This uses look-ahead
				388	to determine which sort of expression is being inspected, and then parses it
				389	with a function call.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	390
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	391	<p>Now that basic expressions are handled, we need to handle binary expressions.
				392	They are a bit more complex.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	393
				394	</div>
				395
				396	<!-- *********************************************************************** -->
				397	<div class="doc_section"><a name="parserbinops">Binary Expression
				398	Parsing</a></div>
				399	<!-- *********************************************************************** -->
				400
				401	<div class="doc_text">
				402
				403	<p>Binary expressions are significantly harder to parse because they are often
				404	ambiguous. For example, when given the string "x+y*z", the parser can choose
				405	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				406	mathematics, we expect the later parse, because "*" (multiplication) has
				407	higher <em>precedence</em> than "+" (addition).</p>
				408
				409	<p>There are many ways to handle this, but an elegant and efficient way is to
				410	use <a href=
				411	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				412	Parsing</a>. This parsing technique uses the precedence of binary operators to
				413	guide recursion. To start with, we need a table of precedences:</p>
				414
				415	<div class="doc_code">
				416	<pre>
				417	/// BinopPrecedence - This holds the precedence for each binary operator that is
				418	/// defined.
				419	static std::map<char, int> BinopPrecedence;
				420
				421	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				422	static int GetTokPrecedence() {
				423	if (!isascii(CurTok))
				424	return -1;
				425
				426	// Make sure it's a declared binop.
				427	int TokPrec = BinopPrecedence[CurTok];
				428	if (TokPrec <= 0) return -1;
				429	return TokPrec;
				430	}
				431
				432	int main() {
				433	// Install standard binary operators.
				434	// 1 is lowest precedence.
				435	BinopPrecedence['<'] = 10;
				436	BinopPrecedence['+'] = 20;
				437	BinopPrecedence['-'] = 20;
				438	BinopPrecedence['*'] = 40; // highest.
				439	...
				440	}
				441	</pre>
				442	</div>
				443
				444	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	445	(this can obviously be extended by you, our brave and intrepid reader). The
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	446	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				447	or -1 if the token is not a binary operator. Having a map makes it easy to add
				448	new operators and makes it clear that the algorithm doesn't depend on the
				449	specific operators involved, but it would be easy enough to eliminate the map
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	450	and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	451	a fixed-size array).</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	452
				453	<p>With the helper above defined, we can now start parsing binary expressions.
				454	The basic idea of operator precedence parsing is to break down an expression
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	455	with potentially ambiguous binary operators into pieces. Consider ,for example,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	456	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				457	as a stream of primary expressions separated by binary operators. As such,
				458	it will first parse the leading primary expression "a", then it will see the
				459	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	460	are primary expressions, the binary expression parser doesn't need to worry
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	461	about nested subexpressions like (c+d) at all.
				462	</p>
				463
				464	<p>
				465	To start, an expression is a primary expression potentially followed by a
				466	sequence of [binop,primaryexpr] pairs:</p>
				467
				468	<div class="doc_code">
				469	<pre>
				470	/// expression
				471	/// ::= primary binoprhs
				472	///
				473	static ExprAST *ParseExpression() {
				474	ExprAST *LHS = ParsePrimary();
				475	if (!LHS) return 0;
				476
				477	return ParseBinOpRHS(0, LHS);
				478	}
				479	</pre>
				480	</div>
				481
				482	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	483	us. It takes a precedence and a pointer to an expression for the part that has been
				484	parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	485	allowed to be empty, in which case it returns the expression that is passed into
				486	it. In our example above, the code passes the expression for "a" into
				487	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				488
				489	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				490	minimal operator precedence</em> that the function is allowed to eat. For
				491	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				492	passed in a precedence of 40, it will not consume any tokens (because the
				493	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				494	with:</p>
				495
				496	<div class="doc_code">
				497	<pre>
				498	/// binoprhs
				499	/// ::= ('+' primary)*
				500	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				501	// If this is a binop, find its precedence.
				502	while (1) {
				503	int TokPrec = GetTokPrecedence();
				504
				505	// If this is a binop that binds at least as tightly as the current binop,
				506	// consume it, otherwise we are done.
				507	if (TokPrec < ExprPrec)
				508	return LHS;
				509	</pre>
				510	</div>
				511
				512	<p>This code gets the precedence of the current token and checks to see if if is
				513	too low. Because we defined invalid tokens to have a precedence of -1, this
				514	check implicitly knows that the pair-stream ends when the token stream runs out
				515	of binary operators. If this check succeeds, we know that the token is a binary
				516	operator and that it will be included in this expression:</p>
				517
				518	<div class="doc_code">
				519	<pre>
				520	// Okay, we know this is a binop.
				521	int BinOp = CurTok;
				522	getNextToken(); // eat binop
				523
				524	// Parse the primary expression after the binary operator.
				525	ExprAST *RHS = ParsePrimary();
				526	if (!RHS) return 0;
				527	</pre>
				528	</div>
				529
				530	<p>As such, this code eats (and remembers) the binary operator and then parses
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	531	the primary expression that follows. This builds up the whole pair, the first of
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	532	which is [+, b] for the running example.</p>
				533
				534	<p>Now that we parsed the left-hand side of an expression and one pair of the
				535	RHS sequence, we have to decide which way the expression associates. In
				536	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				537	To determine this, we look ahead at "binop" to determine its precedence and
				538	compare it to BinOp's precedence (which is '+' in this case):</p>
				539
				540	<div class="doc_code">
				541	<pre>
				542	// If BinOp binds less tightly with RHS than the operator after RHS, let
				543	// the pending operator take RHS as its LHS.
				544	int NextPrec = GetTokPrecedence();
				545	if (TokPrec < NextPrec) {
				546	</pre>
				547	</div>
				548
				549	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				550	precedence of our current operator, then we know that the parentheses associate
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	551	as "(a+b) binop ...". In our example, the current operator is "+" and the next
				552	operator is "+", we know that they have the same precedence. In this case we'll
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	553	create the AST node for "a+b", and then continue parsing:</p>
				554
				555	<div class="doc_code">
				556	<pre>
				557	... if body omitted ...
				558	}
				559
				560	// Merge LHS/RHS.
				561	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				562	} // loop around to the top of the while loop.
				563	}
				564	</pre>
				565	</div>
				566
				567	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	568	iteration of the loop, with "+" as the current token. The code above will eat,
				569	remember, and parse "(c+d)" as the primary expression, which makes the
				570	current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
				571	"" as the binop to the right of the primary. In this case, the precedence of "" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	572	higher than the precedence of "+" so the if condition will be entered.</p>
				573
				574	<p>The critical question left here is "how can the if condition parse the right
				575	hand side in full"? In particular, to build the AST correctly for our example,
				576	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				577	do this is surprisingly simple (code from the above two blocks duplicated for
				578	context):</p>
				579
				580	<div class="doc_code">
				581	<pre>
				582	// If BinOp binds less tightly with RHS than the operator after RHS, let
				583	// the pending operator take RHS as its LHS.
				584	int NextPrec = GetTokPrecedence();
				585	if (TokPrec < NextPrec) {
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	586	<b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
				587	if (RHS == 0) return 0;</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	588	}
				589	// Merge LHS/RHS.
				590	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				591	} // loop around to the top of the while loop.
				592	}
				593	</pre>
				594	</div>
				595
				596	<p>At this point, we know that the binary operator to the RHS of our primary
				597	has higher precedence than the binop we are currently parsing. As such, we know
				598	that any sequence of pairs whose operators are all higher precedence than "+"
				599	should be parsed together and returned as "RHS". To do this, we recursively
				600	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				601	precedence required for it to continue. In our example above, this will cause
				602	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				603	of the '+' expression.</p>
				604
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	605	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	606	and added to the AST. With this little bit of code (14 non-trivial lines), we
				607	correctly handle fully general binary expression parsing in a very elegant way.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	608	This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
				609	running through it with a few tough examples to see how it works.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	610	</p>
				611
				612	<p>This wraps up handling of expressions. At this point, we can point the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	613	parser at an arbitrary token stream and build an expression from it, stopping
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	614	at the first token that is not part of the expression. Next up we need to
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	615	handle function definitions, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	616
				617	</div>
				618
				619	<!-- *********************************************************************** -->
				620	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				621	<!-- *********************************************************************** -->
				622
				623	<div class="doc_text">
				624
				625	<p>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	626	The next thing missing is handling of function prototypes. In Kaleidoscope,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	627	these are used both for 'extern' function declarations as well as function body
				628	definitions. The code to do this is straight-forward and not very interesting
				629	(once you've survived expressions):
				630	</p>
				631
				632	<div class="doc_code">
				633	<pre>
				634	/// prototype
				635	/// ::= id '(' id* ')'
				636	static PrototypeAST *ParsePrototype() {
				637	if (CurTok != tok_identifier)
				638	return ErrorP("Expected function name in prototype");
				639
				640	std::string FnName = IdentifierStr;
				641	getNextToken();
				642
				643	if (CurTok != '(')
				644	return ErrorP("Expected '(' in prototype");
				645
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	646	// Read the list of argument names.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	647	std::vector<std::string> ArgNames;
				648	while (getNextToken() == tok_identifier)
				649	ArgNames.push_back(IdentifierStr);
				650	if (CurTok != ')')
				651	return ErrorP("Expected ')' in prototype");
				652
				653	// success.
				654	getNextToken(); // eat ')'.
				655
				656	return new PrototypeAST(FnName, ArgNames);
				657	}
				658	</pre>
				659	</div>
				660
				661	<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	662	an expression to implement the body:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	663
				664	<div class="doc_code">
				665	<pre>
				666	/// definition ::= 'def' prototype expression
				667	static FunctionAST *ParseDefinition() {
				668	getNextToken(); // eat def.
				669	PrototypeAST *Proto = ParsePrototype();
				670	if (Proto == 0) return 0;
				671
				672	if (ExprAST *E = ParseExpression())
				673	return new FunctionAST(Proto, E);
				674	return 0;
				675	}
				676	</pre>
				677	</div>
				678
				679	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	680	well as to support forward declaration of user functions. These 'extern's are just
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	681	prototypes with no body:</p>
				682
				683	<div class="doc_code">
				684	<pre>
				685	/// external ::= 'extern' prototype
				686	static PrototypeAST *ParseExtern() {
				687	getNextToken(); // eat extern.
				688	return ParsePrototype();
				689	}
				690	</pre>
				691	</div>
				692
				693	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				694	evaluate them on the fly. We will handle this by defining anonymous nullary
				695	(zero argument) functions for them:</p>
				696
				697	<div class="doc_code">
				698	<pre>
				699	/// toplevelexpr ::= expression
				700	static FunctionAST *ParseTopLevelExpr() {
				701	if (ExprAST *E = ParseExpression()) {
				702	// Make an anonymous proto.
				703	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				704	return new FunctionAST(Proto, E);
				705	}
				706	return 0;
				707	}
				708	</pre>
				709	</div>
				710
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	711	<p>Now that we have all the pieces, let's build a little driver that will let us
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	712	actually <em>execute</em> this code we've built!</p>
				713
				714	</div>
				715
				716	<!-- *********************************************************************** -->
				717	<div class="doc_section"><a name="driver">The Driver</a></div>
				718	<!-- *********************************************************************** -->
				719
				720	<div class="doc_text">
				721
				722	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				723	dispatch loop. There isn't much interesting here, so I'll just include the
				724	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				725	Parsing" section.</p>
				726
				727	<div class="doc_code">
				728	<pre>
				729	/// top ::= definition \| external \| expression \| ';'
				730	static void MainLoop() {
				731	while (1) {
				732	fprintf(stderr, "ready> ");
				733	switch (CurTok) {
				734	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	735	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	736	case tok_def: HandleDefinition(); break;
				737	case tok_extern: HandleExtern(); break;
				738	default: HandleTopLevelExpression(); break;
				739	}
				740	}
				741	}
				742	</pre>
				743	</div>
				744
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	745	<p>The most interesting part of this is that we ignore top-level semicolons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	746	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	747	command line, the parser doesn't know whether that is the end of what you will type
				748	or not. For example, on the next line you could type "def foo..." in which case
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	749	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				750	which would continue the expression. Having top-level semicolons allows you to
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	751	type "4+5;", and the parser will know you are done.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	752
				753	</div>
				754
				755	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	756	<div class="doc_section"><a name="conclusions">Conclusions</a></div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	757	<!-- *********************************************************************** -->
				758
				759	<div class="doc_text">
				760
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	761	<p>With just under 400 lines of commented code (240 lines of non-comment,
				762	non-blank code), we fully defined our minimal language, including a lexer,
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	763	parser, and AST builder. With this done, the executable will validate
				764	Kaleidoscope code and tell us if it is grammatically invalid. For
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	765	example, here is a sample interaction:</p>
				766
				767	<div class="doc_code">
				768	<pre>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	769	$ <b>./a.out</b>
				770	ready> <b>def foo(x y) x+foo(y, 4.0);</b>
				771	Parsed a function definition.
				772	ready> <b>def foo(x y) x+y y;</b>
				773	Parsed a function definition.
				774	Parsed a top-level expr
				775	ready> <b>def foo(x y) x+y );</b>
				776	Parsed a function definition.
				777	Error: unknown token when expecting an expression
				778	ready> <b>extern sin(a);</b>
Chris Lattner	35abbf5	2007-10-23 06:23:57 +0000	[diff] [blame]	779	ready> Parsed an extern
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	780	ready> <b>^D</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	781	$
				782	</pre>
				783	</div>
				784
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	785	<p>There is a lot of room for extension here. You can define new AST nodes,
				786	extend the language in many ways, etc. In the <a href="LangImpl3.html">next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	787	installment</a>, we will describe how to generate LLVM Intermediate
				788	Representation (IR) from the AST.</p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	789
				790	</div>
				791
				792	<!-- *********************************************************************** -->
				793	<div class="doc_section"><a name="code">Full Code Listing</a></div>
				794	<!-- *********************************************************************** -->
				795
				796	<div class="doc_text">
				797
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	798	<p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	799	Here is the complete code listing for this and the previous chapter.
				800	Note that it is fully self-contained: you don't need LLVM or any external
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	801	libraries at all for this. (Besides the C and C++ standard libraries, of
				802	course.) To build this, just compile with:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	803
				804	<div class="doc_code">
				805	<pre>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	806	# Compile
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	807	g++ -g -O3 toy.cpp
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	808	# Run
				809	./a.out
				810	</pre>
				811	</div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	812
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	813	<p>Here is the code:</p>
				814
				815	<div class="doc_code">
				816	<pre>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	817	#include <cstdio>
John McCall	46c8181	2009-08-17 21:07:37 +0000	[diff] [blame]	818	#include <cstdlib>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	819	#include <string>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	820	#include <map>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	821	#include <vector>
				822
				823	//===----------------------------------------------------------------------===//
				824	// Lexer
				825	//===----------------------------------------------------------------------===//
				826
				827	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				828	// of these for known things.
				829	enum Token {
				830	tok_eof = -1,
				831
				832	// commands
				833	tok_def = -2, tok_extern = -3,
				834
				835	// primary
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	836	tok_identifier = -4, tok_number = -5
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	837	};
				838
				839	static std::string IdentifierStr; // Filled in if tok_identifier
				840	static double NumVal; // Filled in if tok_number
				841
				842	/// gettok - Return the next token from standard input.
				843	static int gettok() {
				844	static int LastChar = ' ';
				845
				846	// Skip any whitespace.
				847	while (isspace(LastChar))
				848	LastChar = getchar();
				849
				850	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				851	IdentifierStr = LastChar;
				852	while (isalnum((LastChar = getchar())))
				853	IdentifierStr += LastChar;
				854
				855	if (IdentifierStr == "def") return tok_def;
				856	if (IdentifierStr == "extern") return tok_extern;
				857	return tok_identifier;
				858	}
				859
				860	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				861	std::string NumStr;
				862	do {
				863	NumStr += LastChar;
				864	LastChar = getchar();
				865	} while (isdigit(LastChar) \|\| LastChar == '.');
				866
				867	NumVal = strtod(NumStr.c_str(), 0);
				868	return tok_number;
				869	}
				870
				871	if (LastChar == '#') {
				872	// Comment until end of line.
				873	do LastChar = getchar();
Chris Lattner	c80c23f	2007-12-02 22:46:01 +0000	[diff] [blame]	874	while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	875
				876	if (LastChar != EOF)
				877	return gettok();
				878	}
				879
				880	// Check for end of file. Don't eat the EOF.
				881	if (LastChar == EOF)
				882	return tok_eof;
				883
				884	// Otherwise, just return the character as its ascii value.
				885	int ThisChar = LastChar;
				886	LastChar = getchar();
				887	return ThisChar;
				888	}
				889
				890	//===----------------------------------------------------------------------===//
				891	// Abstract Syntax Tree (aka Parse Tree)
				892	//===----------------------------------------------------------------------===//
				893
				894	/// ExprAST - Base class for all expression nodes.
				895	class ExprAST {
				896	public:
				897	virtual ~ExprAST() {}
				898	};
				899
				900	/// NumberExprAST - Expression class for numeric literals like "1.0".
				901	class NumberExprAST : public ExprAST {
				902	double Val;
				903	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	904	NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	905	};
				906
				907	/// VariableExprAST - Expression class for referencing a variable, like "a".
				908	class VariableExprAST : public ExprAST {
				909	std::string Name;
				910	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	911	VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	912	};
				913
				914	/// BinaryExprAST - Expression class for a binary operator.
				915	class BinaryExprAST : public ExprAST {
				916	char Op;
				917	ExprAST LHS, RHS;
				918	public:
				919	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				920	: Op(op), LHS(lhs), RHS(rhs) {}
				921	};
				922
				923	/// CallExprAST - Expression class for function calls.
				924	class CallExprAST : public ExprAST {
				925	std::string Callee;
				926	std::vector<ExprAST*> Args;
				927	public:
				928	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				929	: Callee(callee), Args(args) {}
				930	};
				931
				932	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	933	/// which captures its name, and its argument names (thus implicitly the number
				934	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	935	class PrototypeAST {
				936	std::string Name;
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	937	std::vector<std::string> Args;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	938	public:
				939	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				940	: Name(name), Args(args) {}
				941
				942	};
				943
				944	/// FunctionAST - This class represents a function definition itself.
				945	class FunctionAST {
				946	PrototypeAST *Proto;
				947	ExprAST *Body;
				948	public:
				949	FunctionAST(PrototypeAST proto, ExprAST body)
				950	: Proto(proto), Body(body) {}
				951
				952	};
				953
				954	//===----------------------------------------------------------------------===//
				955	// Parser
				956	//===----------------------------------------------------------------------===//
				957
				958	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	959	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	960	/// lexer and updates CurTok with its results.
				961	static int CurTok;
				962	static int getNextToken() {
				963	return CurTok = gettok();
				964	}
				965
				966	/// BinopPrecedence - This holds the precedence for each binary operator that is
				967	/// defined.
				968	static std::map<char, int> BinopPrecedence;
				969
				970	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				971	static int GetTokPrecedence() {
				972	if (!isascii(CurTok))
				973	return -1;
				974
				975	// Make sure it's a declared binop.
				976	int TokPrec = BinopPrecedence[CurTok];
				977	if (TokPrec <= 0) return -1;
				978	return TokPrec;
				979	}
				980
				981	/// Error* - These are little helper functions for error handling.
				982	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				983	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				984	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				985
				986	static ExprAST *ParseExpression();
				987
				988	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	989	/// ::= identifier
				990	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	991	static ExprAST *ParseIdentifierExpr() {
				992	std::string IdName = IdentifierStr;
				993
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	994	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	995
				996	if (CurTok != '(') // Simple variable ref.
				997	return new VariableExprAST(IdName);
				998
				999	// Call.
				1000	getNextToken(); // eat (
				1001	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1002	if (CurTok != ')') {
				1003	while (1) {
				1004	ExprAST *Arg = ParseExpression();
				1005	if (!Arg) return 0;
				1006	Args.push_back(Arg);
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1007
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1008	if (CurTok == ')') break;
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1009
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1010	if (CurTok != ',')
Chris Lattner	6c4be9c	2008-04-14 16:44:41 +0000	[diff] [blame]	1011	return Error("Expected ')' or ',' in argument list");
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1012	getNextToken();
				1013	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1014	}
				1015
				1016	// Eat the ')'.
				1017	getNextToken();
				1018
				1019	return new CallExprAST(IdName, Args);
				1020	}
				1021
				1022	/// numberexpr ::= number
				1023	static ExprAST *ParseNumberExpr() {
				1024	ExprAST *Result = new NumberExprAST(NumVal);
				1025	getNextToken(); // consume the number
				1026	return Result;
				1027	}
				1028
				1029	/// parenexpr ::= '(' expression ')'
				1030	static ExprAST *ParseParenExpr() {
				1031	getNextToken(); // eat (.
				1032	ExprAST *V = ParseExpression();
				1033	if (!V) return 0;
				1034
				1035	if (CurTok != ')')
				1036	return Error("expected ')'");
				1037	getNextToken(); // eat ).
				1038	return V;
				1039	}
				1040
				1041	/// primary
				1042	/// ::= identifierexpr
				1043	/// ::= numberexpr
				1044	/// ::= parenexpr
				1045	static ExprAST *ParsePrimary() {
				1046	switch (CurTok) {
				1047	default: return Error("unknown token when expecting an expression");
				1048	case tok_identifier: return ParseIdentifierExpr();
				1049	case tok_number: return ParseNumberExpr();
				1050	case '(': return ParseParenExpr();
				1051	}
				1052	}
				1053
				1054	/// binoprhs
				1055	/// ::= ('+' primary)*
				1056	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1057	// If this is a binop, find its precedence.
				1058	while (1) {
				1059	int TokPrec = GetTokPrecedence();
				1060
				1061	// If this is a binop that binds at least as tightly as the current binop,
				1062	// consume it, otherwise we are done.
				1063	if (TokPrec < ExprPrec)
				1064	return LHS;
				1065
				1066	// Okay, we know this is a binop.
				1067	int BinOp = CurTok;
				1068	getNextToken(); // eat binop
				1069
				1070	// Parse the primary expression after the binary operator.
				1071	ExprAST *RHS = ParsePrimary();
				1072	if (!RHS) return 0;
				1073
				1074	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1075	// the pending operator take RHS as its LHS.
				1076	int NextPrec = GetTokPrecedence();
				1077	if (TokPrec < NextPrec) {
				1078	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1079	if (RHS == 0) return 0;
				1080	}
				1081
				1082	// Merge LHS/RHS.
				1083	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1084	}
				1085	}
				1086
				1087	/// expression
				1088	/// ::= primary binoprhs
				1089	///
				1090	static ExprAST *ParseExpression() {
				1091	ExprAST *LHS = ParsePrimary();
				1092	if (!LHS) return 0;
				1093
				1094	return ParseBinOpRHS(0, LHS);
				1095	}
				1096
				1097	/// prototype
				1098	/// ::= id '(' id* ')'
				1099	static PrototypeAST *ParsePrototype() {
				1100	if (CurTok != tok_identifier)
				1101	return ErrorP("Expected function name in prototype");
				1102
				1103	std::string FnName = IdentifierStr;
				1104	getNextToken();
				1105
				1106	if (CurTok != '(')
				1107	return ErrorP("Expected '(' in prototype");
				1108
				1109	std::vector<std::string> ArgNames;
				1110	while (getNextToken() == tok_identifier)
				1111	ArgNames.push_back(IdentifierStr);
				1112	if (CurTok != ')')
				1113	return ErrorP("Expected ')' in prototype");
				1114
				1115	// success.
				1116	getNextToken(); // eat ')'.
				1117
				1118	return new PrototypeAST(FnName, ArgNames);
				1119	}
				1120
				1121	/// definition ::= 'def' prototype expression
				1122	static FunctionAST *ParseDefinition() {
				1123	getNextToken(); // eat def.
				1124	PrototypeAST *Proto = ParsePrototype();
				1125	if (Proto == 0) return 0;
				1126
				1127	if (ExprAST *E = ParseExpression())
				1128	return new FunctionAST(Proto, E);
				1129	return 0;
				1130	}
				1131
				1132	/// toplevelexpr ::= expression
				1133	static FunctionAST *ParseTopLevelExpr() {
				1134	if (ExprAST *E = ParseExpression()) {
				1135	// Make an anonymous proto.
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	1136	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1137	return new FunctionAST(Proto, E);
				1138	}
				1139	return 0;
				1140	}
				1141
				1142	/// external ::= 'extern' prototype
				1143	static PrototypeAST *ParseExtern() {
				1144	getNextToken(); // eat extern.
				1145	return ParsePrototype();
				1146	}
				1147
				1148	//===----------------------------------------------------------------------===//
				1149	// Top-Level parsing
				1150	//===----------------------------------------------------------------------===//
				1151
				1152	static void HandleDefinition() {
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1153	if (ParseDefinition()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1154	fprintf(stderr, "Parsed a function definition.\n");
				1155	} else {
				1156	// Skip token for error recovery.
				1157	getNextToken();
				1158	}
				1159	}
				1160
				1161	static void HandleExtern() {
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1162	if (ParseExtern()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1163	fprintf(stderr, "Parsed an extern\n");
				1164	} else {
				1165	// Skip token for error recovery.
				1166	getNextToken();
				1167	}
				1168	}
				1169
				1170	static void HandleTopLevelExpression() {
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1171	// Evaluate a top-level expression into an anonymous function.
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1172	if (ParseTopLevelExpr()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1173	fprintf(stderr, "Parsed a top-level expr\n");
				1174	} else {
				1175	// Skip token for error recovery.
				1176	getNextToken();
				1177	}
				1178	}
				1179
				1180	/// top ::= definition \| external \| expression \| ';'
				1181	static void MainLoop() {
				1182	while (1) {
				1183	fprintf(stderr, "ready> ");
				1184	switch (CurTok) {
				1185	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1186	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1187	case tok_def: HandleDefinition(); break;
				1188	case tok_extern: HandleExtern(); break;
				1189	default: HandleTopLevelExpression(); break;
				1190	}
				1191	}
				1192	}
				1193
				1194	//===----------------------------------------------------------------------===//
				1195	// Main driver code.
				1196	//===----------------------------------------------------------------------===//
				1197
				1198	int main() {
				1199	// Install standard binary operators.
				1200	// 1 is lowest precedence.
				1201	BinopPrecedence['<'] = 10;
				1202	BinopPrecedence['+'] = 20;
				1203	BinopPrecedence['-'] = 20;
				1204	BinopPrecedence['*'] = 40; // highest.
				1205
				1206	// Prime the first token.
				1207	fprintf(stderr, "ready> ");
				1208	getNextToken();
				1209
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1210	// Run the main "interpreter loop" now.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1211	MainLoop();
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1212
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1213	return 0;
				1214	}
				1215	</pre>
				1216	</div>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1217	<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1218	</div>
				1219
				1220	<!-- *********************************************************************** -->
				1221	<hr>
				1222	<address>
				1223	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1224	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1225	<a href="http://validator.w3.org/check/referer"><img
Chris Lattner	8eef4b2	2007-10-23 06:30:50 +0000	[diff] [blame]	1226	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1227
				1228	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1229	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
Dan Gohman	523e392	2010-02-03 17:27:31 +0000	[diff] [blame]	1230	Last modified: $Date$
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1231	</address>
				1232	</body>
				1233	</html>