Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: 50619e095502382959262f5dce1ecb24ae8b2174 [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				15
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	16	<ul>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	17	<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	18	<li>Chapter 2
				19	<ol>
				20	<li><a href="#intro">Chapter 2 Introduction</a></li>
				21	<li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
				22	<li><a href="#parserbasics">Parser Basics</a></li>
				23	<li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
				24	<li><a href="#parserbinops">Binary Expression Parsing</a></li>
				25	<li><a href="#parsertop">Parsing the Rest</a></li>
				26	<li><a href="#driver">The Driver</a></li>
				27	<li><a href="#conclusions">Conclusions</a></li>
				28	<li><a href="#code">Full Code Listing</a></li>
				29	</ol>
				30	</li>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	31	<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	32	</ul>
				33
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	34	<div class="doc_author">
				35	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				36	</div>
				37
				38	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	39	<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	40	<!-- *********************************************************************** -->
				41
				42	<div class="doc_text">
				43
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	44	<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	45	with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in
				46	<a href="LangImpl1.html">Chapter 1</a>, to build a full <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	47	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	48	our Kaleidoscope language. Once we have a parser, we'll define and build an <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	49	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				50	Tree</a> (AST).</p>
				51
				52	<p>The parser we will build uses a combination of <a
				53	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				54	Parsing</a> and <a href=
				55	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	56	Parsing</a> to parse the Kaleidoscope language (the latter for
				57	binary expressions and the former for everything else). Before we get to
				58	parsing though, lets talk about the output of the parser: the Abstract Syntax
				59	Tree.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	60
				61	</div>
				62
				63	<!-- *********************************************************************** -->
				64	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				65	<!-- *********************************************************************** -->
				66
				67	<div class="doc_text">
				68
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	69	<p>The AST for a program captures its behavior in such a way that it is easy for
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	70	later stages of the compiler (e.g. code generation) to interpret. We basically
				71	want one object for each construct in the language, and the AST should closely
				72	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				73	function object. We'll start with expressions first:</p>
				74
				75	<div class="doc_code">
				76	<pre>
				77	/// ExprAST - Base class for all expression nodes.
				78	class ExprAST {
				79	public:
				80	virtual ~ExprAST() {}
				81	};
				82
				83	/// NumberExprAST - Expression class for numeric literals like "1.0".
				84	class NumberExprAST : public ExprAST {
				85	double Val;
				86	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	87	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	88	};
				89	</pre>
				90	</div>
				91
				92	<p>The code above shows the definition of the base ExprAST class and one
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	93	subclass which we use for numeric literals. The important thing to note about
				94	this code is that the NumberExprAST class captures the numeric value of the
				95	literal as an instance variable. This allows later phases of the compiler to
				96	know what the stored numeric value is.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	97
				98	<p>Right now we only create the AST, so there are no useful accessor methods on
				99	them. It would be very easy to add a virtual method to pretty print the code,
				100	for example. Here are the other expression AST node definitions that we'll use
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	101	in the basic form of the Kaleidoscope language:
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	102	</p>
				103
				104	<div class="doc_code">
				105	<pre>
				106	/// VariableExprAST - Expression class for referencing a variable, like "a".
				107	class VariableExprAST : public ExprAST {
				108	std::string Name;
				109	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	110	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	111	};
				112
				113	/// BinaryExprAST - Expression class for a binary operator.
				114	class BinaryExprAST : public ExprAST {
				115	char Op;
				116	ExprAST LHS, RHS;
				117	public:
				118	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				119	: Op(op), LHS(lhs), RHS(rhs) {}
				120	};
				121
				122	/// CallExprAST - Expression class for function calls.
				123	class CallExprAST : public ExprAST {
				124	std::string Callee;
				125	std::vector<ExprAST*> Args;
				126	public:
				127	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				128	: Callee(callee), Args(args) {}
				129	};
				130	</pre>
				131	</div>
				132
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	133	<p>This is all (intentionally) rather straight-forward: variables capture the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	134	variable name, binary operators capture their opcode (e.g. '+'), and calls
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	135	capture a function name as well as a list of any argument expressions. One thing
				136	that is nice about our AST is that it captures the language features without
				137	talking about the syntax of the language. Note that there is no discussion about
				138	precedence of binary operators, lexical structure, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	139
				140	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	141	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	142	we'll fix that in a later installment. The two things we need next are a way
				143	to talk about the interface to a function, and a way to talk about functions
				144	themselves:</p>
				145
				146	<div class="doc_code">
				147	<pre>
				148	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	149	/// which captures its name, and its argument names (thus implicitly the number
				150	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	151	class PrototypeAST {
				152	std::string Name;
				153	std::vector<std::string> Args;
				154	public:
				155	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				156	: Name(name), Args(args) {}
				157	};
				158
				159	/// FunctionAST - This class represents a function definition itself.
				160	class FunctionAST {
				161	PrototypeAST *Proto;
				162	ExprAST *Body;
				163	public:
				164	FunctionAST(PrototypeAST proto, ExprAST body)
				165	: Proto(proto), Body(body) {}
				166	};
				167	</pre>
				168	</div>
				169
				170	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	171	Since all values are double precision floating point, the type of each argument
				172	doesn't need to be stored anywhere. In a more aggressive and realistic
				173	language, the "ExprAST" class would probably have a type field.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	174
				175	<p>With this scaffolding, we can now talk about parsing expressions and function
				176	bodies in Kaleidoscope.</p>
				177
				178	</div>
				179
				180	<!-- *********************************************************************** -->
				181	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				182	<!-- *********************************************************************** -->
				183
				184	<div class="doc_text">
				185
				186	<p>Now that we have an AST to build, we need to define the parser code to build
				187	it. The idea here is that we want to parse something like "x+y" (which is
				188	returned as three tokens by the lexer) into an AST that could be generated with
				189	calls like this:</p>
				190
				191	<div class="doc_code">
				192	<pre>
				193	ExprAST *X = new VariableExprAST("x");
				194	ExprAST *Y = new VariableExprAST("y");
				195	ExprAST *Result = new BinaryExprAST('+', X, Y);
				196	</pre>
				197	</div>
				198
				199	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				200
				201	<div class="doc_code">
				202	<pre>
				203	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	204	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	205	/// lexer and updates CurTok with its results.
				206	static int CurTok;
				207	static int getNextToken() {
				208	return CurTok = gettok();
				209	}
				210	</pre>
				211	</div>
				212
				213	<p>
				214	This implements a simple token buffer around the lexer. This allows
				215	us to look one token ahead at what the lexer is returning. Every function in
Chris Lattner	e949512	2007-10-25 18:05:29 +0000	[diff] [blame]	216	our parser will assume that CurTok is the current token that needs to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	217	parsed.</p>
				218
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	219	<div class="doc_code">
				220	<pre>
				221
				222	/// Error* - These are little helper functions for error handling.
				223	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				224	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				225	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				226	</pre>
				227	</div>
				228
				229	<p>
				230	The <tt>Error</tt> routines are simple helper routines that our parser will use
				231	to handle errors. The error recovery in our parser will not be the best and
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	232	is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	233	routines make it easier to handle errors in routines that have various return
				234	types: they always return null.</p>
				235
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	236	<p>With these basic helper functions, we can implement the first
				237	piece of our grammar: numeric literals.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	238
				239	</div>
				240
				241	<!-- *********************************************************************** -->
				242	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				243	Parsing</a></div>
				244	<!-- *********************************************************************** -->
				245
				246	<div class="doc_text">
				247
				248	<p>We start with numeric literals, because they are the simplest to process.
				249	For each production in our grammar, we'll define a function which parses that
				250	production. For numeric literals, we have:
				251	</p>
				252
				253	<div class="doc_code">
				254	<pre>
				255	/// numberexpr ::= number
				256	static ExprAST *ParseNumberExpr() {
				257	ExprAST *Result = new NumberExprAST(NumVal);
				258	getNextToken(); // consume the number
				259	return Result;
				260	}
				261	</pre>
				262	</div>
				263
				264	<p>This routine is very simple: it expects to be called when the current token
				265	is a <tt>tok_number</tt> token. It takes the current number value, creates
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	266	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	267	returns.</p>
				268
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	269	<p>There are some interesting aspects to this. The most important one is that
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	270	this routine eats all of the tokens that correspond to the production and
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	271	returns the lexer buffer with the next token (which is not part of the grammar
				272	production) ready to go. This is a fairly standard way to go for recursive
				273	descent parsers. For a better example, the parenthesis operator is defined like
				274	this:</p>
				275
				276	<div class="doc_code">
				277	<pre>
				278	/// parenexpr ::= '(' expression ')'
				279	static ExprAST *ParseParenExpr() {
				280	getNextToken(); // eat (.
				281	ExprAST *V = ParseExpression();
				282	if (!V) return 0;
				283
				284	if (CurTok != ')')
				285	return Error("expected ')'");
				286	getNextToken(); // eat ).
				287	return V;
				288	}
				289	</pre>
				290	</div>
				291
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	292	<p>This function illustrates a number of interesting things about the
				293	parser:</p>
				294
				295	<p>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	296	1) It shows how we use the Error routines. When called, this function expects
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	297	that the current token is a '(' token, but after parsing the subexpression, it
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	298	is possible that there is no ')' waiting. For example, if the user types in
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	299	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				300	occur, the parser needs a way to indicate that they happened: in our parser, we
				301	return null on an error.</p>
				302
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	303	<p>2) Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	304	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	305	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				306	recursive grammars, and keeps each production very simple. Note that
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	307	parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	308	do it this way, the most important role of parentheses are to guide the parser
				309	and provide grouping. Once the parser constructs the AST, parentheses are not
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	310	needed.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	311
				312	<p>The next simple production is for handling variable references and function
				313	calls:</p>
				314
				315	<div class="doc_code">
				316	<pre>
				317	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	318	/// ::= identifier
				319	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	320	static ExprAST *ParseIdentifierExpr() {
				321	std::string IdName = IdentifierStr;
				322
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	323	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	324
				325	if (CurTok != '(') // Simple variable ref.
				326	return new VariableExprAST(IdName);
				327
				328	// Call.
				329	getNextToken(); // eat (
				330	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	331	if (CurTok != ')') {
				332	while (1) {
				333	ExprAST *Arg = ParseExpression();
				334	if (!Arg) return 0;
				335	Args.push_back(Arg);
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	336
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	337	if (CurTok == ')') break;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	338
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	339	if (CurTok != ',')
				340	return Error("Expected ')'");
				341	getNextToken();
				342	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	343	}
				344
				345	// Eat the ')'.
				346	getNextToken();
				347
				348	return new CallExprAST(IdName, Args);
				349	}
				350	</pre>
				351	</div>
				352
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	353	<p>This routine follows the same style as the other routines. (It expects to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	354	called if the current token is a <tt>tok_identifier</tt> token). It also has
				355	recursion and error handling. One interesting aspect of this is that it uses
				356	<em>look-ahead</em> to determine if the current identifier is a stand alone
				357	variable reference or if it is a function call expression. It handles this by
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	358	checking to see if the token after the identifier is a '(' token, constructing
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	359	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				360	</p>
				361
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	362	<p>Now that we have all of our simple expression-parsing logic in place, we can
				363	define a helper function to wrap it together into one entry point. We call this
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	364	class of expressions "primary" expressions, for reasons that will become more
				365	clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
				366	parse an arbitrary primary expression, we need to determine what sort of
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	367	expression it is:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	368
				369	<div class="doc_code">
				370	<pre>
				371	/// primary
				372	/// ::= identifierexpr
				373	/// ::= numberexpr
				374	/// ::= parenexpr
				375	static ExprAST *ParsePrimary() {
				376	switch (CurTok) {
				377	default: return Error("unknown token when expecting an expression");
				378	case tok_identifier: return ParseIdentifierExpr();
				379	case tok_number: return ParseNumberExpr();
				380	case '(': return ParseParenExpr();
				381	}
				382	}
				383	</pre>
				384	</div>
				385
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	386	<p>Now that you see the definition of this function, it is more obvious why we
				387	can assume the state of CurTok in the various functions. This uses look-ahead
				388	to determine which sort of expression is being inspected, and then parses it
				389	with a function call.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	390
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	391	<p>Now that basic expressions are handled, we need to handle binary expressions.
				392	They are a bit more complex.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	393
				394	</div>
				395
				396	<!-- *********************************************************************** -->
				397	<div class="doc_section"><a name="parserbinops">Binary Expression
				398	Parsing</a></div>
				399	<!-- *********************************************************************** -->
				400
				401	<div class="doc_text">
				402
				403	<p>Binary expressions are significantly harder to parse because they are often
				404	ambiguous. For example, when given the string "x+y*z", the parser can choose
				405	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				406	mathematics, we expect the later parse, because "*" (multiplication) has
				407	higher <em>precedence</em> than "+" (addition).</p>
				408
				409	<p>There are many ways to handle this, but an elegant and efficient way is to
				410	use <a href=
				411	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				412	Parsing</a>. This parsing technique uses the precedence of binary operators to
				413	guide recursion. To start with, we need a table of precedences:</p>
				414
				415	<div class="doc_code">
				416	<pre>
				417	/// BinopPrecedence - This holds the precedence for each binary operator that is
				418	/// defined.
				419	static std::map<char, int> BinopPrecedence;
				420
				421	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				422	static int GetTokPrecedence() {
				423	if (!isascii(CurTok))
				424	return -1;
				425
				426	// Make sure it's a declared binop.
				427	int TokPrec = BinopPrecedence[CurTok];
				428	if (TokPrec <= 0) return -1;
				429	return TokPrec;
				430	}
				431
				432	int main() {
				433	// Install standard binary operators.
				434	// 1 is lowest precedence.
				435	BinopPrecedence['<'] = 10;
				436	BinopPrecedence['+'] = 20;
				437	BinopPrecedence['-'] = 20;
				438	BinopPrecedence['*'] = 40; // highest.
				439	...
				440	}
				441	</pre>
				442	</div>
				443
				444	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	445	(this can obviously be extended by you, our brave and intrepid reader). The
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	446	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				447	or -1 if the token is not a binary operator. Having a map makes it easy to add
				448	new operators and makes it clear that the algorithm doesn't depend on the
				449	specific operators involved, but it would be easy enough to eliminate the map
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	450	and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	451	a fixed-size array).</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	452
				453	<p>With the helper above defined, we can now start parsing binary expressions.
				454	The basic idea of operator precedence parsing is to break down an expression
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	455	with potentially ambiguous binary operators into pieces. Consider ,for example,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	456	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				457	as a stream of primary expressions separated by binary operators. As such,
				458	it will first parse the leading primary expression "a", then it will see the
				459	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	460	are primary expressions, the binary expression parser doesn't need to worry
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	461	about nested subexpressions like (c+d) at all.
				462	</p>
				463
				464	<p>
				465	To start, an expression is a primary expression potentially followed by a
				466	sequence of [binop,primaryexpr] pairs:</p>
				467
				468	<div class="doc_code">
				469	<pre>
				470	/// expression
				471	/// ::= primary binoprhs
				472	///
				473	static ExprAST *ParseExpression() {
				474	ExprAST *LHS = ParsePrimary();
				475	if (!LHS) return 0;
				476
				477	return ParseBinOpRHS(0, LHS);
				478	}
				479	</pre>
				480	</div>
				481
				482	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	483	us. It takes a precedence and a pointer to an expression for the part that has been
				484	parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	485	allowed to be empty, in which case it returns the expression that is passed into
				486	it. In our example above, the code passes the expression for "a" into
				487	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				488
				489	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				490	minimal operator precedence</em> that the function is allowed to eat. For
				491	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				492	passed in a precedence of 40, it will not consume any tokens (because the
				493	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				494	with:</p>
				495
				496	<div class="doc_code">
				497	<pre>
				498	/// binoprhs
				499	/// ::= ('+' primary)*
				500	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				501	// If this is a binop, find its precedence.
				502	while (1) {
				503	int TokPrec = GetTokPrecedence();
				504
				505	// If this is a binop that binds at least as tightly as the current binop,
				506	// consume it, otherwise we are done.
				507	if (TokPrec < ExprPrec)
				508	return LHS;
				509	</pre>
				510	</div>
				511
				512	<p>This code gets the precedence of the current token and checks to see if if is
				513	too low. Because we defined invalid tokens to have a precedence of -1, this
				514	check implicitly knows that the pair-stream ends when the token stream runs out
				515	of binary operators. If this check succeeds, we know that the token is a binary
				516	operator and that it will be included in this expression:</p>
				517
				518	<div class="doc_code">
				519	<pre>
				520	// Okay, we know this is a binop.
				521	int BinOp = CurTok;
				522	getNextToken(); // eat binop
				523
				524	// Parse the primary expression after the binary operator.
				525	ExprAST *RHS = ParsePrimary();
				526	if (!RHS) return 0;
				527	</pre>
				528	</div>
				529
				530	<p>As such, this code eats (and remembers) the binary operator and then parses
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	531	the primary expression that follows. This builds up the whole pair, the first of
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	532	which is [+, b] for the running example.</p>
				533
				534	<p>Now that we parsed the left-hand side of an expression and one pair of the
				535	RHS sequence, we have to decide which way the expression associates. In
				536	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				537	To determine this, we look ahead at "binop" to determine its precedence and
				538	compare it to BinOp's precedence (which is '+' in this case):</p>
				539
				540	<div class="doc_code">
				541	<pre>
				542	// If BinOp binds less tightly with RHS than the operator after RHS, let
				543	// the pending operator take RHS as its LHS.
				544	int NextPrec = GetTokPrecedence();
				545	if (TokPrec < NextPrec) {
				546	</pre>
				547	</div>
				548
				549	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				550	precedence of our current operator, then we know that the parentheses associate
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	551	as "(a+b) binop ...". In our example, the current operator is "+" and the next
				552	operator is "+", we know that they have the same precedence. In this case we'll
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	553	create the AST node for "a+b", and then continue parsing:</p>
				554
				555	<div class="doc_code">
				556	<pre>
				557	... if body omitted ...
				558	}
				559
				560	// Merge LHS/RHS.
				561	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				562	} // loop around to the top of the while loop.
				563	}
				564	</pre>
				565	</div>
				566
				567	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	568	iteration of the loop, with "+" as the current token. The code above will eat,
				569	remember, and parse "(c+d)" as the primary expression, which makes the
				570	current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
				571	"" as the binop to the right of the primary. In this case, the precedence of "" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	572	higher than the precedence of "+" so the if condition will be entered.</p>
				573
				574	<p>The critical question left here is "how can the if condition parse the right
				575	hand side in full"? In particular, to build the AST correctly for our example,
				576	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				577	do this is surprisingly simple (code from the above two blocks duplicated for
				578	context):</p>
				579
				580	<div class="doc_code">
				581	<pre>
				582	// If BinOp binds less tightly with RHS than the operator after RHS, let
				583	// the pending operator take RHS as its LHS.
				584	int NextPrec = GetTokPrecedence();
				585	if (TokPrec < NextPrec) {
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	586	<b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
				587	if (RHS == 0) return 0;</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	588	}
				589	// Merge LHS/RHS.
				590	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				591	} // loop around to the top of the while loop.
				592	}
				593	</pre>
				594	</div>
				595
				596	<p>At this point, we know that the binary operator to the RHS of our primary
				597	has higher precedence than the binop we are currently parsing. As such, we know
				598	that any sequence of pairs whose operators are all higher precedence than "+"
				599	should be parsed together and returned as "RHS". To do this, we recursively
				600	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				601	precedence required for it to continue. In our example above, this will cause
				602	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				603	of the '+' expression.</p>
				604
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	605	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	606	and added to the AST. With this little bit of code (14 non-trivial lines), we
				607	correctly handle fully general binary expression parsing in a very elegant way.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	608	This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
				609	running through it with a few tough examples to see how it works.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	610	</p>
				611
				612	<p>This wraps up handling of expressions. At this point, we can point the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	613	parser at an arbitrary token stream and build an expression from it, stopping
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	614	at the first token that is not part of the expression. Next up we need to
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	615	handle function definitions, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	616
				617	</div>
				618
				619	<!-- *********************************************************************** -->
				620	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				621	<!-- *********************************************************************** -->
				622
				623	<div class="doc_text">
				624
				625	<p>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	626	The next thing missing is handling of function prototypes. In Kaleidoscope,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	627	these are used both for 'extern' function declarations as well as function body
				628	definitions. The code to do this is straight-forward and not very interesting
				629	(once you've survived expressions):
				630	</p>
				631
				632	<div class="doc_code">
				633	<pre>
				634	/// prototype
				635	/// ::= id '(' id* ')'
				636	static PrototypeAST *ParsePrototype() {
				637	if (CurTok != tok_identifier)
				638	return ErrorP("Expected function name in prototype");
				639
				640	std::string FnName = IdentifierStr;
				641	getNextToken();
				642
				643	if (CurTok != '(')
				644	return ErrorP("Expected '(' in prototype");
				645
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	646	// Read the list of argument names.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	647	std::vector<std::string> ArgNames;
				648	while (getNextToken() == tok_identifier)
				649	ArgNames.push_back(IdentifierStr);
				650	if (CurTok != ')')
				651	return ErrorP("Expected ')' in prototype");
				652
				653	// success.
				654	getNextToken(); // eat ')'.
				655
				656	return new PrototypeAST(FnName, ArgNames);
				657	}
				658	</pre>
				659	</div>
				660
				661	<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	662	an expression to implement the body:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	663
				664	<div class="doc_code">
				665	<pre>
				666	/// definition ::= 'def' prototype expression
				667	static FunctionAST *ParseDefinition() {
				668	getNextToken(); // eat def.
				669	PrototypeAST *Proto = ParsePrototype();
				670	if (Proto == 0) return 0;
				671
				672	if (ExprAST *E = ParseExpression())
				673	return new FunctionAST(Proto, E);
				674	return 0;
				675	}
				676	</pre>
				677	</div>
				678
				679	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	680	well as to support forward declaration of user functions. These 'extern's are just
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	681	prototypes with no body:</p>
				682
				683	<div class="doc_code">
				684	<pre>
				685	/// external ::= 'extern' prototype
				686	static PrototypeAST *ParseExtern() {
				687	getNextToken(); // eat extern.
				688	return ParsePrototype();
				689	}
				690	</pre>
				691	</div>
				692
				693	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				694	evaluate them on the fly. We will handle this by defining anonymous nullary
				695	(zero argument) functions for them:</p>
				696
				697	<div class="doc_code">
				698	<pre>
				699	/// toplevelexpr ::= expression
				700	static FunctionAST *ParseTopLevelExpr() {
				701	if (ExprAST *E = ParseExpression()) {
				702	// Make an anonymous proto.
				703	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				704	return new FunctionAST(Proto, E);
				705	}
				706	return 0;
				707	}
				708	</pre>
				709	</div>
				710
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	711	<p>Now that we have all the pieces, let's build a little driver that will let us
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	712	actually <em>execute</em> this code we've built!</p>
				713
				714	</div>
				715
				716	<!-- *********************************************************************** -->
				717	<div class="doc_section"><a name="driver">The Driver</a></div>
				718	<!-- *********************************************************************** -->
				719
				720	<div class="doc_text">
				721
				722	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				723	dispatch loop. There isn't much interesting here, so I'll just include the
				724	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				725	Parsing" section.</p>
				726
				727	<div class="doc_code">
				728	<pre>
				729	/// top ::= definition \| external \| expression \| ';'
				730	static void MainLoop() {
				731	while (1) {
				732	fprintf(stderr, "ready> ");
				733	switch (CurTok) {
				734	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	735	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	736	case tok_def: HandleDefinition(); break;
				737	case tok_extern: HandleExtern(); break;
				738	default: HandleTopLevelExpression(); break;
				739	}
				740	}
				741	}
				742	</pre>
				743	</div>
				744
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	745	<p>The most interesting part of this is that we ignore top-level semicolons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	746	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	747	command line, the parser doesn't know whether that is the end of what you will type
				748	or not. For example, on the next line you could type "def foo..." in which case
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	749	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				750	which would continue the expression. Having top-level semicolons allows you to
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	751	type "4+5;", and the parser will know you are done.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	752
				753	</div>
				754
				755	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	756	<div class="doc_section"><a name="conclusions">Conclusions</a></div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	757	<!-- *********************************************************************** -->
				758
				759	<div class="doc_text">
				760
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	761	<p>With just under 400 lines of commented code (240 lines of non-comment,
				762	non-blank code), we fully defined our minimal language, including a lexer,
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	763	parser, and AST builder. With this done, the executable will validate
				764	Kaleidoscope code and tell us if it is grammatically invalid. For
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	765	example, here is a sample interaction:</p>
				766
				767	<div class="doc_code">
				768	<pre>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	769	$ <b>./a.out</b>
				770	ready> <b>def foo(x y) x+foo(y, 4.0);</b>
				771	Parsed a function definition.
				772	ready> <b>def foo(x y) x+y y;</b>
				773	Parsed a function definition.
				774	Parsed a top-level expr
				775	ready> <b>def foo(x y) x+y );</b>
				776	Parsed a function definition.
				777	Error: unknown token when expecting an expression
				778	ready> <b>extern sin(a);</b>
Chris Lattner	35abbf5	2007-10-23 06:23:57 +0000	[diff] [blame]	779	ready> Parsed an extern
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	780	ready> <b>^D</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	781	$
				782	</pre>
				783	</div>
				784
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	785	<p>There is a lot of room for extension here. You can define new AST nodes,
				786	extend the language in many ways, etc. In the <a href="LangImpl3.html">next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	787	installment</a>, we will describe how to generate LLVM Intermediate
				788	Representation (IR) from the AST.</p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	789
				790	</div>
				791
				792	<!-- *********************************************************************** -->
				793	<div class="doc_section"><a name="code">Full Code Listing</a></div>
				794	<!-- *********************************************************************** -->
				795
				796	<div class="doc_text">
				797
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	798	<p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	799	Here is the complete code listing for this and the previous chapter.
				800	Note that it is fully self-contained: you don't need LLVM or any external
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	801	libraries at all for this. (Besides the C and C++ standard libraries, of
				802	course.) To build this, just compile with:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	803
				804	<div class="doc_code">
				805	<pre>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	806	# Compile
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	807	g++ -g -O3 toy.cpp
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	808	# Run
				809	./a.out
				810	</pre>
				811	</div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	812
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	813	<p>Here is the code:</p>
				814
				815	<div class="doc_code">
				816	<pre>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	817	#include <cstdio>
				818	#include <string>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	819	#include <map>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	820	#include <vector>
				821
				822	//===----------------------------------------------------------------------===//
				823	// Lexer
				824	//===----------------------------------------------------------------------===//
				825
				826	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				827	// of these for known things.
				828	enum Token {
				829	tok_eof = -1,
				830
				831	// commands
				832	tok_def = -2, tok_extern = -3,
				833
				834	// primary
				835	tok_identifier = -4, tok_number = -5,
				836	};
				837
				838	static std::string IdentifierStr; // Filled in if tok_identifier
				839	static double NumVal; // Filled in if tok_number
				840
				841	/// gettok - Return the next token from standard input.
				842	static int gettok() {
				843	static int LastChar = ' ';
				844
				845	// Skip any whitespace.
				846	while (isspace(LastChar))
				847	LastChar = getchar();
				848
				849	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				850	IdentifierStr = LastChar;
				851	while (isalnum((LastChar = getchar())))
				852	IdentifierStr += LastChar;
				853
				854	if (IdentifierStr == "def") return tok_def;
				855	if (IdentifierStr == "extern") return tok_extern;
				856	return tok_identifier;
				857	}
				858
				859	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				860	std::string NumStr;
				861	do {
				862	NumStr += LastChar;
				863	LastChar = getchar();
				864	} while (isdigit(LastChar) \|\| LastChar == '.');
				865
				866	NumVal = strtod(NumStr.c_str(), 0);
				867	return tok_number;
				868	}
				869
				870	if (LastChar == '#') {
				871	// Comment until end of line.
				872	do LastChar = getchar();
Chris Lattner	c80c23f	2007-12-02 22:46:01 +0000	[diff] [blame]	873	while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	874
				875	if (LastChar != EOF)
				876	return gettok();
				877	}
				878
				879	// Check for end of file. Don't eat the EOF.
				880	if (LastChar == EOF)
				881	return tok_eof;
				882
				883	// Otherwise, just return the character as its ascii value.
				884	int ThisChar = LastChar;
				885	LastChar = getchar();
				886	return ThisChar;
				887	}
				888
				889	//===----------------------------------------------------------------------===//
				890	// Abstract Syntax Tree (aka Parse Tree)
				891	//===----------------------------------------------------------------------===//
				892
				893	/// ExprAST - Base class for all expression nodes.
				894	class ExprAST {
				895	public:
				896	virtual ~ExprAST() {}
				897	};
				898
				899	/// NumberExprAST - Expression class for numeric literals like "1.0".
				900	class NumberExprAST : public ExprAST {
				901	double Val;
				902	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	903	explicit NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	904	};
				905
				906	/// VariableExprAST - Expression class for referencing a variable, like "a".
				907	class VariableExprAST : public ExprAST {
				908	std::string Name;
				909	public:
Chris Lattner	28571ed	2007-10-23 04:27:44 +0000	[diff] [blame]	910	explicit VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	911	};
				912
				913	/// BinaryExprAST - Expression class for a binary operator.
				914	class BinaryExprAST : public ExprAST {
				915	char Op;
				916	ExprAST LHS, RHS;
				917	public:
				918	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				919	: Op(op), LHS(lhs), RHS(rhs) {}
				920	};
				921
				922	/// CallExprAST - Expression class for function calls.
				923	class CallExprAST : public ExprAST {
				924	std::string Callee;
				925	std::vector<ExprAST*> Args;
				926	public:
				927	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				928	: Callee(callee), Args(args) {}
				929	};
				930
				931	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	932	/// which captures its name, and its argument names (thus implicitly the number
				933	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	934	class PrototypeAST {
				935	std::string Name;
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	936	std::vector<std::string> Args;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	937	public:
				938	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				939	: Name(name), Args(args) {}
				940
				941	};
				942
				943	/// FunctionAST - This class represents a function definition itself.
				944	class FunctionAST {
				945	PrototypeAST *Proto;
				946	ExprAST *Body;
				947	public:
				948	FunctionAST(PrototypeAST proto, ExprAST body)
				949	: Proto(proto), Body(body) {}
				950
				951	};
				952
				953	//===----------------------------------------------------------------------===//
				954	// Parser
				955	//===----------------------------------------------------------------------===//
				956
				957	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	958	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	959	/// lexer and updates CurTok with its results.
				960	static int CurTok;
				961	static int getNextToken() {
				962	return CurTok = gettok();
				963	}
				964
				965	/// BinopPrecedence - This holds the precedence for each binary operator that is
				966	/// defined.
				967	static std::map<char, int> BinopPrecedence;
				968
				969	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				970	static int GetTokPrecedence() {
				971	if (!isascii(CurTok))
				972	return -1;
				973
				974	// Make sure it's a declared binop.
				975	int TokPrec = BinopPrecedence[CurTok];
				976	if (TokPrec <= 0) return -1;
				977	return TokPrec;
				978	}
				979
				980	/// Error* - These are little helper functions for error handling.
				981	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				982	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				983	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				984
				985	static ExprAST *ParseExpression();
				986
				987	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	988	/// ::= identifier
				989	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	990	static ExprAST *ParseIdentifierExpr() {
				991	std::string IdName = IdentifierStr;
				992
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	993	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	994
				995	if (CurTok != '(') // Simple variable ref.
				996	return new VariableExprAST(IdName);
				997
				998	// Call.
				999	getNextToken(); // eat (
				1000	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1001	if (CurTok != ')') {
				1002	while (1) {
				1003	ExprAST *Arg = ParseExpression();
				1004	if (!Arg) return 0;
				1005	Args.push_back(Arg);
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1006
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1007	if (CurTok == ')') break;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1008
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1009	if (CurTok != ',')
				1010	return Error("Expected ')'");
				1011	getNextToken();
				1012	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1013	}
				1014
				1015	// Eat the ')'.
				1016	getNextToken();
				1017
				1018	return new CallExprAST(IdName, Args);
				1019	}
				1020
				1021	/// numberexpr ::= number
				1022	static ExprAST *ParseNumberExpr() {
				1023	ExprAST *Result = new NumberExprAST(NumVal);
				1024	getNextToken(); // consume the number
				1025	return Result;
				1026	}
				1027
				1028	/// parenexpr ::= '(' expression ')'
				1029	static ExprAST *ParseParenExpr() {
				1030	getNextToken(); // eat (.
				1031	ExprAST *V = ParseExpression();
				1032	if (!V) return 0;
				1033
				1034	if (CurTok != ')')
				1035	return Error("expected ')'");
				1036	getNextToken(); // eat ).
				1037	return V;
				1038	}
				1039
				1040	/// primary
				1041	/// ::= identifierexpr
				1042	/// ::= numberexpr
				1043	/// ::= parenexpr
				1044	static ExprAST *ParsePrimary() {
				1045	switch (CurTok) {
				1046	default: return Error("unknown token when expecting an expression");
				1047	case tok_identifier: return ParseIdentifierExpr();
				1048	case tok_number: return ParseNumberExpr();
				1049	case '(': return ParseParenExpr();
				1050	}
				1051	}
				1052
				1053	/// binoprhs
				1054	/// ::= ('+' primary)*
				1055	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1056	// If this is a binop, find its precedence.
				1057	while (1) {
				1058	int TokPrec = GetTokPrecedence();
				1059
				1060	// If this is a binop that binds at least as tightly as the current binop,
				1061	// consume it, otherwise we are done.
				1062	if (TokPrec < ExprPrec)
				1063	return LHS;
				1064
				1065	// Okay, we know this is a binop.
				1066	int BinOp = CurTok;
				1067	getNextToken(); // eat binop
				1068
				1069	// Parse the primary expression after the binary operator.
				1070	ExprAST *RHS = ParsePrimary();
				1071	if (!RHS) return 0;
				1072
				1073	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1074	// the pending operator take RHS as its LHS.
				1075	int NextPrec = GetTokPrecedence();
				1076	if (TokPrec < NextPrec) {
				1077	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1078	if (RHS == 0) return 0;
				1079	}
				1080
				1081	// Merge LHS/RHS.
				1082	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1083	}
				1084	}
				1085
				1086	/// expression
				1087	/// ::= primary binoprhs
				1088	///
				1089	static ExprAST *ParseExpression() {
				1090	ExprAST *LHS = ParsePrimary();
				1091	if (!LHS) return 0;
				1092
				1093	return ParseBinOpRHS(0, LHS);
				1094	}
				1095
				1096	/// prototype
				1097	/// ::= id '(' id* ')'
				1098	static PrototypeAST *ParsePrototype() {
				1099	if (CurTok != tok_identifier)
				1100	return ErrorP("Expected function name in prototype");
				1101
				1102	std::string FnName = IdentifierStr;
				1103	getNextToken();
				1104
				1105	if (CurTok != '(')
				1106	return ErrorP("Expected '(' in prototype");
				1107
				1108	std::vector<std::string> ArgNames;
				1109	while (getNextToken() == tok_identifier)
				1110	ArgNames.push_back(IdentifierStr);
				1111	if (CurTok != ')')
				1112	return ErrorP("Expected ')' in prototype");
				1113
				1114	// success.
				1115	getNextToken(); // eat ')'.
				1116
				1117	return new PrototypeAST(FnName, ArgNames);
				1118	}
				1119
				1120	/// definition ::= 'def' prototype expression
				1121	static FunctionAST *ParseDefinition() {
				1122	getNextToken(); // eat def.
				1123	PrototypeAST *Proto = ParsePrototype();
				1124	if (Proto == 0) return 0;
				1125
				1126	if (ExprAST *E = ParseExpression())
				1127	return new FunctionAST(Proto, E);
				1128	return 0;
				1129	}
				1130
				1131	/// toplevelexpr ::= expression
				1132	static FunctionAST *ParseTopLevelExpr() {
				1133	if (ExprAST *E = ParseExpression()) {
				1134	// Make an anonymous proto.
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	1135	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1136	return new FunctionAST(Proto, E);
				1137	}
				1138	return 0;
				1139	}
				1140
				1141	/// external ::= 'extern' prototype
				1142	static PrototypeAST *ParseExtern() {
				1143	getNextToken(); // eat extern.
				1144	return ParsePrototype();
				1145	}
				1146
				1147	//===----------------------------------------------------------------------===//
				1148	// Top-Level parsing
				1149	//===----------------------------------------------------------------------===//
				1150
				1151	static void HandleDefinition() {
				1152	if (FunctionAST *F = ParseDefinition()) {
				1153	fprintf(stderr, "Parsed a function definition.\n");
				1154	} else {
				1155	// Skip token for error recovery.
				1156	getNextToken();
				1157	}
				1158	}
				1159
				1160	static void HandleExtern() {
				1161	if (PrototypeAST *P = ParseExtern()) {
				1162	fprintf(stderr, "Parsed an extern\n");
				1163	} else {
				1164	// Skip token for error recovery.
				1165	getNextToken();
				1166	}
				1167	}
				1168
				1169	static void HandleTopLevelExpression() {
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1170	// Evaluate a top-level expression into an anonymous function.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1171	if (FunctionAST *F = ParseTopLevelExpr()) {
				1172	fprintf(stderr, "Parsed a top-level expr\n");
				1173	} else {
				1174	// Skip token for error recovery.
				1175	getNextToken();
				1176	}
				1177	}
				1178
				1179	/// top ::= definition \| external \| expression \| ';'
				1180	static void MainLoop() {
				1181	while (1) {
				1182	fprintf(stderr, "ready> ");
				1183	switch (CurTok) {
				1184	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1185	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1186	case tok_def: HandleDefinition(); break;
				1187	case tok_extern: HandleExtern(); break;
				1188	default: HandleTopLevelExpression(); break;
				1189	}
				1190	}
				1191	}
				1192
				1193	//===----------------------------------------------------------------------===//
				1194	// Main driver code.
				1195	//===----------------------------------------------------------------------===//
				1196
				1197	int main() {
				1198	// Install standard binary operators.
				1199	// 1 is lowest precedence.
				1200	BinopPrecedence['<'] = 10;
				1201	BinopPrecedence['+'] = 20;
				1202	BinopPrecedence['-'] = 20;
				1203	BinopPrecedence['*'] = 40; // highest.
				1204
				1205	// Prime the first token.
				1206	fprintf(stderr, "ready> ");
				1207	getNextToken();
				1208
				1209	MainLoop();
				1210	return 0;
				1211	}
				1212	</pre>
				1213	</div>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1214	<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1215	</div>
				1216
				1217	<!-- *********************************************************************** -->
				1218	<hr>
				1219	<address>
				1220	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1221	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1222	<a href="http://validator.w3.org/check/referer"><img
Chris Lattner	8eef4b2	2007-10-23 06:30:50 +0000	[diff] [blame]	1223	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1224
				1225	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				1226	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				1227	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				1228	</address>
				1229	</body>
				1230	</html>