Blame - llvm/docs/tutorial/LangImpl2.rst - toolchain/llvm-project

blob: 4b28e02a523e96b925e20999671454ceee61df46 [file] [log] [blame]

Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	1	===========================================
				2	Kaleidoscope: Implementing a Parser and AST
				3	===========================================
				4
				5	.. contents::
				6	:local:
				7
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	8	Chapter 2 Introduction
				9	======================
				10
				11	Welcome to Chapter 2 of the "`Implementing a language with
				12	LLVM <index.html>`_" tutorial. This chapter shows you how to use the
				13	lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
				14	`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
				15	language. Once we have a parser, we'll define and build an `Abstract
				16	Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
				17
				18	The parser we will build uses a combination of `Recursive Descent
				19	Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
				20	`Operator-Precedence
				21	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
				22	parse the Kaleidoscope language (the latter for binary expressions and
				23	the former for everything else). Before we get to parsing though, lets
				24	talk about the output of the parser: the Abstract Syntax Tree.
				25
				26	The Abstract Syntax Tree (AST)
				27	==============================
				28
				29	The AST for a program captures its behavior in such a way that it is
				30	easy for later stages of the compiler (e.g. code generation) to
				31	interpret. We basically want one object for each construct in the
				32	language, and the AST should closely model the language. In
				33	Kaleidoscope, we have expressions, a prototype, and a function object.
				34	We'll start with expressions first:
				35
				36	.. code-block:: c++
				37
				38	/// ExprAST - Base class for all expression nodes.
				39	class ExprAST {
				40	public:
				41	virtual ~ExprAST() {}
				42	};
				43
				44	/// NumberExprAST - Expression class for numeric literals like "1.0".
				45	class NumberExprAST : public ExprAST {
				46	double Val;
				47	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	48	NumberExprAST(double Val) : Val(Val) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	49	};
				50
				51	The code above shows the definition of the base ExprAST class and one
				52	subclass which we use for numeric literals. The important thing to note
				53	about this code is that the NumberExprAST class captures the numeric
				54	value of the literal as an instance variable. This allows later phases
				55	of the compiler to know what the stored numeric value is.
				56
				57	Right now we only create the AST, so there are no useful accessor
				58	methods on them. It would be very easy to add a virtual method to pretty
				59	print the code, for example. Here are the other expression AST node
				60	definitions that we'll use in the basic form of the Kaleidoscope
				61	language:
				62
				63	.. code-block:: c++
				64
				65	/// VariableExprAST - Expression class for referencing a variable, like "a".
				66	class VariableExprAST : public ExprAST {
				67	std::string Name;
				68	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	69	VariableExprAST(const std::string &Name) : Name(Name) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	70	};
				71
				72	/// BinaryExprAST - Expression class for a binary operator.
				73	class BinaryExprAST : public ExprAST {
				74	char Op;
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	75	std::unique_ptr<ExprAST> LHS, RHS;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	76	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	77	BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
				78	std::unique_ptr<ExprAST> RHS)
				79	: Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	80	};
				81
				82	/// CallExprAST - Expression class for function calls.
				83	class CallExprAST : public ExprAST {
				84	std::string Callee;
				85	std::vector<ExprAST*> Args;
				86	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	87	CallExprAST(const std::string &Callee,
				88	std::vector<std::unique_ptr<ExprAST>> Args)
				89	: Callee(Callee), Args(std::move(Args)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	90	};
				91
				92	This is all (intentionally) rather straight-forward: variables capture
				93	the variable name, binary operators capture their opcode (e.g. '+'), and
				94	calls capture a function name as well as a list of any argument
				95	expressions. One thing that is nice about our AST is that it captures
				96	the language features without talking about the syntax of the language.
				97	Note that there is no discussion about precedence of binary operators,
				98	lexical structure, etc.
				99
				100	For our basic language, these are all of the expression nodes we'll
				101	define. Because it doesn't have conditional control flow, it isn't
				102	Turing-complete; we'll fix that in a later installment. The two things
				103	we need next are a way to talk about the interface to a function, and a
				104	way to talk about functions themselves:
				105
				106	.. code-block:: c++
				107
				108	/// PrototypeAST - This class represents the "prototype" for a function,
				109	/// which captures its name, and its argument names (thus implicitly the number
				110	/// of arguments the function takes).
				111	class PrototypeAST {
				112	std::string Name;
				113	std::vector<std::string> Args;
				114	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	115	PrototypeAST(const std::string &name, std::vector<std::string> Args)
				116	: Name(name), Args(std::move(Args)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	117	};
				118
				119	/// FunctionAST - This class represents a function definition itself.
				120	class FunctionAST {
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	121	std::unique_ptr<PrototypeAST> Proto;
				122	std::unique_ptr<ExprAST> Body;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	123	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	124	FunctionAST(std::unique_ptr<PrototypeAST> Proto,
				125	std::unique_ptr<ExprAST> Body)
				126	: Proto(std::move(Proto)), Body(std::move(Body)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	127	};
				128
				129	In Kaleidoscope, functions are typed with just a count of their
				130	arguments. Since all values are double precision floating point, the
				131	type of each argument doesn't need to be stored anywhere. In a more
				132	aggressive and realistic language, the "ExprAST" class would probably
				133	have a type field.
				134
				135	With this scaffolding, we can now talk about parsing expressions and
				136	function bodies in Kaleidoscope.
				137
				138	Parser Basics
				139	=============
				140
				141	Now that we have an AST to build, we need to define the parser code to
				142	build it. The idea here is that we want to parse something like "x+y"
				143	(which is returned as three tokens by the lexer) into an AST that could
				144	be generated with calls like this:
				145
				146	.. code-block:: c++
				147
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	148	auto LHS = llvm::make_unique<VariableExprAST>("x");
				149	auto RHS = llvm::make_unique<VariableExprAST>("y");
				150	auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
				151	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	152
				153	In order to do this, we'll start by defining some basic helper routines:
				154
				155	.. code-block:: c++
				156
				157	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				158	/// token the parser is looking at. getNextToken reads another token from the
				159	/// lexer and updates CurTok with its results.
				160	static int CurTok;
				161	static int getNextToken() {
				162	return CurTok = gettok();
				163	}
				164
				165	This implements a simple token buffer around the lexer. This allows us
				166	to look one token ahead at what the lexer is returning. Every function
				167	in our parser will assume that CurTok is the current token that needs to
				168	be parsed.
				169
				170	.. code-block:: c++
				171
				172
				173	/// Error* - These are little helper functions for error handling.
				174	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				175	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				176	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				177
				178	The ``Error`` routines are simple helper routines that our parser will
				179	use to handle errors. The error recovery in our parser will not be the
				180	best and is not particular user-friendly, but it will be enough for our
				181	tutorial. These routines make it easier to handle errors in routines
				182	that have various return types: they always return null.
				183
				184	With these basic helper functions, we can implement the first piece of
				185	our grammar: numeric literals.
				186
				187	Basic Expression Parsing
				188	========================
				189
				190	We start with numeric literals, because they are the simplest to
				191	process. For each production in our grammar, we'll define a function
				192	which parses that production. For numeric literals, we have:
				193
				194	.. code-block:: c++
				195
				196	/// numberexpr ::= number
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	197	static std::unique_ptr<ExprAST> ParseNumberExpr() {
				198	auto Result = llvm::make_unique<NumberExprAST>(NumVal);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	199	getNextToken(); // consume the number
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	200	return std::move(Result);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	201	}
				202
				203	This routine is very simple: it expects to be called when the current
				204	token is a ``tok_number`` token. It takes the current number value,
				205	creates a ``NumberExprAST`` node, advances the lexer to the next token,
				206	and finally returns.
				207
				208	There are some interesting aspects to this. The most important one is
				209	that this routine eats all of the tokens that correspond to the
				210	production and returns the lexer buffer with the next token (which is
				211	not part of the grammar production) ready to go. This is a fairly
				212	standard way to go for recursive descent parsers. For a better example,
				213	the parenthesis operator is defined like this:
				214
				215	.. code-block:: c++
				216
				217	/// parenexpr ::= '(' expression ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	218	static std::unique_ptr<ExprAST> ParseParenExpr() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	219	getNextToken(); // eat (.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	220	auto V = ParseExpression();
				221	if (!V) return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	222
				223	if (CurTok != ')')
				224	return Error("expected ')'");
				225	getNextToken(); // eat ).
				226	return V;
				227	}
				228
				229	This function illustrates a number of interesting things about the
				230	parser:
				231
				232	1) It shows how we use the Error routines. When called, this function
				233	expects that the current token is a '(' token, but after parsing the
				234	subexpression, it is possible that there is no ')' waiting. For example,
				235	if the user types in "(4 x" instead of "(4)", the parser should emit an
				236	error. Because errors can occur, the parser needs a way to indicate that
				237	they happened: in our parser, we return null on an error.
				238
				239	2) Another interesting aspect of this function is that it uses recursion
				240	by calling ``ParseExpression`` (we will soon see that
				241	``ParseExpression`` can call ``ParseParenExpr``). This is powerful
				242	because it allows us to handle recursive grammars, and keeps each
				243	production very simple. Note that parentheses do not cause construction
				244	of AST nodes themselves. While we could do it this way, the most
				245	important role of parentheses are to guide the parser and provide
				246	grouping. Once the parser constructs the AST, parentheses are not
				247	needed.
				248
				249	The next simple production is for handling variable references and
				250	function calls:
				251
				252	.. code-block:: c++
				253
				254	/// identifierexpr
				255	/// ::= identifier
				256	/// ::= identifier '(' expression* ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	257	static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	258	std::string IdName = IdentifierStr;
				259
				260	getNextToken(); // eat identifier.
				261
				262	if (CurTok != '(') // Simple variable ref.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	263	return llvm::make_unique<VariableExprAST>(IdName);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	264
				265	// Call.
				266	getNextToken(); // eat (
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	267	std::vector<std::unique_ptr<ExprAST>> Args;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	268	if (CurTok != ')') {
				269	while (1) {
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	270	auto Arg = ParseExpression();
				271	if (!Arg) return nullptr;
				272	Args.push_back(std::move(Arg));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	273
				274	if (CurTok == ')') break;
				275
				276	if (CurTok != ',')
				277	return Error("Expected ')' or ',' in argument list");
				278	getNextToken();
				279	}
				280	}
				281
				282	// Eat the ')'.
				283	getNextToken();
				284
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	285	return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	286	}
				287
				288	This routine follows the same style as the other routines. (It expects
				289	to be called if the current token is a ``tok_identifier`` token). It
				290	also has recursion and error handling. One interesting aspect of this is
				291	that it uses look-ahead to determine if the current identifier is a
				292	stand alone variable reference or if it is a function call expression.
				293	It handles this by checking to see if the token after the identifier is
				294	a '(' token, constructing either a ``VariableExprAST`` or
				295	``CallExprAST`` node as appropriate.
				296
				297	Now that we have all of our simple expression-parsing logic in place, we
				298	can define a helper function to wrap it together into one entry point.
				299	We call this class of expressions "primary" expressions, for reasons
				300	that will become more clear `later in the
				301	tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
				302	primary expression, we need to determine what sort of expression it is:
				303
				304	.. code-block:: c++
				305
				306	/// primary
				307	/// ::= identifierexpr
				308	/// ::= numberexpr
				309	/// ::= parenexpr
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	310	static std::unique_ptr<ExprAST> ParsePrimary() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	311	switch (CurTok) {
				312	default: return Error("unknown token when expecting an expression");
				313	case tok_identifier: return ParseIdentifierExpr();
				314	case tok_number: return ParseNumberExpr();
				315	case '(': return ParseParenExpr();
				316	}
				317	}
				318
				319	Now that you see the definition of this function, it is more obvious why
				320	we can assume the state of CurTok in the various functions. This uses
				321	look-ahead to determine which sort of expression is being inspected, and
				322	then parses it with a function call.
				323
				324	Now that basic expressions are handled, we need to handle binary
				325	expressions. They are a bit more complex.
				326
				327	Binary Expression Parsing
				328	=========================
				329
				330	Binary expressions are significantly harder to parse because they are
				331	often ambiguous. For example, when given the string "x+y\*z", the parser
				332	can choose to parse it as either "(x+y)\z" or "x+(y\z)". With common
				333	definitions from mathematics, we expect the later parse, because "\*"
				334	(multiplication) has higher precedence than "+" (addition).
				335
				336	There are many ways to handle this, but an elegant and efficient way is
				337	to use `Operator-Precedence
				338	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
				339	This parsing technique uses the precedence of binary operators to guide
				340	recursion. To start with, we need a table of precedences:
				341
				342	.. code-block:: c++
				343
				344	/// BinopPrecedence - This holds the precedence for each binary operator that is
				345	/// defined.
				346	static std::map<char, int> BinopPrecedence;
				347
				348	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				349	static int GetTokPrecedence() {
				350	if (!isascii(CurTok))
				351	return -1;
				352
				353	// Make sure it's a declared binop.
				354	int TokPrec = BinopPrecedence[CurTok];
				355	if (TokPrec <= 0) return -1;
				356	return TokPrec;
				357	}
				358
				359	int main() {
				360	// Install standard binary operators.
				361	// 1 is lowest precedence.
				362	BinopPrecedence['<'] = 10;
				363	BinopPrecedence['+'] = 20;
				364	BinopPrecedence['-'] = 20;
				365	BinopPrecedence['*'] = 40; // highest.
				366	...
				367	}
				368
				369	For the basic form of Kaleidoscope, we will only support 4 binary
				370	operators (this can obviously be extended by you, our brave and intrepid
				371	reader). The ``GetTokPrecedence`` function returns the precedence for
				372	the current token, or -1 if the token is not a binary operator. Having a
				373	map makes it easy to add new operators and makes it clear that the
				374	algorithm doesn't depend on the specific operators involved, but it
				375	would be easy enough to eliminate the map and do the comparisons in the
				376	``GetTokPrecedence`` function. (Or just use a fixed-size array).
				377
				378	With the helper above defined, we can now start parsing binary
				379	expressions. The basic idea of operator precedence parsing is to break
				380	down an expression with potentially ambiguous binary operators into
				381	pieces. Consider ,for example, the expression "a+b+(c+d)\e\f+g".
				382	Operator precedence parsing considers this as a stream of primary
				383	expressions separated by binary operators. As such, it will first parse
				384	the leading primary expression "a", then it will see the pairs [+, b]
				385	[+, (c+d)] [\, e] [\, f] and [+, g]. Note that because parentheses are
				386	primary expressions, the binary expression parser doesn't need to worry
				387	about nested subexpressions like (c+d) at all.
				388
				389	To start, an expression is a primary expression potentially followed by
				390	a sequence of [binop,primaryexpr] pairs:
				391
				392	.. code-block:: c++
				393
				394	/// expression
				395	/// ::= primary binoprhs
				396	///
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	397	static std::unique_ptr<ExprAST> ParseExpression() {
				398	auto LHS = ParsePrimary();
				399	if (!LHS) return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	400
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	401	return ParseBinOpRHS(0, std::move(LHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	402	}
				403
				404	``ParseBinOpRHS`` is the function that parses the sequence of pairs for
				405	us. It takes a precedence and a pointer to an expression for the part
				406	that has been parsed so far. Note that "x" is a perfectly valid
				407	expression: As such, "binoprhs" is allowed to be empty, in which case it
				408	returns the expression that is passed into it. In our example above, the
				409	code passes the expression for "a" into ``ParseBinOpRHS`` and the
				410	current token is "+".
				411
				412	The precedence value passed into ``ParseBinOpRHS`` indicates the
				413	minimal operator precedence that the function is allowed to eat. For
				414	example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
				415	passed in a precedence of 40, it will not consume any tokens (because
				416	the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
				417	starts with:
				418
				419	.. code-block:: c++
				420
				421	/// binoprhs
				422	/// ::= ('+' primary)*
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	423	static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
				424	std::unique_ptr<ExprAST> LHS) {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	425	// If this is a binop, find its precedence.
				426	while (1) {
				427	int TokPrec = GetTokPrecedence();
				428
				429	// If this is a binop that binds at least as tightly as the current binop,
				430	// consume it, otherwise we are done.
				431	if (TokPrec < ExprPrec)
				432	return LHS;
				433
				434	This code gets the precedence of the current token and checks to see if
				435	if is too low. Because we defined invalid tokens to have a precedence of
				436	-1, this check implicitly knows that the pair-stream ends when the token
				437	stream runs out of binary operators. If this check succeeds, we know
				438	that the token is a binary operator and that it will be included in this
				439	expression:
				440
				441	.. code-block:: c++
				442
				443	// Okay, we know this is a binop.
				444	int BinOp = CurTok;
				445	getNextToken(); // eat binop
				446
				447	// Parse the primary expression after the binary operator.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	448	auto RHS = ParsePrimary();
				449	if (!RHS) return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	450
				451	As such, this code eats (and remembers) the binary operator and then
				452	parses the primary expression that follows. This builds up the whole
				453	pair, the first of which is [+, b] for the running example.
				454
				455	Now that we parsed the left-hand side of an expression and one pair of
				456	the RHS sequence, we have to decide which way the expression associates.
				457	In particular, we could have "(a+b) binop unparsed" or "a + (b binop
				458	unparsed)". To determine this, we look ahead at "binop" to determine its
				459	precedence and compare it to BinOp's precedence (which is '+' in this
				460	case):
				461
				462	.. code-block:: c++
				463
				464	// If BinOp binds less tightly with RHS than the operator after RHS, let
				465	// the pending operator take RHS as its LHS.
				466	int NextPrec = GetTokPrecedence();
				467	if (TokPrec < NextPrec) {
				468
				469	If the precedence of the binop to the right of "RHS" is lower or equal
				470	to the precedence of our current operator, then we know that the
				471	parentheses associate as "(a+b) binop ...". In our example, the current
				472	operator is "+" and the next operator is "+", we know that they have the
				473	same precedence. In this case we'll create the AST node for "a+b", and
				474	then continue parsing:
				475
				476	.. code-block:: c++
				477
				478	... if body omitted ...
				479	}
				480
				481	// Merge LHS/RHS.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	482	LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
				483	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	484	} // loop around to the top of the while loop.
				485	}
				486
				487	In our example above, this will turn "a+b+" into "(a+b)" and execute the
				488	next iteration of the loop, with "+" as the current token. The code
				489	above will eat, remember, and parse "(c+d)" as the primary expression,
				490	which makes the current pair equal to [+, (c+d)]. It will then evaluate
				491	the 'if' conditional above with "\*" as the binop to the right of the
				492	primary. In this case, the precedence of "\*" is higher than the
				493	precedence of "+" so the if condition will be entered.
				494
				495	The critical question left here is "how can the if condition parse the
				496	right hand side in full"? In particular, to build the AST correctly for
				497	our example, it needs to get all of "(c+d)\e\f" as the RHS expression
				498	variable. The code to do this is surprisingly simple (code from the
				499	above two blocks duplicated for context):
				500
				501	.. code-block:: c++
				502
				503	// If BinOp binds less tightly with RHS than the operator after RHS, let
				504	// the pending operator take RHS as its LHS.
				505	int NextPrec = GetTokPrecedence();
				506	if (TokPrec < NextPrec) {
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	507	RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	508	if (RHS == 0) return 0;
				509	}
				510	// Merge LHS/RHS.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	511	LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
				512	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	513	} // loop around to the top of the while loop.
				514	}
				515
				516	At this point, we know that the binary operator to the RHS of our
				517	primary has higher precedence than the binop we are currently parsing.
				518	As such, we know that any sequence of pairs whose operators are all
				519	higher precedence than "+" should be parsed together and returned as
				520	"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
				521	specifying "TokPrec+1" as the minimum precedence required for it to
				522	continue. In our example above, this will cause it to return the AST
				523	node for "(c+d)\e\f" as RHS, which is then set as the RHS of the '+'
				524	expression.
				525
				526	Finally, on the next iteration of the while loop, the "+g" piece is
				527	parsed and added to the AST. With this little bit of code (14
				528	non-trivial lines), we correctly handle fully general binary expression
				529	parsing in a very elegant way. This was a whirlwind tour of this code,
				530	and it is somewhat subtle. I recommend running through it with a few
				531	tough examples to see how it works.
				532
				533	This wraps up handling of expressions. At this point, we can point the
				534	parser at an arbitrary token stream and build an expression from it,
				535	stopping at the first token that is not part of the expression. Next up
				536	we need to handle function definitions, etc.
				537
				538	Parsing the Rest
				539	================
				540
				541	The next thing missing is handling of function prototypes. In
				542	Kaleidoscope, these are used both for 'extern' function declarations as
				543	well as function body definitions. The code to do this is
				544	straight-forward and not very interesting (once you've survived
				545	expressions):
				546
				547	.. code-block:: c++
				548
				549	/// prototype
				550	/// ::= id '(' id* ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	551	static std::unique_ptr<PrototypeAST> ParsePrototype() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	552	if (CurTok != tok_identifier)
				553	return ErrorP("Expected function name in prototype");
				554
				555	std::string FnName = IdentifierStr;
				556	getNextToken();
				557
				558	if (CurTok != '(')
				559	return ErrorP("Expected '(' in prototype");
				560
				561	// Read the list of argument names.
				562	std::vector<std::string> ArgNames;
				563	while (getNextToken() == tok_identifier)
				564	ArgNames.push_back(IdentifierStr);
				565	if (CurTok != ')')
				566	return ErrorP("Expected ')' in prototype");
				567
				568	// success.
				569	getNextToken(); // eat ')'.
				570
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	571	return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	572	}
				573
				574	Given this, a function definition is very simple, just a prototype plus
				575	an expression to implement the body:
				576
				577	.. code-block:: c++
				578
				579	/// definition ::= 'def' prototype expression
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	580	static std::unique_ptr<FunctionAST> ParseDefinition() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	581	getNextToken(); // eat def.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	582	auto Proto = ParsePrototype();
				583	if (!Proto) return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	584
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	585	if (auto E = ParseExpression())
				586	return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
				587	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	588	}
				589
				590	In addition, we support 'extern' to declare functions like 'sin' and
				591	'cos' as well as to support forward declaration of user functions. These
				592	'extern's are just prototypes with no body:
				593
				594	.. code-block:: c++
				595
				596	/// external ::= 'extern' prototype
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	597	static std::unique_ptr<PrototypeAST> ParseExtern() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	598	getNextToken(); // eat extern.
				599	return ParsePrototype();
				600	}
				601
				602	Finally, we'll also let the user type in arbitrary top-level expressions
				603	and evaluate them on the fly. We will handle this by defining anonymous
				604	nullary (zero argument) functions for them:
				605
				606	.. code-block:: c++
				607
				608	/// toplevelexpr ::= expression
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	609	static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
				610	if (auto E = ParseExpression()) {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	611	// Make an anonymous proto.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	612	auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
				613	return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	614	}
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	615	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	616	}
				617
				618	Now that we have all the pieces, let's build a little driver that will
				619	let us actually execute this code we've built!
				620
				621	The Driver
				622	==========
				623
				624	The driver for this simply invokes all of the parsing pieces with a
				625	top-level dispatch loop. There isn't much interesting here, so I'll just
				626	include the top-level loop. See `below <#code>`_ for full code in the
				627	"Top-Level Parsing" section.
				628
				629	.. code-block:: c++
				630
				631	/// top ::= definition \| external \| expression \| ';'
				632	static void MainLoop() {
				633	while (1) {
				634	fprintf(stderr, "ready> ");
				635	switch (CurTok) {
				636	case tok_eof: return;
				637	case ';': getNextToken(); break; // ignore top-level semicolons.
				638	case tok_def: HandleDefinition(); break;
				639	case tok_extern: HandleExtern(); break;
				640	default: HandleTopLevelExpression(); break;
				641	}
				642	}
				643	}
				644
				645	The most interesting part of this is that we ignore top-level
				646	semicolons. Why is this, you ask? The basic reason is that if you type
				647	"4 + 5" at the command line, the parser doesn't know whether that is the
				648	end of what you will type or not. For example, on the next line you
				649	could type "def foo..." in which case 4+5 is the end of a top-level
				650	expression. Alternatively you could type "\* 6", which would continue
				651	the expression. Having top-level semicolons allows you to type "4+5;",
				652	and the parser will know you are done.
				653
				654	Conclusions
				655	===========
				656
				657	With just under 400 lines of commented code (240 lines of non-comment,
				658	non-blank code), we fully defined our minimal language, including a
				659	lexer, parser, and AST builder. With this done, the executable will
				660	validate Kaleidoscope code and tell us if it is grammatically invalid.
				661	For example, here is a sample interaction:
				662
				663	.. code-block:: bash
				664
				665	$ ./a.out
				666	ready> def foo(x y) x+foo(y, 4.0);
				667	Parsed a function definition.
				668	ready> def foo(x y) x+y y;
				669	Parsed a function definition.
				670	Parsed a top-level expr
				671	ready> def foo(x y) x+y );
				672	Parsed a function definition.
				673	Error: unknown token when expecting an expression
				674	ready> extern sin(a);
				675	ready> Parsed an extern
				676	ready> ^D
				677	$
				678
				679	There is a lot of room for extension here. You can define new AST nodes,
				680	extend the language in many ways, etc. In the `next
				681	installment <LangImpl3.html>`_, we will describe how to generate LLVM
				682	Intermediate Representation (IR) from the AST.
				683
				684	Full Code Listing
				685	=================
				686
				687	Here is the complete code listing for this and the previous chapter.
				688	Note that it is fully self-contained: you don't need LLVM or any
				689	external libraries at all for this. (Besides the C and C++ standard
				690	libraries, of course.) To build this, just compile with:
				691
				692	.. code-block:: bash
				693
				694	# Compile
				695	clang++ -g -O3 toy.cpp
				696	# Run
				697	./a.out
				698
				699	Here is the code:
				700
Logan Chien	855b17d	2013-06-08 09:03:03 +0000	[diff] [blame]	701	.. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
				702	:language: c++
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	703
				704	`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
				705