Blame - llvm/docs/tutorial/LangImpl2.rst - toolchain/llvm-project

blob: dab60172b9882b14dd4293058054332fcee3b592 [file] [log] [blame]

Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	1	===========================================
				2	Kaleidoscope: Implementing a Parser and AST
				3	===========================================
				4
				5	.. contents::
				6	:local:
				7
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	8	Chapter 2 Introduction
				9	======================
				10
				11	Welcome to Chapter 2 of the "`Implementing a language with
				12	LLVM <index.html>`_" tutorial. This chapter shows you how to use the
				13	lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
				14	`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
				15	language. Once we have a parser, we'll define and build an `Abstract
				16	Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
				17
				18	The parser we will build uses a combination of `Recursive Descent
				19	Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
				20	`Operator-Precedence
				21	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
				22	parse the Kaleidoscope language (the latter for binary expressions and
				23	the former for everything else). Before we get to parsing though, lets
				24	talk about the output of the parser: the Abstract Syntax Tree.
				25
				26	The Abstract Syntax Tree (AST)
				27	==============================
				28
				29	The AST for a program captures its behavior in such a way that it is
				30	easy for later stages of the compiler (e.g. code generation) to
				31	interpret. We basically want one object for each construct in the
				32	language, and the AST should closely model the language. In
				33	Kaleidoscope, we have expressions, a prototype, and a function object.
				34	We'll start with expressions first:
				35
				36	.. code-block:: c++
				37
				38	/// ExprAST - Base class for all expression nodes.
				39	class ExprAST {
				40	public:
				41	virtual ~ExprAST() {}
				42	};
				43
				44	/// NumberExprAST - Expression class for numeric literals like "1.0".
				45	class NumberExprAST : public ExprAST {
				46	double Val;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	47
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	48	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	49	NumberExprAST(double Val) : Val(Val) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	50	};
				51
				52	The code above shows the definition of the base ExprAST class and one
				53	subclass which we use for numeric literals. The important thing to note
				54	about this code is that the NumberExprAST class captures the numeric
				55	value of the literal as an instance variable. This allows later phases
				56	of the compiler to know what the stored numeric value is.
				57
				58	Right now we only create the AST, so there are no useful accessor
				59	methods on them. It would be very easy to add a virtual method to pretty
				60	print the code, for example. Here are the other expression AST node
				61	definitions that we'll use in the basic form of the Kaleidoscope
				62	language:
				63
				64	.. code-block:: c++
				65
				66	/// VariableExprAST - Expression class for referencing a variable, like "a".
				67	class VariableExprAST : public ExprAST {
				68	std::string Name;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	69
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	70	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	71	VariableExprAST(const std::string &Name) : Name(Name) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	72	};
				73
				74	/// BinaryExprAST - Expression class for a binary operator.
				75	class BinaryExprAST : public ExprAST {
				76	char Op;
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	77	std::unique_ptr<ExprAST> LHS, RHS;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	78
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	79	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	80	BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
				81	std::unique_ptr<ExprAST> RHS)
				82	: Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	83	};
				84
				85	/// CallExprAST - Expression class for function calls.
				86	class CallExprAST : public ExprAST {
				87	std::string Callee;
Lang Hames	2d789c3	2015-08-26 03:07:41 +0000	[diff] [blame]	88	std::vector<std::unique_ptr<ExprAST>> Args;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	89
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	90	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	91	CallExprAST(const std::string &Callee,
				92	std::vector<std::unique_ptr<ExprAST>> Args)
				93	: Callee(Callee), Args(std::move(Args)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	94	};
				95
				96	This is all (intentionally) rather straight-forward: variables capture
				97	the variable name, binary operators capture their opcode (e.g. '+'), and
				98	calls capture a function name as well as a list of any argument
				99	expressions. One thing that is nice about our AST is that it captures
				100	the language features without talking about the syntax of the language.
				101	Note that there is no discussion about precedence of binary operators,
				102	lexical structure, etc.
				103
				104	For our basic language, these are all of the expression nodes we'll
				105	define. Because it doesn't have conditional control flow, it isn't
				106	Turing-complete; we'll fix that in a later installment. The two things
				107	we need next are a way to talk about the interface to a function, and a
				108	way to talk about functions themselves:
				109
				110	.. code-block:: c++
				111
				112	/// PrototypeAST - This class represents the "prototype" for a function,
				113	/// which captures its name, and its argument names (thus implicitly the number
				114	/// of arguments the function takes).
				115	class PrototypeAST {
				116	std::string Name;
				117	std::vector<std::string> Args;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	118
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	119	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	120	PrototypeAST(const std::string &name, std::vector<std::string> Args)
				121	: Name(name), Args(std::move(Args)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	122	};
				123
				124	/// FunctionAST - This class represents a function definition itself.
				125	class FunctionAST {
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	126	std::unique_ptr<PrototypeAST> Proto;
				127	std::unique_ptr<ExprAST> Body;
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	128
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	129	public:
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	130	FunctionAST(std::unique_ptr<PrototypeAST> Proto,
				131	std::unique_ptr<ExprAST> Body)
				132	: Proto(std::move(Proto)), Body(std::move(Body)) {}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	133	};
				134
				135	In Kaleidoscope, functions are typed with just a count of their
				136	arguments. Since all values are double precision floating point, the
				137	type of each argument doesn't need to be stored anywhere. In a more
				138	aggressive and realistic language, the "ExprAST" class would probably
				139	have a type field.
				140
				141	With this scaffolding, we can now talk about parsing expressions and
				142	function bodies in Kaleidoscope.
				143
				144	Parser Basics
				145	=============
				146
				147	Now that we have an AST to build, we need to define the parser code to
				148	build it. The idea here is that we want to parse something like "x+y"
				149	(which is returned as three tokens by the lexer) into an AST that could
				150	be generated with calls like this:
				151
				152	.. code-block:: c++
				153
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	154	auto LHS = llvm::make_unique<VariableExprAST>("x");
				155	auto RHS = llvm::make_unique<VariableExprAST>("y");
				156	auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
				157	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	158
				159	In order to do this, we'll start by defining some basic helper routines:
				160
				161	.. code-block:: c++
				162
				163	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				164	/// token the parser is looking at. getNextToken reads another token from the
				165	/// lexer and updates CurTok with its results.
				166	static int CurTok;
				167	static int getNextToken() {
				168	return CurTok = gettok();
				169	}
				170
				171	This implements a simple token buffer around the lexer. This allows us
				172	to look one token ahead at what the lexer is returning. Every function
				173	in our parser will assume that CurTok is the current token that needs to
				174	be parsed.
				175
				176	.. code-block:: c++
				177
				178
				179	/// Error* - These are little helper functions for error handling.
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	180	std::unique_ptr<ExprAST> Error(const char *Str) {
				181	fprintf(stderr, "Error: %s\n", Str);
				182	return nullptr;
				183	}
				184	std::unique_ptr<PrototypeAST> ErrorP(const char *Str) {
				185	Error(Str);
				186	return nullptr;
				187	}
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	188
				189	The ``Error`` routines are simple helper routines that our parser will
				190	use to handle errors. The error recovery in our parser will not be the
				191	best and is not particular user-friendly, but it will be enough for our
				192	tutorial. These routines make it easier to handle errors in routines
				193	that have various return types: they always return null.
				194
				195	With these basic helper functions, we can implement the first piece of
				196	our grammar: numeric literals.
				197
				198	Basic Expression Parsing
				199	========================
				200
				201	We start with numeric literals, because they are the simplest to
				202	process. For each production in our grammar, we'll define a function
				203	which parses that production. For numeric literals, we have:
				204
				205	.. code-block:: c++
				206
				207	/// numberexpr ::= number
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	208	static std::unique_ptr<ExprAST> ParseNumberExpr() {
				209	auto Result = llvm::make_unique<NumberExprAST>(NumVal);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	210	getNextToken(); // consume the number
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	211	return std::move(Result);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	212	}
				213
				214	This routine is very simple: it expects to be called when the current
				215	token is a ``tok_number`` token. It takes the current number value,
				216	creates a ``NumberExprAST`` node, advances the lexer to the next token,
				217	and finally returns.
				218
				219	There are some interesting aspects to this. The most important one is
				220	that this routine eats all of the tokens that correspond to the
				221	production and returns the lexer buffer with the next token (which is
				222	not part of the grammar production) ready to go. This is a fairly
				223	standard way to go for recursive descent parsers. For a better example,
				224	the parenthesis operator is defined like this:
				225
				226	.. code-block:: c++
				227
				228	/// parenexpr ::= '(' expression ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	229	static std::unique_ptr<ExprAST> ParseParenExpr() {
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	230	getNextToken(); // eat (.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	231	auto V = ParseExpression();
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	232	if (!V)
				233	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	234
				235	if (CurTok != ')')
				236	return Error("expected ')'");
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	237	getNextToken(); // eat ).
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	238	return V;
				239	}
				240
				241	This function illustrates a number of interesting things about the
				242	parser:
				243
				244	1) It shows how we use the Error routines. When called, this function
				245	expects that the current token is a '(' token, but after parsing the
				246	subexpression, it is possible that there is no ')' waiting. For example,
				247	if the user types in "(4 x" instead of "(4)", the parser should emit an
				248	error. Because errors can occur, the parser needs a way to indicate that
				249	they happened: in our parser, we return null on an error.
				250
				251	2) Another interesting aspect of this function is that it uses recursion
				252	by calling ``ParseExpression`` (we will soon see that
				253	``ParseExpression`` can call ``ParseParenExpr``). This is powerful
				254	because it allows us to handle recursive grammars, and keeps each
				255	production very simple. Note that parentheses do not cause construction
				256	of AST nodes themselves. While we could do it this way, the most
				257	important role of parentheses are to guide the parser and provide
				258	grouping. Once the parser constructs the AST, parentheses are not
				259	needed.
				260
				261	The next simple production is for handling variable references and
				262	function calls:
				263
				264	.. code-block:: c++
				265
				266	/// identifierexpr
				267	/// ::= identifier
				268	/// ::= identifier '(' expression* ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	269	static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	270	std::string IdName = IdentifierStr;
				271
				272	getNextToken(); // eat identifier.
				273
				274	if (CurTok != '(') // Simple variable ref.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	275	return llvm::make_unique<VariableExprAST>(IdName);
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	276
				277	// Call.
				278	getNextToken(); // eat (
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	279	std::vector<std::unique_ptr<ExprAST>> Args;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	280	if (CurTok != ')') {
				281	while (1) {
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	282	if (auto Arg = ParseExpression())
				283	Args.push_back(std::move(Arg));
				284	else
				285	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	286
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	287	if (CurTok == ')')
				288	break;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	289
				290	if (CurTok != ',')
				291	return Error("Expected ')' or ',' in argument list");
				292	getNextToken();
				293	}
				294	}
				295
				296	// Eat the ')'.
				297	getNextToken();
				298
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	299	return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	300	}
				301
				302	This routine follows the same style as the other routines. (It expects
				303	to be called if the current token is a ``tok_identifier`` token). It
				304	also has recursion and error handling. One interesting aspect of this is
				305	that it uses look-ahead to determine if the current identifier is a
				306	stand alone variable reference or if it is a function call expression.
				307	It handles this by checking to see if the token after the identifier is
				308	a '(' token, constructing either a ``VariableExprAST`` or
				309	``CallExprAST`` node as appropriate.
				310
				311	Now that we have all of our simple expression-parsing logic in place, we
				312	can define a helper function to wrap it together into one entry point.
				313	We call this class of expressions "primary" expressions, for reasons
				314	that will become more clear `later in the
Alex Denisov	596e979	2015-12-15 20:50:29 +0000	[diff] [blame^]	315	tutorial <LangImpl6.html#user-defined-unary-operators>`_. In order to parse an arbitrary
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	316	primary expression, we need to determine what sort of expression it is:
				317
				318	.. code-block:: c++
				319
				320	/// primary
				321	/// ::= identifierexpr
				322	/// ::= numberexpr
				323	/// ::= parenexpr
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	324	static std::unique_ptr<ExprAST> ParsePrimary() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	325	switch (CurTok) {
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	326	default:
				327	return Error("unknown token when expecting an expression");
				328	case tok_identifier:
				329	return ParseIdentifierExpr();
				330	case tok_number:
				331	return ParseNumberExpr();
				332	case '(':
				333	return ParseParenExpr();
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	334	}
				335	}
				336
				337	Now that you see the definition of this function, it is more obvious why
				338	we can assume the state of CurTok in the various functions. This uses
				339	look-ahead to determine which sort of expression is being inspected, and
				340	then parses it with a function call.
				341
				342	Now that basic expressions are handled, we need to handle binary
				343	expressions. They are a bit more complex.
				344
				345	Binary Expression Parsing
				346	=========================
				347
				348	Binary expressions are significantly harder to parse because they are
				349	often ambiguous. For example, when given the string "x+y\*z", the parser
				350	can choose to parse it as either "(x+y)\z" or "x+(y\z)". With common
				351	definitions from mathematics, we expect the later parse, because "\*"
				352	(multiplication) has higher precedence than "+" (addition).
				353
				354	There are many ways to handle this, but an elegant and efficient way is
				355	to use `Operator-Precedence
				356	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
				357	This parsing technique uses the precedence of binary operators to guide
				358	recursion. To start with, we need a table of precedences:
				359
				360	.. code-block:: c++
				361
				362	/// BinopPrecedence - This holds the precedence for each binary operator that is
				363	/// defined.
				364	static std::map<char, int> BinopPrecedence;
				365
				366	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				367	static int GetTokPrecedence() {
				368	if (!isascii(CurTok))
				369	return -1;
				370
				371	// Make sure it's a declared binop.
				372	int TokPrec = BinopPrecedence[CurTok];
				373	if (TokPrec <= 0) return -1;
				374	return TokPrec;
				375	}
				376
				377	int main() {
				378	// Install standard binary operators.
				379	// 1 is lowest precedence.
				380	BinopPrecedence['<'] = 10;
				381	BinopPrecedence['+'] = 20;
				382	BinopPrecedence['-'] = 20;
				383	BinopPrecedence['*'] = 40; // highest.
				384	...
				385	}
				386
				387	For the basic form of Kaleidoscope, we will only support 4 binary
				388	operators (this can obviously be extended by you, our brave and intrepid
				389	reader). The ``GetTokPrecedence`` function returns the precedence for
				390	the current token, or -1 if the token is not a binary operator. Having a
				391	map makes it easy to add new operators and makes it clear that the
				392	algorithm doesn't depend on the specific operators involved, but it
				393	would be easy enough to eliminate the map and do the comparisons in the
				394	``GetTokPrecedence`` function. (Or just use a fixed-size array).
				395
				396	With the helper above defined, we can now start parsing binary
				397	expressions. The basic idea of operator precedence parsing is to break
				398	down an expression with potentially ambiguous binary operators into
Alex Denisov	bc769d4	2015-11-15 14:13:24 +0000	[diff] [blame]	399	pieces. Consider, for example, the expression "a+b+(c+d)\e\f+g".
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	400	Operator precedence parsing considers this as a stream of primary
				401	expressions separated by binary operators. As such, it will first parse
				402	the leading primary expression "a", then it will see the pairs [+, b]
				403	[+, (c+d)] [\, e] [\, f] and [+, g]. Note that because parentheses are
				404	primary expressions, the binary expression parser doesn't need to worry
				405	about nested subexpressions like (c+d) at all.
				406
				407	To start, an expression is a primary expression potentially followed by
				408	a sequence of [binop,primaryexpr] pairs:
				409
				410	.. code-block:: c++
				411
				412	/// expression
				413	/// ::= primary binoprhs
				414	///
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	415	static std::unique_ptr<ExprAST> ParseExpression() {
				416	auto LHS = ParsePrimary();
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	417	if (!LHS)
				418	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	419
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	420	return ParseBinOpRHS(0, std::move(LHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	421	}
				422
				423	``ParseBinOpRHS`` is the function that parses the sequence of pairs for
				424	us. It takes a precedence and a pointer to an expression for the part
				425	that has been parsed so far. Note that "x" is a perfectly valid
				426	expression: As such, "binoprhs" is allowed to be empty, in which case it
				427	returns the expression that is passed into it. In our example above, the
				428	code passes the expression for "a" into ``ParseBinOpRHS`` and the
				429	current token is "+".
				430
				431	The precedence value passed into ``ParseBinOpRHS`` indicates the
				432	minimal operator precedence that the function is allowed to eat. For
				433	example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
				434	passed in a precedence of 40, it will not consume any tokens (because
				435	the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
				436	starts with:
				437
				438	.. code-block:: c++
				439
				440	/// binoprhs
				441	/// ::= ('+' primary)*
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	442	static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
				443	std::unique_ptr<ExprAST> LHS) {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	444	// If this is a binop, find its precedence.
				445	while (1) {
				446	int TokPrec = GetTokPrecedence();
				447
				448	// If this is a binop that binds at least as tightly as the current binop,
				449	// consume it, otherwise we are done.
				450	if (TokPrec < ExprPrec)
				451	return LHS;
				452
				453	This code gets the precedence of the current token and checks to see if
				454	if is too low. Because we defined invalid tokens to have a precedence of
				455	-1, this check implicitly knows that the pair-stream ends when the token
				456	stream runs out of binary operators. If this check succeeds, we know
				457	that the token is a binary operator and that it will be included in this
				458	expression:
				459
				460	.. code-block:: c++
				461
				462	// Okay, we know this is a binop.
				463	int BinOp = CurTok;
				464	getNextToken(); // eat binop
				465
				466	// Parse the primary expression after the binary operator.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	467	auto RHS = ParsePrimary();
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	468	if (!RHS)
				469	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	470
				471	As such, this code eats (and remembers) the binary operator and then
				472	parses the primary expression that follows. This builds up the whole
				473	pair, the first of which is [+, b] for the running example.
				474
				475	Now that we parsed the left-hand side of an expression and one pair of
				476	the RHS sequence, we have to decide which way the expression associates.
				477	In particular, we could have "(a+b) binop unparsed" or "a + (b binop
				478	unparsed)". To determine this, we look ahead at "binop" to determine its
				479	precedence and compare it to BinOp's precedence (which is '+' in this
				480	case):
				481
				482	.. code-block:: c++
				483
				484	// If BinOp binds less tightly with RHS than the operator after RHS, let
				485	// the pending operator take RHS as its LHS.
				486	int NextPrec = GetTokPrecedence();
				487	if (TokPrec < NextPrec) {
				488
				489	If the precedence of the binop to the right of "RHS" is lower or equal
				490	to the precedence of our current operator, then we know that the
				491	parentheses associate as "(a+b) binop ...". In our example, the current
				492	operator is "+" and the next operator is "+", we know that they have the
				493	same precedence. In this case we'll create the AST node for "a+b", and
				494	then continue parsing:
				495
				496	.. code-block:: c++
				497
				498	... if body omitted ...
				499	}
				500
				501	// Merge LHS/RHS.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	502	LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
				503	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	504	} // loop around to the top of the while loop.
				505	}
				506
				507	In our example above, this will turn "a+b+" into "(a+b)" and execute the
				508	next iteration of the loop, with "+" as the current token. The code
				509	above will eat, remember, and parse "(c+d)" as the primary expression,
				510	which makes the current pair equal to [+, (c+d)]. It will then evaluate
				511	the 'if' conditional above with "\*" as the binop to the right of the
				512	primary. In this case, the precedence of "\*" is higher than the
				513	precedence of "+" so the if condition will be entered.
				514
				515	The critical question left here is "how can the if condition parse the
				516	right hand side in full"? In particular, to build the AST correctly for
				517	our example, it needs to get all of "(c+d)\e\f" as the RHS expression
				518	variable. The code to do this is surprisingly simple (code from the
				519	above two blocks duplicated for context):
				520
				521	.. code-block:: c++
				522
				523	// If BinOp binds less tightly with RHS than the operator after RHS, let
				524	// the pending operator take RHS as its LHS.
				525	int NextPrec = GetTokPrecedence();
				526	if (TokPrec < NextPrec) {
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	527	RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	528	if (!RHS)
				529	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	530	}
				531	// Merge LHS/RHS.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	532	LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
				533	std::move(RHS));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	534	} // loop around to the top of the while loop.
				535	}
				536
				537	At this point, we know that the binary operator to the RHS of our
				538	primary has higher precedence than the binop we are currently parsing.
				539	As such, we know that any sequence of pairs whose operators are all
				540	higher precedence than "+" should be parsed together and returned as
				541	"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
				542	specifying "TokPrec+1" as the minimum precedence required for it to
				543	continue. In our example above, this will cause it to return the AST
				544	node for "(c+d)\e\f" as RHS, which is then set as the RHS of the '+'
				545	expression.
				546
				547	Finally, on the next iteration of the while loop, the "+g" piece is
				548	parsed and added to the AST. With this little bit of code (14
				549	non-trivial lines), we correctly handle fully general binary expression
				550	parsing in a very elegant way. This was a whirlwind tour of this code,
				551	and it is somewhat subtle. I recommend running through it with a few
				552	tough examples to see how it works.
				553
				554	This wraps up handling of expressions. At this point, we can point the
				555	parser at an arbitrary token stream and build an expression from it,
				556	stopping at the first token that is not part of the expression. Next up
				557	we need to handle function definitions, etc.
				558
				559	Parsing the Rest
				560	================
				561
				562	The next thing missing is handling of function prototypes. In
				563	Kaleidoscope, these are used both for 'extern' function declarations as
				564	well as function body definitions. The code to do this is
				565	straight-forward and not very interesting (once you've survived
				566	expressions):
				567
				568	.. code-block:: c++
				569
				570	/// prototype
				571	/// ::= id '(' id* ')'
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	572	static std::unique_ptr<PrototypeAST> ParsePrototype() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	573	if (CurTok != tok_identifier)
				574	return ErrorP("Expected function name in prototype");
				575
				576	std::string FnName = IdentifierStr;
				577	getNextToken();
				578
				579	if (CurTok != '(')
				580	return ErrorP("Expected '(' in prototype");
				581
				582	// Read the list of argument names.
				583	std::vector<std::string> ArgNames;
				584	while (getNextToken() == tok_identifier)
				585	ArgNames.push_back(IdentifierStr);
				586	if (CurTok != ')')
				587	return ErrorP("Expected ')' in prototype");
				588
				589	// success.
				590	getNextToken(); // eat ')'.
				591
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	592	return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	593	}
				594
				595	Given this, a function definition is very simple, just a prototype plus
				596	an expression to implement the body:
				597
				598	.. code-block:: c++
				599
				600	/// definition ::= 'def' prototype expression
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	601	static std::unique_ptr<FunctionAST> ParseDefinition() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	602	getNextToken(); // eat def.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	603	auto Proto = ParsePrototype();
				604	if (!Proto) return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	605
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	606	if (auto E = ParseExpression())
				607	return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
				608	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	609	}
				610
				611	In addition, we support 'extern' to declare functions like 'sin' and
				612	'cos' as well as to support forward declaration of user functions. These
				613	'extern's are just prototypes with no body:
				614
				615	.. code-block:: c++
				616
				617	/// external ::= 'extern' prototype
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	618	static std::unique_ptr<PrototypeAST> ParseExtern() {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	619	getNextToken(); // eat extern.
				620	return ParsePrototype();
				621	}
				622
				623	Finally, we'll also let the user type in arbitrary top-level expressions
				624	and evaluate them on the fly. We will handle this by defining anonymous
				625	nullary (zero argument) functions for them:
				626
				627	.. code-block:: c++
				628
				629	/// toplevelexpr ::= expression
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	630	static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
				631	if (auto E = ParseExpression()) {
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	632	// Make an anonymous proto.
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	633	auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
				634	return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	635	}
Lang Hames	09bf4c1	2015-08-18 18:11:06 +0000	[diff] [blame]	636	return nullptr;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	637	}
				638
				639	Now that we have all the pieces, let's build a little driver that will
				640	let us actually execute this code we've built!
				641
				642	The Driver
				643	==========
				644
				645	The driver for this simply invokes all of the parsing pieces with a
				646	top-level dispatch loop. There isn't much interesting here, so I'll just
Alex Denisov	596e979	2015-12-15 20:50:29 +0000	[diff] [blame^]	647	include the top-level loop. See `below <#full-code-listing>`_ for full code in the
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	648	"Top-Level Parsing" section.
				649
				650	.. code-block:: c++
				651
				652	/// top ::= definition \| external \| expression \| ';'
				653	static void MainLoop() {
				654	while (1) {
				655	fprintf(stderr, "ready> ");
				656	switch (CurTok) {
Lang Hames	59b0da8	2015-08-19 18:15:58 +0000	[diff] [blame]	657	case tok_eof:
				658	return;
				659	case ';': // ignore top-level semicolons.
				660	getNextToken();
				661	break;
				662	case tok_def:
				663	HandleDefinition();
				664	break;
				665	case tok_extern:
				666	HandleExtern();
				667	break;
				668	default:
				669	HandleTopLevelExpression();
				670	break;
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	671	}
				672	}
				673	}
				674
				675	The most interesting part of this is that we ignore top-level
				676	semicolons. Why is this, you ask? The basic reason is that if you type
				677	"4 + 5" at the command line, the parser doesn't know whether that is the
				678	end of what you will type or not. For example, on the next line you
				679	could type "def foo..." in which case 4+5 is the end of a top-level
				680	expression. Alternatively you could type "\* 6", which would continue
				681	the expression. Having top-level semicolons allows you to type "4+5;",
				682	and the parser will know you are done.
				683
				684	Conclusions
				685	===========
				686
				687	With just under 400 lines of commented code (240 lines of non-comment,
				688	non-blank code), we fully defined our minimal language, including a
				689	lexer, parser, and AST builder. With this done, the executable will
				690	validate Kaleidoscope code and tell us if it is grammatically invalid.
				691	For example, here is a sample interaction:
				692
				693	.. code-block:: bash
				694
				695	$ ./a.out
				696	ready> def foo(x y) x+foo(y, 4.0);
				697	Parsed a function definition.
				698	ready> def foo(x y) x+y y;
				699	Parsed a function definition.
				700	Parsed a top-level expr
				701	ready> def foo(x y) x+y );
				702	Parsed a function definition.
				703	Error: unknown token when expecting an expression
				704	ready> extern sin(a);
				705	ready> Parsed an extern
				706	ready> ^D
				707	$
				708
				709	There is a lot of room for extension here. You can define new AST nodes,
				710	extend the language in many ways, etc. In the `next
				711	installment <LangImpl3.html>`_, we will describe how to generate LLVM
				712	Intermediate Representation (IR) from the AST.
				713
				714	Full Code Listing
				715	=================
				716
				717	Here is the complete code listing for this and the previous chapter.
				718	Note that it is fully self-contained: you don't need LLVM or any
				719	external libraries at all for this. (Besides the C and C++ standard
				720	libraries, of course.) To build this, just compile with:
				721
				722	.. code-block:: bash
				723
				724	# Compile
				725	clang++ -g -O3 toy.cpp
				726	# Run
				727	./a.out
				728
				729	Here is the code:
				730
Logan Chien	855b17d	2013-06-08 09:03:03 +0000	[diff] [blame]	731	.. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
				732	:language: c++
Sean Silva	d7fb396	2012-12-05 00:26:32 +0000	[diff] [blame]	733
				734	`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
				735