Blame - docs/tutorial/LangImpl2.rst - fp2-dev/platform/external/llvm

blob: 0d62894a24028434981b60ac896b03794b945082 [file] [log] [blame]

Sean Silva	ee47edf	2012-12-05 00:26:32 +0000	[diff] [blame^]	1	===========================================
				2	Kaleidoscope: Implementing a Parser and AST
				3	===========================================
				4
				5	.. contents::
				6	:local:
				7
				8	Written by `Chris Lattner <mailto:sabre@nondot.org>`_
				9
				10	Chapter 2 Introduction
				11	======================
				12
				13	Welcome to Chapter 2 of the "`Implementing a language with
				14	LLVM <index.html>`_" tutorial. This chapter shows you how to use the
				15	lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
				16	`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
				17	language. Once we have a parser, we'll define and build an `Abstract
				18	Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
				19
				20	The parser we will build uses a combination of `Recursive Descent
				21	Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
				22	`Operator-Precedence
				23	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
				24	parse the Kaleidoscope language (the latter for binary expressions and
				25	the former for everything else). Before we get to parsing though, lets
				26	talk about the output of the parser: the Abstract Syntax Tree.
				27
				28	The Abstract Syntax Tree (AST)
				29	==============================
				30
				31	The AST for a program captures its behavior in such a way that it is
				32	easy for later stages of the compiler (e.g. code generation) to
				33	interpret. We basically want one object for each construct in the
				34	language, and the AST should closely model the language. In
				35	Kaleidoscope, we have expressions, a prototype, and a function object.
				36	We'll start with expressions first:
				37
				38	.. code-block:: c++
				39
				40	/// ExprAST - Base class for all expression nodes.
				41	class ExprAST {
				42	public:
				43	virtual ~ExprAST() {}
				44	};
				45
				46	/// NumberExprAST - Expression class for numeric literals like "1.0".
				47	class NumberExprAST : public ExprAST {
				48	double Val;
				49	public:
				50	NumberExprAST(double val) : Val(val) {}
				51	};
				52
				53	The code above shows the definition of the base ExprAST class and one
				54	subclass which we use for numeric literals. The important thing to note
				55	about this code is that the NumberExprAST class captures the numeric
				56	value of the literal as an instance variable. This allows later phases
				57	of the compiler to know what the stored numeric value is.
				58
				59	Right now we only create the AST, so there are no useful accessor
				60	methods on them. It would be very easy to add a virtual method to pretty
				61	print the code, for example. Here are the other expression AST node
				62	definitions that we'll use in the basic form of the Kaleidoscope
				63	language:
				64
				65	.. code-block:: c++
				66
				67	/// VariableExprAST - Expression class for referencing a variable, like "a".
				68	class VariableExprAST : public ExprAST {
				69	std::string Name;
				70	public:
				71	VariableExprAST(const std::string &name) : Name(name) {}
				72	};
				73
				74	/// BinaryExprAST - Expression class for a binary operator.
				75	class BinaryExprAST : public ExprAST {
				76	char Op;
				77	ExprAST LHS, RHS;
				78	public:
				79	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				80	: Op(op), LHS(lhs), RHS(rhs) {}
				81	};
				82
				83	/// CallExprAST - Expression class for function calls.
				84	class CallExprAST : public ExprAST {
				85	std::string Callee;
				86	std::vector<ExprAST*> Args;
				87	public:
				88	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				89	: Callee(callee), Args(args) {}
				90	};
				91
				92	This is all (intentionally) rather straight-forward: variables capture
				93	the variable name, binary operators capture their opcode (e.g. '+'), and
				94	calls capture a function name as well as a list of any argument
				95	expressions. One thing that is nice about our AST is that it captures
				96	the language features without talking about the syntax of the language.
				97	Note that there is no discussion about precedence of binary operators,
				98	lexical structure, etc.
				99
				100	For our basic language, these are all of the expression nodes we'll
				101	define. Because it doesn't have conditional control flow, it isn't
				102	Turing-complete; we'll fix that in a later installment. The two things
				103	we need next are a way to talk about the interface to a function, and a
				104	way to talk about functions themselves:
				105
				106	.. code-block:: c++
				107
				108	/// PrototypeAST - This class represents the "prototype" for a function,
				109	/// which captures its name, and its argument names (thus implicitly the number
				110	/// of arguments the function takes).
				111	class PrototypeAST {
				112	std::string Name;
				113	std::vector<std::string> Args;
				114	public:
				115	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				116	: Name(name), Args(args) {}
				117	};
				118
				119	/// FunctionAST - This class represents a function definition itself.
				120	class FunctionAST {
				121	PrototypeAST *Proto;
				122	ExprAST *Body;
				123	public:
				124	FunctionAST(PrototypeAST proto, ExprAST body)
				125	: Proto(proto), Body(body) {}
				126	};
				127
				128	In Kaleidoscope, functions are typed with just a count of their
				129	arguments. Since all values are double precision floating point, the
				130	type of each argument doesn't need to be stored anywhere. In a more
				131	aggressive and realistic language, the "ExprAST" class would probably
				132	have a type field.
				133
				134	With this scaffolding, we can now talk about parsing expressions and
				135	function bodies in Kaleidoscope.
				136
				137	Parser Basics
				138	=============
				139
				140	Now that we have an AST to build, we need to define the parser code to
				141	build it. The idea here is that we want to parse something like "x+y"
				142	(which is returned as three tokens by the lexer) into an AST that could
				143	be generated with calls like this:
				144
				145	.. code-block:: c++
				146
				147	ExprAST *X = new VariableExprAST("x");
				148	ExprAST *Y = new VariableExprAST("y");
				149	ExprAST *Result = new BinaryExprAST('+', X, Y);
				150
				151	In order to do this, we'll start by defining some basic helper routines:
				152
				153	.. code-block:: c++
				154
				155	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				156	/// token the parser is looking at. getNextToken reads another token from the
				157	/// lexer and updates CurTok with its results.
				158	static int CurTok;
				159	static int getNextToken() {
				160	return CurTok = gettok();
				161	}
				162
				163	This implements a simple token buffer around the lexer. This allows us
				164	to look one token ahead at what the lexer is returning. Every function
				165	in our parser will assume that CurTok is the current token that needs to
				166	be parsed.
				167
				168	.. code-block:: c++
				169
				170
				171	/// Error* - These are little helper functions for error handling.
				172	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				173	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				174	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				175
				176	The ``Error`` routines are simple helper routines that our parser will
				177	use to handle errors. The error recovery in our parser will not be the
				178	best and is not particular user-friendly, but it will be enough for our
				179	tutorial. These routines make it easier to handle errors in routines
				180	that have various return types: they always return null.
				181
				182	With these basic helper functions, we can implement the first piece of
				183	our grammar: numeric literals.
				184
				185	Basic Expression Parsing
				186	========================
				187
				188	We start with numeric literals, because they are the simplest to
				189	process. For each production in our grammar, we'll define a function
				190	which parses that production. For numeric literals, we have:
				191
				192	.. code-block:: c++
				193
				194	/// numberexpr ::= number
				195	static ExprAST *ParseNumberExpr() {
				196	ExprAST *Result = new NumberExprAST(NumVal);
				197	getNextToken(); // consume the number
				198	return Result;
				199	}
				200
				201	This routine is very simple: it expects to be called when the current
				202	token is a ``tok_number`` token. It takes the current number value,
				203	creates a ``NumberExprAST`` node, advances the lexer to the next token,
				204	and finally returns.
				205
				206	There are some interesting aspects to this. The most important one is
				207	that this routine eats all of the tokens that correspond to the
				208	production and returns the lexer buffer with the next token (which is
				209	not part of the grammar production) ready to go. This is a fairly
				210	standard way to go for recursive descent parsers. For a better example,
				211	the parenthesis operator is defined like this:
				212
				213	.. code-block:: c++
				214
				215	/// parenexpr ::= '(' expression ')'
				216	static ExprAST *ParseParenExpr() {
				217	getNextToken(); // eat (.
				218	ExprAST *V = ParseExpression();
				219	if (!V) return 0;
				220
				221	if (CurTok != ')')
				222	return Error("expected ')'");
				223	getNextToken(); // eat ).
				224	return V;
				225	}
				226
				227	This function illustrates a number of interesting things about the
				228	parser:
				229
				230	1) It shows how we use the Error routines. When called, this function
				231	expects that the current token is a '(' token, but after parsing the
				232	subexpression, it is possible that there is no ')' waiting. For example,
				233	if the user types in "(4 x" instead of "(4)", the parser should emit an
				234	error. Because errors can occur, the parser needs a way to indicate that
				235	they happened: in our parser, we return null on an error.
				236
				237	2) Another interesting aspect of this function is that it uses recursion
				238	by calling ``ParseExpression`` (we will soon see that
				239	``ParseExpression`` can call ``ParseParenExpr``). This is powerful
				240	because it allows us to handle recursive grammars, and keeps each
				241	production very simple. Note that parentheses do not cause construction
				242	of AST nodes themselves. While we could do it this way, the most
				243	important role of parentheses are to guide the parser and provide
				244	grouping. Once the parser constructs the AST, parentheses are not
				245	needed.
				246
				247	The next simple production is for handling variable references and
				248	function calls:
				249
				250	.. code-block:: c++
				251
				252	/// identifierexpr
				253	/// ::= identifier
				254	/// ::= identifier '(' expression* ')'
				255	static ExprAST *ParseIdentifierExpr() {
				256	std::string IdName = IdentifierStr;
				257
				258	getNextToken(); // eat identifier.
				259
				260	if (CurTok != '(') // Simple variable ref.
				261	return new VariableExprAST(IdName);
				262
				263	// Call.
				264	getNextToken(); // eat (
				265	std::vector<ExprAST*> Args;
				266	if (CurTok != ')') {
				267	while (1) {
				268	ExprAST *Arg = ParseExpression();
				269	if (!Arg) return 0;
				270	Args.push_back(Arg);
				271
				272	if (CurTok == ')') break;
				273
				274	if (CurTok != ',')
				275	return Error("Expected ')' or ',' in argument list");
				276	getNextToken();
				277	}
				278	}
				279
				280	// Eat the ')'.
				281	getNextToken();
				282
				283	return new CallExprAST(IdName, Args);
				284	}
				285
				286	This routine follows the same style as the other routines. (It expects
				287	to be called if the current token is a ``tok_identifier`` token). It
				288	also has recursion and error handling. One interesting aspect of this is
				289	that it uses look-ahead to determine if the current identifier is a
				290	stand alone variable reference or if it is a function call expression.
				291	It handles this by checking to see if the token after the identifier is
				292	a '(' token, constructing either a ``VariableExprAST`` or
				293	``CallExprAST`` node as appropriate.
				294
				295	Now that we have all of our simple expression-parsing logic in place, we
				296	can define a helper function to wrap it together into one entry point.
				297	We call this class of expressions "primary" expressions, for reasons
				298	that will become more clear `later in the
				299	tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
				300	primary expression, we need to determine what sort of expression it is:
				301
				302	.. code-block:: c++
				303
				304	/// primary
				305	/// ::= identifierexpr
				306	/// ::= numberexpr
				307	/// ::= parenexpr
				308	static ExprAST *ParsePrimary() {
				309	switch (CurTok) {
				310	default: return Error("unknown token when expecting an expression");
				311	case tok_identifier: return ParseIdentifierExpr();
				312	case tok_number: return ParseNumberExpr();
				313	case '(': return ParseParenExpr();
				314	}
				315	}
				316
				317	Now that you see the definition of this function, it is more obvious why
				318	we can assume the state of CurTok in the various functions. This uses
				319	look-ahead to determine which sort of expression is being inspected, and
				320	then parses it with a function call.
				321
				322	Now that basic expressions are handled, we need to handle binary
				323	expressions. They are a bit more complex.
				324
				325	Binary Expression Parsing
				326	=========================
				327
				328	Binary expressions are significantly harder to parse because they are
				329	often ambiguous. For example, when given the string "x+y\*z", the parser
				330	can choose to parse it as either "(x+y)\z" or "x+(y\z)". With common
				331	definitions from mathematics, we expect the later parse, because "\*"
				332	(multiplication) has higher precedence than "+" (addition).
				333
				334	There are many ways to handle this, but an elegant and efficient way is
				335	to use `Operator-Precedence
				336	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
				337	This parsing technique uses the precedence of binary operators to guide
				338	recursion. To start with, we need a table of precedences:
				339
				340	.. code-block:: c++
				341
				342	/// BinopPrecedence - This holds the precedence for each binary operator that is
				343	/// defined.
				344	static std::map<char, int> BinopPrecedence;
				345
				346	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				347	static int GetTokPrecedence() {
				348	if (!isascii(CurTok))
				349	return -1;
				350
				351	// Make sure it's a declared binop.
				352	int TokPrec = BinopPrecedence[CurTok];
				353	if (TokPrec <= 0) return -1;
				354	return TokPrec;
				355	}
				356
				357	int main() {
				358	// Install standard binary operators.
				359	// 1 is lowest precedence.
				360	BinopPrecedence['<'] = 10;
				361	BinopPrecedence['+'] = 20;
				362	BinopPrecedence['-'] = 20;
				363	BinopPrecedence['*'] = 40; // highest.
				364	...
				365	}
				366
				367	For the basic form of Kaleidoscope, we will only support 4 binary
				368	operators (this can obviously be extended by you, our brave and intrepid
				369	reader). The ``GetTokPrecedence`` function returns the precedence for
				370	the current token, or -1 if the token is not a binary operator. Having a
				371	map makes it easy to add new operators and makes it clear that the
				372	algorithm doesn't depend on the specific operators involved, but it
				373	would be easy enough to eliminate the map and do the comparisons in the
				374	``GetTokPrecedence`` function. (Or just use a fixed-size array).
				375
				376	With the helper above defined, we can now start parsing binary
				377	expressions. The basic idea of operator precedence parsing is to break
				378	down an expression with potentially ambiguous binary operators into
				379	pieces. Consider ,for example, the expression "a+b+(c+d)\e\f+g".
				380	Operator precedence parsing considers this as a stream of primary
				381	expressions separated by binary operators. As such, it will first parse
				382	the leading primary expression "a", then it will see the pairs [+, b]
				383	[+, (c+d)] [\, e] [\, f] and [+, g]. Note that because parentheses are
				384	primary expressions, the binary expression parser doesn't need to worry
				385	about nested subexpressions like (c+d) at all.
				386
				387	To start, an expression is a primary expression potentially followed by
				388	a sequence of [binop,primaryexpr] pairs:
				389
				390	.. code-block:: c++
				391
				392	/// expression
				393	/// ::= primary binoprhs
				394	///
				395	static ExprAST *ParseExpression() {
				396	ExprAST *LHS = ParsePrimary();
				397	if (!LHS) return 0;
				398
				399	return ParseBinOpRHS(0, LHS);
				400	}
				401
				402	``ParseBinOpRHS`` is the function that parses the sequence of pairs for
				403	us. It takes a precedence and a pointer to an expression for the part
				404	that has been parsed so far. Note that "x" is a perfectly valid
				405	expression: As such, "binoprhs" is allowed to be empty, in which case it
				406	returns the expression that is passed into it. In our example above, the
				407	code passes the expression for "a" into ``ParseBinOpRHS`` and the
				408	current token is "+".
				409
				410	The precedence value passed into ``ParseBinOpRHS`` indicates the
				411	minimal operator precedence that the function is allowed to eat. For
				412	example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
				413	passed in a precedence of 40, it will not consume any tokens (because
				414	the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
				415	starts with:
				416
				417	.. code-block:: c++
				418
				419	/// binoprhs
				420	/// ::= ('+' primary)*
				421	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				422	// If this is a binop, find its precedence.
				423	while (1) {
				424	int TokPrec = GetTokPrecedence();
				425
				426	// If this is a binop that binds at least as tightly as the current binop,
				427	// consume it, otherwise we are done.
				428	if (TokPrec < ExprPrec)
				429	return LHS;
				430
				431	This code gets the precedence of the current token and checks to see if
				432	if is too low. Because we defined invalid tokens to have a precedence of
				433	-1, this check implicitly knows that the pair-stream ends when the token
				434	stream runs out of binary operators. If this check succeeds, we know
				435	that the token is a binary operator and that it will be included in this
				436	expression:
				437
				438	.. code-block:: c++
				439
				440	// Okay, we know this is a binop.
				441	int BinOp = CurTok;
				442	getNextToken(); // eat binop
				443
				444	// Parse the primary expression after the binary operator.
				445	ExprAST *RHS = ParsePrimary();
				446	if (!RHS) return 0;
				447
				448	As such, this code eats (and remembers) the binary operator and then
				449	parses the primary expression that follows. This builds up the whole
				450	pair, the first of which is [+, b] for the running example.
				451
				452	Now that we parsed the left-hand side of an expression and one pair of
				453	the RHS sequence, we have to decide which way the expression associates.
				454	In particular, we could have "(a+b) binop unparsed" or "a + (b binop
				455	unparsed)". To determine this, we look ahead at "binop" to determine its
				456	precedence and compare it to BinOp's precedence (which is '+' in this
				457	case):
				458
				459	.. code-block:: c++
				460
				461	// If BinOp binds less tightly with RHS than the operator after RHS, let
				462	// the pending operator take RHS as its LHS.
				463	int NextPrec = GetTokPrecedence();
				464	if (TokPrec < NextPrec) {
				465
				466	If the precedence of the binop to the right of "RHS" is lower or equal
				467	to the precedence of our current operator, then we know that the
				468	parentheses associate as "(a+b) binop ...". In our example, the current
				469	operator is "+" and the next operator is "+", we know that they have the
				470	same precedence. In this case we'll create the AST node for "a+b", and
				471	then continue parsing:
				472
				473	.. code-block:: c++
				474
				475	... if body omitted ...
				476	}
				477
				478	// Merge LHS/RHS.
				479	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				480	} // loop around to the top of the while loop.
				481	}
				482
				483	In our example above, this will turn "a+b+" into "(a+b)" and execute the
				484	next iteration of the loop, with "+" as the current token. The code
				485	above will eat, remember, and parse "(c+d)" as the primary expression,
				486	which makes the current pair equal to [+, (c+d)]. It will then evaluate
				487	the 'if' conditional above with "\*" as the binop to the right of the
				488	primary. In this case, the precedence of "\*" is higher than the
				489	precedence of "+" so the if condition will be entered.
				490
				491	The critical question left here is "how can the if condition parse the
				492	right hand side in full"? In particular, to build the AST correctly for
				493	our example, it needs to get all of "(c+d)\e\f" as the RHS expression
				494	variable. The code to do this is surprisingly simple (code from the
				495	above two blocks duplicated for context):
				496
				497	.. code-block:: c++
				498
				499	// If BinOp binds less tightly with RHS than the operator after RHS, let
				500	// the pending operator take RHS as its LHS.
				501	int NextPrec = GetTokPrecedence();
				502	if (TokPrec < NextPrec) {
				503	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				504	if (RHS == 0) return 0;
				505	}
				506	// Merge LHS/RHS.
				507	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				508	} // loop around to the top of the while loop.
				509	}
				510
				511	At this point, we know that the binary operator to the RHS of our
				512	primary has higher precedence than the binop we are currently parsing.
				513	As such, we know that any sequence of pairs whose operators are all
				514	higher precedence than "+" should be parsed together and returned as
				515	"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
				516	specifying "TokPrec+1" as the minimum precedence required for it to
				517	continue. In our example above, this will cause it to return the AST
				518	node for "(c+d)\e\f" as RHS, which is then set as the RHS of the '+'
				519	expression.
				520
				521	Finally, on the next iteration of the while loop, the "+g" piece is
				522	parsed and added to the AST. With this little bit of code (14
				523	non-trivial lines), we correctly handle fully general binary expression
				524	parsing in a very elegant way. This was a whirlwind tour of this code,
				525	and it is somewhat subtle. I recommend running through it with a few
				526	tough examples to see how it works.
				527
				528	This wraps up handling of expressions. At this point, we can point the
				529	parser at an arbitrary token stream and build an expression from it,
				530	stopping at the first token that is not part of the expression. Next up
				531	we need to handle function definitions, etc.
				532
				533	Parsing the Rest
				534	================
				535
				536	The next thing missing is handling of function prototypes. In
				537	Kaleidoscope, these are used both for 'extern' function declarations as
				538	well as function body definitions. The code to do this is
				539	straight-forward and not very interesting (once you've survived
				540	expressions):
				541
				542	.. code-block:: c++
				543
				544	/// prototype
				545	/// ::= id '(' id* ')'
				546	static PrototypeAST *ParsePrototype() {
				547	if (CurTok != tok_identifier)
				548	return ErrorP("Expected function name in prototype");
				549
				550	std::string FnName = IdentifierStr;
				551	getNextToken();
				552
				553	if (CurTok != '(')
				554	return ErrorP("Expected '(' in prototype");
				555
				556	// Read the list of argument names.
				557	std::vector<std::string> ArgNames;
				558	while (getNextToken() == tok_identifier)
				559	ArgNames.push_back(IdentifierStr);
				560	if (CurTok != ')')
				561	return ErrorP("Expected ')' in prototype");
				562
				563	// success.
				564	getNextToken(); // eat ')'.
				565
				566	return new PrototypeAST(FnName, ArgNames);
				567	}
				568
				569	Given this, a function definition is very simple, just a prototype plus
				570	an expression to implement the body:
				571
				572	.. code-block:: c++
				573
				574	/// definition ::= 'def' prototype expression
				575	static FunctionAST *ParseDefinition() {
				576	getNextToken(); // eat def.
				577	PrototypeAST *Proto = ParsePrototype();
				578	if (Proto == 0) return 0;
				579
				580	if (ExprAST *E = ParseExpression())
				581	return new FunctionAST(Proto, E);
				582	return 0;
				583	}
				584
				585	In addition, we support 'extern' to declare functions like 'sin' and
				586	'cos' as well as to support forward declaration of user functions. These
				587	'extern's are just prototypes with no body:
				588
				589	.. code-block:: c++
				590
				591	/// external ::= 'extern' prototype
				592	static PrototypeAST *ParseExtern() {
				593	getNextToken(); // eat extern.
				594	return ParsePrototype();
				595	}
				596
				597	Finally, we'll also let the user type in arbitrary top-level expressions
				598	and evaluate them on the fly. We will handle this by defining anonymous
				599	nullary (zero argument) functions for them:
				600
				601	.. code-block:: c++
				602
				603	/// toplevelexpr ::= expression
				604	static FunctionAST *ParseTopLevelExpr() {
				605	if (ExprAST *E = ParseExpression()) {
				606	// Make an anonymous proto.
				607	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				608	return new FunctionAST(Proto, E);
				609	}
				610	return 0;
				611	}
				612
				613	Now that we have all the pieces, let's build a little driver that will
				614	let us actually execute this code we've built!
				615
				616	The Driver
				617	==========
				618
				619	The driver for this simply invokes all of the parsing pieces with a
				620	top-level dispatch loop. There isn't much interesting here, so I'll just
				621	include the top-level loop. See `below <#code>`_ for full code in the
				622	"Top-Level Parsing" section.
				623
				624	.. code-block:: c++
				625
				626	/// top ::= definition \| external \| expression \| ';'
				627	static void MainLoop() {
				628	while (1) {
				629	fprintf(stderr, "ready> ");
				630	switch (CurTok) {
				631	case tok_eof: return;
				632	case ';': getNextToken(); break; // ignore top-level semicolons.
				633	case tok_def: HandleDefinition(); break;
				634	case tok_extern: HandleExtern(); break;
				635	default: HandleTopLevelExpression(); break;
				636	}
				637	}
				638	}
				639
				640	The most interesting part of this is that we ignore top-level
				641	semicolons. Why is this, you ask? The basic reason is that if you type
				642	"4 + 5" at the command line, the parser doesn't know whether that is the
				643	end of what you will type or not. For example, on the next line you
				644	could type "def foo..." in which case 4+5 is the end of a top-level
				645	expression. Alternatively you could type "\* 6", which would continue
				646	the expression. Having top-level semicolons allows you to type "4+5;",
				647	and the parser will know you are done.
				648
				649	Conclusions
				650	===========
				651
				652	With just under 400 lines of commented code (240 lines of non-comment,
				653	non-blank code), we fully defined our minimal language, including a
				654	lexer, parser, and AST builder. With this done, the executable will
				655	validate Kaleidoscope code and tell us if it is grammatically invalid.
				656	For example, here is a sample interaction:
				657
				658	.. code-block:: bash
				659
				660	$ ./a.out
				661	ready> def foo(x y) x+foo(y, 4.0);
				662	Parsed a function definition.
				663	ready> def foo(x y) x+y y;
				664	Parsed a function definition.
				665	Parsed a top-level expr
				666	ready> def foo(x y) x+y );
				667	Parsed a function definition.
				668	Error: unknown token when expecting an expression
				669	ready> extern sin(a);
				670	ready> Parsed an extern
				671	ready> ^D
				672	$
				673
				674	There is a lot of room for extension here. You can define new AST nodes,
				675	extend the language in many ways, etc. In the `next
				676	installment <LangImpl3.html>`_, we will describe how to generate LLVM
				677	Intermediate Representation (IR) from the AST.
				678
				679	Full Code Listing
				680	=================
				681
				682	Here is the complete code listing for this and the previous chapter.
				683	Note that it is fully self-contained: you don't need LLVM or any
				684	external libraries at all for this. (Besides the C and C++ standard
				685	libraries, of course.) To build this, just compile with:
				686
				687	.. code-block:: bash
				688
				689	# Compile
				690	clang++ -g -O3 toy.cpp
				691	# Run
				692	./a.out
				693
				694	Here is the code:
				695
				696	.. code-block:: c++
				697
				698	#include <cstdio>
				699	#include <cstdlib>
				700	#include <string>
				701	#include <map>
				702	#include <vector>
				703
				704	//===----------------------------------------------------------------------===//
				705	// Lexer
				706	//===----------------------------------------------------------------------===//
				707
				708	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				709	// of these for known things.
				710	enum Token {
				711	tok_eof = -1,
				712
				713	// commands
				714	tok_def = -2, tok_extern = -3,
				715
				716	// primary
				717	tok_identifier = -4, tok_number = -5
				718	};
				719
				720	static std::string IdentifierStr; // Filled in if tok_identifier
				721	static double NumVal; // Filled in if tok_number
				722
				723	/// gettok - Return the next token from standard input.
				724	static int gettok() {
				725	static int LastChar = ' ';
				726
				727	// Skip any whitespace.
				728	while (isspace(LastChar))
				729	LastChar = getchar();
				730
				731	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				732	IdentifierStr = LastChar;
				733	while (isalnum((LastChar = getchar())))
				734	IdentifierStr += LastChar;
				735
				736	if (IdentifierStr == "def") return tok_def;
				737	if (IdentifierStr == "extern") return tok_extern;
				738	return tok_identifier;
				739	}
				740
				741	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				742	std::string NumStr;
				743	do {
				744	NumStr += LastChar;
				745	LastChar = getchar();
				746	} while (isdigit(LastChar) \|\| LastChar == '.');
				747
				748	NumVal = strtod(NumStr.c_str(), 0);
				749	return tok_number;
				750	}
				751
				752	if (LastChar == '#') {
				753	// Comment until end of line.
				754	do LastChar = getchar();
				755	while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
				756
				757	if (LastChar != EOF)
				758	return gettok();
				759	}
				760
				761	// Check for end of file. Don't eat the EOF.
				762	if (LastChar == EOF)
				763	return tok_eof;
				764
				765	// Otherwise, just return the character as its ascii value.
				766	int ThisChar = LastChar;
				767	LastChar = getchar();
				768	return ThisChar;
				769	}
				770
				771	//===----------------------------------------------------------------------===//
				772	// Abstract Syntax Tree (aka Parse Tree)
				773	//===----------------------------------------------------------------------===//
				774
				775	/// ExprAST - Base class for all expression nodes.
				776	class ExprAST {
				777	public:
				778	virtual ~ExprAST() {}
				779	};
				780
				781	/// NumberExprAST - Expression class for numeric literals like "1.0".
				782	class NumberExprAST : public ExprAST {
				783	double Val;
				784	public:
				785	NumberExprAST(double val) : Val(val) {}
				786	};
				787
				788	/// VariableExprAST - Expression class for referencing a variable, like "a".
				789	class VariableExprAST : public ExprAST {
				790	std::string Name;
				791	public:
				792	VariableExprAST(const std::string &name) : Name(name) {}
				793	};
				794
				795	/// BinaryExprAST - Expression class for a binary operator.
				796	class BinaryExprAST : public ExprAST {
				797	char Op;
				798	ExprAST LHS, RHS;
				799	public:
				800	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				801	: Op(op), LHS(lhs), RHS(rhs) {}
				802	};
				803
				804	/// CallExprAST - Expression class for function calls.
				805	class CallExprAST : public ExprAST {
				806	std::string Callee;
				807	std::vector<ExprAST*> Args;
				808	public:
				809	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				810	: Callee(callee), Args(args) {}
				811	};
				812
				813	/// PrototypeAST - This class represents the "prototype" for a function,
				814	/// which captures its name, and its argument names (thus implicitly the number
				815	/// of arguments the function takes).
				816	class PrototypeAST {
				817	std::string Name;
				818	std::vector<std::string> Args;
				819	public:
				820	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				821	: Name(name), Args(args) {}
				822
				823	};
				824
				825	/// FunctionAST - This class represents a function definition itself.
				826	class FunctionAST {
				827	PrototypeAST *Proto;
				828	ExprAST *Body;
				829	public:
				830	FunctionAST(PrototypeAST proto, ExprAST body)
				831	: Proto(proto), Body(body) {}
				832
				833	};
				834
				835	//===----------------------------------------------------------------------===//
				836	// Parser
				837	//===----------------------------------------------------------------------===//
				838
				839	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
				840	/// token the parser is looking at. getNextToken reads another token from the
				841	/// lexer and updates CurTok with its results.
				842	static int CurTok;
				843	static int getNextToken() {
				844	return CurTok = gettok();
				845	}
				846
				847	/// BinopPrecedence - This holds the precedence for each binary operator that is
				848	/// defined.
				849	static std::map<char, int> BinopPrecedence;
				850
				851	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				852	static int GetTokPrecedence() {
				853	if (!isascii(CurTok))
				854	return -1;
				855
				856	// Make sure it's a declared binop.
				857	int TokPrec = BinopPrecedence[CurTok];
				858	if (TokPrec <= 0) return -1;
				859	return TokPrec;
				860	}
				861
				862	/// Error* - These are little helper functions for error handling.
				863	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				864	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				865	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				866
				867	static ExprAST *ParseExpression();
				868
				869	/// identifierexpr
				870	/// ::= identifier
				871	/// ::= identifier '(' expression* ')'
				872	static ExprAST *ParseIdentifierExpr() {
				873	std::string IdName = IdentifierStr;
				874
				875	getNextToken(); // eat identifier.
				876
				877	if (CurTok != '(') // Simple variable ref.
				878	return new VariableExprAST(IdName);
				879
				880	// Call.
				881	getNextToken(); // eat (
				882	std::vector<ExprAST*> Args;
				883	if (CurTok != ')') {
				884	while (1) {
				885	ExprAST *Arg = ParseExpression();
				886	if (!Arg) return 0;
				887	Args.push_back(Arg);
				888
				889	if (CurTok == ')') break;
				890
				891	if (CurTok != ',')
				892	return Error("Expected ')' or ',' in argument list");
				893	getNextToken();
				894	}
				895	}
				896
				897	// Eat the ')'.
				898	getNextToken();
				899
				900	return new CallExprAST(IdName, Args);
				901	}
				902
				903	/// numberexpr ::= number
				904	static ExprAST *ParseNumberExpr() {
				905	ExprAST *Result = new NumberExprAST(NumVal);
				906	getNextToken(); // consume the number
				907	return Result;
				908	}
				909
				910	/// parenexpr ::= '(' expression ')'
				911	static ExprAST *ParseParenExpr() {
				912	getNextToken(); // eat (.
				913	ExprAST *V = ParseExpression();
				914	if (!V) return 0;
				915
				916	if (CurTok != ')')
				917	return Error("expected ')'");
				918	getNextToken(); // eat ).
				919	return V;
				920	}
				921
				922	/// primary
				923	/// ::= identifierexpr
				924	/// ::= numberexpr
				925	/// ::= parenexpr
				926	static ExprAST *ParsePrimary() {
				927	switch (CurTok) {
				928	default: return Error("unknown token when expecting an expression");
				929	case tok_identifier: return ParseIdentifierExpr();
				930	case tok_number: return ParseNumberExpr();
				931	case '(': return ParseParenExpr();
				932	}
				933	}
				934
				935	/// binoprhs
				936	/// ::= ('+' primary)*
				937	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				938	// If this is a binop, find its precedence.
				939	while (1) {
				940	int TokPrec = GetTokPrecedence();
				941
				942	// If this is a binop that binds at least as tightly as the current binop,
				943	// consume it, otherwise we are done.
				944	if (TokPrec < ExprPrec)
				945	return LHS;
				946
				947	// Okay, we know this is a binop.
				948	int BinOp = CurTok;
				949	getNextToken(); // eat binop
				950
				951	// Parse the primary expression after the binary operator.
				952	ExprAST *RHS = ParsePrimary();
				953	if (!RHS) return 0;
				954
				955	// If BinOp binds less tightly with RHS than the operator after RHS, let
				956	// the pending operator take RHS as its LHS.
				957	int NextPrec = GetTokPrecedence();
				958	if (TokPrec < NextPrec) {
				959	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				960	if (RHS == 0) return 0;
				961	}
				962
				963	// Merge LHS/RHS.
				964	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				965	}
				966	}
				967
				968	/// expression
				969	/// ::= primary binoprhs
				970	///
				971	static ExprAST *ParseExpression() {
				972	ExprAST *LHS = ParsePrimary();
				973	if (!LHS) return 0;
				974
				975	return ParseBinOpRHS(0, LHS);
				976	}
				977
				978	/// prototype
				979	/// ::= id '(' id* ')'
				980	static PrototypeAST *ParsePrototype() {
				981	if (CurTok != tok_identifier)
				982	return ErrorP("Expected function name in prototype");
				983
				984	std::string FnName = IdentifierStr;
				985	getNextToken();
				986
				987	if (CurTok != '(')
				988	return ErrorP("Expected '(' in prototype");
				989
				990	std::vector<std::string> ArgNames;
				991	while (getNextToken() == tok_identifier)
				992	ArgNames.push_back(IdentifierStr);
				993	if (CurTok != ')')
				994	return ErrorP("Expected ')' in prototype");
				995
				996	// success.
				997	getNextToken(); // eat ')'.
				998
				999	return new PrototypeAST(FnName, ArgNames);
				1000	}
				1001
				1002	/// definition ::= 'def' prototype expression
				1003	static FunctionAST *ParseDefinition() {
				1004	getNextToken(); // eat def.
				1005	PrototypeAST *Proto = ParsePrototype();
				1006	if (Proto == 0) return 0;
				1007
				1008	if (ExprAST *E = ParseExpression())
				1009	return new FunctionAST(Proto, E);
				1010	return 0;
				1011	}
				1012
				1013	/// toplevelexpr ::= expression
				1014	static FunctionAST *ParseTopLevelExpr() {
				1015	if (ExprAST *E = ParseExpression()) {
				1016	// Make an anonymous proto.
				1017	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				1018	return new FunctionAST(Proto, E);
				1019	}
				1020	return 0;
				1021	}
				1022
				1023	/// external ::= 'extern' prototype
				1024	static PrototypeAST *ParseExtern() {
				1025	getNextToken(); // eat extern.
				1026	return ParsePrototype();
				1027	}
				1028
				1029	//===----------------------------------------------------------------------===//
				1030	// Top-Level parsing
				1031	//===----------------------------------------------------------------------===//
				1032
				1033	static void HandleDefinition() {
				1034	if (ParseDefinition()) {
				1035	fprintf(stderr, "Parsed a function definition.\n");
				1036	} else {
				1037	// Skip token for error recovery.
				1038	getNextToken();
				1039	}
				1040	}
				1041
				1042	static void HandleExtern() {
				1043	if (ParseExtern()) {
				1044	fprintf(stderr, "Parsed an extern\n");
				1045	} else {
				1046	// Skip token for error recovery.
				1047	getNextToken();
				1048	}
				1049	}
				1050
				1051	static void HandleTopLevelExpression() {
				1052	// Evaluate a top-level expression into an anonymous function.
				1053	if (ParseTopLevelExpr()) {
				1054	fprintf(stderr, "Parsed a top-level expr\n");
				1055	} else {
				1056	// Skip token for error recovery.
				1057	getNextToken();
				1058	}
				1059	}
				1060
				1061	/// top ::= definition \| external \| expression \| ';'
				1062	static void MainLoop() {
				1063	while (1) {
				1064	fprintf(stderr, "ready> ");
				1065	switch (CurTok) {
				1066	case tok_eof: return;
				1067	case ';': getNextToken(); break; // ignore top-level semicolons.
				1068	case tok_def: HandleDefinition(); break;
				1069	case tok_extern: HandleExtern(); break;
				1070	default: HandleTopLevelExpression(); break;
				1071	}
				1072	}
				1073	}
				1074
				1075	//===----------------------------------------------------------------------===//
				1076	// Main driver code.
				1077	//===----------------------------------------------------------------------===//
				1078
				1079	int main() {
				1080	// Install standard binary operators.
				1081	// 1 is lowest precedence.
				1082	BinopPrecedence['<'] = 10;
				1083	BinopPrecedence['+'] = 20;
				1084	BinopPrecedence['-'] = 20;
				1085	BinopPrecedence['*'] = 40; // highest.
				1086
				1087	// Prime the first token.
				1088	fprintf(stderr, "ready> ");
				1089	getNextToken();
				1090
				1091	// Run the main "interpreter loop" now.
				1092	MainLoop();
				1093
				1094	return 0;
				1095	}
				1096
				1097	`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
				1098