Blame - docs/tutorial/LangImpl2.html - fp2-dev/platform/external/llvm

blob: acccd20a09048426ca3dbd62e2dfaa18d4b36f7a [file] [log] [blame]

Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	14	<h1>Kaleidoscope: Implementing a Parser and AST</h1>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	15
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	16	<ul>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	17	<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	18	<li>Chapter 2
				19	<ol>
				20	<li><a href="#intro">Chapter 2 Introduction</a></li>
				21	<li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
				22	<li><a href="#parserbasics">Parser Basics</a></li>
				23	<li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
				24	<li><a href="#parserbinops">Binary Expression Parsing</a></li>
				25	<li><a href="#parsertop">Parsing the Rest</a></li>
				26	<li><a href="#driver">The Driver</a></li>
				27	<li><a href="#conclusions">Conclusions</a></li>
				28	<li><a href="#code">Full Code Listing</a></li>
				29	</ol>
				30	</li>
Chris Lattner	0e555b1	2007-11-05 20:04:56 +0000	[diff] [blame]	31	<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	32	</ul>
				33
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	34	<div class="doc_author">
				35	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				36	</div>
				37
				38	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	39	<h2><a name="intro">Chapter 2 Introduction</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	40	<!-- *********************************************************************** -->
				41
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	42	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	43
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame]	44	<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	45	with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in
				46	<a href="LangImpl1.html">Chapter 1</a>, to build a full <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	47	href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	48	our Kaleidoscope language. Once we have a parser, we'll define and build an <a
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	49	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				50	Tree</a> (AST).</p>
				51
				52	<p>The parser we will build uses a combination of <a
				53	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				54	Parsing</a> and <a href=
				55	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	56	Parsing</a> to parse the Kaleidoscope language (the latter for
				57	binary expressions and the former for everything else). Before we get to
				58	parsing though, lets talk about the output of the parser: the Abstract Syntax
				59	Tree.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	60
				61	</div>
				62
				63	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	64	<h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	65	<!-- *********************************************************************** -->
				66
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	67	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	68
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	69	<p>The AST for a program captures its behavior in such a way that it is easy for
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	70	later stages of the compiler (e.g. code generation) to interpret. We basically
				71	want one object for each construct in the language, and the AST should closely
				72	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				73	function object. We'll start with expressions first:</p>
				74
				75	<div class="doc_code">
				76	<pre>
				77	/// ExprAST - Base class for all expression nodes.
				78	class ExprAST {
				79	public:
				80	virtual ~ExprAST() {}
				81	};
				82
				83	/// NumberExprAST - Expression class for numeric literals like "1.0".
				84	class NumberExprAST : public ExprAST {
				85	double Val;
				86	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	87	NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	88	};
				89	</pre>
				90	</div>
				91
				92	<p>The code above shows the definition of the base ExprAST class and one
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	93	subclass which we use for numeric literals. The important thing to note about
				94	this code is that the NumberExprAST class captures the numeric value of the
				95	literal as an instance variable. This allows later phases of the compiler to
				96	know what the stored numeric value is.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	97
				98	<p>Right now we only create the AST, so there are no useful accessor methods on
				99	them. It would be very easy to add a virtual method to pretty print the code,
				100	for example. Here are the other expression AST node definitions that we'll use
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	101	in the basic form of the Kaleidoscope language:
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	102	</p>
				103
				104	<div class="doc_code">
				105	<pre>
				106	/// VariableExprAST - Expression class for referencing a variable, like "a".
				107	class VariableExprAST : public ExprAST {
				108	std::string Name;
				109	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	110	VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	111	};
				112
				113	/// BinaryExprAST - Expression class for a binary operator.
				114	class BinaryExprAST : public ExprAST {
				115	char Op;
				116	ExprAST LHS, RHS;
				117	public:
				118	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				119	: Op(op), LHS(lhs), RHS(rhs) {}
				120	};
				121
				122	/// CallExprAST - Expression class for function calls.
				123	class CallExprAST : public ExprAST {
				124	std::string Callee;
				125	std::vector<ExprAST*> Args;
				126	public:
				127	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				128	: Callee(callee), Args(args) {}
				129	};
				130	</pre>
				131	</div>
				132
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	133	<p>This is all (intentionally) rather straight-forward: variables capture the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	134	variable name, binary operators capture their opcode (e.g. '+'), and calls
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	135	capture a function name as well as a list of any argument expressions. One thing
				136	that is nice about our AST is that it captures the language features without
				137	talking about the syntax of the language. Note that there is no discussion about
				138	precedence of binary operators, lexical structure, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	139
				140	<p>For our basic language, these are all of the expression nodes we'll define.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	141	Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	142	we'll fix that in a later installment. The two things we need next are a way
				143	to talk about the interface to a function, and a way to talk about functions
				144	themselves:</p>
				145
				146	<div class="doc_code">
				147	<pre>
				148	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	149	/// which captures its name, and its argument names (thus implicitly the number
				150	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	151	class PrototypeAST {
				152	std::string Name;
				153	std::vector<std::string> Args;
				154	public:
				155	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				156	: Name(name), Args(args) {}
				157	};
				158
				159	/// FunctionAST - This class represents a function definition itself.
				160	class FunctionAST {
				161	PrototypeAST *Proto;
				162	ExprAST *Body;
				163	public:
				164	FunctionAST(PrototypeAST proto, ExprAST body)
				165	: Proto(proto), Body(body) {}
				166	};
				167	</pre>
				168	</div>
				169
				170	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	171	Since all values are double precision floating point, the type of each argument
				172	doesn't need to be stored anywhere. In a more aggressive and realistic
				173	language, the "ExprAST" class would probably have a type field.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	174
				175	<p>With this scaffolding, we can now talk about parsing expressions and function
				176	bodies in Kaleidoscope.</p>
				177
				178	</div>
				179
				180	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	181	<h2><a name="parserbasics">Parser Basics</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	182	<!-- *********************************************************************** -->
				183
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	184	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	185
				186	<p>Now that we have an AST to build, we need to define the parser code to build
				187	it. The idea here is that we want to parse something like "x+y" (which is
				188	returned as three tokens by the lexer) into an AST that could be generated with
				189	calls like this:</p>
				190
				191	<div class="doc_code">
				192	<pre>
				193	ExprAST *X = new VariableExprAST("x");
				194	ExprAST *Y = new VariableExprAST("y");
				195	ExprAST *Result = new BinaryExprAST('+', X, Y);
				196	</pre>
				197	</div>
				198
				199	<p>In order to do this, we'll start by defining some basic helper routines:</p>
				200
				201	<div class="doc_code">
				202	<pre>
				203	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	204	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	205	/// lexer and updates CurTok with its results.
				206	static int CurTok;
				207	static int getNextToken() {
				208	return CurTok = gettok();
				209	}
				210	</pre>
				211	</div>
				212
				213	<p>
				214	This implements a simple token buffer around the lexer. This allows
				215	us to look one token ahead at what the lexer is returning. Every function in
Chris Lattner	e949512	2007-10-25 18:05:29 +0000	[diff] [blame]	216	our parser will assume that CurTok is the current token that needs to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	217	parsed.</p>
				218
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	219	<div class="doc_code">
				220	<pre>
				221
				222	/// Error* - These are little helper functions for error handling.
				223	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				224	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				225	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				226	</pre>
				227	</div>
				228
				229	<p>
				230	The <tt>Error</tt> routines are simple helper routines that our parser will use
				231	to handle errors. The error recovery in our parser will not be the best and
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	232	is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	233	routines make it easier to handle errors in routines that have various return
				234	types: they always return null.</p>
				235
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	236	<p>With these basic helper functions, we can implement the first
				237	piece of our grammar: numeric literals.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	238
				239	</div>
				240
				241	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	242	<h2><a name="parserprimexprs">Basic Expression Parsing</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	243	<!-- *********************************************************************** -->
				244
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	245	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	246
				247	<p>We start with numeric literals, because they are the simplest to process.
				248	For each production in our grammar, we'll define a function which parses that
				249	production. For numeric literals, we have:
				250	</p>
				251
				252	<div class="doc_code">
				253	<pre>
				254	/// numberexpr ::= number
				255	static ExprAST *ParseNumberExpr() {
				256	ExprAST *Result = new NumberExprAST(NumVal);
				257	getNextToken(); // consume the number
				258	return Result;
				259	}
				260	</pre>
				261	</div>
				262
				263	<p>This routine is very simple: it expects to be called when the current token
				264	is a <tt>tok_number</tt> token. It takes the current number value, creates
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	265	a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	266	returns.</p>
				267
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	268	<p>There are some interesting aspects to this. The most important one is that
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	269	this routine eats all of the tokens that correspond to the production and
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	270	returns the lexer buffer with the next token (which is not part of the grammar
				271	production) ready to go. This is a fairly standard way to go for recursive
				272	descent parsers. For a better example, the parenthesis operator is defined like
				273	this:</p>
				274
				275	<div class="doc_code">
				276	<pre>
				277	/// parenexpr ::= '(' expression ')'
				278	static ExprAST *ParseParenExpr() {
				279	getNextToken(); // eat (.
				280	ExprAST *V = ParseExpression();
				281	if (!V) return 0;
				282
				283	if (CurTok != ')')
				284	return Error("expected ')'");
				285	getNextToken(); // eat ).
				286	return V;
				287	}
				288	</pre>
				289	</div>
				290
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	291	<p>This function illustrates a number of interesting things about the
				292	parser:</p>
				293
				294	<p>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	295	1) It shows how we use the Error routines. When called, this function expects
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	296	that the current token is a '(' token, but after parsing the subexpression, it
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	297	is possible that there is no ')' waiting. For example, if the user types in
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	298	"(4 x" instead of "(4)", the parser should emit an error. Because errors can
				299	occur, the parser needs a way to indicate that they happened: in our parser, we
				300	return null on an error.</p>
				301
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	302	<p>2) Another interesting aspect of this function is that it uses recursion by
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	303	calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	304	<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
				305	recursive grammars, and keeps each production very simple. Note that
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	306	parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	307	do it this way, the most important role of parentheses are to guide the parser
				308	and provide grouping. Once the parser constructs the AST, parentheses are not
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	309	needed.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	310
				311	<p>The next simple production is for handling variable references and function
				312	calls:</p>
				313
				314	<div class="doc_code">
				315	<pre>
				316	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	317	/// ::= identifier
				318	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	319	static ExprAST *ParseIdentifierExpr() {
				320	std::string IdName = IdentifierStr;
				321
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	322	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	323
				324	if (CurTok != '(') // Simple variable ref.
				325	return new VariableExprAST(IdName);
				326
				327	// Call.
				328	getNextToken(); // eat (
				329	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	330	if (CurTok != ')') {
				331	while (1) {
				332	ExprAST *Arg = ParseExpression();
				333	if (!Arg) return 0;
				334	Args.push_back(Arg);
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	335
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	336	if (CurTok == ')') break;
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	337
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	338	if (CurTok != ',')
Chris Lattner	6c4be9c	2008-04-14 16:44:41 +0000	[diff] [blame]	339	return Error("Expected ')' or ',' in argument list");
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	340	getNextToken();
				341	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	342	}
				343
				344	// Eat the ')'.
				345	getNextToken();
				346
				347	return new CallExprAST(IdName, Args);
				348	}
				349	</pre>
				350	</div>
				351
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	352	<p>This routine follows the same style as the other routines. (It expects to be
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	353	called if the current token is a <tt>tok_identifier</tt> token). It also has
				354	recursion and error handling. One interesting aspect of this is that it uses
				355	<em>look-ahead</em> to determine if the current identifier is a stand alone
				356	variable reference or if it is a function call expression. It handles this by
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	357	checking to see if the token after the identifier is a '(' token, constructing
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	358	either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
				359	</p>
				360
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	361	<p>Now that we have all of our simple expression-parsing logic in place, we can
				362	define a helper function to wrap it together into one entry point. We call this
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	363	class of expressions "primary" expressions, for reasons that will become more
				364	clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
				365	parse an arbitrary primary expression, we need to determine what sort of
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	366	expression it is:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	367
				368	<div class="doc_code">
				369	<pre>
				370	/// primary
				371	/// ::= identifierexpr
				372	/// ::= numberexpr
				373	/// ::= parenexpr
				374	static ExprAST *ParsePrimary() {
				375	switch (CurTok) {
				376	default: return Error("unknown token when expecting an expression");
				377	case tok_identifier: return ParseIdentifierExpr();
				378	case tok_number: return ParseNumberExpr();
				379	case '(': return ParseParenExpr();
				380	}
				381	}
				382	</pre>
				383	</div>
				384
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	385	<p>Now that you see the definition of this function, it is more obvious why we
				386	can assume the state of CurTok in the various functions. This uses look-ahead
				387	to determine which sort of expression is being inspected, and then parses it
				388	with a function call.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	389
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	390	<p>Now that basic expressions are handled, we need to handle binary expressions.
				391	They are a bit more complex.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	392
				393	</div>
				394
				395	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	396	<h2><a name="parserbinops">Binary Expression Parsing</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	397	<!-- *********************************************************************** -->
				398
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	399	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	400
				401	<p>Binary expressions are significantly harder to parse because they are often
				402	ambiguous. For example, when given the string "x+y*z", the parser can choose
				403	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				404	mathematics, we expect the later parse, because "*" (multiplication) has
				405	higher <em>precedence</em> than "+" (addition).</p>
				406
				407	<p>There are many ways to handle this, but an elegant and efficient way is to
				408	use <a href=
				409	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				410	Parsing</a>. This parsing technique uses the precedence of binary operators to
				411	guide recursion. To start with, we need a table of precedences:</p>
				412
				413	<div class="doc_code">
				414	<pre>
				415	/// BinopPrecedence - This holds the precedence for each binary operator that is
				416	/// defined.
				417	static std::map<char, int> BinopPrecedence;
				418
				419	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				420	static int GetTokPrecedence() {
				421	if (!isascii(CurTok))
				422	return -1;
				423
				424	// Make sure it's a declared binop.
				425	int TokPrec = BinopPrecedence[CurTok];
				426	if (TokPrec <= 0) return -1;
				427	return TokPrec;
				428	}
				429
				430	int main() {
				431	// Install standard binary operators.
				432	// 1 is lowest precedence.
				433	BinopPrecedence['<'] = 10;
				434	BinopPrecedence['+'] = 20;
				435	BinopPrecedence['-'] = 20;
				436	BinopPrecedence['*'] = 40; // highest.
				437	...
				438	}
				439	</pre>
				440	</div>
				441
				442	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	443	(this can obviously be extended by you, our brave and intrepid reader). The
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	444	<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
				445	or -1 if the token is not a binary operator. Having a map makes it easy to add
				446	new operators and makes it clear that the algorithm doesn't depend on the
				447	specific operators involved, but it would be easy enough to eliminate the map
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	448	and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	449	a fixed-size array).</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	450
				451	<p>With the helper above defined, we can now start parsing binary expressions.
				452	The basic idea of operator precedence parsing is to break down an expression
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	453	with potentially ambiguous binary operators into pieces. Consider ,for example,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	454	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				455	as a stream of primary expressions separated by binary operators. As such,
				456	it will first parse the leading primary expression "a", then it will see the
				457	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	458	are primary expressions, the binary expression parser doesn't need to worry
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	459	about nested subexpressions like (c+d) at all.
				460	</p>
				461
				462	<p>
				463	To start, an expression is a primary expression potentially followed by a
				464	sequence of [binop,primaryexpr] pairs:</p>
				465
				466	<div class="doc_code">
				467	<pre>
				468	/// expression
				469	/// ::= primary binoprhs
				470	///
				471	static ExprAST *ParseExpression() {
				472	ExprAST *LHS = ParsePrimary();
				473	if (!LHS) return 0;
				474
				475	return ParseBinOpRHS(0, LHS);
				476	}
				477	</pre>
				478	</div>
				479
				480	<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	481	us. It takes a precedence and a pointer to an expression for the part that has been
				482	parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	483	allowed to be empty, in which case it returns the expression that is passed into
				484	it. In our example above, the code passes the expression for "a" into
				485	<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
				486
				487	<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
				488	minimal operator precedence</em> that the function is allowed to eat. For
				489	example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
				490	passed in a precedence of 40, it will not consume any tokens (because the
				491	precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
				492	with:</p>
				493
				494	<div class="doc_code">
				495	<pre>
				496	/// binoprhs
				497	/// ::= ('+' primary)*
				498	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				499	// If this is a binop, find its precedence.
				500	while (1) {
				501	int TokPrec = GetTokPrecedence();
				502
				503	// If this is a binop that binds at least as tightly as the current binop,
				504	// consume it, otherwise we are done.
				505	if (TokPrec < ExprPrec)
				506	return LHS;
				507	</pre>
				508	</div>
				509
				510	<p>This code gets the precedence of the current token and checks to see if if is
				511	too low. Because we defined invalid tokens to have a precedence of -1, this
				512	check implicitly knows that the pair-stream ends when the token stream runs out
				513	of binary operators. If this check succeeds, we know that the token is a binary
				514	operator and that it will be included in this expression:</p>
				515
				516	<div class="doc_code">
				517	<pre>
				518	// Okay, we know this is a binop.
				519	int BinOp = CurTok;
				520	getNextToken(); // eat binop
				521
				522	// Parse the primary expression after the binary operator.
				523	ExprAST *RHS = ParsePrimary();
				524	if (!RHS) return 0;
				525	</pre>
				526	</div>
				527
				528	<p>As such, this code eats (and remembers) the binary operator and then parses
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	529	the primary expression that follows. This builds up the whole pair, the first of
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	530	which is [+, b] for the running example.</p>
				531
				532	<p>Now that we parsed the left-hand side of an expression and one pair of the
				533	RHS sequence, we have to decide which way the expression associates. In
				534	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				535	To determine this, we look ahead at "binop" to determine its precedence and
				536	compare it to BinOp's precedence (which is '+' in this case):</p>
				537
				538	<div class="doc_code">
				539	<pre>
				540	// If BinOp binds less tightly with RHS than the operator after RHS, let
				541	// the pending operator take RHS as its LHS.
				542	int NextPrec = GetTokPrecedence();
				543	if (TokPrec < NextPrec) {
				544	</pre>
				545	</div>
				546
				547	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				548	precedence of our current operator, then we know that the parentheses associate
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	549	as "(a+b) binop ...". In our example, the current operator is "+" and the next
				550	operator is "+", we know that they have the same precedence. In this case we'll
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	551	create the AST node for "a+b", and then continue parsing:</p>
				552
				553	<div class="doc_code">
				554	<pre>
				555	... if body omitted ...
				556	}
				557
				558	// Merge LHS/RHS.
				559	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				560	} // loop around to the top of the while loop.
				561	}
				562	</pre>
				563	</div>
				564
				565	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	566	iteration of the loop, with "+" as the current token. The code above will eat,
				567	remember, and parse "(c+d)" as the primary expression, which makes the
				568	current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
				569	"" as the binop to the right of the primary. In this case, the precedence of "" is
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	570	higher than the precedence of "+" so the if condition will be entered.</p>
				571
				572	<p>The critical question left here is "how can the if condition parse the right
				573	hand side in full"? In particular, to build the AST correctly for our example,
				574	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				575	do this is surprisingly simple (code from the above two blocks duplicated for
				576	context):</p>
				577
				578	<div class="doc_code">
				579	<pre>
				580	// If BinOp binds less tightly with RHS than the operator after RHS, let
				581	// the pending operator take RHS as its LHS.
				582	int NextPrec = GetTokPrecedence();
				583	if (TokPrec < NextPrec) {
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	584	<b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
				585	if (RHS == 0) return 0;</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	586	}
				587	// Merge LHS/RHS.
				588	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				589	} // loop around to the top of the while loop.
				590	}
				591	</pre>
				592	</div>
				593
				594	<p>At this point, we know that the binary operator to the RHS of our primary
				595	has higher precedence than the binop we are currently parsing. As such, we know
				596	that any sequence of pairs whose operators are all higher precedence than "+"
				597	should be parsed together and returned as "RHS". To do this, we recursively
				598	invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
				599	precedence required for it to continue. In our example above, this will cause
				600	it to return the AST node for "(c+d)ef" as RHS, which is then set as the RHS
				601	of the '+' expression.</p>
				602
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	603	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	604	and added to the AST. With this little bit of code (14 non-trivial lines), we
				605	correctly handle fully general binary expression parsing in a very elegant way.
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	606	This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
				607	running through it with a few tough examples to see how it works.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	608	</p>
				609
				610	<p>This wraps up handling of expressions. At this point, we can point the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	611	parser at an arbitrary token stream and build an expression from it, stopping
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	612	at the first token that is not part of the expression. Next up we need to
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	613	handle function definitions, etc.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	614
				615	</div>
				616
				617	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	618	<h2><a name="parsertop">Parsing the Rest</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	619	<!-- *********************************************************************** -->
				620
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	621	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	622
				623	<p>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	624	The next thing missing is handling of function prototypes. In Kaleidoscope,
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	625	these are used both for 'extern' function declarations as well as function body
				626	definitions. The code to do this is straight-forward and not very interesting
				627	(once you've survived expressions):
				628	</p>
				629
				630	<div class="doc_code">
				631	<pre>
				632	/// prototype
				633	/// ::= id '(' id* ')'
				634	static PrototypeAST *ParsePrototype() {
				635	if (CurTok != tok_identifier)
				636	return ErrorP("Expected function name in prototype");
				637
				638	std::string FnName = IdentifierStr;
				639	getNextToken();
				640
				641	if (CurTok != '(')
				642	return ErrorP("Expected '(' in prototype");
				643
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	644	// Read the list of argument names.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	645	std::vector<std::string> ArgNames;
				646	while (getNextToken() == tok_identifier)
				647	ArgNames.push_back(IdentifierStr);
				648	if (CurTok != ')')
				649	return ErrorP("Expected ')' in prototype");
				650
				651	// success.
				652	getNextToken(); // eat ')'.
				653
				654	return new PrototypeAST(FnName, ArgNames);
				655	}
				656	</pre>
				657	</div>
				658
				659	<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	660	an expression to implement the body:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	661
				662	<div class="doc_code">
				663	<pre>
				664	/// definition ::= 'def' prototype expression
				665	static FunctionAST *ParseDefinition() {
				666	getNextToken(); // eat def.
				667	PrototypeAST *Proto = ParsePrototype();
				668	if (Proto == 0) return 0;
				669
				670	if (ExprAST *E = ParseExpression())
				671	return new FunctionAST(Proto, E);
				672	return 0;
				673	}
				674	</pre>
				675	</div>
				676
				677	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	678	well as to support forward declaration of user functions. These 'extern's are just
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	679	prototypes with no body:</p>
				680
				681	<div class="doc_code">
				682	<pre>
				683	/// external ::= 'extern' prototype
				684	static PrototypeAST *ParseExtern() {
				685	getNextToken(); // eat extern.
				686	return ParsePrototype();
				687	}
				688	</pre>
				689	</div>
				690
				691	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				692	evaluate them on the fly. We will handle this by defining anonymous nullary
				693	(zero argument) functions for them:</p>
				694
				695	<div class="doc_code">
				696	<pre>
				697	/// toplevelexpr ::= expression
				698	static FunctionAST *ParseTopLevelExpr() {
				699	if (ExprAST *E = ParseExpression()) {
				700	// Make an anonymous proto.
				701	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
				702	return new FunctionAST(Proto, E);
				703	}
				704	return 0;
				705	}
				706	</pre>
				707	</div>
				708
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	709	<p>Now that we have all the pieces, let's build a little driver that will let us
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	710	actually <em>execute</em> this code we've built!</p>
				711
				712	</div>
				713
				714	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	715	<h2><a name="driver">The Driver</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	716	<!-- *********************************************************************** -->
				717
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	718	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	719
				720	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				721	dispatch loop. There isn't much interesting here, so I'll just include the
				722	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				723	Parsing" section.</p>
				724
				725	<div class="doc_code">
				726	<pre>
				727	/// top ::= definition \| external \| expression \| ';'
				728	static void MainLoop() {
				729	while (1) {
				730	fprintf(stderr, "ready> ");
				731	switch (CurTok) {
				732	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	733	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	734	case tok_def: HandleDefinition(); break;
				735	case tok_extern: HandleExtern(); break;
				736	default: HandleTopLevelExpression(); break;
				737	}
				738	}
				739	}
				740	</pre>
				741	</div>
				742
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	743	<p>The most interesting part of this is that we ignore top-level semicolons.
Owen Anderson	c6311b9	2007-10-22 06:48:28 +0000	[diff] [blame]	744	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	745	command line, the parser doesn't know whether that is the end of what you will type
				746	or not. For example, on the next line you could type "def foo..." in which case
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	747	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				748	which would continue the expression. Having top-level semicolons allows you to
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	749	type "4+5;", and the parser will know you are done.</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	750
				751	</div>
				752
				753	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	754	<h2><a name="conclusions">Conclusions</a></h2>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	755	<!-- *********************************************************************** -->
				756
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	757	<div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	758
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	759	<p>With just under 400 lines of commented code (240 lines of non-comment,
				760	non-blank code), we fully defined our minimal language, including a lexer,
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	761	parser, and AST builder. With this done, the executable will validate
				762	Kaleidoscope code and tell us if it is grammatically invalid. For
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	763	example, here is a sample interaction:</p>
				764
				765	<div class="doc_code">
				766	<pre>
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	767	$ <b>./a.out</b>
				768	ready> <b>def foo(x y) x+foo(y, 4.0);</b>
				769	Parsed a function definition.
				770	ready> <b>def foo(x y) x+y y;</b>
				771	Parsed a function definition.
				772	Parsed a top-level expr
				773	ready> <b>def foo(x y) x+y );</b>
				774	Parsed a function definition.
				775	Error: unknown token when expecting an expression
				776	ready> <b>extern sin(a);</b>
Chris Lattner	35abbf5	2007-10-23 06:23:57 +0000	[diff] [blame]	777	ready> Parsed an extern
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	778	ready> <b>^D</b>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	779	$
				780	</pre>
				781	</div>
				782
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	783	<p>There is a lot of room for extension here. You can define new AST nodes,
				784	extend the language in many ways, etc. In the <a href="LangImpl3.html">next
Chris Lattner	bd779a3	2007-11-06 18:13:32 +0000	[diff] [blame]	785	installment</a>, we will describe how to generate LLVM Intermediate
				786	Representation (IR) from the AST.</p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	787
				788	</div>
				789
				790	<!-- *********************************************************************** -->
NAKAMURA Takumi	05d0265	2011-04-18 23:59:50 +0000	[diff] [blame]	791	<h2><a name="code">Full Code Listing</a></h2>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	792	<!-- *********************************************************************** -->
				793
NAKAMURA Takumi	f5af6ad	2011-04-23 00:30:22 +0000	[diff] [blame]	794	<div>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	795
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	796	<p>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	797	Here is the complete code listing for this and the previous chapter.
				798	Note that it is fully self-contained: you don't need LLVM or any external
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	799	libraries at all for this. (Besides the C and C++ standard libraries, of
				800	course.) To build this, just compile with:</p>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	801
				802	<div class="doc_code">
				803	<pre>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	804	# Compile
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	805	g++ -g -O3 toy.cpp
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	806	# Run
				807	./a.out
				808	</pre>
				809	</div>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	810
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	811	<p>Here is the code:</p>
				812
				813	<div class="doc_code">
				814	<pre>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	815	#include <cstdio>
John McCall	46c8181	2009-08-17 21:07:37 +0000	[diff] [blame]	816	#include <cstdlib>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	817	#include <string>
Chris Lattner	d93a584	2007-10-23 05:43:01 +0000	[diff] [blame]	818	#include <map>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	819	#include <vector>
				820
				821	//===----------------------------------------------------------------------===//
				822	// Lexer
				823	//===----------------------------------------------------------------------===//
				824
				825	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				826	// of these for known things.
				827	enum Token {
				828	tok_eof = -1,
				829
				830	// commands
				831	tok_def = -2, tok_extern = -3,
				832
				833	// primary
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	834	tok_identifier = -4, tok_number = -5
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	835	};
				836
				837	static std::string IdentifierStr; // Filled in if tok_identifier
				838	static double NumVal; // Filled in if tok_number
				839
				840	/// gettok - Return the next token from standard input.
				841	static int gettok() {
				842	static int LastChar = ' ';
				843
				844	// Skip any whitespace.
				845	while (isspace(LastChar))
				846	LastChar = getchar();
				847
				848	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				849	IdentifierStr = LastChar;
				850	while (isalnum((LastChar = getchar())))
				851	IdentifierStr += LastChar;
				852
				853	if (IdentifierStr == "def") return tok_def;
				854	if (IdentifierStr == "extern") return tok_extern;
				855	return tok_identifier;
				856	}
				857
				858	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				859	std::string NumStr;
				860	do {
				861	NumStr += LastChar;
				862	LastChar = getchar();
				863	} while (isdigit(LastChar) \|\| LastChar == '.');
				864
				865	NumVal = strtod(NumStr.c_str(), 0);
				866	return tok_number;
				867	}
				868
				869	if (LastChar == '#') {
				870	// Comment until end of line.
				871	do LastChar = getchar();
Chris Lattner	c80c23f	2007-12-02 22:46:01 +0000	[diff] [blame]	872	while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	873
				874	if (LastChar != EOF)
				875	return gettok();
				876	}
				877
				878	// Check for end of file. Don't eat the EOF.
				879	if (LastChar == EOF)
				880	return tok_eof;
				881
				882	// Otherwise, just return the character as its ascii value.
				883	int ThisChar = LastChar;
				884	LastChar = getchar();
				885	return ThisChar;
				886	}
				887
				888	//===----------------------------------------------------------------------===//
				889	// Abstract Syntax Tree (aka Parse Tree)
				890	//===----------------------------------------------------------------------===//
				891
				892	/// ExprAST - Base class for all expression nodes.
				893	class ExprAST {
				894	public:
				895	virtual ~ExprAST() {}
				896	};
				897
				898	/// NumberExprAST - Expression class for numeric literals like "1.0".
				899	class NumberExprAST : public ExprAST {
				900	double Val;
				901	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	902	NumberExprAST(double val) : Val(val) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	903	};
				904
				905	/// VariableExprAST - Expression class for referencing a variable, like "a".
				906	class VariableExprAST : public ExprAST {
				907	std::string Name;
				908	public:
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	909	VariableExprAST(const std::string &name) : Name(name) {}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	910	};
				911
				912	/// BinaryExprAST - Expression class for a binary operator.
				913	class BinaryExprAST : public ExprAST {
				914	char Op;
				915	ExprAST LHS, RHS;
				916	public:
				917	BinaryExprAST(char op, ExprAST lhs, ExprAST rhs)
				918	: Op(op), LHS(lhs), RHS(rhs) {}
				919	};
				920
				921	/// CallExprAST - Expression class for function calls.
				922	class CallExprAST : public ExprAST {
				923	std::string Callee;
				924	std::vector<ExprAST*> Args;
				925	public:
				926	CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
				927	: Callee(callee), Args(args) {}
				928	};
				929
				930	/// PrototypeAST - This class represents the "prototype" for a function,
Chris Lattner	cde1d9d	2007-11-06 07:16:22 +0000	[diff] [blame]	931	/// which captures its name, and its argument names (thus implicitly the number
				932	/// of arguments the function takes).
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	933	class PrototypeAST {
				934	std::string Name;
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	935	std::vector<std::string> Args;
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	936	public:
				937	PrototypeAST(const std::string &name, const std::vector<std::string> &args)
				938	: Name(name), Args(args) {}
				939
				940	};
				941
				942	/// FunctionAST - This class represents a function definition itself.
				943	class FunctionAST {
				944	PrototypeAST *Proto;
				945	ExprAST *Body;
				946	public:
				947	FunctionAST(PrototypeAST proto, ExprAST body)
				948	: Proto(proto), Body(body) {}
				949
				950	};
				951
				952	//===----------------------------------------------------------------------===//
				953	// Parser
				954	//===----------------------------------------------------------------------===//
				955
				956	/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	957	/// token the parser is looking at. getNextToken reads another token from the
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	958	/// lexer and updates CurTok with its results.
				959	static int CurTok;
				960	static int getNextToken() {
				961	return CurTok = gettok();
				962	}
				963
				964	/// BinopPrecedence - This holds the precedence for each binary operator that is
				965	/// defined.
				966	static std::map<char, int> BinopPrecedence;
				967
				968	/// GetTokPrecedence - Get the precedence of the pending binary operator token.
				969	static int GetTokPrecedence() {
				970	if (!isascii(CurTok))
				971	return -1;
				972
				973	// Make sure it's a declared binop.
				974	int TokPrec = BinopPrecedence[CurTok];
				975	if (TokPrec <= 0) return -1;
				976	return TokPrec;
				977	}
				978
				979	/// Error* - These are little helper functions for error handling.
				980	ExprAST Error(const char Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
				981	PrototypeAST ErrorP(const char Str) { Error(Str); return 0; }
				982	FunctionAST ErrorF(const char Str) { Error(Str); return 0; }
				983
				984	static ExprAST *ParseExpression();
				985
				986	/// identifierexpr
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	987	/// ::= identifier
				988	/// ::= identifier '(' expression* ')'
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	989	static ExprAST *ParseIdentifierExpr() {
				990	std::string IdName = IdentifierStr;
				991
Chris Lattner	20a0c80	2007-11-05 17:54:34 +0000	[diff] [blame]	992	getNextToken(); // eat identifier.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	993
				994	if (CurTok != '(') // Simple variable ref.
				995	return new VariableExprAST(IdName);
				996
				997	// Call.
				998	getNextToken(); // eat (
				999	std::vector<ExprAST*> Args;
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1000	if (CurTok != ')') {
				1001	while (1) {
				1002	ExprAST *Arg = ParseExpression();
				1003	if (!Arg) return 0;
				1004	Args.push_back(Arg);
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1005
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1006	if (CurTok == ')') break;
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1007
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1008	if (CurTok != ',')
Chris Lattner	6c4be9c	2008-04-14 16:44:41 +0000	[diff] [blame]	1009	return Error("Expected ')' or ',' in argument list");
Chris Lattner	7115521	2007-11-06 01:39:12 +0000	[diff] [blame]	1010	getNextToken();
				1011	}
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1012	}
				1013
				1014	// Eat the ')'.
				1015	getNextToken();
				1016
				1017	return new CallExprAST(IdName, Args);
				1018	}
				1019
				1020	/// numberexpr ::= number
				1021	static ExprAST *ParseNumberExpr() {
				1022	ExprAST *Result = new NumberExprAST(NumVal);
				1023	getNextToken(); // consume the number
				1024	return Result;
				1025	}
				1026
				1027	/// parenexpr ::= '(' expression ')'
				1028	static ExprAST *ParseParenExpr() {
				1029	getNextToken(); // eat (.
				1030	ExprAST *V = ParseExpression();
				1031	if (!V) return 0;
				1032
				1033	if (CurTok != ')')
				1034	return Error("expected ')'");
				1035	getNextToken(); // eat ).
				1036	return V;
				1037	}
				1038
				1039	/// primary
				1040	/// ::= identifierexpr
				1041	/// ::= numberexpr
				1042	/// ::= parenexpr
				1043	static ExprAST *ParsePrimary() {
				1044	switch (CurTok) {
				1045	default: return Error("unknown token when expecting an expression");
				1046	case tok_identifier: return ParseIdentifierExpr();
				1047	case tok_number: return ParseNumberExpr();
				1048	case '(': return ParseParenExpr();
				1049	}
				1050	}
				1051
				1052	/// binoprhs
				1053	/// ::= ('+' primary)*
				1054	static ExprAST ParseBinOpRHS(int ExprPrec, ExprAST LHS) {
				1055	// If this is a binop, find its precedence.
				1056	while (1) {
				1057	int TokPrec = GetTokPrecedence();
				1058
				1059	// If this is a binop that binds at least as tightly as the current binop,
				1060	// consume it, otherwise we are done.
				1061	if (TokPrec < ExprPrec)
				1062	return LHS;
				1063
				1064	// Okay, we know this is a binop.
				1065	int BinOp = CurTok;
				1066	getNextToken(); // eat binop
				1067
				1068	// Parse the primary expression after the binary operator.
				1069	ExprAST *RHS = ParsePrimary();
				1070	if (!RHS) return 0;
				1071
				1072	// If BinOp binds less tightly with RHS than the operator after RHS, let
				1073	// the pending operator take RHS as its LHS.
				1074	int NextPrec = GetTokPrecedence();
				1075	if (TokPrec < NextPrec) {
				1076	RHS = ParseBinOpRHS(TokPrec+1, RHS);
				1077	if (RHS == 0) return 0;
				1078	}
				1079
				1080	// Merge LHS/RHS.
				1081	LHS = new BinaryExprAST(BinOp, LHS, RHS);
				1082	}
				1083	}
				1084
				1085	/// expression
				1086	/// ::= primary binoprhs
				1087	///
				1088	static ExprAST *ParseExpression() {
				1089	ExprAST *LHS = ParsePrimary();
				1090	if (!LHS) return 0;
				1091
				1092	return ParseBinOpRHS(0, LHS);
				1093	}
				1094
				1095	/// prototype
				1096	/// ::= id '(' id* ')'
				1097	static PrototypeAST *ParsePrototype() {
				1098	if (CurTok != tok_identifier)
				1099	return ErrorP("Expected function name in prototype");
				1100
				1101	std::string FnName = IdentifierStr;
				1102	getNextToken();
				1103
				1104	if (CurTok != '(')
				1105	return ErrorP("Expected '(' in prototype");
				1106
				1107	std::vector<std::string> ArgNames;
				1108	while (getNextToken() == tok_identifier)
				1109	ArgNames.push_back(IdentifierStr);
				1110	if (CurTok != ')')
				1111	return ErrorP("Expected ')' in prototype");
				1112
				1113	// success.
				1114	getNextToken(); // eat ')'.
				1115
				1116	return new PrototypeAST(FnName, ArgNames);
				1117	}
				1118
				1119	/// definition ::= 'def' prototype expression
				1120	static FunctionAST *ParseDefinition() {
				1121	getNextToken(); // eat def.
				1122	PrototypeAST *Proto = ParsePrototype();
				1123	if (Proto == 0) return 0;
				1124
				1125	if (ExprAST *E = ParseExpression())
				1126	return new FunctionAST(Proto, E);
				1127	return 0;
				1128	}
				1129
				1130	/// toplevelexpr ::= expression
				1131	static FunctionAST *ParseTopLevelExpr() {
				1132	if (ExprAST *E = ParseExpression()) {
				1133	// Make an anonymous proto.
Chris Lattner	5ea8ef8	2008-02-22 17:09:39 +0000	[diff] [blame]	1134	PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1135	return new FunctionAST(Proto, E);
				1136	}
				1137	return 0;
				1138	}
				1139
				1140	/// external ::= 'extern' prototype
				1141	static PrototypeAST *ParseExtern() {
				1142	getNextToken(); // eat extern.
				1143	return ParsePrototype();
				1144	}
				1145
				1146	//===----------------------------------------------------------------------===//
				1147	// Top-Level parsing
				1148	//===----------------------------------------------------------------------===//
				1149
				1150	static void HandleDefinition() {
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1151	if (ParseDefinition()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1152	fprintf(stderr, "Parsed a function definition.\n");
				1153	} else {
				1154	// Skip token for error recovery.
				1155	getNextToken();
				1156	}
				1157	}
				1158
				1159	static void HandleExtern() {
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1160	if (ParseExtern()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1161	fprintf(stderr, "Parsed an extern\n");
				1162	} else {
				1163	// Skip token for error recovery.
				1164	getNextToken();
				1165	}
				1166	}
				1167
				1168	static void HandleTopLevelExpression() {
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1169	// Evaluate a top-level expression into an anonymous function.
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1170	if (ParseTopLevelExpr()) {
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1171	fprintf(stderr, "Parsed a top-level expr\n");
				1172	} else {
				1173	// Skip token for error recovery.
				1174	getNextToken();
				1175	}
				1176	}
				1177
				1178	/// top ::= definition \| external \| expression \| ';'
				1179	static void MainLoop() {
				1180	while (1) {
				1181	fprintf(stderr, "ready> ");
				1182	switch (CurTok) {
				1183	case tok_eof: return;
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1184	case ';': getNextToken(); break; // ignore top-level semicolons.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1185	case tok_def: HandleDefinition(); break;
				1186	case tok_extern: HandleExtern(); break;
				1187	default: HandleTopLevelExpression(); break;
				1188	}
				1189	}
				1190	}
				1191
				1192	//===----------------------------------------------------------------------===//
				1193	// Main driver code.
				1194	//===----------------------------------------------------------------------===//
				1195
				1196	int main() {
				1197	// Install standard binary operators.
				1198	// 1 is lowest precedence.
				1199	BinopPrecedence['<'] = 10;
				1200	BinopPrecedence['+'] = 20;
				1201	BinopPrecedence['-'] = 20;
				1202	BinopPrecedence['*'] = 40; // highest.
				1203
				1204	// Prime the first token.
				1205	fprintf(stderr, "ready> ");
				1206	getNextToken();
				1207
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1208	// Run the main "interpreter loop" now.
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1209	MainLoop();
Erick Tryzelaar	fd1ec5e	2009-09-22 21:14:49 +0000	[diff] [blame]	1210
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1211	return 0;
				1212	}
				1213	</pre>
				1214	</div>
Chris Lattner	729eb14	2008-02-10 19:11:04 +0000	[diff] [blame]	1215	<a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1216	</div>
				1217
				1218	<!-- *********************************************************************** -->
				1219	<hr>
				1220	<address>
				1221	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1222	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1223	<a href="http://validator.w3.org/check/referer"><img
Chris Lattner	8eef4b2	2007-10-23 06:30:50 +0000	[diff] [blame]	1224	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1225
				1226	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
NAKAMURA Takumi	b9a3363	2011-04-09 02:13:37 +0000	[diff] [blame]	1227	<a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
Dan Gohman	523e392	2010-02-03 17:27:31 +0000	[diff] [blame]	1228	Last modified: $Date$
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	1229	</address>
				1230	</body>
				1231	</html>