Blame - llvm/docs/tutorial/OCamlLangImpl2.html - toolchain/llvm-project

blob: 20e006d97c8dd15956600bf842ba8526ada3055b [file] [log] [blame]

Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<meta name="author" content="Erick Tryzelaar">
				10	<link rel="stylesheet" href="../llvm.css" type="text/css">
				11	</head>
				12
				13	<body>
				14
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	15	<h1>Kaleidoscope: Implementing a Parser and AST</h1>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	16
				17	<ul>
				18	<li><a href="index.html">Up to Tutorial Index</a></li>
				19	<li>Chapter 2
				20	<ol>
				21	<li><a href="#intro">Chapter 2 Introduction</a></li>
				22	<li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
				23	<li><a href="#parserbasics">Parser Basics</a></li>
				24	<li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
				25	<li><a href="#parserbinops">Binary Expression Parsing</a></li>
				26	<li><a href="#parsertop">Parsing the Rest</a></li>
				27	<li><a href="#driver">The Driver</a></li>
				28	<li><a href="#conclusions">Conclusions</a></li>
				29	<li><a href="#code">Full Code Listing</a></li>
				30	</ol>
				31	</li>
				32	<li><a href="OCamlLangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
				33	</ul>
				34
				35	<div class="doc_author">
				36	<p>
				37	Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>
				38	and <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a>
				39	</p>
				40	</div>
				41
				42	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	43	<h2><a name="intro">Chapter 2 Introduction</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	44	<!-- *********************************************************************** -->
				45
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	46	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	47
				48	<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
				49	with LLVM in Objective Caml</a>" tutorial. This chapter shows you how to use
				50	the lexer, built in <a href="OCamlLangImpl1.html">Chapter 1</a>, to build a
				51	full <a href="http://en.wikipedia.org/wiki/Parsing">parser</a> for our
				52	Kaleidoscope language. Once we have a parser, we'll define and build an <a
				53	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				54	Tree</a> (AST).</p>
				55
				56	<p>The parser we will build uses a combination of <a
				57	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				58	Parsing</a> and <a href=
				59	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				60	Parsing</a> to parse the Kaleidoscope language (the latter for
				61	binary expressions and the former for everything else). Before we get to
				62	parsing though, lets talk about the output of the parser: the Abstract Syntax
				63	Tree.</p>
				64
				65	</div>
				66
				67	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	68	<h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	69	<!-- *********************************************************************** -->
				70
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	71	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	72
				73	<p>The AST for a program captures its behavior in such a way that it is easy for
				74	later stages of the compiler (e.g. code generation) to interpret. We basically
				75	want one object for each construct in the language, and the AST should closely
				76	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				77	function object. We'll start with expressions first:</p>
				78
				79	<div class="doc_code">
				80	<pre>
				81	(* expr - Base type for all expression nodes. *)
				82	type expr =
				83	(* variant for numeric literals like "1.0". *)
				84	\| Number of float
				85	</pre>
				86	</div>
				87
				88	<p>The code above shows the definition of the base ExprAST class and one
				89	subclass which we use for numeric literals. The important thing to note about
				90	this code is that the Number variant captures the numeric value of the
				91	literal as an instance variable. This allows later phases of the compiler to
				92	know what the stored numeric value is.</p>
				93
				94	<p>Right now we only create the AST, so there are no useful functions on
				95	them. It would be very easy to add a function to pretty print the code,
				96	for example. Here are the other expression AST node definitions that we'll use
				97	in the basic form of the Kaleidoscope language:
				98	</p>
				99
				100	<div class="doc_code">
				101	<pre>
				102	(* variant for referencing a variable, like "a". *)
				103	\| Variable of string
				104
				105	(* variant for a binary operator. *)
				106	\| Binary of char * expr * expr
				107
				108	(* variant for function calls. *)
				109	\| Call of string * expr array
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentionally) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name as well as a list of any argument expressions. One thing
				116	that is nice about our AST is that it captures the language features without
				117	talking about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure, etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
				121	Because it doesn't have conditional control flow, it isn't Turing-complete;
				122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	(* proto - This type represents the "prototype" for a function, which captures
				129	* its name, and its argument names (thus implicitly the number of arguments the
				130	* function takes). *)
				131	type proto = Prototype of string * string array
				132
				133	(* func - This type represents a function definition itself. *)
				134	type func = Function of proto * expr
				135	</pre>
				136	</div>
				137
				138	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				139	Since all values are double precision floating point, the type of each argument
				140	doesn't need to be stored anywhere. In a more aggressive and realistic
				141	language, the "expr" variants would probably have a type field.</p>
				142
				143	<p>With this scaffolding, we can now talk about parsing expressions and function
				144	bodies in Kaleidoscope.</p>
				145
				146	</div>
				147
				148	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	149	<h2><a name="parserbasics">Parser Basics</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	150	<!-- *********************************************************************** -->
				151
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	152	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	153
				154	<p>Now that we have an AST to build, we need to define the parser code to build
				155	it. The idea here is that we want to parse something like "x+y" (which is
				156	returned as three tokens by the lexer) into an AST that could be generated with
				157	calls like this:</p>
				158
				159	<div class="doc_code">
				160	<pre>
				161	let x = Variable "x" in
				162	let y = Variable "y" in
				163	let result = Binary ('+', x, y) in
				164	...
				165	</pre>
				166	</div>
				167
				168	<p>
				169	The error handling routines make use of the builtin <tt>Stream.Failure</tt> and
				170	<tt>Stream.Error</tt>s. <tt>Stream.Failure</tt> is raised when the parser is
				171	unable to find any matching token in the first position of a pattern.
				172	<tt>Stream.Error</tt> is raised when the first token matches, but the rest do
				173	not. The error recovery in our parser will not be the best and is not
				174	particular user-friendly, but it will be enough for our tutorial. These
				175	exceptions make it easier to handle errors in routines that have various return
				176	types.</p>
				177
				178	<p>With these basic types and exceptions, we can implement the first
				179	piece of our grammar: numeric literals.</p>
				180
				181	</div>
				182
				183	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	184	<h2><a name="parserprimexprs">Basic Expression Parsing</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	185	<!-- *********************************************************************** -->
				186
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	187	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	188
				189	<p>We start with numeric literals, because they are the simplest to process.
				190	For each production in our grammar, we'll define a function which parses that
				191	production. We call this class of expressions "primary" expressions, for
				192	reasons that will become more clear <a href="OCamlLangImpl6.html#unary">
				193	later in the tutorial</a>. In order to parse an arbitrary primary expression,
				194	we need to determine what sort of expression it is. For numeric literals, we
				195	have:</p>
				196
				197	<div class="doc_code">
				198	<pre>
				199	(* primary
				200	* ::= identifier
				201	* ::= numberexpr
				202	* ::= parenexpr *)
				203	parse_primary = parser
				204	(* numberexpr ::= number *)
				205	\| [< 'Token.Number n >] -> Ast.Number n
				206	</pre>
				207	</div>
				208
				209	<p>This routine is very simple: it expects to be called when the current token
				210	is a <tt>Token.Number</tt> token. It takes the current number value, creates
				211	a <tt>Ast.Number</tt> node, advances the lexer to the next token, and finally
				212	returns.</p>
				213
				214	<p>There are some interesting aspects to this. The most important one is that
				215	this routine eats all of the tokens that correspond to the production and
				216	returns the lexer buffer with the next token (which is not part of the grammar
				217	production) ready to go. This is a fairly standard way to go for recursive
				218	descent parsers. For a better example, the parenthesis operator is defined like
				219	this:</p>
				220
				221	<div class="doc_code">
				222	<pre>
				223	(* parenexpr ::= '(' expression ')' *)
				224	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				225	</pre>
				226	</div>
				227
				228	<p>This function illustrates a number of interesting things about the
				229	parser:</p>
				230
				231	<p>
				232	1) It shows how we use the <tt>Stream.Error</tt> exception. When called, this
				233	function expects that the current token is a '(' token, but after parsing the
				234	subexpression, it is possible that there is no ')' waiting. For example, if
				235	the user types in "(4 x" instead of "(4)", the parser should emit an error.
				236	Because errors can occur, the parser needs a way to indicate that they
				237	happened. In our parser, we use the camlp4 shortcut syntax <tt>token ?? "parse
				238	error"</tt>, where if the token before the <tt>??</tt> does not match, then
				239	<tt>Stream.Error "parse error"</tt> will be raised.</p>
				240
				241	<p>2) Another interesting aspect of this function is that it uses recursion by
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	242	calling <tt>Parser.parse_primary</tt> (we will soon see that
				243	<tt>Parser.parse_primary</tt> can call <tt>Parser.parse_primary</tt>). This is
				244	powerful because it allows us to handle recursive grammars, and keeps each
				245	production very simple. Note that parentheses do not cause construction of AST
				246	nodes themselves. While we could do it this way, the most important role of
				247	parentheses are to guide the parser and provide grouping. Once the parser
				248	constructs the AST, parentheses are not needed.</p>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	249
				250	<p>The next simple production is for handling variable references and function
				251	calls:</p>
				252
				253	<div class="doc_code">
				254	<pre>
				255	(* identifierexpr
				256	* ::= identifier
				257	* ::= identifier '(' argumentexpr ')' *)
				258	\| [< 'Token.Ident id; stream >] ->
				259	let rec parse_args accumulator = parser
				260	\| [< e=parse_expr; stream >] ->
				261	begin parser
				262	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				263	\| [< >] -> e :: accumulator
				264	end stream
				265	\| [< >] -> accumulator
				266	in
				267	let rec parse_ident id = parser
				268	(* Call. *)
				269	\| [< 'Token.Kwd '(';
				270	args=parse_args [];
				271	'Token.Kwd ')' ?? "expected ')'">] ->
				272	Ast.Call (id, Array.of_list (List.rev args))
				273
				274	(* Simple variable ref. *)
				275	\| [< >] -> Ast.Variable id
				276	in
				277	parse_ident id stream
				278	</pre>
				279	</div>
				280
				281	<p>This routine follows the same style as the other routines. (It expects to be
				282	called if the current token is a <tt>Token.Ident</tt> token). It also has
				283	recursion and error handling. One interesting aspect of this is that it uses
				284	<em>look-ahead</em> to determine if the current identifier is a stand alone
				285	variable reference or if it is a function call expression. It handles this by
				286	checking to see if the token after the identifier is a '(' token, constructing
				287	either a <tt>Ast.Variable</tt> or <tt>Ast.Call</tt> node as appropriate.
				288	</p>
				289
				290	<p>We finish up by raising an exception if we received a token we didn't
				291	expect:</p>
				292
				293	<div class="doc_code">
				294	<pre>
				295	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				296	</pre>
				297	</div>
				298
				299	<p>Now that basic expressions are handled, we need to handle binary expressions.
				300	They are a bit more complex.</p>
				301
				302	</div>
				303
				304	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	305	<h2><a name="parserbinops">Binary Expression Parsing</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	306	<!-- *********************************************************************** -->
				307
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	308	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	309
				310	<p>Binary expressions are significantly harder to parse because they are often
				311	ambiguous. For example, when given the string "x+y*z", the parser can choose
				312	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				313	mathematics, we expect the later parse, because "*" (multiplication) has
				314	higher <em>precedence</em> than "+" (addition).</p>
				315
				316	<p>There are many ways to handle this, but an elegant and efficient way is to
				317	use <a href=
				318	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				319	Parsing</a>. This parsing technique uses the precedence of binary operators to
				320	guide recursion. To start with, we need a table of precedences:</p>
				321
				322	<div class="doc_code">
				323	<pre>
				324	(* binop_precedence - This holds the precedence for each binary operator that is
				325	* defined *)
				326	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				327
				328	(* precedence - Get the precedence of the pending binary operator token. *)
				329	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				330
				331	...
				332
				333	let main () =
				334	(* Install standard binary operators.
				335	* 1 is the lowest precedence. *)
				336	Hashtbl.add Parser.binop_precedence '<' 10;
				337	Hashtbl.add Parser.binop_precedence '+' 20;
				338	Hashtbl.add Parser.binop_precedence '-' 20;
				339	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				340	...
				341	</pre>
				342	</div>
				343
				344	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				345	(this can obviously be extended by you, our brave and intrepid reader). The
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	346	<tt>Parser.precedence</tt> function returns the precedence for the current
				347	token, or -1 if the token is not a binary operator. Having a <tt>Hashtbl.t</tt>
				348	makes it easy to add new operators and makes it clear that the algorithm doesn't
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	349	depend on the specific operators involved, but it would be easy enough to
				350	eliminate the <tt>Hashtbl.t</tt> and do the comparisons in the
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	351	<tt>Parser.precedence</tt> function. (Or just use a fixed-size array).</p>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	352
				353	<p>With the helper above defined, we can now start parsing binary expressions.
				354	The basic idea of operator precedence parsing is to break down an expression
				355	with potentially ambiguous binary operators into pieces. Consider ,for example,
				356	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				357	as a stream of primary expressions separated by binary operators. As such,
				358	it will first parse the leading primary expression "a", then it will see the
				359	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				360	are primary expressions, the binary expression parser doesn't need to worry
				361	about nested subexpressions like (c+d) at all.
				362	</p>
				363
				364	<p>
				365	To start, an expression is a primary expression potentially followed by a
				366	sequence of [binop,primaryexpr] pairs:</p>
				367
				368	<div class="doc_code">
				369	<pre>
				370	(* expression
				371	* ::= primary binoprhs *)
				372	and parse_expr = parser
				373	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				374	</pre>
				375	</div>
				376
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	377	<p><tt>Parser.parse_bin_rhs</tt> is the function that parses the sequence of
				378	pairs for us. It takes a precedence and a pointer to an expression for the part
				379	that has been parsed so far. Note that "x" is a perfectly valid expression: As
				380	such, "binoprhs" is allowed to be empty, in which case it returns the expression
				381	that is passed into it. In our example above, the code passes the expression for
				382	"a" into <tt>Parser.parse_bin_rhs</tt> and the current token is "+".</p>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	383
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	384	<p>The precedence value passed into <tt>Parser.parse_bin_rhs</tt> indicates the
				385	<em>minimal operator precedence</em> that the function is allowed to eat. For
				386	example, if the current pair stream is [+, x] and <tt>Parser.parse_bin_rhs</tt>
				387	is passed in a precedence of 40, it will not consume any tokens (because the
				388	precedence of '+' is only 20). With this in mind, <tt>Parser.parse_bin_rhs</tt>
				389	starts with:</p>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	390
				391	<div class="doc_code">
				392	<pre>
				393	(* binoprhs
				394	* ::= ('+' primary)* *)
				395	and parse_bin_rhs expr_prec lhs stream =
				396	match Stream.peek stream with
				397	(* If this is a binop, find its precedence. *)
				398	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				399	let token_prec = precedence c in
				400
				401	(* If this is a binop that binds at least as tightly as the current binop,
				402	* consume it, otherwise we are done. *)
				403	if token_prec < expr_prec then lhs else begin
				404	</pre>
				405	</div>
				406
				407	<p>This code gets the precedence of the current token and checks to see if if is
				408	too low. Because we defined invalid tokens to have a precedence of -1, this
				409	check implicitly knows that the pair-stream ends when the token stream runs out
				410	of binary operators. If this check succeeds, we know that the token is a binary
				411	operator and that it will be included in this expression:</p>
				412
				413	<div class="doc_code">
				414	<pre>
				415	(* Eat the binop. *)
				416	Stream.junk stream;
				417
				418	(* Okay, we know this is a binop. *)
				419	let rhs =
				420	match Stream.peek stream with
				421	\| Some (Token.Kwd c2) ->
				422	</pre>
				423	</div>
				424
				425	<p>As such, this code eats (and remembers) the binary operator and then parses
				426	the primary expression that follows. This builds up the whole pair, the first of
				427	which is [+, b] for the running example.</p>
				428
				429	<p>Now that we parsed the left-hand side of an expression and one pair of the
				430	RHS sequence, we have to decide which way the expression associates. In
				431	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				432	To determine this, we look ahead at "binop" to determine its precedence and
				433	compare it to BinOp's precedence (which is '+' in this case):</p>
				434
				435	<div class="doc_code">
				436	<pre>
				437	(* If BinOp binds less tightly with rhs than the operator after
				438	* rhs, let the pending operator take rhs as its lhs. *)
				439	let next_prec = precedence c2 in
				440	if token_prec < next_prec
				441	</pre>
				442	</div>
				443
				444	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				445	precedence of our current operator, then we know that the parentheses associate
				446	as "(a+b) binop ...". In our example, the current operator is "+" and the next
				447	operator is "+", we know that they have the same precedence. In this case we'll
				448	create the AST node for "a+b", and then continue parsing:</p>
				449
				450	<div class="doc_code">
				451	<pre>
				452	... if body omitted ...
				453	in
				454
				455	(* Merge lhs/rhs. *)
				456	let lhs = Ast.Binary (c, lhs, rhs) in
				457	parse_bin_rhs expr_prec lhs stream
				458	end
				459	</pre>
				460	</div>
				461
				462	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				463	iteration of the loop, with "+" as the current token. The code above will eat,
				464	remember, and parse "(c+d)" as the primary expression, which makes the
				465	current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
				466	"" as the binop to the right of the primary. In this case, the precedence of "" is
				467	higher than the precedence of "+" so the if condition will be entered.</p>
				468
				469	<p>The critical question left here is "how can the if condition parse the right
				470	hand side in full"? In particular, to build the AST correctly for our example,
				471	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				472	do this is surprisingly simple (code from the above two blocks duplicated for
				473	context):</p>
				474
				475	<div class="doc_code">
				476	<pre>
				477	match Stream.peek stream with
				478	\| Some (Token.Kwd c2) ->
				479	(* If BinOp binds less tightly with rhs than the operator after
				480	* rhs, let the pending operator take rhs as its lhs. *)
				481	if token_prec < precedence c2
				482	then <b>parse_bin_rhs (token_prec + 1) rhs stream</b>
				483	else rhs
				484	\| _ -> rhs
				485	in
				486
				487	(* Merge lhs/rhs. *)
				488	let lhs = Ast.Binary (c, lhs, rhs) in
				489	parse_bin_rhs expr_prec lhs stream
				490	end
				491	</pre>
				492	</div>
				493
				494	<p>At this point, we know that the binary operator to the RHS of our primary
				495	has higher precedence than the binop we are currently parsing. As such, we know
				496	that any sequence of pairs whose operators are all higher precedence than "+"
				497	should be parsed together and returned as "RHS". To do this, we recursively
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	498	invoke the <tt>Parser.parse_bin_rhs</tt> function specifying "token_prec+1" as
				499	the minimum precedence required for it to continue. In our example above, this
				500	will cause it to return the AST node for "(c+d)ef" as RHS, which is then set
				501	as the RHS of the '+' expression.</p>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	502
				503	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
				504	and added to the AST. With this little bit of code (14 non-trivial lines), we
				505	correctly handle fully general binary expression parsing in a very elegant way.
				506	This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
				507	running through it with a few tough examples to see how it works.
				508	</p>
				509
				510	<p>This wraps up handling of expressions. At this point, we can point the
				511	parser at an arbitrary token stream and build an expression from it, stopping
				512	at the first token that is not part of the expression. Next up we need to
				513	handle function definitions, etc.</p>
				514
				515	</div>
				516
				517	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	518	<h2><a name="parsertop">Parsing the Rest</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	519	<!-- *********************************************************************** -->
				520
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	521	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	522
				523	<p>
				524	The next thing missing is handling of function prototypes. In Kaleidoscope,
				525	these are used both for 'extern' function declarations as well as function body
				526	definitions. The code to do this is straight-forward and not very interesting
				527	(once you've survived expressions):
				528	</p>
				529
				530	<div class="doc_code">
				531	<pre>
				532	(* prototype
				533	* ::= id '(' id* ')' *)
				534	let parse_prototype =
				535	let rec parse_args accumulator = parser
				536	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				537	\| [< >] -> accumulator
				538	in
				539
				540	parser
				541	\| [< 'Token.Ident id;
				542	'Token.Kwd '(' ?? "expected '(' in prototype";
				543	args=parse_args [];
				544	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				545	(* success. *)
				546	Ast.Prototype (id, Array.of_list (List.rev args))
				547
				548	\| [< >] ->
				549	raise (Stream.Error "expected function name in prototype")
				550	</pre>
				551	</div>
				552
				553	<p>Given this, a function definition is very simple, just a prototype plus
				554	an expression to implement the body:</p>
				555
				556	<div class="doc_code">
				557	<pre>
				558	(* definition ::= 'def' prototype expression *)
				559	let parse_definition = parser
				560	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				561	Ast.Function (p, e)
				562	</pre>
				563	</div>
				564
				565	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				566	well as to support forward declaration of user functions. These 'extern's are just
				567	prototypes with no body:</p>
				568
				569	<div class="doc_code">
				570	<pre>
				571	(* external ::= 'extern' prototype *)
				572	let parse_extern = parser
				573	\| [< 'Token.Extern; e=parse_prototype >] -> e
				574	</pre>
				575	</div>
				576
				577	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				578	evaluate them on the fly. We will handle this by defining anonymous nullary
				579	(zero argument) functions for them:</p>
				580
				581	<div class="doc_code">
				582	<pre>
				583	(* toplevelexpr ::= expression *)
				584	let parse_toplevel = parser
				585	\| [< e=parse_expr >] ->
				586	(* Make an anonymous proto. *)
				587	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				588	</pre>
				589	</div>
				590
				591	<p>Now that we have all the pieces, let's build a little driver that will let us
				592	actually <em>execute</em> this code we've built!</p>
				593
				594	</div>
				595
				596	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	597	<h2><a name="driver">The Driver</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	598	<!-- *********************************************************************** -->
				599
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	600	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	601
				602	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				603	dispatch loop. There isn't much interesting here, so I'll just include the
				604	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				605	Parsing" section.</p>
				606
				607	<div class="doc_code">
				608	<pre>
				609	(* top ::= definition \| external \| expression \| ';' *)
				610	let rec main_loop stream =
				611	match Stream.peek stream with
				612	\| None -> ()
				613
				614	(* ignore top-level semicolons. *)
				615	\| Some (Token.Kwd ';') ->
				616	Stream.junk stream;
				617	main_loop stream
				618
				619	\| Some token ->
				620	begin
				621	try match token with
				622	\| Token.Def ->
				623	ignore(Parser.parse_definition stream);
				624	print_endline "parsed a function definition.";
				625	\| Token.Extern ->
				626	ignore(Parser.parse_extern stream);
				627	print_endline "parsed an extern.";
				628	\| _ ->
				629	(* Evaluate a top-level expression into an anonymous function. *)
				630	ignore(Parser.parse_toplevel stream);
				631	print_endline "parsed a top-level expr";
				632	with Stream.Error s ->
				633	(* Skip token for error recovery. *)
				634	Stream.junk stream;
				635	print_endline s;
				636	end;
				637	print_string "ready> "; flush stdout;
				638	main_loop stream
				639	</pre>
				640	</div>
				641
				642	<p>The most interesting part of this is that we ignore top-level semicolons.
				643	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
				644	command line, the parser doesn't know whether that is the end of what you will type
				645	or not. For example, on the next line you could type "def foo..." in which case
				646	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				647	which would continue the expression. Having top-level semicolons allows you to
				648	type "4+5;", and the parser will know you are done.</p>
				649
				650	</div>
				651
				652	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	653	<h2><a name="conclusions">Conclusions</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	654	<!-- *********************************************************************** -->
				655
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	656	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	657
				658	<p>With just under 300 lines of commented code (240 lines of non-comment,
				659	non-blank code), we fully defined our minimal language, including a lexer,
				660	parser, and AST builder. With this done, the executable will validate
				661	Kaleidoscope code and tell us if it is grammatically invalid. For
				662	example, here is a sample interaction:</p>
				663
				664	<div class="doc_code">
				665	<pre>
				666	$ <b>./toy.byte</b>
				667	ready> <b>def foo(x y) x+foo(y, 4.0);</b>
				668	Parsed a function definition.
				669	ready> <b>def foo(x y) x+y y;</b>
				670	Parsed a function definition.
				671	Parsed a top-level expr
				672	ready> <b>def foo(x y) x+y );</b>
				673	Parsed a function definition.
				674	Error: unknown token when expecting an expression
				675	ready> <b>extern sin(a);</b>
				676	ready> Parsed an extern
				677	ready> <b>^D</b>
				678	$
				679	</pre>
				680	</div>
				681
				682	<p>There is a lot of room for extension here. You can define new AST nodes,
				683	extend the language in many ways, etc. In the <a href="OCamlLangImpl3.html">
				684	next installment</a>, we will describe how to generate LLVM Intermediate
				685	Representation (IR) from the AST.</p>
				686
				687	</div>
				688
				689	<!-- *********************************************************************** -->
NAKAMURA Takumi	fc8d930	2011-04-18 23:59:50 +0000	[diff] [blame]	690	<h2><a name="code">Full Code Listing</a></h2>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	691	<!-- *********************************************************************** -->
				692
NAKAMURA Takumi	aa3d624	2011-04-23 00:30:22 +0000	[diff] [blame]	693	<div>
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	694
				695	<p>
				696	Here is the complete code listing for this and the previous chapter.
				697	Note that it is fully self-contained: you don't need LLVM or any external
				698	libraries at all for this. (Besides the ocaml standard libraries, of
				699	course.) To build this, just compile with:</p>
				700
				701	<div class="doc_code">
				702	<pre>
				703	# Compile
				704	ocamlbuild toy.byte
				705	# Run
Erick Tryzelaar	a2eba03	2008-03-30 19:14:31 +0000	[diff] [blame]	706	./toy.byte
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	707	</pre>
				708	</div>
				709
				710	<p>Here is the code:</p>
				711
				712	<dl>
				713	<dt>_tags:</dt>
				714	<dd class="doc_code">
				715	<pre>
				716	<{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
				717	</pre>
				718	</dd>
				719
				720	<dt>token.ml:</dt>
				721	<dd class="doc_code">
				722	<pre>
				723	(*===----------------------------------------------------------------------===
				724	* Lexer Tokens
				725	===----------------------------------------------------------------------===)
				726
				727	(* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
				728	* these others for known things. *)
				729	type token =
				730	(* commands *)
				731	\| Def \| Extern
				732
				733	(* primary *)
				734	\| Ident of string \| Number of float
				735
				736	(* unknown *)
				737	\| Kwd of char
				738	</pre>
				739	</dd>
				740
				741	<dt>lexer.ml:</dt>
				742	<dd class="doc_code">
				743	<pre>
				744	(*===----------------------------------------------------------------------===
				745	* Lexer
				746	===----------------------------------------------------------------------===)
				747
				748	let rec lex = parser
				749	(* Skip any whitespace. *)
				750	\| [< ' (' ' \| '\n' \| '\r' \| '\t'); stream >] -> lex stream
				751
				752	(* identifier: [a-zA-Z][a-zA-Z0-9] *)
				753	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' as c); stream >] ->
				754	let buffer = Buffer.create 1 in
				755	Buffer.add_char buffer c;
				756	lex_ident buffer stream
				757
				758	(* number: [0-9.]+ *)
				759	\| [< ' ('0' .. '9' as c); stream >] ->
				760	let buffer = Buffer.create 1 in
				761	Buffer.add_char buffer c;
				762	lex_number buffer stream
				763
				764	(* Comment until end of line. *)
				765	\| [< ' ('#'); stream >] ->
				766	lex_comment stream
				767
				768	(* Otherwise, just return the character as its ascii value. *)
				769	\| [< 'c; stream >] ->
				770	[< 'Token.Kwd c; lex stream >]
				771
				772	(* end of stream. *)
				773	\| [< >] -> [< >]
				774
				775	and lex_number buffer = parser
				776	\| [< ' ('0' .. '9' \| '.' as c); stream >] ->
				777	Buffer.add_char buffer c;
				778	lex_number buffer stream
				779	\| [< stream=lex >] ->
				780	[< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
				781
				782	and lex_ident buffer = parser
				783	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' \| '0' .. '9' as c); stream >] ->
				784	Buffer.add_char buffer c;
				785	lex_ident buffer stream
				786	\| [< stream=lex >] ->
				787	match Buffer.contents buffer with
				788	\| "def" -> [< 'Token.Def; stream >]
				789	\| "extern" -> [< 'Token.Extern; stream >]
				790	\| id -> [< 'Token.Ident id; stream >]
				791
				792	and lex_comment = parser
				793	\| [< ' ('\n'); stream=lex >] -> stream
				794	\| [< 'c; e=lex_comment >] -> e
				795	\| [< >] -> [< >]
				796	</pre>
				797	</dd>
				798
				799	<dt>ast.ml:</dt>
				800	<dd class="doc_code">
				801	<pre>
				802	(*===----------------------------------------------------------------------===
				803	* Abstract Syntax Tree (aka Parse Tree)
				804	===----------------------------------------------------------------------===)
				805
				806	(* expr - Base type for all expression nodes. *)
				807	type expr =
				808	(* variant for numeric literals like "1.0". *)
				809	\| Number of float
				810
				811	(* variant for referencing a variable, like "a". *)
				812	\| Variable of string
				813
				814	(* variant for a binary operator. *)
				815	\| Binary of char * expr * expr
				816
				817	(* variant for function calls. *)
				818	\| Call of string * expr array
				819
				820	(* proto - This type represents the "prototype" for a function, which captures
				821	* its name, and its argument names (thus implicitly the number of arguments the
				822	* function takes). *)
				823	type proto = Prototype of string * string array
				824
				825	(* func - This type represents a function definition itself. *)
				826	type func = Function of proto * expr
				827	</pre>
				828	</dd>
				829
				830	<dt>parser.ml:</dt>
				831	<dd class="doc_code">
				832	<pre>
				833	(*===---------------------------------------------------------------------===
				834	* Parser
				835	===---------------------------------------------------------------------===)
				836
				837	(* binop_precedence - This holds the precedence for each binary operator that is
				838	* defined *)
				839	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				840
				841	(* precedence - Get the precedence of the pending binary operator token. *)
				842	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				843
				844	(* primary
				845	* ::= identifier
				846	* ::= numberexpr
				847	* ::= parenexpr *)
				848	let rec parse_primary = parser
				849	(* numberexpr ::= number *)
				850	\| [< 'Token.Number n >] -> Ast.Number n
				851
				852	(* parenexpr ::= '(' expression ')' *)
				853	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				854
				855	(* identifierexpr
				856	* ::= identifier
				857	* ::= identifier '(' argumentexpr ')' *)
				858	\| [< 'Token.Ident id; stream >] ->
				859	let rec parse_args accumulator = parser
				860	\| [< e=parse_expr; stream >] ->
				861	begin parser
				862	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				863	\| [< >] -> e :: accumulator
				864	end stream
				865	\| [< >] -> accumulator
				866	in
				867	let rec parse_ident id = parser
				868	(* Call. *)
				869	\| [< 'Token.Kwd '(';
				870	args=parse_args [];
				871	'Token.Kwd ')' ?? "expected ')'">] ->
				872	Ast.Call (id, Array.of_list (List.rev args))
				873
				874	(* Simple variable ref. *)
				875	\| [< >] -> Ast.Variable id
				876	in
				877	parse_ident id stream
				878
				879	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				880
				881	(* binoprhs
				882	* ::= ('+' primary)* *)
				883	and parse_bin_rhs expr_prec lhs stream =
				884	match Stream.peek stream with
				885	(* If this is a binop, find its precedence. *)
				886	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				887	let token_prec = precedence c in
				888
				889	(* If this is a binop that binds at least as tightly as the current binop,
				890	* consume it, otherwise we are done. *)
				891	if token_prec < expr_prec then lhs else begin
				892	(* Eat the binop. *)
				893	Stream.junk stream;
				894
				895	(* Parse the primary expression after the binary operator. *)
				896	let rhs = parse_primary stream in
				897
				898	(* Okay, we know this is a binop. *)
				899	let rhs =
				900	match Stream.peek stream with
				901	\| Some (Token.Kwd c2) ->
				902	(* If BinOp binds less tightly with rhs than the operator after
				903	* rhs, let the pending operator take rhs as its lhs. *)
				904	let next_prec = precedence c2 in
				905	if token_prec < next_prec
				906	then parse_bin_rhs (token_prec + 1) rhs stream
				907	else rhs
				908	\| _ -> rhs
				909	in
				910
				911	(* Merge lhs/rhs. *)
				912	let lhs = Ast.Binary (c, lhs, rhs) in
				913	parse_bin_rhs expr_prec lhs stream
				914	end
				915	\| _ -> lhs
				916
				917	(* expression
				918	* ::= primary binoprhs *)
				919	and parse_expr = parser
				920	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				921
				922	(* prototype
				923	* ::= id '(' id* ')' *)
				924	let parse_prototype =
				925	let rec parse_args accumulator = parser
				926	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				927	\| [< >] -> accumulator
				928	in
				929
				930	parser
				931	\| [< 'Token.Ident id;
				932	'Token.Kwd '(' ?? "expected '(' in prototype";
				933	args=parse_args [];
				934	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				935	(* success. *)
				936	Ast.Prototype (id, Array.of_list (List.rev args))
				937
				938	\| [< >] ->
				939	raise (Stream.Error "expected function name in prototype")
				940
				941	(* definition ::= 'def' prototype expression *)
				942	let parse_definition = parser
				943	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				944	Ast.Function (p, e)
				945
				946	(* toplevelexpr ::= expression *)
				947	let parse_toplevel = parser
				948	\| [< e=parse_expr >] ->
				949	(* Make an anonymous proto. *)
				950	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				951
				952	(* external ::= 'extern' prototype *)
				953	let parse_extern = parser
				954	\| [< 'Token.Extern; e=parse_prototype >] -> e
				955	</pre>
				956	</dd>
				957
				958	<dt>toplevel.ml:</dt>
				959	<dd class="doc_code">
				960	<pre>
				961	(*===----------------------------------------------------------------------===
				962	* Top-Level parsing and JIT Driver
				963	===----------------------------------------------------------------------===)
				964
				965	(* top ::= definition \| external \| expression \| ';' *)
				966	let rec main_loop stream =
				967	match Stream.peek stream with
				968	\| None -> ()
				969
				970	(* ignore top-level semicolons. *)
				971	\| Some (Token.Kwd ';') ->
				972	Stream.junk stream;
				973	main_loop stream
				974
				975	\| Some token ->
				976	begin
				977	try match token with
				978	\| Token.Def ->
				979	ignore(Parser.parse_definition stream);
				980	print_endline "parsed a function definition.";
				981	\| Token.Extern ->
				982	ignore(Parser.parse_extern stream);
				983	print_endline "parsed an extern.";
				984	\| _ ->
				985	(* Evaluate a top-level expression into an anonymous function. *)
				986	ignore(Parser.parse_toplevel stream);
				987	print_endline "parsed a top-level expr";
				988	with Stream.Error s ->
				989	(* Skip token for error recovery. *)
				990	Stream.junk stream;
				991	print_endline s;
				992	end;
				993	print_string "ready> "; flush stdout;
				994	main_loop stream
				995	</pre>
				996	</dd>
				997
				998	<dt>toy.ml:</dt>
				999	<dd class="doc_code">
				1000	<pre>
				1001	(*===----------------------------------------------------------------------===
				1002	* Main driver code.
				1003	===----------------------------------------------------------------------===)
				1004
				1005	let main () =
				1006	(* Install standard binary operators.
				1007	* 1 is the lowest precedence. *)
				1008	Hashtbl.add Parser.binop_precedence '<' 10;
				1009	Hashtbl.add Parser.binop_precedence '+' 20;
				1010	Hashtbl.add Parser.binop_precedence '-' 20;
				1011	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				1012
				1013	(* Prime the first token. *)
				1014	print_string "ready> "; flush stdout;
				1015	let stream = Lexer.lex (Stream.of_channel stdin) in
				1016
				1017	(* Run the main "interpreter loop" now. *)
				1018	Toplevel.main_loop stream;
				1019	;;
				1020
				1021	main ()
				1022	</pre>
				1023	</dd>
				1024	</dl>
				1025
				1026	<a href="OCamlLangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
				1027	</div>
				1028
				1029	<!-- *********************************************************************** -->
				1030	<hr>
				1031	<address>
				1032	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1033	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1034	<a href="http://validator.w3.org/check/referer"><img
				1035	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
				1036
				1037	<a href="mailto:sabre@nondot.org">Chris Lattner</a>
				1038	<a href="mailto:erickt@users.sourceforge.net">Erick Tryzelaar</a><br>
NAKAMURA Takumi	ca46f5a	2011-04-09 02:13:37 +0000	[diff] [blame]	1039	<a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
Dan Gohman	b44d2ab	2010-02-03 17:27:31 +0000	[diff] [blame]	1040	Last modified: $Date$
Erick Tryzelaar	a5e13fa	2008-03-27 08:18:07 +0000	[diff] [blame]	1041	</address>
				1042	</body>
				1043	</html>