Blame - docs/tutorial/OCamlLangImpl2.html - fp2-dev/platform/external/llvm

blob: 9b002324cf6eff434e8b52fe6774fa7a14668f99 [file] [log] [blame]

Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: Implementing a Parser and AST</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<meta name="author" content="Erick Tryzelaar">
				10	<link rel="stylesheet" href="../llvm.css" type="text/css">
				11	</head>
				12
				13	<body>
				14
				15	<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
				16
				17	<ul>
				18	<li><a href="index.html">Up to Tutorial Index</a></li>
				19	<li>Chapter 2
				20	<ol>
				21	<li><a href="#intro">Chapter 2 Introduction</a></li>
				22	<li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
				23	<li><a href="#parserbasics">Parser Basics</a></li>
				24	<li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
				25	<li><a href="#parserbinops">Binary Expression Parsing</a></li>
				26	<li><a href="#parsertop">Parsing the Rest</a></li>
				27	<li><a href="#driver">The Driver</a></li>
				28	<li><a href="#conclusions">Conclusions</a></li>
				29	<li><a href="#code">Full Code Listing</a></li>
				30	</ol>
				31	</li>
				32	<li><a href="OCamlLangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
				33	</ul>
				34
				35	<div class="doc_author">
				36	<p>
				37	Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a>
				38	and <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a>
				39	</p>
				40	</div>
				41
				42	<!-- *********************************************************************** -->
				43	<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
				44	<!-- *********************************************************************** -->
				45
				46	<div class="doc_text">
				47
				48	<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
				49	with LLVM in Objective Caml</a>" tutorial. This chapter shows you how to use
				50	the lexer, built in <a href="OCamlLangImpl1.html">Chapter 1</a>, to build a
				51	full <a href="http://en.wikipedia.org/wiki/Parsing">parser</a> for our
				52	Kaleidoscope language. Once we have a parser, we'll define and build an <a
				53	href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
				54	Tree</a> (AST).</p>
				55
				56	<p>The parser we will build uses a combination of <a
				57	href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
				58	Parsing</a> and <a href=
				59	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				60	Parsing</a> to parse the Kaleidoscope language (the latter for
				61	binary expressions and the former for everything else). Before we get to
				62	parsing though, lets talk about the output of the parser: the Abstract Syntax
				63	Tree.</p>
				64
				65	</div>
				66
				67	<!-- *********************************************************************** -->
				68	<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
				69	<!-- *********************************************************************** -->
				70
				71	<div class="doc_text">
				72
				73	<p>The AST for a program captures its behavior in such a way that it is easy for
				74	later stages of the compiler (e.g. code generation) to interpret. We basically
				75	want one object for each construct in the language, and the AST should closely
				76	model the language. In Kaleidoscope, we have expressions, a prototype, and a
				77	function object. We'll start with expressions first:</p>
				78
				79	<div class="doc_code">
				80	<pre>
				81	(* expr - Base type for all expression nodes. *)
				82	type expr =
				83	(* variant for numeric literals like "1.0". *)
				84	\| Number of float
				85	</pre>
				86	</div>
				87
				88	<p>The code above shows the definition of the base ExprAST class and one
				89	subclass which we use for numeric literals. The important thing to note about
				90	this code is that the Number variant captures the numeric value of the
				91	literal as an instance variable. This allows later phases of the compiler to
				92	know what the stored numeric value is.</p>
				93
				94	<p>Right now we only create the AST, so there are no useful functions on
				95	them. It would be very easy to add a function to pretty print the code,
				96	for example. Here are the other expression AST node definitions that we'll use
				97	in the basic form of the Kaleidoscope language:
				98	</p>
				99
				100	<div class="doc_code">
				101	<pre>
				102	(* variant for referencing a variable, like "a". *)
				103	\| Variable of string
				104
				105	(* variant for a binary operator. *)
				106	\| Binary of char * expr * expr
				107
				108	(* variant for function calls. *)
				109	\| Call of string * expr array
				110	</pre>
				111	</div>
				112
				113	<p>This is all (intentionally) rather straight-forward: variables capture the
				114	variable name, binary operators capture their opcode (e.g. '+'), and calls
				115	capture a function name as well as a list of any argument expressions. One thing
				116	that is nice about our AST is that it captures the language features without
				117	talking about the syntax of the language. Note that there is no discussion about
				118	precedence of binary operators, lexical structure, etc.</p>
				119
				120	<p>For our basic language, these are all of the expression nodes we'll define.
				121	Because it doesn't have conditional control flow, it isn't Turing-complete;
				122	we'll fix that in a later installment. The two things we need next are a way
				123	to talk about the interface to a function, and a way to talk about functions
				124	themselves:</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	(* proto - This type represents the "prototype" for a function, which captures
				129	* its name, and its argument names (thus implicitly the number of arguments the
				130	* function takes). *)
				131	type proto = Prototype of string * string array
				132
				133	(* func - This type represents a function definition itself. *)
				134	type func = Function of proto * expr
				135	</pre>
				136	</div>
				137
				138	<p>In Kaleidoscope, functions are typed with just a count of their arguments.
				139	Since all values are double precision floating point, the type of each argument
				140	doesn't need to be stored anywhere. In a more aggressive and realistic
				141	language, the "expr" variants would probably have a type field.</p>
				142
				143	<p>With this scaffolding, we can now talk about parsing expressions and function
				144	bodies in Kaleidoscope.</p>
				145
				146	</div>
				147
				148	<!-- *********************************************************************** -->
				149	<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
				150	<!-- *********************************************************************** -->
				151
				152	<div class="doc_text">
				153
				154	<p>Now that we have an AST to build, we need to define the parser code to build
				155	it. The idea here is that we want to parse something like "x+y" (which is
				156	returned as three tokens by the lexer) into an AST that could be generated with
				157	calls like this:</p>
				158
				159	<div class="doc_code">
				160	<pre>
				161	let x = Variable "x" in
				162	let y = Variable "y" in
				163	let result = Binary ('+', x, y) in
				164	...
				165	</pre>
				166	</div>
				167
				168	<p>
				169	The error handling routines make use of the builtin <tt>Stream.Failure</tt> and
				170	<tt>Stream.Error</tt>s. <tt>Stream.Failure</tt> is raised when the parser is
				171	unable to find any matching token in the first position of a pattern.
				172	<tt>Stream.Error</tt> is raised when the first token matches, but the rest do
				173	not. The error recovery in our parser will not be the best and is not
				174	particular user-friendly, but it will be enough for our tutorial. These
				175	exceptions make it easier to handle errors in routines that have various return
				176	types.</p>
				177
				178	<p>With these basic types and exceptions, we can implement the first
				179	piece of our grammar: numeric literals.</p>
				180
				181	</div>
				182
				183	<!-- *********************************************************************** -->
				184	<div class="doc_section"><a name="parserprimexprs">Basic Expression
				185	Parsing</a></div>
				186	<!-- *********************************************************************** -->
				187
				188	<div class="doc_text">
				189
				190	<p>We start with numeric literals, because they are the simplest to process.
				191	For each production in our grammar, we'll define a function which parses that
				192	production. We call this class of expressions "primary" expressions, for
				193	reasons that will become more clear <a href="OCamlLangImpl6.html#unary">
				194	later in the tutorial</a>. In order to parse an arbitrary primary expression,
				195	we need to determine what sort of expression it is. For numeric literals, we
				196	have:</p>
				197
				198	<div class="doc_code">
				199	<pre>
				200	(* primary
				201	* ::= identifier
				202	* ::= numberexpr
				203	* ::= parenexpr *)
				204	parse_primary = parser
				205	(* numberexpr ::= number *)
				206	\| [< 'Token.Number n >] -> Ast.Number n
				207	</pre>
				208	</div>
				209
				210	<p>This routine is very simple: it expects to be called when the current token
				211	is a <tt>Token.Number</tt> token. It takes the current number value, creates
				212	a <tt>Ast.Number</tt> node, advances the lexer to the next token, and finally
				213	returns.</p>
				214
				215	<p>There are some interesting aspects to this. The most important one is that
				216	this routine eats all of the tokens that correspond to the production and
				217	returns the lexer buffer with the next token (which is not part of the grammar
				218	production) ready to go. This is a fairly standard way to go for recursive
				219	descent parsers. For a better example, the parenthesis operator is defined like
				220	this:</p>
				221
				222	<div class="doc_code">
				223	<pre>
				224	(* parenexpr ::= '(' expression ')' *)
				225	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				226	</pre>
				227	</div>
				228
				229	<p>This function illustrates a number of interesting things about the
				230	parser:</p>
				231
				232	<p>
				233	1) It shows how we use the <tt>Stream.Error</tt> exception. When called, this
				234	function expects that the current token is a '(' token, but after parsing the
				235	subexpression, it is possible that there is no ')' waiting. For example, if
				236	the user types in "(4 x" instead of "(4)", the parser should emit an error.
				237	Because errors can occur, the parser needs a way to indicate that they
				238	happened. In our parser, we use the camlp4 shortcut syntax <tt>token ?? "parse
				239	error"</tt>, where if the token before the <tt>??</tt> does not match, then
				240	<tt>Stream.Error "parse error"</tt> will be raised.</p>
				241
				242	<p>2) Another interesting aspect of this function is that it uses recursion by
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	243	calling <tt>Parser.parse_primary</tt> (we will soon see that
				244	<tt>Parser.parse_primary</tt> can call <tt>Parser.parse_primary</tt>). This is
				245	powerful because it allows us to handle recursive grammars, and keeps each
				246	production very simple. Note that parentheses do not cause construction of AST
				247	nodes themselves. While we could do it this way, the most important role of
				248	parentheses are to guide the parser and provide grouping. Once the parser
				249	constructs the AST, parentheses are not needed.</p>
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	250
				251	<p>The next simple production is for handling variable references and function
				252	calls:</p>
				253
				254	<div class="doc_code">
				255	<pre>
				256	(* identifierexpr
				257	* ::= identifier
				258	* ::= identifier '(' argumentexpr ')' *)
				259	\| [< 'Token.Ident id; stream >] ->
				260	let rec parse_args accumulator = parser
				261	\| [< e=parse_expr; stream >] ->
				262	begin parser
				263	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				264	\| [< >] -> e :: accumulator
				265	end stream
				266	\| [< >] -> accumulator
				267	in
				268	let rec parse_ident id = parser
				269	(* Call. *)
				270	\| [< 'Token.Kwd '(';
				271	args=parse_args [];
				272	'Token.Kwd ')' ?? "expected ')'">] ->
				273	Ast.Call (id, Array.of_list (List.rev args))
				274
				275	(* Simple variable ref. *)
				276	\| [< >] -> Ast.Variable id
				277	in
				278	parse_ident id stream
				279	</pre>
				280	</div>
				281
				282	<p>This routine follows the same style as the other routines. (It expects to be
				283	called if the current token is a <tt>Token.Ident</tt> token). It also has
				284	recursion and error handling. One interesting aspect of this is that it uses
				285	<em>look-ahead</em> to determine if the current identifier is a stand alone
				286	variable reference or if it is a function call expression. It handles this by
				287	checking to see if the token after the identifier is a '(' token, constructing
				288	either a <tt>Ast.Variable</tt> or <tt>Ast.Call</tt> node as appropriate.
				289	</p>
				290
				291	<p>We finish up by raising an exception if we received a token we didn't
				292	expect:</p>
				293
				294	<div class="doc_code">
				295	<pre>
				296	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				297	</pre>
				298	</div>
				299
				300	<p>Now that basic expressions are handled, we need to handle binary expressions.
				301	They are a bit more complex.</p>
				302
				303	</div>
				304
				305	<!-- *********************************************************************** -->
				306	<div class="doc_section"><a name="parserbinops">Binary Expression
				307	Parsing</a></div>
				308	<!-- *********************************************************************** -->
				309
				310	<div class="doc_text">
				311
				312	<p>Binary expressions are significantly harder to parse because they are often
				313	ambiguous. For example, when given the string "x+y*z", the parser can choose
				314	to parse it as either "(x+y)z" or "x+(yz)". With common definitions from
				315	mathematics, we expect the later parse, because "*" (multiplication) has
				316	higher <em>precedence</em> than "+" (addition).</p>
				317
				318	<p>There are many ways to handle this, but an elegant and efficient way is to
				319	use <a href=
				320	"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
				321	Parsing</a>. This parsing technique uses the precedence of binary operators to
				322	guide recursion. To start with, we need a table of precedences:</p>
				323
				324	<div class="doc_code">
				325	<pre>
				326	(* binop_precedence - This holds the precedence for each binary operator that is
				327	* defined *)
				328	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				329
				330	(* precedence - Get the precedence of the pending binary operator token. *)
				331	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				332
				333	...
				334
				335	let main () =
				336	(* Install standard binary operators.
				337	* 1 is the lowest precedence. *)
				338	Hashtbl.add Parser.binop_precedence '<' 10;
				339	Hashtbl.add Parser.binop_precedence '+' 20;
				340	Hashtbl.add Parser.binop_precedence '-' 20;
				341	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				342	...
				343	</pre>
				344	</div>
				345
				346	<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
				347	(this can obviously be extended by you, our brave and intrepid reader). The
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	348	<tt>Parser.precedence</tt> function returns the precedence for the current
				349	token, or -1 if the token is not a binary operator. Having a <tt>Hashtbl.t</tt>
				350	makes it easy to add new operators and makes it clear that the algorithm doesn't
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	351	depend on the specific operators involved, but it would be easy enough to
				352	eliminate the <tt>Hashtbl.t</tt> and do the comparisons in the
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	353	<tt>Parser.precedence</tt> function. (Or just use a fixed-size array).</p>
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	354
				355	<p>With the helper above defined, we can now start parsing binary expressions.
				356	The basic idea of operator precedence parsing is to break down an expression
				357	with potentially ambiguous binary operators into pieces. Consider ,for example,
				358	the expression "a+b+(c+d)ef+g". Operator precedence parsing considers this
				359	as a stream of primary expressions separated by binary operators. As such,
				360	it will first parse the leading primary expression "a", then it will see the
				361	pairs [+, b] [+, (c+d)] [, e] [, f] and [+, g]. Note that because parentheses
				362	are primary expressions, the binary expression parser doesn't need to worry
				363	about nested subexpressions like (c+d) at all.
				364	</p>
				365
				366	<p>
				367	To start, an expression is a primary expression potentially followed by a
				368	sequence of [binop,primaryexpr] pairs:</p>
				369
				370	<div class="doc_code">
				371	<pre>
				372	(* expression
				373	* ::= primary binoprhs *)
				374	and parse_expr = parser
				375	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				376	</pre>
				377	</div>
				378
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	379	<p><tt>Parser.parse_bin_rhs</tt> is the function that parses the sequence of
				380	pairs for us. It takes a precedence and a pointer to an expression for the part
				381	that has been parsed so far. Note that "x" is a perfectly valid expression: As
				382	such, "binoprhs" is allowed to be empty, in which case it returns the expression
				383	that is passed into it. In our example above, the code passes the expression for
				384	"a" into <tt>Parser.parse_bin_rhs</tt> and the current token is "+".</p>
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	385
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	386	<p>The precedence value passed into <tt>Parser.parse_bin_rhs</tt> indicates the
				387	<em>minimal operator precedence</em> that the function is allowed to eat. For
				388	example, if the current pair stream is [+, x] and <tt>Parser.parse_bin_rhs</tt>
				389	is passed in a precedence of 40, it will not consume any tokens (because the
				390	precedence of '+' is only 20). With this in mind, <tt>Parser.parse_bin_rhs</tt>
				391	starts with:</p>
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	392
				393	<div class="doc_code">
				394	<pre>
				395	(* binoprhs
				396	* ::= ('+' primary)* *)
				397	and parse_bin_rhs expr_prec lhs stream =
				398	match Stream.peek stream with
				399	(* If this is a binop, find its precedence. *)
				400	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				401	let token_prec = precedence c in
				402
				403	(* If this is a binop that binds at least as tightly as the current binop,
				404	* consume it, otherwise we are done. *)
				405	if token_prec < expr_prec then lhs else begin
				406	</pre>
				407	</div>
				408
				409	<p>This code gets the precedence of the current token and checks to see if if is
				410	too low. Because we defined invalid tokens to have a precedence of -1, this
				411	check implicitly knows that the pair-stream ends when the token stream runs out
				412	of binary operators. If this check succeeds, we know that the token is a binary
				413	operator and that it will be included in this expression:</p>
				414
				415	<div class="doc_code">
				416	<pre>
				417	(* Eat the binop. *)
				418	Stream.junk stream;
				419
				420	(* Okay, we know this is a binop. *)
				421	let rhs =
				422	match Stream.peek stream with
				423	\| Some (Token.Kwd c2) ->
				424	</pre>
				425	</div>
				426
				427	<p>As such, this code eats (and remembers) the binary operator and then parses
				428	the primary expression that follows. This builds up the whole pair, the first of
				429	which is [+, b] for the running example.</p>
				430
				431	<p>Now that we parsed the left-hand side of an expression and one pair of the
				432	RHS sequence, we have to decide which way the expression associates. In
				433	particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
				434	To determine this, we look ahead at "binop" to determine its precedence and
				435	compare it to BinOp's precedence (which is '+' in this case):</p>
				436
				437	<div class="doc_code">
				438	<pre>
				439	(* If BinOp binds less tightly with rhs than the operator after
				440	* rhs, let the pending operator take rhs as its lhs. *)
				441	let next_prec = precedence c2 in
				442	if token_prec < next_prec
				443	</pre>
				444	</div>
				445
				446	<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
				447	precedence of our current operator, then we know that the parentheses associate
				448	as "(a+b) binop ...". In our example, the current operator is "+" and the next
				449	operator is "+", we know that they have the same precedence. In this case we'll
				450	create the AST node for "a+b", and then continue parsing:</p>
				451
				452	<div class="doc_code">
				453	<pre>
				454	... if body omitted ...
				455	in
				456
				457	(* Merge lhs/rhs. *)
				458	let lhs = Ast.Binary (c, lhs, rhs) in
				459	parse_bin_rhs expr_prec lhs stream
				460	end
				461	</pre>
				462	</div>
				463
				464	<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
				465	iteration of the loop, with "+" as the current token. The code above will eat,
				466	remember, and parse "(c+d)" as the primary expression, which makes the
				467	current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
				468	"" as the binop to the right of the primary. In this case, the precedence of "" is
				469	higher than the precedence of "+" so the if condition will be entered.</p>
				470
				471	<p>The critical question left here is "how can the if condition parse the right
				472	hand side in full"? In particular, to build the AST correctly for our example,
				473	it needs to get all of "(c+d)ef" as the RHS expression variable. The code to
				474	do this is surprisingly simple (code from the above two blocks duplicated for
				475	context):</p>
				476
				477	<div class="doc_code">
				478	<pre>
				479	match Stream.peek stream with
				480	\| Some (Token.Kwd c2) ->
				481	(* If BinOp binds less tightly with rhs than the operator after
				482	* rhs, let the pending operator take rhs as its lhs. *)
				483	if token_prec < precedence c2
				484	then <b>parse_bin_rhs (token_prec + 1) rhs stream</b>
				485	else rhs
				486	\| _ -> rhs
				487	in
				488
				489	(* Merge lhs/rhs. *)
				490	let lhs = Ast.Binary (c, lhs, rhs) in
				491	parse_bin_rhs expr_prec lhs stream
				492	end
				493	</pre>
				494	</div>
				495
				496	<p>At this point, we know that the binary operator to the RHS of our primary
				497	has higher precedence than the binop we are currently parsing. As such, we know
				498	that any sequence of pairs whose operators are all higher precedence than "+"
				499	should be parsed together and returned as "RHS". To do this, we recursively
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	500	invoke the <tt>Parser.parse_bin_rhs</tt> function specifying "token_prec+1" as
				501	the minimum precedence required for it to continue. In our example above, this
				502	will cause it to return the AST node for "(c+d)ef" as RHS, which is then set
				503	as the RHS of the '+' expression.</p>
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	504
				505	<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
				506	and added to the AST. With this little bit of code (14 non-trivial lines), we
				507	correctly handle fully general binary expression parsing in a very elegant way.
				508	This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
				509	running through it with a few tough examples to see how it works.
				510	</p>
				511
				512	<p>This wraps up handling of expressions. At this point, we can point the
				513	parser at an arbitrary token stream and build an expression from it, stopping
				514	at the first token that is not part of the expression. Next up we need to
				515	handle function definitions, etc.</p>
				516
				517	</div>
				518
				519	<!-- *********************************************************************** -->
				520	<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
				521	<!-- *********************************************************************** -->
				522
				523	<div class="doc_text">
				524
				525	<p>
				526	The next thing missing is handling of function prototypes. In Kaleidoscope,
				527	these are used both for 'extern' function declarations as well as function body
				528	definitions. The code to do this is straight-forward and not very interesting
				529	(once you've survived expressions):
				530	</p>
				531
				532	<div class="doc_code">
				533	<pre>
				534	(* prototype
				535	* ::= id '(' id* ')' *)
				536	let parse_prototype =
				537	let rec parse_args accumulator = parser
				538	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				539	\| [< >] -> accumulator
				540	in
				541
				542	parser
				543	\| [< 'Token.Ident id;
				544	'Token.Kwd '(' ?? "expected '(' in prototype";
				545	args=parse_args [];
				546	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				547	(* success. *)
				548	Ast.Prototype (id, Array.of_list (List.rev args))
				549
				550	\| [< >] ->
				551	raise (Stream.Error "expected function name in prototype")
				552	</pre>
				553	</div>
				554
				555	<p>Given this, a function definition is very simple, just a prototype plus
				556	an expression to implement the body:</p>
				557
				558	<div class="doc_code">
				559	<pre>
				560	(* definition ::= 'def' prototype expression *)
				561	let parse_definition = parser
				562	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				563	Ast.Function (p, e)
				564	</pre>
				565	</div>
				566
				567	<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
				568	well as to support forward declaration of user functions. These 'extern's are just
				569	prototypes with no body:</p>
				570
				571	<div class="doc_code">
				572	<pre>
				573	(* external ::= 'extern' prototype *)
				574	let parse_extern = parser
				575	\| [< 'Token.Extern; e=parse_prototype >] -> e
				576	</pre>
				577	</div>
				578
				579	<p>Finally, we'll also let the user type in arbitrary top-level expressions and
				580	evaluate them on the fly. We will handle this by defining anonymous nullary
				581	(zero argument) functions for them:</p>
				582
				583	<div class="doc_code">
				584	<pre>
				585	(* toplevelexpr ::= expression *)
				586	let parse_toplevel = parser
				587	\| [< e=parse_expr >] ->
				588	(* Make an anonymous proto. *)
				589	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				590	</pre>
				591	</div>
				592
				593	<p>Now that we have all the pieces, let's build a little driver that will let us
				594	actually <em>execute</em> this code we've built!</p>
				595
				596	</div>
				597
				598	<!-- *********************************************************************** -->
				599	<div class="doc_section"><a name="driver">The Driver</a></div>
				600	<!-- *********************************************************************** -->
				601
				602	<div class="doc_text">
				603
				604	<p>The driver for this simply invokes all of the parsing pieces with a top-level
				605	dispatch loop. There isn't much interesting here, so I'll just include the
				606	top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
				607	Parsing" section.</p>
				608
				609	<div class="doc_code">
				610	<pre>
				611	(* top ::= definition \| external \| expression \| ';' *)
				612	let rec main_loop stream =
				613	match Stream.peek stream with
				614	\| None -> ()
				615
				616	(* ignore top-level semicolons. *)
				617	\| Some (Token.Kwd ';') ->
				618	Stream.junk stream;
				619	main_loop stream
				620
				621	\| Some token ->
				622	begin
				623	try match token with
				624	\| Token.Def ->
				625	ignore(Parser.parse_definition stream);
				626	print_endline "parsed a function definition.";
				627	\| Token.Extern ->
				628	ignore(Parser.parse_extern stream);
				629	print_endline "parsed an extern.";
				630	\| _ ->
				631	(* Evaluate a top-level expression into an anonymous function. *)
				632	ignore(Parser.parse_toplevel stream);
				633	print_endline "parsed a top-level expr";
				634	with Stream.Error s ->
				635	(* Skip token for error recovery. *)
				636	Stream.junk stream;
				637	print_endline s;
				638	end;
				639	print_string "ready> "; flush stdout;
				640	main_loop stream
				641	</pre>
				642	</div>
				643
				644	<p>The most interesting part of this is that we ignore top-level semicolons.
				645	Why is this, you ask? The basic reason is that if you type "4 + 5" at the
				646	command line, the parser doesn't know whether that is the end of what you will type
				647	or not. For example, on the next line you could type "def foo..." in which case
				648	4+5 is the end of a top-level expression. Alternatively you could type "* 6",
				649	which would continue the expression. Having top-level semicolons allows you to
				650	type "4+5;", and the parser will know you are done.</p>
				651
				652	</div>
				653
				654	<!-- *********************************************************************** -->
				655	<div class="doc_section"><a name="conclusions">Conclusions</a></div>
				656	<!-- *********************************************************************** -->
				657
				658	<div class="doc_text">
				659
				660	<p>With just under 300 lines of commented code (240 lines of non-comment,
				661	non-blank code), we fully defined our minimal language, including a lexer,
				662	parser, and AST builder. With this done, the executable will validate
				663	Kaleidoscope code and tell us if it is grammatically invalid. For
				664	example, here is a sample interaction:</p>
				665
				666	<div class="doc_code">
				667	<pre>
				668	$ <b>./toy.byte</b>
				669	ready> <b>def foo(x y) x+foo(y, 4.0);</b>
				670	Parsed a function definition.
				671	ready> <b>def foo(x y) x+y y;</b>
				672	Parsed a function definition.
				673	Parsed a top-level expr
				674	ready> <b>def foo(x y) x+y );</b>
				675	Parsed a function definition.
				676	Error: unknown token when expecting an expression
				677	ready> <b>extern sin(a);</b>
				678	ready> Parsed an extern
				679	ready> <b>^D</b>
				680	$
				681	</pre>
				682	</div>
				683
				684	<p>There is a lot of room for extension here. You can define new AST nodes,
				685	extend the language in many ways, etc. In the <a href="OCamlLangImpl3.html">
				686	next installment</a>, we will describe how to generate LLVM Intermediate
				687	Representation (IR) from the AST.</p>
				688
				689	</div>
				690
				691	<!-- *********************************************************************** -->
				692	<div class="doc_section"><a name="code">Full Code Listing</a></div>
				693	<!-- *********************************************************************** -->
				694
				695	<div class="doc_text">
				696
				697	<p>
				698	Here is the complete code listing for this and the previous chapter.
				699	Note that it is fully self-contained: you don't need LLVM or any external
				700	libraries at all for this. (Besides the ocaml standard libraries, of
				701	course.) To build this, just compile with:</p>
				702
				703	<div class="doc_code">
				704	<pre>
				705	# Compile
				706	ocamlbuild toy.byte
				707	# Run
Erick Tryzelaar	d564686	2008-03-30 19:14:31 +0000	[diff] [blame]	708	./toy.byte
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	709	</pre>
				710	</div>
				711
				712	<p>Here is the code:</p>
				713
				714	<dl>
				715	<dt>_tags:</dt>
				716	<dd class="doc_code">
				717	<pre>
				718	<{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
				719	</pre>
				720	</dd>
				721
				722	<dt>token.ml:</dt>
				723	<dd class="doc_code">
				724	<pre>
				725	(*===----------------------------------------------------------------------===
				726	* Lexer Tokens
				727	===----------------------------------------------------------------------===)
				728
				729	(* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
				730	* these others for known things. *)
				731	type token =
				732	(* commands *)
				733	\| Def \| Extern
				734
				735	(* primary *)
				736	\| Ident of string \| Number of float
				737
				738	(* unknown *)
				739	\| Kwd of char
				740	</pre>
				741	</dd>
				742
				743	<dt>lexer.ml:</dt>
				744	<dd class="doc_code">
				745	<pre>
				746	(*===----------------------------------------------------------------------===
				747	* Lexer
				748	===----------------------------------------------------------------------===)
				749
				750	let rec lex = parser
				751	(* Skip any whitespace. *)
				752	\| [< ' (' ' \| '\n' \| '\r' \| '\t'); stream >] -> lex stream
				753
				754	(* identifier: [a-zA-Z][a-zA-Z0-9] *)
				755	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' as c); stream >] ->
				756	let buffer = Buffer.create 1 in
				757	Buffer.add_char buffer c;
				758	lex_ident buffer stream
				759
				760	(* number: [0-9.]+ *)
				761	\| [< ' ('0' .. '9' as c); stream >] ->
				762	let buffer = Buffer.create 1 in
				763	Buffer.add_char buffer c;
				764	lex_number buffer stream
				765
				766	(* Comment until end of line. *)
				767	\| [< ' ('#'); stream >] ->
				768	lex_comment stream
				769
				770	(* Otherwise, just return the character as its ascii value. *)
				771	\| [< 'c; stream >] ->
				772	[< 'Token.Kwd c; lex stream >]
				773
				774	(* end of stream. *)
				775	\| [< >] -> [< >]
				776
				777	and lex_number buffer = parser
				778	\| [< ' ('0' .. '9' \| '.' as c); stream >] ->
				779	Buffer.add_char buffer c;
				780	lex_number buffer stream
				781	\| [< stream=lex >] ->
				782	[< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
				783
				784	and lex_ident buffer = parser
				785	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' \| '0' .. '9' as c); stream >] ->
				786	Buffer.add_char buffer c;
				787	lex_ident buffer stream
				788	\| [< stream=lex >] ->
				789	match Buffer.contents buffer with
				790	\| "def" -> [< 'Token.Def; stream >]
				791	\| "extern" -> [< 'Token.Extern; stream >]
				792	\| id -> [< 'Token.Ident id; stream >]
				793
				794	and lex_comment = parser
				795	\| [< ' ('\n'); stream=lex >] -> stream
				796	\| [< 'c; e=lex_comment >] -> e
				797	\| [< >] -> [< >]
				798	</pre>
				799	</dd>
				800
				801	<dt>ast.ml:</dt>
				802	<dd class="doc_code">
				803	<pre>
				804	(*===----------------------------------------------------------------------===
				805	* Abstract Syntax Tree (aka Parse Tree)
				806	===----------------------------------------------------------------------===)
				807
				808	(* expr - Base type for all expression nodes. *)
				809	type expr =
				810	(* variant for numeric literals like "1.0". *)
				811	\| Number of float
				812
				813	(* variant for referencing a variable, like "a". *)
				814	\| Variable of string
				815
				816	(* variant for a binary operator. *)
				817	\| Binary of char * expr * expr
				818
				819	(* variant for function calls. *)
				820	\| Call of string * expr array
				821
				822	(* proto - This type represents the "prototype" for a function, which captures
				823	* its name, and its argument names (thus implicitly the number of arguments the
				824	* function takes). *)
				825	type proto = Prototype of string * string array
				826
				827	(* func - This type represents a function definition itself. *)
				828	type func = Function of proto * expr
				829	</pre>
				830	</dd>
				831
				832	<dt>parser.ml:</dt>
				833	<dd class="doc_code">
				834	<pre>
				835	(*===---------------------------------------------------------------------===
				836	* Parser
				837	===---------------------------------------------------------------------===)
				838
				839	(* binop_precedence - This holds the precedence for each binary operator that is
				840	* defined *)
				841	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				842
				843	(* precedence - Get the precedence of the pending binary operator token. *)
				844	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				845
				846	(* primary
				847	* ::= identifier
				848	* ::= numberexpr
				849	* ::= parenexpr *)
				850	let rec parse_primary = parser
				851	(* numberexpr ::= number *)
				852	\| [< 'Token.Number n >] -> Ast.Number n
				853
				854	(* parenexpr ::= '(' expression ')' *)
				855	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				856
				857	(* identifierexpr
				858	* ::= identifier
				859	* ::= identifier '(' argumentexpr ')' *)
				860	\| [< 'Token.Ident id; stream >] ->
				861	let rec parse_args accumulator = parser
				862	\| [< e=parse_expr; stream >] ->
				863	begin parser
				864	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				865	\| [< >] -> e :: accumulator
				866	end stream
				867	\| [< >] -> accumulator
				868	in
				869	let rec parse_ident id = parser
				870	(* Call. *)
				871	\| [< 'Token.Kwd '(';
				872	args=parse_args [];
				873	'Token.Kwd ')' ?? "expected ')'">] ->
				874	Ast.Call (id, Array.of_list (List.rev args))
				875
				876	(* Simple variable ref. *)
				877	\| [< >] -> Ast.Variable id
				878	in
				879	parse_ident id stream
				880
				881	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				882
				883	(* binoprhs
				884	* ::= ('+' primary)* *)
				885	and parse_bin_rhs expr_prec lhs stream =
				886	match Stream.peek stream with
				887	(* If this is a binop, find its precedence. *)
				888	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				889	let token_prec = precedence c in
				890
				891	(* If this is a binop that binds at least as tightly as the current binop,
				892	* consume it, otherwise we are done. *)
				893	if token_prec < expr_prec then lhs else begin
				894	(* Eat the binop. *)
				895	Stream.junk stream;
				896
				897	(* Parse the primary expression after the binary operator. *)
				898	let rhs = parse_primary stream in
				899
				900	(* Okay, we know this is a binop. *)
				901	let rhs =
				902	match Stream.peek stream with
				903	\| Some (Token.Kwd c2) ->
				904	(* If BinOp binds less tightly with rhs than the operator after
				905	* rhs, let the pending operator take rhs as its lhs. *)
				906	let next_prec = precedence c2 in
				907	if token_prec < next_prec
				908	then parse_bin_rhs (token_prec + 1) rhs stream
				909	else rhs
				910	\| _ -> rhs
				911	in
				912
				913	(* Merge lhs/rhs. *)
				914	let lhs = Ast.Binary (c, lhs, rhs) in
				915	parse_bin_rhs expr_prec lhs stream
				916	end
				917	\| _ -> lhs
				918
				919	(* expression
				920	* ::= primary binoprhs *)
				921	and parse_expr = parser
				922	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				923
				924	(* prototype
				925	* ::= id '(' id* ')' *)
				926	let parse_prototype =
				927	let rec parse_args accumulator = parser
				928	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				929	\| [< >] -> accumulator
				930	in
				931
				932	parser
				933	\| [< 'Token.Ident id;
				934	'Token.Kwd '(' ?? "expected '(' in prototype";
				935	args=parse_args [];
				936	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				937	(* success. *)
				938	Ast.Prototype (id, Array.of_list (List.rev args))
				939
				940	\| [< >] ->
				941	raise (Stream.Error "expected function name in prototype")
				942
				943	(* definition ::= 'def' prototype expression *)
				944	let parse_definition = parser
				945	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				946	Ast.Function (p, e)
				947
				948	(* toplevelexpr ::= expression *)
				949	let parse_toplevel = parser
				950	\| [< e=parse_expr >] ->
				951	(* Make an anonymous proto. *)
				952	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				953
				954	(* external ::= 'extern' prototype *)
				955	let parse_extern = parser
				956	\| [< 'Token.Extern; e=parse_prototype >] -> e
				957	</pre>
				958	</dd>
				959
				960	<dt>toplevel.ml:</dt>
				961	<dd class="doc_code">
				962	<pre>
				963	(*===----------------------------------------------------------------------===
				964	* Top-Level parsing and JIT Driver
				965	===----------------------------------------------------------------------===)
				966
				967	(* top ::= definition \| external \| expression \| ';' *)
				968	let rec main_loop stream =
				969	match Stream.peek stream with
				970	\| None -> ()
				971
				972	(* ignore top-level semicolons. *)
				973	\| Some (Token.Kwd ';') ->
				974	Stream.junk stream;
				975	main_loop stream
				976
				977	\| Some token ->
				978	begin
				979	try match token with
				980	\| Token.Def ->
				981	ignore(Parser.parse_definition stream);
				982	print_endline "parsed a function definition.";
				983	\| Token.Extern ->
				984	ignore(Parser.parse_extern stream);
				985	print_endline "parsed an extern.";
				986	\| _ ->
				987	(* Evaluate a top-level expression into an anonymous function. *)
				988	ignore(Parser.parse_toplevel stream);
				989	print_endline "parsed a top-level expr";
				990	with Stream.Error s ->
				991	(* Skip token for error recovery. *)
				992	Stream.junk stream;
				993	print_endline s;
				994	end;
				995	print_string "ready> "; flush stdout;
				996	main_loop stream
				997	</pre>
				998	</dd>
				999
				1000	<dt>toy.ml:</dt>
				1001	<dd class="doc_code">
				1002	<pre>
				1003	(*===----------------------------------------------------------------------===
				1004	* Main driver code.
				1005	===----------------------------------------------------------------------===)
				1006
				1007	let main () =
				1008	(* Install standard binary operators.
				1009	* 1 is the lowest precedence. *)
				1010	Hashtbl.add Parser.binop_precedence '<' 10;
				1011	Hashtbl.add Parser.binop_precedence '+' 20;
				1012	Hashtbl.add Parser.binop_precedence '-' 20;
				1013	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				1014
				1015	(* Prime the first token. *)
				1016	print_string "ready> "; flush stdout;
				1017	let stream = Lexer.lex (Stream.of_channel stdin) in
				1018
				1019	(* Run the main "interpreter loop" now. *)
				1020	Toplevel.main_loop stream;
				1021	;;
				1022
				1023	main ()
				1024	</pre>
				1025	</dd>
				1026	</dl>
				1027
				1028	<a href="OCamlLangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
				1029	</div>
				1030
				1031	<!-- *********************************************************************** -->
				1032	<hr>
				1033	<address>
				1034	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				1035	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				1036	<a href="http://validator.w3.org/check/referer"><img
				1037	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
				1038
				1039	<a href="mailto:sabre@nondot.org">Chris Lattner</a>
				1040	<a href="mailto:erickt@users.sourceforge.net">Erick Tryzelaar</a><br>
NAKAMURA Takumi	b9a3363	2011-04-09 02:13:37 +0000	[diff] [blame^]	1041	<a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
Dan Gohman	523e392	2010-02-03 17:27:31 +0000	[diff] [blame]	1042	Last modified: $Date$
Erick Tryzelaar	9ba8a57	2008-03-27 08:18:07 +0000	[diff] [blame]	1043	</address>
				1044	</body>
				1045	</html>