Blame - docs/tutorial/OCamlLangImpl2.rst - fp2-dev/platform/external/llvm

blob: 07490e1f67df87805adda2af0146c45d3c6f681d [file] [log] [blame]

Sean Silva	ee47edf	2012-12-05 00:26:32 +0000	[diff] [blame]	1	===========================================
				2	Kaleidoscope: Implementing a Parser and AST
				3	===========================================
				4
				5	.. contents::
				6	:local:
				7
				8	Written by `Chris Lattner <mailto:sabre@nondot.org>`_ and `Erick
				9	Tryzelaar <mailto:idadesub@users.sourceforge.net>`_
				10
				11	Chapter 2 Introduction
				12	======================
				13
				14	Welcome to Chapter 2 of the "`Implementing a language with LLVM in
				15	Objective Caml <index.html>`_" tutorial. This chapter shows you how to
				16	use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
				17	full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
				18	Kaleidoscope language. Once we have a parser, we'll define and build an
				19	`Abstract Syntax
				20	Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
				21
				22	The parser we will build uses a combination of `Recursive Descent
				23	Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
				24	`Operator-Precedence
				25	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
				26	parse the Kaleidoscope language (the latter for binary expressions and
				27	the former for everything else). Before we get to parsing though, lets
				28	talk about the output of the parser: the Abstract Syntax Tree.
				29
				30	The Abstract Syntax Tree (AST)
				31	==============================
				32
				33	The AST for a program captures its behavior in such a way that it is
				34	easy for later stages of the compiler (e.g. code generation) to
				35	interpret. We basically want one object for each construct in the
				36	language, and the AST should closely model the language. In
				37	Kaleidoscope, we have expressions, a prototype, and a function object.
				38	We'll start with expressions first:
				39
				40	.. code-block:: ocaml
				41
				42	(* expr - Base type for all expression nodes. *)
				43	type expr =
				44	(* variant for numeric literals like "1.0". *)
				45	\| Number of float
				46
				47	The code above shows the definition of the base ExprAST class and one
				48	subclass which we use for numeric literals. The important thing to note
				49	about this code is that the Number variant captures the numeric value of
				50	the literal as an instance variable. This allows later phases of the
				51	compiler to know what the stored numeric value is.
				52
				53	Right now we only create the AST, so there are no useful functions on
				54	them. It would be very easy to add a function to pretty print the code,
				55	for example. Here are the other expression AST node definitions that
				56	we'll use in the basic form of the Kaleidoscope language:
				57
				58	.. code-block:: ocaml
				59
				60	(* variant for referencing a variable, like "a". *)
				61	\| Variable of string
				62
				63	(* variant for a binary operator. *)
				64	\| Binary of char * expr * expr
				65
				66	(* variant for function calls. *)
				67	\| Call of string * expr array
				68
				69	This is all (intentionally) rather straight-forward: variables capture
				70	the variable name, binary operators capture their opcode (e.g. '+'), and
				71	calls capture a function name as well as a list of any argument
				72	expressions. One thing that is nice about our AST is that it captures
				73	the language features without talking about the syntax of the language.
				74	Note that there is no discussion about precedence of binary operators,
				75	lexical structure, etc.
				76
				77	For our basic language, these are all of the expression nodes we'll
				78	define. Because it doesn't have conditional control flow, it isn't
				79	Turing-complete; we'll fix that in a later installment. The two things
				80	we need next are a way to talk about the interface to a function, and a
				81	way to talk about functions themselves:
				82
				83	.. code-block:: ocaml
				84
				85	(* proto - This type represents the "prototype" for a function, which captures
				86	* its name, and its argument names (thus implicitly the number of arguments the
				87	* function takes). *)
				88	type proto = Prototype of string * string array
				89
				90	(* func - This type represents a function definition itself. *)
				91	type func = Function of proto * expr
				92
				93	In Kaleidoscope, functions are typed with just a count of their
				94	arguments. Since all values are double precision floating point, the
				95	type of each argument doesn't need to be stored anywhere. In a more
				96	aggressive and realistic language, the "expr" variants would probably
				97	have a type field.
				98
				99	With this scaffolding, we can now talk about parsing expressions and
				100	function bodies in Kaleidoscope.
				101
				102	Parser Basics
				103	=============
				104
				105	Now that we have an AST to build, we need to define the parser code to
				106	build it. The idea here is that we want to parse something like "x+y"
				107	(which is returned as three tokens by the lexer) into an AST that could
				108	be generated with calls like this:
				109
				110	.. code-block:: ocaml
				111
				112	let x = Variable "x" in
				113	let y = Variable "y" in
				114	let result = Binary ('+', x, y) in
				115	...
				116
				117	The error handling routines make use of the builtin ``Stream.Failure``
				118	and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
				119	unable to find any matching token in the first position of a pattern.
				120	``Stream.Error`` is raised when the first token matches, but the rest do
				121	not. The error recovery in our parser will not be the best and is not
				122	particular user-friendly, but it will be enough for our tutorial. These
				123	exceptions make it easier to handle errors in routines that have various
				124	return types.
				125
				126	With these basic types and exceptions, we can implement the first piece
				127	of our grammar: numeric literals.
				128
				129	Basic Expression Parsing
				130	========================
				131
				132	We start with numeric literals, because they are the simplest to
				133	process. For each production in our grammar, we'll define a function
				134	which parses that production. We call this class of expressions
				135	"primary" expressions, for reasons that will become more clear `later in
				136	the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
				137	arbitrary primary expression, we need to determine what sort of
				138	expression it is. For numeric literals, we have:
				139
				140	.. code-block:: ocaml
				141
				142	(* primary
				143	* ::= identifier
				144	* ::= numberexpr
				145	* ::= parenexpr *)
				146	parse_primary = parser
				147	(* numberexpr ::= number *)
				148	\| [< 'Token.Number n >] -> Ast.Number n
				149
				150	This routine is very simple: it expects to be called when the current
				151	token is a ``Token.Number`` token. It takes the current number value,
				152	creates a ``Ast.Number`` node, advances the lexer to the next token, and
				153	finally returns.
				154
				155	There are some interesting aspects to this. The most important one is
				156	that this routine eats all of the tokens that correspond to the
				157	production and returns the lexer buffer with the next token (which is
				158	not part of the grammar production) ready to go. This is a fairly
				159	standard way to go for recursive descent parsers. For a better example,
				160	the parenthesis operator is defined like this:
				161
				162	.. code-block:: ocaml
				163
				164	(* parenexpr ::= '(' expression ')' *)
				165	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				166
				167	This function illustrates a number of interesting things about the
				168	parser:
				169
				170	1) It shows how we use the ``Stream.Error`` exception. When called, this
				171	function expects that the current token is a '(' token, but after
				172	parsing the subexpression, it is possible that there is no ')' waiting.
				173	For example, if the user types in "(4 x" instead of "(4)", the parser
				174	should emit an error. Because errors can occur, the parser needs a way
				175	to indicate that they happened. In our parser, we use the camlp4
				176	shortcut syntax ``token ?? "parse error"``, where if the token before
				177	the ``??`` does not match, then ``Stream.Error "parse error"`` will be
				178	raised.
				179
				180	2) Another interesting aspect of this function is that it uses recursion
				181	by calling ``Parser.parse_primary`` (we will soon see that
				182	``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
				183	powerful because it allows us to handle recursive grammars, and keeps
				184	each production very simple. Note that parentheses do not cause
				185	construction of AST nodes themselves. While we could do it this way, the
				186	most important role of parentheses are to guide the parser and provide
				187	grouping. Once the parser constructs the AST, parentheses are not
				188	needed.
				189
				190	The next simple production is for handling variable references and
				191	function calls:
				192
				193	.. code-block:: ocaml
				194
				195	(* identifierexpr
				196	* ::= identifier
				197	* ::= identifier '(' argumentexpr ')' *)
				198	\| [< 'Token.Ident id; stream >] ->
				199	let rec parse_args accumulator = parser
				200	\| [< e=parse_expr; stream >] ->
				201	begin parser
				202	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				203	\| [< >] -> e :: accumulator
				204	end stream
				205	\| [< >] -> accumulator
				206	in
				207	let rec parse_ident id = parser
				208	(* Call. *)
				209	\| [< 'Token.Kwd '(';
				210	args=parse_args [];
				211	'Token.Kwd ')' ?? "expected ')'">] ->
				212	Ast.Call (id, Array.of_list (List.rev args))
				213
				214	(* Simple variable ref. *)
				215	\| [< >] -> Ast.Variable id
				216	in
				217	parse_ident id stream
				218
				219	This routine follows the same style as the other routines. (It expects
				220	to be called if the current token is a ``Token.Ident`` token). It also
				221	has recursion and error handling. One interesting aspect of this is that
				222	it uses look-ahead to determine if the current identifier is a stand
				223	alone variable reference or if it is a function call expression. It
				224	handles this by checking to see if the token after the identifier is a
				225	'(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
				226	as appropriate.
				227
				228	We finish up by raising an exception if we received a token we didn't
				229	expect:
				230
				231	.. code-block:: ocaml
				232
				233	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				234
				235	Now that basic expressions are handled, we need to handle binary
				236	expressions. They are a bit more complex.
				237
				238	Binary Expression Parsing
				239	=========================
				240
				241	Binary expressions are significantly harder to parse because they are
				242	often ambiguous. For example, when given the string "x+y\*z", the parser
				243	can choose to parse it as either "(x+y)\z" or "x+(y\z)". With common
				244	definitions from mathematics, we expect the later parse, because "\*"
				245	(multiplication) has higher precedence than "+" (addition).
				246
				247	There are many ways to handle this, but an elegant and efficient way is
				248	to use `Operator-Precedence
				249	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
				250	This parsing technique uses the precedence of binary operators to guide
				251	recursion. To start with, we need a table of precedences:
				252
				253	.. code-block:: ocaml
				254
				255	(* binop_precedence - This holds the precedence for each binary operator that is
				256	* defined *)
				257	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				258
				259	(* precedence - Get the precedence of the pending binary operator token. *)
				260	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				261
				262	...
				263
				264	let main () =
				265	(* Install standard binary operators.
				266	* 1 is the lowest precedence. *)
				267	Hashtbl.add Parser.binop_precedence '<' 10;
				268	Hashtbl.add Parser.binop_precedence '+' 20;
				269	Hashtbl.add Parser.binop_precedence '-' 20;
				270	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				271	...
				272
				273	For the basic form of Kaleidoscope, we will only support 4 binary
				274	operators (this can obviously be extended by you, our brave and intrepid
				275	reader). The ``Parser.precedence`` function returns the precedence for
				276	the current token, or -1 if the token is not a binary operator. Having a
				277	``Hashtbl.t`` makes it easy to add new operators and makes it clear that
				278	the algorithm doesn't depend on the specific operators involved, but it
				279	would be easy enough to eliminate the ``Hashtbl.t`` and do the
				280	comparisons in the ``Parser.precedence`` function. (Or just use a
				281	fixed-size array).
				282
				283	With the helper above defined, we can now start parsing binary
				284	expressions. The basic idea of operator precedence parsing is to break
				285	down an expression with potentially ambiguous binary operators into
				286	pieces. Consider ,for example, the expression "a+b+(c+d)\e\f+g".
				287	Operator precedence parsing considers this as a stream of primary
				288	expressions separated by binary operators. As such, it will first parse
				289	the leading primary expression "a", then it will see the pairs [+, b]
				290	[+, (c+d)] [\, e] [\, f] and [+, g]. Note that because parentheses are
				291	primary expressions, the binary expression parser doesn't need to worry
				292	about nested subexpressions like (c+d) at all.
				293
				294	To start, an expression is a primary expression potentially followed by
				295	a sequence of [binop,primaryexpr] pairs:
				296
				297	.. code-block:: ocaml
				298
				299	(* expression
				300	* ::= primary binoprhs *)
				301	and parse_expr = parser
				302	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				303
				304	``Parser.parse_bin_rhs`` is the function that parses the sequence of
				305	pairs for us. It takes a precedence and a pointer to an expression for
				306	the part that has been parsed so far. Note that "x" is a perfectly valid
				307	expression: As such, "binoprhs" is allowed to be empty, in which case it
				308	returns the expression that is passed into it. In our example above, the
				309	code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
				310	current token is "+".
				311
				312	The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
				313	minimal operator precedence that the function is allowed to eat. For
				314	example, if the current pair stream is [+, x] and
				315	``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
				316	consume any tokens (because the precedence of '+' is only 20). With this
				317	in mind, ``Parser.parse_bin_rhs`` starts with:
				318
				319	.. code-block:: ocaml
				320
				321	(* binoprhs
				322	* ::= ('+' primary)* *)
				323	and parse_bin_rhs expr_prec lhs stream =
				324	match Stream.peek stream with
				325	(* If this is a binop, find its precedence. *)
				326	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				327	let token_prec = precedence c in
				328
				329	(* If this is a binop that binds at least as tightly as the current binop,
				330	* consume it, otherwise we are done. *)
				331	if token_prec < expr_prec then lhs else begin
				332
				333	This code gets the precedence of the current token and checks to see if
				334	if is too low. Because we defined invalid tokens to have a precedence of
				335	-1, this check implicitly knows that the pair-stream ends when the token
				336	stream runs out of binary operators. If this check succeeds, we know
				337	that the token is a binary operator and that it will be included in this
				338	expression:
				339
				340	.. code-block:: ocaml
				341
				342	(* Eat the binop. *)
				343	Stream.junk stream;
				344
				345	(* Okay, we know this is a binop. *)
				346	let rhs =
				347	match Stream.peek stream with
				348	\| Some (Token.Kwd c2) ->
				349
				350	As such, this code eats (and remembers) the binary operator and then
				351	parses the primary expression that follows. This builds up the whole
				352	pair, the first of which is [+, b] for the running example.
				353
				354	Now that we parsed the left-hand side of an expression and one pair of
				355	the RHS sequence, we have to decide which way the expression associates.
				356	In particular, we could have "(a+b) binop unparsed" or "a + (b binop
				357	unparsed)". To determine this, we look ahead at "binop" to determine its
				358	precedence and compare it to BinOp's precedence (which is '+' in this
				359	case):
				360
				361	.. code-block:: ocaml
				362
				363	(* If BinOp binds less tightly with rhs than the operator after
				364	* rhs, let the pending operator take rhs as its lhs. *)
				365	let next_prec = precedence c2 in
				366	if token_prec < next_prec
				367
				368	If the precedence of the binop to the right of "RHS" is lower or equal
				369	to the precedence of our current operator, then we know that the
				370	parentheses associate as "(a+b) binop ...". In our example, the current
				371	operator is "+" and the next operator is "+", we know that they have the
				372	same precedence. In this case we'll create the AST node for "a+b", and
				373	then continue parsing:
				374
				375	.. code-block:: ocaml
				376
				377	... if body omitted ...
				378	in
				379
				380	(* Merge lhs/rhs. *)
				381	let lhs = Ast.Binary (c, lhs, rhs) in
				382	parse_bin_rhs expr_prec lhs stream
				383	end
				384
				385	In our example above, this will turn "a+b+" into "(a+b)" and execute the
				386	next iteration of the loop, with "+" as the current token. The code
				387	above will eat, remember, and parse "(c+d)" as the primary expression,
				388	which makes the current pair equal to [+, (c+d)]. It will then evaluate
				389	the 'if' conditional above with "\*" as the binop to the right of the
				390	primary. In this case, the precedence of "\*" is higher than the
				391	precedence of "+" so the if condition will be entered.
				392
				393	The critical question left here is "how can the if condition parse the
				394	right hand side in full"? In particular, to build the AST correctly for
				395	our example, it needs to get all of "(c+d)\e\f" as the RHS expression
				396	variable. The code to do this is surprisingly simple (code from the
				397	above two blocks duplicated for context):
				398
				399	.. code-block:: ocaml
				400
				401	match Stream.peek stream with
				402	\| Some (Token.Kwd c2) ->
				403	(* If BinOp binds less tightly with rhs than the operator after
				404	* rhs, let the pending operator take rhs as its lhs. *)
				405	if token_prec < precedence c2
				406	then parse_bin_rhs (token_prec + 1) rhs stream
				407	else rhs
				408	\| _ -> rhs
				409	in
				410
				411	(* Merge lhs/rhs. *)
				412	let lhs = Ast.Binary (c, lhs, rhs) in
				413	parse_bin_rhs expr_prec lhs stream
				414	end
				415
				416	At this point, we know that the binary operator to the RHS of our
				417	primary has higher precedence than the binop we are currently parsing.
				418	As such, we know that any sequence of pairs whose operators are all
				419	higher precedence than "+" should be parsed together and returned as
				420	"RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
				421	function specifying "token\_prec+1" as the minimum precedence required
				422	for it to continue. In our example above, this will cause it to return
				423	the AST node for "(c+d)\e\f" as RHS, which is then set as the RHS of
				424	the '+' expression.
				425
				426	Finally, on the next iteration of the while loop, the "+g" piece is
				427	parsed and added to the AST. With this little bit of code (14
				428	non-trivial lines), we correctly handle fully general binary expression
				429	parsing in a very elegant way. This was a whirlwind tour of this code,
				430	and it is somewhat subtle. I recommend running through it with a few
				431	tough examples to see how it works.
				432
				433	This wraps up handling of expressions. At this point, we can point the
				434	parser at an arbitrary token stream and build an expression from it,
				435	stopping at the first token that is not part of the expression. Next up
				436	we need to handle function definitions, etc.
				437
				438	Parsing the Rest
				439	================
				440
				441	The next thing missing is handling of function prototypes. In
				442	Kaleidoscope, these are used both for 'extern' function declarations as
				443	well as function body definitions. The code to do this is
				444	straight-forward and not very interesting (once you've survived
				445	expressions):
				446
				447	.. code-block:: ocaml
				448
				449	(* prototype
				450	* ::= id '(' id* ')' *)
				451	let parse_prototype =
				452	let rec parse_args accumulator = parser
				453	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				454	\| [< >] -> accumulator
				455	in
				456
				457	parser
				458	\| [< 'Token.Ident id;
				459	'Token.Kwd '(' ?? "expected '(' in prototype";
				460	args=parse_args [];
				461	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				462	(* success. *)
				463	Ast.Prototype (id, Array.of_list (List.rev args))
				464
				465	\| [< >] ->
				466	raise (Stream.Error "expected function name in prototype")
				467
				468	Given this, a function definition is very simple, just a prototype plus
				469	an expression to implement the body:
				470
				471	.. code-block:: ocaml
				472
				473	(* definition ::= 'def' prototype expression *)
				474	let parse_definition = parser
				475	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				476	Ast.Function (p, e)
				477
				478	In addition, we support 'extern' to declare functions like 'sin' and
				479	'cos' as well as to support forward declaration of user functions. These
				480	'extern's are just prototypes with no body:
				481
				482	.. code-block:: ocaml
				483
				484	(* external ::= 'extern' prototype *)
				485	let parse_extern = parser
				486	\| [< 'Token.Extern; e=parse_prototype >] -> e
				487
				488	Finally, we'll also let the user type in arbitrary top-level expressions
				489	and evaluate them on the fly. We will handle this by defining anonymous
				490	nullary (zero argument) functions for them:
				491
				492	.. code-block:: ocaml
				493
				494	(* toplevelexpr ::= expression *)
				495	let parse_toplevel = parser
				496	\| [< e=parse_expr >] ->
				497	(* Make an anonymous proto. *)
				498	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				499
				500	Now that we have all the pieces, let's build a little driver that will
				501	let us actually execute this code we've built!
				502
				503	The Driver
				504	==========
				505
				506	The driver for this simply invokes all of the parsing pieces with a
				507	top-level dispatch loop. There isn't much interesting here, so I'll just
				508	include the top-level loop. See `below <#code>`_ for full code in the
				509	"Top-Level Parsing" section.
				510
				511	.. code-block:: ocaml
				512
				513	(* top ::= definition \| external \| expression \| ';' *)
				514	let rec main_loop stream =
				515	match Stream.peek stream with
				516	\| None -> ()
				517
				518	(* ignore top-level semicolons. *)
				519	\| Some (Token.Kwd ';') ->
				520	Stream.junk stream;
				521	main_loop stream
				522
				523	\| Some token ->
				524	begin
				525	try match token with
				526	\| Token.Def ->
				527	ignore(Parser.parse_definition stream);
				528	print_endline "parsed a function definition.";
				529	\| Token.Extern ->
				530	ignore(Parser.parse_extern stream);
				531	print_endline "parsed an extern.";
				532	\| _ ->
				533	(* Evaluate a top-level expression into an anonymous function. *)
				534	ignore(Parser.parse_toplevel stream);
				535	print_endline "parsed a top-level expr";
				536	with Stream.Error s ->
				537	(* Skip token for error recovery. *)
				538	Stream.junk stream;
				539	print_endline s;
				540	end;
				541	print_string "ready> "; flush stdout;
				542	main_loop stream
				543
				544	The most interesting part of this is that we ignore top-level
				545	semicolons. Why is this, you ask? The basic reason is that if you type
				546	"4 + 5" at the command line, the parser doesn't know whether that is the
				547	end of what you will type or not. For example, on the next line you
				548	could type "def foo..." in which case 4+5 is the end of a top-level
				549	expression. Alternatively you could type "\* 6", which would continue
				550	the expression. Having top-level semicolons allows you to type "4+5;",
				551	and the parser will know you are done.
				552
				553	Conclusions
				554	===========
				555
				556	With just under 300 lines of commented code (240 lines of non-comment,
				557	non-blank code), we fully defined our minimal language, including a
				558	lexer, parser, and AST builder. With this done, the executable will
				559	validate Kaleidoscope code and tell us if it is grammatically invalid.
				560	For example, here is a sample interaction:
				561
				562	.. code-block:: bash
				563
				564	$ ./toy.byte
				565	ready> def foo(x y) x+foo(y, 4.0);
				566	Parsed a function definition.
				567	ready> def foo(x y) x+y y;
				568	Parsed a function definition.
				569	Parsed a top-level expr
				570	ready> def foo(x y) x+y );
				571	Parsed a function definition.
				572	Error: unknown token when expecting an expression
				573	ready> extern sin(a);
				574	ready> Parsed an extern
				575	ready> ^D
				576	$
				577
				578	There is a lot of room for extension here. You can define new AST nodes,
				579	extend the language in many ways, etc. In the `next
				580	installment <OCamlLangImpl3.html>`_, we will describe how to generate
				581	LLVM Intermediate Representation (IR) from the AST.
				582
				583	Full Code Listing
				584	=================
				585
				586	Here is the complete code listing for this and the previous chapter.
				587	Note that it is fully self-contained: you don't need LLVM or any
				588	external libraries at all for this. (Besides the ocaml standard
				589	libraries, of course.) To build this, just compile with:
				590
				591	.. code-block:: bash
				592
				593	# Compile
				594	ocamlbuild toy.byte
				595	# Run
				596	./toy.byte
				597
				598	Here is the code:
				599
				600	\_tags:
				601	::
				602
				603	<{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
				604
				605	token.ml:
				606	.. code-block:: ocaml
				607
				608	(*===----------------------------------------------------------------------===
				609	* Lexer Tokens
				610	===----------------------------------------------------------------------===)
				611
				612	(* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
				613	* these others for known things. *)
				614	type token =
				615	(* commands *)
				616	\| Def \| Extern
				617
				618	(* primary *)
				619	\| Ident of string \| Number of float
				620
				621	(* unknown *)
				622	\| Kwd of char
				623
				624	lexer.ml:
				625	.. code-block:: ocaml
				626
				627	(*===----------------------------------------------------------------------===
				628	* Lexer
				629	===----------------------------------------------------------------------===)
				630
				631	let rec lex = parser
				632	(* Skip any whitespace. *)
				633	\| [< ' (' ' \| '\n' \| '\r' \| '\t'); stream >] -> lex stream
				634
				635	(* identifier: [a-zA-Z][a-zA-Z0-9] *)
				636	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' as c); stream >] ->
				637	let buffer = Buffer.create 1 in
				638	Buffer.add_char buffer c;
				639	lex_ident buffer stream
				640
				641	(* number: [0-9.]+ *)
				642	\| [< ' ('0' .. '9' as c); stream >] ->
				643	let buffer = Buffer.create 1 in
				644	Buffer.add_char buffer c;
				645	lex_number buffer stream
				646
				647	(* Comment until end of line. *)
				648	\| [< ' ('#'); stream >] ->
				649	lex_comment stream
				650
				651	(* Otherwise, just return the character as its ascii value. *)
				652	\| [< 'c; stream >] ->
				653	[< 'Token.Kwd c; lex stream >]
				654
				655	(* end of stream. *)
				656	\| [< >] -> [< >]
				657
				658	and lex_number buffer = parser
				659	\| [< ' ('0' .. '9' \| '.' as c); stream >] ->
				660	Buffer.add_char buffer c;
				661	lex_number buffer stream
				662	\| [< stream=lex >] ->
				663	[< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
				664
				665	and lex_ident buffer = parser
				666	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' \| '0' .. '9' as c); stream >] ->
				667	Buffer.add_char buffer c;
				668	lex_ident buffer stream
				669	\| [< stream=lex >] ->
				670	match Buffer.contents buffer with
				671	\| "def" -> [< 'Token.Def; stream >]
				672	\| "extern" -> [< 'Token.Extern; stream >]
				673	\| id -> [< 'Token.Ident id; stream >]
				674
				675	and lex_comment = parser
				676	\| [< ' ('\n'); stream=lex >] -> stream
				677	\| [< 'c; e=lex_comment >] -> e
				678	\| [< >] -> [< >]
				679
				680	ast.ml:
				681	.. code-block:: ocaml
				682
				683	(*===----------------------------------------------------------------------===
				684	* Abstract Syntax Tree (aka Parse Tree)
				685	===----------------------------------------------------------------------===)
				686
				687	(* expr - Base type for all expression nodes. *)
				688	type expr =
				689	(* variant for numeric literals like "1.0". *)
				690	\| Number of float
				691
				692	(* variant for referencing a variable, like "a". *)
				693	\| Variable of string
				694
				695	(* variant for a binary operator. *)
				696	\| Binary of char * expr * expr
				697
				698	(* variant for function calls. *)
				699	\| Call of string * expr array
				700
				701	(* proto - This type represents the "prototype" for a function, which captures
				702	* its name, and its argument names (thus implicitly the number of arguments the
				703	* function takes). *)
				704	type proto = Prototype of string * string array
				705
				706	(* func - This type represents a function definition itself. *)
				707	type func = Function of proto * expr
				708
				709	parser.ml:
				710	.. code-block:: ocaml
				711
				712	(*===---------------------------------------------------------------------===
				713	* Parser
				714	===---------------------------------------------------------------------===)
				715
				716	(* binop_precedence - This holds the precedence for each binary operator that is
				717	* defined *)
				718	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				719
				720	(* precedence - Get the precedence of the pending binary operator token. *)
				721	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				722
				723	(* primary
				724	* ::= identifier
				725	* ::= numberexpr
				726	* ::= parenexpr *)
				727	let rec parse_primary = parser
				728	(* numberexpr ::= number *)
				729	\| [< 'Token.Number n >] -> Ast.Number n
				730
				731	(* parenexpr ::= '(' expression ')' *)
				732	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				733
				734	(* identifierexpr
				735	* ::= identifier
				736	* ::= identifier '(' argumentexpr ')' *)
				737	\| [< 'Token.Ident id; stream >] ->
				738	let rec parse_args accumulator = parser
				739	\| [< e=parse_expr; stream >] ->
				740	begin parser
				741	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				742	\| [< >] -> e :: accumulator
				743	end stream
				744	\| [< >] -> accumulator
				745	in
				746	let rec parse_ident id = parser
				747	(* Call. *)
				748	\| [< 'Token.Kwd '(';
				749	args=parse_args [];
				750	'Token.Kwd ')' ?? "expected ')'">] ->
				751	Ast.Call (id, Array.of_list (List.rev args))
				752
				753	(* Simple variable ref. *)
				754	\| [< >] -> Ast.Variable id
				755	in
				756	parse_ident id stream
				757
				758	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				759
				760	(* binoprhs
				761	* ::= ('+' primary)* *)
				762	and parse_bin_rhs expr_prec lhs stream =
				763	match Stream.peek stream with
				764	(* If this is a binop, find its precedence. *)
				765	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				766	let token_prec = precedence c in
				767
				768	(* If this is a binop that binds at least as tightly as the current binop,
				769	* consume it, otherwise we are done. *)
				770	if token_prec < expr_prec then lhs else begin
				771	(* Eat the binop. *)
				772	Stream.junk stream;
				773
				774	(* Parse the primary expression after the binary operator. *)
				775	let rhs = parse_primary stream in
				776
				777	(* Okay, we know this is a binop. *)
				778	let rhs =
				779	match Stream.peek stream with
				780	\| Some (Token.Kwd c2) ->
				781	(* If BinOp binds less tightly with rhs than the operator after
				782	* rhs, let the pending operator take rhs as its lhs. *)
				783	let next_prec = precedence c2 in
				784	if token_prec < next_prec
				785	then parse_bin_rhs (token_prec + 1) rhs stream
				786	else rhs
				787	\| _ -> rhs
				788	in
				789
				790	(* Merge lhs/rhs. *)
				791	let lhs = Ast.Binary (c, lhs, rhs) in
				792	parse_bin_rhs expr_prec lhs stream
				793	end
				794	\| _ -> lhs
				795
				796	(* expression
				797	* ::= primary binoprhs *)
				798	and parse_expr = parser
				799	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				800
				801	(* prototype
				802	* ::= id '(' id* ')' *)
				803	let parse_prototype =
				804	let rec parse_args accumulator = parser
				805	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				806	\| [< >] -> accumulator
				807	in
				808
				809	parser
				810	\| [< 'Token.Ident id;
				811	'Token.Kwd '(' ?? "expected '(' in prototype";
				812	args=parse_args [];
				813	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				814	(* success. *)
				815	Ast.Prototype (id, Array.of_list (List.rev args))
				816
				817	\| [< >] ->
				818	raise (Stream.Error "expected function name in prototype")
				819
				820	(* definition ::= 'def' prototype expression *)
				821	let parse_definition = parser
				822	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				823	Ast.Function (p, e)
				824
				825	(* toplevelexpr ::= expression *)
				826	let parse_toplevel = parser
				827	\| [< e=parse_expr >] ->
				828	(* Make an anonymous proto. *)
				829	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				830
				831	(* external ::= 'extern' prototype *)
				832	let parse_extern = parser
				833	\| [< 'Token.Extern; e=parse_prototype >] -> e
				834
				835	toplevel.ml:
				836	.. code-block:: ocaml
				837
				838	(*===----------------------------------------------------------------------===
				839	* Top-Level parsing and JIT Driver
				840	===----------------------------------------------------------------------===)
				841
				842	(* top ::= definition \| external \| expression \| ';' *)
				843	let rec main_loop stream =
				844	match Stream.peek stream with
				845	\| None -> ()
				846
				847	(* ignore top-level semicolons. *)
				848	\| Some (Token.Kwd ';') ->
				849	Stream.junk stream;
				850	main_loop stream
				851
				852	\| Some token ->
				853	begin
				854	try match token with
				855	\| Token.Def ->
				856	ignore(Parser.parse_definition stream);
				857	print_endline "parsed a function definition.";
				858	\| Token.Extern ->
				859	ignore(Parser.parse_extern stream);
				860	print_endline "parsed an extern.";
				861	\| _ ->
				862	(* Evaluate a top-level expression into an anonymous function. *)
				863	ignore(Parser.parse_toplevel stream);
				864	print_endline "parsed a top-level expr";
				865	with Stream.Error s ->
				866	(* Skip token for error recovery. *)
				867	Stream.junk stream;
				868	print_endline s;
				869	end;
				870	print_string "ready> "; flush stdout;
				871	main_loop stream
				872
				873	toy.ml:
				874	.. code-block:: ocaml
				875
				876	(*===----------------------------------------------------------------------===
				877	* Main driver code.
				878	===----------------------------------------------------------------------===)
				879
				880	let main () =
				881	(* Install standard binary operators.
				882	* 1 is the lowest precedence. *)
				883	Hashtbl.add Parser.binop_precedence '<' 10;
				884	Hashtbl.add Parser.binop_precedence '+' 20;
				885	Hashtbl.add Parser.binop_precedence '-' 20;
				886	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				887
				888	(* Prime the first token. *)
				889	print_string "ready> "; flush stdout;
				890	let stream = Lexer.lex (Stream.of_channel stdin) in
				891
				892	(* Run the main "interpreter loop" now. *)
				893	Toplevel.main_loop stream;
				894	;;
				895
				896	main ()
				897
				898	`Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
				899