Blame - docs/tutorial/OCamlLangImpl2.rst - fp2-dev/platform/external/llvm

blob: 83a22ab22d4f438d73fed8a6aa2346ec7a754fb1 [file] [log] [blame]

Sean Silva	ee47edf	2012-12-05 00:26:32 +0000	[diff] [blame]	1	===========================================
				2	Kaleidoscope: Implementing a Parser and AST
				3	===========================================
				4
				5	.. contents::
				6	:local:
				7
Sean Silva	ee47edf	2012-12-05 00:26:32 +0000	[diff] [blame]	8	Chapter 2 Introduction
				9	======================
				10
				11	Welcome to Chapter 2 of the "`Implementing a language with LLVM in
				12	Objective Caml <index.html>`_" tutorial. This chapter shows you how to
				13	use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
				14	full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
				15	Kaleidoscope language. Once we have a parser, we'll define and build an
				16	`Abstract Syntax
				17	Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
				18
				19	The parser we will build uses a combination of `Recursive Descent
				20	Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
				21	`Operator-Precedence
				22	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
				23	parse the Kaleidoscope language (the latter for binary expressions and
				24	the former for everything else). Before we get to parsing though, lets
				25	talk about the output of the parser: the Abstract Syntax Tree.
				26
				27	The Abstract Syntax Tree (AST)
				28	==============================
				29
				30	The AST for a program captures its behavior in such a way that it is
				31	easy for later stages of the compiler (e.g. code generation) to
				32	interpret. We basically want one object for each construct in the
				33	language, and the AST should closely model the language. In
				34	Kaleidoscope, we have expressions, a prototype, and a function object.
				35	We'll start with expressions first:
				36
				37	.. code-block:: ocaml
				38
				39	(* expr - Base type for all expression nodes. *)
				40	type expr =
				41	(* variant for numeric literals like "1.0". *)
				42	\| Number of float
				43
				44	The code above shows the definition of the base ExprAST class and one
				45	subclass which we use for numeric literals. The important thing to note
				46	about this code is that the Number variant captures the numeric value of
				47	the literal as an instance variable. This allows later phases of the
				48	compiler to know what the stored numeric value is.
				49
				50	Right now we only create the AST, so there are no useful functions on
				51	them. It would be very easy to add a function to pretty print the code,
				52	for example. Here are the other expression AST node definitions that
				53	we'll use in the basic form of the Kaleidoscope language:
				54
				55	.. code-block:: ocaml
				56
				57	(* variant for referencing a variable, like "a". *)
				58	\| Variable of string
				59
				60	(* variant for a binary operator. *)
				61	\| Binary of char * expr * expr
				62
				63	(* variant for function calls. *)
				64	\| Call of string * expr array
				65
				66	This is all (intentionally) rather straight-forward: variables capture
				67	the variable name, binary operators capture their opcode (e.g. '+'), and
				68	calls capture a function name as well as a list of any argument
				69	expressions. One thing that is nice about our AST is that it captures
				70	the language features without talking about the syntax of the language.
				71	Note that there is no discussion about precedence of binary operators,
				72	lexical structure, etc.
				73
				74	For our basic language, these are all of the expression nodes we'll
				75	define. Because it doesn't have conditional control flow, it isn't
				76	Turing-complete; we'll fix that in a later installment. The two things
				77	we need next are a way to talk about the interface to a function, and a
				78	way to talk about functions themselves:
				79
				80	.. code-block:: ocaml
				81
				82	(* proto - This type represents the "prototype" for a function, which captures
				83	* its name, and its argument names (thus implicitly the number of arguments the
				84	* function takes). *)
				85	type proto = Prototype of string * string array
				86
				87	(* func - This type represents a function definition itself. *)
				88	type func = Function of proto * expr
				89
				90	In Kaleidoscope, functions are typed with just a count of their
				91	arguments. Since all values are double precision floating point, the
				92	type of each argument doesn't need to be stored anywhere. In a more
				93	aggressive and realistic language, the "expr" variants would probably
				94	have a type field.
				95
				96	With this scaffolding, we can now talk about parsing expressions and
				97	function bodies in Kaleidoscope.
				98
				99	Parser Basics
				100	=============
				101
				102	Now that we have an AST to build, we need to define the parser code to
				103	build it. The idea here is that we want to parse something like "x+y"
				104	(which is returned as three tokens by the lexer) into an AST that could
				105	be generated with calls like this:
				106
				107	.. code-block:: ocaml
				108
				109	let x = Variable "x" in
				110	let y = Variable "y" in
				111	let result = Binary ('+', x, y) in
				112	...
				113
				114	The error handling routines make use of the builtin ``Stream.Failure``
				115	and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
				116	unable to find any matching token in the first position of a pattern.
				117	``Stream.Error`` is raised when the first token matches, but the rest do
				118	not. The error recovery in our parser will not be the best and is not
				119	particular user-friendly, but it will be enough for our tutorial. These
				120	exceptions make it easier to handle errors in routines that have various
				121	return types.
				122
				123	With these basic types and exceptions, we can implement the first piece
				124	of our grammar: numeric literals.
				125
				126	Basic Expression Parsing
				127	========================
				128
				129	We start with numeric literals, because they are the simplest to
				130	process. For each production in our grammar, we'll define a function
				131	which parses that production. We call this class of expressions
				132	"primary" expressions, for reasons that will become more clear `later in
				133	the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
				134	arbitrary primary expression, we need to determine what sort of
				135	expression it is. For numeric literals, we have:
				136
				137	.. code-block:: ocaml
				138
				139	(* primary
				140	* ::= identifier
				141	* ::= numberexpr
				142	* ::= parenexpr *)
				143	parse_primary = parser
				144	(* numberexpr ::= number *)
				145	\| [< 'Token.Number n >] -> Ast.Number n
				146
				147	This routine is very simple: it expects to be called when the current
				148	token is a ``Token.Number`` token. It takes the current number value,
				149	creates a ``Ast.Number`` node, advances the lexer to the next token, and
				150	finally returns.
				151
				152	There are some interesting aspects to this. The most important one is
				153	that this routine eats all of the tokens that correspond to the
				154	production and returns the lexer buffer with the next token (which is
				155	not part of the grammar production) ready to go. This is a fairly
				156	standard way to go for recursive descent parsers. For a better example,
				157	the parenthesis operator is defined like this:
				158
				159	.. code-block:: ocaml
				160
				161	(* parenexpr ::= '(' expression ')' *)
				162	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				163
				164	This function illustrates a number of interesting things about the
				165	parser:
				166
				167	1) It shows how we use the ``Stream.Error`` exception. When called, this
				168	function expects that the current token is a '(' token, but after
				169	parsing the subexpression, it is possible that there is no ')' waiting.
				170	For example, if the user types in "(4 x" instead of "(4)", the parser
				171	should emit an error. Because errors can occur, the parser needs a way
				172	to indicate that they happened. In our parser, we use the camlp4
				173	shortcut syntax ``token ?? "parse error"``, where if the token before
				174	the ``??`` does not match, then ``Stream.Error "parse error"`` will be
				175	raised.
				176
				177	2) Another interesting aspect of this function is that it uses recursion
				178	by calling ``Parser.parse_primary`` (we will soon see that
				179	``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
				180	powerful because it allows us to handle recursive grammars, and keeps
				181	each production very simple. Note that parentheses do not cause
				182	construction of AST nodes themselves. While we could do it this way, the
				183	most important role of parentheses are to guide the parser and provide
				184	grouping. Once the parser constructs the AST, parentheses are not
				185	needed.
				186
				187	The next simple production is for handling variable references and
				188	function calls:
				189
				190	.. code-block:: ocaml
				191
				192	(* identifierexpr
				193	* ::= identifier
				194	* ::= identifier '(' argumentexpr ')' *)
				195	\| [< 'Token.Ident id; stream >] ->
				196	let rec parse_args accumulator = parser
				197	\| [< e=parse_expr; stream >] ->
				198	begin parser
				199	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				200	\| [< >] -> e :: accumulator
				201	end stream
				202	\| [< >] -> accumulator
				203	in
				204	let rec parse_ident id = parser
				205	(* Call. *)
				206	\| [< 'Token.Kwd '(';
				207	args=parse_args [];
				208	'Token.Kwd ')' ?? "expected ')'">] ->
				209	Ast.Call (id, Array.of_list (List.rev args))
				210
				211	(* Simple variable ref. *)
				212	\| [< >] -> Ast.Variable id
				213	in
				214	parse_ident id stream
				215
				216	This routine follows the same style as the other routines. (It expects
				217	to be called if the current token is a ``Token.Ident`` token). It also
				218	has recursion and error handling. One interesting aspect of this is that
				219	it uses look-ahead to determine if the current identifier is a stand
				220	alone variable reference or if it is a function call expression. It
				221	handles this by checking to see if the token after the identifier is a
				222	'(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
				223	as appropriate.
				224
				225	We finish up by raising an exception if we received a token we didn't
				226	expect:
				227
				228	.. code-block:: ocaml
				229
				230	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				231
				232	Now that basic expressions are handled, we need to handle binary
				233	expressions. They are a bit more complex.
				234
				235	Binary Expression Parsing
				236	=========================
				237
				238	Binary expressions are significantly harder to parse because they are
				239	often ambiguous. For example, when given the string "x+y\*z", the parser
				240	can choose to parse it as either "(x+y)\z" or "x+(y\z)". With common
				241	definitions from mathematics, we expect the later parse, because "\*"
				242	(multiplication) has higher precedence than "+" (addition).
				243
				244	There are many ways to handle this, but an elegant and efficient way is
				245	to use `Operator-Precedence
				246	Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
				247	This parsing technique uses the precedence of binary operators to guide
				248	recursion. To start with, we need a table of precedences:
				249
				250	.. code-block:: ocaml
				251
				252	(* binop_precedence - This holds the precedence for each binary operator that is
				253	* defined *)
				254	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				255
				256	(* precedence - Get the precedence of the pending binary operator token. *)
				257	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				258
				259	...
				260
				261	let main () =
				262	(* Install standard binary operators.
				263	* 1 is the lowest precedence. *)
				264	Hashtbl.add Parser.binop_precedence '<' 10;
				265	Hashtbl.add Parser.binop_precedence '+' 20;
				266	Hashtbl.add Parser.binop_precedence '-' 20;
				267	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				268	...
				269
				270	For the basic form of Kaleidoscope, we will only support 4 binary
				271	operators (this can obviously be extended by you, our brave and intrepid
				272	reader). The ``Parser.precedence`` function returns the precedence for
				273	the current token, or -1 if the token is not a binary operator. Having a
				274	``Hashtbl.t`` makes it easy to add new operators and makes it clear that
				275	the algorithm doesn't depend on the specific operators involved, but it
				276	would be easy enough to eliminate the ``Hashtbl.t`` and do the
				277	comparisons in the ``Parser.precedence`` function. (Or just use a
				278	fixed-size array).
				279
				280	With the helper above defined, we can now start parsing binary
				281	expressions. The basic idea of operator precedence parsing is to break
				282	down an expression with potentially ambiguous binary operators into
				283	pieces. Consider ,for example, the expression "a+b+(c+d)\e\f+g".
				284	Operator precedence parsing considers this as a stream of primary
				285	expressions separated by binary operators. As such, it will first parse
				286	the leading primary expression "a", then it will see the pairs [+, b]
				287	[+, (c+d)] [\, e] [\, f] and [+, g]. Note that because parentheses are
				288	primary expressions, the binary expression parser doesn't need to worry
				289	about nested subexpressions like (c+d) at all.
				290
				291	To start, an expression is a primary expression potentially followed by
				292	a sequence of [binop,primaryexpr] pairs:
				293
				294	.. code-block:: ocaml
				295
				296	(* expression
				297	* ::= primary binoprhs *)
				298	and parse_expr = parser
				299	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				300
				301	``Parser.parse_bin_rhs`` is the function that parses the sequence of
				302	pairs for us. It takes a precedence and a pointer to an expression for
				303	the part that has been parsed so far. Note that "x" is a perfectly valid
				304	expression: As such, "binoprhs" is allowed to be empty, in which case it
				305	returns the expression that is passed into it. In our example above, the
				306	code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
				307	current token is "+".
				308
				309	The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
				310	minimal operator precedence that the function is allowed to eat. For
				311	example, if the current pair stream is [+, x] and
				312	``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
				313	consume any tokens (because the precedence of '+' is only 20). With this
				314	in mind, ``Parser.parse_bin_rhs`` starts with:
				315
				316	.. code-block:: ocaml
				317
				318	(* binoprhs
				319	* ::= ('+' primary)* *)
				320	and parse_bin_rhs expr_prec lhs stream =
				321	match Stream.peek stream with
				322	(* If this is a binop, find its precedence. *)
				323	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				324	let token_prec = precedence c in
				325
				326	(* If this is a binop that binds at least as tightly as the current binop,
				327	* consume it, otherwise we are done. *)
				328	if token_prec < expr_prec then lhs else begin
				329
				330	This code gets the precedence of the current token and checks to see if
				331	if is too low. Because we defined invalid tokens to have a precedence of
				332	-1, this check implicitly knows that the pair-stream ends when the token
				333	stream runs out of binary operators. If this check succeeds, we know
				334	that the token is a binary operator and that it will be included in this
				335	expression:
				336
				337	.. code-block:: ocaml
				338
				339	(* Eat the binop. *)
				340	Stream.junk stream;
				341
				342	(* Okay, we know this is a binop. *)
				343	let rhs =
				344	match Stream.peek stream with
				345	\| Some (Token.Kwd c2) ->
				346
				347	As such, this code eats (and remembers) the binary operator and then
				348	parses the primary expression that follows. This builds up the whole
				349	pair, the first of which is [+, b] for the running example.
				350
				351	Now that we parsed the left-hand side of an expression and one pair of
				352	the RHS sequence, we have to decide which way the expression associates.
				353	In particular, we could have "(a+b) binop unparsed" or "a + (b binop
				354	unparsed)". To determine this, we look ahead at "binop" to determine its
				355	precedence and compare it to BinOp's precedence (which is '+' in this
				356	case):
				357
				358	.. code-block:: ocaml
				359
				360	(* If BinOp binds less tightly with rhs than the operator after
				361	* rhs, let the pending operator take rhs as its lhs. *)
				362	let next_prec = precedence c2 in
				363	if token_prec < next_prec
				364
				365	If the precedence of the binop to the right of "RHS" is lower or equal
				366	to the precedence of our current operator, then we know that the
				367	parentheses associate as "(a+b) binop ...". In our example, the current
				368	operator is "+" and the next operator is "+", we know that they have the
				369	same precedence. In this case we'll create the AST node for "a+b", and
				370	then continue parsing:
				371
				372	.. code-block:: ocaml
				373
				374	... if body omitted ...
				375	in
				376
				377	(* Merge lhs/rhs. *)
				378	let lhs = Ast.Binary (c, lhs, rhs) in
				379	parse_bin_rhs expr_prec lhs stream
				380	end
				381
				382	In our example above, this will turn "a+b+" into "(a+b)" and execute the
				383	next iteration of the loop, with "+" as the current token. The code
				384	above will eat, remember, and parse "(c+d)" as the primary expression,
				385	which makes the current pair equal to [+, (c+d)]. It will then evaluate
				386	the 'if' conditional above with "\*" as the binop to the right of the
				387	primary. In this case, the precedence of "\*" is higher than the
				388	precedence of "+" so the if condition will be entered.
				389
				390	The critical question left here is "how can the if condition parse the
				391	right hand side in full"? In particular, to build the AST correctly for
				392	our example, it needs to get all of "(c+d)\e\f" as the RHS expression
				393	variable. The code to do this is surprisingly simple (code from the
				394	above two blocks duplicated for context):
				395
				396	.. code-block:: ocaml
				397
				398	match Stream.peek stream with
				399	\| Some (Token.Kwd c2) ->
				400	(* If BinOp binds less tightly with rhs than the operator after
				401	* rhs, let the pending operator take rhs as its lhs. *)
				402	if token_prec < precedence c2
				403	then parse_bin_rhs (token_prec + 1) rhs stream
				404	else rhs
				405	\| _ -> rhs
				406	in
				407
				408	(* Merge lhs/rhs. *)
				409	let lhs = Ast.Binary (c, lhs, rhs) in
				410	parse_bin_rhs expr_prec lhs stream
				411	end
				412
				413	At this point, we know that the binary operator to the RHS of our
				414	primary has higher precedence than the binop we are currently parsing.
				415	As such, we know that any sequence of pairs whose operators are all
				416	higher precedence than "+" should be parsed together and returned as
				417	"RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
				418	function specifying "token\_prec+1" as the minimum precedence required
				419	for it to continue. In our example above, this will cause it to return
				420	the AST node for "(c+d)\e\f" as RHS, which is then set as the RHS of
				421	the '+' expression.
				422
				423	Finally, on the next iteration of the while loop, the "+g" piece is
				424	parsed and added to the AST. With this little bit of code (14
				425	non-trivial lines), we correctly handle fully general binary expression
				426	parsing in a very elegant way. This was a whirlwind tour of this code,
				427	and it is somewhat subtle. I recommend running through it with a few
				428	tough examples to see how it works.
				429
				430	This wraps up handling of expressions. At this point, we can point the
				431	parser at an arbitrary token stream and build an expression from it,
				432	stopping at the first token that is not part of the expression. Next up
				433	we need to handle function definitions, etc.
				434
				435	Parsing the Rest
				436	================
				437
				438	The next thing missing is handling of function prototypes. In
				439	Kaleidoscope, these are used both for 'extern' function declarations as
				440	well as function body definitions. The code to do this is
				441	straight-forward and not very interesting (once you've survived
				442	expressions):
				443
				444	.. code-block:: ocaml
				445
				446	(* prototype
				447	* ::= id '(' id* ')' *)
				448	let parse_prototype =
				449	let rec parse_args accumulator = parser
				450	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				451	\| [< >] -> accumulator
				452	in
				453
				454	parser
				455	\| [< 'Token.Ident id;
				456	'Token.Kwd '(' ?? "expected '(' in prototype";
				457	args=parse_args [];
				458	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				459	(* success. *)
				460	Ast.Prototype (id, Array.of_list (List.rev args))
				461
				462	\| [< >] ->
				463	raise (Stream.Error "expected function name in prototype")
				464
				465	Given this, a function definition is very simple, just a prototype plus
				466	an expression to implement the body:
				467
				468	.. code-block:: ocaml
				469
				470	(* definition ::= 'def' prototype expression *)
				471	let parse_definition = parser
				472	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				473	Ast.Function (p, e)
				474
				475	In addition, we support 'extern' to declare functions like 'sin' and
				476	'cos' as well as to support forward declaration of user functions. These
				477	'extern's are just prototypes with no body:
				478
				479	.. code-block:: ocaml
				480
				481	(* external ::= 'extern' prototype *)
				482	let parse_extern = parser
				483	\| [< 'Token.Extern; e=parse_prototype >] -> e
				484
				485	Finally, we'll also let the user type in arbitrary top-level expressions
				486	and evaluate them on the fly. We will handle this by defining anonymous
				487	nullary (zero argument) functions for them:
				488
				489	.. code-block:: ocaml
				490
				491	(* toplevelexpr ::= expression *)
				492	let parse_toplevel = parser
				493	\| [< e=parse_expr >] ->
				494	(* Make an anonymous proto. *)
				495	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				496
				497	Now that we have all the pieces, let's build a little driver that will
				498	let us actually execute this code we've built!
				499
				500	The Driver
				501	==========
				502
				503	The driver for this simply invokes all of the parsing pieces with a
				504	top-level dispatch loop. There isn't much interesting here, so I'll just
				505	include the top-level loop. See `below <#code>`_ for full code in the
				506	"Top-Level Parsing" section.
				507
				508	.. code-block:: ocaml
				509
				510	(* top ::= definition \| external \| expression \| ';' *)
				511	let rec main_loop stream =
				512	match Stream.peek stream with
				513	\| None -> ()
				514
				515	(* ignore top-level semicolons. *)
				516	\| Some (Token.Kwd ';') ->
				517	Stream.junk stream;
				518	main_loop stream
				519
				520	\| Some token ->
				521	begin
				522	try match token with
				523	\| Token.Def ->
				524	ignore(Parser.parse_definition stream);
				525	print_endline "parsed a function definition.";
				526	\| Token.Extern ->
				527	ignore(Parser.parse_extern stream);
				528	print_endline "parsed an extern.";
				529	\| _ ->
				530	(* Evaluate a top-level expression into an anonymous function. *)
				531	ignore(Parser.parse_toplevel stream);
				532	print_endline "parsed a top-level expr";
				533	with Stream.Error s ->
				534	(* Skip token for error recovery. *)
				535	Stream.junk stream;
				536	print_endline s;
				537	end;
				538	print_string "ready> "; flush stdout;
				539	main_loop stream
				540
				541	The most interesting part of this is that we ignore top-level
				542	semicolons. Why is this, you ask? The basic reason is that if you type
				543	"4 + 5" at the command line, the parser doesn't know whether that is the
				544	end of what you will type or not. For example, on the next line you
				545	could type "def foo..." in which case 4+5 is the end of a top-level
				546	expression. Alternatively you could type "\* 6", which would continue
				547	the expression. Having top-level semicolons allows you to type "4+5;",
				548	and the parser will know you are done.
				549
				550	Conclusions
				551	===========
				552
				553	With just under 300 lines of commented code (240 lines of non-comment,
				554	non-blank code), we fully defined our minimal language, including a
				555	lexer, parser, and AST builder. With this done, the executable will
				556	validate Kaleidoscope code and tell us if it is grammatically invalid.
				557	For example, here is a sample interaction:
				558
				559	.. code-block:: bash
				560
				561	$ ./toy.byte
				562	ready> def foo(x y) x+foo(y, 4.0);
				563	Parsed a function definition.
				564	ready> def foo(x y) x+y y;
				565	Parsed a function definition.
				566	Parsed a top-level expr
				567	ready> def foo(x y) x+y );
				568	Parsed a function definition.
				569	Error: unknown token when expecting an expression
				570	ready> extern sin(a);
				571	ready> Parsed an extern
				572	ready> ^D
				573	$
				574
				575	There is a lot of room for extension here. You can define new AST nodes,
				576	extend the language in many ways, etc. In the `next
				577	installment <OCamlLangImpl3.html>`_, we will describe how to generate
				578	LLVM Intermediate Representation (IR) from the AST.
				579
				580	Full Code Listing
				581	=================
				582
				583	Here is the complete code listing for this and the previous chapter.
				584	Note that it is fully self-contained: you don't need LLVM or any
				585	external libraries at all for this. (Besides the ocaml standard
				586	libraries, of course.) To build this, just compile with:
				587
				588	.. code-block:: bash
				589
				590	# Compile
				591	ocamlbuild toy.byte
				592	# Run
				593	./toy.byte
				594
				595	Here is the code:
				596
				597	\_tags:
				598	::
				599
				600	<{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
				601
				602	token.ml:
				603	.. code-block:: ocaml
				604
				605	(*===----------------------------------------------------------------------===
				606	* Lexer Tokens
				607	===----------------------------------------------------------------------===)
				608
				609	(* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
				610	* these others for known things. *)
				611	type token =
				612	(* commands *)
				613	\| Def \| Extern
				614
				615	(* primary *)
				616	\| Ident of string \| Number of float
				617
				618	(* unknown *)
				619	\| Kwd of char
				620
				621	lexer.ml:
				622	.. code-block:: ocaml
				623
				624	(*===----------------------------------------------------------------------===
				625	* Lexer
				626	===----------------------------------------------------------------------===)
				627
				628	let rec lex = parser
				629	(* Skip any whitespace. *)
				630	\| [< ' (' ' \| '\n' \| '\r' \| '\t'); stream >] -> lex stream
				631
				632	(* identifier: [a-zA-Z][a-zA-Z0-9] *)
				633	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' as c); stream >] ->
				634	let buffer = Buffer.create 1 in
				635	Buffer.add_char buffer c;
				636	lex_ident buffer stream
				637
				638	(* number: [0-9.]+ *)
				639	\| [< ' ('0' .. '9' as c); stream >] ->
				640	let buffer = Buffer.create 1 in
				641	Buffer.add_char buffer c;
				642	lex_number buffer stream
				643
				644	(* Comment until end of line. *)
				645	\| [< ' ('#'); stream >] ->
				646	lex_comment stream
				647
				648	(* Otherwise, just return the character as its ascii value. *)
				649	\| [< 'c; stream >] ->
				650	[< 'Token.Kwd c; lex stream >]
				651
				652	(* end of stream. *)
				653	\| [< >] -> [< >]
				654
				655	and lex_number buffer = parser
				656	\| [< ' ('0' .. '9' \| '.' as c); stream >] ->
				657	Buffer.add_char buffer c;
				658	lex_number buffer stream
				659	\| [< stream=lex >] ->
				660	[< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
				661
				662	and lex_ident buffer = parser
				663	\| [< ' ('A' .. 'Z' \| 'a' .. 'z' \| '0' .. '9' as c); stream >] ->
				664	Buffer.add_char buffer c;
				665	lex_ident buffer stream
				666	\| [< stream=lex >] ->
				667	match Buffer.contents buffer with
				668	\| "def" -> [< 'Token.Def; stream >]
				669	\| "extern" -> [< 'Token.Extern; stream >]
				670	\| id -> [< 'Token.Ident id; stream >]
				671
				672	and lex_comment = parser
				673	\| [< ' ('\n'); stream=lex >] -> stream
				674	\| [< 'c; e=lex_comment >] -> e
				675	\| [< >] -> [< >]
				676
				677	ast.ml:
				678	.. code-block:: ocaml
				679
				680	(*===----------------------------------------------------------------------===
				681	* Abstract Syntax Tree (aka Parse Tree)
				682	===----------------------------------------------------------------------===)
				683
				684	(* expr - Base type for all expression nodes. *)
				685	type expr =
				686	(* variant for numeric literals like "1.0". *)
				687	\| Number of float
				688
				689	(* variant for referencing a variable, like "a". *)
				690	\| Variable of string
				691
				692	(* variant for a binary operator. *)
				693	\| Binary of char * expr * expr
				694
				695	(* variant for function calls. *)
				696	\| Call of string * expr array
				697
				698	(* proto - This type represents the "prototype" for a function, which captures
				699	* its name, and its argument names (thus implicitly the number of arguments the
				700	* function takes). *)
				701	type proto = Prototype of string * string array
				702
				703	(* func - This type represents a function definition itself. *)
				704	type func = Function of proto * expr
				705
				706	parser.ml:
				707	.. code-block:: ocaml
				708
				709	(*===---------------------------------------------------------------------===
				710	* Parser
				711	===---------------------------------------------------------------------===)
				712
				713	(* binop_precedence - This holds the precedence for each binary operator that is
				714	* defined *)
				715	let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
				716
				717	(* precedence - Get the precedence of the pending binary operator token. *)
				718	let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
				719
				720	(* primary
				721	* ::= identifier
				722	* ::= numberexpr
				723	* ::= parenexpr *)
				724	let rec parse_primary = parser
				725	(* numberexpr ::= number *)
				726	\| [< 'Token.Number n >] -> Ast.Number n
				727
				728	(* parenexpr ::= '(' expression ')' *)
				729	\| [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
				730
				731	(* identifierexpr
				732	* ::= identifier
				733	* ::= identifier '(' argumentexpr ')' *)
				734	\| [< 'Token.Ident id; stream >] ->
				735	let rec parse_args accumulator = parser
				736	\| [< e=parse_expr; stream >] ->
				737	begin parser
				738	\| [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
				739	\| [< >] -> e :: accumulator
				740	end stream
				741	\| [< >] -> accumulator
				742	in
				743	let rec parse_ident id = parser
				744	(* Call. *)
				745	\| [< 'Token.Kwd '(';
				746	args=parse_args [];
				747	'Token.Kwd ')' ?? "expected ')'">] ->
				748	Ast.Call (id, Array.of_list (List.rev args))
				749
				750	(* Simple variable ref. *)
				751	\| [< >] -> Ast.Variable id
				752	in
				753	parse_ident id stream
				754
				755	\| [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
				756
				757	(* binoprhs
				758	* ::= ('+' primary)* *)
				759	and parse_bin_rhs expr_prec lhs stream =
				760	match Stream.peek stream with
				761	(* If this is a binop, find its precedence. *)
				762	\| Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
				763	let token_prec = precedence c in
				764
				765	(* If this is a binop that binds at least as tightly as the current binop,
				766	* consume it, otherwise we are done. *)
				767	if token_prec < expr_prec then lhs else begin
				768	(* Eat the binop. *)
				769	Stream.junk stream;
				770
				771	(* Parse the primary expression after the binary operator. *)
				772	let rhs = parse_primary stream in
				773
				774	(* Okay, we know this is a binop. *)
				775	let rhs =
				776	match Stream.peek stream with
				777	\| Some (Token.Kwd c2) ->
				778	(* If BinOp binds less tightly with rhs than the operator after
				779	* rhs, let the pending operator take rhs as its lhs. *)
				780	let next_prec = precedence c2 in
				781	if token_prec < next_prec
				782	then parse_bin_rhs (token_prec + 1) rhs stream
				783	else rhs
				784	\| _ -> rhs
				785	in
				786
				787	(* Merge lhs/rhs. *)
				788	let lhs = Ast.Binary (c, lhs, rhs) in
				789	parse_bin_rhs expr_prec lhs stream
				790	end
				791	\| _ -> lhs
				792
				793	(* expression
				794	* ::= primary binoprhs *)
				795	and parse_expr = parser
				796	\| [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
				797
				798	(* prototype
				799	* ::= id '(' id* ')' *)
				800	let parse_prototype =
				801	let rec parse_args accumulator = parser
				802	\| [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
				803	\| [< >] -> accumulator
				804	in
				805
				806	parser
				807	\| [< 'Token.Ident id;
				808	'Token.Kwd '(' ?? "expected '(' in prototype";
				809	args=parse_args [];
				810	'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
				811	(* success. *)
				812	Ast.Prototype (id, Array.of_list (List.rev args))
				813
				814	\| [< >] ->
				815	raise (Stream.Error "expected function name in prototype")
				816
				817	(* definition ::= 'def' prototype expression *)
				818	let parse_definition = parser
				819	\| [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
				820	Ast.Function (p, e)
				821
				822	(* toplevelexpr ::= expression *)
				823	let parse_toplevel = parser
				824	\| [< e=parse_expr >] ->
				825	(* Make an anonymous proto. *)
				826	Ast.Function (Ast.Prototype ("", [\|\|]), e)
				827
				828	(* external ::= 'extern' prototype *)
				829	let parse_extern = parser
				830	\| [< 'Token.Extern; e=parse_prototype >] -> e
				831
				832	toplevel.ml:
				833	.. code-block:: ocaml
				834
				835	(*===----------------------------------------------------------------------===
				836	* Top-Level parsing and JIT Driver
				837	===----------------------------------------------------------------------===)
				838
				839	(* top ::= definition \| external \| expression \| ';' *)
				840	let rec main_loop stream =
				841	match Stream.peek stream with
				842	\| None -> ()
				843
				844	(* ignore top-level semicolons. *)
				845	\| Some (Token.Kwd ';') ->
				846	Stream.junk stream;
				847	main_loop stream
				848
				849	\| Some token ->
				850	begin
				851	try match token with
				852	\| Token.Def ->
				853	ignore(Parser.parse_definition stream);
				854	print_endline "parsed a function definition.";
				855	\| Token.Extern ->
				856	ignore(Parser.parse_extern stream);
				857	print_endline "parsed an extern.";
				858	\| _ ->
				859	(* Evaluate a top-level expression into an anonymous function. *)
				860	ignore(Parser.parse_toplevel stream);
				861	print_endline "parsed a top-level expr";
				862	with Stream.Error s ->
				863	(* Skip token for error recovery. *)
				864	Stream.junk stream;
				865	print_endline s;
				866	end;
				867	print_string "ready> "; flush stdout;
				868	main_loop stream
				869
				870	toy.ml:
				871	.. code-block:: ocaml
				872
				873	(*===----------------------------------------------------------------------===
				874	* Main driver code.
				875	===----------------------------------------------------------------------===)
				876
				877	let main () =
				878	(* Install standard binary operators.
				879	* 1 is the lowest precedence. *)
				880	Hashtbl.add Parser.binop_precedence '<' 10;
				881	Hashtbl.add Parser.binop_precedence '+' 20;
				882	Hashtbl.add Parser.binop_precedence '-' 20;
				883	Hashtbl.add Parser.binop_precedence '' 40; ( highest. *)
				884
				885	(* Prime the first token. *)
				886	print_string "ready> "; flush stdout;
				887	let stream = Lexer.lex (Stream.of_channel stdin) in
				888
				889	(* Run the main "interpreter loop" now. *)
				890	Toplevel.main_loop stream;
				891	;;
				892
				893	main ()
				894
				895	`Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
				896