blob: 83a22ab22d4f438d73fed8a6aa2346ec7a754fb1 [file] [log] [blame]
Sean Silvaee47edf2012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
Sean Silvaee47edf2012-12-05 00:26:32 +00008Chapter 2 Introduction
9======================
10
11Welcome to Chapter 2 of the "`Implementing a language with LLVM in
12Objective Caml <index.html>`_" tutorial. This chapter shows you how to
13use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
14full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
15Kaleidoscope language. Once we have a parser, we'll define and build an
16`Abstract Syntax
17Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
18
19The parser we will build uses a combination of `Recursive Descent
20Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
21`Operator-Precedence
22Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
23parse the Kaleidoscope language (the latter for binary expressions and
24the former for everything else). Before we get to parsing though, lets
25talk about the output of the parser: the Abstract Syntax Tree.
26
27The Abstract Syntax Tree (AST)
28==============================
29
30The AST for a program captures its behavior in such a way that it is
31easy for later stages of the compiler (e.g. code generation) to
32interpret. We basically want one object for each construct in the
33language, and the AST should closely model the language. In
34Kaleidoscope, we have expressions, a prototype, and a function object.
35We'll start with expressions first:
36
37.. code-block:: ocaml
38
39 (* expr - Base type for all expression nodes. *)
40 type expr =
41 (* variant for numeric literals like "1.0". *)
42 | Number of float
43
44The code above shows the definition of the base ExprAST class and one
45subclass which we use for numeric literals. The important thing to note
46about this code is that the Number variant captures the numeric value of
47the literal as an instance variable. This allows later phases of the
48compiler to know what the stored numeric value is.
49
50Right now we only create the AST, so there are no useful functions on
51them. It would be very easy to add a function to pretty print the code,
52for example. Here are the other expression AST node definitions that
53we'll use in the basic form of the Kaleidoscope language:
54
55.. code-block:: ocaml
56
57 (* variant for referencing a variable, like "a". *)
58 | Variable of string
59
60 (* variant for a binary operator. *)
61 | Binary of char * expr * expr
62
63 (* variant for function calls. *)
64 | Call of string * expr array
65
66This is all (intentionally) rather straight-forward: variables capture
67the variable name, binary operators capture their opcode (e.g. '+'), and
68calls capture a function name as well as a list of any argument
69expressions. One thing that is nice about our AST is that it captures
70the language features without talking about the syntax of the language.
71Note that there is no discussion about precedence of binary operators,
72lexical structure, etc.
73
74For our basic language, these are all of the expression nodes we'll
75define. Because it doesn't have conditional control flow, it isn't
76Turing-complete; we'll fix that in a later installment. The two things
77we need next are a way to talk about the interface to a function, and a
78way to talk about functions themselves:
79
80.. code-block:: ocaml
81
82 (* proto - This type represents the "prototype" for a function, which captures
83 * its name, and its argument names (thus implicitly the number of arguments the
84 * function takes). *)
85 type proto = Prototype of string * string array
86
87 (* func - This type represents a function definition itself. *)
88 type func = Function of proto * expr
89
90In Kaleidoscope, functions are typed with just a count of their
91arguments. Since all values are double precision floating point, the
92type of each argument doesn't need to be stored anywhere. In a more
93aggressive and realistic language, the "expr" variants would probably
94have a type field.
95
96With this scaffolding, we can now talk about parsing expressions and
97function bodies in Kaleidoscope.
98
99Parser Basics
100=============
101
102Now that we have an AST to build, we need to define the parser code to
103build it. The idea here is that we want to parse something like "x+y"
104(which is returned as three tokens by the lexer) into an AST that could
105be generated with calls like this:
106
107.. code-block:: ocaml
108
109 let x = Variable "x" in
110 let y = Variable "y" in
111 let result = Binary ('+', x, y) in
112 ...
113
114The error handling routines make use of the builtin ``Stream.Failure``
115and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
116unable to find any matching token in the first position of a pattern.
117``Stream.Error`` is raised when the first token matches, but the rest do
118not. The error recovery in our parser will not be the best and is not
119particular user-friendly, but it will be enough for our tutorial. These
120exceptions make it easier to handle errors in routines that have various
121return types.
122
123With these basic types and exceptions, we can implement the first piece
124of our grammar: numeric literals.
125
126Basic Expression Parsing
127========================
128
129We start with numeric literals, because they are the simplest to
130process. For each production in our grammar, we'll define a function
131which parses that production. We call this class of expressions
132"primary" expressions, for reasons that will become more clear `later in
133the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
134arbitrary primary expression, we need to determine what sort of
135expression it is. For numeric literals, we have:
136
137.. code-block:: ocaml
138
139 (* primary
140 * ::= identifier
141 * ::= numberexpr
142 * ::= parenexpr *)
143 parse_primary = parser
144 (* numberexpr ::= number *)
145 | [< 'Token.Number n >] -> Ast.Number n
146
147This routine is very simple: it expects to be called when the current
148token is a ``Token.Number`` token. It takes the current number value,
149creates a ``Ast.Number`` node, advances the lexer to the next token, and
150finally returns.
151
152There are some interesting aspects to this. The most important one is
153that this routine eats all of the tokens that correspond to the
154production and returns the lexer buffer with the next token (which is
155not part of the grammar production) ready to go. This is a fairly
156standard way to go for recursive descent parsers. For a better example,
157the parenthesis operator is defined like this:
158
159.. code-block:: ocaml
160
161 (* parenexpr ::= '(' expression ')' *)
162 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
163
164This function illustrates a number of interesting things about the
165parser:
166
1671) It shows how we use the ``Stream.Error`` exception. When called, this
168function expects that the current token is a '(' token, but after
169parsing the subexpression, it is possible that there is no ')' waiting.
170For example, if the user types in "(4 x" instead of "(4)", the parser
171should emit an error. Because errors can occur, the parser needs a way
172to indicate that they happened. In our parser, we use the camlp4
173shortcut syntax ``token ?? "parse error"``, where if the token before
174the ``??`` does not match, then ``Stream.Error "parse error"`` will be
175raised.
176
1772) Another interesting aspect of this function is that it uses recursion
178by calling ``Parser.parse_primary`` (we will soon see that
179``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
180powerful because it allows us to handle recursive grammars, and keeps
181each production very simple. Note that parentheses do not cause
182construction of AST nodes themselves. While we could do it this way, the
183most important role of parentheses are to guide the parser and provide
184grouping. Once the parser constructs the AST, parentheses are not
185needed.
186
187The next simple production is for handling variable references and
188function calls:
189
190.. code-block:: ocaml
191
192 (* identifierexpr
193 * ::= identifier
194 * ::= identifier '(' argumentexpr ')' *)
195 | [< 'Token.Ident id; stream >] ->
196 let rec parse_args accumulator = parser
197 | [< e=parse_expr; stream >] ->
198 begin parser
199 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
200 | [< >] -> e :: accumulator
201 end stream
202 | [< >] -> accumulator
203 in
204 let rec parse_ident id = parser
205 (* Call. *)
206 | [< 'Token.Kwd '(';
207 args=parse_args [];
208 'Token.Kwd ')' ?? "expected ')'">] ->
209 Ast.Call (id, Array.of_list (List.rev args))
210
211 (* Simple variable ref. *)
212 | [< >] -> Ast.Variable id
213 in
214 parse_ident id stream
215
216This routine follows the same style as the other routines. (It expects
217to be called if the current token is a ``Token.Ident`` token). It also
218has recursion and error handling. One interesting aspect of this is that
219it uses *look-ahead* to determine if the current identifier is a stand
220alone variable reference or if it is a function call expression. It
221handles this by checking to see if the token after the identifier is a
222'(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
223as appropriate.
224
225We finish up by raising an exception if we received a token we didn't
226expect:
227
228.. code-block:: ocaml
229
230 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
231
232Now that basic expressions are handled, we need to handle binary
233expressions. They are a bit more complex.
234
235Binary Expression Parsing
236=========================
237
238Binary expressions are significantly harder to parse because they are
239often ambiguous. For example, when given the string "x+y\*z", the parser
240can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
241definitions from mathematics, we expect the later parse, because "\*"
242(multiplication) has higher *precedence* than "+" (addition).
243
244There are many ways to handle this, but an elegant and efficient way is
245to use `Operator-Precedence
246Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
247This parsing technique uses the precedence of binary operators to guide
248recursion. To start with, we need a table of precedences:
249
250.. code-block:: ocaml
251
252 (* binop_precedence - This holds the precedence for each binary operator that is
253 * defined *)
254 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
255
256 (* precedence - Get the precedence of the pending binary operator token. *)
257 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
258
259 ...
260
261 let main () =
262 (* Install standard binary operators.
263 * 1 is the lowest precedence. *)
264 Hashtbl.add Parser.binop_precedence '<' 10;
265 Hashtbl.add Parser.binop_precedence '+' 20;
266 Hashtbl.add Parser.binop_precedence '-' 20;
267 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *)
268 ...
269
270For the basic form of Kaleidoscope, we will only support 4 binary
271operators (this can obviously be extended by you, our brave and intrepid
272reader). The ``Parser.precedence`` function returns the precedence for
273the current token, or -1 if the token is not a binary operator. Having a
274``Hashtbl.t`` makes it easy to add new operators and makes it clear that
275the algorithm doesn't depend on the specific operators involved, but it
276would be easy enough to eliminate the ``Hashtbl.t`` and do the
277comparisons in the ``Parser.precedence`` function. (Or just use a
278fixed-size array).
279
280With the helper above defined, we can now start parsing binary
281expressions. The basic idea of operator precedence parsing is to break
282down an expression with potentially ambiguous binary operators into
283pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
284Operator precedence parsing considers this as a stream of primary
285expressions separated by binary operators. As such, it will first parse
286the leading primary expression "a", then it will see the pairs [+, b]
287[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
288primary expressions, the binary expression parser doesn't need to worry
289about nested subexpressions like (c+d) at all.
290
291To start, an expression is a primary expression potentially followed by
292a sequence of [binop,primaryexpr] pairs:
293
294.. code-block:: ocaml
295
296 (* expression
297 * ::= primary binoprhs *)
298 and parse_expr = parser
299 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
300
301``Parser.parse_bin_rhs`` is the function that parses the sequence of
302pairs for us. It takes a precedence and a pointer to an expression for
303the part that has been parsed so far. Note that "x" is a perfectly valid
304expression: As such, "binoprhs" is allowed to be empty, in which case it
305returns the expression that is passed into it. In our example above, the
306code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
307current token is "+".
308
309The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
310*minimal operator precedence* that the function is allowed to eat. For
311example, if the current pair stream is [+, x] and
312``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
313consume any tokens (because the precedence of '+' is only 20). With this
314in mind, ``Parser.parse_bin_rhs`` starts with:
315
316.. code-block:: ocaml
317
318 (* binoprhs
319 * ::= ('+' primary)* *)
320 and parse_bin_rhs expr_prec lhs stream =
321 match Stream.peek stream with
322 (* If this is a binop, find its precedence. *)
323 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
324 let token_prec = precedence c in
325
326 (* If this is a binop that binds at least as tightly as the current binop,
327 * consume it, otherwise we are done. *)
328 if token_prec < expr_prec then lhs else begin
329
330This code gets the precedence of the current token and checks to see if
331if is too low. Because we defined invalid tokens to have a precedence of
332-1, this check implicitly knows that the pair-stream ends when the token
333stream runs out of binary operators. If this check succeeds, we know
334that the token is a binary operator and that it will be included in this
335expression:
336
337.. code-block:: ocaml
338
339 (* Eat the binop. *)
340 Stream.junk stream;
341
342 (* Okay, we know this is a binop. *)
343 let rhs =
344 match Stream.peek stream with
345 | Some (Token.Kwd c2) ->
346
347As such, this code eats (and remembers) the binary operator and then
348parses the primary expression that follows. This builds up the whole
349pair, the first of which is [+, b] for the running example.
350
351Now that we parsed the left-hand side of an expression and one pair of
352the RHS sequence, we have to decide which way the expression associates.
353In particular, we could have "(a+b) binop unparsed" or "a + (b binop
354unparsed)". To determine this, we look ahead at "binop" to determine its
355precedence and compare it to BinOp's precedence (which is '+' in this
356case):
357
358.. code-block:: ocaml
359
360 (* If BinOp binds less tightly with rhs than the operator after
361 * rhs, let the pending operator take rhs as its lhs. *)
362 let next_prec = precedence c2 in
363 if token_prec < next_prec
364
365If the precedence of the binop to the right of "RHS" is lower or equal
366to the precedence of our current operator, then we know that the
367parentheses associate as "(a+b) binop ...". In our example, the current
368operator is "+" and the next operator is "+", we know that they have the
369same precedence. In this case we'll create the AST node for "a+b", and
370then continue parsing:
371
372.. code-block:: ocaml
373
374 ... if body omitted ...
375 in
376
377 (* Merge lhs/rhs. *)
378 let lhs = Ast.Binary (c, lhs, rhs) in
379 parse_bin_rhs expr_prec lhs stream
380 end
381
382In our example above, this will turn "a+b+" into "(a+b)" and execute the
383next iteration of the loop, with "+" as the current token. The code
384above will eat, remember, and parse "(c+d)" as the primary expression,
385which makes the current pair equal to [+, (c+d)]. It will then evaluate
386the 'if' conditional above with "\*" as the binop to the right of the
387primary. In this case, the precedence of "\*" is higher than the
388precedence of "+" so the if condition will be entered.
389
390The critical question left here is "how can the if condition parse the
391right hand side in full"? In particular, to build the AST correctly for
392our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
393variable. The code to do this is surprisingly simple (code from the
394above two blocks duplicated for context):
395
396.. code-block:: ocaml
397
398 match Stream.peek stream with
399 | Some (Token.Kwd c2) ->
400 (* If BinOp binds less tightly with rhs than the operator after
401 * rhs, let the pending operator take rhs as its lhs. *)
402 if token_prec < precedence c2
403 then parse_bin_rhs (token_prec + 1) rhs stream
404 else rhs
405 | _ -> rhs
406 in
407
408 (* Merge lhs/rhs. *)
409 let lhs = Ast.Binary (c, lhs, rhs) in
410 parse_bin_rhs expr_prec lhs stream
411 end
412
413At this point, we know that the binary operator to the RHS of our
414primary has higher precedence than the binop we are currently parsing.
415As such, we know that any sequence of pairs whose operators are all
416higher precedence than "+" should be parsed together and returned as
417"RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
418function specifying "token\_prec+1" as the minimum precedence required
419for it to continue. In our example above, this will cause it to return
420the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of
421the '+' expression.
422
423Finally, on the next iteration of the while loop, the "+g" piece is
424parsed and added to the AST. With this little bit of code (14
425non-trivial lines), we correctly handle fully general binary expression
426parsing in a very elegant way. This was a whirlwind tour of this code,
427and it is somewhat subtle. I recommend running through it with a few
428tough examples to see how it works.
429
430This wraps up handling of expressions. At this point, we can point the
431parser at an arbitrary token stream and build an expression from it,
432stopping at the first token that is not part of the expression. Next up
433we need to handle function definitions, etc.
434
435Parsing the Rest
436================
437
438The next thing missing is handling of function prototypes. In
439Kaleidoscope, these are used both for 'extern' function declarations as
440well as function body definitions. The code to do this is
441straight-forward and not very interesting (once you've survived
442expressions):
443
444.. code-block:: ocaml
445
446 (* prototype
447 * ::= id '(' id* ')' *)
448 let parse_prototype =
449 let rec parse_args accumulator = parser
450 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
451 | [< >] -> accumulator
452 in
453
454 parser
455 | [< 'Token.Ident id;
456 'Token.Kwd '(' ?? "expected '(' in prototype";
457 args=parse_args [];
458 'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
459 (* success. *)
460 Ast.Prototype (id, Array.of_list (List.rev args))
461
462 | [< >] ->
463 raise (Stream.Error "expected function name in prototype")
464
465Given this, a function definition is very simple, just a prototype plus
466an expression to implement the body:
467
468.. code-block:: ocaml
469
470 (* definition ::= 'def' prototype expression *)
471 let parse_definition = parser
472 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
473 Ast.Function (p, e)
474
475In addition, we support 'extern' to declare functions like 'sin' and
476'cos' as well as to support forward declaration of user functions. These
477'extern's are just prototypes with no body:
478
479.. code-block:: ocaml
480
481 (* external ::= 'extern' prototype *)
482 let parse_extern = parser
483 | [< 'Token.Extern; e=parse_prototype >] -> e
484
485Finally, we'll also let the user type in arbitrary top-level expressions
486and evaluate them on the fly. We will handle this by defining anonymous
487nullary (zero argument) functions for them:
488
489.. code-block:: ocaml
490
491 (* toplevelexpr ::= expression *)
492 let parse_toplevel = parser
493 | [< e=parse_expr >] ->
494 (* Make an anonymous proto. *)
495 Ast.Function (Ast.Prototype ("", [||]), e)
496
497Now that we have all the pieces, let's build a little driver that will
498let us actually *execute* this code we've built!
499
500The Driver
501==========
502
503The driver for this simply invokes all of the parsing pieces with a
504top-level dispatch loop. There isn't much interesting here, so I'll just
505include the top-level loop. See `below <#code>`_ for full code in the
506"Top-Level Parsing" section.
507
508.. code-block:: ocaml
509
510 (* top ::= definition | external | expression | ';' *)
511 let rec main_loop stream =
512 match Stream.peek stream with
513 | None -> ()
514
515 (* ignore top-level semicolons. *)
516 | Some (Token.Kwd ';') ->
517 Stream.junk stream;
518 main_loop stream
519
520 | Some token ->
521 begin
522 try match token with
523 | Token.Def ->
524 ignore(Parser.parse_definition stream);
525 print_endline "parsed a function definition.";
526 | Token.Extern ->
527 ignore(Parser.parse_extern stream);
528 print_endline "parsed an extern.";
529 | _ ->
530 (* Evaluate a top-level expression into an anonymous function. *)
531 ignore(Parser.parse_toplevel stream);
532 print_endline "parsed a top-level expr";
533 with Stream.Error s ->
534 (* Skip token for error recovery. *)
535 Stream.junk stream;
536 print_endline s;
537 end;
538 print_string "ready> "; flush stdout;
539 main_loop stream
540
541The most interesting part of this is that we ignore top-level
542semicolons. Why is this, you ask? The basic reason is that if you type
543"4 + 5" at the command line, the parser doesn't know whether that is the
544end of what you will type or not. For example, on the next line you
545could type "def foo..." in which case 4+5 is the end of a top-level
546expression. Alternatively you could type "\* 6", which would continue
547the expression. Having top-level semicolons allows you to type "4+5;",
548and the parser will know you are done.
549
550Conclusions
551===========
552
553With just under 300 lines of commented code (240 lines of non-comment,
554non-blank code), we fully defined our minimal language, including a
555lexer, parser, and AST builder. With this done, the executable will
556validate Kaleidoscope code and tell us if it is grammatically invalid.
557For example, here is a sample interaction:
558
559.. code-block:: bash
560
561 $ ./toy.byte
562 ready> def foo(x y) x+foo(y, 4.0);
563 Parsed a function definition.
564 ready> def foo(x y) x+y y;
565 Parsed a function definition.
566 Parsed a top-level expr
567 ready> def foo(x y) x+y );
568 Parsed a function definition.
569 Error: unknown token when expecting an expression
570 ready> extern sin(a);
571 ready> Parsed an extern
572 ready> ^D
573 $
574
575There is a lot of room for extension here. You can define new AST nodes,
576extend the language in many ways, etc. In the `next
577installment <OCamlLangImpl3.html>`_, we will describe how to generate
578LLVM Intermediate Representation (IR) from the AST.
579
580Full Code Listing
581=================
582
583Here is the complete code listing for this and the previous chapter.
584Note that it is fully self-contained: you don't need LLVM or any
585external libraries at all for this. (Besides the ocaml standard
586libraries, of course.) To build this, just compile with:
587
588.. code-block:: bash
589
590 # Compile
591 ocamlbuild toy.byte
592 # Run
593 ./toy.byte
594
595Here is the code:
596
597\_tags:
598 ::
599
600 <{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
601
602token.ml:
603 .. code-block:: ocaml
604
605 (*===----------------------------------------------------------------------===
606 * Lexer Tokens
607 *===----------------------------------------------------------------------===*)
608
609 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
610 * these others for known things. *)
611 type token =
612 (* commands *)
613 | Def | Extern
614
615 (* primary *)
616 | Ident of string | Number of float
617
618 (* unknown *)
619 | Kwd of char
620
621lexer.ml:
622 .. code-block:: ocaml
623
624 (*===----------------------------------------------------------------------===
625 * Lexer
626 *===----------------------------------------------------------------------===*)
627
628 let rec lex = parser
629 (* Skip any whitespace. *)
630 | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
631
632 (* identifier: [a-zA-Z][a-zA-Z0-9] *)
633 | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
634 let buffer = Buffer.create 1 in
635 Buffer.add_char buffer c;
636 lex_ident buffer stream
637
638 (* number: [0-9.]+ *)
639 | [< ' ('0' .. '9' as c); stream >] ->
640 let buffer = Buffer.create 1 in
641 Buffer.add_char buffer c;
642 lex_number buffer stream
643
644 (* Comment until end of line. *)
645 | [< ' ('#'); stream >] ->
646 lex_comment stream
647
648 (* Otherwise, just return the character as its ascii value. *)
649 | [< 'c; stream >] ->
650 [< 'Token.Kwd c; lex stream >]
651
652 (* end of stream. *)
653 | [< >] -> [< >]
654
655 and lex_number buffer = parser
656 | [< ' ('0' .. '9' | '.' as c); stream >] ->
657 Buffer.add_char buffer c;
658 lex_number buffer stream
659 | [< stream=lex >] ->
660 [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
661
662 and lex_ident buffer = parser
663 | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
664 Buffer.add_char buffer c;
665 lex_ident buffer stream
666 | [< stream=lex >] ->
667 match Buffer.contents buffer with
668 | "def" -> [< 'Token.Def; stream >]
669 | "extern" -> [< 'Token.Extern; stream >]
670 | id -> [< 'Token.Ident id; stream >]
671
672 and lex_comment = parser
673 | [< ' ('\n'); stream=lex >] -> stream
674 | [< 'c; e=lex_comment >] -> e
675 | [< >] -> [< >]
676
677ast.ml:
678 .. code-block:: ocaml
679
680 (*===----------------------------------------------------------------------===
681 * Abstract Syntax Tree (aka Parse Tree)
682 *===----------------------------------------------------------------------===*)
683
684 (* expr - Base type for all expression nodes. *)
685 type expr =
686 (* variant for numeric literals like "1.0". *)
687 | Number of float
688
689 (* variant for referencing a variable, like "a". *)
690 | Variable of string
691
692 (* variant for a binary operator. *)
693 | Binary of char * expr * expr
694
695 (* variant for function calls. *)
696 | Call of string * expr array
697
698 (* proto - This type represents the "prototype" for a function, which captures
699 * its name, and its argument names (thus implicitly the number of arguments the
700 * function takes). *)
701 type proto = Prototype of string * string array
702
703 (* func - This type represents a function definition itself. *)
704 type func = Function of proto * expr
705
706parser.ml:
707 .. code-block:: ocaml
708
709 (*===---------------------------------------------------------------------===
710 * Parser
711 *===---------------------------------------------------------------------===*)
712
713 (* binop_precedence - This holds the precedence for each binary operator that is
714 * defined *)
715 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
716
717 (* precedence - Get the precedence of the pending binary operator token. *)
718 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
719
720 (* primary
721 * ::= identifier
722 * ::= numberexpr
723 * ::= parenexpr *)
724 let rec parse_primary = parser
725 (* numberexpr ::= number *)
726 | [< 'Token.Number n >] -> Ast.Number n
727
728 (* parenexpr ::= '(' expression ')' *)
729 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
730
731 (* identifierexpr
732 * ::= identifier
733 * ::= identifier '(' argumentexpr ')' *)
734 | [< 'Token.Ident id; stream >] ->
735 let rec parse_args accumulator = parser
736 | [< e=parse_expr; stream >] ->
737 begin parser
738 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
739 | [< >] -> e :: accumulator
740 end stream
741 | [< >] -> accumulator
742 in
743 let rec parse_ident id = parser
744 (* Call. *)
745 | [< 'Token.Kwd '(';
746 args=parse_args [];
747 'Token.Kwd ')' ?? "expected ')'">] ->
748 Ast.Call (id, Array.of_list (List.rev args))
749
750 (* Simple variable ref. *)
751 | [< >] -> Ast.Variable id
752 in
753 parse_ident id stream
754
755 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
756
757 (* binoprhs
758 * ::= ('+' primary)* *)
759 and parse_bin_rhs expr_prec lhs stream =
760 match Stream.peek stream with
761 (* If this is a binop, find its precedence. *)
762 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
763 let token_prec = precedence c in
764
765 (* If this is a binop that binds at least as tightly as the current binop,
766 * consume it, otherwise we are done. *)
767 if token_prec < expr_prec then lhs else begin
768 (* Eat the binop. *)
769 Stream.junk stream;
770
771 (* Parse the primary expression after the binary operator. *)
772 let rhs = parse_primary stream in
773
774 (* Okay, we know this is a binop. *)
775 let rhs =
776 match Stream.peek stream with
777 | Some (Token.Kwd c2) ->
778 (* If BinOp binds less tightly with rhs than the operator after
779 * rhs, let the pending operator take rhs as its lhs. *)
780 let next_prec = precedence c2 in
781 if token_prec < next_prec
782 then parse_bin_rhs (token_prec + 1) rhs stream
783 else rhs
784 | _ -> rhs
785 in
786
787 (* Merge lhs/rhs. *)
788 let lhs = Ast.Binary (c, lhs, rhs) in
789 parse_bin_rhs expr_prec lhs stream
790 end
791 | _ -> lhs
792
793 (* expression
794 * ::= primary binoprhs *)
795 and parse_expr = parser
796 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
797
798 (* prototype
799 * ::= id '(' id* ')' *)
800 let parse_prototype =
801 let rec parse_args accumulator = parser
802 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
803 | [< >] -> accumulator
804 in
805
806 parser
807 | [< 'Token.Ident id;
808 'Token.Kwd '(' ?? "expected '(' in prototype";
809 args=parse_args [];
810 'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
811 (* success. *)
812 Ast.Prototype (id, Array.of_list (List.rev args))
813
814 | [< >] ->
815 raise (Stream.Error "expected function name in prototype")
816
817 (* definition ::= 'def' prototype expression *)
818 let parse_definition = parser
819 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
820 Ast.Function (p, e)
821
822 (* toplevelexpr ::= expression *)
823 let parse_toplevel = parser
824 | [< e=parse_expr >] ->
825 (* Make an anonymous proto. *)
826 Ast.Function (Ast.Prototype ("", [||]), e)
827
828 (* external ::= 'extern' prototype *)
829 let parse_extern = parser
830 | [< 'Token.Extern; e=parse_prototype >] -> e
831
832toplevel.ml:
833 .. code-block:: ocaml
834
835 (*===----------------------------------------------------------------------===
836 * Top-Level parsing and JIT Driver
837 *===----------------------------------------------------------------------===*)
838
839 (* top ::= definition | external | expression | ';' *)
840 let rec main_loop stream =
841 match Stream.peek stream with
842 | None -> ()
843
844 (* ignore top-level semicolons. *)
845 | Some (Token.Kwd ';') ->
846 Stream.junk stream;
847 main_loop stream
848
849 | Some token ->
850 begin
851 try match token with
852 | Token.Def ->
853 ignore(Parser.parse_definition stream);
854 print_endline "parsed a function definition.";
855 | Token.Extern ->
856 ignore(Parser.parse_extern stream);
857 print_endline "parsed an extern.";
858 | _ ->
859 (* Evaluate a top-level expression into an anonymous function. *)
860 ignore(Parser.parse_toplevel stream);
861 print_endline "parsed a top-level expr";
862 with Stream.Error s ->
863 (* Skip token for error recovery. *)
864 Stream.junk stream;
865 print_endline s;
866 end;
867 print_string "ready> "; flush stdout;
868 main_loop stream
869
870toy.ml:
871 .. code-block:: ocaml
872
873 (*===----------------------------------------------------------------------===
874 * Main driver code.
875 *===----------------------------------------------------------------------===*)
876
877 let main () =
878 (* Install standard binary operators.
879 * 1 is the lowest precedence. *)
880 Hashtbl.add Parser.binop_precedence '<' 10;
881 Hashtbl.add Parser.binop_precedence '+' 20;
882 Hashtbl.add Parser.binop_precedence '-' 20;
883 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *)
884
885 (* Prime the first token. *)
886 print_string "ready> "; flush stdout;
887 let stream = Lexer.lex (Stream.of_channel stdin) in
888
889 (* Run the main "interpreter loop" now. *)
890 Toplevel.main_loop stream;
891 ;;
892
893 main ()
894
895`Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
896