blob: 07490e1f67df87805adda2af0146c45d3c6f681d [file] [log] [blame]
Sean Silvaee47edf2012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
8Written by `Chris Lattner <mailto:sabre@nondot.org>`_ and `Erick
9Tryzelaar <mailto:idadesub@users.sourceforge.net>`_
10
11Chapter 2 Introduction
12======================
13
14Welcome to Chapter 2 of the "`Implementing a language with LLVM in
15Objective Caml <index.html>`_" tutorial. This chapter shows you how to
16use the lexer, built in `Chapter 1 <OCamlLangImpl1.html>`_, to build a
17full `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our
18Kaleidoscope language. Once we have a parser, we'll define and build an
19`Abstract Syntax
20Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
21
22The parser we will build uses a combination of `Recursive Descent
23Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
24`Operator-Precedence
25Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
26parse the Kaleidoscope language (the latter for binary expressions and
27the former for everything else). Before we get to parsing though, lets
28talk about the output of the parser: the Abstract Syntax Tree.
29
30The Abstract Syntax Tree (AST)
31==============================
32
33The AST for a program captures its behavior in such a way that it is
34easy for later stages of the compiler (e.g. code generation) to
35interpret. We basically want one object for each construct in the
36language, and the AST should closely model the language. In
37Kaleidoscope, we have expressions, a prototype, and a function object.
38We'll start with expressions first:
39
40.. code-block:: ocaml
41
42 (* expr - Base type for all expression nodes. *)
43 type expr =
44 (* variant for numeric literals like "1.0". *)
45 | Number of float
46
47The code above shows the definition of the base ExprAST class and one
48subclass which we use for numeric literals. The important thing to note
49about this code is that the Number variant captures the numeric value of
50the literal as an instance variable. This allows later phases of the
51compiler to know what the stored numeric value is.
52
53Right now we only create the AST, so there are no useful functions on
54them. It would be very easy to add a function to pretty print the code,
55for example. Here are the other expression AST node definitions that
56we'll use in the basic form of the Kaleidoscope language:
57
58.. code-block:: ocaml
59
60 (* variant for referencing a variable, like "a". *)
61 | Variable of string
62
63 (* variant for a binary operator. *)
64 | Binary of char * expr * expr
65
66 (* variant for function calls. *)
67 | Call of string * expr array
68
69This is all (intentionally) rather straight-forward: variables capture
70the variable name, binary operators capture their opcode (e.g. '+'), and
71calls capture a function name as well as a list of any argument
72expressions. One thing that is nice about our AST is that it captures
73the language features without talking about the syntax of the language.
74Note that there is no discussion about precedence of binary operators,
75lexical structure, etc.
76
77For our basic language, these are all of the expression nodes we'll
78define. Because it doesn't have conditional control flow, it isn't
79Turing-complete; we'll fix that in a later installment. The two things
80we need next are a way to talk about the interface to a function, and a
81way to talk about functions themselves:
82
83.. code-block:: ocaml
84
85 (* proto - This type represents the "prototype" for a function, which captures
86 * its name, and its argument names (thus implicitly the number of arguments the
87 * function takes). *)
88 type proto = Prototype of string * string array
89
90 (* func - This type represents a function definition itself. *)
91 type func = Function of proto * expr
92
93In Kaleidoscope, functions are typed with just a count of their
94arguments. Since all values are double precision floating point, the
95type of each argument doesn't need to be stored anywhere. In a more
96aggressive and realistic language, the "expr" variants would probably
97have a type field.
98
99With this scaffolding, we can now talk about parsing expressions and
100function bodies in Kaleidoscope.
101
102Parser Basics
103=============
104
105Now that we have an AST to build, we need to define the parser code to
106build it. The idea here is that we want to parse something like "x+y"
107(which is returned as three tokens by the lexer) into an AST that could
108be generated with calls like this:
109
110.. code-block:: ocaml
111
112 let x = Variable "x" in
113 let y = Variable "y" in
114 let result = Binary ('+', x, y) in
115 ...
116
117The error handling routines make use of the builtin ``Stream.Failure``
118and ``Stream.Error``s. ``Stream.Failure`` is raised when the parser is
119unable to find any matching token in the first position of a pattern.
120``Stream.Error`` is raised when the first token matches, but the rest do
121not. The error recovery in our parser will not be the best and is not
122particular user-friendly, but it will be enough for our tutorial. These
123exceptions make it easier to handle errors in routines that have various
124return types.
125
126With these basic types and exceptions, we can implement the first piece
127of our grammar: numeric literals.
128
129Basic Expression Parsing
130========================
131
132We start with numeric literals, because they are the simplest to
133process. For each production in our grammar, we'll define a function
134which parses that production. We call this class of expressions
135"primary" expressions, for reasons that will become more clear `later in
136the tutorial <OCamlLangImpl6.html#unary>`_. In order to parse an
137arbitrary primary expression, we need to determine what sort of
138expression it is. For numeric literals, we have:
139
140.. code-block:: ocaml
141
142 (* primary
143 * ::= identifier
144 * ::= numberexpr
145 * ::= parenexpr *)
146 parse_primary = parser
147 (* numberexpr ::= number *)
148 | [< 'Token.Number n >] -> Ast.Number n
149
150This routine is very simple: it expects to be called when the current
151token is a ``Token.Number`` token. It takes the current number value,
152creates a ``Ast.Number`` node, advances the lexer to the next token, and
153finally returns.
154
155There are some interesting aspects to this. The most important one is
156that this routine eats all of the tokens that correspond to the
157production and returns the lexer buffer with the next token (which is
158not part of the grammar production) ready to go. This is a fairly
159standard way to go for recursive descent parsers. For a better example,
160the parenthesis operator is defined like this:
161
162.. code-block:: ocaml
163
164 (* parenexpr ::= '(' expression ')' *)
165 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
166
167This function illustrates a number of interesting things about the
168parser:
169
1701) It shows how we use the ``Stream.Error`` exception. When called, this
171function expects that the current token is a '(' token, but after
172parsing the subexpression, it is possible that there is no ')' waiting.
173For example, if the user types in "(4 x" instead of "(4)", the parser
174should emit an error. Because errors can occur, the parser needs a way
175to indicate that they happened. In our parser, we use the camlp4
176shortcut syntax ``token ?? "parse error"``, where if the token before
177the ``??`` does not match, then ``Stream.Error "parse error"`` will be
178raised.
179
1802) Another interesting aspect of this function is that it uses recursion
181by calling ``Parser.parse_primary`` (we will soon see that
182``Parser.parse_primary`` can call ``Parser.parse_primary``). This is
183powerful because it allows us to handle recursive grammars, and keeps
184each production very simple. Note that parentheses do not cause
185construction of AST nodes themselves. While we could do it this way, the
186most important role of parentheses are to guide the parser and provide
187grouping. Once the parser constructs the AST, parentheses are not
188needed.
189
190The next simple production is for handling variable references and
191function calls:
192
193.. code-block:: ocaml
194
195 (* identifierexpr
196 * ::= identifier
197 * ::= identifier '(' argumentexpr ')' *)
198 | [< 'Token.Ident id; stream >] ->
199 let rec parse_args accumulator = parser
200 | [< e=parse_expr; stream >] ->
201 begin parser
202 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
203 | [< >] -> e :: accumulator
204 end stream
205 | [< >] -> accumulator
206 in
207 let rec parse_ident id = parser
208 (* Call. *)
209 | [< 'Token.Kwd '(';
210 args=parse_args [];
211 'Token.Kwd ')' ?? "expected ')'">] ->
212 Ast.Call (id, Array.of_list (List.rev args))
213
214 (* Simple variable ref. *)
215 | [< >] -> Ast.Variable id
216 in
217 parse_ident id stream
218
219This routine follows the same style as the other routines. (It expects
220to be called if the current token is a ``Token.Ident`` token). It also
221has recursion and error handling. One interesting aspect of this is that
222it uses *look-ahead* to determine if the current identifier is a stand
223alone variable reference or if it is a function call expression. It
224handles this by checking to see if the token after the identifier is a
225'(' token, constructing either a ``Ast.Variable`` or ``Ast.Call`` node
226as appropriate.
227
228We finish up by raising an exception if we received a token we didn't
229expect:
230
231.. code-block:: ocaml
232
233 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
234
235Now that basic expressions are handled, we need to handle binary
236expressions. They are a bit more complex.
237
238Binary Expression Parsing
239=========================
240
241Binary expressions are significantly harder to parse because they are
242often ambiguous. For example, when given the string "x+y\*z", the parser
243can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
244definitions from mathematics, we expect the later parse, because "\*"
245(multiplication) has higher *precedence* than "+" (addition).
246
247There are many ways to handle this, but an elegant and efficient way is
248to use `Operator-Precedence
249Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
250This parsing technique uses the precedence of binary operators to guide
251recursion. To start with, we need a table of precedences:
252
253.. code-block:: ocaml
254
255 (* binop_precedence - This holds the precedence for each binary operator that is
256 * defined *)
257 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
258
259 (* precedence - Get the precedence of the pending binary operator token. *)
260 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
261
262 ...
263
264 let main () =
265 (* Install standard binary operators.
266 * 1 is the lowest precedence. *)
267 Hashtbl.add Parser.binop_precedence '<' 10;
268 Hashtbl.add Parser.binop_precedence '+' 20;
269 Hashtbl.add Parser.binop_precedence '-' 20;
270 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *)
271 ...
272
273For the basic form of Kaleidoscope, we will only support 4 binary
274operators (this can obviously be extended by you, our brave and intrepid
275reader). The ``Parser.precedence`` function returns the precedence for
276the current token, or -1 if the token is not a binary operator. Having a
277``Hashtbl.t`` makes it easy to add new operators and makes it clear that
278the algorithm doesn't depend on the specific operators involved, but it
279would be easy enough to eliminate the ``Hashtbl.t`` and do the
280comparisons in the ``Parser.precedence`` function. (Or just use a
281fixed-size array).
282
283With the helper above defined, we can now start parsing binary
284expressions. The basic idea of operator precedence parsing is to break
285down an expression with potentially ambiguous binary operators into
286pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
287Operator precedence parsing considers this as a stream of primary
288expressions separated by binary operators. As such, it will first parse
289the leading primary expression "a", then it will see the pairs [+, b]
290[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
291primary expressions, the binary expression parser doesn't need to worry
292about nested subexpressions like (c+d) at all.
293
294To start, an expression is a primary expression potentially followed by
295a sequence of [binop,primaryexpr] pairs:
296
297.. code-block:: ocaml
298
299 (* expression
300 * ::= primary binoprhs *)
301 and parse_expr = parser
302 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
303
304``Parser.parse_bin_rhs`` is the function that parses the sequence of
305pairs for us. It takes a precedence and a pointer to an expression for
306the part that has been parsed so far. Note that "x" is a perfectly valid
307expression: As such, "binoprhs" is allowed to be empty, in which case it
308returns the expression that is passed into it. In our example above, the
309code passes the expression for "a" into ``Parser.parse_bin_rhs`` and the
310current token is "+".
311
312The precedence value passed into ``Parser.parse_bin_rhs`` indicates the
313*minimal operator precedence* that the function is allowed to eat. For
314example, if the current pair stream is [+, x] and
315``Parser.parse_bin_rhs`` is passed in a precedence of 40, it will not
316consume any tokens (because the precedence of '+' is only 20). With this
317in mind, ``Parser.parse_bin_rhs`` starts with:
318
319.. code-block:: ocaml
320
321 (* binoprhs
322 * ::= ('+' primary)* *)
323 and parse_bin_rhs expr_prec lhs stream =
324 match Stream.peek stream with
325 (* If this is a binop, find its precedence. *)
326 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
327 let token_prec = precedence c in
328
329 (* If this is a binop that binds at least as tightly as the current binop,
330 * consume it, otherwise we are done. *)
331 if token_prec < expr_prec then lhs else begin
332
333This code gets the precedence of the current token and checks to see if
334if is too low. Because we defined invalid tokens to have a precedence of
335-1, this check implicitly knows that the pair-stream ends when the token
336stream runs out of binary operators. If this check succeeds, we know
337that the token is a binary operator and that it will be included in this
338expression:
339
340.. code-block:: ocaml
341
342 (* Eat the binop. *)
343 Stream.junk stream;
344
345 (* Okay, we know this is a binop. *)
346 let rhs =
347 match Stream.peek stream with
348 | Some (Token.Kwd c2) ->
349
350As such, this code eats (and remembers) the binary operator and then
351parses the primary expression that follows. This builds up the whole
352pair, the first of which is [+, b] for the running example.
353
354Now that we parsed the left-hand side of an expression and one pair of
355the RHS sequence, we have to decide which way the expression associates.
356In particular, we could have "(a+b) binop unparsed" or "a + (b binop
357unparsed)". To determine this, we look ahead at "binop" to determine its
358precedence and compare it to BinOp's precedence (which is '+' in this
359case):
360
361.. code-block:: ocaml
362
363 (* If BinOp binds less tightly with rhs than the operator after
364 * rhs, let the pending operator take rhs as its lhs. *)
365 let next_prec = precedence c2 in
366 if token_prec < next_prec
367
368If the precedence of the binop to the right of "RHS" is lower or equal
369to the precedence of our current operator, then we know that the
370parentheses associate as "(a+b) binop ...". In our example, the current
371operator is "+" and the next operator is "+", we know that they have the
372same precedence. In this case we'll create the AST node for "a+b", and
373then continue parsing:
374
375.. code-block:: ocaml
376
377 ... if body omitted ...
378 in
379
380 (* Merge lhs/rhs. *)
381 let lhs = Ast.Binary (c, lhs, rhs) in
382 parse_bin_rhs expr_prec lhs stream
383 end
384
385In our example above, this will turn "a+b+" into "(a+b)" and execute the
386next iteration of the loop, with "+" as the current token. The code
387above will eat, remember, and parse "(c+d)" as the primary expression,
388which makes the current pair equal to [+, (c+d)]. It will then evaluate
389the 'if' conditional above with "\*" as the binop to the right of the
390primary. In this case, the precedence of "\*" is higher than the
391precedence of "+" so the if condition will be entered.
392
393The critical question left here is "how can the if condition parse the
394right hand side in full"? In particular, to build the AST correctly for
395our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
396variable. The code to do this is surprisingly simple (code from the
397above two blocks duplicated for context):
398
399.. code-block:: ocaml
400
401 match Stream.peek stream with
402 | Some (Token.Kwd c2) ->
403 (* If BinOp binds less tightly with rhs than the operator after
404 * rhs, let the pending operator take rhs as its lhs. *)
405 if token_prec < precedence c2
406 then parse_bin_rhs (token_prec + 1) rhs stream
407 else rhs
408 | _ -> rhs
409 in
410
411 (* Merge lhs/rhs. *)
412 let lhs = Ast.Binary (c, lhs, rhs) in
413 parse_bin_rhs expr_prec lhs stream
414 end
415
416At this point, we know that the binary operator to the RHS of our
417primary has higher precedence than the binop we are currently parsing.
418As such, we know that any sequence of pairs whose operators are all
419higher precedence than "+" should be parsed together and returned as
420"RHS". To do this, we recursively invoke the ``Parser.parse_bin_rhs``
421function specifying "token\_prec+1" as the minimum precedence required
422for it to continue. In our example above, this will cause it to return
423the AST node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of
424the '+' expression.
425
426Finally, on the next iteration of the while loop, the "+g" piece is
427parsed and added to the AST. With this little bit of code (14
428non-trivial lines), we correctly handle fully general binary expression
429parsing in a very elegant way. This was a whirlwind tour of this code,
430and it is somewhat subtle. I recommend running through it with a few
431tough examples to see how it works.
432
433This wraps up handling of expressions. At this point, we can point the
434parser at an arbitrary token stream and build an expression from it,
435stopping at the first token that is not part of the expression. Next up
436we need to handle function definitions, etc.
437
438Parsing the Rest
439================
440
441The next thing missing is handling of function prototypes. In
442Kaleidoscope, these are used both for 'extern' function declarations as
443well as function body definitions. The code to do this is
444straight-forward and not very interesting (once you've survived
445expressions):
446
447.. code-block:: ocaml
448
449 (* prototype
450 * ::= id '(' id* ')' *)
451 let parse_prototype =
452 let rec parse_args accumulator = parser
453 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
454 | [< >] -> accumulator
455 in
456
457 parser
458 | [< 'Token.Ident id;
459 'Token.Kwd '(' ?? "expected '(' in prototype";
460 args=parse_args [];
461 'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
462 (* success. *)
463 Ast.Prototype (id, Array.of_list (List.rev args))
464
465 | [< >] ->
466 raise (Stream.Error "expected function name in prototype")
467
468Given this, a function definition is very simple, just a prototype plus
469an expression to implement the body:
470
471.. code-block:: ocaml
472
473 (* definition ::= 'def' prototype expression *)
474 let parse_definition = parser
475 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
476 Ast.Function (p, e)
477
478In addition, we support 'extern' to declare functions like 'sin' and
479'cos' as well as to support forward declaration of user functions. These
480'extern's are just prototypes with no body:
481
482.. code-block:: ocaml
483
484 (* external ::= 'extern' prototype *)
485 let parse_extern = parser
486 | [< 'Token.Extern; e=parse_prototype >] -> e
487
488Finally, we'll also let the user type in arbitrary top-level expressions
489and evaluate them on the fly. We will handle this by defining anonymous
490nullary (zero argument) functions for them:
491
492.. code-block:: ocaml
493
494 (* toplevelexpr ::= expression *)
495 let parse_toplevel = parser
496 | [< e=parse_expr >] ->
497 (* Make an anonymous proto. *)
498 Ast.Function (Ast.Prototype ("", [||]), e)
499
500Now that we have all the pieces, let's build a little driver that will
501let us actually *execute* this code we've built!
502
503The Driver
504==========
505
506The driver for this simply invokes all of the parsing pieces with a
507top-level dispatch loop. There isn't much interesting here, so I'll just
508include the top-level loop. See `below <#code>`_ for full code in the
509"Top-Level Parsing" section.
510
511.. code-block:: ocaml
512
513 (* top ::= definition | external | expression | ';' *)
514 let rec main_loop stream =
515 match Stream.peek stream with
516 | None -> ()
517
518 (* ignore top-level semicolons. *)
519 | Some (Token.Kwd ';') ->
520 Stream.junk stream;
521 main_loop stream
522
523 | Some token ->
524 begin
525 try match token with
526 | Token.Def ->
527 ignore(Parser.parse_definition stream);
528 print_endline "parsed a function definition.";
529 | Token.Extern ->
530 ignore(Parser.parse_extern stream);
531 print_endline "parsed an extern.";
532 | _ ->
533 (* Evaluate a top-level expression into an anonymous function. *)
534 ignore(Parser.parse_toplevel stream);
535 print_endline "parsed a top-level expr";
536 with Stream.Error s ->
537 (* Skip token for error recovery. *)
538 Stream.junk stream;
539 print_endline s;
540 end;
541 print_string "ready> "; flush stdout;
542 main_loop stream
543
544The most interesting part of this is that we ignore top-level
545semicolons. Why is this, you ask? The basic reason is that if you type
546"4 + 5" at the command line, the parser doesn't know whether that is the
547end of what you will type or not. For example, on the next line you
548could type "def foo..." in which case 4+5 is the end of a top-level
549expression. Alternatively you could type "\* 6", which would continue
550the expression. Having top-level semicolons allows you to type "4+5;",
551and the parser will know you are done.
552
553Conclusions
554===========
555
556With just under 300 lines of commented code (240 lines of non-comment,
557non-blank code), we fully defined our minimal language, including a
558lexer, parser, and AST builder. With this done, the executable will
559validate Kaleidoscope code and tell us if it is grammatically invalid.
560For example, here is a sample interaction:
561
562.. code-block:: bash
563
564 $ ./toy.byte
565 ready> def foo(x y) x+foo(y, 4.0);
566 Parsed a function definition.
567 ready> def foo(x y) x+y y;
568 Parsed a function definition.
569 Parsed a top-level expr
570 ready> def foo(x y) x+y );
571 Parsed a function definition.
572 Error: unknown token when expecting an expression
573 ready> extern sin(a);
574 ready> Parsed an extern
575 ready> ^D
576 $
577
578There is a lot of room for extension here. You can define new AST nodes,
579extend the language in many ways, etc. In the `next
580installment <OCamlLangImpl3.html>`_, we will describe how to generate
581LLVM Intermediate Representation (IR) from the AST.
582
583Full Code Listing
584=================
585
586Here is the complete code listing for this and the previous chapter.
587Note that it is fully self-contained: you don't need LLVM or any
588external libraries at all for this. (Besides the ocaml standard
589libraries, of course.) To build this, just compile with:
590
591.. code-block:: bash
592
593 # Compile
594 ocamlbuild toy.byte
595 # Run
596 ./toy.byte
597
598Here is the code:
599
600\_tags:
601 ::
602
603 <{lexer,parser}.ml>: use_camlp4, pp(camlp4of)
604
605token.ml:
606 .. code-block:: ocaml
607
608 (*===----------------------------------------------------------------------===
609 * Lexer Tokens
610 *===----------------------------------------------------------------------===*)
611
612 (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of
613 * these others for known things. *)
614 type token =
615 (* commands *)
616 | Def | Extern
617
618 (* primary *)
619 | Ident of string | Number of float
620
621 (* unknown *)
622 | Kwd of char
623
624lexer.ml:
625 .. code-block:: ocaml
626
627 (*===----------------------------------------------------------------------===
628 * Lexer
629 *===----------------------------------------------------------------------===*)
630
631 let rec lex = parser
632 (* Skip any whitespace. *)
633 | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream
634
635 (* identifier: [a-zA-Z][a-zA-Z0-9] *)
636 | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] ->
637 let buffer = Buffer.create 1 in
638 Buffer.add_char buffer c;
639 lex_ident buffer stream
640
641 (* number: [0-9.]+ *)
642 | [< ' ('0' .. '9' as c); stream >] ->
643 let buffer = Buffer.create 1 in
644 Buffer.add_char buffer c;
645 lex_number buffer stream
646
647 (* Comment until end of line. *)
648 | [< ' ('#'); stream >] ->
649 lex_comment stream
650
651 (* Otherwise, just return the character as its ascii value. *)
652 | [< 'c; stream >] ->
653 [< 'Token.Kwd c; lex stream >]
654
655 (* end of stream. *)
656 | [< >] -> [< >]
657
658 and lex_number buffer = parser
659 | [< ' ('0' .. '9' | '.' as c); stream >] ->
660 Buffer.add_char buffer c;
661 lex_number buffer stream
662 | [< stream=lex >] ->
663 [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >]
664
665 and lex_ident buffer = parser
666 | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] ->
667 Buffer.add_char buffer c;
668 lex_ident buffer stream
669 | [< stream=lex >] ->
670 match Buffer.contents buffer with
671 | "def" -> [< 'Token.Def; stream >]
672 | "extern" -> [< 'Token.Extern; stream >]
673 | id -> [< 'Token.Ident id; stream >]
674
675 and lex_comment = parser
676 | [< ' ('\n'); stream=lex >] -> stream
677 | [< 'c; e=lex_comment >] -> e
678 | [< >] -> [< >]
679
680ast.ml:
681 .. code-block:: ocaml
682
683 (*===----------------------------------------------------------------------===
684 * Abstract Syntax Tree (aka Parse Tree)
685 *===----------------------------------------------------------------------===*)
686
687 (* expr - Base type for all expression nodes. *)
688 type expr =
689 (* variant for numeric literals like "1.0". *)
690 | Number of float
691
692 (* variant for referencing a variable, like "a". *)
693 | Variable of string
694
695 (* variant for a binary operator. *)
696 | Binary of char * expr * expr
697
698 (* variant for function calls. *)
699 | Call of string * expr array
700
701 (* proto - This type represents the "prototype" for a function, which captures
702 * its name, and its argument names (thus implicitly the number of arguments the
703 * function takes). *)
704 type proto = Prototype of string * string array
705
706 (* func - This type represents a function definition itself. *)
707 type func = Function of proto * expr
708
709parser.ml:
710 .. code-block:: ocaml
711
712 (*===---------------------------------------------------------------------===
713 * Parser
714 *===---------------------------------------------------------------------===*)
715
716 (* binop_precedence - This holds the precedence for each binary operator that is
717 * defined *)
718 let binop_precedence:(char, int) Hashtbl.t = Hashtbl.create 10
719
720 (* precedence - Get the precedence of the pending binary operator token. *)
721 let precedence c = try Hashtbl.find binop_precedence c with Not_found -> -1
722
723 (* primary
724 * ::= identifier
725 * ::= numberexpr
726 * ::= parenexpr *)
727 let rec parse_primary = parser
728 (* numberexpr ::= number *)
729 | [< 'Token.Number n >] -> Ast.Number n
730
731 (* parenexpr ::= '(' expression ')' *)
732 | [< 'Token.Kwd '('; e=parse_expr; 'Token.Kwd ')' ?? "expected ')'" >] -> e
733
734 (* identifierexpr
735 * ::= identifier
736 * ::= identifier '(' argumentexpr ')' *)
737 | [< 'Token.Ident id; stream >] ->
738 let rec parse_args accumulator = parser
739 | [< e=parse_expr; stream >] ->
740 begin parser
741 | [< 'Token.Kwd ','; e=parse_args (e :: accumulator) >] -> e
742 | [< >] -> e :: accumulator
743 end stream
744 | [< >] -> accumulator
745 in
746 let rec parse_ident id = parser
747 (* Call. *)
748 | [< 'Token.Kwd '(';
749 args=parse_args [];
750 'Token.Kwd ')' ?? "expected ')'">] ->
751 Ast.Call (id, Array.of_list (List.rev args))
752
753 (* Simple variable ref. *)
754 | [< >] -> Ast.Variable id
755 in
756 parse_ident id stream
757
758 | [< >] -> raise (Stream.Error "unknown token when expecting an expression.")
759
760 (* binoprhs
761 * ::= ('+' primary)* *)
762 and parse_bin_rhs expr_prec lhs stream =
763 match Stream.peek stream with
764 (* If this is a binop, find its precedence. *)
765 | Some (Token.Kwd c) when Hashtbl.mem binop_precedence c ->
766 let token_prec = precedence c in
767
768 (* If this is a binop that binds at least as tightly as the current binop,
769 * consume it, otherwise we are done. *)
770 if token_prec < expr_prec then lhs else begin
771 (* Eat the binop. *)
772 Stream.junk stream;
773
774 (* Parse the primary expression after the binary operator. *)
775 let rhs = parse_primary stream in
776
777 (* Okay, we know this is a binop. *)
778 let rhs =
779 match Stream.peek stream with
780 | Some (Token.Kwd c2) ->
781 (* If BinOp binds less tightly with rhs than the operator after
782 * rhs, let the pending operator take rhs as its lhs. *)
783 let next_prec = precedence c2 in
784 if token_prec < next_prec
785 then parse_bin_rhs (token_prec + 1) rhs stream
786 else rhs
787 | _ -> rhs
788 in
789
790 (* Merge lhs/rhs. *)
791 let lhs = Ast.Binary (c, lhs, rhs) in
792 parse_bin_rhs expr_prec lhs stream
793 end
794 | _ -> lhs
795
796 (* expression
797 * ::= primary binoprhs *)
798 and parse_expr = parser
799 | [< lhs=parse_primary; stream >] -> parse_bin_rhs 0 lhs stream
800
801 (* prototype
802 * ::= id '(' id* ')' *)
803 let parse_prototype =
804 let rec parse_args accumulator = parser
805 | [< 'Token.Ident id; e=parse_args (id::accumulator) >] -> e
806 | [< >] -> accumulator
807 in
808
809 parser
810 | [< 'Token.Ident id;
811 'Token.Kwd '(' ?? "expected '(' in prototype";
812 args=parse_args [];
813 'Token.Kwd ')' ?? "expected ')' in prototype" >] ->
814 (* success. *)
815 Ast.Prototype (id, Array.of_list (List.rev args))
816
817 | [< >] ->
818 raise (Stream.Error "expected function name in prototype")
819
820 (* definition ::= 'def' prototype expression *)
821 let parse_definition = parser
822 | [< 'Token.Def; p=parse_prototype; e=parse_expr >] ->
823 Ast.Function (p, e)
824
825 (* toplevelexpr ::= expression *)
826 let parse_toplevel = parser
827 | [< e=parse_expr >] ->
828 (* Make an anonymous proto. *)
829 Ast.Function (Ast.Prototype ("", [||]), e)
830
831 (* external ::= 'extern' prototype *)
832 let parse_extern = parser
833 | [< 'Token.Extern; e=parse_prototype >] -> e
834
835toplevel.ml:
836 .. code-block:: ocaml
837
838 (*===----------------------------------------------------------------------===
839 * Top-Level parsing and JIT Driver
840 *===----------------------------------------------------------------------===*)
841
842 (* top ::= definition | external | expression | ';' *)
843 let rec main_loop stream =
844 match Stream.peek stream with
845 | None -> ()
846
847 (* ignore top-level semicolons. *)
848 | Some (Token.Kwd ';') ->
849 Stream.junk stream;
850 main_loop stream
851
852 | Some token ->
853 begin
854 try match token with
855 | Token.Def ->
856 ignore(Parser.parse_definition stream);
857 print_endline "parsed a function definition.";
858 | Token.Extern ->
859 ignore(Parser.parse_extern stream);
860 print_endline "parsed an extern.";
861 | _ ->
862 (* Evaluate a top-level expression into an anonymous function. *)
863 ignore(Parser.parse_toplevel stream);
864 print_endline "parsed a top-level expr";
865 with Stream.Error s ->
866 (* Skip token for error recovery. *)
867 Stream.junk stream;
868 print_endline s;
869 end;
870 print_string "ready> "; flush stdout;
871 main_loop stream
872
873toy.ml:
874 .. code-block:: ocaml
875
876 (*===----------------------------------------------------------------------===
877 * Main driver code.
878 *===----------------------------------------------------------------------===*)
879
880 let main () =
881 (* Install standard binary operators.
882 * 1 is the lowest precedence. *)
883 Hashtbl.add Parser.binop_precedence '<' 10;
884 Hashtbl.add Parser.binop_precedence '+' 20;
885 Hashtbl.add Parser.binop_precedence '-' 20;
886 Hashtbl.add Parser.binop_precedence '*' 40; (* highest. *)
887
888 (* Prime the first token. *)
889 print_string "ready> "; flush stdout;
890 let stream = Lexer.lex (Stream.of_channel stdin) in
891
892 (* Run the main "interpreter loop" now. *)
893 Toplevel.main_loop stream;
894 ;;
895
896 main ()
897
898`Next: Implementing Code Generation to LLVM IR <OCamlLangImpl3.html>`_
899