blob: 0d62894a24028434981b60ac896b03794b945082 [file] [log] [blame]
Sean Silvaee47edf2012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
8Written by `Chris Lattner <mailto:sabre@nondot.org>`_
9
10Chapter 2 Introduction
11======================
12
13Welcome to Chapter 2 of the "`Implementing a language with
14LLVM <index.html>`_" tutorial. This chapter shows you how to use the
15lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
16`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
17language. Once we have a parser, we'll define and build an `Abstract
18Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
19
20The parser we will build uses a combination of `Recursive Descent
21Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
22`Operator-Precedence
23Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
24parse the Kaleidoscope language (the latter for binary expressions and
25the former for everything else). Before we get to parsing though, lets
26talk about the output of the parser: the Abstract Syntax Tree.
27
28The Abstract Syntax Tree (AST)
29==============================
30
31The AST for a program captures its behavior in such a way that it is
32easy for later stages of the compiler (e.g. code generation) to
33interpret. We basically want one object for each construct in the
34language, and the AST should closely model the language. In
35Kaleidoscope, we have expressions, a prototype, and a function object.
36We'll start with expressions first:
37
38.. code-block:: c++
39
40 /// ExprAST - Base class for all expression nodes.
41 class ExprAST {
42 public:
43 virtual ~ExprAST() {}
44 };
45
46 /// NumberExprAST - Expression class for numeric literals like "1.0".
47 class NumberExprAST : public ExprAST {
48 double Val;
49 public:
50 NumberExprAST(double val) : Val(val) {}
51 };
52
53The code above shows the definition of the base ExprAST class and one
54subclass which we use for numeric literals. The important thing to note
55about this code is that the NumberExprAST class captures the numeric
56value of the literal as an instance variable. This allows later phases
57of the compiler to know what the stored numeric value is.
58
59Right now we only create the AST, so there are no useful accessor
60methods on them. It would be very easy to add a virtual method to pretty
61print the code, for example. Here are the other expression AST node
62definitions that we'll use in the basic form of the Kaleidoscope
63language:
64
65.. code-block:: c++
66
67 /// VariableExprAST - Expression class for referencing a variable, like "a".
68 class VariableExprAST : public ExprAST {
69 std::string Name;
70 public:
71 VariableExprAST(const std::string &name) : Name(name) {}
72 };
73
74 /// BinaryExprAST - Expression class for a binary operator.
75 class BinaryExprAST : public ExprAST {
76 char Op;
77 ExprAST *LHS, *RHS;
78 public:
79 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
80 : Op(op), LHS(lhs), RHS(rhs) {}
81 };
82
83 /// CallExprAST - Expression class for function calls.
84 class CallExprAST : public ExprAST {
85 std::string Callee;
86 std::vector<ExprAST*> Args;
87 public:
88 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
89 : Callee(callee), Args(args) {}
90 };
91
92This is all (intentionally) rather straight-forward: variables capture
93the variable name, binary operators capture their opcode (e.g. '+'), and
94calls capture a function name as well as a list of any argument
95expressions. One thing that is nice about our AST is that it captures
96the language features without talking about the syntax of the language.
97Note that there is no discussion about precedence of binary operators,
98lexical structure, etc.
99
100For our basic language, these are all of the expression nodes we'll
101define. Because it doesn't have conditional control flow, it isn't
102Turing-complete; we'll fix that in a later installment. The two things
103we need next are a way to talk about the interface to a function, and a
104way to talk about functions themselves:
105
106.. code-block:: c++
107
108 /// PrototypeAST - This class represents the "prototype" for a function,
109 /// which captures its name, and its argument names (thus implicitly the number
110 /// of arguments the function takes).
111 class PrototypeAST {
112 std::string Name;
113 std::vector<std::string> Args;
114 public:
115 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
116 : Name(name), Args(args) {}
117 };
118
119 /// FunctionAST - This class represents a function definition itself.
120 class FunctionAST {
121 PrototypeAST *Proto;
122 ExprAST *Body;
123 public:
124 FunctionAST(PrototypeAST *proto, ExprAST *body)
125 : Proto(proto), Body(body) {}
126 };
127
128In Kaleidoscope, functions are typed with just a count of their
129arguments. Since all values are double precision floating point, the
130type of each argument doesn't need to be stored anywhere. In a more
131aggressive and realistic language, the "ExprAST" class would probably
132have a type field.
133
134With this scaffolding, we can now talk about parsing expressions and
135function bodies in Kaleidoscope.
136
137Parser Basics
138=============
139
140Now that we have an AST to build, we need to define the parser code to
141build it. The idea here is that we want to parse something like "x+y"
142(which is returned as three tokens by the lexer) into an AST that could
143be generated with calls like this:
144
145.. code-block:: c++
146
147 ExprAST *X = new VariableExprAST("x");
148 ExprAST *Y = new VariableExprAST("y");
149 ExprAST *Result = new BinaryExprAST('+', X, Y);
150
151In order to do this, we'll start by defining some basic helper routines:
152
153.. code-block:: c++
154
155 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
156 /// token the parser is looking at. getNextToken reads another token from the
157 /// lexer and updates CurTok with its results.
158 static int CurTok;
159 static int getNextToken() {
160 return CurTok = gettok();
161 }
162
163This implements a simple token buffer around the lexer. This allows us
164to look one token ahead at what the lexer is returning. Every function
165in our parser will assume that CurTok is the current token that needs to
166be parsed.
167
168.. code-block:: c++
169
170
171 /// Error* - These are little helper functions for error handling.
172 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
173 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
174 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
175
176The ``Error`` routines are simple helper routines that our parser will
177use to handle errors. The error recovery in our parser will not be the
178best and is not particular user-friendly, but it will be enough for our
179tutorial. These routines make it easier to handle errors in routines
180that have various return types: they always return null.
181
182With these basic helper functions, we can implement the first piece of
183our grammar: numeric literals.
184
185Basic Expression Parsing
186========================
187
188We start with numeric literals, because they are the simplest to
189process. For each production in our grammar, we'll define a function
190which parses that production. For numeric literals, we have:
191
192.. code-block:: c++
193
194 /// numberexpr ::= number
195 static ExprAST *ParseNumberExpr() {
196 ExprAST *Result = new NumberExprAST(NumVal);
197 getNextToken(); // consume the number
198 return Result;
199 }
200
201This routine is very simple: it expects to be called when the current
202token is a ``tok_number`` token. It takes the current number value,
203creates a ``NumberExprAST`` node, advances the lexer to the next token,
204and finally returns.
205
206There are some interesting aspects to this. The most important one is
207that this routine eats all of the tokens that correspond to the
208production and returns the lexer buffer with the next token (which is
209not part of the grammar production) ready to go. This is a fairly
210standard way to go for recursive descent parsers. For a better example,
211the parenthesis operator is defined like this:
212
213.. code-block:: c++
214
215 /// parenexpr ::= '(' expression ')'
216 static ExprAST *ParseParenExpr() {
217 getNextToken(); // eat (.
218 ExprAST *V = ParseExpression();
219 if (!V) return 0;
220
221 if (CurTok != ')')
222 return Error("expected ')'");
223 getNextToken(); // eat ).
224 return V;
225 }
226
227This function illustrates a number of interesting things about the
228parser:
229
2301) It shows how we use the Error routines. When called, this function
231expects that the current token is a '(' token, but after parsing the
232subexpression, it is possible that there is no ')' waiting. For example,
233if the user types in "(4 x" instead of "(4)", the parser should emit an
234error. Because errors can occur, the parser needs a way to indicate that
235they happened: in our parser, we return null on an error.
236
2372) Another interesting aspect of this function is that it uses recursion
238by calling ``ParseExpression`` (we will soon see that
239``ParseExpression`` can call ``ParseParenExpr``). This is powerful
240because it allows us to handle recursive grammars, and keeps each
241production very simple. Note that parentheses do not cause construction
242of AST nodes themselves. While we could do it this way, the most
243important role of parentheses are to guide the parser and provide
244grouping. Once the parser constructs the AST, parentheses are not
245needed.
246
247The next simple production is for handling variable references and
248function calls:
249
250.. code-block:: c++
251
252 /// identifierexpr
253 /// ::= identifier
254 /// ::= identifier '(' expression* ')'
255 static ExprAST *ParseIdentifierExpr() {
256 std::string IdName = IdentifierStr;
257
258 getNextToken(); // eat identifier.
259
260 if (CurTok != '(') // Simple variable ref.
261 return new VariableExprAST(IdName);
262
263 // Call.
264 getNextToken(); // eat (
265 std::vector<ExprAST*> Args;
266 if (CurTok != ')') {
267 while (1) {
268 ExprAST *Arg = ParseExpression();
269 if (!Arg) return 0;
270 Args.push_back(Arg);
271
272 if (CurTok == ')') break;
273
274 if (CurTok != ',')
275 return Error("Expected ')' or ',' in argument list");
276 getNextToken();
277 }
278 }
279
280 // Eat the ')'.
281 getNextToken();
282
283 return new CallExprAST(IdName, Args);
284 }
285
286This routine follows the same style as the other routines. (It expects
287to be called if the current token is a ``tok_identifier`` token). It
288also has recursion and error handling. One interesting aspect of this is
289that it uses *look-ahead* to determine if the current identifier is a
290stand alone variable reference or if it is a function call expression.
291It handles this by checking to see if the token after the identifier is
292a '(' token, constructing either a ``VariableExprAST`` or
293``CallExprAST`` node as appropriate.
294
295Now that we have all of our simple expression-parsing logic in place, we
296can define a helper function to wrap it together into one entry point.
297We call this class of expressions "primary" expressions, for reasons
298that will become more clear `later in the
299tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
300primary expression, we need to determine what sort of expression it is:
301
302.. code-block:: c++
303
304 /// primary
305 /// ::= identifierexpr
306 /// ::= numberexpr
307 /// ::= parenexpr
308 static ExprAST *ParsePrimary() {
309 switch (CurTok) {
310 default: return Error("unknown token when expecting an expression");
311 case tok_identifier: return ParseIdentifierExpr();
312 case tok_number: return ParseNumberExpr();
313 case '(': return ParseParenExpr();
314 }
315 }
316
317Now that you see the definition of this function, it is more obvious why
318we can assume the state of CurTok in the various functions. This uses
319look-ahead to determine which sort of expression is being inspected, and
320then parses it with a function call.
321
322Now that basic expressions are handled, we need to handle binary
323expressions. They are a bit more complex.
324
325Binary Expression Parsing
326=========================
327
328Binary expressions are significantly harder to parse because they are
329often ambiguous. For example, when given the string "x+y\*z", the parser
330can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
331definitions from mathematics, we expect the later parse, because "\*"
332(multiplication) has higher *precedence* than "+" (addition).
333
334There are many ways to handle this, but an elegant and efficient way is
335to use `Operator-Precedence
336Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
337This parsing technique uses the precedence of binary operators to guide
338recursion. To start with, we need a table of precedences:
339
340.. code-block:: c++
341
342 /// BinopPrecedence - This holds the precedence for each binary operator that is
343 /// defined.
344 static std::map<char, int> BinopPrecedence;
345
346 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
347 static int GetTokPrecedence() {
348 if (!isascii(CurTok))
349 return -1;
350
351 // Make sure it's a declared binop.
352 int TokPrec = BinopPrecedence[CurTok];
353 if (TokPrec <= 0) return -1;
354 return TokPrec;
355 }
356
357 int main() {
358 // Install standard binary operators.
359 // 1 is lowest precedence.
360 BinopPrecedence['<'] = 10;
361 BinopPrecedence['+'] = 20;
362 BinopPrecedence['-'] = 20;
363 BinopPrecedence['*'] = 40; // highest.
364 ...
365 }
366
367For the basic form of Kaleidoscope, we will only support 4 binary
368operators (this can obviously be extended by you, our brave and intrepid
369reader). The ``GetTokPrecedence`` function returns the precedence for
370the current token, or -1 if the token is not a binary operator. Having a
371map makes it easy to add new operators and makes it clear that the
372algorithm doesn't depend on the specific operators involved, but it
373would be easy enough to eliminate the map and do the comparisons in the
374``GetTokPrecedence`` function. (Or just use a fixed-size array).
375
376With the helper above defined, we can now start parsing binary
377expressions. The basic idea of operator precedence parsing is to break
378down an expression with potentially ambiguous binary operators into
379pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
380Operator precedence parsing considers this as a stream of primary
381expressions separated by binary operators. As such, it will first parse
382the leading primary expression "a", then it will see the pairs [+, b]
383[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
384primary expressions, the binary expression parser doesn't need to worry
385about nested subexpressions like (c+d) at all.
386
387To start, an expression is a primary expression potentially followed by
388a sequence of [binop,primaryexpr] pairs:
389
390.. code-block:: c++
391
392 /// expression
393 /// ::= primary binoprhs
394 ///
395 static ExprAST *ParseExpression() {
396 ExprAST *LHS = ParsePrimary();
397 if (!LHS) return 0;
398
399 return ParseBinOpRHS(0, LHS);
400 }
401
402``ParseBinOpRHS`` is the function that parses the sequence of pairs for
403us. It takes a precedence and a pointer to an expression for the part
404that has been parsed so far. Note that "x" is a perfectly valid
405expression: As such, "binoprhs" is allowed to be empty, in which case it
406returns the expression that is passed into it. In our example above, the
407code passes the expression for "a" into ``ParseBinOpRHS`` and the
408current token is "+".
409
410The precedence value passed into ``ParseBinOpRHS`` indicates the
411*minimal operator precedence* that the function is allowed to eat. For
412example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
413passed in a precedence of 40, it will not consume any tokens (because
414the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
415starts with:
416
417.. code-block:: c++
418
419 /// binoprhs
420 /// ::= ('+' primary)*
421 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
422 // If this is a binop, find its precedence.
423 while (1) {
424 int TokPrec = GetTokPrecedence();
425
426 // If this is a binop that binds at least as tightly as the current binop,
427 // consume it, otherwise we are done.
428 if (TokPrec < ExprPrec)
429 return LHS;
430
431This code gets the precedence of the current token and checks to see if
432if is too low. Because we defined invalid tokens to have a precedence of
433-1, this check implicitly knows that the pair-stream ends when the token
434stream runs out of binary operators. If this check succeeds, we know
435that the token is a binary operator and that it will be included in this
436expression:
437
438.. code-block:: c++
439
440 // Okay, we know this is a binop.
441 int BinOp = CurTok;
442 getNextToken(); // eat binop
443
444 // Parse the primary expression after the binary operator.
445 ExprAST *RHS = ParsePrimary();
446 if (!RHS) return 0;
447
448As such, this code eats (and remembers) the binary operator and then
449parses the primary expression that follows. This builds up the whole
450pair, the first of which is [+, b] for the running example.
451
452Now that we parsed the left-hand side of an expression and one pair of
453the RHS sequence, we have to decide which way the expression associates.
454In particular, we could have "(a+b) binop unparsed" or "a + (b binop
455unparsed)". To determine this, we look ahead at "binop" to determine its
456precedence and compare it to BinOp's precedence (which is '+' in this
457case):
458
459.. code-block:: c++
460
461 // If BinOp binds less tightly with RHS than the operator after RHS, let
462 // the pending operator take RHS as its LHS.
463 int NextPrec = GetTokPrecedence();
464 if (TokPrec < NextPrec) {
465
466If the precedence of the binop to the right of "RHS" is lower or equal
467to the precedence of our current operator, then we know that the
468parentheses associate as "(a+b) binop ...". In our example, the current
469operator is "+" and the next operator is "+", we know that they have the
470same precedence. In this case we'll create the AST node for "a+b", and
471then continue parsing:
472
473.. code-block:: c++
474
475 ... if body omitted ...
476 }
477
478 // Merge LHS/RHS.
479 LHS = new BinaryExprAST(BinOp, LHS, RHS);
480 } // loop around to the top of the while loop.
481 }
482
483In our example above, this will turn "a+b+" into "(a+b)" and execute the
484next iteration of the loop, with "+" as the current token. The code
485above will eat, remember, and parse "(c+d)" as the primary expression,
486which makes the current pair equal to [+, (c+d)]. It will then evaluate
487the 'if' conditional above with "\*" as the binop to the right of the
488primary. In this case, the precedence of "\*" is higher than the
489precedence of "+" so the if condition will be entered.
490
491The critical question left here is "how can the if condition parse the
492right hand side in full"? In particular, to build the AST correctly for
493our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
494variable. The code to do this is surprisingly simple (code from the
495above two blocks duplicated for context):
496
497.. code-block:: c++
498
499 // If BinOp binds less tightly with RHS than the operator after RHS, let
500 // the pending operator take RHS as its LHS.
501 int NextPrec = GetTokPrecedence();
502 if (TokPrec < NextPrec) {
503 RHS = ParseBinOpRHS(TokPrec+1, RHS);
504 if (RHS == 0) return 0;
505 }
506 // Merge LHS/RHS.
507 LHS = new BinaryExprAST(BinOp, LHS, RHS);
508 } // loop around to the top of the while loop.
509 }
510
511At this point, we know that the binary operator to the RHS of our
512primary has higher precedence than the binop we are currently parsing.
513As such, we know that any sequence of pairs whose operators are all
514higher precedence than "+" should be parsed together and returned as
515"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
516specifying "TokPrec+1" as the minimum precedence required for it to
517continue. In our example above, this will cause it to return the AST
518node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
519expression.
520
521Finally, on the next iteration of the while loop, the "+g" piece is
522parsed and added to the AST. With this little bit of code (14
523non-trivial lines), we correctly handle fully general binary expression
524parsing in a very elegant way. This was a whirlwind tour of this code,
525and it is somewhat subtle. I recommend running through it with a few
526tough examples to see how it works.
527
528This wraps up handling of expressions. At this point, we can point the
529parser at an arbitrary token stream and build an expression from it,
530stopping at the first token that is not part of the expression. Next up
531we need to handle function definitions, etc.
532
533Parsing the Rest
534================
535
536The next thing missing is handling of function prototypes. In
537Kaleidoscope, these are used both for 'extern' function declarations as
538well as function body definitions. The code to do this is
539straight-forward and not very interesting (once you've survived
540expressions):
541
542.. code-block:: c++
543
544 /// prototype
545 /// ::= id '(' id* ')'
546 static PrototypeAST *ParsePrototype() {
547 if (CurTok != tok_identifier)
548 return ErrorP("Expected function name in prototype");
549
550 std::string FnName = IdentifierStr;
551 getNextToken();
552
553 if (CurTok != '(')
554 return ErrorP("Expected '(' in prototype");
555
556 // Read the list of argument names.
557 std::vector<std::string> ArgNames;
558 while (getNextToken() == tok_identifier)
559 ArgNames.push_back(IdentifierStr);
560 if (CurTok != ')')
561 return ErrorP("Expected ')' in prototype");
562
563 // success.
564 getNextToken(); // eat ')'.
565
566 return new PrototypeAST(FnName, ArgNames);
567 }
568
569Given this, a function definition is very simple, just a prototype plus
570an expression to implement the body:
571
572.. code-block:: c++
573
574 /// definition ::= 'def' prototype expression
575 static FunctionAST *ParseDefinition() {
576 getNextToken(); // eat def.
577 PrototypeAST *Proto = ParsePrototype();
578 if (Proto == 0) return 0;
579
580 if (ExprAST *E = ParseExpression())
581 return new FunctionAST(Proto, E);
582 return 0;
583 }
584
585In addition, we support 'extern' to declare functions like 'sin' and
586'cos' as well as to support forward declaration of user functions. These
587'extern's are just prototypes with no body:
588
589.. code-block:: c++
590
591 /// external ::= 'extern' prototype
592 static PrototypeAST *ParseExtern() {
593 getNextToken(); // eat extern.
594 return ParsePrototype();
595 }
596
597Finally, we'll also let the user type in arbitrary top-level expressions
598and evaluate them on the fly. We will handle this by defining anonymous
599nullary (zero argument) functions for them:
600
601.. code-block:: c++
602
603 /// toplevelexpr ::= expression
604 static FunctionAST *ParseTopLevelExpr() {
605 if (ExprAST *E = ParseExpression()) {
606 // Make an anonymous proto.
607 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
608 return new FunctionAST(Proto, E);
609 }
610 return 0;
611 }
612
613Now that we have all the pieces, let's build a little driver that will
614let us actually *execute* this code we've built!
615
616The Driver
617==========
618
619The driver for this simply invokes all of the parsing pieces with a
620top-level dispatch loop. There isn't much interesting here, so I'll just
621include the top-level loop. See `below <#code>`_ for full code in the
622"Top-Level Parsing" section.
623
624.. code-block:: c++
625
626 /// top ::= definition | external | expression | ';'
627 static void MainLoop() {
628 while (1) {
629 fprintf(stderr, "ready> ");
630 switch (CurTok) {
631 case tok_eof: return;
632 case ';': getNextToken(); break; // ignore top-level semicolons.
633 case tok_def: HandleDefinition(); break;
634 case tok_extern: HandleExtern(); break;
635 default: HandleTopLevelExpression(); break;
636 }
637 }
638 }
639
640The most interesting part of this is that we ignore top-level
641semicolons. Why is this, you ask? The basic reason is that if you type
642"4 + 5" at the command line, the parser doesn't know whether that is the
643end of what you will type or not. For example, on the next line you
644could type "def foo..." in which case 4+5 is the end of a top-level
645expression. Alternatively you could type "\* 6", which would continue
646the expression. Having top-level semicolons allows you to type "4+5;",
647and the parser will know you are done.
648
649Conclusions
650===========
651
652With just under 400 lines of commented code (240 lines of non-comment,
653non-blank code), we fully defined our minimal language, including a
654lexer, parser, and AST builder. With this done, the executable will
655validate Kaleidoscope code and tell us if it is grammatically invalid.
656For example, here is a sample interaction:
657
658.. code-block:: bash
659
660 $ ./a.out
661 ready> def foo(x y) x+foo(y, 4.0);
662 Parsed a function definition.
663 ready> def foo(x y) x+y y;
664 Parsed a function definition.
665 Parsed a top-level expr
666 ready> def foo(x y) x+y );
667 Parsed a function definition.
668 Error: unknown token when expecting an expression
669 ready> extern sin(a);
670 ready> Parsed an extern
671 ready> ^D
672 $
673
674There is a lot of room for extension here. You can define new AST nodes,
675extend the language in many ways, etc. In the `next
676installment <LangImpl3.html>`_, we will describe how to generate LLVM
677Intermediate Representation (IR) from the AST.
678
679Full Code Listing
680=================
681
682Here is the complete code listing for this and the previous chapter.
683Note that it is fully self-contained: you don't need LLVM or any
684external libraries at all for this. (Besides the C and C++ standard
685libraries, of course.) To build this, just compile with:
686
687.. code-block:: bash
688
689 # Compile
690 clang++ -g -O3 toy.cpp
691 # Run
692 ./a.out
693
694Here is the code:
695
696.. code-block:: c++
697
698 #include <cstdio>
699 #include <cstdlib>
700 #include <string>
701 #include <map>
702 #include <vector>
703
704 //===----------------------------------------------------------------------===//
705 // Lexer
706 //===----------------------------------------------------------------------===//
707
708 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
709 // of these for known things.
710 enum Token {
711 tok_eof = -1,
712
713 // commands
714 tok_def = -2, tok_extern = -3,
715
716 // primary
717 tok_identifier = -4, tok_number = -5
718 };
719
720 static std::string IdentifierStr; // Filled in if tok_identifier
721 static double NumVal; // Filled in if tok_number
722
723 /// gettok - Return the next token from standard input.
724 static int gettok() {
725 static int LastChar = ' ';
726
727 // Skip any whitespace.
728 while (isspace(LastChar))
729 LastChar = getchar();
730
731 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
732 IdentifierStr = LastChar;
733 while (isalnum((LastChar = getchar())))
734 IdentifierStr += LastChar;
735
736 if (IdentifierStr == "def") return tok_def;
737 if (IdentifierStr == "extern") return tok_extern;
738 return tok_identifier;
739 }
740
741 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
742 std::string NumStr;
743 do {
744 NumStr += LastChar;
745 LastChar = getchar();
746 } while (isdigit(LastChar) || LastChar == '.');
747
748 NumVal = strtod(NumStr.c_str(), 0);
749 return tok_number;
750 }
751
752 if (LastChar == '#') {
753 // Comment until end of line.
754 do LastChar = getchar();
755 while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
756
757 if (LastChar != EOF)
758 return gettok();
759 }
760
761 // Check for end of file. Don't eat the EOF.
762 if (LastChar == EOF)
763 return tok_eof;
764
765 // Otherwise, just return the character as its ascii value.
766 int ThisChar = LastChar;
767 LastChar = getchar();
768 return ThisChar;
769 }
770
771 //===----------------------------------------------------------------------===//
772 // Abstract Syntax Tree (aka Parse Tree)
773 //===----------------------------------------------------------------------===//
774
775 /// ExprAST - Base class for all expression nodes.
776 class ExprAST {
777 public:
778 virtual ~ExprAST() {}
779 };
780
781 /// NumberExprAST - Expression class for numeric literals like "1.0".
782 class NumberExprAST : public ExprAST {
783 double Val;
784 public:
785 NumberExprAST(double val) : Val(val) {}
786 };
787
788 /// VariableExprAST - Expression class for referencing a variable, like "a".
789 class VariableExprAST : public ExprAST {
790 std::string Name;
791 public:
792 VariableExprAST(const std::string &name) : Name(name) {}
793 };
794
795 /// BinaryExprAST - Expression class for a binary operator.
796 class BinaryExprAST : public ExprAST {
797 char Op;
798 ExprAST *LHS, *RHS;
799 public:
800 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
801 : Op(op), LHS(lhs), RHS(rhs) {}
802 };
803
804 /// CallExprAST - Expression class for function calls.
805 class CallExprAST : public ExprAST {
806 std::string Callee;
807 std::vector<ExprAST*> Args;
808 public:
809 CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
810 : Callee(callee), Args(args) {}
811 };
812
813 /// PrototypeAST - This class represents the "prototype" for a function,
814 /// which captures its name, and its argument names (thus implicitly the number
815 /// of arguments the function takes).
816 class PrototypeAST {
817 std::string Name;
818 std::vector<std::string> Args;
819 public:
820 PrototypeAST(const std::string &name, const std::vector<std::string> &args)
821 : Name(name), Args(args) {}
822
823 };
824
825 /// FunctionAST - This class represents a function definition itself.
826 class FunctionAST {
827 PrototypeAST *Proto;
828 ExprAST *Body;
829 public:
830 FunctionAST(PrototypeAST *proto, ExprAST *body)
831 : Proto(proto), Body(body) {}
832
833 };
834
835 //===----------------------------------------------------------------------===//
836 // Parser
837 //===----------------------------------------------------------------------===//
838
839 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
840 /// token the parser is looking at. getNextToken reads another token from the
841 /// lexer and updates CurTok with its results.
842 static int CurTok;
843 static int getNextToken() {
844 return CurTok = gettok();
845 }
846
847 /// BinopPrecedence - This holds the precedence for each binary operator that is
848 /// defined.
849 static std::map<char, int> BinopPrecedence;
850
851 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
852 static int GetTokPrecedence() {
853 if (!isascii(CurTok))
854 return -1;
855
856 // Make sure it's a declared binop.
857 int TokPrec = BinopPrecedence[CurTok];
858 if (TokPrec <= 0) return -1;
859 return TokPrec;
860 }
861
862 /// Error* - These are little helper functions for error handling.
863 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
864 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
865 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
866
867 static ExprAST *ParseExpression();
868
869 /// identifierexpr
870 /// ::= identifier
871 /// ::= identifier '(' expression* ')'
872 static ExprAST *ParseIdentifierExpr() {
873 std::string IdName = IdentifierStr;
874
875 getNextToken(); // eat identifier.
876
877 if (CurTok != '(') // Simple variable ref.
878 return new VariableExprAST(IdName);
879
880 // Call.
881 getNextToken(); // eat (
882 std::vector<ExprAST*> Args;
883 if (CurTok != ')') {
884 while (1) {
885 ExprAST *Arg = ParseExpression();
886 if (!Arg) return 0;
887 Args.push_back(Arg);
888
889 if (CurTok == ')') break;
890
891 if (CurTok != ',')
892 return Error("Expected ')' or ',' in argument list");
893 getNextToken();
894 }
895 }
896
897 // Eat the ')'.
898 getNextToken();
899
900 return new CallExprAST(IdName, Args);
901 }
902
903 /// numberexpr ::= number
904 static ExprAST *ParseNumberExpr() {
905 ExprAST *Result = new NumberExprAST(NumVal);
906 getNextToken(); // consume the number
907 return Result;
908 }
909
910 /// parenexpr ::= '(' expression ')'
911 static ExprAST *ParseParenExpr() {
912 getNextToken(); // eat (.
913 ExprAST *V = ParseExpression();
914 if (!V) return 0;
915
916 if (CurTok != ')')
917 return Error("expected ')'");
918 getNextToken(); // eat ).
919 return V;
920 }
921
922 /// primary
923 /// ::= identifierexpr
924 /// ::= numberexpr
925 /// ::= parenexpr
926 static ExprAST *ParsePrimary() {
927 switch (CurTok) {
928 default: return Error("unknown token when expecting an expression");
929 case tok_identifier: return ParseIdentifierExpr();
930 case tok_number: return ParseNumberExpr();
931 case '(': return ParseParenExpr();
932 }
933 }
934
935 /// binoprhs
936 /// ::= ('+' primary)*
937 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
938 // If this is a binop, find its precedence.
939 while (1) {
940 int TokPrec = GetTokPrecedence();
941
942 // If this is a binop that binds at least as tightly as the current binop,
943 // consume it, otherwise we are done.
944 if (TokPrec < ExprPrec)
945 return LHS;
946
947 // Okay, we know this is a binop.
948 int BinOp = CurTok;
949 getNextToken(); // eat binop
950
951 // Parse the primary expression after the binary operator.
952 ExprAST *RHS = ParsePrimary();
953 if (!RHS) return 0;
954
955 // If BinOp binds less tightly with RHS than the operator after RHS, let
956 // the pending operator take RHS as its LHS.
957 int NextPrec = GetTokPrecedence();
958 if (TokPrec < NextPrec) {
959 RHS = ParseBinOpRHS(TokPrec+1, RHS);
960 if (RHS == 0) return 0;
961 }
962
963 // Merge LHS/RHS.
964 LHS = new BinaryExprAST(BinOp, LHS, RHS);
965 }
966 }
967
968 /// expression
969 /// ::= primary binoprhs
970 ///
971 static ExprAST *ParseExpression() {
972 ExprAST *LHS = ParsePrimary();
973 if (!LHS) return 0;
974
975 return ParseBinOpRHS(0, LHS);
976 }
977
978 /// prototype
979 /// ::= id '(' id* ')'
980 static PrototypeAST *ParsePrototype() {
981 if (CurTok != tok_identifier)
982 return ErrorP("Expected function name in prototype");
983
984 std::string FnName = IdentifierStr;
985 getNextToken();
986
987 if (CurTok != '(')
988 return ErrorP("Expected '(' in prototype");
989
990 std::vector<std::string> ArgNames;
991 while (getNextToken() == tok_identifier)
992 ArgNames.push_back(IdentifierStr);
993 if (CurTok != ')')
994 return ErrorP("Expected ')' in prototype");
995
996 // success.
997 getNextToken(); // eat ')'.
998
999 return new PrototypeAST(FnName, ArgNames);
1000 }
1001
1002 /// definition ::= 'def' prototype expression
1003 static FunctionAST *ParseDefinition() {
1004 getNextToken(); // eat def.
1005 PrototypeAST *Proto = ParsePrototype();
1006 if (Proto == 0) return 0;
1007
1008 if (ExprAST *E = ParseExpression())
1009 return new FunctionAST(Proto, E);
1010 return 0;
1011 }
1012
1013 /// toplevelexpr ::= expression
1014 static FunctionAST *ParseTopLevelExpr() {
1015 if (ExprAST *E = ParseExpression()) {
1016 // Make an anonymous proto.
1017 PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
1018 return new FunctionAST(Proto, E);
1019 }
1020 return 0;
1021 }
1022
1023 /// external ::= 'extern' prototype
1024 static PrototypeAST *ParseExtern() {
1025 getNextToken(); // eat extern.
1026 return ParsePrototype();
1027 }
1028
1029 //===----------------------------------------------------------------------===//
1030 // Top-Level parsing
1031 //===----------------------------------------------------------------------===//
1032
1033 static void HandleDefinition() {
1034 if (ParseDefinition()) {
1035 fprintf(stderr, "Parsed a function definition.\n");
1036 } else {
1037 // Skip token for error recovery.
1038 getNextToken();
1039 }
1040 }
1041
1042 static void HandleExtern() {
1043 if (ParseExtern()) {
1044 fprintf(stderr, "Parsed an extern\n");
1045 } else {
1046 // Skip token for error recovery.
1047 getNextToken();
1048 }
1049 }
1050
1051 static void HandleTopLevelExpression() {
1052 // Evaluate a top-level expression into an anonymous function.
1053 if (ParseTopLevelExpr()) {
1054 fprintf(stderr, "Parsed a top-level expr\n");
1055 } else {
1056 // Skip token for error recovery.
1057 getNextToken();
1058 }
1059 }
1060
1061 /// top ::= definition | external | expression | ';'
1062 static void MainLoop() {
1063 while (1) {
1064 fprintf(stderr, "ready> ");
1065 switch (CurTok) {
1066 case tok_eof: return;
1067 case ';': getNextToken(); break; // ignore top-level semicolons.
1068 case tok_def: HandleDefinition(); break;
1069 case tok_extern: HandleExtern(); break;
1070 default: HandleTopLevelExpression(); break;
1071 }
1072 }
1073 }
1074
1075 //===----------------------------------------------------------------------===//
1076 // Main driver code.
1077 //===----------------------------------------------------------------------===//
1078
1079 int main() {
1080 // Install standard binary operators.
1081 // 1 is lowest precedence.
1082 BinopPrecedence['<'] = 10;
1083 BinopPrecedence['+'] = 20;
1084 BinopPrecedence['-'] = 20;
1085 BinopPrecedence['*'] = 40; // highest.
1086
1087 // Prime the first token.
1088 fprintf(stderr, "ready> ");
1089 getNextToken();
1090
1091 // Run the main "interpreter loop" now.
1092 MainLoop();
1093
1094 return 0;
1095 }
1096
1097`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
1098