blob: c153aa2c973f48e7a8431f462415b5886fed5c27 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
16<div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
18</div>
19
20<!-- *********************************************************************** -->
21<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
22<!-- *********************************************************************** -->
23
24<div class="doc_text">
25
26<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
27LLVM</a>" tutorial. This chapter shows you how to use the <a
28href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
29href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
30our Kaleidoscope language and build an <a
31href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
32Tree</a> (AST).</p>
33
34<p>The parser we will build uses a combination of <a
35href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
36Parsing</a> and <a href=
37"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
38Parsing</a> to parse the Kaleidoscope language (the later for binary expression
39and the former for everything else). Before we get to parsing though, lets talk
40about the output of the parser: the Abstract Syntax Tree.</p>
41
42</div>
43
44<!-- *********************************************************************** -->
45<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
46<!-- *********************************************************************** -->
47
48<div class="doc_text">
49
50<p>The AST for a program captures its behavior in a way that it is easy for
51later stages of the compiler (e.g. code generation) to interpret. We basically
52want one object for each construct in the language, and the AST should closely
53model the language. In Kaleidoscope, we have expressions, a prototype, and a
54function object. We'll start with expressions first:</p>
55
56<div class="doc_code">
57<pre>
58/// ExprAST - Base class for all expression nodes.
59class ExprAST {
60public:
61 virtual ~ExprAST() {}
62};
63
64/// NumberExprAST - Expression class for numeric literals like "1.0".
65class NumberExprAST : public ExprAST {
66 double Val;
67public:
68 NumberExprAST(double val) : Val(val) {}
69};
70</pre>
71</div>
72
73<p>The code above shows the definition of the base ExprAST class and one
74subclass which we use for numeric literals. The important thing about this is
75that the NumberExprAST class captures the numeric value of the literal in the
76class, so that later phases of the compiler can know what it is.</p>
77
78<p>Right now we only create the AST, so there are no useful accessor methods on
79them. It would be very easy to add a virtual method to pretty print the code,
80for example. Here are the other expression AST node definitions that we'll use
81in the basic form of the Kaleidoscope language.
82</p>
83
84<div class="doc_code">
85<pre>
86/// VariableExprAST - Expression class for referencing a variable, like "a".
87class VariableExprAST : public ExprAST {
88 std::string Name;
89public:
90 VariableExprAST(const std::string &amp;name) : Name(name) {}
91};
92
93/// BinaryExprAST - Expression class for a binary operator.
94class BinaryExprAST : public ExprAST {
95 char Op;
96 ExprAST *LHS, *RHS;
97public:
98 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
99 : Op(op), LHS(lhs), RHS(rhs) {}
100};
101
102/// CallExprAST - Expression class for function calls.
103class CallExprAST : public ExprAST {
104 std::string Callee;
105 std::vector&lt;ExprAST*&gt; Args;
106public:
107 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
108 : Callee(callee), Args(args) {}
109};
110</pre>
111</div>
112
113<p>This is all (intentially) rather straight-forward: variables capture the
114variable name, binary operators capture their opcode (e.g. '+'), and calls
115capture a function name and list of argument expressions. One thing that is
116nice about our AST is that it captures the language features without talking
117about the syntax of the language. Note that there is no discussion about
118precedence of binary operators, lexical structure etc.</p>
119
120<p>For our basic language, these are all of the expression nodes we'll define.
121because it doesn't have conditional control flow, it isn't turing complete:
122we'll fix that in a later installment. The two things we need next are a way
123to talk about the interface to a function, and a way to talk about functions
124themselves:</p>
125
126<div class="doc_code">
127<pre>
128/// PrototypeAST - This class represents the "prototype" for a function,
129/// which captures its argument names as well as if it is an operator.
130class PrototypeAST {
131 std::string Name;
132 std::vector&lt;std::string&gt; Args;
133public:
134 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
135 : Name(name), Args(args) {}
136};
137
138/// FunctionAST - This class represents a function definition itself.
139class FunctionAST {
140 PrototypeAST *Proto;
141 ExprAST *Body;
142public:
143 FunctionAST(PrototypeAST *proto, ExprAST *body)
144 : Proto(proto), Body(body) {}
145};
146</pre>
147</div>
148
149<p>In Kaleidoscope, functions are typed with just a count of their arguments.
150Since all values are double precision floating point, this fact doesn't need to
151be captured anywhere. In a more aggressive and realistic language, the
152"ExprAST" class would probably have a type field.</p>
153
154<p>With this scaffolding, we can now talk about parsing expressions and function
155bodies in Kaleidoscope.</p>
156
157</div>
158
159<!-- *********************************************************************** -->
160<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
161<!-- *********************************************************************** -->
162
163<div class="doc_text">
164
165<p>Now that we have an AST to build, we need to define the parser code to build
166it. The idea here is that we want to parse something like "x+y" (which is
167returned as three tokens by the lexer) into an AST that could be generated with
168calls like this:</p>
169
170<div class="doc_code">
171<pre>
172 ExprAST *X = new VariableExprAST("x");
173 ExprAST *Y = new VariableExprAST("y");
174 ExprAST *Result = new BinaryExprAST('+', X, Y);
175</pre>
176</div>
177
178<p>In order to do this, we'll start by defining some basic helper routines:</p>
179
180<div class="doc_code">
181<pre>
182/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
183/// token the parser it looking at. getNextToken reads another token from the
184/// lexer and updates CurTok with its results.
185static int CurTok;
186static int getNextToken() {
187 return CurTok = gettok();
188}
189</pre>
190</div>
191
192<p>
193This implements a simple token buffer around the lexer. This allows
194us to look one token ahead at what the lexer is returning. Every function in
195our lexer will assume that CurTok is the current token that needs to be
196parsed.</p>
197
198<p>Again, we define
199these with global variables: it would be better design to wrap the entire parser
200in a class and use instance variables for these.
201</p>
202
203<div class="doc_code">
204<pre>
205
206/// Error* - These are little helper functions for error handling.
207ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
208PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
209FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
210</pre>
211</div>
212
213<p>
214The <tt>Error</tt> routines are simple helper routines that our parser will use
215to handle errors. The error recovery in our parser will not be the best and
216are not particular user-friendly, but it will be enough for our tutorial. These
217routines make it easier to handle errors in routines that have various return
218types: they always return null.</p>
219
220<p>With these basic helper functions implemented, we can implement the first
221piece of our grammar: we'll start with numeric literals.</p>
222
223</div>
224
225<!-- *********************************************************************** -->
226<div class="doc_section"><a name="parserprimexprs">Basic Expression
227 Parsing</a></div>
228<!-- *********************************************************************** -->
229
230<div class="doc_text">
231
232<p>We start with numeric literals, because they are the simplest to process.
233For each production in our grammar, we'll define a function which parses that
234production. For numeric literals, we have:
235</p>
236
237<div class="doc_code">
238<pre>
239/// numberexpr ::= number
240static ExprAST *ParseNumberExpr() {
241 ExprAST *Result = new NumberExprAST(NumVal);
242 getNextToken(); // consume the number
243 return Result;
244}
245</pre>
246</div>
247
248<p>This routine is very simple: it expects to be called when the current token
249is a <tt>tok_number</tt> token. It takes the current number value, creates
250a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
251returns.</p>
252
253<p>There are some interesting aspects of this. The most important one is that
254this routine eats all of the tokens that correspond to the production, and
255returns the lexer buffer with the next token (which is not part of the grammar
256production) ready to go. This is a fairly standard way to go for recursive
257descent parsers. For a better example, the parenthesis operator is defined like
258this:</p>
259
260<div class="doc_code">
261<pre>
262/// parenexpr ::= '(' expression ')'
263static ExprAST *ParseParenExpr() {
264 getNextToken(); // eat (.
265 ExprAST *V = ParseExpression();
266 if (!V) return 0;
267
268 if (CurTok != ')')
269 return Error("expected ')'");
270 getNextToken(); // eat ).
271 return V;
272}
273</pre>
274</div>
275
276<p>This function illustrates a number of interesting things about the parser:
2771) it shows how we use the Error routines. When called, this function expects
278that the current token is a '(' token, but after parsing the subexpression, it
279is possible that there is not a ')' waiting. For example, if the user types in
280"(4 x" instead of "(4)", the parser should emit an error. Because errors can
281occur, the parser needs a way to indicate that they happened: in our parser, we
282return null on an error.</p>
283
284<p>Another interesting aspect of this function is that it uses recursion by
285calling <tt>ParseExpression</tt> (we will soon see that ParseExpression can call
286<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
287recursive grammars, and keeps each production very simple. Note that
288parenthesis do not cause construction of AST nodes themselves. While we could
289do this, the most important role of parens are to guide the parser and provide
290grouping. Once the parser constructs the AST, parens are not needed.</p>
291
292<p>The next simple production is for handling variable references and function
293calls:</p>
294
295<div class="doc_code">
296<pre>
297/// identifierexpr
298/// ::= identifer
299/// ::= identifer '(' expression* ')'
300static ExprAST *ParseIdentifierExpr() {
301 std::string IdName = IdentifierStr;
302
303 getNextToken(); // eat identifer.
304
305 if (CurTok != '(') // Simple variable ref.
306 return new VariableExprAST(IdName);
307
308 // Call.
309 getNextToken(); // eat (
310 std::vector&lt;ExprAST*&gt; Args;
311 while (1) {
312 ExprAST *Arg = ParseExpression();
313 if (!Arg) return 0;
314 Args.push_back(Arg);
315
316 if (CurTok == ')') break;
317
318 if (CurTok != ',')
319 return Error("Expected ')'");
320 getNextToken();
321 }
322
323 // Eat the ')'.
324 getNextToken();
325
326 return new CallExprAST(IdName, Args);
327}
328</pre>
329</div>
330
331<p>This routine follows the same style as the other routines (it expects to be
332called if the current token is a <tt>tok_identifier</tt> token). It also has
333recursion and error handling. One interesting aspect of this is that it uses
334<em>look-ahead</em> to determine if the current identifier is a stand alone
335variable reference or if it is a function call expression. It handles this by
336checking to see if the token after the identifier is a '(' token, and constructs
337either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
338</p>
339
340<p>Now that we have all of our simple expression parsing logic in place, we can
341define a helper function to wrap them up in a class. We call this class of
342expressions "primary" expressions, for reasons that will become more clear
343later. In order to parse a primary expression, we need to determine what sort
344of expression it is:</p>
345
346<div class="doc_code">
347<pre>
348/// primary
349/// ::= identifierexpr
350/// ::= numberexpr
351/// ::= parenexpr
352static ExprAST *ParsePrimary() {
353 switch (CurTok) {
354 default: return Error("unknown token when expecting an expression");
355 case tok_identifier: return ParseIdentifierExpr();
356 case tok_number: return ParseNumberExpr();
357 case '(': return ParseParenExpr();
358 }
359}
360</pre>
361</div>
362
363<p>Now that you see the definition of this function, it makes it more obvious
364why we can assume the state of CurTok in the various functions. This uses
365look-ahead to determine which sort of expression is being inspected, and parses
366it with a function call.</p>
367
368<p>Now that basic expressions are handled, we need to handle binary expressions,
369which are a bit more complex.</p>
370
371</div>
372
373<!-- *********************************************************************** -->
374<div class="doc_section"><a name="parserbinops">Binary Expression
375 Parsing</a></div>
376<!-- *********************************************************************** -->
377
378<div class="doc_text">
379
380<p>Binary expressions are significantly harder to parse because they are often
381ambiguous. For example, when given the string "x+y*z", the parser can choose
382to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
383mathematics, we expect the later parse, because "*" (multiplication) has
384higher <em>precedence</em> than "+" (addition).</p>
385
386<p>There are many ways to handle this, but an elegant and efficient way is to
387use <a href=
388"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
389Parsing</a>. This parsing technique uses the precedence of binary operators to
390guide recursion. To start with, we need a table of precedences:</p>
391
392<div class="doc_code">
393<pre>
394/// BinopPrecedence - This holds the precedence for each binary operator that is
395/// defined.
396static std::map&lt;char, int&gt; BinopPrecedence;
397
398/// GetTokPrecedence - Get the precedence of the pending binary operator token.
399static int GetTokPrecedence() {
400 if (!isascii(CurTok))
401 return -1;
402
403 // Make sure it's a declared binop.
404 int TokPrec = BinopPrecedence[CurTok];
405 if (TokPrec &lt;= 0) return -1;
406 return TokPrec;
407}
408
409int main() {
410 // Install standard binary operators.
411 // 1 is lowest precedence.
412 BinopPrecedence['&lt;'] = 10;
413 BinopPrecedence['+'] = 20;
414 BinopPrecedence['-'] = 20;
415 BinopPrecedence['*'] = 40; // highest.
416 ...
417}
418</pre>
419</div>
420
421<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
422(this can obviously be extended by you, the reader). The
423<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
424or -1 if the token is not a binary operator. Having a map makes it easy to add
425new operators and makes it clear that the algorithm doesn't depend on the
426specific operators involved, but it would be easy enough to eliminate the map
427and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
428
429<p>With the helper above defined, we can now start parsing binary expressions.
430The basic idea of operator precedence parsing is to break down an expression
431with potentially ambiguous binary operators into pieces. Consider for example
432the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
433as a stream of primary expressions separated by binary operators. As such,
434it will first parse the leading primary expression "a", then it will see the
435pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
436are primary expressions that the binary expression parser doesn't need to worry
437about nested subexpressions like (c+d) at all.
438</p>
439
440<p>
441To start, an expression is a primary expression potentially followed by a
442sequence of [binop,primaryexpr] pairs:</p>
443
444<div class="doc_code">
445<pre>
446/// expression
447/// ::= primary binoprhs
448///
449static ExprAST *ParseExpression() {
450 ExprAST *LHS = ParsePrimary();
451 if (!LHS) return 0;
452
453 return ParseBinOpRHS(0, LHS);
454}
455</pre>
456</div>
457
458<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
459us. It takes a precedence and a pointer to an expression for the part parsed
460so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
461allowed to be empty, in which case it returns the expression that is passed into
462it. In our example above, the code passes the expression for "a" into
463<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
464
465<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
466minimal operator precedence</em> that the function is allowed to eat. For
467example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
468passed in a precedence of 40, it will not consume any tokens (because the
469precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
470with:</p>
471
472<div class="doc_code">
473<pre>
474/// binoprhs
475/// ::= ('+' primary)*
476static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
477 // If this is a binop, find its precedence.
478 while (1) {
479 int TokPrec = GetTokPrecedence();
480
481 // If this is a binop that binds at least as tightly as the current binop,
482 // consume it, otherwise we are done.
483 if (TokPrec &lt; ExprPrec)
484 return LHS;
485</pre>
486</div>
487
488<p>This code gets the precedence of the current token and checks to see if if is
489too low. Because we defined invalid tokens to have a precedence of -1, this
490check implicitly knows that the pair-stream ends when the token stream runs out
491of binary operators. If this check succeeds, we know that the token is a binary
492operator and that it will be included in this expression:</p>
493
494<div class="doc_code">
495<pre>
496 // Okay, we know this is a binop.
497 int BinOp = CurTok;
498 getNextToken(); // eat binop
499
500 // Parse the primary expression after the binary operator.
501 ExprAST *RHS = ParsePrimary();
502 if (!RHS) return 0;
503</pre>
504</div>
505
506<p>As such, this code eats (and remembers) the binary operator and then parses
507the following primary expression. This builds up the whole pair, the first of
508which is [+, b] for the running example.</p>
509
510<p>Now that we parsed the left-hand side of an expression and one pair of the
511RHS sequence, we have to decide which way the expression associates. In
512particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
513To determine this, we look ahead at "binop" to determine its precedence and
514compare it to BinOp's precedence (which is '+' in this case):</p>
515
516<div class="doc_code">
517<pre>
518 // If BinOp binds less tightly with RHS than the operator after RHS, let
519 // the pending operator take RHS as its LHS.
520 int NextPrec = GetTokPrecedence();
521 if (TokPrec &lt; NextPrec) {
522</pre>
523</div>
524
525<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
526precedence of our current operator, then we know that the parentheses associate
527as "(a+b) binop ...". In our example, since the next operator is "+" and so is
528our current one, we know that they have the same precedence. In this case we'll
529create the AST node for "a+b", and then continue parsing:</p>
530
531<div class="doc_code">
532<pre>
533 ... if body omitted ...
534 }
535
536 // Merge LHS/RHS.
537 LHS = new BinaryExprAST(BinOp, LHS, RHS);
538 } // loop around to the top of the while loop.
539}
540</pre>
541</div>
542
543<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
544iteration of the loop, with "+" as the current token. The code above will eat
545and remember it and parse "(c+d)" as the primary expression, which makes the
546current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
547binop to the right of the primary. In this case, the precedence of "*" is
548higher than the precedence of "+" so the if condition will be entered.</p>
549
550<p>The critical question left here is "how can the if condition parse the right
551hand side in full"? In particular, to build the AST correctly for our example,
552it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
553do this is surprisingly simple (code from the above two blocks duplicated for
554context):</p>
555
556<div class="doc_code">
557<pre>
558 // If BinOp binds less tightly with RHS than the operator after RHS, let
559 // the pending operator take RHS as its LHS.
560 int NextPrec = GetTokPrecedence();
561 if (TokPrec &lt; NextPrec) {
562 RHS = ParseBinOpRHS(TokPrec+1, RHS);
563 if (RHS == 0) return 0;
564 }
565 // Merge LHS/RHS.
566 LHS = new BinaryExprAST(BinOp, LHS, RHS);
567 } // loop around to the top of the while loop.
568}
569</pre>
570</div>
571
572<p>At this point, we know that the binary operator to the RHS of our primary
573has higher precedence than the binop we are currently parsing. As such, we know
574that any sequence of pairs whose operators are all higher precedence than "+"
575should be parsed together and returned as "RHS". To do this, we recursively
576invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
577precedence required for it to continue. In our example above, this will cause
578it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
579of the '+' expression.</p>
580
581<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
582and added to the AST. With this little bit of code (14 non-trivial lines), we
583correctly handle fully general binary expression parsing in a very elegant way.
584</p>
585
586<p>This wraps up handling of expressions. At this point, we can point the
587parser at an arbitrary token stream and build an expression from them, stopping
588at the first token that is not part of the expression. Next up we need to
589handle function definitions etc.</p>
590
591</div>
592
593<!-- *********************************************************************** -->
594<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
595<!-- *********************************************************************** -->
596
597<div class="doc_text">
598
599<p>
600The first basic thing missing is that of function prototypes. In Kaleidoscope,
601these are used both for 'extern' function declarations as well as function body
602definitions. The code to do this is straight-forward and not very interesting
603(once you've survived expressions):
604</p>
605
606<div class="doc_code">
607<pre>
608/// prototype
609/// ::= id '(' id* ')'
610static PrototypeAST *ParsePrototype() {
611 if (CurTok != tok_identifier)
612 return ErrorP("Expected function name in prototype");
613
614 std::string FnName = IdentifierStr;
615 getNextToken();
616
617 if (CurTok != '(')
618 return ErrorP("Expected '(' in prototype");
619
620 std::vector&lt;std::string&gt; ArgNames;
621 while (getNextToken() == tok_identifier)
622 ArgNames.push_back(IdentifierStr);
623 if (CurTok != ')')
624 return ErrorP("Expected ')' in prototype");
625
626 // success.
627 getNextToken(); // eat ')'.
628
629 return new PrototypeAST(FnName, ArgNames);
630}
631</pre>
632</div>
633
634<p>Given this, a function definition is very simple, just a prototype plus
635and expression to implement the body:</p>
636
637<div class="doc_code">
638<pre>
639/// definition ::= 'def' prototype expression
640static FunctionAST *ParseDefinition() {
641 getNextToken(); // eat def.
642 PrototypeAST *Proto = ParsePrototype();
643 if (Proto == 0) return 0;
644
645 if (ExprAST *E = ParseExpression())
646 return new FunctionAST(Proto, E);
647 return 0;
648}
649</pre>
650</div>
651
652<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
653well as to support forward declaration of user functions. 'externs' are just
654prototypes with no body:</p>
655
656<div class="doc_code">
657<pre>
658/// external ::= 'extern' prototype
659static PrototypeAST *ParseExtern() {
660 getNextToken(); // eat extern.
661 return ParsePrototype();
662}
663</pre>
664</div>
665
666<p>Finally, we'll also let the user type in arbitrary top-level expressions and
667evaluate them on the fly. We will handle this by defining anonymous nullary
668(zero argument) functions for them:</p>
669
670<div class="doc_code">
671<pre>
672/// toplevelexpr ::= expression
673static FunctionAST *ParseTopLevelExpr() {
674 if (ExprAST *E = ParseExpression()) {
675 // Make an anonymous proto.
676 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
677 return new FunctionAST(Proto, E);
678 }
679 return 0;
680}
681</pre>
682</div>
683
684<p>Now that we have all the pieces, lets build a little driver that will let us
685actually <em>execute</em> this code we've built!</p>
686
687</div>
688
689<!-- *********************************************************************** -->
690<div class="doc_section"><a name="driver">The Driver</a></div>
691<!-- *********************************************************************** -->
692
693<div class="doc_text">
694
695<p>The driver for this simply invokes all of the parsing pieces with a top-level
696dispatch loop. There isn't much interesting here, so I'll just include the
697top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
698Parsing" section.</p>
699
700<div class="doc_code">
701<pre>
702/// top ::= definition | external | expression | ';'
703static void MainLoop() {
704 while (1) {
705 fprintf(stderr, "ready&gt; ");
706 switch (CurTok) {
707 case tok_eof: return;
708 case ';': getNextToken(); break; // ignore top level semicolons.
709 case tok_def: HandleDefinition(); break;
710 case tok_extern: HandleExtern(); break;
711 default: HandleTopLevelExpression(); break;
712 }
713 }
714}
715</pre>
716</div>
717
718<p>The most interesting part of this is that we ignore top-level semi colons.
719Why is this do you ask? The basic reason is that if you type "4 + 5" at the
720command line, the parser doesn't know that that is the end of what you will
721type. For example, on the next line you could type "def foo..." in which case
7224+5 is the end of a top-level expression. Alternatively you could type "* 6",
723which would continue the expression. Having top-level semicolons allows you to
724type "4+5;" and the parser will know you are done.</p>
725
726</div>
727
728<!-- *********************************************************************** -->
729<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
730<!-- *********************************************************************** -->
731
732<div class="doc_text">
733
734<p>With just under 400 lines of commented code, we fully defined our minimal
735language, including a lexer, parser and AST builder. With this done, the
736executable will validate code and tell us if it is gramatically invalid. For
737example, here is a sample interaction:</p>
738
739<div class="doc_code">
740<pre>
741$ ./a.out
742ready> def foo(x y) x+foo(y, 4.0);
743ready> Parsed an function definition.
744ready> def foo(x y) x+y y;
745ready> Parsed an function definition.
746ready> Parsed a top-level expr
747ready> def foo(x y) x+y );
748ready> Parsed an function definition.
749ready> Error: unknown token when expecting an expression
750ready> extern sin(a);
751ready> Parsed an extern
752ready> ^D
753$
754</pre>
755</div>
756
757<p>
758Here is the full code. Note that it is fully self-contained: you don't even
759need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
760will describe how to generate LLVM IR from the AST.</p>
761
762<div class="doc_code">
763<pre>
764// To build this:
765// g++ -g toy.cpp
766// ./a.out
767
768#include &lt;cstdio&gt;
769#include &lt;string&gt;
770#include &lt;
771#include &lt;vector&gt;
772
773//===----------------------------------------------------------------------===//
774// Lexer
775//===----------------------------------------------------------------------===//
776
777// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
778// of these for known things.
779enum Token {
780 tok_eof = -1,
781
782 // commands
783 tok_def = -2, tok_extern = -3,
784
785 // primary
786 tok_identifier = -4, tok_number = -5,
787};
788
789static std::string IdentifierStr; // Filled in if tok_identifier
790static double NumVal; // Filled in if tok_number
791
792/// gettok - Return the next token from standard input.
793static int gettok() {
794 static int LastChar = ' ';
795
796 // Skip any whitespace.
797 while (isspace(LastChar))
798 LastChar = getchar();
799
800 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
801 IdentifierStr = LastChar;
802 while (isalnum((LastChar = getchar())))
803 IdentifierStr += LastChar;
804
805 if (IdentifierStr == "def") return tok_def;
806 if (IdentifierStr == "extern") return tok_extern;
807 return tok_identifier;
808 }
809
810 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
811 std::string NumStr;
812 do {
813 NumStr += LastChar;
814 LastChar = getchar();
815 } while (isdigit(LastChar) || LastChar == '.');
816
817 NumVal = strtod(NumStr.c_str(), 0);
818 return tok_number;
819 }
820
821 if (LastChar == '#') {
822 // Comment until end of line.
823 do LastChar = getchar();
824 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
825
826 if (LastChar != EOF)
827 return gettok();
828 }
829
830 // Check for end of file. Don't eat the EOF.
831 if (LastChar == EOF)
832 return tok_eof;
833
834 // Otherwise, just return the character as its ascii value.
835 int ThisChar = LastChar;
836 LastChar = getchar();
837 return ThisChar;
838}
839
840//===----------------------------------------------------------------------===//
841// Abstract Syntax Tree (aka Parse Tree)
842//===----------------------------------------------------------------------===//
843
844/// ExprAST - Base class for all expression nodes.
845class ExprAST {
846public:
847 virtual ~ExprAST() {}
848};
849
850/// NumberExprAST - Expression class for numeric literals like "1.0".
851class NumberExprAST : public ExprAST {
852 double Val;
853public:
854 NumberExprAST(double val) : Val(val) {}
855};
856
857/// VariableExprAST - Expression class for referencing a variable, like "a".
858class VariableExprAST : public ExprAST {
859 std::string Name;
860public:
861 VariableExprAST(const std::string &amp;name) : Name(name) {}
862};
863
864/// BinaryExprAST - Expression class for a binary operator.
865class BinaryExprAST : public ExprAST {
866 char Op;
867 ExprAST *LHS, *RHS;
868public:
869 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
870 : Op(op), LHS(lhs), RHS(rhs) {}
871};
872
873/// CallExprAST - Expression class for function calls.
874class CallExprAST : public ExprAST {
875 std::string Callee;
876 std::vector&lt;ExprAST*&gt; Args;
877public:
878 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
879 : Callee(callee), Args(args) {}
880};
881
882/// PrototypeAST - This class represents the "prototype" for a function,
883/// which captures its argument names as well as if it is an operator.
884class PrototypeAST {
885 std::string Name;
886 std::vector&lt; Args;
887public:
888 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
889 : Name(name), Args(args) {}
890
891};
892
893/// FunctionAST - This class represents a function definition itself.
894class FunctionAST {
895 PrototypeAST *Proto;
896 ExprAST *Body;
897public:
898 FunctionAST(PrototypeAST *proto, ExprAST *body)
899 : Proto(proto), Body(body) {}
900
901};
902
903//===----------------------------------------------------------------------===//
904// Parser
905//===----------------------------------------------------------------------===//
906
907/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
908/// token the parser it looking at. getNextToken reads another token from the
909/// lexer and updates CurTok with its results.
910static int CurTok;
911static int getNextToken() {
912 return CurTok = gettok();
913}
914
915/// BinopPrecedence - This holds the precedence for each binary operator that is
916/// defined.
917static std::map&lt;char, int&gt; BinopPrecedence;
918
919/// GetTokPrecedence - Get the precedence of the pending binary operator token.
920static int GetTokPrecedence() {
921 if (!isascii(CurTok))
922 return -1;
923
924 // Make sure it's a declared binop.
925 int TokPrec = BinopPrecedence[CurTok];
926 if (TokPrec &lt;= 0) return -1;
927 return TokPrec;
928}
929
930/// Error* - These are little helper functions for error handling.
931ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
932PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
933FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
934
935static ExprAST *ParseExpression();
936
937/// identifierexpr
938/// ::= identifer
939/// ::= identifer '(' expression* ')'
940static ExprAST *ParseIdentifierExpr() {
941 std::string IdName = IdentifierStr;
942
943 getNextToken(); // eat identifer.
944
945 if (CurTok != '(') // Simple variable ref.
946 return new VariableExprAST(IdName);
947
948 // Call.
949 getNextToken(); // eat (
950 std::vector&lt;ExprAST*&gt; Args;
951 while (1) {
952 ExprAST *Arg = ParseExpression();
953 if (!Arg) return 0;
954 Args.push_back(Arg);
955
956 if (CurTok == ')') break;
957
958 if (CurTok != ',')
959 return Error("Expected ')'");
960 getNextToken();
961 }
962
963 // Eat the ')'.
964 getNextToken();
965
966 return new CallExprAST(IdName, Args);
967}
968
969/// numberexpr ::= number
970static ExprAST *ParseNumberExpr() {
971 ExprAST *Result = new NumberExprAST(NumVal);
972 getNextToken(); // consume the number
973 return Result;
974}
975
976/// parenexpr ::= '(' expression ')'
977static ExprAST *ParseParenExpr() {
978 getNextToken(); // eat (.
979 ExprAST *V = ParseExpression();
980 if (!V) return 0;
981
982 if (CurTok != ')')
983 return Error("expected ')'");
984 getNextToken(); // eat ).
985 return V;
986}
987
988/// primary
989/// ::= identifierexpr
990/// ::= numberexpr
991/// ::= parenexpr
992static ExprAST *ParsePrimary() {
993 switch (CurTok) {
994 default: return Error("unknown token when expecting an expression");
995 case tok_identifier: return ParseIdentifierExpr();
996 case tok_number: return ParseNumberExpr();
997 case '(': return ParseParenExpr();
998 }
999}
1000
1001/// binoprhs
1002/// ::= ('+' primary)*
1003static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1004 // If this is a binop, find its precedence.
1005 while (1) {
1006 int TokPrec = GetTokPrecedence();
1007
1008 // If this is a binop that binds at least as tightly as the current binop,
1009 // consume it, otherwise we are done.
1010 if (TokPrec &lt; ExprPrec)
1011 return LHS;
1012
1013 // Okay, we know this is a binop.
1014 int BinOp = CurTok;
1015 getNextToken(); // eat binop
1016
1017 // Parse the primary expression after the binary operator.
1018 ExprAST *RHS = ParsePrimary();
1019 if (!RHS) return 0;
1020
1021 // If BinOp binds less tightly with RHS than the operator after RHS, let
1022 // the pending operator take RHS as its LHS.
1023 int NextPrec = GetTokPrecedence();
1024 if (TokPrec &lt; NextPrec) {
1025 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1026 if (RHS == 0) return 0;
1027 }
1028
1029 // Merge LHS/RHS.
1030 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1031 }
1032}
1033
1034/// expression
1035/// ::= primary binoprhs
1036///
1037static ExprAST *ParseExpression() {
1038 ExprAST *LHS = ParsePrimary();
1039 if (!LHS) return 0;
1040
1041 return ParseBinOpRHS(0, LHS);
1042}
1043
1044/// prototype
1045/// ::= id '(' id* ')'
1046static PrototypeAST *ParsePrototype() {
1047 if (CurTok != tok_identifier)
1048 return ErrorP("Expected function name in prototype");
1049
1050 std::string FnName = IdentifierStr;
1051 getNextToken();
1052
1053 if (CurTok != '(')
1054 return ErrorP("Expected '(' in prototype");
1055
1056 std::vector&lt;std::string&gt; ArgNames;
1057 while (getNextToken() == tok_identifier)
1058 ArgNames.push_back(IdentifierStr);
1059 if (CurTok != ')')
1060 return ErrorP("Expected ')' in prototype");
1061
1062 // success.
1063 getNextToken(); // eat ')'.
1064
1065 return new PrototypeAST(FnName, ArgNames);
1066}
1067
1068/// definition ::= 'def' prototype expression
1069static FunctionAST *ParseDefinition() {
1070 getNextToken(); // eat def.
1071 PrototypeAST *Proto = ParsePrototype();
1072 if (Proto == 0) return 0;
1073
1074 if (ExprAST *E = ParseExpression())
1075 return new FunctionAST(Proto, E);
1076 return 0;
1077}
1078
1079/// toplevelexpr ::= expression
1080static FunctionAST *ParseTopLevelExpr() {
1081 if (ExprAST *E = ParseExpression()) {
1082 // Make an anonymous proto.
1083 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1084 return new FunctionAST(Proto, E);
1085 }
1086 return 0;
1087}
1088
1089/// external ::= 'extern' prototype
1090static PrototypeAST *ParseExtern() {
1091 getNextToken(); // eat extern.
1092 return ParsePrototype();
1093}
1094
1095//===----------------------------------------------------------------------===//
1096// Top-Level parsing
1097//===----------------------------------------------------------------------===//
1098
1099static void HandleDefinition() {
1100 if (FunctionAST *F = ParseDefinition()) {
1101 fprintf(stderr, "Parsed a function definition.\n");
1102 } else {
1103 // Skip token for error recovery.
1104 getNextToken();
1105 }
1106}
1107
1108static void HandleExtern() {
1109 if (PrototypeAST *P = ParseExtern()) {
1110 fprintf(stderr, "Parsed an extern\n");
1111 } else {
1112 // Skip token for error recovery.
1113 getNextToken();
1114 }
1115}
1116
1117static void HandleTopLevelExpression() {
1118 // Evaluate a top level expression into an anonymous function.
1119 if (FunctionAST *F = ParseTopLevelExpr()) {
1120 fprintf(stderr, "Parsed a top-level expr\n");
1121 } else {
1122 // Skip token for error recovery.
1123 getNextToken();
1124 }
1125}
1126
1127/// top ::= definition | external | expression | ';'
1128static void MainLoop() {
1129 while (1) {
1130 fprintf(stderr, "ready&gt; ");
1131 switch (CurTok) {
1132 case tok_eof: return;
1133 case ';': getNextToken(); break; // ignore top level semicolons.
1134 case tok_def: HandleDefinition(); break;
1135 case tok_extern: HandleExtern(); break;
1136 default: HandleTopLevelExpression(); break;
1137 }
1138 }
1139}
1140
1141//===----------------------------------------------------------------------===//
1142// Main driver code.
1143//===----------------------------------------------------------------------===//
1144
1145int main() {
1146 // Install standard binary operators.
1147 // 1 is lowest precedence.
1148 BinopPrecedence['&lt;'] = 10;
1149 BinopPrecedence['+'] = 20;
1150 BinopPrecedence['-'] = 20;
1151 BinopPrecedence['*'] = 40; // highest.
1152
1153 // Prime the first token.
1154 fprintf(stderr, "ready&gt; ");
1155 getNextToken();
1156
1157 MainLoop();
1158 return 0;
1159}
1160</pre>
1161</div>
1162</div>
1163
1164<!-- *********************************************************************** -->
1165<hr>
1166<address>
1167 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1168 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1169 <a href="http://validator.w3.org/check/referer"><img
1170 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1171
1172 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1173 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1174 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1175</address>
1176</body>
1177</html>