blob: 653975cd8be684b8a0bc2d73b83013f52a4ac152 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
16<div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
18</div>
19
20<!-- *********************************************************************** -->
21<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
22<!-- *********************************************************************** -->
23
24<div class="doc_text">
25
26<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
27LLVM</a>" tutorial. This chapter shows you how to use the <a
28href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
29href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
30our Kaleidoscope language and build an <a
31href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
32Tree</a> (AST).</p>
33
34<p>The parser we will build uses a combination of <a
35href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
36Parsing</a> and <a href=
37"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
38Parsing</a> to parse the Kaleidoscope language (the later for binary expression
39and the former for everything else). Before we get to parsing though, lets talk
40about the output of the parser: the Abstract Syntax Tree.</p>
41
42</div>
43
44<!-- *********************************************************************** -->
45<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
46<!-- *********************************************************************** -->
47
48<div class="doc_text">
49
50<p>The AST for a program captures its behavior in a way that it is easy for
51later stages of the compiler (e.g. code generation) to interpret. We basically
52want one object for each construct in the language, and the AST should closely
53model the language. In Kaleidoscope, we have expressions, a prototype, and a
54function object. We'll start with expressions first:</p>
55
56<div class="doc_code">
57<pre>
58/// ExprAST - Base class for all expression nodes.
59class ExprAST {
60public:
61 virtual ~ExprAST() {}
62};
63
64/// NumberExprAST - Expression class for numeric literals like "1.0".
65class NumberExprAST : public ExprAST {
66 double Val;
67public:
68 NumberExprAST(double val) : Val(val) {}
69};
70</pre>
71</div>
72
73<p>The code above shows the definition of the base ExprAST class and one
74subclass which we use for numeric literals. The important thing about this is
75that the NumberExprAST class captures the numeric value of the literal in the
76class, so that later phases of the compiler can know what it is.</p>
77
78<p>Right now we only create the AST, so there are no useful accessor methods on
79them. It would be very easy to add a virtual method to pretty print the code,
80for example. Here are the other expression AST node definitions that we'll use
81in the basic form of the Kaleidoscope language.
82</p>
83
84<div class="doc_code">
85<pre>
86/// VariableExprAST - Expression class for referencing a variable, like "a".
87class VariableExprAST : public ExprAST {
88 std::string Name;
89public:
90 VariableExprAST(const std::string &amp;name) : Name(name) {}
91};
92
93/// BinaryExprAST - Expression class for a binary operator.
94class BinaryExprAST : public ExprAST {
95 char Op;
96 ExprAST *LHS, *RHS;
97public:
98 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
99 : Op(op), LHS(lhs), RHS(rhs) {}
100};
101
102/// CallExprAST - Expression class for function calls.
103class CallExprAST : public ExprAST {
104 std::string Callee;
105 std::vector&lt;ExprAST*&gt; Args;
106public:
107 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
108 : Callee(callee), Args(args) {}
109};
110</pre>
111</div>
112
113<p>This is all (intentially) rather straight-forward: variables capture the
114variable name, binary operators capture their opcode (e.g. '+'), and calls
115capture a function name and list of argument expressions. One thing that is
116nice about our AST is that it captures the language features without talking
117about the syntax of the language. Note that there is no discussion about
118precedence of binary operators, lexical structure etc.</p>
119
120<p>For our basic language, these are all of the expression nodes we'll define.
Owen Andersonc6311b92007-10-22 06:48:28 +0000121Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattnere6c91042007-10-22 06:34:15 +0000122we'll fix that in a later installment. The two things we need next are a way
123to talk about the interface to a function, and a way to talk about functions
124themselves:</p>
125
126<div class="doc_code">
127<pre>
128/// PrototypeAST - This class represents the "prototype" for a function,
129/// which captures its argument names as well as if it is an operator.
130class PrototypeAST {
131 std::string Name;
132 std::vector&lt;std::string&gt; Args;
133public:
134 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
135 : Name(name), Args(args) {}
136};
137
138/// FunctionAST - This class represents a function definition itself.
139class FunctionAST {
140 PrototypeAST *Proto;
141 ExprAST *Body;
142public:
143 FunctionAST(PrototypeAST *proto, ExprAST *body)
144 : Proto(proto), Body(body) {}
145};
146</pre>
147</div>
148
149<p>In Kaleidoscope, functions are typed with just a count of their arguments.
150Since all values are double precision floating point, this fact doesn't need to
151be captured anywhere. In a more aggressive and realistic language, the
152"ExprAST" class would probably have a type field.</p>
153
154<p>With this scaffolding, we can now talk about parsing expressions and function
155bodies in Kaleidoscope.</p>
156
157</div>
158
159<!-- *********************************************************************** -->
160<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
161<!-- *********************************************************************** -->
162
163<div class="doc_text">
164
165<p>Now that we have an AST to build, we need to define the parser code to build
166it. The idea here is that we want to parse something like "x+y" (which is
167returned as three tokens by the lexer) into an AST that could be generated with
168calls like this:</p>
169
170<div class="doc_code">
171<pre>
172 ExprAST *X = new VariableExprAST("x");
173 ExprAST *Y = new VariableExprAST("y");
174 ExprAST *Result = new BinaryExprAST('+', X, Y);
175</pre>
176</div>
177
178<p>In order to do this, we'll start by defining some basic helper routines:</p>
179
180<div class="doc_code">
181<pre>
182/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
183/// token the parser it looking at. getNextToken reads another token from the
184/// lexer and updates CurTok with its results.
185static int CurTok;
186static int getNextToken() {
187 return CurTok = gettok();
188}
189</pre>
190</div>
191
192<p>
193This implements a simple token buffer around the lexer. This allows
194us to look one token ahead at what the lexer is returning. Every function in
195our lexer will assume that CurTok is the current token that needs to be
196parsed.</p>
197
Owen Andersonc6311b92007-10-22 06:48:28 +0000198<p>Again, we define these with global variables; it would be better design to wrap the entire parser in a class and use instance variables for these.
Chris Lattnere6c91042007-10-22 06:34:15 +0000199</p>
200
201<div class="doc_code">
202<pre>
203
204/// Error* - These are little helper functions for error handling.
205ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
206PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
207FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
208</pre>
209</div>
210
211<p>
212The <tt>Error</tt> routines are simple helper routines that our parser will use
213to handle errors. The error recovery in our parser will not be the best and
214are not particular user-friendly, but it will be enough for our tutorial. These
215routines make it easier to handle errors in routines that have various return
216types: they always return null.</p>
217
218<p>With these basic helper functions implemented, we can implement the first
219piece of our grammar: we'll start with numeric literals.</p>
220
221</div>
222
223<!-- *********************************************************************** -->
224<div class="doc_section"><a name="parserprimexprs">Basic Expression
225 Parsing</a></div>
226<!-- *********************************************************************** -->
227
228<div class="doc_text">
229
230<p>We start with numeric literals, because they are the simplest to process.
231For each production in our grammar, we'll define a function which parses that
232production. For numeric literals, we have:
233</p>
234
235<div class="doc_code">
236<pre>
237/// numberexpr ::= number
238static ExprAST *ParseNumberExpr() {
239 ExprAST *Result = new NumberExprAST(NumVal);
240 getNextToken(); // consume the number
241 return Result;
242}
243</pre>
244</div>
245
246<p>This routine is very simple: it expects to be called when the current token
247is a <tt>tok_number</tt> token. It takes the current number value, creates
248a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
249returns.</p>
250
251<p>There are some interesting aspects of this. The most important one is that
252this routine eats all of the tokens that correspond to the production, and
253returns the lexer buffer with the next token (which is not part of the grammar
254production) ready to go. This is a fairly standard way to go for recursive
255descent parsers. For a better example, the parenthesis operator is defined like
256this:</p>
257
258<div class="doc_code">
259<pre>
260/// parenexpr ::= '(' expression ')'
261static ExprAST *ParseParenExpr() {
262 getNextToken(); // eat (.
263 ExprAST *V = ParseExpression();
264 if (!V) return 0;
265
266 if (CurTok != ')')
267 return Error("expected ')'");
268 getNextToken(); // eat ).
269 return V;
270}
271</pre>
272</div>
273
274<p>This function illustrates a number of interesting things about the parser:
2751) it shows how we use the Error routines. When called, this function expects
276that the current token is a '(' token, but after parsing the subexpression, it
277is possible that there is not a ')' waiting. For example, if the user types in
278"(4 x" instead of "(4)", the parser should emit an error. Because errors can
279occur, the parser needs a way to indicate that they happened: in our parser, we
280return null on an error.</p>
281
282<p>Another interesting aspect of this function is that it uses recursion by
Owen Andersonc6311b92007-10-22 06:48:28 +0000283calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattnere6c91042007-10-22 06:34:15 +0000284<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
285recursive grammars, and keeps each production very simple. Note that
286parenthesis do not cause construction of AST nodes themselves. While we could
287do this, the most important role of parens are to guide the parser and provide
288grouping. Once the parser constructs the AST, parens are not needed.</p>
289
290<p>The next simple production is for handling variable references and function
291calls:</p>
292
293<div class="doc_code">
294<pre>
295/// identifierexpr
296/// ::= identifer
297/// ::= identifer '(' expression* ')'
298static ExprAST *ParseIdentifierExpr() {
299 std::string IdName = IdentifierStr;
300
301 getNextToken(); // eat identifer.
302
303 if (CurTok != '(') // Simple variable ref.
304 return new VariableExprAST(IdName);
305
306 // Call.
307 getNextToken(); // eat (
308 std::vector&lt;ExprAST*&gt; Args;
309 while (1) {
310 ExprAST *Arg = ParseExpression();
311 if (!Arg) return 0;
312 Args.push_back(Arg);
313
314 if (CurTok == ')') break;
315
316 if (CurTok != ',')
317 return Error("Expected ')'");
318 getNextToken();
319 }
320
321 // Eat the ')'.
322 getNextToken();
323
324 return new CallExprAST(IdName, Args);
325}
326</pre>
327</div>
328
329<p>This routine follows the same style as the other routines (it expects to be
330called if the current token is a <tt>tok_identifier</tt> token). It also has
331recursion and error handling. One interesting aspect of this is that it uses
332<em>look-ahead</em> to determine if the current identifier is a stand alone
333variable reference or if it is a function call expression. It handles this by
334checking to see if the token after the identifier is a '(' token, and constructs
335either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
336</p>
337
338<p>Now that we have all of our simple expression parsing logic in place, we can
339define a helper function to wrap them up in a class. We call this class of
340expressions "primary" expressions, for reasons that will become more clear
341later. In order to parse a primary expression, we need to determine what sort
342of expression it is:</p>
343
344<div class="doc_code">
345<pre>
346/// primary
347/// ::= identifierexpr
348/// ::= numberexpr
349/// ::= parenexpr
350static ExprAST *ParsePrimary() {
351 switch (CurTok) {
352 default: return Error("unknown token when expecting an expression");
353 case tok_identifier: return ParseIdentifierExpr();
354 case tok_number: return ParseNumberExpr();
355 case '(': return ParseParenExpr();
356 }
357}
358</pre>
359</div>
360
361<p>Now that you see the definition of this function, it makes it more obvious
362why we can assume the state of CurTok in the various functions. This uses
363look-ahead to determine which sort of expression is being inspected, and parses
364it with a function call.</p>
365
366<p>Now that basic expressions are handled, we need to handle binary expressions,
367which are a bit more complex.</p>
368
369</div>
370
371<!-- *********************************************************************** -->
372<div class="doc_section"><a name="parserbinops">Binary Expression
373 Parsing</a></div>
374<!-- *********************************************************************** -->
375
376<div class="doc_text">
377
378<p>Binary expressions are significantly harder to parse because they are often
379ambiguous. For example, when given the string "x+y*z", the parser can choose
380to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
381mathematics, we expect the later parse, because "*" (multiplication) has
382higher <em>precedence</em> than "+" (addition).</p>
383
384<p>There are many ways to handle this, but an elegant and efficient way is to
385use <a href=
386"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
387Parsing</a>. This parsing technique uses the precedence of binary operators to
388guide recursion. To start with, we need a table of precedences:</p>
389
390<div class="doc_code">
391<pre>
392/// BinopPrecedence - This holds the precedence for each binary operator that is
393/// defined.
394static std::map&lt;char, int&gt; BinopPrecedence;
395
396/// GetTokPrecedence - Get the precedence of the pending binary operator token.
397static int GetTokPrecedence() {
398 if (!isascii(CurTok))
399 return -1;
400
401 // Make sure it's a declared binop.
402 int TokPrec = BinopPrecedence[CurTok];
403 if (TokPrec &lt;= 0) return -1;
404 return TokPrec;
405}
406
407int main() {
408 // Install standard binary operators.
409 // 1 is lowest precedence.
410 BinopPrecedence['&lt;'] = 10;
411 BinopPrecedence['+'] = 20;
412 BinopPrecedence['-'] = 20;
413 BinopPrecedence['*'] = 40; // highest.
414 ...
415}
416</pre>
417</div>
418
419<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
420(this can obviously be extended by you, the reader). The
421<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
422or -1 if the token is not a binary operator. Having a map makes it easy to add
423new operators and makes it clear that the algorithm doesn't depend on the
424specific operators involved, but it would be easy enough to eliminate the map
425and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
426
427<p>With the helper above defined, we can now start parsing binary expressions.
428The basic idea of operator precedence parsing is to break down an expression
429with potentially ambiguous binary operators into pieces. Consider for example
430the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
431as a stream of primary expressions separated by binary operators. As such,
432it will first parse the leading primary expression "a", then it will see the
433pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
434are primary expressions that the binary expression parser doesn't need to worry
435about nested subexpressions like (c+d) at all.
436</p>
437
438<p>
439To start, an expression is a primary expression potentially followed by a
440sequence of [binop,primaryexpr] pairs:</p>
441
442<div class="doc_code">
443<pre>
444/// expression
445/// ::= primary binoprhs
446///
447static ExprAST *ParseExpression() {
448 ExprAST *LHS = ParsePrimary();
449 if (!LHS) return 0;
450
451 return ParseBinOpRHS(0, LHS);
452}
453</pre>
454</div>
455
456<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
457us. It takes a precedence and a pointer to an expression for the part parsed
458so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
459allowed to be empty, in which case it returns the expression that is passed into
460it. In our example above, the code passes the expression for "a" into
461<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
462
463<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
464minimal operator precedence</em> that the function is allowed to eat. For
465example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
466passed in a precedence of 40, it will not consume any tokens (because the
467precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
468with:</p>
469
470<div class="doc_code">
471<pre>
472/// binoprhs
473/// ::= ('+' primary)*
474static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
475 // If this is a binop, find its precedence.
476 while (1) {
477 int TokPrec = GetTokPrecedence();
478
479 // If this is a binop that binds at least as tightly as the current binop,
480 // consume it, otherwise we are done.
481 if (TokPrec &lt; ExprPrec)
482 return LHS;
483</pre>
484</div>
485
486<p>This code gets the precedence of the current token and checks to see if if is
487too low. Because we defined invalid tokens to have a precedence of -1, this
488check implicitly knows that the pair-stream ends when the token stream runs out
489of binary operators. If this check succeeds, we know that the token is a binary
490operator and that it will be included in this expression:</p>
491
492<div class="doc_code">
493<pre>
494 // Okay, we know this is a binop.
495 int BinOp = CurTok;
496 getNextToken(); // eat binop
497
498 // Parse the primary expression after the binary operator.
499 ExprAST *RHS = ParsePrimary();
500 if (!RHS) return 0;
501</pre>
502</div>
503
504<p>As such, this code eats (and remembers) the binary operator and then parses
505the following primary expression. This builds up the whole pair, the first of
506which is [+, b] for the running example.</p>
507
508<p>Now that we parsed the left-hand side of an expression and one pair of the
509RHS sequence, we have to decide which way the expression associates. In
510particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
511To determine this, we look ahead at "binop" to determine its precedence and
512compare it to BinOp's precedence (which is '+' in this case):</p>
513
514<div class="doc_code">
515<pre>
516 // If BinOp binds less tightly with RHS than the operator after RHS, let
517 // the pending operator take RHS as its LHS.
518 int NextPrec = GetTokPrecedence();
519 if (TokPrec &lt; NextPrec) {
520</pre>
521</div>
522
523<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
524precedence of our current operator, then we know that the parentheses associate
525as "(a+b) binop ...". In our example, since the next operator is "+" and so is
526our current one, we know that they have the same precedence. In this case we'll
527create the AST node for "a+b", and then continue parsing:</p>
528
529<div class="doc_code">
530<pre>
531 ... if body omitted ...
532 }
533
534 // Merge LHS/RHS.
535 LHS = new BinaryExprAST(BinOp, LHS, RHS);
536 } // loop around to the top of the while loop.
537}
538</pre>
539</div>
540
541<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
542iteration of the loop, with "+" as the current token. The code above will eat
543and remember it and parse "(c+d)" as the primary expression, which makes the
544current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
545binop to the right of the primary. In this case, the precedence of "*" is
546higher than the precedence of "+" so the if condition will be entered.</p>
547
548<p>The critical question left here is "how can the if condition parse the right
549hand side in full"? In particular, to build the AST correctly for our example,
550it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
551do this is surprisingly simple (code from the above two blocks duplicated for
552context):</p>
553
554<div class="doc_code">
555<pre>
556 // If BinOp binds less tightly with RHS than the operator after RHS, let
557 // the pending operator take RHS as its LHS.
558 int NextPrec = GetTokPrecedence();
559 if (TokPrec &lt; NextPrec) {
560 RHS = ParseBinOpRHS(TokPrec+1, RHS);
561 if (RHS == 0) return 0;
562 }
563 // Merge LHS/RHS.
564 LHS = new BinaryExprAST(BinOp, LHS, RHS);
565 } // loop around to the top of the while loop.
566}
567</pre>
568</div>
569
570<p>At this point, we know that the binary operator to the RHS of our primary
571has higher precedence than the binop we are currently parsing. As such, we know
572that any sequence of pairs whose operators are all higher precedence than "+"
573should be parsed together and returned as "RHS". To do this, we recursively
574invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
575precedence required for it to continue. In our example above, this will cause
576it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
577of the '+' expression.</p>
578
579<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
580and added to the AST. With this little bit of code (14 non-trivial lines), we
581correctly handle fully general binary expression parsing in a very elegant way.
582</p>
583
584<p>This wraps up handling of expressions. At this point, we can point the
585parser at an arbitrary token stream and build an expression from them, stopping
586at the first token that is not part of the expression. Next up we need to
587handle function definitions etc.</p>
588
589</div>
590
591<!-- *********************************************************************** -->
592<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
593<!-- *********************************************************************** -->
594
595<div class="doc_text">
596
597<p>
598The first basic thing missing is that of function prototypes. In Kaleidoscope,
599these are used both for 'extern' function declarations as well as function body
600definitions. The code to do this is straight-forward and not very interesting
601(once you've survived expressions):
602</p>
603
604<div class="doc_code">
605<pre>
606/// prototype
607/// ::= id '(' id* ')'
608static PrototypeAST *ParsePrototype() {
609 if (CurTok != tok_identifier)
610 return ErrorP("Expected function name in prototype");
611
612 std::string FnName = IdentifierStr;
613 getNextToken();
614
615 if (CurTok != '(')
616 return ErrorP("Expected '(' in prototype");
617
618 std::vector&lt;std::string&gt; ArgNames;
619 while (getNextToken() == tok_identifier)
620 ArgNames.push_back(IdentifierStr);
621 if (CurTok != ')')
622 return ErrorP("Expected ')' in prototype");
623
624 // success.
625 getNextToken(); // eat ')'.
626
627 return new PrototypeAST(FnName, ArgNames);
628}
629</pre>
630</div>
631
632<p>Given this, a function definition is very simple, just a prototype plus
633and expression to implement the body:</p>
634
635<div class="doc_code">
636<pre>
637/// definition ::= 'def' prototype expression
638static FunctionAST *ParseDefinition() {
639 getNextToken(); // eat def.
640 PrototypeAST *Proto = ParsePrototype();
641 if (Proto == 0) return 0;
642
643 if (ExprAST *E = ParseExpression())
644 return new FunctionAST(Proto, E);
645 return 0;
646}
647</pre>
648</div>
649
650<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
651well as to support forward declaration of user functions. 'externs' are just
652prototypes with no body:</p>
653
654<div class="doc_code">
655<pre>
656/// external ::= 'extern' prototype
657static PrototypeAST *ParseExtern() {
658 getNextToken(); // eat extern.
659 return ParsePrototype();
660}
661</pre>
662</div>
663
664<p>Finally, we'll also let the user type in arbitrary top-level expressions and
665evaluate them on the fly. We will handle this by defining anonymous nullary
666(zero argument) functions for them:</p>
667
668<div class="doc_code">
669<pre>
670/// toplevelexpr ::= expression
671static FunctionAST *ParseTopLevelExpr() {
672 if (ExprAST *E = ParseExpression()) {
673 // Make an anonymous proto.
674 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
675 return new FunctionAST(Proto, E);
676 }
677 return 0;
678}
679</pre>
680</div>
681
682<p>Now that we have all the pieces, lets build a little driver that will let us
683actually <em>execute</em> this code we've built!</p>
684
685</div>
686
687<!-- *********************************************************************** -->
688<div class="doc_section"><a name="driver">The Driver</a></div>
689<!-- *********************************************************************** -->
690
691<div class="doc_text">
692
693<p>The driver for this simply invokes all of the parsing pieces with a top-level
694dispatch loop. There isn't much interesting here, so I'll just include the
695top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
696Parsing" section.</p>
697
698<div class="doc_code">
699<pre>
700/// top ::= definition | external | expression | ';'
701static void MainLoop() {
702 while (1) {
703 fprintf(stderr, "ready&gt; ");
704 switch (CurTok) {
705 case tok_eof: return;
706 case ';': getNextToken(); break; // ignore top level semicolons.
707 case tok_def: HandleDefinition(); break;
708 case tok_extern: HandleExtern(); break;
709 default: HandleTopLevelExpression(); break;
710 }
711 }
712}
713</pre>
714</div>
715
716<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Andersonc6311b92007-10-22 06:48:28 +0000717Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattnere6c91042007-10-22 06:34:15 +0000718command line, the parser doesn't know that that is the end of what you will
719type. For example, on the next line you could type "def foo..." in which case
7204+5 is the end of a top-level expression. Alternatively you could type "* 6",
721which would continue the expression. Having top-level semicolons allows you to
722type "4+5;" and the parser will know you are done.</p>
723
724</div>
725
726<!-- *********************************************************************** -->
727<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
728<!-- *********************************************************************** -->
729
730<div class="doc_text">
731
732<p>With just under 400 lines of commented code, we fully defined our minimal
733language, including a lexer, parser and AST builder. With this done, the
734executable will validate code and tell us if it is gramatically invalid. For
735example, here is a sample interaction:</p>
736
737<div class="doc_code">
738<pre>
739$ ./a.out
740ready> def foo(x y) x+foo(y, 4.0);
741ready> Parsed an function definition.
742ready> def foo(x y) x+y y;
743ready> Parsed an function definition.
744ready> Parsed a top-level expr
745ready> def foo(x y) x+y );
746ready> Parsed an function definition.
747ready> Error: unknown token when expecting an expression
748ready> extern sin(a);
749ready> Parsed an extern
750ready> ^D
751$
752</pre>
753</div>
754
755<p>
756Here is the full code. Note that it is fully self-contained: you don't even
757need LLVM for this. In the <a href="LangImpl3.html">next installment</a>, we
758will describe how to generate LLVM IR from the AST.</p>
759
760<div class="doc_code">
761<pre>
762// To build this:
763// g++ -g toy.cpp
764// ./a.out
765
766#include &lt;cstdio&gt;
767#include &lt;string&gt;
768#include &lt;
769#include &lt;vector&gt;
770
771//===----------------------------------------------------------------------===//
772// Lexer
773//===----------------------------------------------------------------------===//
774
775// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
776// of these for known things.
777enum Token {
778 tok_eof = -1,
779
780 // commands
781 tok_def = -2, tok_extern = -3,
782
783 // primary
784 tok_identifier = -4, tok_number = -5,
785};
786
787static std::string IdentifierStr; // Filled in if tok_identifier
788static double NumVal; // Filled in if tok_number
789
790/// gettok - Return the next token from standard input.
791static int gettok() {
792 static int LastChar = ' ';
793
794 // Skip any whitespace.
795 while (isspace(LastChar))
796 LastChar = getchar();
797
798 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
799 IdentifierStr = LastChar;
800 while (isalnum((LastChar = getchar())))
801 IdentifierStr += LastChar;
802
803 if (IdentifierStr == "def") return tok_def;
804 if (IdentifierStr == "extern") return tok_extern;
805 return tok_identifier;
806 }
807
808 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
809 std::string NumStr;
810 do {
811 NumStr += LastChar;
812 LastChar = getchar();
813 } while (isdigit(LastChar) || LastChar == '.');
814
815 NumVal = strtod(NumStr.c_str(), 0);
816 return tok_number;
817 }
818
819 if (LastChar == '#') {
820 // Comment until end of line.
821 do LastChar = getchar();
822 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
823
824 if (LastChar != EOF)
825 return gettok();
826 }
827
828 // Check for end of file. Don't eat the EOF.
829 if (LastChar == EOF)
830 return tok_eof;
831
832 // Otherwise, just return the character as its ascii value.
833 int ThisChar = LastChar;
834 LastChar = getchar();
835 return ThisChar;
836}
837
838//===----------------------------------------------------------------------===//
839// Abstract Syntax Tree (aka Parse Tree)
840//===----------------------------------------------------------------------===//
841
842/// ExprAST - Base class for all expression nodes.
843class ExprAST {
844public:
845 virtual ~ExprAST() {}
846};
847
848/// NumberExprAST - Expression class for numeric literals like "1.0".
849class NumberExprAST : public ExprAST {
850 double Val;
851public:
852 NumberExprAST(double val) : Val(val) {}
853};
854
855/// VariableExprAST - Expression class for referencing a variable, like "a".
856class VariableExprAST : public ExprAST {
857 std::string Name;
858public:
859 VariableExprAST(const std::string &amp;name) : Name(name) {}
860};
861
862/// BinaryExprAST - Expression class for a binary operator.
863class BinaryExprAST : public ExprAST {
864 char Op;
865 ExprAST *LHS, *RHS;
866public:
867 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
868 : Op(op), LHS(lhs), RHS(rhs) {}
869};
870
871/// CallExprAST - Expression class for function calls.
872class CallExprAST : public ExprAST {
873 std::string Callee;
874 std::vector&lt;ExprAST*&gt; Args;
875public:
876 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
877 : Callee(callee), Args(args) {}
878};
879
880/// PrototypeAST - This class represents the "prototype" for a function,
881/// which captures its argument names as well as if it is an operator.
882class PrototypeAST {
883 std::string Name;
884 std::vector&lt; Args;
885public:
886 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
887 : Name(name), Args(args) {}
888
889};
890
891/// FunctionAST - This class represents a function definition itself.
892class FunctionAST {
893 PrototypeAST *Proto;
894 ExprAST *Body;
895public:
896 FunctionAST(PrototypeAST *proto, ExprAST *body)
897 : Proto(proto), Body(body) {}
898
899};
900
901//===----------------------------------------------------------------------===//
902// Parser
903//===----------------------------------------------------------------------===//
904
905/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
906/// token the parser it looking at. getNextToken reads another token from the
907/// lexer and updates CurTok with its results.
908static int CurTok;
909static int getNextToken() {
910 return CurTok = gettok();
911}
912
913/// BinopPrecedence - This holds the precedence for each binary operator that is
914/// defined.
915static std::map&lt;char, int&gt; BinopPrecedence;
916
917/// GetTokPrecedence - Get the precedence of the pending binary operator token.
918static int GetTokPrecedence() {
919 if (!isascii(CurTok))
920 return -1;
921
922 // Make sure it's a declared binop.
923 int TokPrec = BinopPrecedence[CurTok];
924 if (TokPrec &lt;= 0) return -1;
925 return TokPrec;
926}
927
928/// Error* - These are little helper functions for error handling.
929ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
930PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
931FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
932
933static ExprAST *ParseExpression();
934
935/// identifierexpr
936/// ::= identifer
937/// ::= identifer '(' expression* ')'
938static ExprAST *ParseIdentifierExpr() {
939 std::string IdName = IdentifierStr;
940
941 getNextToken(); // eat identifer.
942
943 if (CurTok != '(') // Simple variable ref.
944 return new VariableExprAST(IdName);
945
946 // Call.
947 getNextToken(); // eat (
948 std::vector&lt;ExprAST*&gt; Args;
949 while (1) {
950 ExprAST *Arg = ParseExpression();
951 if (!Arg) return 0;
952 Args.push_back(Arg);
953
954 if (CurTok == ')') break;
955
956 if (CurTok != ',')
957 return Error("Expected ')'");
958 getNextToken();
959 }
960
961 // Eat the ')'.
962 getNextToken();
963
964 return new CallExprAST(IdName, Args);
965}
966
967/// numberexpr ::= number
968static ExprAST *ParseNumberExpr() {
969 ExprAST *Result = new NumberExprAST(NumVal);
970 getNextToken(); // consume the number
971 return Result;
972}
973
974/// parenexpr ::= '(' expression ')'
975static ExprAST *ParseParenExpr() {
976 getNextToken(); // eat (.
977 ExprAST *V = ParseExpression();
978 if (!V) return 0;
979
980 if (CurTok != ')')
981 return Error("expected ')'");
982 getNextToken(); // eat ).
983 return V;
984}
985
986/// primary
987/// ::= identifierexpr
988/// ::= numberexpr
989/// ::= parenexpr
990static ExprAST *ParsePrimary() {
991 switch (CurTok) {
992 default: return Error("unknown token when expecting an expression");
993 case tok_identifier: return ParseIdentifierExpr();
994 case tok_number: return ParseNumberExpr();
995 case '(': return ParseParenExpr();
996 }
997}
998
999/// binoprhs
1000/// ::= ('+' primary)*
1001static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1002 // If this is a binop, find its precedence.
1003 while (1) {
1004 int TokPrec = GetTokPrecedence();
1005
1006 // If this is a binop that binds at least as tightly as the current binop,
1007 // consume it, otherwise we are done.
1008 if (TokPrec &lt; ExprPrec)
1009 return LHS;
1010
1011 // Okay, we know this is a binop.
1012 int BinOp = CurTok;
1013 getNextToken(); // eat binop
1014
1015 // Parse the primary expression after the binary operator.
1016 ExprAST *RHS = ParsePrimary();
1017 if (!RHS) return 0;
1018
1019 // If BinOp binds less tightly with RHS than the operator after RHS, let
1020 // the pending operator take RHS as its LHS.
1021 int NextPrec = GetTokPrecedence();
1022 if (TokPrec &lt; NextPrec) {
1023 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1024 if (RHS == 0) return 0;
1025 }
1026
1027 // Merge LHS/RHS.
1028 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1029 }
1030}
1031
1032/// expression
1033/// ::= primary binoprhs
1034///
1035static ExprAST *ParseExpression() {
1036 ExprAST *LHS = ParsePrimary();
1037 if (!LHS) return 0;
1038
1039 return ParseBinOpRHS(0, LHS);
1040}
1041
1042/// prototype
1043/// ::= id '(' id* ')'
1044static PrototypeAST *ParsePrototype() {
1045 if (CurTok != tok_identifier)
1046 return ErrorP("Expected function name in prototype");
1047
1048 std::string FnName = IdentifierStr;
1049 getNextToken();
1050
1051 if (CurTok != '(')
1052 return ErrorP("Expected '(' in prototype");
1053
1054 std::vector&lt;std::string&gt; ArgNames;
1055 while (getNextToken() == tok_identifier)
1056 ArgNames.push_back(IdentifierStr);
1057 if (CurTok != ')')
1058 return ErrorP("Expected ')' in prototype");
1059
1060 // success.
1061 getNextToken(); // eat ')'.
1062
1063 return new PrototypeAST(FnName, ArgNames);
1064}
1065
1066/// definition ::= 'def' prototype expression
1067static FunctionAST *ParseDefinition() {
1068 getNextToken(); // eat def.
1069 PrototypeAST *Proto = ParsePrototype();
1070 if (Proto == 0) return 0;
1071
1072 if (ExprAST *E = ParseExpression())
1073 return new FunctionAST(Proto, E);
1074 return 0;
1075}
1076
1077/// toplevelexpr ::= expression
1078static FunctionAST *ParseTopLevelExpr() {
1079 if (ExprAST *E = ParseExpression()) {
1080 // Make an anonymous proto.
1081 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1082 return new FunctionAST(Proto, E);
1083 }
1084 return 0;
1085}
1086
1087/// external ::= 'extern' prototype
1088static PrototypeAST *ParseExtern() {
1089 getNextToken(); // eat extern.
1090 return ParsePrototype();
1091}
1092
1093//===----------------------------------------------------------------------===//
1094// Top-Level parsing
1095//===----------------------------------------------------------------------===//
1096
1097static void HandleDefinition() {
1098 if (FunctionAST *F = ParseDefinition()) {
1099 fprintf(stderr, "Parsed a function definition.\n");
1100 } else {
1101 // Skip token for error recovery.
1102 getNextToken();
1103 }
1104}
1105
1106static void HandleExtern() {
1107 if (PrototypeAST *P = ParseExtern()) {
1108 fprintf(stderr, "Parsed an extern\n");
1109 } else {
1110 // Skip token for error recovery.
1111 getNextToken();
1112 }
1113}
1114
1115static void HandleTopLevelExpression() {
1116 // Evaluate a top level expression into an anonymous function.
1117 if (FunctionAST *F = ParseTopLevelExpr()) {
1118 fprintf(stderr, "Parsed a top-level expr\n");
1119 } else {
1120 // Skip token for error recovery.
1121 getNextToken();
1122 }
1123}
1124
1125/// top ::= definition | external | expression | ';'
1126static void MainLoop() {
1127 while (1) {
1128 fprintf(stderr, "ready&gt; ");
1129 switch (CurTok) {
1130 case tok_eof: return;
1131 case ';': getNextToken(); break; // ignore top level semicolons.
1132 case tok_def: HandleDefinition(); break;
1133 case tok_extern: HandleExtern(); break;
1134 default: HandleTopLevelExpression(); break;
1135 }
1136 }
1137}
1138
1139//===----------------------------------------------------------------------===//
1140// Main driver code.
1141//===----------------------------------------------------------------------===//
1142
1143int main() {
1144 // Install standard binary operators.
1145 // 1 is lowest precedence.
1146 BinopPrecedence['&lt;'] = 10;
1147 BinopPrecedence['+'] = 20;
1148 BinopPrecedence['-'] = 20;
1149 BinopPrecedence['*'] = 40; // highest.
1150
1151 // Prime the first token.
1152 fprintf(stderr, "ready&gt; ");
1153 getNextToken();
1154
1155 MainLoop();
1156 return 0;
1157}
1158</pre>
1159</div>
1160</div>
1161
1162<!-- *********************************************************************** -->
1163<hr>
1164<address>
1165 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1166 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1167 <a href="http://validator.w3.org/check/referer"><img
1168 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!" /></a>
1169
1170 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1171 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1172 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1173</address>
1174</body>
1175</html>