blob: d364f4de3e7c3b893d79bc235569b8e82a78ec96 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
16<div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
18</div>
19
20<!-- *********************************************************************** -->
21<div class="doc_section"><a name="intro">Part 2 Introduction</a></div>
22<!-- *********************************************************************** -->
23
24<div class="doc_text">
25
26<p>Welcome to part 2 of the "<a href="index.html">Implementing a language with
27LLVM</a>" tutorial. This chapter shows you how to use the <a
28href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
29href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
30our Kaleidoscope language and build an <a
31href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
32Tree</a> (AST).</p>
33
34<p>The parser we will build uses a combination of <a
35href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
36Parsing</a> and <a href=
37"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
38Parsing</a> to parse the Kaleidoscope language (the later for binary expression
39and the former for everything else). Before we get to parsing though, lets talk
40about the output of the parser: the Abstract Syntax Tree.</p>
41
42</div>
43
44<!-- *********************************************************************** -->
45<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
46<!-- *********************************************************************** -->
47
48<div class="doc_text">
49
50<p>The AST for a program captures its behavior in a way that it is easy for
51later stages of the compiler (e.g. code generation) to interpret. We basically
52want one object for each construct in the language, and the AST should closely
53model the language. In Kaleidoscope, we have expressions, a prototype, and a
54function object. We'll start with expressions first:</p>
55
56<div class="doc_code">
57<pre>
58/// ExprAST - Base class for all expression nodes.
59class ExprAST {
60public:
61 virtual ~ExprAST() {}
62};
63
64/// NumberExprAST - Expression class for numeric literals like "1.0".
65class NumberExprAST : public ExprAST {
66 double Val;
67public:
Chris Lattner28571ed2007-10-23 04:27:44 +000068 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +000069};
70</pre>
71</div>
72
73<p>The code above shows the definition of the base ExprAST class and one
74subclass which we use for numeric literals. The important thing about this is
75that the NumberExprAST class captures the numeric value of the literal in the
76class, so that later phases of the compiler can know what it is.</p>
77
78<p>Right now we only create the AST, so there are no useful accessor methods on
79them. It would be very easy to add a virtual method to pretty print the code,
80for example. Here are the other expression AST node definitions that we'll use
81in the basic form of the Kaleidoscope language.
82</p>
83
84<div class="doc_code">
85<pre>
86/// VariableExprAST - Expression class for referencing a variable, like "a".
87class VariableExprAST : public ExprAST {
88 std::string Name;
89public:
Chris Lattner28571ed2007-10-23 04:27:44 +000090 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +000091};
92
93/// BinaryExprAST - Expression class for a binary operator.
94class BinaryExprAST : public ExprAST {
95 char Op;
96 ExprAST *LHS, *RHS;
97public:
98 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
99 : Op(op), LHS(lhs), RHS(rhs) {}
100};
101
102/// CallExprAST - Expression class for function calls.
103class CallExprAST : public ExprAST {
104 std::string Callee;
105 std::vector&lt;ExprAST*&gt; Args;
106public:
107 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
108 : Callee(callee), Args(args) {}
109};
110</pre>
111</div>
112
113<p>This is all (intentially) rather straight-forward: variables capture the
114variable name, binary operators capture their opcode (e.g. '+'), and calls
115capture a function name and list of argument expressions. One thing that is
116nice about our AST is that it captures the language features without talking
117about the syntax of the language. Note that there is no discussion about
118precedence of binary operators, lexical structure etc.</p>
119
120<p>For our basic language, these are all of the expression nodes we'll define.
Owen Andersonc6311b92007-10-22 06:48:28 +0000121Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattnere6c91042007-10-22 06:34:15 +0000122we'll fix that in a later installment. The two things we need next are a way
123to talk about the interface to a function, and a way to talk about functions
124themselves:</p>
125
126<div class="doc_code">
127<pre>
128/// PrototypeAST - This class represents the "prototype" for a function,
129/// which captures its argument names as well as if it is an operator.
130class PrototypeAST {
131 std::string Name;
132 std::vector&lt;std::string&gt; Args;
133public:
134 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
135 : Name(name), Args(args) {}
136};
137
138/// FunctionAST - This class represents a function definition itself.
139class FunctionAST {
140 PrototypeAST *Proto;
141 ExprAST *Body;
142public:
143 FunctionAST(PrototypeAST *proto, ExprAST *body)
144 : Proto(proto), Body(body) {}
145};
146</pre>
147</div>
148
149<p>In Kaleidoscope, functions are typed with just a count of their arguments.
150Since all values are double precision floating point, this fact doesn't need to
151be captured anywhere. In a more aggressive and realistic language, the
152"ExprAST" class would probably have a type field.</p>
153
154<p>With this scaffolding, we can now talk about parsing expressions and function
155bodies in Kaleidoscope.</p>
156
157</div>
158
159<!-- *********************************************************************** -->
160<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
161<!-- *********************************************************************** -->
162
163<div class="doc_text">
164
165<p>Now that we have an AST to build, we need to define the parser code to build
166it. The idea here is that we want to parse something like "x+y" (which is
167returned as three tokens by the lexer) into an AST that could be generated with
168calls like this:</p>
169
170<div class="doc_code">
171<pre>
172 ExprAST *X = new VariableExprAST("x");
173 ExprAST *Y = new VariableExprAST("y");
174 ExprAST *Result = new BinaryExprAST('+', X, Y);
175</pre>
176</div>
177
178<p>In order to do this, we'll start by defining some basic helper routines:</p>
179
180<div class="doc_code">
181<pre>
182/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
183/// token the parser it looking at. getNextToken reads another token from the
184/// lexer and updates CurTok with its results.
185static int CurTok;
186static int getNextToken() {
187 return CurTok = gettok();
188}
189</pre>
190</div>
191
192<p>
193This implements a simple token buffer around the lexer. This allows
194us to look one token ahead at what the lexer is returning. Every function in
Chris Lattnere9495122007-10-25 18:05:29 +0000195our parser will assume that CurTok is the current token that needs to be
Chris Lattnere6c91042007-10-22 06:34:15 +0000196parsed.</p>
197
Chris Lattnerbee175b2007-10-22 16:44:31 +0000198<p>Again, we define these with global variables; it would be better design to
199wrap the entire parser in a class and use instance variables for these.
Chris Lattnere6c91042007-10-22 06:34:15 +0000200</p>
201
202<div class="doc_code">
203<pre>
204
205/// Error* - These are little helper functions for error handling.
206ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
207PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
208FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
209</pre>
210</div>
211
212<p>
213The <tt>Error</tt> routines are simple helper routines that our parser will use
214to handle errors. The error recovery in our parser will not be the best and
215are not particular user-friendly, but it will be enough for our tutorial. These
216routines make it easier to handle errors in routines that have various return
217types: they always return null.</p>
218
219<p>With these basic helper functions implemented, we can implement the first
220piece of our grammar: we'll start with numeric literals.</p>
221
222</div>
223
224<!-- *********************************************************************** -->
225<div class="doc_section"><a name="parserprimexprs">Basic Expression
226 Parsing</a></div>
227<!-- *********************************************************************** -->
228
229<div class="doc_text">
230
231<p>We start with numeric literals, because they are the simplest to process.
232For each production in our grammar, we'll define a function which parses that
233production. For numeric literals, we have:
234</p>
235
236<div class="doc_code">
237<pre>
238/// numberexpr ::= number
239static ExprAST *ParseNumberExpr() {
240 ExprAST *Result = new NumberExprAST(NumVal);
241 getNextToken(); // consume the number
242 return Result;
243}
244</pre>
245</div>
246
247<p>This routine is very simple: it expects to be called when the current token
248is a <tt>tok_number</tt> token. It takes the current number value, creates
249a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
250returns.</p>
251
252<p>There are some interesting aspects of this. The most important one is that
253this routine eats all of the tokens that correspond to the production, and
254returns the lexer buffer with the next token (which is not part of the grammar
255production) ready to go. This is a fairly standard way to go for recursive
256descent parsers. For a better example, the parenthesis operator is defined like
257this:</p>
258
259<div class="doc_code">
260<pre>
261/// parenexpr ::= '(' expression ')'
262static ExprAST *ParseParenExpr() {
263 getNextToken(); // eat (.
264 ExprAST *V = ParseExpression();
265 if (!V) return 0;
266
267 if (CurTok != ')')
268 return Error("expected ')'");
269 getNextToken(); // eat ).
270 return V;
271}
272</pre>
273</div>
274
275<p>This function illustrates a number of interesting things about the parser:
2761) it shows how we use the Error routines. When called, this function expects
277that the current token is a '(' token, but after parsing the subexpression, it
278is possible that there is not a ')' waiting. For example, if the user types in
279"(4 x" instead of "(4)", the parser should emit an error. Because errors can
280occur, the parser needs a way to indicate that they happened: in our parser, we
281return null on an error.</p>
282
283<p>Another interesting aspect of this function is that it uses recursion by
Owen Andersonc6311b92007-10-22 06:48:28 +0000284calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattnere6c91042007-10-22 06:34:15 +0000285<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
286recursive grammars, and keeps each production very simple. Note that
287parenthesis do not cause construction of AST nodes themselves. While we could
288do this, the most important role of parens are to guide the parser and provide
289grouping. Once the parser constructs the AST, parens are not needed.</p>
290
291<p>The next simple production is for handling variable references and function
292calls:</p>
293
294<div class="doc_code">
295<pre>
296/// identifierexpr
297/// ::= identifer
298/// ::= identifer '(' expression* ')'
299static ExprAST *ParseIdentifierExpr() {
300 std::string IdName = IdentifierStr;
301
302 getNextToken(); // eat identifer.
303
304 if (CurTok != '(') // Simple variable ref.
305 return new VariableExprAST(IdName);
306
307 // Call.
308 getNextToken(); // eat (
309 std::vector&lt;ExprAST*&gt; Args;
310 while (1) {
311 ExprAST *Arg = ParseExpression();
312 if (!Arg) return 0;
313 Args.push_back(Arg);
314
315 if (CurTok == ')') break;
316
317 if (CurTok != ',')
318 return Error("Expected ')'");
319 getNextToken();
320 }
321
322 // Eat the ')'.
323 getNextToken();
324
325 return new CallExprAST(IdName, Args);
326}
327</pre>
328</div>
329
330<p>This routine follows the same style as the other routines (it expects to be
331called if the current token is a <tt>tok_identifier</tt> token). It also has
332recursion and error handling. One interesting aspect of this is that it uses
333<em>look-ahead</em> to determine if the current identifier is a stand alone
334variable reference or if it is a function call expression. It handles this by
335checking to see if the token after the identifier is a '(' token, and constructs
336either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
337</p>
338
339<p>Now that we have all of our simple expression parsing logic in place, we can
340define a helper function to wrap them up in a class. We call this class of
341expressions "primary" expressions, for reasons that will become more clear
342later. In order to parse a primary expression, we need to determine what sort
343of expression it is:</p>
344
345<div class="doc_code">
346<pre>
347/// primary
348/// ::= identifierexpr
349/// ::= numberexpr
350/// ::= parenexpr
351static ExprAST *ParsePrimary() {
352 switch (CurTok) {
353 default: return Error("unknown token when expecting an expression");
354 case tok_identifier: return ParseIdentifierExpr();
355 case tok_number: return ParseNumberExpr();
356 case '(': return ParseParenExpr();
357 }
358}
359</pre>
360</div>
361
362<p>Now that you see the definition of this function, it makes it more obvious
363why we can assume the state of CurTok in the various functions. This uses
364look-ahead to determine which sort of expression is being inspected, and parses
365it with a function call.</p>
366
367<p>Now that basic expressions are handled, we need to handle binary expressions,
368which are a bit more complex.</p>
369
370</div>
371
372<!-- *********************************************************************** -->
373<div class="doc_section"><a name="parserbinops">Binary Expression
374 Parsing</a></div>
375<!-- *********************************************************************** -->
376
377<div class="doc_text">
378
379<p>Binary expressions are significantly harder to parse because they are often
380ambiguous. For example, when given the string "x+y*z", the parser can choose
381to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
382mathematics, we expect the later parse, because "*" (multiplication) has
383higher <em>precedence</em> than "+" (addition).</p>
384
385<p>There are many ways to handle this, but an elegant and efficient way is to
386use <a href=
387"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
388Parsing</a>. This parsing technique uses the precedence of binary operators to
389guide recursion. To start with, we need a table of precedences:</p>
390
391<div class="doc_code">
392<pre>
393/// BinopPrecedence - This holds the precedence for each binary operator that is
394/// defined.
395static std::map&lt;char, int&gt; BinopPrecedence;
396
397/// GetTokPrecedence - Get the precedence of the pending binary operator token.
398static int GetTokPrecedence() {
399 if (!isascii(CurTok))
400 return -1;
401
402 // Make sure it's a declared binop.
403 int TokPrec = BinopPrecedence[CurTok];
404 if (TokPrec &lt;= 0) return -1;
405 return TokPrec;
406}
407
408int main() {
409 // Install standard binary operators.
410 // 1 is lowest precedence.
411 BinopPrecedence['&lt;'] = 10;
412 BinopPrecedence['+'] = 20;
413 BinopPrecedence['-'] = 20;
414 BinopPrecedence['*'] = 40; // highest.
415 ...
416}
417</pre>
418</div>
419
420<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
421(this can obviously be extended by you, the reader). The
422<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
423or -1 if the token is not a binary operator. Having a map makes it easy to add
424new operators and makes it clear that the algorithm doesn't depend on the
425specific operators involved, but it would be easy enough to eliminate the map
426and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
427
428<p>With the helper above defined, we can now start parsing binary expressions.
429The basic idea of operator precedence parsing is to break down an expression
430with potentially ambiguous binary operators into pieces. Consider for example
431the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
432as a stream of primary expressions separated by binary operators. As such,
433it will first parse the leading primary expression "a", then it will see the
434pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
435are primary expressions that the binary expression parser doesn't need to worry
436about nested subexpressions like (c+d) at all.
437</p>
438
439<p>
440To start, an expression is a primary expression potentially followed by a
441sequence of [binop,primaryexpr] pairs:</p>
442
443<div class="doc_code">
444<pre>
445/// expression
446/// ::= primary binoprhs
447///
448static ExprAST *ParseExpression() {
449 ExprAST *LHS = ParsePrimary();
450 if (!LHS) return 0;
451
452 return ParseBinOpRHS(0, LHS);
453}
454</pre>
455</div>
456
457<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
458us. It takes a precedence and a pointer to an expression for the part parsed
459so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
460allowed to be empty, in which case it returns the expression that is passed into
461it. In our example above, the code passes the expression for "a" into
462<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
463
464<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
465minimal operator precedence</em> that the function is allowed to eat. For
466example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
467passed in a precedence of 40, it will not consume any tokens (because the
468precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
469with:</p>
470
471<div class="doc_code">
472<pre>
473/// binoprhs
474/// ::= ('+' primary)*
475static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
476 // If this is a binop, find its precedence.
477 while (1) {
478 int TokPrec = GetTokPrecedence();
479
480 // If this is a binop that binds at least as tightly as the current binop,
481 // consume it, otherwise we are done.
482 if (TokPrec &lt; ExprPrec)
483 return LHS;
484</pre>
485</div>
486
487<p>This code gets the precedence of the current token and checks to see if if is
488too low. Because we defined invalid tokens to have a precedence of -1, this
489check implicitly knows that the pair-stream ends when the token stream runs out
490of binary operators. If this check succeeds, we know that the token is a binary
491operator and that it will be included in this expression:</p>
492
493<div class="doc_code">
494<pre>
495 // Okay, we know this is a binop.
496 int BinOp = CurTok;
497 getNextToken(); // eat binop
498
499 // Parse the primary expression after the binary operator.
500 ExprAST *RHS = ParsePrimary();
501 if (!RHS) return 0;
502</pre>
503</div>
504
505<p>As such, this code eats (and remembers) the binary operator and then parses
506the following primary expression. This builds up the whole pair, the first of
507which is [+, b] for the running example.</p>
508
509<p>Now that we parsed the left-hand side of an expression and one pair of the
510RHS sequence, we have to decide which way the expression associates. In
511particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
512To determine this, we look ahead at "binop" to determine its precedence and
513compare it to BinOp's precedence (which is '+' in this case):</p>
514
515<div class="doc_code">
516<pre>
517 // If BinOp binds less tightly with RHS than the operator after RHS, let
518 // the pending operator take RHS as its LHS.
519 int NextPrec = GetTokPrecedence();
520 if (TokPrec &lt; NextPrec) {
521</pre>
522</div>
523
524<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
525precedence of our current operator, then we know that the parentheses associate
526as "(a+b) binop ...". In our example, since the next operator is "+" and so is
527our current one, we know that they have the same precedence. In this case we'll
528create the AST node for "a+b", and then continue parsing:</p>
529
530<div class="doc_code">
531<pre>
532 ... if body omitted ...
533 }
534
535 // Merge LHS/RHS.
536 LHS = new BinaryExprAST(BinOp, LHS, RHS);
537 } // loop around to the top of the while loop.
538}
539</pre>
540</div>
541
542<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
543iteration of the loop, with "+" as the current token. The code above will eat
544and remember it and parse "(c+d)" as the primary expression, which makes the
545current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
546binop to the right of the primary. In this case, the precedence of "*" is
547higher than the precedence of "+" so the if condition will be entered.</p>
548
549<p>The critical question left here is "how can the if condition parse the right
550hand side in full"? In particular, to build the AST correctly for our example,
551it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
552do this is surprisingly simple (code from the above two blocks duplicated for
553context):</p>
554
555<div class="doc_code">
556<pre>
557 // If BinOp binds less tightly with RHS than the operator after RHS, let
558 // the pending operator take RHS as its LHS.
559 int NextPrec = GetTokPrecedence();
560 if (TokPrec &lt; NextPrec) {
561 RHS = ParseBinOpRHS(TokPrec+1, RHS);
562 if (RHS == 0) return 0;
563 }
564 // Merge LHS/RHS.
565 LHS = new BinaryExprAST(BinOp, LHS, RHS);
566 } // loop around to the top of the while loop.
567}
568</pre>
569</div>
570
571<p>At this point, we know that the binary operator to the RHS of our primary
572has higher precedence than the binop we are currently parsing. As such, we know
573that any sequence of pairs whose operators are all higher precedence than "+"
574should be parsed together and returned as "RHS". To do this, we recursively
575invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
576precedence required for it to continue. In our example above, this will cause
577it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
578of the '+' expression.</p>
579
580<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
581and added to the AST. With this little bit of code (14 non-trivial lines), we
582correctly handle fully general binary expression parsing in a very elegant way.
583</p>
584
585<p>This wraps up handling of expressions. At this point, we can point the
586parser at an arbitrary token stream and build an expression from them, stopping
587at the first token that is not part of the expression. Next up we need to
588handle function definitions etc.</p>
589
590</div>
591
592<!-- *********************************************************************** -->
593<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
594<!-- *********************************************************************** -->
595
596<div class="doc_text">
597
598<p>
599The first basic thing missing is that of function prototypes. In Kaleidoscope,
600these are used both for 'extern' function declarations as well as function body
601definitions. The code to do this is straight-forward and not very interesting
602(once you've survived expressions):
603</p>
604
605<div class="doc_code">
606<pre>
607/// prototype
608/// ::= id '(' id* ')'
609static PrototypeAST *ParsePrototype() {
610 if (CurTok != tok_identifier)
611 return ErrorP("Expected function name in prototype");
612
613 std::string FnName = IdentifierStr;
614 getNextToken();
615
616 if (CurTok != '(')
617 return ErrorP("Expected '(' in prototype");
618
619 std::vector&lt;std::string&gt; ArgNames;
620 while (getNextToken() == tok_identifier)
621 ArgNames.push_back(IdentifierStr);
622 if (CurTok != ')')
623 return ErrorP("Expected ')' in prototype");
624
625 // success.
626 getNextToken(); // eat ')'.
627
628 return new PrototypeAST(FnName, ArgNames);
629}
630</pre>
631</div>
632
633<p>Given this, a function definition is very simple, just a prototype plus
634and expression to implement the body:</p>
635
636<div class="doc_code">
637<pre>
638/// definition ::= 'def' prototype expression
639static FunctionAST *ParseDefinition() {
640 getNextToken(); // eat def.
641 PrototypeAST *Proto = ParsePrototype();
642 if (Proto == 0) return 0;
643
644 if (ExprAST *E = ParseExpression())
645 return new FunctionAST(Proto, E);
646 return 0;
647}
648</pre>
649</div>
650
651<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
652well as to support forward declaration of user functions. 'externs' are just
653prototypes with no body:</p>
654
655<div class="doc_code">
656<pre>
657/// external ::= 'extern' prototype
658static PrototypeAST *ParseExtern() {
659 getNextToken(); // eat extern.
660 return ParsePrototype();
661}
662</pre>
663</div>
664
665<p>Finally, we'll also let the user type in arbitrary top-level expressions and
666evaluate them on the fly. We will handle this by defining anonymous nullary
667(zero argument) functions for them:</p>
668
669<div class="doc_code">
670<pre>
671/// toplevelexpr ::= expression
672static FunctionAST *ParseTopLevelExpr() {
673 if (ExprAST *E = ParseExpression()) {
674 // Make an anonymous proto.
675 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
676 return new FunctionAST(Proto, E);
677 }
678 return 0;
679}
680</pre>
681</div>
682
683<p>Now that we have all the pieces, lets build a little driver that will let us
684actually <em>execute</em> this code we've built!</p>
685
686</div>
687
688<!-- *********************************************************************** -->
689<div class="doc_section"><a name="driver">The Driver</a></div>
690<!-- *********************************************************************** -->
691
692<div class="doc_text">
693
694<p>The driver for this simply invokes all of the parsing pieces with a top-level
695dispatch loop. There isn't much interesting here, so I'll just include the
696top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
697Parsing" section.</p>
698
699<div class="doc_code">
700<pre>
701/// top ::= definition | external | expression | ';'
702static void MainLoop() {
703 while (1) {
704 fprintf(stderr, "ready&gt; ");
705 switch (CurTok) {
706 case tok_eof: return;
707 case ';': getNextToken(); break; // ignore top level semicolons.
708 case tok_def: HandleDefinition(); break;
709 case tok_extern: HandleExtern(); break;
710 default: HandleTopLevelExpression(); break;
711 }
712 }
713}
714</pre>
715</div>
716
717<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Andersonc6311b92007-10-22 06:48:28 +0000718Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattnere6c91042007-10-22 06:34:15 +0000719command line, the parser doesn't know that that is the end of what you will
720type. For example, on the next line you could type "def foo..." in which case
7214+5 is the end of a top-level expression. Alternatively you could type "* 6",
722which would continue the expression. Having top-level semicolons allows you to
723type "4+5;" and the parser will know you are done.</p>
724
725</div>
726
727<!-- *********************************************************************** -->
728<div class="doc_section"><a name="code">Conclusions and the Full Code</a></div>
729<!-- *********************************************************************** -->
730
731<div class="doc_text">
732
733<p>With just under 400 lines of commented code, we fully defined our minimal
734language, including a lexer, parser and AST builder. With this done, the
735executable will validate code and tell us if it is gramatically invalid. For
736example, here is a sample interaction:</p>
737
738<div class="doc_code">
739<pre>
740$ ./a.out
Chris Lattner35abbf52007-10-23 06:23:57 +0000741ready&gt; def foo(x y) x+foo(y, 4.0);
742ready&gt; Parsed an function definition.
743ready&gt; def foo(x y) x+y y;
744ready&gt; Parsed an function definition.
745ready&gt; Parsed a top-level expr
746ready&gt; def foo(x y) x+y );
747ready&gt; Parsed an function definition.
748ready&gt; Error: unknown token when expecting an expression
749ready&gt; extern sin(a);
750ready&gt; Parsed an extern
751ready&gt; ^D
Chris Lattnere6c91042007-10-22 06:34:15 +0000752$
753</pre>
754</div>
755
Chris Lattnerd93a5842007-10-23 05:43:01 +0000756<p>There is a lot of room for extension here. You can define new AST nodes,
757extend the language in many ways, etc. In the <a href="LangImpl3.html">next
758installment</a>, we will describe how to generate LLVM IR from the AST.</p>
759
760</div>
761
762<!-- *********************************************************************** -->
763<div class="doc_section"><a name="code">Full Code Listing</a></div>
764<!-- *********************************************************************** -->
765
766<div class="doc_text">
767
Chris Lattnere6c91042007-10-22 06:34:15 +0000768<p>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000769Here is the complete code listing for this and the previous chapter.
770Note that it is fully self-contained: you don't need LLVM or any external
771libraries at all for this (other than the C and C++ standard libraries of
772course). To build this, just compile with:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000773
774<div class="doc_code">
775<pre>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000776 # Compile
777 g++ -g toy.cpp
778 # Run
779 ./a.out
780</pre>
781</div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000782
Chris Lattnerd93a5842007-10-23 05:43:01 +0000783<p>Here is the code:</p>
784
785<div class="doc_code">
786<pre>
Chris Lattnere6c91042007-10-22 06:34:15 +0000787#include &lt;cstdio&gt;
788#include &lt;string&gt;
Chris Lattnerd93a5842007-10-23 05:43:01 +0000789#include &lt;map&gt;
Chris Lattnere6c91042007-10-22 06:34:15 +0000790#include &lt;vector&gt;
791
792//===----------------------------------------------------------------------===//
793// Lexer
794//===----------------------------------------------------------------------===//
795
796// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
797// of these for known things.
798enum Token {
799 tok_eof = -1,
800
801 // commands
802 tok_def = -2, tok_extern = -3,
803
804 // primary
805 tok_identifier = -4, tok_number = -5,
806};
807
808static std::string IdentifierStr; // Filled in if tok_identifier
809static double NumVal; // Filled in if tok_number
810
811/// gettok - Return the next token from standard input.
812static int gettok() {
813 static int LastChar = ' ';
814
815 // Skip any whitespace.
816 while (isspace(LastChar))
817 LastChar = getchar();
818
819 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
820 IdentifierStr = LastChar;
821 while (isalnum((LastChar = getchar())))
822 IdentifierStr += LastChar;
823
824 if (IdentifierStr == "def") return tok_def;
825 if (IdentifierStr == "extern") return tok_extern;
826 return tok_identifier;
827 }
828
829 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
830 std::string NumStr;
831 do {
832 NumStr += LastChar;
833 LastChar = getchar();
834 } while (isdigit(LastChar) || LastChar == '.');
835
836 NumVal = strtod(NumStr.c_str(), 0);
837 return tok_number;
838 }
839
840 if (LastChar == '#') {
841 // Comment until end of line.
842 do LastChar = getchar();
843 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
844
845 if (LastChar != EOF)
846 return gettok();
847 }
848
849 // Check for end of file. Don't eat the EOF.
850 if (LastChar == EOF)
851 return tok_eof;
852
853 // Otherwise, just return the character as its ascii value.
854 int ThisChar = LastChar;
855 LastChar = getchar();
856 return ThisChar;
857}
858
859//===----------------------------------------------------------------------===//
860// Abstract Syntax Tree (aka Parse Tree)
861//===----------------------------------------------------------------------===//
862
863/// ExprAST - Base class for all expression nodes.
864class ExprAST {
865public:
866 virtual ~ExprAST() {}
867};
868
869/// NumberExprAST - Expression class for numeric literals like "1.0".
870class NumberExprAST : public ExprAST {
871 double Val;
872public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000873 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000874};
875
876/// VariableExprAST - Expression class for referencing a variable, like "a".
877class VariableExprAST : public ExprAST {
878 std::string Name;
879public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000880 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000881};
882
883/// BinaryExprAST - Expression class for a binary operator.
884class BinaryExprAST : public ExprAST {
885 char Op;
886 ExprAST *LHS, *RHS;
887public:
888 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
889 : Op(op), LHS(lhs), RHS(rhs) {}
890};
891
892/// CallExprAST - Expression class for function calls.
893class CallExprAST : public ExprAST {
894 std::string Callee;
895 std::vector&lt;ExprAST*&gt; Args;
896public:
897 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
898 : Callee(callee), Args(args) {}
899};
900
901/// PrototypeAST - This class represents the "prototype" for a function,
902/// which captures its argument names as well as if it is an operator.
903class PrototypeAST {
904 std::string Name;
905 std::vector&lt; Args;
906public:
907 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
908 : Name(name), Args(args) {}
909
910};
911
912/// FunctionAST - This class represents a function definition itself.
913class FunctionAST {
914 PrototypeAST *Proto;
915 ExprAST *Body;
916public:
917 FunctionAST(PrototypeAST *proto, ExprAST *body)
918 : Proto(proto), Body(body) {}
919
920};
921
922//===----------------------------------------------------------------------===//
923// Parser
924//===----------------------------------------------------------------------===//
925
926/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
927/// token the parser it looking at. getNextToken reads another token from the
928/// lexer and updates CurTok with its results.
929static int CurTok;
930static int getNextToken() {
931 return CurTok = gettok();
932}
933
934/// BinopPrecedence - This holds the precedence for each binary operator that is
935/// defined.
936static std::map&lt;char, int&gt; BinopPrecedence;
937
938/// GetTokPrecedence - Get the precedence of the pending binary operator token.
939static int GetTokPrecedence() {
940 if (!isascii(CurTok))
941 return -1;
942
943 // Make sure it's a declared binop.
944 int TokPrec = BinopPrecedence[CurTok];
945 if (TokPrec &lt;= 0) return -1;
946 return TokPrec;
947}
948
949/// Error* - These are little helper functions for error handling.
950ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
951PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
952FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
953
954static ExprAST *ParseExpression();
955
956/// identifierexpr
957/// ::= identifer
958/// ::= identifer '(' expression* ')'
959static ExprAST *ParseIdentifierExpr() {
960 std::string IdName = IdentifierStr;
961
962 getNextToken(); // eat identifer.
963
964 if (CurTok != '(') // Simple variable ref.
965 return new VariableExprAST(IdName);
966
967 // Call.
968 getNextToken(); // eat (
969 std::vector&lt;ExprAST*&gt; Args;
970 while (1) {
971 ExprAST *Arg = ParseExpression();
972 if (!Arg) return 0;
973 Args.push_back(Arg);
974
975 if (CurTok == ')') break;
976
977 if (CurTok != ',')
978 return Error("Expected ')'");
979 getNextToken();
980 }
981
982 // Eat the ')'.
983 getNextToken();
984
985 return new CallExprAST(IdName, Args);
986}
987
988/// numberexpr ::= number
989static ExprAST *ParseNumberExpr() {
990 ExprAST *Result = new NumberExprAST(NumVal);
991 getNextToken(); // consume the number
992 return Result;
993}
994
995/// parenexpr ::= '(' expression ')'
996static ExprAST *ParseParenExpr() {
997 getNextToken(); // eat (.
998 ExprAST *V = ParseExpression();
999 if (!V) return 0;
1000
1001 if (CurTok != ')')
1002 return Error("expected ')'");
1003 getNextToken(); // eat ).
1004 return V;
1005}
1006
1007/// primary
1008/// ::= identifierexpr
1009/// ::= numberexpr
1010/// ::= parenexpr
1011static ExprAST *ParsePrimary() {
1012 switch (CurTok) {
1013 default: return Error("unknown token when expecting an expression");
1014 case tok_identifier: return ParseIdentifierExpr();
1015 case tok_number: return ParseNumberExpr();
1016 case '(': return ParseParenExpr();
1017 }
1018}
1019
1020/// binoprhs
1021/// ::= ('+' primary)*
1022static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1023 // If this is a binop, find its precedence.
1024 while (1) {
1025 int TokPrec = GetTokPrecedence();
1026
1027 // If this is a binop that binds at least as tightly as the current binop,
1028 // consume it, otherwise we are done.
1029 if (TokPrec &lt; ExprPrec)
1030 return LHS;
1031
1032 // Okay, we know this is a binop.
1033 int BinOp = CurTok;
1034 getNextToken(); // eat binop
1035
1036 // Parse the primary expression after the binary operator.
1037 ExprAST *RHS = ParsePrimary();
1038 if (!RHS) return 0;
1039
1040 // If BinOp binds less tightly with RHS than the operator after RHS, let
1041 // the pending operator take RHS as its LHS.
1042 int NextPrec = GetTokPrecedence();
1043 if (TokPrec &lt; NextPrec) {
1044 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1045 if (RHS == 0) return 0;
1046 }
1047
1048 // Merge LHS/RHS.
1049 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1050 }
1051}
1052
1053/// expression
1054/// ::= primary binoprhs
1055///
1056static ExprAST *ParseExpression() {
1057 ExprAST *LHS = ParsePrimary();
1058 if (!LHS) return 0;
1059
1060 return ParseBinOpRHS(0, LHS);
1061}
1062
1063/// prototype
1064/// ::= id '(' id* ')'
1065static PrototypeAST *ParsePrototype() {
1066 if (CurTok != tok_identifier)
1067 return ErrorP("Expected function name in prototype");
1068
1069 std::string FnName = IdentifierStr;
1070 getNextToken();
1071
1072 if (CurTok != '(')
1073 return ErrorP("Expected '(' in prototype");
1074
1075 std::vector&lt;std::string&gt; ArgNames;
1076 while (getNextToken() == tok_identifier)
1077 ArgNames.push_back(IdentifierStr);
1078 if (CurTok != ')')
1079 return ErrorP("Expected ')' in prototype");
1080
1081 // success.
1082 getNextToken(); // eat ')'.
1083
1084 return new PrototypeAST(FnName, ArgNames);
1085}
1086
1087/// definition ::= 'def' prototype expression
1088static FunctionAST *ParseDefinition() {
1089 getNextToken(); // eat def.
1090 PrototypeAST *Proto = ParsePrototype();
1091 if (Proto == 0) return 0;
1092
1093 if (ExprAST *E = ParseExpression())
1094 return new FunctionAST(Proto, E);
1095 return 0;
1096}
1097
1098/// toplevelexpr ::= expression
1099static FunctionAST *ParseTopLevelExpr() {
1100 if (ExprAST *E = ParseExpression()) {
1101 // Make an anonymous proto.
1102 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1103 return new FunctionAST(Proto, E);
1104 }
1105 return 0;
1106}
1107
1108/// external ::= 'extern' prototype
1109static PrototypeAST *ParseExtern() {
1110 getNextToken(); // eat extern.
1111 return ParsePrototype();
1112}
1113
1114//===----------------------------------------------------------------------===//
1115// Top-Level parsing
1116//===----------------------------------------------------------------------===//
1117
1118static void HandleDefinition() {
1119 if (FunctionAST *F = ParseDefinition()) {
1120 fprintf(stderr, "Parsed a function definition.\n");
1121 } else {
1122 // Skip token for error recovery.
1123 getNextToken();
1124 }
1125}
1126
1127static void HandleExtern() {
1128 if (PrototypeAST *P = ParseExtern()) {
1129 fprintf(stderr, "Parsed an extern\n");
1130 } else {
1131 // Skip token for error recovery.
1132 getNextToken();
1133 }
1134}
1135
1136static void HandleTopLevelExpression() {
1137 // Evaluate a top level expression into an anonymous function.
1138 if (FunctionAST *F = ParseTopLevelExpr()) {
1139 fprintf(stderr, "Parsed a top-level expr\n");
1140 } else {
1141 // Skip token for error recovery.
1142 getNextToken();
1143 }
1144}
1145
1146/// top ::= definition | external | expression | ';'
1147static void MainLoop() {
1148 while (1) {
1149 fprintf(stderr, "ready&gt; ");
1150 switch (CurTok) {
1151 case tok_eof: return;
1152 case ';': getNextToken(); break; // ignore top level semicolons.
1153 case tok_def: HandleDefinition(); break;
1154 case tok_extern: HandleExtern(); break;
1155 default: HandleTopLevelExpression(); break;
1156 }
1157 }
1158}
1159
1160//===----------------------------------------------------------------------===//
1161// Main driver code.
1162//===----------------------------------------------------------------------===//
1163
1164int main() {
1165 // Install standard binary operators.
1166 // 1 is lowest precedence.
1167 BinopPrecedence['&lt;'] = 10;
1168 BinopPrecedence['+'] = 20;
1169 BinopPrecedence['-'] = 20;
1170 BinopPrecedence['*'] = 40; // highest.
1171
1172 // Prime the first token.
1173 fprintf(stderr, "ready&gt; ");
1174 getNextToken();
1175
1176 MainLoop();
1177 return 0;
1178}
1179</pre>
1180</div>
1181</div>
1182
1183<!-- *********************************************************************** -->
1184<hr>
1185<address>
1186 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1187 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1188 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +00001189 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnere6c91042007-10-22 06:34:15 +00001190
1191 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1192 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1193 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1194</address>
1195</body>
1196</html>