blob: b8b6112884163ef50fa8471cbaf8877c9ad824b0 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
Chris Lattner0e555b12007-11-05 20:04:56 +000017<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner128eb862007-11-05 19:06:59 +000018<li>Chapter 2
19 <ol>
20 <li><a href="#intro">Chapter 2 Introduction</a></li>
21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
22 <li><a href="#parserbasics">Parser Basics</a></li>
23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
24 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
25 <li><a href="#parsertop">Parsing the Rest</a></li>
26 <li><a href="#driver">The Driver</a></li>
27 <li><a href="#conclusions">Conclusions</a></li>
28 <li><a href="#code">Full Code Listing</a></li>
29 </ol>
30</li>
Chris Lattner0e555b12007-11-05 20:04:56 +000031<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
Chris Lattner128eb862007-11-05 19:06:59 +000032</ul>
33
Chris Lattnere6c91042007-10-22 06:34:15 +000034<div class="doc_author">
35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
36</div>
37
38<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000039<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +000040<!-- *********************************************************************** -->
41
42<div class="doc_text">
43
Chris Lattner128eb862007-11-05 19:06:59 +000044<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
45with LLVM</a>" tutorial. This chapter shows you how to use the <a
Chris Lattnere6c91042007-10-22 06:34:15 +000046href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
47href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
48our Kaleidoscope language and build an <a
49href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
50Tree</a> (AST).</p>
51
52<p>The parser we will build uses a combination of <a
53href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
54Parsing</a> and <a href=
55"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
56Parsing</a> to parse the Kaleidoscope language (the later for binary expression
57and the former for everything else). Before we get to parsing though, lets talk
58about the output of the parser: the Abstract Syntax Tree.</p>
59
60</div>
61
62<!-- *********************************************************************** -->
63<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
64<!-- *********************************************************************** -->
65
66<div class="doc_text">
67
68<p>The AST for a program captures its behavior in a way that it is easy for
69later stages of the compiler (e.g. code generation) to interpret. We basically
70want one object for each construct in the language, and the AST should closely
71model the language. In Kaleidoscope, we have expressions, a prototype, and a
72function object. We'll start with expressions first:</p>
73
74<div class="doc_code">
75<pre>
76/// ExprAST - Base class for all expression nodes.
77class ExprAST {
78public:
79 virtual ~ExprAST() {}
80};
81
82/// NumberExprAST - Expression class for numeric literals like "1.0".
83class NumberExprAST : public ExprAST {
84 double Val;
85public:
Chris Lattner28571ed2007-10-23 04:27:44 +000086 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +000087};
88</pre>
89</div>
90
91<p>The code above shows the definition of the base ExprAST class and one
92subclass which we use for numeric literals. The important thing about this is
93that the NumberExprAST class captures the numeric value of the literal in the
94class, so that later phases of the compiler can know what it is.</p>
95
96<p>Right now we only create the AST, so there are no useful accessor methods on
97them. It would be very easy to add a virtual method to pretty print the code,
98for example. Here are the other expression AST node definitions that we'll use
99in the basic form of the Kaleidoscope language.
100</p>
101
102<div class="doc_code">
103<pre>
104/// VariableExprAST - Expression class for referencing a variable, like "a".
105class VariableExprAST : public ExprAST {
106 std::string Name;
107public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000108 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000109};
110
111/// BinaryExprAST - Expression class for a binary operator.
112class BinaryExprAST : public ExprAST {
113 char Op;
114 ExprAST *LHS, *RHS;
115public:
116 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
117 : Op(op), LHS(lhs), RHS(rhs) {}
118};
119
120/// CallExprAST - Expression class for function calls.
121class CallExprAST : public ExprAST {
122 std::string Callee;
123 std::vector&lt;ExprAST*&gt; Args;
124public:
125 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
126 : Callee(callee), Args(args) {}
127};
128</pre>
129</div>
130
131<p>This is all (intentially) rather straight-forward: variables capture the
132variable name, binary operators capture their opcode (e.g. '+'), and calls
133capture a function name and list of argument expressions. One thing that is
134nice about our AST is that it captures the language features without talking
135about the syntax of the language. Note that there is no discussion about
136precedence of binary operators, lexical structure etc.</p>
137
138<p>For our basic language, these are all of the expression nodes we'll define.
Owen Andersonc6311b92007-10-22 06:48:28 +0000139Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattnere6c91042007-10-22 06:34:15 +0000140we'll fix that in a later installment. The two things we need next are a way
141to talk about the interface to a function, and a way to talk about functions
142themselves:</p>
143
144<div class="doc_code">
145<pre>
146/// PrototypeAST - This class represents the "prototype" for a function,
147/// which captures its argument names as well as if it is an operator.
148class PrototypeAST {
149 std::string Name;
150 std::vector&lt;std::string&gt; Args;
151public:
152 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
153 : Name(name), Args(args) {}
154};
155
156/// FunctionAST - This class represents a function definition itself.
157class FunctionAST {
158 PrototypeAST *Proto;
159 ExprAST *Body;
160public:
161 FunctionAST(PrototypeAST *proto, ExprAST *body)
162 : Proto(proto), Body(body) {}
163};
164</pre>
165</div>
166
167<p>In Kaleidoscope, functions are typed with just a count of their arguments.
168Since all values are double precision floating point, this fact doesn't need to
169be captured anywhere. In a more aggressive and realistic language, the
170"ExprAST" class would probably have a type field.</p>
171
172<p>With this scaffolding, we can now talk about parsing expressions and function
173bodies in Kaleidoscope.</p>
174
175</div>
176
177<!-- *********************************************************************** -->
178<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
179<!-- *********************************************************************** -->
180
181<div class="doc_text">
182
183<p>Now that we have an AST to build, we need to define the parser code to build
184it. The idea here is that we want to parse something like "x+y" (which is
185returned as three tokens by the lexer) into an AST that could be generated with
186calls like this:</p>
187
188<div class="doc_code">
189<pre>
190 ExprAST *X = new VariableExprAST("x");
191 ExprAST *Y = new VariableExprAST("y");
192 ExprAST *Result = new BinaryExprAST('+', X, Y);
193</pre>
194</div>
195
196<p>In order to do this, we'll start by defining some basic helper routines:</p>
197
198<div class="doc_code">
199<pre>
200/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
201/// token the parser it looking at. getNextToken reads another token from the
202/// lexer and updates CurTok with its results.
203static int CurTok;
204static int getNextToken() {
205 return CurTok = gettok();
206}
207</pre>
208</div>
209
210<p>
211This implements a simple token buffer around the lexer. This allows
212us to look one token ahead at what the lexer is returning. Every function in
Chris Lattnere9495122007-10-25 18:05:29 +0000213our parser will assume that CurTok is the current token that needs to be
Chris Lattnere6c91042007-10-22 06:34:15 +0000214parsed.</p>
215
Chris Lattnerbee175b2007-10-22 16:44:31 +0000216<p>Again, we define these with global variables; it would be better design to
217wrap the entire parser in a class and use instance variables for these.
Chris Lattnere6c91042007-10-22 06:34:15 +0000218</p>
219
220<div class="doc_code">
221<pre>
222
223/// Error* - These are little helper functions for error handling.
224ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
225PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
226FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
227</pre>
228</div>
229
230<p>
231The <tt>Error</tt> routines are simple helper routines that our parser will use
232to handle errors. The error recovery in our parser will not be the best and
Duncan Sands72261ff2007-11-05 16:04:58 +0000233is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattnere6c91042007-10-22 06:34:15 +0000234routines make it easier to handle errors in routines that have various return
235types: they always return null.</p>
236
237<p>With these basic helper functions implemented, we can implement the first
238piece of our grammar: we'll start with numeric literals.</p>
239
240</div>
241
242<!-- *********************************************************************** -->
243<div class="doc_section"><a name="parserprimexprs">Basic Expression
244 Parsing</a></div>
245<!-- *********************************************************************** -->
246
247<div class="doc_text">
248
249<p>We start with numeric literals, because they are the simplest to process.
250For each production in our grammar, we'll define a function which parses that
251production. For numeric literals, we have:
252</p>
253
254<div class="doc_code">
255<pre>
256/// numberexpr ::= number
257static ExprAST *ParseNumberExpr() {
258 ExprAST *Result = new NumberExprAST(NumVal);
259 getNextToken(); // consume the number
260 return Result;
261}
262</pre>
263</div>
264
265<p>This routine is very simple: it expects to be called when the current token
266is a <tt>tok_number</tt> token. It takes the current number value, creates
267a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
268returns.</p>
269
270<p>There are some interesting aspects of this. The most important one is that
271this routine eats all of the tokens that correspond to the production, and
272returns the lexer buffer with the next token (which is not part of the grammar
273production) ready to go. This is a fairly standard way to go for recursive
274descent parsers. For a better example, the parenthesis operator is defined like
275this:</p>
276
277<div class="doc_code">
278<pre>
279/// parenexpr ::= '(' expression ')'
280static ExprAST *ParseParenExpr() {
281 getNextToken(); // eat (.
282 ExprAST *V = ParseExpression();
283 if (!V) return 0;
284
285 if (CurTok != ')')
286 return Error("expected ')'");
287 getNextToken(); // eat ).
288 return V;
289}
290</pre>
291</div>
292
293<p>This function illustrates a number of interesting things about the parser:
2941) it shows how we use the Error routines. When called, this function expects
295that the current token is a '(' token, but after parsing the subexpression, it
296is possible that there is not a ')' waiting. For example, if the user types in
297"(4 x" instead of "(4)", the parser should emit an error. Because errors can
298occur, the parser needs a way to indicate that they happened: in our parser, we
299return null on an error.</p>
300
301<p>Another interesting aspect of this function is that it uses recursion by
Owen Andersonc6311b92007-10-22 06:48:28 +0000302calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattnere6c91042007-10-22 06:34:15 +0000303<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
304recursive grammars, and keeps each production very simple. Note that
Duncan Sands72261ff2007-11-05 16:04:58 +0000305parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattnere6c91042007-10-22 06:34:15 +0000306do this, the most important role of parens are to guide the parser and provide
307grouping. Once the parser constructs the AST, parens are not needed.</p>
308
309<p>The next simple production is for handling variable references and function
310calls:</p>
311
312<div class="doc_code">
313<pre>
314/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000315/// ::= identifier
316/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000317static ExprAST *ParseIdentifierExpr() {
318 std::string IdName = IdentifierStr;
319
Chris Lattner20a0c802007-11-05 17:54:34 +0000320 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000321
322 if (CurTok != '(') // Simple variable ref.
323 return new VariableExprAST(IdName);
324
325 // Call.
326 getNextToken(); // eat (
327 std::vector&lt;ExprAST*&gt; Args;
Chris Lattner71155212007-11-06 01:39:12 +0000328 if (CurTok != ')') {
329 while (1) {
330 ExprAST *Arg = ParseExpression();
331 if (!Arg) return 0;
332 Args.push_back(Arg);
Chris Lattnere6c91042007-10-22 06:34:15 +0000333
Chris Lattner71155212007-11-06 01:39:12 +0000334 if (CurTok == ')') break;
Chris Lattnere6c91042007-10-22 06:34:15 +0000335
Chris Lattner71155212007-11-06 01:39:12 +0000336 if (CurTok != ',')
337 return Error("Expected ')'");
338 getNextToken();
339 }
Chris Lattnere6c91042007-10-22 06:34:15 +0000340 }
341
342 // Eat the ')'.
343 getNextToken();
344
345 return new CallExprAST(IdName, Args);
346}
347</pre>
348</div>
349
350<p>This routine follows the same style as the other routines (it expects to be
351called if the current token is a <tt>tok_identifier</tt> token). It also has
352recursion and error handling. One interesting aspect of this is that it uses
353<em>look-ahead</em> to determine if the current identifier is a stand alone
354variable reference or if it is a function call expression. It handles this by
355checking to see if the token after the identifier is a '(' token, and constructs
356either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
357</p>
358
359<p>Now that we have all of our simple expression parsing logic in place, we can
360define a helper function to wrap them up in a class. We call this class of
361expressions "primary" expressions, for reasons that will become more clear
362later. In order to parse a primary expression, we need to determine what sort
363of expression it is:</p>
364
365<div class="doc_code">
366<pre>
367/// primary
368/// ::= identifierexpr
369/// ::= numberexpr
370/// ::= parenexpr
371static ExprAST *ParsePrimary() {
372 switch (CurTok) {
373 default: return Error("unknown token when expecting an expression");
374 case tok_identifier: return ParseIdentifierExpr();
375 case tok_number: return ParseNumberExpr();
376 case '(': return ParseParenExpr();
377 }
378}
379</pre>
380</div>
381
382<p>Now that you see the definition of this function, it makes it more obvious
383why we can assume the state of CurTok in the various functions. This uses
384look-ahead to determine which sort of expression is being inspected, and parses
385it with a function call.</p>
386
387<p>Now that basic expressions are handled, we need to handle binary expressions,
388which are a bit more complex.</p>
389
390</div>
391
392<!-- *********************************************************************** -->
393<div class="doc_section"><a name="parserbinops">Binary Expression
394 Parsing</a></div>
395<!-- *********************************************************************** -->
396
397<div class="doc_text">
398
399<p>Binary expressions are significantly harder to parse because they are often
400ambiguous. For example, when given the string "x+y*z", the parser can choose
401to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
402mathematics, we expect the later parse, because "*" (multiplication) has
403higher <em>precedence</em> than "+" (addition).</p>
404
405<p>There are many ways to handle this, but an elegant and efficient way is to
406use <a href=
407"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
408Parsing</a>. This parsing technique uses the precedence of binary operators to
409guide recursion. To start with, we need a table of precedences:</p>
410
411<div class="doc_code">
412<pre>
413/// BinopPrecedence - This holds the precedence for each binary operator that is
414/// defined.
415static std::map&lt;char, int&gt; BinopPrecedence;
416
417/// GetTokPrecedence - Get the precedence of the pending binary operator token.
418static int GetTokPrecedence() {
419 if (!isascii(CurTok))
420 return -1;
421
422 // Make sure it's a declared binop.
423 int TokPrec = BinopPrecedence[CurTok];
424 if (TokPrec &lt;= 0) return -1;
425 return TokPrec;
426}
427
428int main() {
429 // Install standard binary operators.
430 // 1 is lowest precedence.
431 BinopPrecedence['&lt;'] = 10;
432 BinopPrecedence['+'] = 20;
433 BinopPrecedence['-'] = 20;
434 BinopPrecedence['*'] = 40; // highest.
435 ...
436}
437</pre>
438</div>
439
440<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
441(this can obviously be extended by you, the reader). The
442<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
443or -1 if the token is not a binary operator. Having a map makes it easy to add
444new operators and makes it clear that the algorithm doesn't depend on the
445specific operators involved, but it would be easy enough to eliminate the map
446and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
447
448<p>With the helper above defined, we can now start parsing binary expressions.
449The basic idea of operator precedence parsing is to break down an expression
450with potentially ambiguous binary operators into pieces. Consider for example
451the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
452as a stream of primary expressions separated by binary operators. As such,
453it will first parse the leading primary expression "a", then it will see the
454pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
Duncan Sands72261ff2007-11-05 16:04:58 +0000455are primary expressions, the binary expression parser doesn't need to worry
Chris Lattnere6c91042007-10-22 06:34:15 +0000456about nested subexpressions like (c+d) at all.
457</p>
458
459<p>
460To start, an expression is a primary expression potentially followed by a
461sequence of [binop,primaryexpr] pairs:</p>
462
463<div class="doc_code">
464<pre>
465/// expression
466/// ::= primary binoprhs
467///
468static ExprAST *ParseExpression() {
469 ExprAST *LHS = ParsePrimary();
470 if (!LHS) return 0;
471
472 return ParseBinOpRHS(0, LHS);
473}
474</pre>
475</div>
476
477<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
478us. It takes a precedence and a pointer to an expression for the part parsed
479so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
480allowed to be empty, in which case it returns the expression that is passed into
481it. In our example above, the code passes the expression for "a" into
482<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
483
484<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
485minimal operator precedence</em> that the function is allowed to eat. For
486example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
487passed in a precedence of 40, it will not consume any tokens (because the
488precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
489with:</p>
490
491<div class="doc_code">
492<pre>
493/// binoprhs
494/// ::= ('+' primary)*
495static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
496 // If this is a binop, find its precedence.
497 while (1) {
498 int TokPrec = GetTokPrecedence();
499
500 // If this is a binop that binds at least as tightly as the current binop,
501 // consume it, otherwise we are done.
502 if (TokPrec &lt; ExprPrec)
503 return LHS;
504</pre>
505</div>
506
507<p>This code gets the precedence of the current token and checks to see if if is
508too low. Because we defined invalid tokens to have a precedence of -1, this
509check implicitly knows that the pair-stream ends when the token stream runs out
510of binary operators. If this check succeeds, we know that the token is a binary
511operator and that it will be included in this expression:</p>
512
513<div class="doc_code">
514<pre>
515 // Okay, we know this is a binop.
516 int BinOp = CurTok;
517 getNextToken(); // eat binop
518
519 // Parse the primary expression after the binary operator.
520 ExprAST *RHS = ParsePrimary();
521 if (!RHS) return 0;
522</pre>
523</div>
524
525<p>As such, this code eats (and remembers) the binary operator and then parses
526the following primary expression. This builds up the whole pair, the first of
527which is [+, b] for the running example.</p>
528
529<p>Now that we parsed the left-hand side of an expression and one pair of the
530RHS sequence, we have to decide which way the expression associates. In
531particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
532To determine this, we look ahead at "binop" to determine its precedence and
533compare it to BinOp's precedence (which is '+' in this case):</p>
534
535<div class="doc_code">
536<pre>
537 // If BinOp binds less tightly with RHS than the operator after RHS, let
538 // the pending operator take RHS as its LHS.
539 int NextPrec = GetTokPrecedence();
540 if (TokPrec &lt; NextPrec) {
541</pre>
542</div>
543
544<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
545precedence of our current operator, then we know that the parentheses associate
546as "(a+b) binop ...". In our example, since the next operator is "+" and so is
547our current one, we know that they have the same precedence. In this case we'll
548create the AST node for "a+b", and then continue parsing:</p>
549
550<div class="doc_code">
551<pre>
552 ... if body omitted ...
553 }
554
555 // Merge LHS/RHS.
556 LHS = new BinaryExprAST(BinOp, LHS, RHS);
557 } // loop around to the top of the while loop.
558}
559</pre>
560</div>
561
562<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
563iteration of the loop, with "+" as the current token. The code above will eat
564and remember it and parse "(c+d)" as the primary expression, which makes the
565current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
566binop to the right of the primary. In this case, the precedence of "*" is
567higher than the precedence of "+" so the if condition will be entered.</p>
568
569<p>The critical question left here is "how can the if condition parse the right
570hand side in full"? In particular, to build the AST correctly for our example,
571it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
572do this is surprisingly simple (code from the above two blocks duplicated for
573context):</p>
574
575<div class="doc_code">
576<pre>
577 // If BinOp binds less tightly with RHS than the operator after RHS, let
578 // the pending operator take RHS as its LHS.
579 int NextPrec = GetTokPrecedence();
580 if (TokPrec &lt; NextPrec) {
581 RHS = ParseBinOpRHS(TokPrec+1, RHS);
582 if (RHS == 0) return 0;
583 }
584 // Merge LHS/RHS.
585 LHS = new BinaryExprAST(BinOp, LHS, RHS);
586 } // loop around to the top of the while loop.
587}
588</pre>
589</div>
590
591<p>At this point, we know that the binary operator to the RHS of our primary
592has higher precedence than the binop we are currently parsing. As such, we know
593that any sequence of pairs whose operators are all higher precedence than "+"
594should be parsed together and returned as "RHS". To do this, we recursively
595invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
596precedence required for it to continue. In our example above, this will cause
597it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
598of the '+' expression.</p>
599
600<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
601and added to the AST. With this little bit of code (14 non-trivial lines), we
602correctly handle fully general binary expression parsing in a very elegant way.
603</p>
604
605<p>This wraps up handling of expressions. At this point, we can point the
606parser at an arbitrary token stream and build an expression from them, stopping
607at the first token that is not part of the expression. Next up we need to
608handle function definitions etc.</p>
609
610</div>
611
612<!-- *********************************************************************** -->
613<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
614<!-- *********************************************************************** -->
615
616<div class="doc_text">
617
618<p>
619The first basic thing missing is that of function prototypes. In Kaleidoscope,
620these are used both for 'extern' function declarations as well as function body
621definitions. The code to do this is straight-forward and not very interesting
622(once you've survived expressions):
623</p>
624
625<div class="doc_code">
626<pre>
627/// prototype
628/// ::= id '(' id* ')'
629static PrototypeAST *ParsePrototype() {
630 if (CurTok != tok_identifier)
631 return ErrorP("Expected function name in prototype");
632
633 std::string FnName = IdentifierStr;
634 getNextToken();
635
636 if (CurTok != '(')
637 return ErrorP("Expected '(' in prototype");
638
639 std::vector&lt;std::string&gt; ArgNames;
640 while (getNextToken() == tok_identifier)
641 ArgNames.push_back(IdentifierStr);
642 if (CurTok != ')')
643 return ErrorP("Expected ')' in prototype");
644
645 // success.
646 getNextToken(); // eat ')'.
647
648 return new PrototypeAST(FnName, ArgNames);
649}
650</pre>
651</div>
652
653<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands72261ff2007-11-05 16:04:58 +0000654an expression to implement the body:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000655
656<div class="doc_code">
657<pre>
658/// definition ::= 'def' prototype expression
659static FunctionAST *ParseDefinition() {
660 getNextToken(); // eat def.
661 PrototypeAST *Proto = ParsePrototype();
662 if (Proto == 0) return 0;
663
664 if (ExprAST *E = ParseExpression())
665 return new FunctionAST(Proto, E);
666 return 0;
667}
668</pre>
669</div>
670
671<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
672well as to support forward declaration of user functions. 'externs' are just
673prototypes with no body:</p>
674
675<div class="doc_code">
676<pre>
677/// external ::= 'extern' prototype
678static PrototypeAST *ParseExtern() {
679 getNextToken(); // eat extern.
680 return ParsePrototype();
681}
682</pre>
683</div>
684
685<p>Finally, we'll also let the user type in arbitrary top-level expressions and
686evaluate them on the fly. We will handle this by defining anonymous nullary
687(zero argument) functions for them:</p>
688
689<div class="doc_code">
690<pre>
691/// toplevelexpr ::= expression
692static FunctionAST *ParseTopLevelExpr() {
693 if (ExprAST *E = ParseExpression()) {
694 // Make an anonymous proto.
695 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
696 return new FunctionAST(Proto, E);
697 }
698 return 0;
699}
700</pre>
701</div>
702
703<p>Now that we have all the pieces, lets build a little driver that will let us
704actually <em>execute</em> this code we've built!</p>
705
706</div>
707
708<!-- *********************************************************************** -->
709<div class="doc_section"><a name="driver">The Driver</a></div>
710<!-- *********************************************************************** -->
711
712<div class="doc_text">
713
714<p>The driver for this simply invokes all of the parsing pieces with a top-level
715dispatch loop. There isn't much interesting here, so I'll just include the
716top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
717Parsing" section.</p>
718
719<div class="doc_code">
720<pre>
721/// top ::= definition | external | expression | ';'
722static void MainLoop() {
723 while (1) {
724 fprintf(stderr, "ready&gt; ");
725 switch (CurTok) {
726 case tok_eof: return;
727 case ';': getNextToken(); break; // ignore top level semicolons.
728 case tok_def: HandleDefinition(); break;
729 case tok_extern: HandleExtern(); break;
730 default: HandleTopLevelExpression(); break;
731 }
732 }
733}
734</pre>
735</div>
736
737<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Andersonc6311b92007-10-22 06:48:28 +0000738Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattnere6c91042007-10-22 06:34:15 +0000739command line, the parser doesn't know that that is the end of what you will
740type. For example, on the next line you could type "def foo..." in which case
7414+5 is the end of a top-level expression. Alternatively you could type "* 6",
742which would continue the expression. Having top-level semicolons allows you to
743type "4+5;" and the parser will know you are done.</p>
744
745</div>
746
747<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000748<div class="doc_section"><a name="conclusions">Conclusions</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000749<!-- *********************************************************************** -->
750
751<div class="doc_text">
752
753<p>With just under 400 lines of commented code, we fully defined our minimal
754language, including a lexer, parser and AST builder. With this done, the
755executable will validate code and tell us if it is gramatically invalid. For
756example, here is a sample interaction:</p>
757
758<div class="doc_code">
759<pre>
760$ ./a.out
Chris Lattner35abbf52007-10-23 06:23:57 +0000761ready&gt; def foo(x y) x+foo(y, 4.0);
Chris Lattner01fcc042007-11-05 17:38:34 +0000762ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000763ready&gt; def foo(x y) x+y y;
Chris Lattner01fcc042007-11-05 17:38:34 +0000764ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000765ready&gt; Parsed a top-level expr
766ready&gt; def foo(x y) x+y );
Chris Lattner01fcc042007-11-05 17:38:34 +0000767ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000768ready&gt; Error: unknown token when expecting an expression
769ready&gt; extern sin(a);
770ready&gt; Parsed an extern
771ready&gt; ^D
Chris Lattnere6c91042007-10-22 06:34:15 +0000772$
773</pre>
774</div>
775
Chris Lattnerd93a5842007-10-23 05:43:01 +0000776<p>There is a lot of room for extension here. You can define new AST nodes,
777extend the language in many ways, etc. In the <a href="LangImpl3.html">next
778installment</a>, we will describe how to generate LLVM IR from the AST.</p>
779
780</div>
781
782<!-- *********************************************************************** -->
783<div class="doc_section"><a name="code">Full Code Listing</a></div>
784<!-- *********************************************************************** -->
785
786<div class="doc_text">
787
Chris Lattnere6c91042007-10-22 06:34:15 +0000788<p>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000789Here is the complete code listing for this and the previous chapter.
790Note that it is fully self-contained: you don't need LLVM or any external
791libraries at all for this (other than the C and C++ standard libraries of
792course). To build this, just compile with:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000793
794<div class="doc_code">
795<pre>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000796 # Compile
797 g++ -g toy.cpp
798 # Run
799 ./a.out
800</pre>
801</div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000802
Chris Lattnerd93a5842007-10-23 05:43:01 +0000803<p>Here is the code:</p>
804
805<div class="doc_code">
806<pre>
Chris Lattnere6c91042007-10-22 06:34:15 +0000807#include &lt;cstdio&gt;
808#include &lt;string&gt;
Chris Lattnerd93a5842007-10-23 05:43:01 +0000809#include &lt;map&gt;
Chris Lattnere6c91042007-10-22 06:34:15 +0000810#include &lt;vector&gt;
811
812//===----------------------------------------------------------------------===//
813// Lexer
814//===----------------------------------------------------------------------===//
815
816// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
817// of these for known things.
818enum Token {
819 tok_eof = -1,
820
821 // commands
822 tok_def = -2, tok_extern = -3,
823
824 // primary
825 tok_identifier = -4, tok_number = -5,
826};
827
828static std::string IdentifierStr; // Filled in if tok_identifier
829static double NumVal; // Filled in if tok_number
830
831/// gettok - Return the next token from standard input.
832static int gettok() {
833 static int LastChar = ' ';
834
835 // Skip any whitespace.
836 while (isspace(LastChar))
837 LastChar = getchar();
838
839 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
840 IdentifierStr = LastChar;
841 while (isalnum((LastChar = getchar())))
842 IdentifierStr += LastChar;
843
844 if (IdentifierStr == "def") return tok_def;
845 if (IdentifierStr == "extern") return tok_extern;
846 return tok_identifier;
847 }
848
849 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
850 std::string NumStr;
851 do {
852 NumStr += LastChar;
853 LastChar = getchar();
854 } while (isdigit(LastChar) || LastChar == '.');
855
856 NumVal = strtod(NumStr.c_str(), 0);
857 return tok_number;
858 }
859
860 if (LastChar == '#') {
861 // Comment until end of line.
862 do LastChar = getchar();
863 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
864
865 if (LastChar != EOF)
866 return gettok();
867 }
868
869 // Check for end of file. Don't eat the EOF.
870 if (LastChar == EOF)
871 return tok_eof;
872
873 // Otherwise, just return the character as its ascii value.
874 int ThisChar = LastChar;
875 LastChar = getchar();
876 return ThisChar;
877}
878
879//===----------------------------------------------------------------------===//
880// Abstract Syntax Tree (aka Parse Tree)
881//===----------------------------------------------------------------------===//
882
883/// ExprAST - Base class for all expression nodes.
884class ExprAST {
885public:
886 virtual ~ExprAST() {}
887};
888
889/// NumberExprAST - Expression class for numeric literals like "1.0".
890class NumberExprAST : public ExprAST {
891 double Val;
892public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000893 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000894};
895
896/// VariableExprAST - Expression class for referencing a variable, like "a".
897class VariableExprAST : public ExprAST {
898 std::string Name;
899public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000900 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000901};
902
903/// BinaryExprAST - Expression class for a binary operator.
904class BinaryExprAST : public ExprAST {
905 char Op;
906 ExprAST *LHS, *RHS;
907public:
908 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
909 : Op(op), LHS(lhs), RHS(rhs) {}
910};
911
912/// CallExprAST - Expression class for function calls.
913class CallExprAST : public ExprAST {
914 std::string Callee;
915 std::vector&lt;ExprAST*&gt; Args;
916public:
917 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
918 : Callee(callee), Args(args) {}
919};
920
921/// PrototypeAST - This class represents the "prototype" for a function,
922/// which captures its argument names as well as if it is an operator.
923class PrototypeAST {
924 std::string Name;
925 std::vector&lt; Args;
926public:
927 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
928 : Name(name), Args(args) {}
929
930};
931
932/// FunctionAST - This class represents a function definition itself.
933class FunctionAST {
934 PrototypeAST *Proto;
935 ExprAST *Body;
936public:
937 FunctionAST(PrototypeAST *proto, ExprAST *body)
938 : Proto(proto), Body(body) {}
939
940};
941
942//===----------------------------------------------------------------------===//
943// Parser
944//===----------------------------------------------------------------------===//
945
946/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
947/// token the parser it looking at. getNextToken reads another token from the
948/// lexer and updates CurTok with its results.
949static int CurTok;
950static int getNextToken() {
951 return CurTok = gettok();
952}
953
954/// BinopPrecedence - This holds the precedence for each binary operator that is
955/// defined.
956static std::map&lt;char, int&gt; BinopPrecedence;
957
958/// GetTokPrecedence - Get the precedence of the pending binary operator token.
959static int GetTokPrecedence() {
960 if (!isascii(CurTok))
961 return -1;
962
963 // Make sure it's a declared binop.
964 int TokPrec = BinopPrecedence[CurTok];
965 if (TokPrec &lt;= 0) return -1;
966 return TokPrec;
967}
968
969/// Error* - These are little helper functions for error handling.
970ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
971PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
972FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
973
974static ExprAST *ParseExpression();
975
976/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000977/// ::= identifier
978/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000979static ExprAST *ParseIdentifierExpr() {
980 std::string IdName = IdentifierStr;
981
Chris Lattner20a0c802007-11-05 17:54:34 +0000982 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000983
984 if (CurTok != '(') // Simple variable ref.
985 return new VariableExprAST(IdName);
986
987 // Call.
988 getNextToken(); // eat (
989 std::vector&lt;ExprAST*&gt; Args;
Chris Lattner71155212007-11-06 01:39:12 +0000990 if (CurTok != ')') {
991 while (1) {
992 ExprAST *Arg = ParseExpression();
993 if (!Arg) return 0;
994 Args.push_back(Arg);
Chris Lattnere6c91042007-10-22 06:34:15 +0000995
Chris Lattner71155212007-11-06 01:39:12 +0000996 if (CurTok == ')') break;
Chris Lattnere6c91042007-10-22 06:34:15 +0000997
Chris Lattner71155212007-11-06 01:39:12 +0000998 if (CurTok != ',')
999 return Error("Expected ')'");
1000 getNextToken();
1001 }
Chris Lattnere6c91042007-10-22 06:34:15 +00001002 }
1003
1004 // Eat the ')'.
1005 getNextToken();
1006
1007 return new CallExprAST(IdName, Args);
1008}
1009
1010/// numberexpr ::= number
1011static ExprAST *ParseNumberExpr() {
1012 ExprAST *Result = new NumberExprAST(NumVal);
1013 getNextToken(); // consume the number
1014 return Result;
1015}
1016
1017/// parenexpr ::= '(' expression ')'
1018static ExprAST *ParseParenExpr() {
1019 getNextToken(); // eat (.
1020 ExprAST *V = ParseExpression();
1021 if (!V) return 0;
1022
1023 if (CurTok != ')')
1024 return Error("expected ')'");
1025 getNextToken(); // eat ).
1026 return V;
1027}
1028
1029/// primary
1030/// ::= identifierexpr
1031/// ::= numberexpr
1032/// ::= parenexpr
1033static ExprAST *ParsePrimary() {
1034 switch (CurTok) {
1035 default: return Error("unknown token when expecting an expression");
1036 case tok_identifier: return ParseIdentifierExpr();
1037 case tok_number: return ParseNumberExpr();
1038 case '(': return ParseParenExpr();
1039 }
1040}
1041
1042/// binoprhs
1043/// ::= ('+' primary)*
1044static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1045 // If this is a binop, find its precedence.
1046 while (1) {
1047 int TokPrec = GetTokPrecedence();
1048
1049 // If this is a binop that binds at least as tightly as the current binop,
1050 // consume it, otherwise we are done.
1051 if (TokPrec &lt; ExprPrec)
1052 return LHS;
1053
1054 // Okay, we know this is a binop.
1055 int BinOp = CurTok;
1056 getNextToken(); // eat binop
1057
1058 // Parse the primary expression after the binary operator.
1059 ExprAST *RHS = ParsePrimary();
1060 if (!RHS) return 0;
1061
1062 // If BinOp binds less tightly with RHS than the operator after RHS, let
1063 // the pending operator take RHS as its LHS.
1064 int NextPrec = GetTokPrecedence();
1065 if (TokPrec &lt; NextPrec) {
1066 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1067 if (RHS == 0) return 0;
1068 }
1069
1070 // Merge LHS/RHS.
1071 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1072 }
1073}
1074
1075/// expression
1076/// ::= primary binoprhs
1077///
1078static ExprAST *ParseExpression() {
1079 ExprAST *LHS = ParsePrimary();
1080 if (!LHS) return 0;
1081
1082 return ParseBinOpRHS(0, LHS);
1083}
1084
1085/// prototype
1086/// ::= id '(' id* ')'
1087static PrototypeAST *ParsePrototype() {
1088 if (CurTok != tok_identifier)
1089 return ErrorP("Expected function name in prototype");
1090
1091 std::string FnName = IdentifierStr;
1092 getNextToken();
1093
1094 if (CurTok != '(')
1095 return ErrorP("Expected '(' in prototype");
1096
1097 std::vector&lt;std::string&gt; ArgNames;
1098 while (getNextToken() == tok_identifier)
1099 ArgNames.push_back(IdentifierStr);
1100 if (CurTok != ')')
1101 return ErrorP("Expected ')' in prototype");
1102
1103 // success.
1104 getNextToken(); // eat ')'.
1105
1106 return new PrototypeAST(FnName, ArgNames);
1107}
1108
1109/// definition ::= 'def' prototype expression
1110static FunctionAST *ParseDefinition() {
1111 getNextToken(); // eat def.
1112 PrototypeAST *Proto = ParsePrototype();
1113 if (Proto == 0) return 0;
1114
1115 if (ExprAST *E = ParseExpression())
1116 return new FunctionAST(Proto, E);
1117 return 0;
1118}
1119
1120/// toplevelexpr ::= expression
1121static FunctionAST *ParseTopLevelExpr() {
1122 if (ExprAST *E = ParseExpression()) {
1123 // Make an anonymous proto.
1124 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1125 return new FunctionAST(Proto, E);
1126 }
1127 return 0;
1128}
1129
1130/// external ::= 'extern' prototype
1131static PrototypeAST *ParseExtern() {
1132 getNextToken(); // eat extern.
1133 return ParsePrototype();
1134}
1135
1136//===----------------------------------------------------------------------===//
1137// Top-Level parsing
1138//===----------------------------------------------------------------------===//
1139
1140static void HandleDefinition() {
1141 if (FunctionAST *F = ParseDefinition()) {
1142 fprintf(stderr, "Parsed a function definition.\n");
1143 } else {
1144 // Skip token for error recovery.
1145 getNextToken();
1146 }
1147}
1148
1149static void HandleExtern() {
1150 if (PrototypeAST *P = ParseExtern()) {
1151 fprintf(stderr, "Parsed an extern\n");
1152 } else {
1153 // Skip token for error recovery.
1154 getNextToken();
1155 }
1156}
1157
1158static void HandleTopLevelExpression() {
1159 // Evaluate a top level expression into an anonymous function.
1160 if (FunctionAST *F = ParseTopLevelExpr()) {
1161 fprintf(stderr, "Parsed a top-level expr\n");
1162 } else {
1163 // Skip token for error recovery.
1164 getNextToken();
1165 }
1166}
1167
1168/// top ::= definition | external | expression | ';'
1169static void MainLoop() {
1170 while (1) {
1171 fprintf(stderr, "ready&gt; ");
1172 switch (CurTok) {
1173 case tok_eof: return;
1174 case ';': getNextToken(); break; // ignore top level semicolons.
1175 case tok_def: HandleDefinition(); break;
1176 case tok_extern: HandleExtern(); break;
1177 default: HandleTopLevelExpression(); break;
1178 }
1179 }
1180}
1181
1182//===----------------------------------------------------------------------===//
1183// Main driver code.
1184//===----------------------------------------------------------------------===//
1185
1186int main() {
1187 // Install standard binary operators.
1188 // 1 is lowest precedence.
1189 BinopPrecedence['&lt;'] = 10;
1190 BinopPrecedence['+'] = 20;
1191 BinopPrecedence['-'] = 20;
1192 BinopPrecedence['*'] = 40; // highest.
1193
1194 // Prime the first token.
1195 fprintf(stderr, "ready&gt; ");
1196 getNextToken();
1197
1198 MainLoop();
1199 return 0;
1200}
1201</pre>
1202</div>
1203</div>
1204
1205<!-- *********************************************************************** -->
1206<hr>
1207<address>
1208 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1209 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1210 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +00001211 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnere6c91042007-10-22 06:34:15 +00001212
1213 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1214 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1215 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1216</address>
1217</body>
1218</html>