blob: 7eed8d644b29e8e72abc8a7977a4e6f0b529c7c0 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
Chris Lattner0e555b12007-11-05 20:04:56 +000017<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner128eb862007-11-05 19:06:59 +000018<li>Chapter 2
19 <ol>
20 <li><a href="#intro">Chapter 2 Introduction</a></li>
21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
22 <li><a href="#parserbasics">Parser Basics</a></li>
23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
24 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
25 <li><a href="#parsertop">Parsing the Rest</a></li>
26 <li><a href="#driver">The Driver</a></li>
27 <li><a href="#conclusions">Conclusions</a></li>
28 <li><a href="#code">Full Code Listing</a></li>
29 </ol>
30</li>
Chris Lattner0e555b12007-11-05 20:04:56 +000031<li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
Chris Lattner128eb862007-11-05 19:06:59 +000032</ul>
33
Chris Lattnere6c91042007-10-22 06:34:15 +000034<div class="doc_author">
35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
36</div>
37
38<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000039<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +000040<!-- *********************************************************************** -->
41
42<div class="doc_text">
43
Chris Lattner128eb862007-11-05 19:06:59 +000044<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
45with LLVM</a>" tutorial. This chapter shows you how to use the <a
Chris Lattnere6c91042007-10-22 06:34:15 +000046href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
47href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
48our Kaleidoscope language and build an <a
49href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
50Tree</a> (AST).</p>
51
52<p>The parser we will build uses a combination of <a
53href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
54Parsing</a> and <a href=
55"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
56Parsing</a> to parse the Kaleidoscope language (the later for binary expression
57and the former for everything else). Before we get to parsing though, lets talk
58about the output of the parser: the Abstract Syntax Tree.</p>
59
60</div>
61
62<!-- *********************************************************************** -->
63<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
64<!-- *********************************************************************** -->
65
66<div class="doc_text">
67
68<p>The AST for a program captures its behavior in a way that it is easy for
69later stages of the compiler (e.g. code generation) to interpret. We basically
70want one object for each construct in the language, and the AST should closely
71model the language. In Kaleidoscope, we have expressions, a prototype, and a
72function object. We'll start with expressions first:</p>
73
74<div class="doc_code">
75<pre>
76/// ExprAST - Base class for all expression nodes.
77class ExprAST {
78public:
79 virtual ~ExprAST() {}
80};
81
82/// NumberExprAST - Expression class for numeric literals like "1.0".
83class NumberExprAST : public ExprAST {
84 double Val;
85public:
Chris Lattner28571ed2007-10-23 04:27:44 +000086 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +000087};
88</pre>
89</div>
90
91<p>The code above shows the definition of the base ExprAST class and one
92subclass which we use for numeric literals. The important thing about this is
93that the NumberExprAST class captures the numeric value of the literal in the
94class, so that later phases of the compiler can know what it is.</p>
95
96<p>Right now we only create the AST, so there are no useful accessor methods on
97them. It would be very easy to add a virtual method to pretty print the code,
98for example. Here are the other expression AST node definitions that we'll use
99in the basic form of the Kaleidoscope language.
100</p>
101
102<div class="doc_code">
103<pre>
104/// VariableExprAST - Expression class for referencing a variable, like "a".
105class VariableExprAST : public ExprAST {
106 std::string Name;
107public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000108 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000109};
110
111/// BinaryExprAST - Expression class for a binary operator.
112class BinaryExprAST : public ExprAST {
113 char Op;
114 ExprAST *LHS, *RHS;
115public:
116 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
117 : Op(op), LHS(lhs), RHS(rhs) {}
118};
119
120/// CallExprAST - Expression class for function calls.
121class CallExprAST : public ExprAST {
122 std::string Callee;
123 std::vector&lt;ExprAST*&gt; Args;
124public:
125 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
126 : Callee(callee), Args(args) {}
127};
128</pre>
129</div>
130
131<p>This is all (intentially) rather straight-forward: variables capture the
132variable name, binary operators capture their opcode (e.g. '+'), and calls
133capture a function name and list of argument expressions. One thing that is
134nice about our AST is that it captures the language features without talking
135about the syntax of the language. Note that there is no discussion about
136precedence of binary operators, lexical structure etc.</p>
137
138<p>For our basic language, these are all of the expression nodes we'll define.
Owen Andersonc6311b92007-10-22 06:48:28 +0000139Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattnere6c91042007-10-22 06:34:15 +0000140we'll fix that in a later installment. The two things we need next are a way
141to talk about the interface to a function, and a way to talk about functions
142themselves:</p>
143
144<div class="doc_code">
145<pre>
146/// PrototypeAST - This class represents the "prototype" for a function,
147/// which captures its argument names as well as if it is an operator.
148class PrototypeAST {
149 std::string Name;
150 std::vector&lt;std::string&gt; Args;
151public:
152 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
153 : Name(name), Args(args) {}
154};
155
156/// FunctionAST - This class represents a function definition itself.
157class FunctionAST {
158 PrototypeAST *Proto;
159 ExprAST *Body;
160public:
161 FunctionAST(PrototypeAST *proto, ExprAST *body)
162 : Proto(proto), Body(body) {}
163};
164</pre>
165</div>
166
167<p>In Kaleidoscope, functions are typed with just a count of their arguments.
168Since all values are double precision floating point, this fact doesn't need to
169be captured anywhere. In a more aggressive and realistic language, the
170"ExprAST" class would probably have a type field.</p>
171
172<p>With this scaffolding, we can now talk about parsing expressions and function
173bodies in Kaleidoscope.</p>
174
175</div>
176
177<!-- *********************************************************************** -->
178<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
179<!-- *********************************************************************** -->
180
181<div class="doc_text">
182
183<p>Now that we have an AST to build, we need to define the parser code to build
184it. The idea here is that we want to parse something like "x+y" (which is
185returned as three tokens by the lexer) into an AST that could be generated with
186calls like this:</p>
187
188<div class="doc_code">
189<pre>
190 ExprAST *X = new VariableExprAST("x");
191 ExprAST *Y = new VariableExprAST("y");
192 ExprAST *Result = new BinaryExprAST('+', X, Y);
193</pre>
194</div>
195
196<p>In order to do this, we'll start by defining some basic helper routines:</p>
197
198<div class="doc_code">
199<pre>
200/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
201/// token the parser it looking at. getNextToken reads another token from the
202/// lexer and updates CurTok with its results.
203static int CurTok;
204static int getNextToken() {
205 return CurTok = gettok();
206}
207</pre>
208</div>
209
210<p>
211This implements a simple token buffer around the lexer. This allows
212us to look one token ahead at what the lexer is returning. Every function in
Chris Lattnere9495122007-10-25 18:05:29 +0000213our parser will assume that CurTok is the current token that needs to be
Chris Lattnere6c91042007-10-22 06:34:15 +0000214parsed.</p>
215
Chris Lattnerbee175b2007-10-22 16:44:31 +0000216<p>Again, we define these with global variables; it would be better design to
217wrap the entire parser in a class and use instance variables for these.
Chris Lattnere6c91042007-10-22 06:34:15 +0000218</p>
219
220<div class="doc_code">
221<pre>
222
223/// Error* - These are little helper functions for error handling.
224ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
225PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
226FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
227</pre>
228</div>
229
230<p>
231The <tt>Error</tt> routines are simple helper routines that our parser will use
232to handle errors. The error recovery in our parser will not be the best and
Duncan Sands72261ff2007-11-05 16:04:58 +0000233is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattnere6c91042007-10-22 06:34:15 +0000234routines make it easier to handle errors in routines that have various return
235types: they always return null.</p>
236
237<p>With these basic helper functions implemented, we can implement the first
238piece of our grammar: we'll start with numeric literals.</p>
239
240</div>
241
242<!-- *********************************************************************** -->
243<div class="doc_section"><a name="parserprimexprs">Basic Expression
244 Parsing</a></div>
245<!-- *********************************************************************** -->
246
247<div class="doc_text">
248
249<p>We start with numeric literals, because they are the simplest to process.
250For each production in our grammar, we'll define a function which parses that
251production. For numeric literals, we have:
252</p>
253
254<div class="doc_code">
255<pre>
256/// numberexpr ::= number
257static ExprAST *ParseNumberExpr() {
258 ExprAST *Result = new NumberExprAST(NumVal);
259 getNextToken(); // consume the number
260 return Result;
261}
262</pre>
263</div>
264
265<p>This routine is very simple: it expects to be called when the current token
266is a <tt>tok_number</tt> token. It takes the current number value, creates
267a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
268returns.</p>
269
270<p>There are some interesting aspects of this. The most important one is that
271this routine eats all of the tokens that correspond to the production, and
272returns the lexer buffer with the next token (which is not part of the grammar
273production) ready to go. This is a fairly standard way to go for recursive
274descent parsers. For a better example, the parenthesis operator is defined like
275this:</p>
276
277<div class="doc_code">
278<pre>
279/// parenexpr ::= '(' expression ')'
280static ExprAST *ParseParenExpr() {
281 getNextToken(); // eat (.
282 ExprAST *V = ParseExpression();
283 if (!V) return 0;
284
285 if (CurTok != ')')
286 return Error("expected ')'");
287 getNextToken(); // eat ).
288 return V;
289}
290</pre>
291</div>
292
293<p>This function illustrates a number of interesting things about the parser:
2941) it shows how we use the Error routines. When called, this function expects
295that the current token is a '(' token, but after parsing the subexpression, it
296is possible that there is not a ')' waiting. For example, if the user types in
297"(4 x" instead of "(4)", the parser should emit an error. Because errors can
298occur, the parser needs a way to indicate that they happened: in our parser, we
299return null on an error.</p>
300
301<p>Another interesting aspect of this function is that it uses recursion by
Owen Andersonc6311b92007-10-22 06:48:28 +0000302calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattnere6c91042007-10-22 06:34:15 +0000303<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
304recursive grammars, and keeps each production very simple. Note that
Duncan Sands72261ff2007-11-05 16:04:58 +0000305parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattnere6c91042007-10-22 06:34:15 +0000306do this, the most important role of parens are to guide the parser and provide
307grouping. Once the parser constructs the AST, parens are not needed.</p>
308
309<p>The next simple production is for handling variable references and function
310calls:</p>
311
312<div class="doc_code">
313<pre>
314/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000315/// ::= identifier
316/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000317static ExprAST *ParseIdentifierExpr() {
318 std::string IdName = IdentifierStr;
319
Chris Lattner20a0c802007-11-05 17:54:34 +0000320 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000321
322 if (CurTok != '(') // Simple variable ref.
323 return new VariableExprAST(IdName);
324
325 // Call.
326 getNextToken(); // eat (
327 std::vector&lt;ExprAST*&gt; Args;
328 while (1) {
329 ExprAST *Arg = ParseExpression();
330 if (!Arg) return 0;
331 Args.push_back(Arg);
332
333 if (CurTok == ')') break;
334
335 if (CurTok != ',')
336 return Error("Expected ')'");
337 getNextToken();
338 }
339
340 // Eat the ')'.
341 getNextToken();
342
343 return new CallExprAST(IdName, Args);
344}
345</pre>
346</div>
347
348<p>This routine follows the same style as the other routines (it expects to be
349called if the current token is a <tt>tok_identifier</tt> token). It also has
350recursion and error handling. One interesting aspect of this is that it uses
351<em>look-ahead</em> to determine if the current identifier is a stand alone
352variable reference or if it is a function call expression. It handles this by
353checking to see if the token after the identifier is a '(' token, and constructs
354either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
355</p>
356
357<p>Now that we have all of our simple expression parsing logic in place, we can
358define a helper function to wrap them up in a class. We call this class of
359expressions "primary" expressions, for reasons that will become more clear
360later. In order to parse a primary expression, we need to determine what sort
361of expression it is:</p>
362
363<div class="doc_code">
364<pre>
365/// primary
366/// ::= identifierexpr
367/// ::= numberexpr
368/// ::= parenexpr
369static ExprAST *ParsePrimary() {
370 switch (CurTok) {
371 default: return Error("unknown token when expecting an expression");
372 case tok_identifier: return ParseIdentifierExpr();
373 case tok_number: return ParseNumberExpr();
374 case '(': return ParseParenExpr();
375 }
376}
377</pre>
378</div>
379
380<p>Now that you see the definition of this function, it makes it more obvious
381why we can assume the state of CurTok in the various functions. This uses
382look-ahead to determine which sort of expression is being inspected, and parses
383it with a function call.</p>
384
385<p>Now that basic expressions are handled, we need to handle binary expressions,
386which are a bit more complex.</p>
387
388</div>
389
390<!-- *********************************************************************** -->
391<div class="doc_section"><a name="parserbinops">Binary Expression
392 Parsing</a></div>
393<!-- *********************************************************************** -->
394
395<div class="doc_text">
396
397<p>Binary expressions are significantly harder to parse because they are often
398ambiguous. For example, when given the string "x+y*z", the parser can choose
399to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
400mathematics, we expect the later parse, because "*" (multiplication) has
401higher <em>precedence</em> than "+" (addition).</p>
402
403<p>There are many ways to handle this, but an elegant and efficient way is to
404use <a href=
405"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
406Parsing</a>. This parsing technique uses the precedence of binary operators to
407guide recursion. To start with, we need a table of precedences:</p>
408
409<div class="doc_code">
410<pre>
411/// BinopPrecedence - This holds the precedence for each binary operator that is
412/// defined.
413static std::map&lt;char, int&gt; BinopPrecedence;
414
415/// GetTokPrecedence - Get the precedence of the pending binary operator token.
416static int GetTokPrecedence() {
417 if (!isascii(CurTok))
418 return -1;
419
420 // Make sure it's a declared binop.
421 int TokPrec = BinopPrecedence[CurTok];
422 if (TokPrec &lt;= 0) return -1;
423 return TokPrec;
424}
425
426int main() {
427 // Install standard binary operators.
428 // 1 is lowest precedence.
429 BinopPrecedence['&lt;'] = 10;
430 BinopPrecedence['+'] = 20;
431 BinopPrecedence['-'] = 20;
432 BinopPrecedence['*'] = 40; // highest.
433 ...
434}
435</pre>
436</div>
437
438<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
439(this can obviously be extended by you, the reader). The
440<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
441or -1 if the token is not a binary operator. Having a map makes it easy to add
442new operators and makes it clear that the algorithm doesn't depend on the
443specific operators involved, but it would be easy enough to eliminate the map
444and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
445
446<p>With the helper above defined, we can now start parsing binary expressions.
447The basic idea of operator precedence parsing is to break down an expression
448with potentially ambiguous binary operators into pieces. Consider for example
449the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
450as a stream of primary expressions separated by binary operators. As such,
451it will first parse the leading primary expression "a", then it will see the
452pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
Duncan Sands72261ff2007-11-05 16:04:58 +0000453are primary expressions, the binary expression parser doesn't need to worry
Chris Lattnere6c91042007-10-22 06:34:15 +0000454about nested subexpressions like (c+d) at all.
455</p>
456
457<p>
458To start, an expression is a primary expression potentially followed by a
459sequence of [binop,primaryexpr] pairs:</p>
460
461<div class="doc_code">
462<pre>
463/// expression
464/// ::= primary binoprhs
465///
466static ExprAST *ParseExpression() {
467 ExprAST *LHS = ParsePrimary();
468 if (!LHS) return 0;
469
470 return ParseBinOpRHS(0, LHS);
471}
472</pre>
473</div>
474
475<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
476us. It takes a precedence and a pointer to an expression for the part parsed
477so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
478allowed to be empty, in which case it returns the expression that is passed into
479it. In our example above, the code passes the expression for "a" into
480<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
481
482<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
483minimal operator precedence</em> that the function is allowed to eat. For
484example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
485passed in a precedence of 40, it will not consume any tokens (because the
486precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
487with:</p>
488
489<div class="doc_code">
490<pre>
491/// binoprhs
492/// ::= ('+' primary)*
493static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
494 // If this is a binop, find its precedence.
495 while (1) {
496 int TokPrec = GetTokPrecedence();
497
498 // If this is a binop that binds at least as tightly as the current binop,
499 // consume it, otherwise we are done.
500 if (TokPrec &lt; ExprPrec)
501 return LHS;
502</pre>
503</div>
504
505<p>This code gets the precedence of the current token and checks to see if if is
506too low. Because we defined invalid tokens to have a precedence of -1, this
507check implicitly knows that the pair-stream ends when the token stream runs out
508of binary operators. If this check succeeds, we know that the token is a binary
509operator and that it will be included in this expression:</p>
510
511<div class="doc_code">
512<pre>
513 // Okay, we know this is a binop.
514 int BinOp = CurTok;
515 getNextToken(); // eat binop
516
517 // Parse the primary expression after the binary operator.
518 ExprAST *RHS = ParsePrimary();
519 if (!RHS) return 0;
520</pre>
521</div>
522
523<p>As such, this code eats (and remembers) the binary operator and then parses
524the following primary expression. This builds up the whole pair, the first of
525which is [+, b] for the running example.</p>
526
527<p>Now that we parsed the left-hand side of an expression and one pair of the
528RHS sequence, we have to decide which way the expression associates. In
529particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
530To determine this, we look ahead at "binop" to determine its precedence and
531compare it to BinOp's precedence (which is '+' in this case):</p>
532
533<div class="doc_code">
534<pre>
535 // If BinOp binds less tightly with RHS than the operator after RHS, let
536 // the pending operator take RHS as its LHS.
537 int NextPrec = GetTokPrecedence();
538 if (TokPrec &lt; NextPrec) {
539</pre>
540</div>
541
542<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
543precedence of our current operator, then we know that the parentheses associate
544as "(a+b) binop ...". In our example, since the next operator is "+" and so is
545our current one, we know that they have the same precedence. In this case we'll
546create the AST node for "a+b", and then continue parsing:</p>
547
548<div class="doc_code">
549<pre>
550 ... if body omitted ...
551 }
552
553 // Merge LHS/RHS.
554 LHS = new BinaryExprAST(BinOp, LHS, RHS);
555 } // loop around to the top of the while loop.
556}
557</pre>
558</div>
559
560<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
561iteration of the loop, with "+" as the current token. The code above will eat
562and remember it and parse "(c+d)" as the primary expression, which makes the
563current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
564binop to the right of the primary. In this case, the precedence of "*" is
565higher than the precedence of "+" so the if condition will be entered.</p>
566
567<p>The critical question left here is "how can the if condition parse the right
568hand side in full"? In particular, to build the AST correctly for our example,
569it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
570do this is surprisingly simple (code from the above two blocks duplicated for
571context):</p>
572
573<div class="doc_code">
574<pre>
575 // If BinOp binds less tightly with RHS than the operator after RHS, let
576 // the pending operator take RHS as its LHS.
577 int NextPrec = GetTokPrecedence();
578 if (TokPrec &lt; NextPrec) {
579 RHS = ParseBinOpRHS(TokPrec+1, RHS);
580 if (RHS == 0) return 0;
581 }
582 // Merge LHS/RHS.
583 LHS = new BinaryExprAST(BinOp, LHS, RHS);
584 } // loop around to the top of the while loop.
585}
586</pre>
587</div>
588
589<p>At this point, we know that the binary operator to the RHS of our primary
590has higher precedence than the binop we are currently parsing. As such, we know
591that any sequence of pairs whose operators are all higher precedence than "+"
592should be parsed together and returned as "RHS". To do this, we recursively
593invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
594precedence required for it to continue. In our example above, this will cause
595it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
596of the '+' expression.</p>
597
598<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
599and added to the AST. With this little bit of code (14 non-trivial lines), we
600correctly handle fully general binary expression parsing in a very elegant way.
601</p>
602
603<p>This wraps up handling of expressions. At this point, we can point the
604parser at an arbitrary token stream and build an expression from them, stopping
605at the first token that is not part of the expression. Next up we need to
606handle function definitions etc.</p>
607
608</div>
609
610<!-- *********************************************************************** -->
611<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
612<!-- *********************************************************************** -->
613
614<div class="doc_text">
615
616<p>
617The first basic thing missing is that of function prototypes. In Kaleidoscope,
618these are used both for 'extern' function declarations as well as function body
619definitions. The code to do this is straight-forward and not very interesting
620(once you've survived expressions):
621</p>
622
623<div class="doc_code">
624<pre>
625/// prototype
626/// ::= id '(' id* ')'
627static PrototypeAST *ParsePrototype() {
628 if (CurTok != tok_identifier)
629 return ErrorP("Expected function name in prototype");
630
631 std::string FnName = IdentifierStr;
632 getNextToken();
633
634 if (CurTok != '(')
635 return ErrorP("Expected '(' in prototype");
636
637 std::vector&lt;std::string&gt; ArgNames;
638 while (getNextToken() == tok_identifier)
639 ArgNames.push_back(IdentifierStr);
640 if (CurTok != ')')
641 return ErrorP("Expected ')' in prototype");
642
643 // success.
644 getNextToken(); // eat ')'.
645
646 return new PrototypeAST(FnName, ArgNames);
647}
648</pre>
649</div>
650
651<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands72261ff2007-11-05 16:04:58 +0000652an expression to implement the body:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000653
654<div class="doc_code">
655<pre>
656/// definition ::= 'def' prototype expression
657static FunctionAST *ParseDefinition() {
658 getNextToken(); // eat def.
659 PrototypeAST *Proto = ParsePrototype();
660 if (Proto == 0) return 0;
661
662 if (ExprAST *E = ParseExpression())
663 return new FunctionAST(Proto, E);
664 return 0;
665}
666</pre>
667</div>
668
669<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
670well as to support forward declaration of user functions. 'externs' are just
671prototypes with no body:</p>
672
673<div class="doc_code">
674<pre>
675/// external ::= 'extern' prototype
676static PrototypeAST *ParseExtern() {
677 getNextToken(); // eat extern.
678 return ParsePrototype();
679}
680</pre>
681</div>
682
683<p>Finally, we'll also let the user type in arbitrary top-level expressions and
684evaluate them on the fly. We will handle this by defining anonymous nullary
685(zero argument) functions for them:</p>
686
687<div class="doc_code">
688<pre>
689/// toplevelexpr ::= expression
690static FunctionAST *ParseTopLevelExpr() {
691 if (ExprAST *E = ParseExpression()) {
692 // Make an anonymous proto.
693 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
694 return new FunctionAST(Proto, E);
695 }
696 return 0;
697}
698</pre>
699</div>
700
701<p>Now that we have all the pieces, lets build a little driver that will let us
702actually <em>execute</em> this code we've built!</p>
703
704</div>
705
706<!-- *********************************************************************** -->
707<div class="doc_section"><a name="driver">The Driver</a></div>
708<!-- *********************************************************************** -->
709
710<div class="doc_text">
711
712<p>The driver for this simply invokes all of the parsing pieces with a top-level
713dispatch loop. There isn't much interesting here, so I'll just include the
714top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
715Parsing" section.</p>
716
717<div class="doc_code">
718<pre>
719/// top ::= definition | external | expression | ';'
720static void MainLoop() {
721 while (1) {
722 fprintf(stderr, "ready&gt; ");
723 switch (CurTok) {
724 case tok_eof: return;
725 case ';': getNextToken(); break; // ignore top level semicolons.
726 case tok_def: HandleDefinition(); break;
727 case tok_extern: HandleExtern(); break;
728 default: HandleTopLevelExpression(); break;
729 }
730 }
731}
732</pre>
733</div>
734
735<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Andersonc6311b92007-10-22 06:48:28 +0000736Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattnere6c91042007-10-22 06:34:15 +0000737command line, the parser doesn't know that that is the end of what you will
738type. For example, on the next line you could type "def foo..." in which case
7394+5 is the end of a top-level expression. Alternatively you could type "* 6",
740which would continue the expression. Having top-level semicolons allows you to
741type "4+5;" and the parser will know you are done.</p>
742
743</div>
744
745<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000746<div class="doc_section"><a name="conclusions">Conclusions</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000747<!-- *********************************************************************** -->
748
749<div class="doc_text">
750
751<p>With just under 400 lines of commented code, we fully defined our minimal
752language, including a lexer, parser and AST builder. With this done, the
753executable will validate code and tell us if it is gramatically invalid. For
754example, here is a sample interaction:</p>
755
756<div class="doc_code">
757<pre>
758$ ./a.out
Chris Lattner35abbf52007-10-23 06:23:57 +0000759ready&gt; def foo(x y) x+foo(y, 4.0);
Chris Lattner01fcc042007-11-05 17:38:34 +0000760ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000761ready&gt; def foo(x y) x+y y;
Chris Lattner01fcc042007-11-05 17:38:34 +0000762ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000763ready&gt; Parsed a top-level expr
764ready&gt; def foo(x y) x+y );
Chris Lattner01fcc042007-11-05 17:38:34 +0000765ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000766ready&gt; Error: unknown token when expecting an expression
767ready&gt; extern sin(a);
768ready&gt; Parsed an extern
769ready&gt; ^D
Chris Lattnere6c91042007-10-22 06:34:15 +0000770$
771</pre>
772</div>
773
Chris Lattnerd93a5842007-10-23 05:43:01 +0000774<p>There is a lot of room for extension here. You can define new AST nodes,
775extend the language in many ways, etc. In the <a href="LangImpl3.html">next
776installment</a>, we will describe how to generate LLVM IR from the AST.</p>
777
778</div>
779
780<!-- *********************************************************************** -->
781<div class="doc_section"><a name="code">Full Code Listing</a></div>
782<!-- *********************************************************************** -->
783
784<div class="doc_text">
785
Chris Lattnere6c91042007-10-22 06:34:15 +0000786<p>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000787Here is the complete code listing for this and the previous chapter.
788Note that it is fully self-contained: you don't need LLVM or any external
789libraries at all for this (other than the C and C++ standard libraries of
790course). To build this, just compile with:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000791
792<div class="doc_code">
793<pre>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000794 # Compile
795 g++ -g toy.cpp
796 # Run
797 ./a.out
798</pre>
799</div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000800
Chris Lattnerd93a5842007-10-23 05:43:01 +0000801<p>Here is the code:</p>
802
803<div class="doc_code">
804<pre>
Chris Lattnere6c91042007-10-22 06:34:15 +0000805#include &lt;cstdio&gt;
806#include &lt;string&gt;
Chris Lattnerd93a5842007-10-23 05:43:01 +0000807#include &lt;map&gt;
Chris Lattnere6c91042007-10-22 06:34:15 +0000808#include &lt;vector&gt;
809
810//===----------------------------------------------------------------------===//
811// Lexer
812//===----------------------------------------------------------------------===//
813
814// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
815// of these for known things.
816enum Token {
817 tok_eof = -1,
818
819 // commands
820 tok_def = -2, tok_extern = -3,
821
822 // primary
823 tok_identifier = -4, tok_number = -5,
824};
825
826static std::string IdentifierStr; // Filled in if tok_identifier
827static double NumVal; // Filled in if tok_number
828
829/// gettok - Return the next token from standard input.
830static int gettok() {
831 static int LastChar = ' ';
832
833 // Skip any whitespace.
834 while (isspace(LastChar))
835 LastChar = getchar();
836
837 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
838 IdentifierStr = LastChar;
839 while (isalnum((LastChar = getchar())))
840 IdentifierStr += LastChar;
841
842 if (IdentifierStr == "def") return tok_def;
843 if (IdentifierStr == "extern") return tok_extern;
844 return tok_identifier;
845 }
846
847 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
848 std::string NumStr;
849 do {
850 NumStr += LastChar;
851 LastChar = getchar();
852 } while (isdigit(LastChar) || LastChar == '.');
853
854 NumVal = strtod(NumStr.c_str(), 0);
855 return tok_number;
856 }
857
858 if (LastChar == '#') {
859 // Comment until end of line.
860 do LastChar = getchar();
861 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
862
863 if (LastChar != EOF)
864 return gettok();
865 }
866
867 // Check for end of file. Don't eat the EOF.
868 if (LastChar == EOF)
869 return tok_eof;
870
871 // Otherwise, just return the character as its ascii value.
872 int ThisChar = LastChar;
873 LastChar = getchar();
874 return ThisChar;
875}
876
877//===----------------------------------------------------------------------===//
878// Abstract Syntax Tree (aka Parse Tree)
879//===----------------------------------------------------------------------===//
880
881/// ExprAST - Base class for all expression nodes.
882class ExprAST {
883public:
884 virtual ~ExprAST() {}
885};
886
887/// NumberExprAST - Expression class for numeric literals like "1.0".
888class NumberExprAST : public ExprAST {
889 double Val;
890public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000891 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000892};
893
894/// VariableExprAST - Expression class for referencing a variable, like "a".
895class VariableExprAST : public ExprAST {
896 std::string Name;
897public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000898 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000899};
900
901/// BinaryExprAST - Expression class for a binary operator.
902class BinaryExprAST : public ExprAST {
903 char Op;
904 ExprAST *LHS, *RHS;
905public:
906 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
907 : Op(op), LHS(lhs), RHS(rhs) {}
908};
909
910/// CallExprAST - Expression class for function calls.
911class CallExprAST : public ExprAST {
912 std::string Callee;
913 std::vector&lt;ExprAST*&gt; Args;
914public:
915 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
916 : Callee(callee), Args(args) {}
917};
918
919/// PrototypeAST - This class represents the "prototype" for a function,
920/// which captures its argument names as well as if it is an operator.
921class PrototypeAST {
922 std::string Name;
923 std::vector&lt; Args;
924public:
925 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
926 : Name(name), Args(args) {}
927
928};
929
930/// FunctionAST - This class represents a function definition itself.
931class FunctionAST {
932 PrototypeAST *Proto;
933 ExprAST *Body;
934public:
935 FunctionAST(PrototypeAST *proto, ExprAST *body)
936 : Proto(proto), Body(body) {}
937
938};
939
940//===----------------------------------------------------------------------===//
941// Parser
942//===----------------------------------------------------------------------===//
943
944/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
945/// token the parser it looking at. getNextToken reads another token from the
946/// lexer and updates CurTok with its results.
947static int CurTok;
948static int getNextToken() {
949 return CurTok = gettok();
950}
951
952/// BinopPrecedence - This holds the precedence for each binary operator that is
953/// defined.
954static std::map&lt;char, int&gt; BinopPrecedence;
955
956/// GetTokPrecedence - Get the precedence of the pending binary operator token.
957static int GetTokPrecedence() {
958 if (!isascii(CurTok))
959 return -1;
960
961 // Make sure it's a declared binop.
962 int TokPrec = BinopPrecedence[CurTok];
963 if (TokPrec &lt;= 0) return -1;
964 return TokPrec;
965}
966
967/// Error* - These are little helper functions for error handling.
968ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
969PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
970FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
971
972static ExprAST *ParseExpression();
973
974/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000975/// ::= identifier
976/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000977static ExprAST *ParseIdentifierExpr() {
978 std::string IdName = IdentifierStr;
979
Chris Lattner20a0c802007-11-05 17:54:34 +0000980 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000981
982 if (CurTok != '(') // Simple variable ref.
983 return new VariableExprAST(IdName);
984
985 // Call.
986 getNextToken(); // eat (
987 std::vector&lt;ExprAST*&gt; Args;
988 while (1) {
989 ExprAST *Arg = ParseExpression();
990 if (!Arg) return 0;
991 Args.push_back(Arg);
992
993 if (CurTok == ')') break;
994
995 if (CurTok != ',')
996 return Error("Expected ')'");
997 getNextToken();
998 }
999
1000 // Eat the ')'.
1001 getNextToken();
1002
1003 return new CallExprAST(IdName, Args);
1004}
1005
1006/// numberexpr ::= number
1007static ExprAST *ParseNumberExpr() {
1008 ExprAST *Result = new NumberExprAST(NumVal);
1009 getNextToken(); // consume the number
1010 return Result;
1011}
1012
1013/// parenexpr ::= '(' expression ')'
1014static ExprAST *ParseParenExpr() {
1015 getNextToken(); // eat (.
1016 ExprAST *V = ParseExpression();
1017 if (!V) return 0;
1018
1019 if (CurTok != ')')
1020 return Error("expected ')'");
1021 getNextToken(); // eat ).
1022 return V;
1023}
1024
1025/// primary
1026/// ::= identifierexpr
1027/// ::= numberexpr
1028/// ::= parenexpr
1029static ExprAST *ParsePrimary() {
1030 switch (CurTok) {
1031 default: return Error("unknown token when expecting an expression");
1032 case tok_identifier: return ParseIdentifierExpr();
1033 case tok_number: return ParseNumberExpr();
1034 case '(': return ParseParenExpr();
1035 }
1036}
1037
1038/// binoprhs
1039/// ::= ('+' primary)*
1040static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1041 // If this is a binop, find its precedence.
1042 while (1) {
1043 int TokPrec = GetTokPrecedence();
1044
1045 // If this is a binop that binds at least as tightly as the current binop,
1046 // consume it, otherwise we are done.
1047 if (TokPrec &lt; ExprPrec)
1048 return LHS;
1049
1050 // Okay, we know this is a binop.
1051 int BinOp = CurTok;
1052 getNextToken(); // eat binop
1053
1054 // Parse the primary expression after the binary operator.
1055 ExprAST *RHS = ParsePrimary();
1056 if (!RHS) return 0;
1057
1058 // If BinOp binds less tightly with RHS than the operator after RHS, let
1059 // the pending operator take RHS as its LHS.
1060 int NextPrec = GetTokPrecedence();
1061 if (TokPrec &lt; NextPrec) {
1062 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1063 if (RHS == 0) return 0;
1064 }
1065
1066 // Merge LHS/RHS.
1067 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1068 }
1069}
1070
1071/// expression
1072/// ::= primary binoprhs
1073///
1074static ExprAST *ParseExpression() {
1075 ExprAST *LHS = ParsePrimary();
1076 if (!LHS) return 0;
1077
1078 return ParseBinOpRHS(0, LHS);
1079}
1080
1081/// prototype
1082/// ::= id '(' id* ')'
1083static PrototypeAST *ParsePrototype() {
1084 if (CurTok != tok_identifier)
1085 return ErrorP("Expected function name in prototype");
1086
1087 std::string FnName = IdentifierStr;
1088 getNextToken();
1089
1090 if (CurTok != '(')
1091 return ErrorP("Expected '(' in prototype");
1092
1093 std::vector&lt;std::string&gt; ArgNames;
1094 while (getNextToken() == tok_identifier)
1095 ArgNames.push_back(IdentifierStr);
1096 if (CurTok != ')')
1097 return ErrorP("Expected ')' in prototype");
1098
1099 // success.
1100 getNextToken(); // eat ')'.
1101
1102 return new PrototypeAST(FnName, ArgNames);
1103}
1104
1105/// definition ::= 'def' prototype expression
1106static FunctionAST *ParseDefinition() {
1107 getNextToken(); // eat def.
1108 PrototypeAST *Proto = ParsePrototype();
1109 if (Proto == 0) return 0;
1110
1111 if (ExprAST *E = ParseExpression())
1112 return new FunctionAST(Proto, E);
1113 return 0;
1114}
1115
1116/// toplevelexpr ::= expression
1117static FunctionAST *ParseTopLevelExpr() {
1118 if (ExprAST *E = ParseExpression()) {
1119 // Make an anonymous proto.
1120 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1121 return new FunctionAST(Proto, E);
1122 }
1123 return 0;
1124}
1125
1126/// external ::= 'extern' prototype
1127static PrototypeAST *ParseExtern() {
1128 getNextToken(); // eat extern.
1129 return ParsePrototype();
1130}
1131
1132//===----------------------------------------------------------------------===//
1133// Top-Level parsing
1134//===----------------------------------------------------------------------===//
1135
1136static void HandleDefinition() {
1137 if (FunctionAST *F = ParseDefinition()) {
1138 fprintf(stderr, "Parsed a function definition.\n");
1139 } else {
1140 // Skip token for error recovery.
1141 getNextToken();
1142 }
1143}
1144
1145static void HandleExtern() {
1146 if (PrototypeAST *P = ParseExtern()) {
1147 fprintf(stderr, "Parsed an extern\n");
1148 } else {
1149 // Skip token for error recovery.
1150 getNextToken();
1151 }
1152}
1153
1154static void HandleTopLevelExpression() {
1155 // Evaluate a top level expression into an anonymous function.
1156 if (FunctionAST *F = ParseTopLevelExpr()) {
1157 fprintf(stderr, "Parsed a top-level expr\n");
1158 } else {
1159 // Skip token for error recovery.
1160 getNextToken();
1161 }
1162}
1163
1164/// top ::= definition | external | expression | ';'
1165static void MainLoop() {
1166 while (1) {
1167 fprintf(stderr, "ready&gt; ");
1168 switch (CurTok) {
1169 case tok_eof: return;
1170 case ';': getNextToken(); break; // ignore top level semicolons.
1171 case tok_def: HandleDefinition(); break;
1172 case tok_extern: HandleExtern(); break;
1173 default: HandleTopLevelExpression(); break;
1174 }
1175 }
1176}
1177
1178//===----------------------------------------------------------------------===//
1179// Main driver code.
1180//===----------------------------------------------------------------------===//
1181
1182int main() {
1183 // Install standard binary operators.
1184 // 1 is lowest precedence.
1185 BinopPrecedence['&lt;'] = 10;
1186 BinopPrecedence['+'] = 20;
1187 BinopPrecedence['-'] = 20;
1188 BinopPrecedence['*'] = 40; // highest.
1189
1190 // Prime the first token.
1191 fprintf(stderr, "ready&gt; ");
1192 getNextToken();
1193
1194 MainLoop();
1195 return 0;
1196}
1197</pre>
1198</div>
1199</div>
1200
1201<!-- *********************************************************************** -->
1202<hr>
1203<address>
1204 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1205 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1206 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +00001207 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnere6c91042007-10-22 06:34:15 +00001208
1209 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1210 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1211 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1212</address>
1213</body>
1214</html>