blob: 2339d96a91292e380746dc6915e6f5d291e74052 [file] [log] [blame]
Chris Lattnere6c91042007-10-22 06:34:15 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
17<li>Chapter 2
18 <ol>
19 <li><a href="#intro">Chapter 2 Introduction</a></li>
20 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
21 <li><a href="#parserbasics">Parser Basics</a></li>
22 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
23 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
24 <li><a href="#parsertop">Parsing the Rest</a></li>
25 <li><a href="#driver">The Driver</a></li>
26 <li><a href="#conclusions">Conclusions</a></li>
27 <li><a href="#code">Full Code Listing</a></li>
28 </ol>
29</li>
30</ul>
31
Chris Lattnere6c91042007-10-22 06:34:15 +000032<div class="doc_author">
33 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
34</div>
35
36<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000037<div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +000038<!-- *********************************************************************** -->
39
40<div class="doc_text">
41
Chris Lattner128eb862007-11-05 19:06:59 +000042<p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
43with LLVM</a>" tutorial. This chapter shows you how to use the <a
Chris Lattnere6c91042007-10-22 06:34:15 +000044href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a
45href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
46our Kaleidoscope language and build an <a
47href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
48Tree</a> (AST).</p>
49
50<p>The parser we will build uses a combination of <a
51href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
52Parsing</a> and <a href=
53"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
54Parsing</a> to parse the Kaleidoscope language (the later for binary expression
55and the former for everything else). Before we get to parsing though, lets talk
56about the output of the parser: the Abstract Syntax Tree.</p>
57
58</div>
59
60<!-- *********************************************************************** -->
61<div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
62<!-- *********************************************************************** -->
63
64<div class="doc_text">
65
66<p>The AST for a program captures its behavior in a way that it is easy for
67later stages of the compiler (e.g. code generation) to interpret. We basically
68want one object for each construct in the language, and the AST should closely
69model the language. In Kaleidoscope, we have expressions, a prototype, and a
70function object. We'll start with expressions first:</p>
71
72<div class="doc_code">
73<pre>
74/// ExprAST - Base class for all expression nodes.
75class ExprAST {
76public:
77 virtual ~ExprAST() {}
78};
79
80/// NumberExprAST - Expression class for numeric literals like "1.0".
81class NumberExprAST : public ExprAST {
82 double Val;
83public:
Chris Lattner28571ed2007-10-23 04:27:44 +000084 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +000085};
86</pre>
87</div>
88
89<p>The code above shows the definition of the base ExprAST class and one
90subclass which we use for numeric literals. The important thing about this is
91that the NumberExprAST class captures the numeric value of the literal in the
92class, so that later phases of the compiler can know what it is.</p>
93
94<p>Right now we only create the AST, so there are no useful accessor methods on
95them. It would be very easy to add a virtual method to pretty print the code,
96for example. Here are the other expression AST node definitions that we'll use
97in the basic form of the Kaleidoscope language.
98</p>
99
100<div class="doc_code">
101<pre>
102/// VariableExprAST - Expression class for referencing a variable, like "a".
103class VariableExprAST : public ExprAST {
104 std::string Name;
105public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000106 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000107};
108
109/// BinaryExprAST - Expression class for a binary operator.
110class BinaryExprAST : public ExprAST {
111 char Op;
112 ExprAST *LHS, *RHS;
113public:
114 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
115 : Op(op), LHS(lhs), RHS(rhs) {}
116};
117
118/// CallExprAST - Expression class for function calls.
119class CallExprAST : public ExprAST {
120 std::string Callee;
121 std::vector&lt;ExprAST*&gt; Args;
122public:
123 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
124 : Callee(callee), Args(args) {}
125};
126</pre>
127</div>
128
129<p>This is all (intentially) rather straight-forward: variables capture the
130variable name, binary operators capture their opcode (e.g. '+'), and calls
131capture a function name and list of argument expressions. One thing that is
132nice about our AST is that it captures the language features without talking
133about the syntax of the language. Note that there is no discussion about
134precedence of binary operators, lexical structure etc.</p>
135
136<p>For our basic language, these are all of the expression nodes we'll define.
Owen Andersonc6311b92007-10-22 06:48:28 +0000137Because it doesn't have conditional control flow, it isn't Turing-complete;
Chris Lattnere6c91042007-10-22 06:34:15 +0000138we'll fix that in a later installment. The two things we need next are a way
139to talk about the interface to a function, and a way to talk about functions
140themselves:</p>
141
142<div class="doc_code">
143<pre>
144/// PrototypeAST - This class represents the "prototype" for a function,
145/// which captures its argument names as well as if it is an operator.
146class PrototypeAST {
147 std::string Name;
148 std::vector&lt;std::string&gt; Args;
149public:
150 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
151 : Name(name), Args(args) {}
152};
153
154/// FunctionAST - This class represents a function definition itself.
155class FunctionAST {
156 PrototypeAST *Proto;
157 ExprAST *Body;
158public:
159 FunctionAST(PrototypeAST *proto, ExprAST *body)
160 : Proto(proto), Body(body) {}
161};
162</pre>
163</div>
164
165<p>In Kaleidoscope, functions are typed with just a count of their arguments.
166Since all values are double precision floating point, this fact doesn't need to
167be captured anywhere. In a more aggressive and realistic language, the
168"ExprAST" class would probably have a type field.</p>
169
170<p>With this scaffolding, we can now talk about parsing expressions and function
171bodies in Kaleidoscope.</p>
172
173</div>
174
175<!-- *********************************************************************** -->
176<div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
177<!-- *********************************************************************** -->
178
179<div class="doc_text">
180
181<p>Now that we have an AST to build, we need to define the parser code to build
182it. The idea here is that we want to parse something like "x+y" (which is
183returned as three tokens by the lexer) into an AST that could be generated with
184calls like this:</p>
185
186<div class="doc_code">
187<pre>
188 ExprAST *X = new VariableExprAST("x");
189 ExprAST *Y = new VariableExprAST("y");
190 ExprAST *Result = new BinaryExprAST('+', X, Y);
191</pre>
192</div>
193
194<p>In order to do this, we'll start by defining some basic helper routines:</p>
195
196<div class="doc_code">
197<pre>
198/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
199/// token the parser it looking at. getNextToken reads another token from the
200/// lexer and updates CurTok with its results.
201static int CurTok;
202static int getNextToken() {
203 return CurTok = gettok();
204}
205</pre>
206</div>
207
208<p>
209This implements a simple token buffer around the lexer. This allows
210us to look one token ahead at what the lexer is returning. Every function in
Chris Lattnere9495122007-10-25 18:05:29 +0000211our parser will assume that CurTok is the current token that needs to be
Chris Lattnere6c91042007-10-22 06:34:15 +0000212parsed.</p>
213
Chris Lattnerbee175b2007-10-22 16:44:31 +0000214<p>Again, we define these with global variables; it would be better design to
215wrap the entire parser in a class and use instance variables for these.
Chris Lattnere6c91042007-10-22 06:34:15 +0000216</p>
217
218<div class="doc_code">
219<pre>
220
221/// Error* - These are little helper functions for error handling.
222ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
223PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
224FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
225</pre>
226</div>
227
228<p>
229The <tt>Error</tt> routines are simple helper routines that our parser will use
230to handle errors. The error recovery in our parser will not be the best and
Duncan Sands72261ff2007-11-05 16:04:58 +0000231is not particular user-friendly, but it will be enough for our tutorial. These
Chris Lattnere6c91042007-10-22 06:34:15 +0000232routines make it easier to handle errors in routines that have various return
233types: they always return null.</p>
234
235<p>With these basic helper functions implemented, we can implement the first
236piece of our grammar: we'll start with numeric literals.</p>
237
238</div>
239
240<!-- *********************************************************************** -->
241<div class="doc_section"><a name="parserprimexprs">Basic Expression
242 Parsing</a></div>
243<!-- *********************************************************************** -->
244
245<div class="doc_text">
246
247<p>We start with numeric literals, because they are the simplest to process.
248For each production in our grammar, we'll define a function which parses that
249production. For numeric literals, we have:
250</p>
251
252<div class="doc_code">
253<pre>
254/// numberexpr ::= number
255static ExprAST *ParseNumberExpr() {
256 ExprAST *Result = new NumberExprAST(NumVal);
257 getNextToken(); // consume the number
258 return Result;
259}
260</pre>
261</div>
262
263<p>This routine is very simple: it expects to be called when the current token
264is a <tt>tok_number</tt> token. It takes the current number value, creates
265a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then
266returns.</p>
267
268<p>There are some interesting aspects of this. The most important one is that
269this routine eats all of the tokens that correspond to the production, and
270returns the lexer buffer with the next token (which is not part of the grammar
271production) ready to go. This is a fairly standard way to go for recursive
272descent parsers. For a better example, the parenthesis operator is defined like
273this:</p>
274
275<div class="doc_code">
276<pre>
277/// parenexpr ::= '(' expression ')'
278static ExprAST *ParseParenExpr() {
279 getNextToken(); // eat (.
280 ExprAST *V = ParseExpression();
281 if (!V) return 0;
282
283 if (CurTok != ')')
284 return Error("expected ')'");
285 getNextToken(); // eat ).
286 return V;
287}
288</pre>
289</div>
290
291<p>This function illustrates a number of interesting things about the parser:
2921) it shows how we use the Error routines. When called, this function expects
293that the current token is a '(' token, but after parsing the subexpression, it
294is possible that there is not a ')' waiting. For example, if the user types in
295"(4 x" instead of "(4)", the parser should emit an error. Because errors can
296occur, the parser needs a way to indicate that they happened: in our parser, we
297return null on an error.</p>
298
299<p>Another interesting aspect of this function is that it uses recursion by
Owen Andersonc6311b92007-10-22 06:48:28 +0000300calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
Chris Lattnere6c91042007-10-22 06:34:15 +0000301<tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
302recursive grammars, and keeps each production very simple. Note that
Duncan Sands72261ff2007-11-05 16:04:58 +0000303parentheses do not cause construction of AST nodes themselves. While we could
Chris Lattnere6c91042007-10-22 06:34:15 +0000304do this, the most important role of parens are to guide the parser and provide
305grouping. Once the parser constructs the AST, parens are not needed.</p>
306
307<p>The next simple production is for handling variable references and function
308calls:</p>
309
310<div class="doc_code">
311<pre>
312/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000313/// ::= identifier
314/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000315static ExprAST *ParseIdentifierExpr() {
316 std::string IdName = IdentifierStr;
317
Chris Lattner20a0c802007-11-05 17:54:34 +0000318 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000319
320 if (CurTok != '(') // Simple variable ref.
321 return new VariableExprAST(IdName);
322
323 // Call.
324 getNextToken(); // eat (
325 std::vector&lt;ExprAST*&gt; Args;
326 while (1) {
327 ExprAST *Arg = ParseExpression();
328 if (!Arg) return 0;
329 Args.push_back(Arg);
330
331 if (CurTok == ')') break;
332
333 if (CurTok != ',')
334 return Error("Expected ')'");
335 getNextToken();
336 }
337
338 // Eat the ')'.
339 getNextToken();
340
341 return new CallExprAST(IdName, Args);
342}
343</pre>
344</div>
345
346<p>This routine follows the same style as the other routines (it expects to be
347called if the current token is a <tt>tok_identifier</tt> token). It also has
348recursion and error handling. One interesting aspect of this is that it uses
349<em>look-ahead</em> to determine if the current identifier is a stand alone
350variable reference or if it is a function call expression. It handles this by
351checking to see if the token after the identifier is a '(' token, and constructs
352either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
353</p>
354
355<p>Now that we have all of our simple expression parsing logic in place, we can
356define a helper function to wrap them up in a class. We call this class of
357expressions "primary" expressions, for reasons that will become more clear
358later. In order to parse a primary expression, we need to determine what sort
359of expression it is:</p>
360
361<div class="doc_code">
362<pre>
363/// primary
364/// ::= identifierexpr
365/// ::= numberexpr
366/// ::= parenexpr
367static ExprAST *ParsePrimary() {
368 switch (CurTok) {
369 default: return Error("unknown token when expecting an expression");
370 case tok_identifier: return ParseIdentifierExpr();
371 case tok_number: return ParseNumberExpr();
372 case '(': return ParseParenExpr();
373 }
374}
375</pre>
376</div>
377
378<p>Now that you see the definition of this function, it makes it more obvious
379why we can assume the state of CurTok in the various functions. This uses
380look-ahead to determine which sort of expression is being inspected, and parses
381it with a function call.</p>
382
383<p>Now that basic expressions are handled, we need to handle binary expressions,
384which are a bit more complex.</p>
385
386</div>
387
388<!-- *********************************************************************** -->
389<div class="doc_section"><a name="parserbinops">Binary Expression
390 Parsing</a></div>
391<!-- *********************************************************************** -->
392
393<div class="doc_text">
394
395<p>Binary expressions are significantly harder to parse because they are often
396ambiguous. For example, when given the string "x+y*z", the parser can choose
397to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
398mathematics, we expect the later parse, because "*" (multiplication) has
399higher <em>precedence</em> than "+" (addition).</p>
400
401<p>There are many ways to handle this, but an elegant and efficient way is to
402use <a href=
403"http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
404Parsing</a>. This parsing technique uses the precedence of binary operators to
405guide recursion. To start with, we need a table of precedences:</p>
406
407<div class="doc_code">
408<pre>
409/// BinopPrecedence - This holds the precedence for each binary operator that is
410/// defined.
411static std::map&lt;char, int&gt; BinopPrecedence;
412
413/// GetTokPrecedence - Get the precedence of the pending binary operator token.
414static int GetTokPrecedence() {
415 if (!isascii(CurTok))
416 return -1;
417
418 // Make sure it's a declared binop.
419 int TokPrec = BinopPrecedence[CurTok];
420 if (TokPrec &lt;= 0) return -1;
421 return TokPrec;
422}
423
424int main() {
425 // Install standard binary operators.
426 // 1 is lowest precedence.
427 BinopPrecedence['&lt;'] = 10;
428 BinopPrecedence['+'] = 20;
429 BinopPrecedence['-'] = 20;
430 BinopPrecedence['*'] = 40; // highest.
431 ...
432}
433</pre>
434</div>
435
436<p>For the basic form of Kaleidoscope, we will only support 4 binary operators
437(this can obviously be extended by you, the reader). The
438<tt>GetTokPrecedence</tt> function returns the precedence for the current token,
439or -1 if the token is not a binary operator. Having a map makes it easy to add
440new operators and makes it clear that the algorithm doesn't depend on the
441specific operators involved, but it would be easy enough to eliminate the map
442and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p>
443
444<p>With the helper above defined, we can now start parsing binary expressions.
445The basic idea of operator precedence parsing is to break down an expression
446with potentially ambiguous binary operators into pieces. Consider for example
447the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
448as a stream of primary expressions separated by binary operators. As such,
449it will first parse the leading primary expression "a", then it will see the
450pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
Duncan Sands72261ff2007-11-05 16:04:58 +0000451are primary expressions, the binary expression parser doesn't need to worry
Chris Lattnere6c91042007-10-22 06:34:15 +0000452about nested subexpressions like (c+d) at all.
453</p>
454
455<p>
456To start, an expression is a primary expression potentially followed by a
457sequence of [binop,primaryexpr] pairs:</p>
458
459<div class="doc_code">
460<pre>
461/// expression
462/// ::= primary binoprhs
463///
464static ExprAST *ParseExpression() {
465 ExprAST *LHS = ParsePrimary();
466 if (!LHS) return 0;
467
468 return ParseBinOpRHS(0, LHS);
469}
470</pre>
471</div>
472
473<p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
474us. It takes a precedence and a pointer to an expression for the part parsed
475so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
476allowed to be empty, in which case it returns the expression that is passed into
477it. In our example above, the code passes the expression for "a" into
478<tt>ParseBinOpRHS</tt> and the current token is "+".</p>
479
480<p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
481minimal operator precedence</em> that the function is allowed to eat. For
482example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
483passed in a precedence of 40, it will not consume any tokens (because the
484precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
485with:</p>
486
487<div class="doc_code">
488<pre>
489/// binoprhs
490/// ::= ('+' primary)*
491static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
492 // If this is a binop, find its precedence.
493 while (1) {
494 int TokPrec = GetTokPrecedence();
495
496 // If this is a binop that binds at least as tightly as the current binop,
497 // consume it, otherwise we are done.
498 if (TokPrec &lt; ExprPrec)
499 return LHS;
500</pre>
501</div>
502
503<p>This code gets the precedence of the current token and checks to see if if is
504too low. Because we defined invalid tokens to have a precedence of -1, this
505check implicitly knows that the pair-stream ends when the token stream runs out
506of binary operators. If this check succeeds, we know that the token is a binary
507operator and that it will be included in this expression:</p>
508
509<div class="doc_code">
510<pre>
511 // Okay, we know this is a binop.
512 int BinOp = CurTok;
513 getNextToken(); // eat binop
514
515 // Parse the primary expression after the binary operator.
516 ExprAST *RHS = ParsePrimary();
517 if (!RHS) return 0;
518</pre>
519</div>
520
521<p>As such, this code eats (and remembers) the binary operator and then parses
522the following primary expression. This builds up the whole pair, the first of
523which is [+, b] for the running example.</p>
524
525<p>Now that we parsed the left-hand side of an expression and one pair of the
526RHS sequence, we have to decide which way the expression associates. In
527particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
528To determine this, we look ahead at "binop" to determine its precedence and
529compare it to BinOp's precedence (which is '+' in this case):</p>
530
531<div class="doc_code">
532<pre>
533 // If BinOp binds less tightly with RHS than the operator after RHS, let
534 // the pending operator take RHS as its LHS.
535 int NextPrec = GetTokPrecedence();
536 if (TokPrec &lt; NextPrec) {
537</pre>
538</div>
539
540<p>If the precedence of the binop to the right of "RHS" is lower or equal to the
541precedence of our current operator, then we know that the parentheses associate
542as "(a+b) binop ...". In our example, since the next operator is "+" and so is
543our current one, we know that they have the same precedence. In this case we'll
544create the AST node for "a+b", and then continue parsing:</p>
545
546<div class="doc_code">
547<pre>
548 ... if body omitted ...
549 }
550
551 // Merge LHS/RHS.
552 LHS = new BinaryExprAST(BinOp, LHS, RHS);
553 } // loop around to the top of the while loop.
554}
555</pre>
556</div>
557
558<p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
559iteration of the loop, with "+" as the current token. The code above will eat
560and remember it and parse "(c+d)" as the primary expression, which makes the
561current pair be [+, (c+d)]. It will then enter the 'if' above with "*" as the
562binop to the right of the primary. In this case, the precedence of "*" is
563higher than the precedence of "+" so the if condition will be entered.</p>
564
565<p>The critical question left here is "how can the if condition parse the right
566hand side in full"? In particular, to build the AST correctly for our example,
567it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
568do this is surprisingly simple (code from the above two blocks duplicated for
569context):</p>
570
571<div class="doc_code">
572<pre>
573 // If BinOp binds less tightly with RHS than the operator after RHS, let
574 // the pending operator take RHS as its LHS.
575 int NextPrec = GetTokPrecedence();
576 if (TokPrec &lt; NextPrec) {
577 RHS = ParseBinOpRHS(TokPrec+1, RHS);
578 if (RHS == 0) return 0;
579 }
580 // Merge LHS/RHS.
581 LHS = new BinaryExprAST(BinOp, LHS, RHS);
582 } // loop around to the top of the while loop.
583}
584</pre>
585</div>
586
587<p>At this point, we know that the binary operator to the RHS of our primary
588has higher precedence than the binop we are currently parsing. As such, we know
589that any sequence of pairs whose operators are all higher precedence than "+"
590should be parsed together and returned as "RHS". To do this, we recursively
591invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
592precedence required for it to continue. In our example above, this will cause
593it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
594of the '+' expression.</p>
595
596<p>Finally, on the next iteration of the while loop, the "+g" piece is parsed.
597and added to the AST. With this little bit of code (14 non-trivial lines), we
598correctly handle fully general binary expression parsing in a very elegant way.
599</p>
600
601<p>This wraps up handling of expressions. At this point, we can point the
602parser at an arbitrary token stream and build an expression from them, stopping
603at the first token that is not part of the expression. Next up we need to
604handle function definitions etc.</p>
605
606</div>
607
608<!-- *********************************************************************** -->
609<div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
610<!-- *********************************************************************** -->
611
612<div class="doc_text">
613
614<p>
615The first basic thing missing is that of function prototypes. In Kaleidoscope,
616these are used both for 'extern' function declarations as well as function body
617definitions. The code to do this is straight-forward and not very interesting
618(once you've survived expressions):
619</p>
620
621<div class="doc_code">
622<pre>
623/// prototype
624/// ::= id '(' id* ')'
625static PrototypeAST *ParsePrototype() {
626 if (CurTok != tok_identifier)
627 return ErrorP("Expected function name in prototype");
628
629 std::string FnName = IdentifierStr;
630 getNextToken();
631
632 if (CurTok != '(')
633 return ErrorP("Expected '(' in prototype");
634
635 std::vector&lt;std::string&gt; ArgNames;
636 while (getNextToken() == tok_identifier)
637 ArgNames.push_back(IdentifierStr);
638 if (CurTok != ')')
639 return ErrorP("Expected ')' in prototype");
640
641 // success.
642 getNextToken(); // eat ')'.
643
644 return new PrototypeAST(FnName, ArgNames);
645}
646</pre>
647</div>
648
649<p>Given this, a function definition is very simple, just a prototype plus
Duncan Sands72261ff2007-11-05 16:04:58 +0000650an expression to implement the body:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000651
652<div class="doc_code">
653<pre>
654/// definition ::= 'def' prototype expression
655static FunctionAST *ParseDefinition() {
656 getNextToken(); // eat def.
657 PrototypeAST *Proto = ParsePrototype();
658 if (Proto == 0) return 0;
659
660 if (ExprAST *E = ParseExpression())
661 return new FunctionAST(Proto, E);
662 return 0;
663}
664</pre>
665</div>
666
667<p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
668well as to support forward declaration of user functions. 'externs' are just
669prototypes with no body:</p>
670
671<div class="doc_code">
672<pre>
673/// external ::= 'extern' prototype
674static PrototypeAST *ParseExtern() {
675 getNextToken(); // eat extern.
676 return ParsePrototype();
677}
678</pre>
679</div>
680
681<p>Finally, we'll also let the user type in arbitrary top-level expressions and
682evaluate them on the fly. We will handle this by defining anonymous nullary
683(zero argument) functions for them:</p>
684
685<div class="doc_code">
686<pre>
687/// toplevelexpr ::= expression
688static FunctionAST *ParseTopLevelExpr() {
689 if (ExprAST *E = ParseExpression()) {
690 // Make an anonymous proto.
691 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
692 return new FunctionAST(Proto, E);
693 }
694 return 0;
695}
696</pre>
697</div>
698
699<p>Now that we have all the pieces, lets build a little driver that will let us
700actually <em>execute</em> this code we've built!</p>
701
702</div>
703
704<!-- *********************************************************************** -->
705<div class="doc_section"><a name="driver">The Driver</a></div>
706<!-- *********************************************************************** -->
707
708<div class="doc_text">
709
710<p>The driver for this simply invokes all of the parsing pieces with a top-level
711dispatch loop. There isn't much interesting here, so I'll just include the
712top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
713Parsing" section.</p>
714
715<div class="doc_code">
716<pre>
717/// top ::= definition | external | expression | ';'
718static void MainLoop() {
719 while (1) {
720 fprintf(stderr, "ready&gt; ");
721 switch (CurTok) {
722 case tok_eof: return;
723 case ';': getNextToken(); break; // ignore top level semicolons.
724 case tok_def: HandleDefinition(); break;
725 case tok_extern: HandleExtern(); break;
726 default: HandleTopLevelExpression(); break;
727 }
728 }
729}
730</pre>
731</div>
732
733<p>The most interesting part of this is that we ignore top-level semi colons.
Owen Andersonc6311b92007-10-22 06:48:28 +0000734Why is this, you ask? The basic reason is that if you type "4 + 5" at the
Chris Lattnere6c91042007-10-22 06:34:15 +0000735command line, the parser doesn't know that that is the end of what you will
736type. For example, on the next line you could type "def foo..." in which case
7374+5 is the end of a top-level expression. Alternatively you could type "* 6",
738which would continue the expression. Having top-level semicolons allows you to
739type "4+5;" and the parser will know you are done.</p>
740
741</div>
742
743<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000744<div class="doc_section"><a name="conclusions">Conclusions</a></div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000745<!-- *********************************************************************** -->
746
747<div class="doc_text">
748
749<p>With just under 400 lines of commented code, we fully defined our minimal
750language, including a lexer, parser and AST builder. With this done, the
751executable will validate code and tell us if it is gramatically invalid. For
752example, here is a sample interaction:</p>
753
754<div class="doc_code">
755<pre>
756$ ./a.out
Chris Lattner35abbf52007-10-23 06:23:57 +0000757ready&gt; def foo(x y) x+foo(y, 4.0);
Chris Lattner01fcc042007-11-05 17:38:34 +0000758ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000759ready&gt; def foo(x y) x+y y;
Chris Lattner01fcc042007-11-05 17:38:34 +0000760ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000761ready&gt; Parsed a top-level expr
762ready&gt; def foo(x y) x+y );
Chris Lattner01fcc042007-11-05 17:38:34 +0000763ready&gt; Parsed a function definition.
Chris Lattner35abbf52007-10-23 06:23:57 +0000764ready&gt; Error: unknown token when expecting an expression
765ready&gt; extern sin(a);
766ready&gt; Parsed an extern
767ready&gt; ^D
Chris Lattnere6c91042007-10-22 06:34:15 +0000768$
769</pre>
770</div>
771
Chris Lattnerd93a5842007-10-23 05:43:01 +0000772<p>There is a lot of room for extension here. You can define new AST nodes,
773extend the language in many ways, etc. In the <a href="LangImpl3.html">next
774installment</a>, we will describe how to generate LLVM IR from the AST.</p>
775
776</div>
777
778<!-- *********************************************************************** -->
779<div class="doc_section"><a name="code">Full Code Listing</a></div>
780<!-- *********************************************************************** -->
781
782<div class="doc_text">
783
Chris Lattnere6c91042007-10-22 06:34:15 +0000784<p>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000785Here is the complete code listing for this and the previous chapter.
786Note that it is fully self-contained: you don't need LLVM or any external
787libraries at all for this (other than the C and C++ standard libraries of
788course). To build this, just compile with:</p>
Chris Lattnere6c91042007-10-22 06:34:15 +0000789
790<div class="doc_code">
791<pre>
Chris Lattnerd93a5842007-10-23 05:43:01 +0000792 # Compile
793 g++ -g toy.cpp
794 # Run
795 ./a.out
796</pre>
797</div>
Chris Lattnere6c91042007-10-22 06:34:15 +0000798
Chris Lattnerd93a5842007-10-23 05:43:01 +0000799<p>Here is the code:</p>
800
801<div class="doc_code">
802<pre>
Chris Lattnere6c91042007-10-22 06:34:15 +0000803#include &lt;cstdio&gt;
804#include &lt;string&gt;
Chris Lattnerd93a5842007-10-23 05:43:01 +0000805#include &lt;map&gt;
Chris Lattnere6c91042007-10-22 06:34:15 +0000806#include &lt;vector&gt;
807
808//===----------------------------------------------------------------------===//
809// Lexer
810//===----------------------------------------------------------------------===//
811
812// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
813// of these for known things.
814enum Token {
815 tok_eof = -1,
816
817 // commands
818 tok_def = -2, tok_extern = -3,
819
820 // primary
821 tok_identifier = -4, tok_number = -5,
822};
823
824static std::string IdentifierStr; // Filled in if tok_identifier
825static double NumVal; // Filled in if tok_number
826
827/// gettok - Return the next token from standard input.
828static int gettok() {
829 static int LastChar = ' ';
830
831 // Skip any whitespace.
832 while (isspace(LastChar))
833 LastChar = getchar();
834
835 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
836 IdentifierStr = LastChar;
837 while (isalnum((LastChar = getchar())))
838 IdentifierStr += LastChar;
839
840 if (IdentifierStr == "def") return tok_def;
841 if (IdentifierStr == "extern") return tok_extern;
842 return tok_identifier;
843 }
844
845 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
846 std::string NumStr;
847 do {
848 NumStr += LastChar;
849 LastChar = getchar();
850 } while (isdigit(LastChar) || LastChar == '.');
851
852 NumVal = strtod(NumStr.c_str(), 0);
853 return tok_number;
854 }
855
856 if (LastChar == '#') {
857 // Comment until end of line.
858 do LastChar = getchar();
859 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
860
861 if (LastChar != EOF)
862 return gettok();
863 }
864
865 // Check for end of file. Don't eat the EOF.
866 if (LastChar == EOF)
867 return tok_eof;
868
869 // Otherwise, just return the character as its ascii value.
870 int ThisChar = LastChar;
871 LastChar = getchar();
872 return ThisChar;
873}
874
875//===----------------------------------------------------------------------===//
876// Abstract Syntax Tree (aka Parse Tree)
877//===----------------------------------------------------------------------===//
878
879/// ExprAST - Base class for all expression nodes.
880class ExprAST {
881public:
882 virtual ~ExprAST() {}
883};
884
885/// NumberExprAST - Expression class for numeric literals like "1.0".
886class NumberExprAST : public ExprAST {
887 double Val;
888public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000889 explicit NumberExprAST(double val) : Val(val) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000890};
891
892/// VariableExprAST - Expression class for referencing a variable, like "a".
893class VariableExprAST : public ExprAST {
894 std::string Name;
895public:
Chris Lattner28571ed2007-10-23 04:27:44 +0000896 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
Chris Lattnere6c91042007-10-22 06:34:15 +0000897};
898
899/// BinaryExprAST - Expression class for a binary operator.
900class BinaryExprAST : public ExprAST {
901 char Op;
902 ExprAST *LHS, *RHS;
903public:
904 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
905 : Op(op), LHS(lhs), RHS(rhs) {}
906};
907
908/// CallExprAST - Expression class for function calls.
909class CallExprAST : public ExprAST {
910 std::string Callee;
911 std::vector&lt;ExprAST*&gt; Args;
912public:
913 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
914 : Callee(callee), Args(args) {}
915};
916
917/// PrototypeAST - This class represents the "prototype" for a function,
918/// which captures its argument names as well as if it is an operator.
919class PrototypeAST {
920 std::string Name;
921 std::vector&lt; Args;
922public:
923 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
924 : Name(name), Args(args) {}
925
926};
927
928/// FunctionAST - This class represents a function definition itself.
929class FunctionAST {
930 PrototypeAST *Proto;
931 ExprAST *Body;
932public:
933 FunctionAST(PrototypeAST *proto, ExprAST *body)
934 : Proto(proto), Body(body) {}
935
936};
937
938//===----------------------------------------------------------------------===//
939// Parser
940//===----------------------------------------------------------------------===//
941
942/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
943/// token the parser it looking at. getNextToken reads another token from the
944/// lexer and updates CurTok with its results.
945static int CurTok;
946static int getNextToken() {
947 return CurTok = gettok();
948}
949
950/// BinopPrecedence - This holds the precedence for each binary operator that is
951/// defined.
952static std::map&lt;char, int&gt; BinopPrecedence;
953
954/// GetTokPrecedence - Get the precedence of the pending binary operator token.
955static int GetTokPrecedence() {
956 if (!isascii(CurTok))
957 return -1;
958
959 // Make sure it's a declared binop.
960 int TokPrec = BinopPrecedence[CurTok];
961 if (TokPrec &lt;= 0) return -1;
962 return TokPrec;
963}
964
965/// Error* - These are little helper functions for error handling.
966ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
967PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
968FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
969
970static ExprAST *ParseExpression();
971
972/// identifierexpr
Chris Lattner20a0c802007-11-05 17:54:34 +0000973/// ::= identifier
974/// ::= identifier '(' expression* ')'
Chris Lattnere6c91042007-10-22 06:34:15 +0000975static ExprAST *ParseIdentifierExpr() {
976 std::string IdName = IdentifierStr;
977
Chris Lattner20a0c802007-11-05 17:54:34 +0000978 getNextToken(); // eat identifier.
Chris Lattnere6c91042007-10-22 06:34:15 +0000979
980 if (CurTok != '(') // Simple variable ref.
981 return new VariableExprAST(IdName);
982
983 // Call.
984 getNextToken(); // eat (
985 std::vector&lt;ExprAST*&gt; Args;
986 while (1) {
987 ExprAST *Arg = ParseExpression();
988 if (!Arg) return 0;
989 Args.push_back(Arg);
990
991 if (CurTok == ')') break;
992
993 if (CurTok != ',')
994 return Error("Expected ')'");
995 getNextToken();
996 }
997
998 // Eat the ')'.
999 getNextToken();
1000
1001 return new CallExprAST(IdName, Args);
1002}
1003
1004/// numberexpr ::= number
1005static ExprAST *ParseNumberExpr() {
1006 ExprAST *Result = new NumberExprAST(NumVal);
1007 getNextToken(); // consume the number
1008 return Result;
1009}
1010
1011/// parenexpr ::= '(' expression ')'
1012static ExprAST *ParseParenExpr() {
1013 getNextToken(); // eat (.
1014 ExprAST *V = ParseExpression();
1015 if (!V) return 0;
1016
1017 if (CurTok != ')')
1018 return Error("expected ')'");
1019 getNextToken(); // eat ).
1020 return V;
1021}
1022
1023/// primary
1024/// ::= identifierexpr
1025/// ::= numberexpr
1026/// ::= parenexpr
1027static ExprAST *ParsePrimary() {
1028 switch (CurTok) {
1029 default: return Error("unknown token when expecting an expression");
1030 case tok_identifier: return ParseIdentifierExpr();
1031 case tok_number: return ParseNumberExpr();
1032 case '(': return ParseParenExpr();
1033 }
1034}
1035
1036/// binoprhs
1037/// ::= ('+' primary)*
1038static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1039 // If this is a binop, find its precedence.
1040 while (1) {
1041 int TokPrec = GetTokPrecedence();
1042
1043 // If this is a binop that binds at least as tightly as the current binop,
1044 // consume it, otherwise we are done.
1045 if (TokPrec &lt; ExprPrec)
1046 return LHS;
1047
1048 // Okay, we know this is a binop.
1049 int BinOp = CurTok;
1050 getNextToken(); // eat binop
1051
1052 // Parse the primary expression after the binary operator.
1053 ExprAST *RHS = ParsePrimary();
1054 if (!RHS) return 0;
1055
1056 // If BinOp binds less tightly with RHS than the operator after RHS, let
1057 // the pending operator take RHS as its LHS.
1058 int NextPrec = GetTokPrecedence();
1059 if (TokPrec &lt; NextPrec) {
1060 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1061 if (RHS == 0) return 0;
1062 }
1063
1064 // Merge LHS/RHS.
1065 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1066 }
1067}
1068
1069/// expression
1070/// ::= primary binoprhs
1071///
1072static ExprAST *ParseExpression() {
1073 ExprAST *LHS = ParsePrimary();
1074 if (!LHS) return 0;
1075
1076 return ParseBinOpRHS(0, LHS);
1077}
1078
1079/// prototype
1080/// ::= id '(' id* ')'
1081static PrototypeAST *ParsePrototype() {
1082 if (CurTok != tok_identifier)
1083 return ErrorP("Expected function name in prototype");
1084
1085 std::string FnName = IdentifierStr;
1086 getNextToken();
1087
1088 if (CurTok != '(')
1089 return ErrorP("Expected '(' in prototype");
1090
1091 std::vector&lt;std::string&gt; ArgNames;
1092 while (getNextToken() == tok_identifier)
1093 ArgNames.push_back(IdentifierStr);
1094 if (CurTok != ')')
1095 return ErrorP("Expected ')' in prototype");
1096
1097 // success.
1098 getNextToken(); // eat ')'.
1099
1100 return new PrototypeAST(FnName, ArgNames);
1101}
1102
1103/// definition ::= 'def' prototype expression
1104static FunctionAST *ParseDefinition() {
1105 getNextToken(); // eat def.
1106 PrototypeAST *Proto = ParsePrototype();
1107 if (Proto == 0) return 0;
1108
1109 if (ExprAST *E = ParseExpression())
1110 return new FunctionAST(Proto, E);
1111 return 0;
1112}
1113
1114/// toplevelexpr ::= expression
1115static FunctionAST *ParseTopLevelExpr() {
1116 if (ExprAST *E = ParseExpression()) {
1117 // Make an anonymous proto.
1118 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;());
1119 return new FunctionAST(Proto, E);
1120 }
1121 return 0;
1122}
1123
1124/// external ::= 'extern' prototype
1125static PrototypeAST *ParseExtern() {
1126 getNextToken(); // eat extern.
1127 return ParsePrototype();
1128}
1129
1130//===----------------------------------------------------------------------===//
1131// Top-Level parsing
1132//===----------------------------------------------------------------------===//
1133
1134static void HandleDefinition() {
1135 if (FunctionAST *F = ParseDefinition()) {
1136 fprintf(stderr, "Parsed a function definition.\n");
1137 } else {
1138 // Skip token for error recovery.
1139 getNextToken();
1140 }
1141}
1142
1143static void HandleExtern() {
1144 if (PrototypeAST *P = ParseExtern()) {
1145 fprintf(stderr, "Parsed an extern\n");
1146 } else {
1147 // Skip token for error recovery.
1148 getNextToken();
1149 }
1150}
1151
1152static void HandleTopLevelExpression() {
1153 // Evaluate a top level expression into an anonymous function.
1154 if (FunctionAST *F = ParseTopLevelExpr()) {
1155 fprintf(stderr, "Parsed a top-level expr\n");
1156 } else {
1157 // Skip token for error recovery.
1158 getNextToken();
1159 }
1160}
1161
1162/// top ::= definition | external | expression | ';'
1163static void MainLoop() {
1164 while (1) {
1165 fprintf(stderr, "ready&gt; ");
1166 switch (CurTok) {
1167 case tok_eof: return;
1168 case ';': getNextToken(); break; // ignore top level semicolons.
1169 case tok_def: HandleDefinition(); break;
1170 case tok_extern: HandleExtern(); break;
1171 default: HandleTopLevelExpression(); break;
1172 }
1173 }
1174}
1175
1176//===----------------------------------------------------------------------===//
1177// Main driver code.
1178//===----------------------------------------------------------------------===//
1179
1180int main() {
1181 // Install standard binary operators.
1182 // 1 is lowest precedence.
1183 BinopPrecedence['&lt;'] = 10;
1184 BinopPrecedence['+'] = 20;
1185 BinopPrecedence['-'] = 20;
1186 BinopPrecedence['*'] = 40; // highest.
1187
1188 // Prime the first token.
1189 fprintf(stderr, "ready&gt; ");
1190 getNextToken();
1191
1192 MainLoop();
1193 return 0;
1194}
1195</pre>
1196</div>
1197</div>
1198
1199<!-- *********************************************************************** -->
1200<hr>
1201<address>
1202 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1203 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1204 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +00001205 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnere6c91042007-10-22 06:34:15 +00001206
1207 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1208 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1209 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1210</address>
1211</body>
1212</html>