blob: 4b28e02a523e96b925e20999671454ceee61df46 [file] [log] [blame]
Sean Silvad7fb3962012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
Sean Silvad7fb3962012-12-05 00:26:32 +00008Chapter 2 Introduction
9======================
10
11Welcome to Chapter 2 of the "`Implementing a language with
12LLVM <index.html>`_" tutorial. This chapter shows you how to use the
13lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
14`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15language. Once we have a parser, we'll define and build an `Abstract
16Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
17
18The parser we will build uses a combination of `Recursive Descent
19Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
20`Operator-Precedence
21Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22parse the Kaleidoscope language (the latter for binary expressions and
23the former for everything else). Before we get to parsing though, lets
24talk about the output of the parser: the Abstract Syntax Tree.
25
26The Abstract Syntax Tree (AST)
27==============================
28
29The AST for a program captures its behavior in such a way that it is
30easy for later stages of the compiler (e.g. code generation) to
31interpret. We basically want one object for each construct in the
32language, and the AST should closely model the language. In
33Kaleidoscope, we have expressions, a prototype, and a function object.
34We'll start with expressions first:
35
36.. code-block:: c++
37
38 /// ExprAST - Base class for all expression nodes.
39 class ExprAST {
40 public:
41 virtual ~ExprAST() {}
42 };
43
44 /// NumberExprAST - Expression class for numeric literals like "1.0".
45 class NumberExprAST : public ExprAST {
46 double Val;
47 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000048 NumberExprAST(double Val) : Val(Val) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000049 };
50
51The code above shows the definition of the base ExprAST class and one
52subclass which we use for numeric literals. The important thing to note
53about this code is that the NumberExprAST class captures the numeric
54value of the literal as an instance variable. This allows later phases
55of the compiler to know what the stored numeric value is.
56
57Right now we only create the AST, so there are no useful accessor
58methods on them. It would be very easy to add a virtual method to pretty
59print the code, for example. Here are the other expression AST node
60definitions that we'll use in the basic form of the Kaleidoscope
61language:
62
63.. code-block:: c++
64
65 /// VariableExprAST - Expression class for referencing a variable, like "a".
66 class VariableExprAST : public ExprAST {
67 std::string Name;
68 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000069 VariableExprAST(const std::string &Name) : Name(Name) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000070 };
71
72 /// BinaryExprAST - Expression class for a binary operator.
73 class BinaryExprAST : public ExprAST {
74 char Op;
Lang Hames09bf4c12015-08-18 18:11:06 +000075 std::unique_ptr<ExprAST> LHS, RHS;
Sean Silvad7fb3962012-12-05 00:26:32 +000076 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000077 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
78 std::unique_ptr<ExprAST> RHS)
79 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000080 };
81
82 /// CallExprAST - Expression class for function calls.
83 class CallExprAST : public ExprAST {
84 std::string Callee;
85 std::vector<ExprAST*> Args;
86 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000087 CallExprAST(const std::string &Callee,
88 std::vector<std::unique_ptr<ExprAST>> Args)
89 : Callee(Callee), Args(std::move(Args)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000090 };
91
92This is all (intentionally) rather straight-forward: variables capture
93the variable name, binary operators capture their opcode (e.g. '+'), and
94calls capture a function name as well as a list of any argument
95expressions. One thing that is nice about our AST is that it captures
96the language features without talking about the syntax of the language.
97Note that there is no discussion about precedence of binary operators,
98lexical structure, etc.
99
100For our basic language, these are all of the expression nodes we'll
101define. Because it doesn't have conditional control flow, it isn't
102Turing-complete; we'll fix that in a later installment. The two things
103we need next are a way to talk about the interface to a function, and a
104way to talk about functions themselves:
105
106.. code-block:: c++
107
108 /// PrototypeAST - This class represents the "prototype" for a function,
109 /// which captures its name, and its argument names (thus implicitly the number
110 /// of arguments the function takes).
111 class PrototypeAST {
112 std::string Name;
113 std::vector<std::string> Args;
114 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000115 PrototypeAST(const std::string &name, std::vector<std::string> Args)
116 : Name(name), Args(std::move(Args)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +0000117 };
118
119 /// FunctionAST - This class represents a function definition itself.
120 class FunctionAST {
Lang Hames09bf4c12015-08-18 18:11:06 +0000121 std::unique_ptr<PrototypeAST> Proto;
122 std::unique_ptr<ExprAST> Body;
Sean Silvad7fb3962012-12-05 00:26:32 +0000123 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000124 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
125 std::unique_ptr<ExprAST> Body)
126 : Proto(std::move(Proto)), Body(std::move(Body)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +0000127 };
128
129In Kaleidoscope, functions are typed with just a count of their
130arguments. Since all values are double precision floating point, the
131type of each argument doesn't need to be stored anywhere. In a more
132aggressive and realistic language, the "ExprAST" class would probably
133have a type field.
134
135With this scaffolding, we can now talk about parsing expressions and
136function bodies in Kaleidoscope.
137
138Parser Basics
139=============
140
141Now that we have an AST to build, we need to define the parser code to
142build it. The idea here is that we want to parse something like "x+y"
143(which is returned as three tokens by the lexer) into an AST that could
144be generated with calls like this:
145
146.. code-block:: c++
147
Lang Hames09bf4c12015-08-18 18:11:06 +0000148 auto LHS = llvm::make_unique<VariableExprAST>("x");
149 auto RHS = llvm::make_unique<VariableExprAST>("y");
150 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
151 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000152
153In order to do this, we'll start by defining some basic helper routines:
154
155.. code-block:: c++
156
157 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
158 /// token the parser is looking at. getNextToken reads another token from the
159 /// lexer and updates CurTok with its results.
160 static int CurTok;
161 static int getNextToken() {
162 return CurTok = gettok();
163 }
164
165This implements a simple token buffer around the lexer. This allows us
166to look one token ahead at what the lexer is returning. Every function
167in our parser will assume that CurTok is the current token that needs to
168be parsed.
169
170.. code-block:: c++
171
172
173 /// Error* - These are little helper functions for error handling.
174 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
175 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
176 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
177
178The ``Error`` routines are simple helper routines that our parser will
179use to handle errors. The error recovery in our parser will not be the
180best and is not particular user-friendly, but it will be enough for our
181tutorial. These routines make it easier to handle errors in routines
182that have various return types: they always return null.
183
184With these basic helper functions, we can implement the first piece of
185our grammar: numeric literals.
186
187Basic Expression Parsing
188========================
189
190We start with numeric literals, because they are the simplest to
191process. For each production in our grammar, we'll define a function
192which parses that production. For numeric literals, we have:
193
194.. code-block:: c++
195
196 /// numberexpr ::= number
Lang Hames09bf4c12015-08-18 18:11:06 +0000197 static std::unique_ptr<ExprAST> ParseNumberExpr() {
198 auto Result = llvm::make_unique<NumberExprAST>(NumVal);
Sean Silvad7fb3962012-12-05 00:26:32 +0000199 getNextToken(); // consume the number
Lang Hames09bf4c12015-08-18 18:11:06 +0000200 return std::move(Result);
Sean Silvad7fb3962012-12-05 00:26:32 +0000201 }
202
203This routine is very simple: it expects to be called when the current
204token is a ``tok_number`` token. It takes the current number value,
205creates a ``NumberExprAST`` node, advances the lexer to the next token,
206and finally returns.
207
208There are some interesting aspects to this. The most important one is
209that this routine eats all of the tokens that correspond to the
210production and returns the lexer buffer with the next token (which is
211not part of the grammar production) ready to go. This is a fairly
212standard way to go for recursive descent parsers. For a better example,
213the parenthesis operator is defined like this:
214
215.. code-block:: c++
216
217 /// parenexpr ::= '(' expression ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000218 static std::unique_ptr<ExprAST> ParseParenExpr() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000219 getNextToken(); // eat (.
Lang Hames09bf4c12015-08-18 18:11:06 +0000220 auto V = ParseExpression();
221 if (!V) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000222
223 if (CurTok != ')')
224 return Error("expected ')'");
225 getNextToken(); // eat ).
226 return V;
227 }
228
229This function illustrates a number of interesting things about the
230parser:
231
2321) It shows how we use the Error routines. When called, this function
233expects that the current token is a '(' token, but after parsing the
234subexpression, it is possible that there is no ')' waiting. For example,
235if the user types in "(4 x" instead of "(4)", the parser should emit an
236error. Because errors can occur, the parser needs a way to indicate that
237they happened: in our parser, we return null on an error.
238
2392) Another interesting aspect of this function is that it uses recursion
240by calling ``ParseExpression`` (we will soon see that
241``ParseExpression`` can call ``ParseParenExpr``). This is powerful
242because it allows us to handle recursive grammars, and keeps each
243production very simple. Note that parentheses do not cause construction
244of AST nodes themselves. While we could do it this way, the most
245important role of parentheses are to guide the parser and provide
246grouping. Once the parser constructs the AST, parentheses are not
247needed.
248
249The next simple production is for handling variable references and
250function calls:
251
252.. code-block:: c++
253
254 /// identifierexpr
255 /// ::= identifier
256 /// ::= identifier '(' expression* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000257 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000258 std::string IdName = IdentifierStr;
259
260 getNextToken(); // eat identifier.
261
262 if (CurTok != '(') // Simple variable ref.
Lang Hames09bf4c12015-08-18 18:11:06 +0000263 return llvm::make_unique<VariableExprAST>(IdName);
Sean Silvad7fb3962012-12-05 00:26:32 +0000264
265 // Call.
266 getNextToken(); // eat (
Lang Hames09bf4c12015-08-18 18:11:06 +0000267 std::vector<std::unique_ptr<ExprAST>> Args;
Sean Silvad7fb3962012-12-05 00:26:32 +0000268 if (CurTok != ')') {
269 while (1) {
Lang Hames09bf4c12015-08-18 18:11:06 +0000270 auto Arg = ParseExpression();
271 if (!Arg) return nullptr;
272 Args.push_back(std::move(Arg));
Sean Silvad7fb3962012-12-05 00:26:32 +0000273
274 if (CurTok == ')') break;
275
276 if (CurTok != ',')
277 return Error("Expected ')' or ',' in argument list");
278 getNextToken();
279 }
280 }
281
282 // Eat the ')'.
283 getNextToken();
284
Lang Hames09bf4c12015-08-18 18:11:06 +0000285 return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
Sean Silvad7fb3962012-12-05 00:26:32 +0000286 }
287
288This routine follows the same style as the other routines. (It expects
289to be called if the current token is a ``tok_identifier`` token). It
290also has recursion and error handling. One interesting aspect of this is
291that it uses *look-ahead* to determine if the current identifier is a
292stand alone variable reference or if it is a function call expression.
293It handles this by checking to see if the token after the identifier is
294a '(' token, constructing either a ``VariableExprAST`` or
295``CallExprAST`` node as appropriate.
296
297Now that we have all of our simple expression-parsing logic in place, we
298can define a helper function to wrap it together into one entry point.
299We call this class of expressions "primary" expressions, for reasons
300that will become more clear `later in the
301tutorial <LangImpl6.html#unary>`_. In order to parse an arbitrary
302primary expression, we need to determine what sort of expression it is:
303
304.. code-block:: c++
305
306 /// primary
307 /// ::= identifierexpr
308 /// ::= numberexpr
309 /// ::= parenexpr
Lang Hames09bf4c12015-08-18 18:11:06 +0000310 static std::unique_ptr<ExprAST> ParsePrimary() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000311 switch (CurTok) {
312 default: return Error("unknown token when expecting an expression");
313 case tok_identifier: return ParseIdentifierExpr();
314 case tok_number: return ParseNumberExpr();
315 case '(': return ParseParenExpr();
316 }
317 }
318
319Now that you see the definition of this function, it is more obvious why
320we can assume the state of CurTok in the various functions. This uses
321look-ahead to determine which sort of expression is being inspected, and
322then parses it with a function call.
323
324Now that basic expressions are handled, we need to handle binary
325expressions. They are a bit more complex.
326
327Binary Expression Parsing
328=========================
329
330Binary expressions are significantly harder to parse because they are
331often ambiguous. For example, when given the string "x+y\*z", the parser
332can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
333definitions from mathematics, we expect the later parse, because "\*"
334(multiplication) has higher *precedence* than "+" (addition).
335
336There are many ways to handle this, but an elegant and efficient way is
337to use `Operator-Precedence
338Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
339This parsing technique uses the precedence of binary operators to guide
340recursion. To start with, we need a table of precedences:
341
342.. code-block:: c++
343
344 /// BinopPrecedence - This holds the precedence for each binary operator that is
345 /// defined.
346 static std::map<char, int> BinopPrecedence;
347
348 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
349 static int GetTokPrecedence() {
350 if (!isascii(CurTok))
351 return -1;
352
353 // Make sure it's a declared binop.
354 int TokPrec = BinopPrecedence[CurTok];
355 if (TokPrec <= 0) return -1;
356 return TokPrec;
357 }
358
359 int main() {
360 // Install standard binary operators.
361 // 1 is lowest precedence.
362 BinopPrecedence['<'] = 10;
363 BinopPrecedence['+'] = 20;
364 BinopPrecedence['-'] = 20;
365 BinopPrecedence['*'] = 40; // highest.
366 ...
367 }
368
369For the basic form of Kaleidoscope, we will only support 4 binary
370operators (this can obviously be extended by you, our brave and intrepid
371reader). The ``GetTokPrecedence`` function returns the precedence for
372the current token, or -1 if the token is not a binary operator. Having a
373map makes it easy to add new operators and makes it clear that the
374algorithm doesn't depend on the specific operators involved, but it
375would be easy enough to eliminate the map and do the comparisons in the
376``GetTokPrecedence`` function. (Or just use a fixed-size array).
377
378With the helper above defined, we can now start parsing binary
379expressions. The basic idea of operator precedence parsing is to break
380down an expression with potentially ambiguous binary operators into
381pieces. Consider ,for example, the expression "a+b+(c+d)\*e\*f+g".
382Operator precedence parsing considers this as a stream of primary
383expressions separated by binary operators. As such, it will first parse
384the leading primary expression "a", then it will see the pairs [+, b]
385[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
386primary expressions, the binary expression parser doesn't need to worry
387about nested subexpressions like (c+d) at all.
388
389To start, an expression is a primary expression potentially followed by
390a sequence of [binop,primaryexpr] pairs:
391
392.. code-block:: c++
393
394 /// expression
395 /// ::= primary binoprhs
396 ///
Lang Hames09bf4c12015-08-18 18:11:06 +0000397 static std::unique_ptr<ExprAST> ParseExpression() {
398 auto LHS = ParsePrimary();
399 if (!LHS) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000400
Lang Hames09bf4c12015-08-18 18:11:06 +0000401 return ParseBinOpRHS(0, std::move(LHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000402 }
403
404``ParseBinOpRHS`` is the function that parses the sequence of pairs for
405us. It takes a precedence and a pointer to an expression for the part
406that has been parsed so far. Note that "x" is a perfectly valid
407expression: As such, "binoprhs" is allowed to be empty, in which case it
408returns the expression that is passed into it. In our example above, the
409code passes the expression for "a" into ``ParseBinOpRHS`` and the
410current token is "+".
411
412The precedence value passed into ``ParseBinOpRHS`` indicates the
413*minimal operator precedence* that the function is allowed to eat. For
414example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
415passed in a precedence of 40, it will not consume any tokens (because
416the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
417starts with:
418
419.. code-block:: c++
420
421 /// binoprhs
422 /// ::= ('+' primary)*
Lang Hames09bf4c12015-08-18 18:11:06 +0000423 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
424 std::unique_ptr<ExprAST> LHS) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000425 // If this is a binop, find its precedence.
426 while (1) {
427 int TokPrec = GetTokPrecedence();
428
429 // If this is a binop that binds at least as tightly as the current binop,
430 // consume it, otherwise we are done.
431 if (TokPrec < ExprPrec)
432 return LHS;
433
434This code gets the precedence of the current token and checks to see if
435if is too low. Because we defined invalid tokens to have a precedence of
436-1, this check implicitly knows that the pair-stream ends when the token
437stream runs out of binary operators. If this check succeeds, we know
438that the token is a binary operator and that it will be included in this
439expression:
440
441.. code-block:: c++
442
443 // Okay, we know this is a binop.
444 int BinOp = CurTok;
445 getNextToken(); // eat binop
446
447 // Parse the primary expression after the binary operator.
Lang Hames09bf4c12015-08-18 18:11:06 +0000448 auto RHS = ParsePrimary();
449 if (!RHS) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000450
451As such, this code eats (and remembers) the binary operator and then
452parses the primary expression that follows. This builds up the whole
453pair, the first of which is [+, b] for the running example.
454
455Now that we parsed the left-hand side of an expression and one pair of
456the RHS sequence, we have to decide which way the expression associates.
457In particular, we could have "(a+b) binop unparsed" or "a + (b binop
458unparsed)". To determine this, we look ahead at "binop" to determine its
459precedence and compare it to BinOp's precedence (which is '+' in this
460case):
461
462.. code-block:: c++
463
464 // If BinOp binds less tightly with RHS than the operator after RHS, let
465 // the pending operator take RHS as its LHS.
466 int NextPrec = GetTokPrecedence();
467 if (TokPrec < NextPrec) {
468
469If the precedence of the binop to the right of "RHS" is lower or equal
470to the precedence of our current operator, then we know that the
471parentheses associate as "(a+b) binop ...". In our example, the current
472operator is "+" and the next operator is "+", we know that they have the
473same precedence. In this case we'll create the AST node for "a+b", and
474then continue parsing:
475
476.. code-block:: c++
477
478 ... if body omitted ...
479 }
480
481 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000482 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
483 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000484 } // loop around to the top of the while loop.
485 }
486
487In our example above, this will turn "a+b+" into "(a+b)" and execute the
488next iteration of the loop, with "+" as the current token. The code
489above will eat, remember, and parse "(c+d)" as the primary expression,
490which makes the current pair equal to [+, (c+d)]. It will then evaluate
491the 'if' conditional above with "\*" as the binop to the right of the
492primary. In this case, the precedence of "\*" is higher than the
493precedence of "+" so the if condition will be entered.
494
495The critical question left here is "how can the if condition parse the
496right hand side in full"? In particular, to build the AST correctly for
497our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
498variable. The code to do this is surprisingly simple (code from the
499above two blocks duplicated for context):
500
501.. code-block:: c++
502
503 // If BinOp binds less tightly with RHS than the operator after RHS, let
504 // the pending operator take RHS as its LHS.
505 int NextPrec = GetTokPrecedence();
506 if (TokPrec < NextPrec) {
Lang Hames09bf4c12015-08-18 18:11:06 +0000507 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000508 if (RHS == 0) return 0;
509 }
510 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000511 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
512 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000513 } // loop around to the top of the while loop.
514 }
515
516At this point, we know that the binary operator to the RHS of our
517primary has higher precedence than the binop we are currently parsing.
518As such, we know that any sequence of pairs whose operators are all
519higher precedence than "+" should be parsed together and returned as
520"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
521specifying "TokPrec+1" as the minimum precedence required for it to
522continue. In our example above, this will cause it to return the AST
523node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
524expression.
525
526Finally, on the next iteration of the while loop, the "+g" piece is
527parsed and added to the AST. With this little bit of code (14
528non-trivial lines), we correctly handle fully general binary expression
529parsing in a very elegant way. This was a whirlwind tour of this code,
530and it is somewhat subtle. I recommend running through it with a few
531tough examples to see how it works.
532
533This wraps up handling of expressions. At this point, we can point the
534parser at an arbitrary token stream and build an expression from it,
535stopping at the first token that is not part of the expression. Next up
536we need to handle function definitions, etc.
537
538Parsing the Rest
539================
540
541The next thing missing is handling of function prototypes. In
542Kaleidoscope, these are used both for 'extern' function declarations as
543well as function body definitions. The code to do this is
544straight-forward and not very interesting (once you've survived
545expressions):
546
547.. code-block:: c++
548
549 /// prototype
550 /// ::= id '(' id* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000551 static std::unique_ptr<PrototypeAST> ParsePrototype() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000552 if (CurTok != tok_identifier)
553 return ErrorP("Expected function name in prototype");
554
555 std::string FnName = IdentifierStr;
556 getNextToken();
557
558 if (CurTok != '(')
559 return ErrorP("Expected '(' in prototype");
560
561 // Read the list of argument names.
562 std::vector<std::string> ArgNames;
563 while (getNextToken() == tok_identifier)
564 ArgNames.push_back(IdentifierStr);
565 if (CurTok != ')')
566 return ErrorP("Expected ')' in prototype");
567
568 // success.
569 getNextToken(); // eat ')'.
570
Lang Hames09bf4c12015-08-18 18:11:06 +0000571 return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
Sean Silvad7fb3962012-12-05 00:26:32 +0000572 }
573
574Given this, a function definition is very simple, just a prototype plus
575an expression to implement the body:
576
577.. code-block:: c++
578
579 /// definition ::= 'def' prototype expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000580 static std::unique_ptr<FunctionAST> ParseDefinition() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000581 getNextToken(); // eat def.
Lang Hames09bf4c12015-08-18 18:11:06 +0000582 auto Proto = ParsePrototype();
583 if (!Proto) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000584
Lang Hames09bf4c12015-08-18 18:11:06 +0000585 if (auto E = ParseExpression())
586 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
587 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000588 }
589
590In addition, we support 'extern' to declare functions like 'sin' and
591'cos' as well as to support forward declaration of user functions. These
592'extern's are just prototypes with no body:
593
594.. code-block:: c++
595
596 /// external ::= 'extern' prototype
Lang Hames09bf4c12015-08-18 18:11:06 +0000597 static std::unique_ptr<PrototypeAST> ParseExtern() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000598 getNextToken(); // eat extern.
599 return ParsePrototype();
600 }
601
602Finally, we'll also let the user type in arbitrary top-level expressions
603and evaluate them on the fly. We will handle this by defining anonymous
604nullary (zero argument) functions for them:
605
606.. code-block:: c++
607
608 /// toplevelexpr ::= expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000609 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
610 if (auto E = ParseExpression()) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000611 // Make an anonymous proto.
Lang Hames09bf4c12015-08-18 18:11:06 +0000612 auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
613 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
Sean Silvad7fb3962012-12-05 00:26:32 +0000614 }
Lang Hames09bf4c12015-08-18 18:11:06 +0000615 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000616 }
617
618Now that we have all the pieces, let's build a little driver that will
619let us actually *execute* this code we've built!
620
621The Driver
622==========
623
624The driver for this simply invokes all of the parsing pieces with a
625top-level dispatch loop. There isn't much interesting here, so I'll just
626include the top-level loop. See `below <#code>`_ for full code in the
627"Top-Level Parsing" section.
628
629.. code-block:: c++
630
631 /// top ::= definition | external | expression | ';'
632 static void MainLoop() {
633 while (1) {
634 fprintf(stderr, "ready> ");
635 switch (CurTok) {
636 case tok_eof: return;
637 case ';': getNextToken(); break; // ignore top-level semicolons.
638 case tok_def: HandleDefinition(); break;
639 case tok_extern: HandleExtern(); break;
640 default: HandleTopLevelExpression(); break;
641 }
642 }
643 }
644
645The most interesting part of this is that we ignore top-level
646semicolons. Why is this, you ask? The basic reason is that if you type
647"4 + 5" at the command line, the parser doesn't know whether that is the
648end of what you will type or not. For example, on the next line you
649could type "def foo..." in which case 4+5 is the end of a top-level
650expression. Alternatively you could type "\* 6", which would continue
651the expression. Having top-level semicolons allows you to type "4+5;",
652and the parser will know you are done.
653
654Conclusions
655===========
656
657With just under 400 lines of commented code (240 lines of non-comment,
658non-blank code), we fully defined our minimal language, including a
659lexer, parser, and AST builder. With this done, the executable will
660validate Kaleidoscope code and tell us if it is grammatically invalid.
661For example, here is a sample interaction:
662
663.. code-block:: bash
664
665 $ ./a.out
666 ready> def foo(x y) x+foo(y, 4.0);
667 Parsed a function definition.
668 ready> def foo(x y) x+y y;
669 Parsed a function definition.
670 Parsed a top-level expr
671 ready> def foo(x y) x+y );
672 Parsed a function definition.
673 Error: unknown token when expecting an expression
674 ready> extern sin(a);
675 ready> Parsed an extern
676 ready> ^D
677 $
678
679There is a lot of room for extension here. You can define new AST nodes,
680extend the language in many ways, etc. In the `next
681installment <LangImpl3.html>`_, we will describe how to generate LLVM
682Intermediate Representation (IR) from the AST.
683
684Full Code Listing
685=================
686
687Here is the complete code listing for this and the previous chapter.
688Note that it is fully self-contained: you don't need LLVM or any
689external libraries at all for this. (Besides the C and C++ standard
690libraries, of course.) To build this, just compile with:
691
692.. code-block:: bash
693
694 # Compile
695 clang++ -g -O3 toy.cpp
696 # Run
697 ./a.out
698
699Here is the code:
700
Logan Chien855b17d2013-06-08 09:03:03 +0000701.. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
702 :language: c++
Sean Silvad7fb3962012-12-05 00:26:32 +0000703
704`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
705