blob: 4be447eb5ba35da9c675e0729ad066b1523dbfd8 [file] [log] [blame]
Sean Silvad7fb3962012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
Sean Silvad7fb3962012-12-05 00:26:32 +00008Chapter 2 Introduction
9======================
10
11Welcome to Chapter 2 of the "`Implementing a language with
12LLVM <index.html>`_" tutorial. This chapter shows you how to use the
13lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
14`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15language. Once we have a parser, we'll define and build an `Abstract
16Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
17
18The parser we will build uses a combination of `Recursive Descent
19Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
20`Operator-Precedence
21Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22parse the Kaleidoscope language (the latter for binary expressions and
23the former for everything else). Before we get to parsing though, lets
24talk about the output of the parser: the Abstract Syntax Tree.
25
26The Abstract Syntax Tree (AST)
27==============================
28
29The AST for a program captures its behavior in such a way that it is
30easy for later stages of the compiler (e.g. code generation) to
31interpret. We basically want one object for each construct in the
32language, and the AST should closely model the language. In
33Kaleidoscope, we have expressions, a prototype, and a function object.
34We'll start with expressions first:
35
36.. code-block:: c++
37
38 /// ExprAST - Base class for all expression nodes.
39 class ExprAST {
40 public:
41 virtual ~ExprAST() {}
42 };
43
44 /// NumberExprAST - Expression class for numeric literals like "1.0".
45 class NumberExprAST : public ExprAST {
46 double Val;
Lang Hames59b0da82015-08-19 18:15:58 +000047
Sean Silvad7fb3962012-12-05 00:26:32 +000048 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000049 NumberExprAST(double Val) : Val(Val) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000050 };
51
52The code above shows the definition of the base ExprAST class and one
53subclass which we use for numeric literals. The important thing to note
54about this code is that the NumberExprAST class captures the numeric
55value of the literal as an instance variable. This allows later phases
56of the compiler to know what the stored numeric value is.
57
58Right now we only create the AST, so there are no useful accessor
59methods on them. It would be very easy to add a virtual method to pretty
60print the code, for example. Here are the other expression AST node
61definitions that we'll use in the basic form of the Kaleidoscope
62language:
63
64.. code-block:: c++
65
66 /// VariableExprAST - Expression class for referencing a variable, like "a".
67 class VariableExprAST : public ExprAST {
68 std::string Name;
Lang Hames59b0da82015-08-19 18:15:58 +000069
Sean Silvad7fb3962012-12-05 00:26:32 +000070 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000071 VariableExprAST(const std::string &Name) : Name(Name) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000072 };
73
74 /// BinaryExprAST - Expression class for a binary operator.
75 class BinaryExprAST : public ExprAST {
76 char Op;
Lang Hames09bf4c12015-08-18 18:11:06 +000077 std::unique_ptr<ExprAST> LHS, RHS;
Lang Hames59b0da82015-08-19 18:15:58 +000078
Sean Silvad7fb3962012-12-05 00:26:32 +000079 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000080 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
81 std::unique_ptr<ExprAST> RHS)
82 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000083 };
84
85 /// CallExprAST - Expression class for function calls.
86 class CallExprAST : public ExprAST {
87 std::string Callee;
Lang Hames2d789c32015-08-26 03:07:41 +000088 std::vector<std::unique_ptr<ExprAST>> Args;
Lang Hames59b0da82015-08-19 18:15:58 +000089
Sean Silvad7fb3962012-12-05 00:26:32 +000090 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000091 CallExprAST(const std::string &Callee,
92 std::vector<std::unique_ptr<ExprAST>> Args)
93 : Callee(Callee), Args(std::move(Args)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000094 };
95
96This is all (intentionally) rather straight-forward: variables capture
97the variable name, binary operators capture their opcode (e.g. '+'), and
98calls capture a function name as well as a list of any argument
99expressions. One thing that is nice about our AST is that it captures
100the language features without talking about the syntax of the language.
101Note that there is no discussion about precedence of binary operators,
102lexical structure, etc.
103
104For our basic language, these are all of the expression nodes we'll
105define. Because it doesn't have conditional control flow, it isn't
106Turing-complete; we'll fix that in a later installment. The two things
107we need next are a way to talk about the interface to a function, and a
108way to talk about functions themselves:
109
110.. code-block:: c++
111
112 /// PrototypeAST - This class represents the "prototype" for a function,
113 /// which captures its name, and its argument names (thus implicitly the number
114 /// of arguments the function takes).
115 class PrototypeAST {
116 std::string Name;
117 std::vector<std::string> Args;
Lang Hames59b0da82015-08-19 18:15:58 +0000118
Sean Silvad7fb3962012-12-05 00:26:32 +0000119 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000120 PrototypeAST(const std::string &name, std::vector<std::string> Args)
121 : Name(name), Args(std::move(Args)) {}
Mehdi Aminibb6805d2017-02-11 21:26:52 +0000122
123 const std::string &getName() const { return Name; }
Sean Silvad7fb3962012-12-05 00:26:32 +0000124 };
125
126 /// FunctionAST - This class represents a function definition itself.
127 class FunctionAST {
Lang Hames09bf4c12015-08-18 18:11:06 +0000128 std::unique_ptr<PrototypeAST> Proto;
129 std::unique_ptr<ExprAST> Body;
Lang Hames59b0da82015-08-19 18:15:58 +0000130
Sean Silvad7fb3962012-12-05 00:26:32 +0000131 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000132 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
133 std::unique_ptr<ExprAST> Body)
134 : Proto(std::move(Proto)), Body(std::move(Body)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +0000135 };
136
137In Kaleidoscope, functions are typed with just a count of their
138arguments. Since all values are double precision floating point, the
139type of each argument doesn't need to be stored anywhere. In a more
140aggressive and realistic language, the "ExprAST" class would probably
141have a type field.
142
143With this scaffolding, we can now talk about parsing expressions and
144function bodies in Kaleidoscope.
145
146Parser Basics
147=============
148
149Now that we have an AST to build, we need to define the parser code to
150build it. The idea here is that we want to parse something like "x+y"
151(which is returned as three tokens by the lexer) into an AST that could
152be generated with calls like this:
153
154.. code-block:: c++
155
Lang Hames09bf4c12015-08-18 18:11:06 +0000156 auto LHS = llvm::make_unique<VariableExprAST>("x");
157 auto RHS = llvm::make_unique<VariableExprAST>("y");
158 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
159 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000160
161In order to do this, we'll start by defining some basic helper routines:
162
163.. code-block:: c++
164
165 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
166 /// token the parser is looking at. getNextToken reads another token from the
167 /// lexer and updates CurTok with its results.
168 static int CurTok;
169 static int getNextToken() {
170 return CurTok = gettok();
171 }
172
173This implements a simple token buffer around the lexer. This allows us
174to look one token ahead at what the lexer is returning. Every function
175in our parser will assume that CurTok is the current token that needs to
176be parsed.
177
178.. code-block:: c++
179
180
Lang Hames5d045a92016-03-25 17:41:26 +0000181 /// LogError* - These are little helper functions for error handling.
182 std::unique_ptr<ExprAST> LogError(const char *Str) {
183 fprintf(stderr, "LogError: %s\n", Str);
Lang Hames59b0da82015-08-19 18:15:58 +0000184 return nullptr;
185 }
Lang Hames5d045a92016-03-25 17:41:26 +0000186 std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
187 LogError(Str);
Lang Hames59b0da82015-08-19 18:15:58 +0000188 return nullptr;
189 }
Sean Silvad7fb3962012-12-05 00:26:32 +0000190
Lang Hames5d045a92016-03-25 17:41:26 +0000191The ``LogError`` routines are simple helper routines that our parser will
Sean Silvad7fb3962012-12-05 00:26:32 +0000192use to handle errors. The error recovery in our parser will not be the
193best and is not particular user-friendly, but it will be enough for our
194tutorial. These routines make it easier to handle errors in routines
195that have various return types: they always return null.
196
197With these basic helper functions, we can implement the first piece of
198our grammar: numeric literals.
199
200Basic Expression Parsing
201========================
202
203We start with numeric literals, because they are the simplest to
204process. For each production in our grammar, we'll define a function
205which parses that production. For numeric literals, we have:
206
207.. code-block:: c++
208
209 /// numberexpr ::= number
Lang Hames09bf4c12015-08-18 18:11:06 +0000210 static std::unique_ptr<ExprAST> ParseNumberExpr() {
211 auto Result = llvm::make_unique<NumberExprAST>(NumVal);
Sean Silvad7fb3962012-12-05 00:26:32 +0000212 getNextToken(); // consume the number
Lang Hames09bf4c12015-08-18 18:11:06 +0000213 return std::move(Result);
Sean Silvad7fb3962012-12-05 00:26:32 +0000214 }
215
216This routine is very simple: it expects to be called when the current
217token is a ``tok_number`` token. It takes the current number value,
218creates a ``NumberExprAST`` node, advances the lexer to the next token,
219and finally returns.
220
221There are some interesting aspects to this. The most important one is
222that this routine eats all of the tokens that correspond to the
223production and returns the lexer buffer with the next token (which is
224not part of the grammar production) ready to go. This is a fairly
225standard way to go for recursive descent parsers. For a better example,
226the parenthesis operator is defined like this:
227
228.. code-block:: c++
229
230 /// parenexpr ::= '(' expression ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000231 static std::unique_ptr<ExprAST> ParseParenExpr() {
Lang Hames59b0da82015-08-19 18:15:58 +0000232 getNextToken(); // eat (.
Lang Hames09bf4c12015-08-18 18:11:06 +0000233 auto V = ParseExpression();
Lang Hames59b0da82015-08-19 18:15:58 +0000234 if (!V)
235 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000236
237 if (CurTok != ')')
Lang Hames5d045a92016-03-25 17:41:26 +0000238 return LogError("expected ')'");
Lang Hames59b0da82015-08-19 18:15:58 +0000239 getNextToken(); // eat ).
Sean Silvad7fb3962012-12-05 00:26:32 +0000240 return V;
241 }
242
243This function illustrates a number of interesting things about the
244parser:
245
Lang Hames5d045a92016-03-25 17:41:26 +00002461) It shows how we use the LogError routines. When called, this function
Sean Silvad7fb3962012-12-05 00:26:32 +0000247expects that the current token is a '(' token, but after parsing the
248subexpression, it is possible that there is no ')' waiting. For example,
249if the user types in "(4 x" instead of "(4)", the parser should emit an
250error. Because errors can occur, the parser needs a way to indicate that
251they happened: in our parser, we return null on an error.
252
2532) Another interesting aspect of this function is that it uses recursion
254by calling ``ParseExpression`` (we will soon see that
255``ParseExpression`` can call ``ParseParenExpr``). This is powerful
256because it allows us to handle recursive grammars, and keeps each
257production very simple. Note that parentheses do not cause construction
258of AST nodes themselves. While we could do it this way, the most
259important role of parentheses are to guide the parser and provide
260grouping. Once the parser constructs the AST, parentheses are not
261needed.
262
263The next simple production is for handling variable references and
264function calls:
265
266.. code-block:: c++
267
268 /// identifierexpr
269 /// ::= identifier
270 /// ::= identifier '(' expression* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000271 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000272 std::string IdName = IdentifierStr;
273
274 getNextToken(); // eat identifier.
275
276 if (CurTok != '(') // Simple variable ref.
Lang Hames09bf4c12015-08-18 18:11:06 +0000277 return llvm::make_unique<VariableExprAST>(IdName);
Sean Silvad7fb3962012-12-05 00:26:32 +0000278
279 // Call.
280 getNextToken(); // eat (
Lang Hames09bf4c12015-08-18 18:11:06 +0000281 std::vector<std::unique_ptr<ExprAST>> Args;
Sean Silvad7fb3962012-12-05 00:26:32 +0000282 if (CurTok != ')') {
283 while (1) {
Lang Hames59b0da82015-08-19 18:15:58 +0000284 if (auto Arg = ParseExpression())
285 Args.push_back(std::move(Arg));
286 else
287 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000288
Lang Hames59b0da82015-08-19 18:15:58 +0000289 if (CurTok == ')')
290 break;
Sean Silvad7fb3962012-12-05 00:26:32 +0000291
292 if (CurTok != ',')
Lang Hames5d045a92016-03-25 17:41:26 +0000293 return LogError("Expected ')' or ',' in argument list");
Sean Silvad7fb3962012-12-05 00:26:32 +0000294 getNextToken();
295 }
296 }
297
298 // Eat the ')'.
299 getNextToken();
300
Lang Hames09bf4c12015-08-18 18:11:06 +0000301 return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
Sean Silvad7fb3962012-12-05 00:26:32 +0000302 }
303
304This routine follows the same style as the other routines. (It expects
305to be called if the current token is a ``tok_identifier`` token). It
306also has recursion and error handling. One interesting aspect of this is
307that it uses *look-ahead* to determine if the current identifier is a
308stand alone variable reference or if it is a function call expression.
309It handles this by checking to see if the token after the identifier is
310a '(' token, constructing either a ``VariableExprAST`` or
311``CallExprAST`` node as appropriate.
312
313Now that we have all of our simple expression-parsing logic in place, we
314can define a helper function to wrap it together into one entry point.
315We call this class of expressions "primary" expressions, for reasons
316that will become more clear `later in the
Alex Denisov596e9792015-12-15 20:50:29 +0000317tutorial <LangImpl6.html#user-defined-unary-operators>`_. In order to parse an arbitrary
Sean Silvad7fb3962012-12-05 00:26:32 +0000318primary expression, we need to determine what sort of expression it is:
319
320.. code-block:: c++
321
322 /// primary
323 /// ::= identifierexpr
324 /// ::= numberexpr
325 /// ::= parenexpr
Lang Hames09bf4c12015-08-18 18:11:06 +0000326 static std::unique_ptr<ExprAST> ParsePrimary() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000327 switch (CurTok) {
Lang Hames59b0da82015-08-19 18:15:58 +0000328 default:
Lang Hames5d045a92016-03-25 17:41:26 +0000329 return LogError("unknown token when expecting an expression");
Lang Hames59b0da82015-08-19 18:15:58 +0000330 case tok_identifier:
331 return ParseIdentifierExpr();
332 case tok_number:
333 return ParseNumberExpr();
334 case '(':
335 return ParseParenExpr();
Sean Silvad7fb3962012-12-05 00:26:32 +0000336 }
337 }
338
339Now that you see the definition of this function, it is more obvious why
340we can assume the state of CurTok in the various functions. This uses
341look-ahead to determine which sort of expression is being inspected, and
342then parses it with a function call.
343
344Now that basic expressions are handled, we need to handle binary
345expressions. They are a bit more complex.
346
347Binary Expression Parsing
348=========================
349
350Binary expressions are significantly harder to parse because they are
351often ambiguous. For example, when given the string "x+y\*z", the parser
352can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
353definitions from mathematics, we expect the later parse, because "\*"
354(multiplication) has higher *precedence* than "+" (addition).
355
356There are many ways to handle this, but an elegant and efficient way is
357to use `Operator-Precedence
358Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
359This parsing technique uses the precedence of binary operators to guide
360recursion. To start with, we need a table of precedences:
361
362.. code-block:: c++
363
364 /// BinopPrecedence - This holds the precedence for each binary operator that is
365 /// defined.
366 static std::map<char, int> BinopPrecedence;
367
368 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
369 static int GetTokPrecedence() {
370 if (!isascii(CurTok))
371 return -1;
372
373 // Make sure it's a declared binop.
374 int TokPrec = BinopPrecedence[CurTok];
375 if (TokPrec <= 0) return -1;
376 return TokPrec;
377 }
378
379 int main() {
380 // Install standard binary operators.
381 // 1 is lowest precedence.
382 BinopPrecedence['<'] = 10;
383 BinopPrecedence['+'] = 20;
384 BinopPrecedence['-'] = 20;
385 BinopPrecedence['*'] = 40; // highest.
386 ...
387 }
388
389For the basic form of Kaleidoscope, we will only support 4 binary
390operators (this can obviously be extended by you, our brave and intrepid
391reader). The ``GetTokPrecedence`` function returns the precedence for
392the current token, or -1 if the token is not a binary operator. Having a
393map makes it easy to add new operators and makes it clear that the
394algorithm doesn't depend on the specific operators involved, but it
395would be easy enough to eliminate the map and do the comparisons in the
396``GetTokPrecedence`` function. (Or just use a fixed-size array).
397
398With the helper above defined, we can now start parsing binary
399expressions. The basic idea of operator precedence parsing is to break
400down an expression with potentially ambiguous binary operators into
Alex Denisovbc769d42015-11-15 14:13:24 +0000401pieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
Sean Silvad7fb3962012-12-05 00:26:32 +0000402Operator precedence parsing considers this as a stream of primary
403expressions separated by binary operators. As such, it will first parse
404the leading primary expression "a", then it will see the pairs [+, b]
405[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
406primary expressions, the binary expression parser doesn't need to worry
407about nested subexpressions like (c+d) at all.
408
409To start, an expression is a primary expression potentially followed by
410a sequence of [binop,primaryexpr] pairs:
411
412.. code-block:: c++
413
414 /// expression
415 /// ::= primary binoprhs
416 ///
Lang Hames09bf4c12015-08-18 18:11:06 +0000417 static std::unique_ptr<ExprAST> ParseExpression() {
418 auto LHS = ParsePrimary();
Lang Hames59b0da82015-08-19 18:15:58 +0000419 if (!LHS)
420 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000421
Lang Hames09bf4c12015-08-18 18:11:06 +0000422 return ParseBinOpRHS(0, std::move(LHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000423 }
424
425``ParseBinOpRHS`` is the function that parses the sequence of pairs for
426us. It takes a precedence and a pointer to an expression for the part
427that has been parsed so far. Note that "x" is a perfectly valid
428expression: As such, "binoprhs" is allowed to be empty, in which case it
429returns the expression that is passed into it. In our example above, the
430code passes the expression for "a" into ``ParseBinOpRHS`` and the
431current token is "+".
432
433The precedence value passed into ``ParseBinOpRHS`` indicates the
434*minimal operator precedence* that the function is allowed to eat. For
435example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
436passed in a precedence of 40, it will not consume any tokens (because
437the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
438starts with:
439
440.. code-block:: c++
441
442 /// binoprhs
443 /// ::= ('+' primary)*
Lang Hames09bf4c12015-08-18 18:11:06 +0000444 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
445 std::unique_ptr<ExprAST> LHS) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000446 // If this is a binop, find its precedence.
447 while (1) {
448 int TokPrec = GetTokPrecedence();
449
450 // If this is a binop that binds at least as tightly as the current binop,
451 // consume it, otherwise we are done.
452 if (TokPrec < ExprPrec)
453 return LHS;
454
455This code gets the precedence of the current token and checks to see if
456if is too low. Because we defined invalid tokens to have a precedence of
457-1, this check implicitly knows that the pair-stream ends when the token
458stream runs out of binary operators. If this check succeeds, we know
459that the token is a binary operator and that it will be included in this
460expression:
461
462.. code-block:: c++
463
464 // Okay, we know this is a binop.
465 int BinOp = CurTok;
466 getNextToken(); // eat binop
467
468 // Parse the primary expression after the binary operator.
Lang Hames09bf4c12015-08-18 18:11:06 +0000469 auto RHS = ParsePrimary();
Lang Hames59b0da82015-08-19 18:15:58 +0000470 if (!RHS)
471 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000472
473As such, this code eats (and remembers) the binary operator and then
474parses the primary expression that follows. This builds up the whole
475pair, the first of which is [+, b] for the running example.
476
477Now that we parsed the left-hand side of an expression and one pair of
478the RHS sequence, we have to decide which way the expression associates.
479In particular, we could have "(a+b) binop unparsed" or "a + (b binop
480unparsed)". To determine this, we look ahead at "binop" to determine its
481precedence and compare it to BinOp's precedence (which is '+' in this
482case):
483
484.. code-block:: c++
485
486 // If BinOp binds less tightly with RHS than the operator after RHS, let
487 // the pending operator take RHS as its LHS.
488 int NextPrec = GetTokPrecedence();
489 if (TokPrec < NextPrec) {
490
491If the precedence of the binop to the right of "RHS" is lower or equal
492to the precedence of our current operator, then we know that the
493parentheses associate as "(a+b) binop ...". In our example, the current
494operator is "+" and the next operator is "+", we know that they have the
495same precedence. In this case we'll create the AST node for "a+b", and
496then continue parsing:
497
498.. code-block:: c++
499
500 ... if body omitted ...
501 }
502
503 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000504 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
505 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000506 } // loop around to the top of the while loop.
507 }
508
509In our example above, this will turn "a+b+" into "(a+b)" and execute the
510next iteration of the loop, with "+" as the current token. The code
511above will eat, remember, and parse "(c+d)" as the primary expression,
512which makes the current pair equal to [+, (c+d)]. It will then evaluate
513the 'if' conditional above with "\*" as the binop to the right of the
514primary. In this case, the precedence of "\*" is higher than the
515precedence of "+" so the if condition will be entered.
516
517The critical question left here is "how can the if condition parse the
518right hand side in full"? In particular, to build the AST correctly for
519our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
520variable. The code to do this is surprisingly simple (code from the
521above two blocks duplicated for context):
522
523.. code-block:: c++
524
525 // If BinOp binds less tightly with RHS than the operator after RHS, let
526 // the pending operator take RHS as its LHS.
527 int NextPrec = GetTokPrecedence();
528 if (TokPrec < NextPrec) {
Lang Hames09bf4c12015-08-18 18:11:06 +0000529 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
Lang Hames59b0da82015-08-19 18:15:58 +0000530 if (!RHS)
531 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000532 }
533 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000534 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
535 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000536 } // loop around to the top of the while loop.
537 }
538
539At this point, we know that the binary operator to the RHS of our
540primary has higher precedence than the binop we are currently parsing.
541As such, we know that any sequence of pairs whose operators are all
542higher precedence than "+" should be parsed together and returned as
543"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
544specifying "TokPrec+1" as the minimum precedence required for it to
545continue. In our example above, this will cause it to return the AST
546node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
547expression.
548
549Finally, on the next iteration of the while loop, the "+g" piece is
550parsed and added to the AST. With this little bit of code (14
551non-trivial lines), we correctly handle fully general binary expression
552parsing in a very elegant way. This was a whirlwind tour of this code,
553and it is somewhat subtle. I recommend running through it with a few
554tough examples to see how it works.
555
556This wraps up handling of expressions. At this point, we can point the
557parser at an arbitrary token stream and build an expression from it,
558stopping at the first token that is not part of the expression. Next up
559we need to handle function definitions, etc.
560
561Parsing the Rest
562================
563
564The next thing missing is handling of function prototypes. In
565Kaleidoscope, these are used both for 'extern' function declarations as
566well as function body definitions. The code to do this is
567straight-forward and not very interesting (once you've survived
568expressions):
569
570.. code-block:: c++
571
572 /// prototype
573 /// ::= id '(' id* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000574 static std::unique_ptr<PrototypeAST> ParsePrototype() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000575 if (CurTok != tok_identifier)
Lang Hames5d045a92016-03-25 17:41:26 +0000576 return LogErrorP("Expected function name in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000577
578 std::string FnName = IdentifierStr;
579 getNextToken();
580
581 if (CurTok != '(')
Lang Hames5d045a92016-03-25 17:41:26 +0000582 return LogErrorP("Expected '(' in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000583
584 // Read the list of argument names.
585 std::vector<std::string> ArgNames;
586 while (getNextToken() == tok_identifier)
587 ArgNames.push_back(IdentifierStr);
588 if (CurTok != ')')
Lang Hames5d045a92016-03-25 17:41:26 +0000589 return LogErrorP("Expected ')' in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000590
591 // success.
592 getNextToken(); // eat ')'.
593
Lang Hames09bf4c12015-08-18 18:11:06 +0000594 return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
Sean Silvad7fb3962012-12-05 00:26:32 +0000595 }
596
597Given this, a function definition is very simple, just a prototype plus
598an expression to implement the body:
599
600.. code-block:: c++
601
602 /// definition ::= 'def' prototype expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000603 static std::unique_ptr<FunctionAST> ParseDefinition() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000604 getNextToken(); // eat def.
Lang Hames09bf4c12015-08-18 18:11:06 +0000605 auto Proto = ParsePrototype();
606 if (!Proto) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000607
Lang Hames09bf4c12015-08-18 18:11:06 +0000608 if (auto E = ParseExpression())
609 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
610 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000611 }
612
613In addition, we support 'extern' to declare functions like 'sin' and
614'cos' as well as to support forward declaration of user functions. These
615'extern's are just prototypes with no body:
616
617.. code-block:: c++
618
619 /// external ::= 'extern' prototype
Lang Hames09bf4c12015-08-18 18:11:06 +0000620 static std::unique_ptr<PrototypeAST> ParseExtern() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000621 getNextToken(); // eat extern.
622 return ParsePrototype();
623 }
624
625Finally, we'll also let the user type in arbitrary top-level expressions
626and evaluate them on the fly. We will handle this by defining anonymous
627nullary (zero argument) functions for them:
628
629.. code-block:: c++
630
631 /// toplevelexpr ::= expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000632 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
633 if (auto E = ParseExpression()) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000634 // Make an anonymous proto.
Lang Hames09bf4c12015-08-18 18:11:06 +0000635 auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
636 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
Sean Silvad7fb3962012-12-05 00:26:32 +0000637 }
Lang Hames09bf4c12015-08-18 18:11:06 +0000638 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000639 }
640
641Now that we have all the pieces, let's build a little driver that will
642let us actually *execute* this code we've built!
643
644The Driver
645==========
646
647The driver for this simply invokes all of the parsing pieces with a
648top-level dispatch loop. There isn't much interesting here, so I'll just
Alex Denisov596e9792015-12-15 20:50:29 +0000649include the top-level loop. See `below <#full-code-listing>`_ for full code in the
Sean Silvad7fb3962012-12-05 00:26:32 +0000650"Top-Level Parsing" section.
651
652.. code-block:: c++
653
654 /// top ::= definition | external | expression | ';'
655 static void MainLoop() {
656 while (1) {
657 fprintf(stderr, "ready> ");
658 switch (CurTok) {
Lang Hames59b0da82015-08-19 18:15:58 +0000659 case tok_eof:
660 return;
661 case ';': // ignore top-level semicolons.
662 getNextToken();
663 break;
664 case tok_def:
665 HandleDefinition();
666 break;
667 case tok_extern:
668 HandleExtern();
669 break;
670 default:
671 HandleTopLevelExpression();
672 break;
Sean Silvad7fb3962012-12-05 00:26:32 +0000673 }
674 }
675 }
676
677The most interesting part of this is that we ignore top-level
678semicolons. Why is this, you ask? The basic reason is that if you type
679"4 + 5" at the command line, the parser doesn't know whether that is the
680end of what you will type or not. For example, on the next line you
681could type "def foo..." in which case 4+5 is the end of a top-level
682expression. Alternatively you could type "\* 6", which would continue
683the expression. Having top-level semicolons allows you to type "4+5;",
684and the parser will know you are done.
685
686Conclusions
687===========
688
689With just under 400 lines of commented code (240 lines of non-comment,
690non-blank code), we fully defined our minimal language, including a
691lexer, parser, and AST builder. With this done, the executable will
692validate Kaleidoscope code and tell us if it is grammatically invalid.
693For example, here is a sample interaction:
694
695.. code-block:: bash
696
697 $ ./a.out
698 ready> def foo(x y) x+foo(y, 4.0);
699 Parsed a function definition.
700 ready> def foo(x y) x+y y;
701 Parsed a function definition.
702 Parsed a top-level expr
703 ready> def foo(x y) x+y );
704 Parsed a function definition.
705 Error: unknown token when expecting an expression
706 ready> extern sin(a);
707 ready> Parsed an extern
708 ready> ^D
709 $
710
711There is a lot of room for extension here. You can define new AST nodes,
712extend the language in many ways, etc. In the `next
Davide Italiano11cfa452016-09-13 06:31:37 +0000713installment <LangImpl03.html>`_, we will describe how to generate LLVM
Sean Silvad7fb3962012-12-05 00:26:32 +0000714Intermediate Representation (IR) from the AST.
715
716Full Code Listing
717=================
718
719Here is the complete code listing for this and the previous chapter.
720Note that it is fully self-contained: you don't need LLVM or any
721external libraries at all for this. (Besides the C and C++ standard
722libraries, of course.) To build this, just compile with:
723
724.. code-block:: bash
725
726 # Compile
727 clang++ -g -O3 toy.cpp
728 # Run
729 ./a.out
730
731Here is the code:
732
Logan Chien855b17d2013-06-08 09:03:03 +0000733.. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
734 :language: c++
Sean Silvad7fb3962012-12-05 00:26:32 +0000735
Wilfred Hughes945f43e2016-07-02 17:01:59 +0000736`Next: Implementing Code Generation to LLVM IR <LangImpl03.html>`_
Sean Silvad7fb3962012-12-05 00:26:32 +0000737