blob: f01e3f90f20d2ce6c11a1659890bc080eabec134 [file] [log] [blame]
Sean Silvad7fb3962012-12-05 00:26:32 +00001===========================================
2Kaleidoscope: Implementing a Parser and AST
3===========================================
4
5.. contents::
6 :local:
7
Sean Silvad7fb3962012-12-05 00:26:32 +00008Chapter 2 Introduction
9======================
10
11Welcome to Chapter 2 of the "`Implementing a language with
12LLVM <index.html>`_" tutorial. This chapter shows you how to use the
13lexer, built in `Chapter 1 <LangImpl1.html>`_, to build a full
14`parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15language. Once we have a parser, we'll define and build an `Abstract
16Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
17
18The parser we will build uses a combination of `Recursive Descent
19Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
20`Operator-Precedence
21Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22parse the Kaleidoscope language (the latter for binary expressions and
23the former for everything else). Before we get to parsing though, lets
24talk about the output of the parser: the Abstract Syntax Tree.
25
26The Abstract Syntax Tree (AST)
27==============================
28
29The AST for a program captures its behavior in such a way that it is
30easy for later stages of the compiler (e.g. code generation) to
31interpret. We basically want one object for each construct in the
32language, and the AST should closely model the language. In
33Kaleidoscope, we have expressions, a prototype, and a function object.
34We'll start with expressions first:
35
36.. code-block:: c++
37
38 /// ExprAST - Base class for all expression nodes.
39 class ExprAST {
40 public:
41 virtual ~ExprAST() {}
42 };
43
44 /// NumberExprAST - Expression class for numeric literals like "1.0".
45 class NumberExprAST : public ExprAST {
46 double Val;
Lang Hames59b0da82015-08-19 18:15:58 +000047
Sean Silvad7fb3962012-12-05 00:26:32 +000048 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000049 NumberExprAST(double Val) : Val(Val) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000050 };
51
52The code above shows the definition of the base ExprAST class and one
53subclass which we use for numeric literals. The important thing to note
54about this code is that the NumberExprAST class captures the numeric
55value of the literal as an instance variable. This allows later phases
56of the compiler to know what the stored numeric value is.
57
58Right now we only create the AST, so there are no useful accessor
59methods on them. It would be very easy to add a virtual method to pretty
60print the code, for example. Here are the other expression AST node
61definitions that we'll use in the basic form of the Kaleidoscope
62language:
63
64.. code-block:: c++
65
66 /// VariableExprAST - Expression class for referencing a variable, like "a".
67 class VariableExprAST : public ExprAST {
68 std::string Name;
Lang Hames59b0da82015-08-19 18:15:58 +000069
Sean Silvad7fb3962012-12-05 00:26:32 +000070 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000071 VariableExprAST(const std::string &Name) : Name(Name) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000072 };
73
74 /// BinaryExprAST - Expression class for a binary operator.
75 class BinaryExprAST : public ExprAST {
76 char Op;
Lang Hames09bf4c12015-08-18 18:11:06 +000077 std::unique_ptr<ExprAST> LHS, RHS;
Lang Hames59b0da82015-08-19 18:15:58 +000078
Sean Silvad7fb3962012-12-05 00:26:32 +000079 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000080 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
81 std::unique_ptr<ExprAST> RHS)
82 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000083 };
84
85 /// CallExprAST - Expression class for function calls.
86 class CallExprAST : public ExprAST {
87 std::string Callee;
Lang Hames2d789c32015-08-26 03:07:41 +000088 std::vector<std::unique_ptr<ExprAST>> Args;
Lang Hames59b0da82015-08-19 18:15:58 +000089
Sean Silvad7fb3962012-12-05 00:26:32 +000090 public:
Lang Hames09bf4c12015-08-18 18:11:06 +000091 CallExprAST(const std::string &Callee,
92 std::vector<std::unique_ptr<ExprAST>> Args)
93 : Callee(Callee), Args(std::move(Args)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +000094 };
95
96This is all (intentionally) rather straight-forward: variables capture
97the variable name, binary operators capture their opcode (e.g. '+'), and
98calls capture a function name as well as a list of any argument
99expressions. One thing that is nice about our AST is that it captures
100the language features without talking about the syntax of the language.
101Note that there is no discussion about precedence of binary operators,
102lexical structure, etc.
103
104For our basic language, these are all of the expression nodes we'll
105define. Because it doesn't have conditional control flow, it isn't
106Turing-complete; we'll fix that in a later installment. The two things
107we need next are a way to talk about the interface to a function, and a
108way to talk about functions themselves:
109
110.. code-block:: c++
111
112 /// PrototypeAST - This class represents the "prototype" for a function,
113 /// which captures its name, and its argument names (thus implicitly the number
114 /// of arguments the function takes).
115 class PrototypeAST {
116 std::string Name;
117 std::vector<std::string> Args;
Lang Hames59b0da82015-08-19 18:15:58 +0000118
Sean Silvad7fb3962012-12-05 00:26:32 +0000119 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000120 PrototypeAST(const std::string &name, std::vector<std::string> Args)
121 : Name(name), Args(std::move(Args)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +0000122 };
123
124 /// FunctionAST - This class represents a function definition itself.
125 class FunctionAST {
Lang Hames09bf4c12015-08-18 18:11:06 +0000126 std::unique_ptr<PrototypeAST> Proto;
127 std::unique_ptr<ExprAST> Body;
Lang Hames59b0da82015-08-19 18:15:58 +0000128
Sean Silvad7fb3962012-12-05 00:26:32 +0000129 public:
Lang Hames09bf4c12015-08-18 18:11:06 +0000130 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
131 std::unique_ptr<ExprAST> Body)
132 : Proto(std::move(Proto)), Body(std::move(Body)) {}
Sean Silvad7fb3962012-12-05 00:26:32 +0000133 };
134
135In Kaleidoscope, functions are typed with just a count of their
136arguments. Since all values are double precision floating point, the
137type of each argument doesn't need to be stored anywhere. In a more
138aggressive and realistic language, the "ExprAST" class would probably
139have a type field.
140
141With this scaffolding, we can now talk about parsing expressions and
142function bodies in Kaleidoscope.
143
144Parser Basics
145=============
146
147Now that we have an AST to build, we need to define the parser code to
148build it. The idea here is that we want to parse something like "x+y"
149(which is returned as three tokens by the lexer) into an AST that could
150be generated with calls like this:
151
152.. code-block:: c++
153
Lang Hames09bf4c12015-08-18 18:11:06 +0000154 auto LHS = llvm::make_unique<VariableExprAST>("x");
155 auto RHS = llvm::make_unique<VariableExprAST>("y");
156 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
157 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000158
159In order to do this, we'll start by defining some basic helper routines:
160
161.. code-block:: c++
162
163 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
164 /// token the parser is looking at. getNextToken reads another token from the
165 /// lexer and updates CurTok with its results.
166 static int CurTok;
167 static int getNextToken() {
168 return CurTok = gettok();
169 }
170
171This implements a simple token buffer around the lexer. This allows us
172to look one token ahead at what the lexer is returning. Every function
173in our parser will assume that CurTok is the current token that needs to
174be parsed.
175
176.. code-block:: c++
177
178
Lang Hames5d045a92016-03-25 17:41:26 +0000179 /// LogError* - These are little helper functions for error handling.
180 std::unique_ptr<ExprAST> LogError(const char *Str) {
181 fprintf(stderr, "LogError: %s\n", Str);
Lang Hames59b0da82015-08-19 18:15:58 +0000182 return nullptr;
183 }
Lang Hames5d045a92016-03-25 17:41:26 +0000184 std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
185 LogError(Str);
Lang Hames59b0da82015-08-19 18:15:58 +0000186 return nullptr;
187 }
Sean Silvad7fb3962012-12-05 00:26:32 +0000188
Lang Hames5d045a92016-03-25 17:41:26 +0000189The ``LogError`` routines are simple helper routines that our parser will
Sean Silvad7fb3962012-12-05 00:26:32 +0000190use to handle errors. The error recovery in our parser will not be the
191best and is not particular user-friendly, but it will be enough for our
192tutorial. These routines make it easier to handle errors in routines
193that have various return types: they always return null.
194
195With these basic helper functions, we can implement the first piece of
196our grammar: numeric literals.
197
198Basic Expression Parsing
199========================
200
201We start with numeric literals, because they are the simplest to
202process. For each production in our grammar, we'll define a function
203which parses that production. For numeric literals, we have:
204
205.. code-block:: c++
206
207 /// numberexpr ::= number
Lang Hames09bf4c12015-08-18 18:11:06 +0000208 static std::unique_ptr<ExprAST> ParseNumberExpr() {
209 auto Result = llvm::make_unique<NumberExprAST>(NumVal);
Sean Silvad7fb3962012-12-05 00:26:32 +0000210 getNextToken(); // consume the number
Lang Hames09bf4c12015-08-18 18:11:06 +0000211 return std::move(Result);
Sean Silvad7fb3962012-12-05 00:26:32 +0000212 }
213
214This routine is very simple: it expects to be called when the current
215token is a ``tok_number`` token. It takes the current number value,
216creates a ``NumberExprAST`` node, advances the lexer to the next token,
217and finally returns.
218
219There are some interesting aspects to this. The most important one is
220that this routine eats all of the tokens that correspond to the
221production and returns the lexer buffer with the next token (which is
222not part of the grammar production) ready to go. This is a fairly
223standard way to go for recursive descent parsers. For a better example,
224the parenthesis operator is defined like this:
225
226.. code-block:: c++
227
228 /// parenexpr ::= '(' expression ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000229 static std::unique_ptr<ExprAST> ParseParenExpr() {
Lang Hames59b0da82015-08-19 18:15:58 +0000230 getNextToken(); // eat (.
Lang Hames09bf4c12015-08-18 18:11:06 +0000231 auto V = ParseExpression();
Lang Hames59b0da82015-08-19 18:15:58 +0000232 if (!V)
233 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000234
235 if (CurTok != ')')
Lang Hames5d045a92016-03-25 17:41:26 +0000236 return LogError("expected ')'");
Lang Hames59b0da82015-08-19 18:15:58 +0000237 getNextToken(); // eat ).
Sean Silvad7fb3962012-12-05 00:26:32 +0000238 return V;
239 }
240
241This function illustrates a number of interesting things about the
242parser:
243
Lang Hames5d045a92016-03-25 17:41:26 +00002441) It shows how we use the LogError routines. When called, this function
Sean Silvad7fb3962012-12-05 00:26:32 +0000245expects that the current token is a '(' token, but after parsing the
246subexpression, it is possible that there is no ')' waiting. For example,
247if the user types in "(4 x" instead of "(4)", the parser should emit an
248error. Because errors can occur, the parser needs a way to indicate that
249they happened: in our parser, we return null on an error.
250
2512) Another interesting aspect of this function is that it uses recursion
252by calling ``ParseExpression`` (we will soon see that
253``ParseExpression`` can call ``ParseParenExpr``). This is powerful
254because it allows us to handle recursive grammars, and keeps each
255production very simple. Note that parentheses do not cause construction
256of AST nodes themselves. While we could do it this way, the most
257important role of parentheses are to guide the parser and provide
258grouping. Once the parser constructs the AST, parentheses are not
259needed.
260
261The next simple production is for handling variable references and
262function calls:
263
264.. code-block:: c++
265
266 /// identifierexpr
267 /// ::= identifier
268 /// ::= identifier '(' expression* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000269 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000270 std::string IdName = IdentifierStr;
271
272 getNextToken(); // eat identifier.
273
274 if (CurTok != '(') // Simple variable ref.
Lang Hames09bf4c12015-08-18 18:11:06 +0000275 return llvm::make_unique<VariableExprAST>(IdName);
Sean Silvad7fb3962012-12-05 00:26:32 +0000276
277 // Call.
278 getNextToken(); // eat (
Lang Hames09bf4c12015-08-18 18:11:06 +0000279 std::vector<std::unique_ptr<ExprAST>> Args;
Sean Silvad7fb3962012-12-05 00:26:32 +0000280 if (CurTok != ')') {
281 while (1) {
Lang Hames59b0da82015-08-19 18:15:58 +0000282 if (auto Arg = ParseExpression())
283 Args.push_back(std::move(Arg));
284 else
285 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000286
Lang Hames59b0da82015-08-19 18:15:58 +0000287 if (CurTok == ')')
288 break;
Sean Silvad7fb3962012-12-05 00:26:32 +0000289
290 if (CurTok != ',')
Lang Hames5d045a92016-03-25 17:41:26 +0000291 return LogError("Expected ')' or ',' in argument list");
Sean Silvad7fb3962012-12-05 00:26:32 +0000292 getNextToken();
293 }
294 }
295
296 // Eat the ')'.
297 getNextToken();
298
Lang Hames09bf4c12015-08-18 18:11:06 +0000299 return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
Sean Silvad7fb3962012-12-05 00:26:32 +0000300 }
301
302This routine follows the same style as the other routines. (It expects
303to be called if the current token is a ``tok_identifier`` token). It
304also has recursion and error handling. One interesting aspect of this is
305that it uses *look-ahead* to determine if the current identifier is a
306stand alone variable reference or if it is a function call expression.
307It handles this by checking to see if the token after the identifier is
308a '(' token, constructing either a ``VariableExprAST`` or
309``CallExprAST`` node as appropriate.
310
311Now that we have all of our simple expression-parsing logic in place, we
312can define a helper function to wrap it together into one entry point.
313We call this class of expressions "primary" expressions, for reasons
314that will become more clear `later in the
Alex Denisov596e9792015-12-15 20:50:29 +0000315tutorial <LangImpl6.html#user-defined-unary-operators>`_. In order to parse an arbitrary
Sean Silvad7fb3962012-12-05 00:26:32 +0000316primary expression, we need to determine what sort of expression it is:
317
318.. code-block:: c++
319
320 /// primary
321 /// ::= identifierexpr
322 /// ::= numberexpr
323 /// ::= parenexpr
Lang Hames09bf4c12015-08-18 18:11:06 +0000324 static std::unique_ptr<ExprAST> ParsePrimary() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000325 switch (CurTok) {
Lang Hames59b0da82015-08-19 18:15:58 +0000326 default:
Lang Hames5d045a92016-03-25 17:41:26 +0000327 return LogError("unknown token when expecting an expression");
Lang Hames59b0da82015-08-19 18:15:58 +0000328 case tok_identifier:
329 return ParseIdentifierExpr();
330 case tok_number:
331 return ParseNumberExpr();
332 case '(':
333 return ParseParenExpr();
Sean Silvad7fb3962012-12-05 00:26:32 +0000334 }
335 }
336
337Now that you see the definition of this function, it is more obvious why
338we can assume the state of CurTok in the various functions. This uses
339look-ahead to determine which sort of expression is being inspected, and
340then parses it with a function call.
341
342Now that basic expressions are handled, we need to handle binary
343expressions. They are a bit more complex.
344
345Binary Expression Parsing
346=========================
347
348Binary expressions are significantly harder to parse because they are
349often ambiguous. For example, when given the string "x+y\*z", the parser
350can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
351definitions from mathematics, we expect the later parse, because "\*"
352(multiplication) has higher *precedence* than "+" (addition).
353
354There are many ways to handle this, but an elegant and efficient way is
355to use `Operator-Precedence
356Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
357This parsing technique uses the precedence of binary operators to guide
358recursion. To start with, we need a table of precedences:
359
360.. code-block:: c++
361
362 /// BinopPrecedence - This holds the precedence for each binary operator that is
363 /// defined.
364 static std::map<char, int> BinopPrecedence;
365
366 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
367 static int GetTokPrecedence() {
368 if (!isascii(CurTok))
369 return -1;
370
371 // Make sure it's a declared binop.
372 int TokPrec = BinopPrecedence[CurTok];
373 if (TokPrec <= 0) return -1;
374 return TokPrec;
375 }
376
377 int main() {
378 // Install standard binary operators.
379 // 1 is lowest precedence.
380 BinopPrecedence['<'] = 10;
381 BinopPrecedence['+'] = 20;
382 BinopPrecedence['-'] = 20;
383 BinopPrecedence['*'] = 40; // highest.
384 ...
385 }
386
387For the basic form of Kaleidoscope, we will only support 4 binary
388operators (this can obviously be extended by you, our brave and intrepid
389reader). The ``GetTokPrecedence`` function returns the precedence for
390the current token, or -1 if the token is not a binary operator. Having a
391map makes it easy to add new operators and makes it clear that the
392algorithm doesn't depend on the specific operators involved, but it
393would be easy enough to eliminate the map and do the comparisons in the
394``GetTokPrecedence`` function. (Or just use a fixed-size array).
395
396With the helper above defined, we can now start parsing binary
397expressions. The basic idea of operator precedence parsing is to break
398down an expression with potentially ambiguous binary operators into
Alex Denisovbc769d42015-11-15 14:13:24 +0000399pieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
Sean Silvad7fb3962012-12-05 00:26:32 +0000400Operator precedence parsing considers this as a stream of primary
401expressions separated by binary operators. As such, it will first parse
402the leading primary expression "a", then it will see the pairs [+, b]
403[+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
404primary expressions, the binary expression parser doesn't need to worry
405about nested subexpressions like (c+d) at all.
406
407To start, an expression is a primary expression potentially followed by
408a sequence of [binop,primaryexpr] pairs:
409
410.. code-block:: c++
411
412 /// expression
413 /// ::= primary binoprhs
414 ///
Lang Hames09bf4c12015-08-18 18:11:06 +0000415 static std::unique_ptr<ExprAST> ParseExpression() {
416 auto LHS = ParsePrimary();
Lang Hames59b0da82015-08-19 18:15:58 +0000417 if (!LHS)
418 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000419
Lang Hames09bf4c12015-08-18 18:11:06 +0000420 return ParseBinOpRHS(0, std::move(LHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000421 }
422
423``ParseBinOpRHS`` is the function that parses the sequence of pairs for
424us. It takes a precedence and a pointer to an expression for the part
425that has been parsed so far. Note that "x" is a perfectly valid
426expression: As such, "binoprhs" is allowed to be empty, in which case it
427returns the expression that is passed into it. In our example above, the
428code passes the expression for "a" into ``ParseBinOpRHS`` and the
429current token is "+".
430
431The precedence value passed into ``ParseBinOpRHS`` indicates the
432*minimal operator precedence* that the function is allowed to eat. For
433example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
434passed in a precedence of 40, it will not consume any tokens (because
435the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
436starts with:
437
438.. code-block:: c++
439
440 /// binoprhs
441 /// ::= ('+' primary)*
Lang Hames09bf4c12015-08-18 18:11:06 +0000442 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
443 std::unique_ptr<ExprAST> LHS) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000444 // If this is a binop, find its precedence.
445 while (1) {
446 int TokPrec = GetTokPrecedence();
447
448 // If this is a binop that binds at least as tightly as the current binop,
449 // consume it, otherwise we are done.
450 if (TokPrec < ExprPrec)
451 return LHS;
452
453This code gets the precedence of the current token and checks to see if
454if is too low. Because we defined invalid tokens to have a precedence of
455-1, this check implicitly knows that the pair-stream ends when the token
456stream runs out of binary operators. If this check succeeds, we know
457that the token is a binary operator and that it will be included in this
458expression:
459
460.. code-block:: c++
461
462 // Okay, we know this is a binop.
463 int BinOp = CurTok;
464 getNextToken(); // eat binop
465
466 // Parse the primary expression after the binary operator.
Lang Hames09bf4c12015-08-18 18:11:06 +0000467 auto RHS = ParsePrimary();
Lang Hames59b0da82015-08-19 18:15:58 +0000468 if (!RHS)
469 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000470
471As such, this code eats (and remembers) the binary operator and then
472parses the primary expression that follows. This builds up the whole
473pair, the first of which is [+, b] for the running example.
474
475Now that we parsed the left-hand side of an expression and one pair of
476the RHS sequence, we have to decide which way the expression associates.
477In particular, we could have "(a+b) binop unparsed" or "a + (b binop
478unparsed)". To determine this, we look ahead at "binop" to determine its
479precedence and compare it to BinOp's precedence (which is '+' in this
480case):
481
482.. code-block:: c++
483
484 // If BinOp binds less tightly with RHS than the operator after RHS, let
485 // the pending operator take RHS as its LHS.
486 int NextPrec = GetTokPrecedence();
487 if (TokPrec < NextPrec) {
488
489If the precedence of the binop to the right of "RHS" is lower or equal
490to the precedence of our current operator, then we know that the
491parentheses associate as "(a+b) binop ...". In our example, the current
492operator is "+" and the next operator is "+", we know that they have the
493same precedence. In this case we'll create the AST node for "a+b", and
494then continue parsing:
495
496.. code-block:: c++
497
498 ... if body omitted ...
499 }
500
501 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000502 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
503 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000504 } // loop around to the top of the while loop.
505 }
506
507In our example above, this will turn "a+b+" into "(a+b)" and execute the
508next iteration of the loop, with "+" as the current token. The code
509above will eat, remember, and parse "(c+d)" as the primary expression,
510which makes the current pair equal to [+, (c+d)]. It will then evaluate
511the 'if' conditional above with "\*" as the binop to the right of the
512primary. In this case, the precedence of "\*" is higher than the
513precedence of "+" so the if condition will be entered.
514
515The critical question left here is "how can the if condition parse the
516right hand side in full"? In particular, to build the AST correctly for
517our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
518variable. The code to do this is surprisingly simple (code from the
519above two blocks duplicated for context):
520
521.. code-block:: c++
522
523 // If BinOp binds less tightly with RHS than the operator after RHS, let
524 // the pending operator take RHS as its LHS.
525 int NextPrec = GetTokPrecedence();
526 if (TokPrec < NextPrec) {
Lang Hames09bf4c12015-08-18 18:11:06 +0000527 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
Lang Hames59b0da82015-08-19 18:15:58 +0000528 if (!RHS)
529 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000530 }
531 // Merge LHS/RHS.
Lang Hames09bf4c12015-08-18 18:11:06 +0000532 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
533 std::move(RHS));
Sean Silvad7fb3962012-12-05 00:26:32 +0000534 } // loop around to the top of the while loop.
535 }
536
537At this point, we know that the binary operator to the RHS of our
538primary has higher precedence than the binop we are currently parsing.
539As such, we know that any sequence of pairs whose operators are all
540higher precedence than "+" should be parsed together and returned as
541"RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
542specifying "TokPrec+1" as the minimum precedence required for it to
543continue. In our example above, this will cause it to return the AST
544node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
545expression.
546
547Finally, on the next iteration of the while loop, the "+g" piece is
548parsed and added to the AST. With this little bit of code (14
549non-trivial lines), we correctly handle fully general binary expression
550parsing in a very elegant way. This was a whirlwind tour of this code,
551and it is somewhat subtle. I recommend running through it with a few
552tough examples to see how it works.
553
554This wraps up handling of expressions. At this point, we can point the
555parser at an arbitrary token stream and build an expression from it,
556stopping at the first token that is not part of the expression. Next up
557we need to handle function definitions, etc.
558
559Parsing the Rest
560================
561
562The next thing missing is handling of function prototypes. In
563Kaleidoscope, these are used both for 'extern' function declarations as
564well as function body definitions. The code to do this is
565straight-forward and not very interesting (once you've survived
566expressions):
567
568.. code-block:: c++
569
570 /// prototype
571 /// ::= id '(' id* ')'
Lang Hames09bf4c12015-08-18 18:11:06 +0000572 static std::unique_ptr<PrototypeAST> ParsePrototype() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000573 if (CurTok != tok_identifier)
Lang Hames5d045a92016-03-25 17:41:26 +0000574 return LogErrorP("Expected function name in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000575
576 std::string FnName = IdentifierStr;
577 getNextToken();
578
579 if (CurTok != '(')
Lang Hames5d045a92016-03-25 17:41:26 +0000580 return LogErrorP("Expected '(' in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000581
582 // Read the list of argument names.
583 std::vector<std::string> ArgNames;
584 while (getNextToken() == tok_identifier)
585 ArgNames.push_back(IdentifierStr);
586 if (CurTok != ')')
Lang Hames5d045a92016-03-25 17:41:26 +0000587 return LogErrorP("Expected ')' in prototype");
Sean Silvad7fb3962012-12-05 00:26:32 +0000588
589 // success.
590 getNextToken(); // eat ')'.
591
Lang Hames09bf4c12015-08-18 18:11:06 +0000592 return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
Sean Silvad7fb3962012-12-05 00:26:32 +0000593 }
594
595Given this, a function definition is very simple, just a prototype plus
596an expression to implement the body:
597
598.. code-block:: c++
599
600 /// definition ::= 'def' prototype expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000601 static std::unique_ptr<FunctionAST> ParseDefinition() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000602 getNextToken(); // eat def.
Lang Hames09bf4c12015-08-18 18:11:06 +0000603 auto Proto = ParsePrototype();
604 if (!Proto) return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000605
Lang Hames09bf4c12015-08-18 18:11:06 +0000606 if (auto E = ParseExpression())
607 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
608 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000609 }
610
611In addition, we support 'extern' to declare functions like 'sin' and
612'cos' as well as to support forward declaration of user functions. These
613'extern's are just prototypes with no body:
614
615.. code-block:: c++
616
617 /// external ::= 'extern' prototype
Lang Hames09bf4c12015-08-18 18:11:06 +0000618 static std::unique_ptr<PrototypeAST> ParseExtern() {
Sean Silvad7fb3962012-12-05 00:26:32 +0000619 getNextToken(); // eat extern.
620 return ParsePrototype();
621 }
622
623Finally, we'll also let the user type in arbitrary top-level expressions
624and evaluate them on the fly. We will handle this by defining anonymous
625nullary (zero argument) functions for them:
626
627.. code-block:: c++
628
629 /// toplevelexpr ::= expression
Lang Hames09bf4c12015-08-18 18:11:06 +0000630 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
631 if (auto E = ParseExpression()) {
Sean Silvad7fb3962012-12-05 00:26:32 +0000632 // Make an anonymous proto.
Lang Hames09bf4c12015-08-18 18:11:06 +0000633 auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
634 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
Sean Silvad7fb3962012-12-05 00:26:32 +0000635 }
Lang Hames09bf4c12015-08-18 18:11:06 +0000636 return nullptr;
Sean Silvad7fb3962012-12-05 00:26:32 +0000637 }
638
639Now that we have all the pieces, let's build a little driver that will
640let us actually *execute* this code we've built!
641
642The Driver
643==========
644
645The driver for this simply invokes all of the parsing pieces with a
646top-level dispatch loop. There isn't much interesting here, so I'll just
Alex Denisov596e9792015-12-15 20:50:29 +0000647include the top-level loop. See `below <#full-code-listing>`_ for full code in the
Sean Silvad7fb3962012-12-05 00:26:32 +0000648"Top-Level Parsing" section.
649
650.. code-block:: c++
651
652 /// top ::= definition | external | expression | ';'
653 static void MainLoop() {
654 while (1) {
655 fprintf(stderr, "ready> ");
656 switch (CurTok) {
Lang Hames59b0da82015-08-19 18:15:58 +0000657 case tok_eof:
658 return;
659 case ';': // ignore top-level semicolons.
660 getNextToken();
661 break;
662 case tok_def:
663 HandleDefinition();
664 break;
665 case tok_extern:
666 HandleExtern();
667 break;
668 default:
669 HandleTopLevelExpression();
670 break;
Sean Silvad7fb3962012-12-05 00:26:32 +0000671 }
672 }
673 }
674
675The most interesting part of this is that we ignore top-level
676semicolons. Why is this, you ask? The basic reason is that if you type
677"4 + 5" at the command line, the parser doesn't know whether that is the
678end of what you will type or not. For example, on the next line you
679could type "def foo..." in which case 4+5 is the end of a top-level
680expression. Alternatively you could type "\* 6", which would continue
681the expression. Having top-level semicolons allows you to type "4+5;",
682and the parser will know you are done.
683
684Conclusions
685===========
686
687With just under 400 lines of commented code (240 lines of non-comment,
688non-blank code), we fully defined our minimal language, including a
689lexer, parser, and AST builder. With this done, the executable will
690validate Kaleidoscope code and tell us if it is grammatically invalid.
691For example, here is a sample interaction:
692
693.. code-block:: bash
694
695 $ ./a.out
696 ready> def foo(x y) x+foo(y, 4.0);
697 Parsed a function definition.
698 ready> def foo(x y) x+y y;
699 Parsed a function definition.
700 Parsed a top-level expr
701 ready> def foo(x y) x+y );
702 Parsed a function definition.
703 Error: unknown token when expecting an expression
704 ready> extern sin(a);
705 ready> Parsed an extern
706 ready> ^D
707 $
708
709There is a lot of room for extension here. You can define new AST nodes,
710extend the language in many ways, etc. In the `next
711installment <LangImpl3.html>`_, we will describe how to generate LLVM
712Intermediate Representation (IR) from the AST.
713
714Full Code Listing
715=================
716
717Here is the complete code listing for this and the previous chapter.
718Note that it is fully self-contained: you don't need LLVM or any
719external libraries at all for this. (Besides the C and C++ standard
720libraries, of course.) To build this, just compile with:
721
722.. code-block:: bash
723
724 # Compile
725 clang++ -g -O3 toy.cpp
726 # Run
727 ./a.out
728
729Here is the code:
730
Logan Chien855b17d2013-06-08 09:03:03 +0000731.. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
732 :language: c++
Sean Silvad7fb3962012-12-05 00:26:32 +0000733
734`Next: Implementing Code Generation to LLVM IR <LangImpl3.html>`_
735