| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" | 
|  | 2 | "http://www.w3.org/TR/html4/strict.dtd"> | 
|  | 3 |  | 
|  | 4 | <html> | 
|  | 5 | <head> | 
|  | 6 | <title>Kaleidoscope: Implementing a Parser and AST</title> | 
|  | 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> | 
|  | 8 | <meta name="author" content="Chris Lattner"> | 
|  | 9 | <link rel="stylesheet" href="../llvm.css" type="text/css"> | 
|  | 10 | </head> | 
|  | 11 |  | 
|  | 12 | <body> | 
|  | 13 |  | 
|  | 14 | <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div> | 
|  | 15 |  | 
|  | 16 | <div class="doc_author"> | 
|  | 17 | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> | 
|  | 18 | </div> | 
|  | 19 |  | 
|  | 20 | <!-- *********************************************************************** --> | 
|  | 21 | <div class="doc_section"><a name="intro">Part 2 Introduction</a></div> | 
|  | 22 | <!-- *********************************************************************** --> | 
|  | 23 |  | 
|  | 24 | <div class="doc_text"> | 
|  | 25 |  | 
|  | 26 | <p>Welcome to part 2 of the "<a href="index.html">Implementing a language with | 
|  | 27 | LLVM</a>" tutorial.  This chapter shows you how to use the <a | 
|  | 28 | href="LangImpl1.html">Lexer built in Chapter 1</a> to build a full <a | 
|  | 29 | href="http://en.wikipedia.org/wiki/Parsing">parser</a> for | 
|  | 30 | our Kaleidoscope language and build an <a | 
|  | 31 | href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax | 
|  | 32 | Tree</a> (AST).</p> | 
|  | 33 |  | 
|  | 34 | <p>The parser we will build uses a combination of <a | 
|  | 35 | href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent | 
|  | 36 | Parsing</a> and <a href= | 
|  | 37 | "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence | 
|  | 38 | Parsing</a> to parse the Kaleidoscope language (the later for binary expression | 
|  | 39 | and the former for everything else).  Before we get to parsing though, lets talk | 
|  | 40 | about the output of the parser: the Abstract Syntax Tree.</p> | 
|  | 41 |  | 
|  | 42 | </div> | 
|  | 43 |  | 
|  | 44 | <!-- *********************************************************************** --> | 
|  | 45 | <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div> | 
|  | 46 | <!-- *********************************************************************** --> | 
|  | 47 |  | 
|  | 48 | <div class="doc_text"> | 
|  | 49 |  | 
|  | 50 | <p>The AST for a program captures its behavior in a way that it is easy for | 
|  | 51 | later stages of the compiler (e.g. code generation) to interpret.  We basically | 
|  | 52 | want one object for each construct in the language, and the AST should closely | 
|  | 53 | model the language.  In Kaleidoscope, we have expressions, a prototype, and a | 
|  | 54 | function object.  We'll start with expressions first:</p> | 
|  | 55 |  | 
|  | 56 | <div class="doc_code"> | 
|  | 57 | <pre> | 
|  | 58 | /// ExprAST - Base class for all expression nodes. | 
|  | 59 | class ExprAST { | 
|  | 60 | public: | 
|  | 61 | virtual ~ExprAST() {} | 
|  | 62 | }; | 
|  | 63 |  | 
|  | 64 | /// NumberExprAST - Expression class for numeric literals like "1.0". | 
|  | 65 | class NumberExprAST : public ExprAST { | 
|  | 66 | double Val; | 
|  | 67 | public: | 
| Chris Lattner | 61b4ec7 | 2007-10-23 04:27:44 +0000 | [diff] [blame] | 68 | explicit NumberExprAST(double val) : Val(val) {} | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 69 | }; | 
|  | 70 | </pre> | 
|  | 71 | </div> | 
|  | 72 |  | 
|  | 73 | <p>The code above shows the definition of the base ExprAST class and one | 
|  | 74 | subclass which we use for numeric literals.  The important thing about this is | 
|  | 75 | that the NumberExprAST class captures the numeric value of the literal in the | 
|  | 76 | class, so that later phases of the compiler can know what it is.</p> | 
|  | 77 |  | 
|  | 78 | <p>Right now we only create the AST,  so there are no useful accessor methods on | 
|  | 79 | them.  It would be very easy to add a virtual method to pretty print the code, | 
|  | 80 | for example.  Here are the other expression AST node definitions that we'll use | 
|  | 81 | in the basic form of the Kaleidoscope language. | 
|  | 82 | </p> | 
|  | 83 |  | 
|  | 84 | <div class="doc_code"> | 
|  | 85 | <pre> | 
|  | 86 | /// VariableExprAST - Expression class for referencing a variable, like "a". | 
|  | 87 | class VariableExprAST : public ExprAST { | 
|  | 88 | std::string Name; | 
|  | 89 | public: | 
| Chris Lattner | 61b4ec7 | 2007-10-23 04:27:44 +0000 | [diff] [blame] | 90 | explicit VariableExprAST(const std::string &name) : Name(name) {} | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 91 | }; | 
|  | 92 |  | 
|  | 93 | /// BinaryExprAST - Expression class for a binary operator. | 
|  | 94 | class BinaryExprAST : public ExprAST { | 
|  | 95 | char Op; | 
|  | 96 | ExprAST *LHS, *RHS; | 
|  | 97 | public: | 
|  | 98 | BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) | 
|  | 99 | : Op(op), LHS(lhs), RHS(rhs) {} | 
|  | 100 | }; | 
|  | 101 |  | 
|  | 102 | /// CallExprAST - Expression class for function calls. | 
|  | 103 | class CallExprAST : public ExprAST { | 
|  | 104 | std::string Callee; | 
|  | 105 | std::vector<ExprAST*> Args; | 
|  | 106 | public: | 
|  | 107 | CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) | 
|  | 108 | : Callee(callee), Args(args) {} | 
|  | 109 | }; | 
|  | 110 | </pre> | 
|  | 111 | </div> | 
|  | 112 |  | 
|  | 113 | <p>This is all (intentially) rather straight-forward: variables capture the | 
|  | 114 | variable name, binary operators capture their opcode (e.g. '+'), and calls | 
|  | 115 | capture a function name and list of argument expressions.  One thing that is | 
|  | 116 | nice about our AST is that it captures the language features without talking | 
|  | 117 | about the syntax of the language.  Note that there is no discussion about | 
|  | 118 | precedence of binary operators, lexical structure etc.</p> | 
|  | 119 |  | 
|  | 120 | <p>For our basic language, these are all of the expression nodes we'll define. | 
| Owen Anderson | c2b2fc0 | 2007-10-22 06:48:28 +0000 | [diff] [blame] | 121 | Because it doesn't have conditional control flow, it isn't Turing-complete; | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 122 | we'll fix that in a later installment.  The two things we need next are a way | 
|  | 123 | to talk about the interface to a function, and a way to talk about functions | 
|  | 124 | themselves:</p> | 
|  | 125 |  | 
|  | 126 | <div class="doc_code"> | 
|  | 127 | <pre> | 
|  | 128 | /// PrototypeAST - This class represents the "prototype" for a function, | 
|  | 129 | /// which captures its argument names as well as if it is an operator. | 
|  | 130 | class PrototypeAST { | 
|  | 131 | std::string Name; | 
|  | 132 | std::vector<std::string> Args; | 
|  | 133 | public: | 
|  | 134 | PrototypeAST(const std::string &name, const std::vector<std::string> &args) | 
|  | 135 | : Name(name), Args(args) {} | 
|  | 136 | }; | 
|  | 137 |  | 
|  | 138 | /// FunctionAST - This class represents a function definition itself. | 
|  | 139 | class FunctionAST { | 
|  | 140 | PrototypeAST *Proto; | 
|  | 141 | ExprAST *Body; | 
|  | 142 | public: | 
|  | 143 | FunctionAST(PrototypeAST *proto, ExprAST *body) | 
|  | 144 | : Proto(proto), Body(body) {} | 
|  | 145 | }; | 
|  | 146 | </pre> | 
|  | 147 | </div> | 
|  | 148 |  | 
|  | 149 | <p>In Kaleidoscope, functions are typed with just a count of their arguments. | 
|  | 150 | Since all values are double precision floating point, this fact doesn't need to | 
|  | 151 | be captured anywhere.  In a more aggressive and realistic language, the | 
|  | 152 | "ExprAST" class would probably have a type field.</p> | 
|  | 153 |  | 
|  | 154 | <p>With this scaffolding, we can now talk about parsing expressions and function | 
|  | 155 | bodies in Kaleidoscope.</p> | 
|  | 156 |  | 
|  | 157 | </div> | 
|  | 158 |  | 
|  | 159 | <!-- *********************************************************************** --> | 
|  | 160 | <div class="doc_section"><a name="parserbasics">Parser Basics</a></div> | 
|  | 161 | <!-- *********************************************************************** --> | 
|  | 162 |  | 
|  | 163 | <div class="doc_text"> | 
|  | 164 |  | 
|  | 165 | <p>Now that we have an AST to build, we need to define the parser code to build | 
|  | 166 | it.  The idea here is that we want to parse something like "x+y" (which is | 
|  | 167 | returned as three tokens by the lexer) into an AST that could be generated with | 
|  | 168 | calls like this:</p> | 
|  | 169 |  | 
|  | 170 | <div class="doc_code"> | 
|  | 171 | <pre> | 
|  | 172 | ExprAST *X = new VariableExprAST("x"); | 
|  | 173 | ExprAST *Y = new VariableExprAST("y"); | 
|  | 174 | ExprAST *Result = new BinaryExprAST('+', X, Y); | 
|  | 175 | </pre> | 
|  | 176 | </div> | 
|  | 177 |  | 
|  | 178 | <p>In order to do this, we'll start by defining some basic helper routines:</p> | 
|  | 179 |  | 
|  | 180 | <div class="doc_code"> | 
|  | 181 | <pre> | 
|  | 182 | /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current | 
|  | 183 | /// token the parser it looking at.  getNextToken reads another token from the | 
|  | 184 | /// lexer and updates CurTok with its results. | 
|  | 185 | static int CurTok; | 
|  | 186 | static int getNextToken() { | 
|  | 187 | return CurTok = gettok(); | 
|  | 188 | } | 
|  | 189 | </pre> | 
|  | 190 | </div> | 
|  | 191 |  | 
|  | 192 | <p> | 
|  | 193 | This implements a simple token buffer around the lexer.  This allows | 
|  | 194 | us to look one token ahead at what the lexer is returning.  Every function in | 
| Chris Lattner | 991b6b9 | 2007-10-25 18:05:29 +0000 | [diff] [blame] | 195 | our parser will assume that CurTok is the current token that needs to be | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 196 | parsed.</p> | 
|  | 197 |  | 
| Chris Lattner | ce2c3a4 | 2007-10-22 16:44:31 +0000 | [diff] [blame] | 198 | <p>Again, we define these with global variables; it would be better design to | 
|  | 199 | wrap the entire parser in a class and use instance variables for these. | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 200 | </p> | 
|  | 201 |  | 
|  | 202 | <div class="doc_code"> | 
|  | 203 | <pre> | 
|  | 204 |  | 
|  | 205 | /// Error* - These are little helper functions for error handling. | 
|  | 206 | ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} | 
|  | 207 | PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } | 
|  | 208 | FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } | 
|  | 209 | </pre> | 
|  | 210 | </div> | 
|  | 211 |  | 
|  | 212 | <p> | 
|  | 213 | The <tt>Error</tt> routines are simple helper routines that our parser will use | 
|  | 214 | to handle errors.  The error recovery in our parser will not be the best and | 
| Duncan Sands | d6f131b | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 215 | is not particular user-friendly, but it will be enough for our tutorial.  These | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 216 | routines make it easier to handle errors in routines that have various return | 
|  | 217 | types: they always return null.</p> | 
|  | 218 |  | 
|  | 219 | <p>With these basic helper functions implemented, we can implement the first | 
|  | 220 | piece of our grammar: we'll start with numeric literals.</p> | 
|  | 221 |  | 
|  | 222 | </div> | 
|  | 223 |  | 
|  | 224 | <!-- *********************************************************************** --> | 
|  | 225 | <div class="doc_section"><a name="parserprimexprs">Basic Expression | 
|  | 226 | Parsing</a></div> | 
|  | 227 | <!-- *********************************************************************** --> | 
|  | 228 |  | 
|  | 229 | <div class="doc_text"> | 
|  | 230 |  | 
|  | 231 | <p>We start with numeric literals, because they are the simplest to process. | 
|  | 232 | For each production in our grammar, we'll define a function which parses that | 
|  | 233 | production.  For numeric literals, we have: | 
|  | 234 | </p> | 
|  | 235 |  | 
|  | 236 | <div class="doc_code"> | 
|  | 237 | <pre> | 
|  | 238 | /// numberexpr ::= number | 
|  | 239 | static ExprAST *ParseNumberExpr() { | 
|  | 240 | ExprAST *Result = new NumberExprAST(NumVal); | 
|  | 241 | getNextToken(); // consume the number | 
|  | 242 | return Result; | 
|  | 243 | } | 
|  | 244 | </pre> | 
|  | 245 | </div> | 
|  | 246 |  | 
|  | 247 | <p>This routine is very simple: it expects to be called when the current token | 
|  | 248 | is a <tt>tok_number</tt> token.  It takes the current number value, creates | 
|  | 249 | a <tt>NumberExprAST</tt> node, advances the lexer to the next token, then | 
|  | 250 | returns.</p> | 
|  | 251 |  | 
|  | 252 | <p>There are some interesting aspects of this.  The most important one is that | 
|  | 253 | this routine eats all of the tokens that correspond to the production, and | 
|  | 254 | returns the lexer buffer with the next token (which is not part of the grammar | 
|  | 255 | production) ready to go.  This is a fairly standard way to go for recursive | 
|  | 256 | descent parsers.  For a better example, the parenthesis operator is defined like | 
|  | 257 | this:</p> | 
|  | 258 |  | 
|  | 259 | <div class="doc_code"> | 
|  | 260 | <pre> | 
|  | 261 | /// parenexpr ::= '(' expression ')' | 
|  | 262 | static ExprAST *ParseParenExpr() { | 
|  | 263 | getNextToken();  // eat (. | 
|  | 264 | ExprAST *V = ParseExpression(); | 
|  | 265 | if (!V) return 0; | 
|  | 266 |  | 
|  | 267 | if (CurTok != ')') | 
|  | 268 | return Error("expected ')'"); | 
|  | 269 | getNextToken();  // eat ). | 
|  | 270 | return V; | 
|  | 271 | } | 
|  | 272 | </pre> | 
|  | 273 | </div> | 
|  | 274 |  | 
|  | 275 | <p>This function illustrates a number of interesting things about the parser: | 
|  | 276 | 1) it shows how we use the Error routines.  When called, this function expects | 
|  | 277 | that the current token is a '(' token, but after parsing the subexpression, it | 
|  | 278 | is possible that there is not a ')' waiting.  For example, if the user types in | 
|  | 279 | "(4 x" instead of "(4)", the parser should emit an error.  Because errors can | 
|  | 280 | occur, the parser needs a way to indicate that they happened: in our parser, we | 
|  | 281 | return null on an error.</p> | 
|  | 282 |  | 
|  | 283 | <p>Another interesting aspect of this function is that it uses recursion by | 
| Owen Anderson | c2b2fc0 | 2007-10-22 06:48:28 +0000 | [diff] [blame] | 284 | calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 285 | <tt>ParseParenExpr</tt>).  This is powerful because it allows us to handle | 
|  | 286 | recursive grammars, and keeps each production very simple.  Note that | 
| Duncan Sands | d6f131b | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 287 | parentheses do not cause construction of AST nodes themselves.  While we could | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 288 | do this, the most important role of parens are to guide the parser and provide | 
|  | 289 | grouping.  Once the parser constructs the AST, parens are not needed.</p> | 
|  | 290 |  | 
|  | 291 | <p>The next simple production is for handling variable references and function | 
|  | 292 | calls:</p> | 
|  | 293 |  | 
|  | 294 | <div class="doc_code"> | 
|  | 295 | <pre> | 
|  | 296 | /// identifierexpr | 
| Chris Lattner | 9b2f777 | 2007-11-05 17:54:34 +0000 | [diff] [blame^] | 297 | ///   ::= identifier | 
|  | 298 | ///   ::= identifier '(' expression* ')' | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 299 | static ExprAST *ParseIdentifierExpr() { | 
|  | 300 | std::string IdName = IdentifierStr; | 
|  | 301 |  | 
| Chris Lattner | 9b2f777 | 2007-11-05 17:54:34 +0000 | [diff] [blame^] | 302 | getNextToken();  // eat identifier. | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 303 |  | 
|  | 304 | if (CurTok != '(') // Simple variable ref. | 
|  | 305 | return new VariableExprAST(IdName); | 
|  | 306 |  | 
|  | 307 | // Call. | 
|  | 308 | getNextToken();  // eat ( | 
|  | 309 | std::vector<ExprAST*> Args; | 
|  | 310 | while (1) { | 
|  | 311 | ExprAST *Arg = ParseExpression(); | 
|  | 312 | if (!Arg) return 0; | 
|  | 313 | Args.push_back(Arg); | 
|  | 314 |  | 
|  | 315 | if (CurTok == ')') break; | 
|  | 316 |  | 
|  | 317 | if (CurTok != ',') | 
|  | 318 | return Error("Expected ')'"); | 
|  | 319 | getNextToken(); | 
|  | 320 | } | 
|  | 321 |  | 
|  | 322 | // Eat the ')'. | 
|  | 323 | getNextToken(); | 
|  | 324 |  | 
|  | 325 | return new CallExprAST(IdName, Args); | 
|  | 326 | } | 
|  | 327 | </pre> | 
|  | 328 | </div> | 
|  | 329 |  | 
|  | 330 | <p>This routine follows the same style as the other routines (it expects to be | 
|  | 331 | called if the current token is a <tt>tok_identifier</tt> token).  It also has | 
|  | 332 | recursion and error handling.  One interesting aspect of this is that it uses | 
|  | 333 | <em>look-ahead</em> to determine if the current identifier is a stand alone | 
|  | 334 | variable reference or if it is a function call expression.  It handles this by | 
|  | 335 | checking to see if the token after the identifier is a '(' token, and constructs | 
|  | 336 | either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate. | 
|  | 337 | </p> | 
|  | 338 |  | 
|  | 339 | <p>Now that we have all of our simple expression parsing logic in place, we can | 
|  | 340 | define a helper function to wrap them up in a class.  We call this class of | 
|  | 341 | expressions "primary" expressions, for reasons that will become more clear | 
|  | 342 | later.  In order to parse a primary expression, we need to determine what sort | 
|  | 343 | of expression it is:</p> | 
|  | 344 |  | 
|  | 345 | <div class="doc_code"> | 
|  | 346 | <pre> | 
|  | 347 | /// primary | 
|  | 348 | ///   ::= identifierexpr | 
|  | 349 | ///   ::= numberexpr | 
|  | 350 | ///   ::= parenexpr | 
|  | 351 | static ExprAST *ParsePrimary() { | 
|  | 352 | switch (CurTok) { | 
|  | 353 | default: return Error("unknown token when expecting an expression"); | 
|  | 354 | case tok_identifier: return ParseIdentifierExpr(); | 
|  | 355 | case tok_number:     return ParseNumberExpr(); | 
|  | 356 | case '(':            return ParseParenExpr(); | 
|  | 357 | } | 
|  | 358 | } | 
|  | 359 | </pre> | 
|  | 360 | </div> | 
|  | 361 |  | 
|  | 362 | <p>Now that you see the definition of this function, it makes it more obvious | 
|  | 363 | why we can assume the state of CurTok in the various functions.  This uses | 
|  | 364 | look-ahead to determine which sort of expression is being inspected, and parses | 
|  | 365 | it with a function call.</p> | 
|  | 366 |  | 
|  | 367 | <p>Now that basic expressions are handled, we need to handle binary expressions, | 
|  | 368 | which are a bit more complex.</p> | 
|  | 369 |  | 
|  | 370 | </div> | 
|  | 371 |  | 
|  | 372 | <!-- *********************************************************************** --> | 
|  | 373 | <div class="doc_section"><a name="parserbinops">Binary Expression | 
|  | 374 | Parsing</a></div> | 
|  | 375 | <!-- *********************************************************************** --> | 
|  | 376 |  | 
|  | 377 | <div class="doc_text"> | 
|  | 378 |  | 
|  | 379 | <p>Binary expressions are significantly harder to parse because they are often | 
|  | 380 | ambiguous.  For example, when given the string "x+y*z", the parser can choose | 
|  | 381 | to parse it as either "(x+y)*z" or "x+(y*z)".  With common definitions from | 
|  | 382 | mathematics, we expect the later parse, because "*" (multiplication) has | 
|  | 383 | higher <em>precedence</em> than "+" (addition).</p> | 
|  | 384 |  | 
|  | 385 | <p>There are many ways to handle this, but an elegant and efficient way is to | 
|  | 386 | use <a href= | 
|  | 387 | "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence | 
|  | 388 | Parsing</a>.  This parsing technique uses the precedence of binary operators to | 
|  | 389 | guide recursion.  To start with, we need a table of precedences:</p> | 
|  | 390 |  | 
|  | 391 | <div class="doc_code"> | 
|  | 392 | <pre> | 
|  | 393 | /// BinopPrecedence - This holds the precedence for each binary operator that is | 
|  | 394 | /// defined. | 
|  | 395 | static std::map<char, int> BinopPrecedence; | 
|  | 396 |  | 
|  | 397 | /// GetTokPrecedence - Get the precedence of the pending binary operator token. | 
|  | 398 | static int GetTokPrecedence() { | 
|  | 399 | if (!isascii(CurTok)) | 
|  | 400 | return -1; | 
|  | 401 |  | 
|  | 402 | // Make sure it's a declared binop. | 
|  | 403 | int TokPrec = BinopPrecedence[CurTok]; | 
|  | 404 | if (TokPrec <= 0) return -1; | 
|  | 405 | return TokPrec; | 
|  | 406 | } | 
|  | 407 |  | 
|  | 408 | int main() { | 
|  | 409 | // Install standard binary operators. | 
|  | 410 | // 1 is lowest precedence. | 
|  | 411 | BinopPrecedence['<'] = 10; | 
|  | 412 | BinopPrecedence['+'] = 20; | 
|  | 413 | BinopPrecedence['-'] = 20; | 
|  | 414 | BinopPrecedence['*'] = 40;  // highest. | 
|  | 415 | ... | 
|  | 416 | } | 
|  | 417 | </pre> | 
|  | 418 | </div> | 
|  | 419 |  | 
|  | 420 | <p>For the basic form of Kaleidoscope, we will only support 4 binary operators | 
|  | 421 | (this can obviously be extended by you, the reader).  The | 
|  | 422 | <tt>GetTokPrecedence</tt> function returns the precedence for the current token, | 
|  | 423 | or -1 if the token is not a binary operator.  Having a map makes it easy to add | 
|  | 424 | new operators and makes it clear that the algorithm doesn't depend on the | 
|  | 425 | specific operators involved, but it would be easy enough to eliminate the map | 
|  | 426 | and do the comparisons in the <tt>GetTokPrecedence</tt> function.</p> | 
|  | 427 |  | 
|  | 428 | <p>With the helper above defined, we can now start parsing binary expressions. | 
|  | 429 | The basic idea of operator precedence parsing is to break down an expression | 
|  | 430 | with potentially ambiguous binary operators into pieces.  Consider for example | 
|  | 431 | the expression "a+b+(c+d)*e*f+g".  Operator precedence parsing considers this | 
|  | 432 | as a stream of primary expressions separated by binary operators.  As such, | 
|  | 433 | it will first parse the leading primary expression "a", then it will see the | 
|  | 434 | pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g].  Note that because parentheses | 
| Duncan Sands | d6f131b | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 435 | are primary expressions, the binary expression parser doesn't need to worry | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 436 | about nested subexpressions like (c+d) at all. | 
|  | 437 | </p> | 
|  | 438 |  | 
|  | 439 | <p> | 
|  | 440 | To start, an expression is a primary expression potentially followed by a | 
|  | 441 | sequence of [binop,primaryexpr] pairs:</p> | 
|  | 442 |  | 
|  | 443 | <div class="doc_code"> | 
|  | 444 | <pre> | 
|  | 445 | /// expression | 
|  | 446 | ///   ::= primary binoprhs | 
|  | 447 | /// | 
|  | 448 | static ExprAST *ParseExpression() { | 
|  | 449 | ExprAST *LHS = ParsePrimary(); | 
|  | 450 | if (!LHS) return 0; | 
|  | 451 |  | 
|  | 452 | return ParseBinOpRHS(0, LHS); | 
|  | 453 | } | 
|  | 454 | </pre> | 
|  | 455 | </div> | 
|  | 456 |  | 
|  | 457 | <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for | 
|  | 458 | us.  It takes a precedence and a pointer to an expression for the part parsed | 
|  | 459 | so far.   Note that "x" is a perfectly valid expression: As such, "binoprhs" is | 
|  | 460 | allowed to be empty, in which case it returns the expression that is passed into | 
|  | 461 | it. In our example above, the code passes the expression for "a" into | 
|  | 462 | <tt>ParseBinOpRHS</tt> and the current token is "+".</p> | 
|  | 463 |  | 
|  | 464 | <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em> | 
|  | 465 | minimal operator precedence</em> that the function is allowed to eat.  For | 
|  | 466 | example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is | 
|  | 467 | passed in a precedence of 40, it will not consume any tokens (because the | 
|  | 468 | precedence of '+' is only 20).  With this in mind, <tt>ParseBinOpRHS</tt> starts | 
|  | 469 | with:</p> | 
|  | 470 |  | 
|  | 471 | <div class="doc_code"> | 
|  | 472 | <pre> | 
|  | 473 | /// binoprhs | 
|  | 474 | ///   ::= ('+' primary)* | 
|  | 475 | static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { | 
|  | 476 | // If this is a binop, find its precedence. | 
|  | 477 | while (1) { | 
|  | 478 | int TokPrec = GetTokPrecedence(); | 
|  | 479 |  | 
|  | 480 | // If this is a binop that binds at least as tightly as the current binop, | 
|  | 481 | // consume it, otherwise we are done. | 
|  | 482 | if (TokPrec < ExprPrec) | 
|  | 483 | return LHS; | 
|  | 484 | </pre> | 
|  | 485 | </div> | 
|  | 486 |  | 
|  | 487 | <p>This code gets the precedence of the current token and checks to see if if is | 
|  | 488 | too low.  Because we defined invalid tokens to have a precedence of -1, this | 
|  | 489 | check implicitly knows that the pair-stream ends when the token stream runs out | 
|  | 490 | of binary operators.  If this check succeeds, we know that the token is a binary | 
|  | 491 | operator and that it will be included in this expression:</p> | 
|  | 492 |  | 
|  | 493 | <div class="doc_code"> | 
|  | 494 | <pre> | 
|  | 495 | // Okay, we know this is a binop. | 
|  | 496 | int BinOp = CurTok; | 
|  | 497 | getNextToken();  // eat binop | 
|  | 498 |  | 
|  | 499 | // Parse the primary expression after the binary operator. | 
|  | 500 | ExprAST *RHS = ParsePrimary(); | 
|  | 501 | if (!RHS) return 0; | 
|  | 502 | </pre> | 
|  | 503 | </div> | 
|  | 504 |  | 
|  | 505 | <p>As such, this code eats (and remembers) the binary operator and then parses | 
|  | 506 | the following primary expression.  This builds up the whole pair, the first of | 
|  | 507 | which is [+, b] for the running example.</p> | 
|  | 508 |  | 
|  | 509 | <p>Now that we parsed the left-hand side of an expression and one pair of the | 
|  | 510 | RHS sequence, we have to decide which way the expression associates.  In | 
|  | 511 | particular, we could have "(a+b) binop unparsed"  or "a + (b binop unparsed)". | 
|  | 512 | To determine this, we look ahead at "binop" to determine its precedence and | 
|  | 513 | compare it to BinOp's precedence (which is '+' in this case):</p> | 
|  | 514 |  | 
|  | 515 | <div class="doc_code"> | 
|  | 516 | <pre> | 
|  | 517 | // If BinOp binds less tightly with RHS than the operator after RHS, let | 
|  | 518 | // the pending operator take RHS as its LHS. | 
|  | 519 | int NextPrec = GetTokPrecedence(); | 
|  | 520 | if (TokPrec < NextPrec) { | 
|  | 521 | </pre> | 
|  | 522 | </div> | 
|  | 523 |  | 
|  | 524 | <p>If the precedence of the binop to the right of "RHS" is lower or equal to the | 
|  | 525 | precedence of our current operator, then we know that the parentheses associate | 
|  | 526 | as "(a+b) binop ...".  In our example, since the next operator is "+" and so is | 
|  | 527 | our current one, we know that they have the same precedence.  In this case we'll | 
|  | 528 | create the AST node for "a+b", and then continue parsing:</p> | 
|  | 529 |  | 
|  | 530 | <div class="doc_code"> | 
|  | 531 | <pre> | 
|  | 532 | ... if body omitted ... | 
|  | 533 | } | 
|  | 534 |  | 
|  | 535 | // Merge LHS/RHS. | 
|  | 536 | LHS = new BinaryExprAST(BinOp, LHS, RHS); | 
|  | 537 | }  // loop around to the top of the while loop. | 
|  | 538 | } | 
|  | 539 | </pre> | 
|  | 540 | </div> | 
|  | 541 |  | 
|  | 542 | <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next | 
|  | 543 | iteration of the loop, with "+" as the current token.  The code above will eat | 
|  | 544 | and remember it and parse "(c+d)" as the primary expression, which makes the | 
|  | 545 | current pair be [+, (c+d)].  It will then enter the 'if' above with "*" as the | 
|  | 546 | binop to the right of the primary.  In this case, the precedence of "*" is | 
|  | 547 | higher than the precedence of "+" so the if condition will be entered.</p> | 
|  | 548 |  | 
|  | 549 | <p>The critical question left here is "how can the if condition parse the right | 
|  | 550 | hand side in full"?  In particular, to build the AST correctly for our example, | 
|  | 551 | it needs to get all of "(c+d)*e*f" as the RHS expression variable.  The code to | 
|  | 552 | do this is surprisingly simple (code from the above two blocks duplicated for | 
|  | 553 | context):</p> | 
|  | 554 |  | 
|  | 555 | <div class="doc_code"> | 
|  | 556 | <pre> | 
|  | 557 | // If BinOp binds less tightly with RHS than the operator after RHS, let | 
|  | 558 | // the pending operator take RHS as its LHS. | 
|  | 559 | int NextPrec = GetTokPrecedence(); | 
|  | 560 | if (TokPrec < NextPrec) { | 
|  | 561 | RHS = ParseBinOpRHS(TokPrec+1, RHS); | 
|  | 562 | if (RHS == 0) return 0; | 
|  | 563 | } | 
|  | 564 | // Merge LHS/RHS. | 
|  | 565 | LHS = new BinaryExprAST(BinOp, LHS, RHS); | 
|  | 566 | }  // loop around to the top of the while loop. | 
|  | 567 | } | 
|  | 568 | </pre> | 
|  | 569 | </div> | 
|  | 570 |  | 
|  | 571 | <p>At this point, we know that the binary operator to the RHS of our primary | 
|  | 572 | has higher precedence than the binop we are currently parsing.  As such, we know | 
|  | 573 | that any sequence of pairs whose operators are all higher precedence than "+" | 
|  | 574 | should be parsed together and returned as "RHS".  To do this, we recursively | 
|  | 575 | invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum | 
|  | 576 | precedence required for it to continue.  In our example above, this will cause | 
|  | 577 | it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS | 
|  | 578 | of the '+' expression.</p> | 
|  | 579 |  | 
|  | 580 | <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed. | 
|  | 581 | and added to the AST.  With this little bit of code (14 non-trivial lines), we | 
|  | 582 | correctly handle fully general binary expression parsing in a very elegant way. | 
|  | 583 | </p> | 
|  | 584 |  | 
|  | 585 | <p>This wraps up handling of expressions.  At this point, we can point the | 
|  | 586 | parser at an arbitrary token stream and build an expression from them, stopping | 
|  | 587 | at the first token that is not part of the expression.  Next up we need to | 
|  | 588 | handle function definitions etc.</p> | 
|  | 589 |  | 
|  | 590 | </div> | 
|  | 591 |  | 
|  | 592 | <!-- *********************************************************************** --> | 
|  | 593 | <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div> | 
|  | 594 | <!-- *********************************************************************** --> | 
|  | 595 |  | 
|  | 596 | <div class="doc_text"> | 
|  | 597 |  | 
|  | 598 | <p> | 
|  | 599 | The first basic thing missing is that of function prototypes.  In Kaleidoscope, | 
|  | 600 | these are used both for 'extern' function declarations as well as function body | 
|  | 601 | definitions.  The code to do this is straight-forward and not very interesting | 
|  | 602 | (once you've survived expressions): | 
|  | 603 | </p> | 
|  | 604 |  | 
|  | 605 | <div class="doc_code"> | 
|  | 606 | <pre> | 
|  | 607 | /// prototype | 
|  | 608 | ///   ::= id '(' id* ')' | 
|  | 609 | static PrototypeAST *ParsePrototype() { | 
|  | 610 | if (CurTok != tok_identifier) | 
|  | 611 | return ErrorP("Expected function name in prototype"); | 
|  | 612 |  | 
|  | 613 | std::string FnName = IdentifierStr; | 
|  | 614 | getNextToken(); | 
|  | 615 |  | 
|  | 616 | if (CurTok != '(') | 
|  | 617 | return ErrorP("Expected '(' in prototype"); | 
|  | 618 |  | 
|  | 619 | std::vector<std::string> ArgNames; | 
|  | 620 | while (getNextToken() == tok_identifier) | 
|  | 621 | ArgNames.push_back(IdentifierStr); | 
|  | 622 | if (CurTok != ')') | 
|  | 623 | return ErrorP("Expected ')' in prototype"); | 
|  | 624 |  | 
|  | 625 | // success. | 
|  | 626 | getNextToken();  // eat ')'. | 
|  | 627 |  | 
|  | 628 | return new PrototypeAST(FnName, ArgNames); | 
|  | 629 | } | 
|  | 630 | </pre> | 
|  | 631 | </div> | 
|  | 632 |  | 
|  | 633 | <p>Given this, a function definition is very simple, just a prototype plus | 
| Duncan Sands | d6f131b | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 634 | an expression to implement the body:</p> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 635 |  | 
|  | 636 | <div class="doc_code"> | 
|  | 637 | <pre> | 
|  | 638 | /// definition ::= 'def' prototype expression | 
|  | 639 | static FunctionAST *ParseDefinition() { | 
|  | 640 | getNextToken();  // eat def. | 
|  | 641 | PrototypeAST *Proto = ParsePrototype(); | 
|  | 642 | if (Proto == 0) return 0; | 
|  | 643 |  | 
|  | 644 | if (ExprAST *E = ParseExpression()) | 
|  | 645 | return new FunctionAST(Proto, E); | 
|  | 646 | return 0; | 
|  | 647 | } | 
|  | 648 | </pre> | 
|  | 649 | </div> | 
|  | 650 |  | 
|  | 651 | <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as | 
|  | 652 | well as to support forward declaration of user functions.  'externs' are just | 
|  | 653 | prototypes with no body:</p> | 
|  | 654 |  | 
|  | 655 | <div class="doc_code"> | 
|  | 656 | <pre> | 
|  | 657 | /// external ::= 'extern' prototype | 
|  | 658 | static PrototypeAST *ParseExtern() { | 
|  | 659 | getNextToken();  // eat extern. | 
|  | 660 | return ParsePrototype(); | 
|  | 661 | } | 
|  | 662 | </pre> | 
|  | 663 | </div> | 
|  | 664 |  | 
|  | 665 | <p>Finally, we'll also let the user type in arbitrary top-level expressions and | 
|  | 666 | evaluate them on the fly.  We will handle this by defining anonymous nullary | 
|  | 667 | (zero argument) functions for them:</p> | 
|  | 668 |  | 
|  | 669 | <div class="doc_code"> | 
|  | 670 | <pre> | 
|  | 671 | /// toplevelexpr ::= expression | 
|  | 672 | static FunctionAST *ParseTopLevelExpr() { | 
|  | 673 | if (ExprAST *E = ParseExpression()) { | 
|  | 674 | // Make an anonymous proto. | 
|  | 675 | PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>()); | 
|  | 676 | return new FunctionAST(Proto, E); | 
|  | 677 | } | 
|  | 678 | return 0; | 
|  | 679 | } | 
|  | 680 | </pre> | 
|  | 681 | </div> | 
|  | 682 |  | 
|  | 683 | <p>Now that we have all the pieces, lets build a little driver that will let us | 
|  | 684 | actually <em>execute</em> this code we've built!</p> | 
|  | 685 |  | 
|  | 686 | </div> | 
|  | 687 |  | 
|  | 688 | <!-- *********************************************************************** --> | 
|  | 689 | <div class="doc_section"><a name="driver">The Driver</a></div> | 
|  | 690 | <!-- *********************************************************************** --> | 
|  | 691 |  | 
|  | 692 | <div class="doc_text"> | 
|  | 693 |  | 
|  | 694 | <p>The driver for this simply invokes all of the parsing pieces with a top-level | 
|  | 695 | dispatch loop.  There isn't much interesting here, so I'll just include the | 
|  | 696 | top-level loop.  See <a href="#code">below</a> for full code in the "Top-Level | 
|  | 697 | Parsing" section.</p> | 
|  | 698 |  | 
|  | 699 | <div class="doc_code"> | 
|  | 700 | <pre> | 
|  | 701 | /// top ::= definition | external | expression | ';' | 
|  | 702 | static void MainLoop() { | 
|  | 703 | while (1) { | 
|  | 704 | fprintf(stderr, "ready> "); | 
|  | 705 | switch (CurTok) { | 
|  | 706 | case tok_eof:    return; | 
|  | 707 | case ';':        getNextToken(); break;  // ignore top level semicolons. | 
|  | 708 | case tok_def:    HandleDefinition(); break; | 
|  | 709 | case tok_extern: HandleExtern(); break; | 
|  | 710 | default:         HandleTopLevelExpression(); break; | 
|  | 711 | } | 
|  | 712 | } | 
|  | 713 | } | 
|  | 714 | </pre> | 
|  | 715 | </div> | 
|  | 716 |  | 
|  | 717 | <p>The most interesting part of this is that we ignore top-level semi colons. | 
| Owen Anderson | c2b2fc0 | 2007-10-22 06:48:28 +0000 | [diff] [blame] | 718 | Why is this, you ask?  The basic reason is that if you type "4 + 5" at the | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 719 | command line, the parser doesn't know that that is the end of what you will | 
|  | 720 | type.  For example, on the next line you could type "def foo..." in which case | 
|  | 721 | 4+5 is the end of a top-level expression.  Alternatively you could type "* 6", | 
|  | 722 | which would continue the expression.  Having top-level semicolons allows you to | 
|  | 723 | type "4+5;" and the parser will know you are done.</p> | 
|  | 724 |  | 
|  | 725 | </div> | 
|  | 726 |  | 
|  | 727 | <!-- *********************************************************************** --> | 
|  | 728 | <div class="doc_section"><a name="code">Conclusions and the Full Code</a></div> | 
|  | 729 | <!-- *********************************************************************** --> | 
|  | 730 |  | 
|  | 731 | <div class="doc_text"> | 
|  | 732 |  | 
|  | 733 | <p>With just under 400 lines of commented code, we fully defined our minimal | 
|  | 734 | language, including a lexer, parser and AST builder.  With this done, the | 
|  | 735 | executable will validate code and tell us if it is gramatically invalid.  For | 
|  | 736 | example, here is a sample interaction:</p> | 
|  | 737 |  | 
|  | 738 | <div class="doc_code"> | 
|  | 739 | <pre> | 
|  | 740 | $ ./a.out | 
| Chris Lattner | a02ab55 | 2007-10-23 06:23:57 +0000 | [diff] [blame] | 741 | ready> def foo(x y) x+foo(y, 4.0); | 
| Chris Lattner | 3a39ad6 | 2007-11-05 17:38:34 +0000 | [diff] [blame] | 742 | ready> Parsed a function definition. | 
| Chris Lattner | a02ab55 | 2007-10-23 06:23:57 +0000 | [diff] [blame] | 743 | ready> def foo(x y) x+y y; | 
| Chris Lattner | 3a39ad6 | 2007-11-05 17:38:34 +0000 | [diff] [blame] | 744 | ready> Parsed a function definition. | 
| Chris Lattner | a02ab55 | 2007-10-23 06:23:57 +0000 | [diff] [blame] | 745 | ready> Parsed a top-level expr | 
|  | 746 | ready> def foo(x y) x+y ); | 
| Chris Lattner | 3a39ad6 | 2007-11-05 17:38:34 +0000 | [diff] [blame] | 747 | ready> Parsed a function definition. | 
| Chris Lattner | a02ab55 | 2007-10-23 06:23:57 +0000 | [diff] [blame] | 748 | ready> Error: unknown token when expecting an expression | 
|  | 749 | ready> extern sin(a); | 
|  | 750 | ready> Parsed an extern | 
|  | 751 | ready> ^D | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 752 | $ | 
|  | 753 | </pre> | 
|  | 754 | </div> | 
|  | 755 |  | 
| Chris Lattner | 61353b4 | 2007-10-23 05:43:01 +0000 | [diff] [blame] | 756 | <p>There is a lot of room for extension here.  You can define new AST nodes, | 
|  | 757 | extend the language in many ways, etc.  In the <a href="LangImpl3.html">next | 
|  | 758 | installment</a>, we will describe how to generate LLVM IR from the AST.</p> | 
|  | 759 |  | 
|  | 760 | </div> | 
|  | 761 |  | 
|  | 762 | <!-- *********************************************************************** --> | 
|  | 763 | <div class="doc_section"><a name="code">Full Code Listing</a></div> | 
|  | 764 | <!-- *********************************************************************** --> | 
|  | 765 |  | 
|  | 766 | <div class="doc_text"> | 
|  | 767 |  | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 768 | <p> | 
| Chris Lattner | 61353b4 | 2007-10-23 05:43:01 +0000 | [diff] [blame] | 769 | Here is the complete code listing for this and the previous chapter. | 
|  | 770 | Note that it is fully self-contained: you don't need LLVM or any external | 
|  | 771 | libraries at all for this (other than the C and C++ standard libraries of | 
|  | 772 | course).  To build this, just compile with:</p> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 773 |  | 
|  | 774 | <div class="doc_code"> | 
|  | 775 | <pre> | 
| Chris Lattner | 61353b4 | 2007-10-23 05:43:01 +0000 | [diff] [blame] | 776 | # Compile | 
|  | 777 | g++ -g toy.cpp | 
|  | 778 | # Run | 
|  | 779 | ./a.out | 
|  | 780 | </pre> | 
|  | 781 | </div> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 782 |  | 
| Chris Lattner | 61353b4 | 2007-10-23 05:43:01 +0000 | [diff] [blame] | 783 | <p>Here is the code:</p> | 
|  | 784 |  | 
|  | 785 | <div class="doc_code"> | 
|  | 786 | <pre> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 787 | #include <cstdio> | 
|  | 788 | #include <string> | 
| Chris Lattner | 61353b4 | 2007-10-23 05:43:01 +0000 | [diff] [blame] | 789 | #include <map> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 790 | #include <vector> | 
|  | 791 |  | 
|  | 792 | //===----------------------------------------------------------------------===// | 
|  | 793 | // Lexer | 
|  | 794 | //===----------------------------------------------------------------------===// | 
|  | 795 |  | 
|  | 796 | // The lexer returns tokens [0-255] if it is an unknown character, otherwise one | 
|  | 797 | // of these for known things. | 
|  | 798 | enum Token { | 
|  | 799 | tok_eof = -1, | 
|  | 800 |  | 
|  | 801 | // commands | 
|  | 802 | tok_def = -2, tok_extern = -3, | 
|  | 803 |  | 
|  | 804 | // primary | 
|  | 805 | tok_identifier = -4, tok_number = -5, | 
|  | 806 | }; | 
|  | 807 |  | 
|  | 808 | static std::string IdentifierStr;  // Filled in if tok_identifier | 
|  | 809 | static double NumVal;              // Filled in if tok_number | 
|  | 810 |  | 
|  | 811 | /// gettok - Return the next token from standard input. | 
|  | 812 | static int gettok() { | 
|  | 813 | static int LastChar = ' '; | 
|  | 814 |  | 
|  | 815 | // Skip any whitespace. | 
|  | 816 | while (isspace(LastChar)) | 
|  | 817 | LastChar = getchar(); | 
|  | 818 |  | 
|  | 819 | if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* | 
|  | 820 | IdentifierStr = LastChar; | 
|  | 821 | while (isalnum((LastChar = getchar()))) | 
|  | 822 | IdentifierStr += LastChar; | 
|  | 823 |  | 
|  | 824 | if (IdentifierStr == "def") return tok_def; | 
|  | 825 | if (IdentifierStr == "extern") return tok_extern; | 
|  | 826 | return tok_identifier; | 
|  | 827 | } | 
|  | 828 |  | 
|  | 829 | if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+ | 
|  | 830 | std::string NumStr; | 
|  | 831 | do { | 
|  | 832 | NumStr += LastChar; | 
|  | 833 | LastChar = getchar(); | 
|  | 834 | } while (isdigit(LastChar) || LastChar == '.'); | 
|  | 835 |  | 
|  | 836 | NumVal = strtod(NumStr.c_str(), 0); | 
|  | 837 | return tok_number; | 
|  | 838 | } | 
|  | 839 |  | 
|  | 840 | if (LastChar == '#') { | 
|  | 841 | // Comment until end of line. | 
|  | 842 | do LastChar = getchar(); | 
|  | 843 | while (LastChar != EOF && LastChar != '\n' & LastChar != '\r'); | 
|  | 844 |  | 
|  | 845 | if (LastChar != EOF) | 
|  | 846 | return gettok(); | 
|  | 847 | } | 
|  | 848 |  | 
|  | 849 | // Check for end of file.  Don't eat the EOF. | 
|  | 850 | if (LastChar == EOF) | 
|  | 851 | return tok_eof; | 
|  | 852 |  | 
|  | 853 | // Otherwise, just return the character as its ascii value. | 
|  | 854 | int ThisChar = LastChar; | 
|  | 855 | LastChar = getchar(); | 
|  | 856 | return ThisChar; | 
|  | 857 | } | 
|  | 858 |  | 
|  | 859 | //===----------------------------------------------------------------------===// | 
|  | 860 | // Abstract Syntax Tree (aka Parse Tree) | 
|  | 861 | //===----------------------------------------------------------------------===// | 
|  | 862 |  | 
|  | 863 | /// ExprAST - Base class for all expression nodes. | 
|  | 864 | class ExprAST { | 
|  | 865 | public: | 
|  | 866 | virtual ~ExprAST() {} | 
|  | 867 | }; | 
|  | 868 |  | 
|  | 869 | /// NumberExprAST - Expression class for numeric literals like "1.0". | 
|  | 870 | class NumberExprAST : public ExprAST { | 
|  | 871 | double Val; | 
|  | 872 | public: | 
| Chris Lattner | 61b4ec7 | 2007-10-23 04:27:44 +0000 | [diff] [blame] | 873 | explicit NumberExprAST(double val) : Val(val) {} | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 874 | }; | 
|  | 875 |  | 
|  | 876 | /// VariableExprAST - Expression class for referencing a variable, like "a". | 
|  | 877 | class VariableExprAST : public ExprAST { | 
|  | 878 | std::string Name; | 
|  | 879 | public: | 
| Chris Lattner | 61b4ec7 | 2007-10-23 04:27:44 +0000 | [diff] [blame] | 880 | explicit VariableExprAST(const std::string &name) : Name(name) {} | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 881 | }; | 
|  | 882 |  | 
|  | 883 | /// BinaryExprAST - Expression class for a binary operator. | 
|  | 884 | class BinaryExprAST : public ExprAST { | 
|  | 885 | char Op; | 
|  | 886 | ExprAST *LHS, *RHS; | 
|  | 887 | public: | 
|  | 888 | BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs) | 
|  | 889 | : Op(op), LHS(lhs), RHS(rhs) {} | 
|  | 890 | }; | 
|  | 891 |  | 
|  | 892 | /// CallExprAST - Expression class for function calls. | 
|  | 893 | class CallExprAST : public ExprAST { | 
|  | 894 | std::string Callee; | 
|  | 895 | std::vector<ExprAST*> Args; | 
|  | 896 | public: | 
|  | 897 | CallExprAST(const std::string &callee, std::vector<ExprAST*> &args) | 
|  | 898 | : Callee(callee), Args(args) {} | 
|  | 899 | }; | 
|  | 900 |  | 
|  | 901 | /// PrototypeAST - This class represents the "prototype" for a function, | 
|  | 902 | /// which captures its argument names as well as if it is an operator. | 
|  | 903 | class PrototypeAST { | 
|  | 904 | std::string Name; | 
|  | 905 | std::vector< Args; | 
|  | 906 | public: | 
|  | 907 | PrototypeAST(const std::string &name, const std::vector<std::string> &args) | 
|  | 908 | : Name(name), Args(args) {} | 
|  | 909 |  | 
|  | 910 | }; | 
|  | 911 |  | 
|  | 912 | /// FunctionAST - This class represents a function definition itself. | 
|  | 913 | class FunctionAST { | 
|  | 914 | PrototypeAST *Proto; | 
|  | 915 | ExprAST *Body; | 
|  | 916 | public: | 
|  | 917 | FunctionAST(PrototypeAST *proto, ExprAST *body) | 
|  | 918 | : Proto(proto), Body(body) {} | 
|  | 919 |  | 
|  | 920 | }; | 
|  | 921 |  | 
|  | 922 | //===----------------------------------------------------------------------===// | 
|  | 923 | // Parser | 
|  | 924 | //===----------------------------------------------------------------------===// | 
|  | 925 |  | 
|  | 926 | /// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current | 
|  | 927 | /// token the parser it looking at.  getNextToken reads another token from the | 
|  | 928 | /// lexer and updates CurTok with its results. | 
|  | 929 | static int CurTok; | 
|  | 930 | static int getNextToken() { | 
|  | 931 | return CurTok = gettok(); | 
|  | 932 | } | 
|  | 933 |  | 
|  | 934 | /// BinopPrecedence - This holds the precedence for each binary operator that is | 
|  | 935 | /// defined. | 
|  | 936 | static std::map<char, int> BinopPrecedence; | 
|  | 937 |  | 
|  | 938 | /// GetTokPrecedence - Get the precedence of the pending binary operator token. | 
|  | 939 | static int GetTokPrecedence() { | 
|  | 940 | if (!isascii(CurTok)) | 
|  | 941 | return -1; | 
|  | 942 |  | 
|  | 943 | // Make sure it's a declared binop. | 
|  | 944 | int TokPrec = BinopPrecedence[CurTok]; | 
|  | 945 | if (TokPrec <= 0) return -1; | 
|  | 946 | return TokPrec; | 
|  | 947 | } | 
|  | 948 |  | 
|  | 949 | /// Error* - These are little helper functions for error handling. | 
|  | 950 | ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;} | 
|  | 951 | PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; } | 
|  | 952 | FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; } | 
|  | 953 |  | 
|  | 954 | static ExprAST *ParseExpression(); | 
|  | 955 |  | 
|  | 956 | /// identifierexpr | 
| Chris Lattner | 9b2f777 | 2007-11-05 17:54:34 +0000 | [diff] [blame^] | 957 | ///   ::= identifier | 
|  | 958 | ///   ::= identifier '(' expression* ')' | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 959 | static ExprAST *ParseIdentifierExpr() { | 
|  | 960 | std::string IdName = IdentifierStr; | 
|  | 961 |  | 
| Chris Lattner | 9b2f777 | 2007-11-05 17:54:34 +0000 | [diff] [blame^] | 962 | getNextToken();  // eat identifier. | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 963 |  | 
|  | 964 | if (CurTok != '(') // Simple variable ref. | 
|  | 965 | return new VariableExprAST(IdName); | 
|  | 966 |  | 
|  | 967 | // Call. | 
|  | 968 | getNextToken();  // eat ( | 
|  | 969 | std::vector<ExprAST*> Args; | 
|  | 970 | while (1) { | 
|  | 971 | ExprAST *Arg = ParseExpression(); | 
|  | 972 | if (!Arg) return 0; | 
|  | 973 | Args.push_back(Arg); | 
|  | 974 |  | 
|  | 975 | if (CurTok == ')') break; | 
|  | 976 |  | 
|  | 977 | if (CurTok != ',') | 
|  | 978 | return Error("Expected ')'"); | 
|  | 979 | getNextToken(); | 
|  | 980 | } | 
|  | 981 |  | 
|  | 982 | // Eat the ')'. | 
|  | 983 | getNextToken(); | 
|  | 984 |  | 
|  | 985 | return new CallExprAST(IdName, Args); | 
|  | 986 | } | 
|  | 987 |  | 
|  | 988 | /// numberexpr ::= number | 
|  | 989 | static ExprAST *ParseNumberExpr() { | 
|  | 990 | ExprAST *Result = new NumberExprAST(NumVal); | 
|  | 991 | getNextToken(); // consume the number | 
|  | 992 | return Result; | 
|  | 993 | } | 
|  | 994 |  | 
|  | 995 | /// parenexpr ::= '(' expression ')' | 
|  | 996 | static ExprAST *ParseParenExpr() { | 
|  | 997 | getNextToken();  // eat (. | 
|  | 998 | ExprAST *V = ParseExpression(); | 
|  | 999 | if (!V) return 0; | 
|  | 1000 |  | 
|  | 1001 | if (CurTok != ')') | 
|  | 1002 | return Error("expected ')'"); | 
|  | 1003 | getNextToken();  // eat ). | 
|  | 1004 | return V; | 
|  | 1005 | } | 
|  | 1006 |  | 
|  | 1007 | /// primary | 
|  | 1008 | ///   ::= identifierexpr | 
|  | 1009 | ///   ::= numberexpr | 
|  | 1010 | ///   ::= parenexpr | 
|  | 1011 | static ExprAST *ParsePrimary() { | 
|  | 1012 | switch (CurTok) { | 
|  | 1013 | default: return Error("unknown token when expecting an expression"); | 
|  | 1014 | case tok_identifier: return ParseIdentifierExpr(); | 
|  | 1015 | case tok_number:     return ParseNumberExpr(); | 
|  | 1016 | case '(':            return ParseParenExpr(); | 
|  | 1017 | } | 
|  | 1018 | } | 
|  | 1019 |  | 
|  | 1020 | /// binoprhs | 
|  | 1021 | ///   ::= ('+' primary)* | 
|  | 1022 | static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) { | 
|  | 1023 | // If this is a binop, find its precedence. | 
|  | 1024 | while (1) { | 
|  | 1025 | int TokPrec = GetTokPrecedence(); | 
|  | 1026 |  | 
|  | 1027 | // If this is a binop that binds at least as tightly as the current binop, | 
|  | 1028 | // consume it, otherwise we are done. | 
|  | 1029 | if (TokPrec < ExprPrec) | 
|  | 1030 | return LHS; | 
|  | 1031 |  | 
|  | 1032 | // Okay, we know this is a binop. | 
|  | 1033 | int BinOp = CurTok; | 
|  | 1034 | getNextToken();  // eat binop | 
|  | 1035 |  | 
|  | 1036 | // Parse the primary expression after the binary operator. | 
|  | 1037 | ExprAST *RHS = ParsePrimary(); | 
|  | 1038 | if (!RHS) return 0; | 
|  | 1039 |  | 
|  | 1040 | // If BinOp binds less tightly with RHS than the operator after RHS, let | 
|  | 1041 | // the pending operator take RHS as its LHS. | 
|  | 1042 | int NextPrec = GetTokPrecedence(); | 
|  | 1043 | if (TokPrec < NextPrec) { | 
|  | 1044 | RHS = ParseBinOpRHS(TokPrec+1, RHS); | 
|  | 1045 | if (RHS == 0) return 0; | 
|  | 1046 | } | 
|  | 1047 |  | 
|  | 1048 | // Merge LHS/RHS. | 
|  | 1049 | LHS = new BinaryExprAST(BinOp, LHS, RHS); | 
|  | 1050 | } | 
|  | 1051 | } | 
|  | 1052 |  | 
|  | 1053 | /// expression | 
|  | 1054 | ///   ::= primary binoprhs | 
|  | 1055 | /// | 
|  | 1056 | static ExprAST *ParseExpression() { | 
|  | 1057 | ExprAST *LHS = ParsePrimary(); | 
|  | 1058 | if (!LHS) return 0; | 
|  | 1059 |  | 
|  | 1060 | return ParseBinOpRHS(0, LHS); | 
|  | 1061 | } | 
|  | 1062 |  | 
|  | 1063 | /// prototype | 
|  | 1064 | ///   ::= id '(' id* ')' | 
|  | 1065 | static PrototypeAST *ParsePrototype() { | 
|  | 1066 | if (CurTok != tok_identifier) | 
|  | 1067 | return ErrorP("Expected function name in prototype"); | 
|  | 1068 |  | 
|  | 1069 | std::string FnName = IdentifierStr; | 
|  | 1070 | getNextToken(); | 
|  | 1071 |  | 
|  | 1072 | if (CurTok != '(') | 
|  | 1073 | return ErrorP("Expected '(' in prototype"); | 
|  | 1074 |  | 
|  | 1075 | std::vector<std::string> ArgNames; | 
|  | 1076 | while (getNextToken() == tok_identifier) | 
|  | 1077 | ArgNames.push_back(IdentifierStr); | 
|  | 1078 | if (CurTok != ')') | 
|  | 1079 | return ErrorP("Expected ')' in prototype"); | 
|  | 1080 |  | 
|  | 1081 | // success. | 
|  | 1082 | getNextToken();  // eat ')'. | 
|  | 1083 |  | 
|  | 1084 | return new PrototypeAST(FnName, ArgNames); | 
|  | 1085 | } | 
|  | 1086 |  | 
|  | 1087 | /// definition ::= 'def' prototype expression | 
|  | 1088 | static FunctionAST *ParseDefinition() { | 
|  | 1089 | getNextToken();  // eat def. | 
|  | 1090 | PrototypeAST *Proto = ParsePrototype(); | 
|  | 1091 | if (Proto == 0) return 0; | 
|  | 1092 |  | 
|  | 1093 | if (ExprAST *E = ParseExpression()) | 
|  | 1094 | return new FunctionAST(Proto, E); | 
|  | 1095 | return 0; | 
|  | 1096 | } | 
|  | 1097 |  | 
|  | 1098 | /// toplevelexpr ::= expression | 
|  | 1099 | static FunctionAST *ParseTopLevelExpr() { | 
|  | 1100 | if (ExprAST *E = ParseExpression()) { | 
|  | 1101 | // Make an anonymous proto. | 
|  | 1102 | PrototypeAST *Proto = new PrototypeAST("", std::vector<()); | 
|  | 1103 | return new FunctionAST(Proto, E); | 
|  | 1104 | } | 
|  | 1105 | return 0; | 
|  | 1106 | } | 
|  | 1107 |  | 
|  | 1108 | /// external ::= 'extern' prototype | 
|  | 1109 | static PrototypeAST *ParseExtern() { | 
|  | 1110 | getNextToken();  // eat extern. | 
|  | 1111 | return ParsePrototype(); | 
|  | 1112 | } | 
|  | 1113 |  | 
|  | 1114 | //===----------------------------------------------------------------------===// | 
|  | 1115 | // Top-Level parsing | 
|  | 1116 | //===----------------------------------------------------------------------===// | 
|  | 1117 |  | 
|  | 1118 | static void HandleDefinition() { | 
|  | 1119 | if (FunctionAST *F = ParseDefinition()) { | 
|  | 1120 | fprintf(stderr, "Parsed a function definition.\n"); | 
|  | 1121 | } else { | 
|  | 1122 | // Skip token for error recovery. | 
|  | 1123 | getNextToken(); | 
|  | 1124 | } | 
|  | 1125 | } | 
|  | 1126 |  | 
|  | 1127 | static void HandleExtern() { | 
|  | 1128 | if (PrototypeAST *P = ParseExtern()) { | 
|  | 1129 | fprintf(stderr, "Parsed an extern\n"); | 
|  | 1130 | } else { | 
|  | 1131 | // Skip token for error recovery. | 
|  | 1132 | getNextToken(); | 
|  | 1133 | } | 
|  | 1134 | } | 
|  | 1135 |  | 
|  | 1136 | static void HandleTopLevelExpression() { | 
|  | 1137 | // Evaluate a top level expression into an anonymous function. | 
|  | 1138 | if (FunctionAST *F = ParseTopLevelExpr()) { | 
|  | 1139 | fprintf(stderr, "Parsed a top-level expr\n"); | 
|  | 1140 | } else { | 
|  | 1141 | // Skip token for error recovery. | 
|  | 1142 | getNextToken(); | 
|  | 1143 | } | 
|  | 1144 | } | 
|  | 1145 |  | 
|  | 1146 | /// top ::= definition | external | expression | ';' | 
|  | 1147 | static void MainLoop() { | 
|  | 1148 | while (1) { | 
|  | 1149 | fprintf(stderr, "ready> "); | 
|  | 1150 | switch (CurTok) { | 
|  | 1151 | case tok_eof:    return; | 
|  | 1152 | case ';':        getNextToken(); break;  // ignore top level semicolons. | 
|  | 1153 | case tok_def:    HandleDefinition(); break; | 
|  | 1154 | case tok_extern: HandleExtern(); break; | 
|  | 1155 | default:         HandleTopLevelExpression(); break; | 
|  | 1156 | } | 
|  | 1157 | } | 
|  | 1158 | } | 
|  | 1159 |  | 
|  | 1160 | //===----------------------------------------------------------------------===// | 
|  | 1161 | // Main driver code. | 
|  | 1162 | //===----------------------------------------------------------------------===// | 
|  | 1163 |  | 
|  | 1164 | int main() { | 
|  | 1165 | // Install standard binary operators. | 
|  | 1166 | // 1 is lowest precedence. | 
|  | 1167 | BinopPrecedence['<'] = 10; | 
|  | 1168 | BinopPrecedence['+'] = 20; | 
|  | 1169 | BinopPrecedence['-'] = 20; | 
|  | 1170 | BinopPrecedence['*'] = 40;  // highest. | 
|  | 1171 |  | 
|  | 1172 | // Prime the first token. | 
|  | 1173 | fprintf(stderr, "ready> "); | 
|  | 1174 | getNextToken(); | 
|  | 1175 |  | 
|  | 1176 | MainLoop(); | 
|  | 1177 | return 0; | 
|  | 1178 | } | 
|  | 1179 | </pre> | 
|  | 1180 | </div> | 
|  | 1181 | </div> | 
|  | 1182 |  | 
|  | 1183 | <!-- *********************************************************************** --> | 
|  | 1184 | <hr> | 
|  | 1185 | <address> | 
|  | 1186 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img | 
|  | 1187 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> | 
|  | 1188 | <a href="http://validator.w3.org/check/referer"><img | 
| Chris Lattner | c3def15 | 2007-10-23 06:30:50 +0000 | [diff] [blame] | 1189 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> | 
| Chris Lattner | 3a48708 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 1190 |  | 
|  | 1191 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> | 
|  | 1192 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> | 
|  | 1193 | Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $ | 
|  | 1194 | </address> | 
|  | 1195 | </body> | 
|  | 1196 | </html> |