Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | |
| 4 | <html> |
| 5 | <head> |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 6 | <title>Kaleidoscope: Tutorial Introduction and the Lexer</title> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 8 | <meta name="author" content="Chris Lattner"> |
| 9 | <link rel="stylesheet" href="../llvm.css" type="text/css"> |
| 10 | </head> |
| 11 | |
| 12 | <body> |
| 13 | |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 14 | <div class="doc_title">Kaleidoscope: Tutorial Introduction and the Lexer</div> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 15 | |
Chris Lattner | 128eb86 | 2007-11-05 19:06:59 +0000 | [diff] [blame] | 16 | <ul> |
Chris Lattner | 0e555b1 | 2007-11-05 20:04:56 +0000 | [diff] [blame] | 17 | <li><a href="index.html">Up to Tutorial Index</a></li> |
Chris Lattner | 128eb86 | 2007-11-05 19:06:59 +0000 | [diff] [blame] | 18 | <li>Chapter 1 |
| 19 | <ol> |
| 20 | <li><a href="#intro">Tutorial Introduction</a></li> |
| 21 | <li><a href="#language">The Basic Language</a></li> |
| 22 | <li><a href="#lexer">The Lexer</a></li> |
| 23 | </ol> |
| 24 | </li> |
Chris Lattner | 0e555b1 | 2007-11-05 20:04:56 +0000 | [diff] [blame] | 25 | <li><a href="LangImpl2.html">Chapter 2</a>: Implementing a Parser and AST</li> |
Chris Lattner | 128eb86 | 2007-11-05 19:06:59 +0000 | [diff] [blame] | 26 | </ul> |
| 27 | |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 28 | <div class="doc_author"> |
| 29 | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> |
| 30 | </div> |
| 31 | |
| 32 | <!-- *********************************************************************** --> |
| 33 | <div class="doc_section"><a name="intro">Tutorial Introduction</a></div> |
| 34 | <!-- *********************************************************************** --> |
| 35 | |
| 36 | <div class="doc_text"> |
| 37 | |
| 38 | <p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 39 | runs through the implementation of a simple language, showing how fun and |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 40 | easy it can be. This tutorial will get you up and started as well as help to |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 41 | build a framework you can extend to other languages, allowing you to use this |
| 42 | as a way to start playing with other LLVM specific things. |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 43 | </p> |
| 44 | |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 45 | <p> |
| 46 | The goal of this tutorial is to progressively unveil our language, describing |
| 47 | how it is built up over time. This will let us cover a fairly broad range of |
| 48 | language design and LLVM-specific usage issues, showing and explaining the code |
| 49 | for it all along the way, without overwhelming you with tons of details up |
| 50 | front.</p> |
| 51 | |
| 52 | <p>It is useful to point out ahead of time that this tutorial is really about |
| 53 | teaching compiler techniques and LLVM specifically, <em>not</em> about teaching |
| 54 | modern and sane software engineering principles. In practice, this means that |
| 55 | we'll take a number of shortcuts to simplify the exposition. For example, the |
| 56 | code leaks memory, uses global variables all over the place, doesn't use nice |
| 57 | design patterns like visitors, etc... but it is very simple. If you dig in and |
| 58 | use the code as a basis for future projects, fixing these deficiencies shouldn't |
| 59 | be hard.</p> |
| 60 | |
| 61 | <p>I've tried to put this tutorial together in a way that makes chapters easy to |
| 62 | skip over if you are already familiar or are uninterested with various pieces. |
| 63 | The structure of the tutorial is: |
| 64 | </p> |
| 65 | |
| 66 | <ul> |
| 67 | <li><b><a href="#language">Chapter #1</a>: Introduction to the Kaleidoscope |
| 68 | language, and the definition of its Lexer</b> - This shows where we are going |
| 69 | and the basic functionality that we want it to do. In order to make this |
| 70 | tutorial maximally understandable and hackable, we choose to implement |
| 71 | everything in C++ instead of using lexer and parser generators. LLVM obviously |
| 72 | works just fine with such tools, feel free to use one if you prefer.</li> |
| 73 | <li><b><a href="LangImpl2.html">Chapter #2</a>: Implementing a Parser and |
| 74 | AST</b> - With the lexer in place, we can talk about parsing techniques and |
| 75 | basic AST construction. This tutorial describes recursive descent parsing and |
| 76 | operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific, |
| 77 | the code doesn't even link in LLVM at this point. :)</li> |
| 78 | <li><b><a href="LangImpl3.html">Chapter #3</a>: Code generation to LLVM IR</b> - |
| 79 | With the AST ready, we can show off how easy generation of LLVM IR really |
| 80 | is.</li> |
| 81 | <li><b><a href="LangImpl4.html">Chapter #4</a>: Adding JIT and Optimizer |
| 82 | Support</b> - Because a lot of people are interested in using LLVM as a JIT, |
| 83 | we'll dive right into it and show you the 3 lines it takes to add JIT support. |
| 84 | LLVM is also useful in many other ways, but this is one simple and "sexy" way |
| 85 | that shows off its power. :)</li> |
| 86 | <li><b><a href="LangImpl5.html">Chapter #5</a>: Extending the Language: Control |
| 87 | Flow</b> - With the language up and running, we show how to extend it with |
| 88 | control flow operations (if/then/else and a for loop). This gives us a chance |
| 89 | to talk about simple SSA construction and control flow.</li> |
| 90 | <li><b><a href="LangImpl6.html">Chapter #6</a>: Extending the Language: |
| 91 | User-defined Operators</b> - This is a silly but fun chapter that talks about |
| 92 | extending the language to let the user program define their own arbitrary |
| 93 | unary and binary operators (with assignable precedence!). This lets us build a |
| 94 | significant piece of the "language" as library routines.</li> |
| 95 | <li><b><a href="LangImpl7.html">Chapter #7</a>: Extending the Language: Mutable |
| 96 | Variables</b> - This chapter talks about adding user-defined local variables |
| 97 | along with variable assignment operator. The interesting part about this is how |
| 98 | easy and trivial it is to construct SSA form in LLVM (no, LLVM does <em>not</em> |
| 99 | require your front-end to construct SSA form!).</li> |
| 100 | <li><b><a href="LangImpl8.html">Chapter #8</a>: Conclusion and other useful LLVM |
| 101 | tidbits</b> - This chapter wraps up the series by talking about potential |
| 102 | ways to extend the language, but also includes a bunch of pointers to info about |
| 103 | "special topics" like adding garbage collection support, exceptions, debugging, |
| 104 | support for "spaghetti stacks", and a bunch of other tips and tricks.</li> |
| 105 | |
| 106 | </ul> |
| 107 | |
| 108 | <p>By the end of the tutorial, we'll have written about 700 lines of |
| 109 | non-comment, non-blank lines of code. With this small amount of code, we'll |
| 110 | have built up a very reasonable compiler for a non-trivial language including |
| 111 | a hand-written lexer, parser, AST, as well as code generation support with a JIT |
| 112 | compiler. While other systems may have interesting "hello world" tutorials, |
| 113 | I think the breadth of this tutorial is a great testament to the strengths of |
| 114 | LLVM and why you should consider it if you're interested in language or compiler |
| 115 | design.</p> |
| 116 | |
| 117 | <p>A note about this tutorial: we expect you to extend the language and play |
| 118 | with it on your own. Take the code and go crazy hacking away at it. It can be |
| 119 | a lot of fun to play with languages! In any case, lets get into the code!</p> |
| 120 | |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 121 | </div> |
| 122 | |
| 123 | <!-- *********************************************************************** --> |
Chris Lattner | 128eb86 | 2007-11-05 19:06:59 +0000 | [diff] [blame] | 124 | <div class="doc_section"><a name="language">The Basic Language</a></div> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 125 | <!-- *********************************************************************** --> |
| 126 | |
| 127 | <div class="doc_text"> |
| 128 | |
| 129 | <p>This tutorial will be illustrated with a toy language that we'll call |
| 130 | "<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>". |
| 131 | Kaleidoscope is a procedural language that allows you to define functions, use |
| 132 | conditionals, math, etc. Over the course of the tutorial, we'll extend |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 133 | Kaleidoscope to support the if/then/else construct, a for loop, user defined |
| 134 | operators, JIT compilation with a simple command line interface, etc.</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 135 | |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 136 | <p>Because we want to keep things simple, the only datatype in Kaleidoscope is a |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 137 | 64-bit floating point type (aka 'double' in C parlance). As such, all values |
| 138 | are implicitly double precision and the language doesn't require type |
| 139 | declarations. This gives the language a very nice and simple syntax. For |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 140 | example, the following simple example computes <a |
| 141 | href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 142 | |
| 143 | <div class="doc_code"> |
| 144 | <pre> |
| 145 | # Compute the x'th fibonacci number. |
| 146 | def fib(x) |
Chris Lattner | e6c9104 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 147 | if x < 3 then |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 148 | 1 |
| 149 | else |
| 150 | fib(x-1)+fib(x-2) |
| 151 | |
| 152 | # This expression will compute the 40th number. |
| 153 | fib(40) |
| 154 | </pre> |
| 155 | </div> |
| 156 | |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 157 | <p>We also allow Kaleidoscope to call into standard library functions (the LLVM |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 158 | JIT makes this completely trivial). This means that you can use the 'extern' |
| 159 | keyword to define a function before you use it (this is also useful for mutually |
| 160 | recursive functions). For example:</p> |
| 161 | |
| 162 | <div class="doc_code"> |
| 163 | <pre> |
| 164 | extern sin(arg); |
| 165 | extern cos(arg); |
| 166 | extern atan2(arg1 arg2); |
| 167 | |
| 168 | atan2(sin(.4), cos(42)) |
| 169 | </pre> |
| 170 | </div> |
| 171 | |
Chris Lattner | 95ce0d6 | 2007-11-06 05:02:48 +0000 | [diff] [blame^] | 172 | <p>A more interesting example is included in Chapter 6 where we show the code |
| 173 | used to <a href="LangImpl6.html#example">implement a Mandelbrot Set viewer</a> |
| 174 | in Kaleidoscope.</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 175 | |
| 176 | </div> |
| 177 | |
| 178 | <!-- *********************************************************************** --> |
Chris Lattner | 128eb86 | 2007-11-05 19:06:59 +0000 | [diff] [blame] | 179 | <div class="doc_section"><a name="lexer">The Lexer</a></div> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 180 | <!-- *********************************************************************** --> |
| 181 | |
| 182 | <div class="doc_text"> |
| 183 | |
| 184 | <p>When it comes to implementing a language, the first thing needed is |
| 185 | the ability to process a text file and recognize what it says. The traditional |
| 186 | way to do this is to use a "<a |
| 187 | href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner') |
| 188 | to break the input up into "tokens". Each token returned by the lexer includes |
| 189 | a token code and potentially some metadata (e.g. the numeric value of a number). |
| 190 | First, we define the possibilities: |
| 191 | </p> |
| 192 | |
| 193 | <div class="doc_code"> |
| 194 | <pre> |
| 195 | // The lexer returns tokens [0-255] if it is an unknown character, otherwise one |
| 196 | // of these for known things. |
| 197 | enum Token { |
| 198 | tok_eof = -1, |
| 199 | |
| 200 | // commands |
| 201 | tok_def = -2, tok_extern = -3, |
| 202 | |
| 203 | // primary |
| 204 | tok_identifier = -4, tok_number = -5, |
| 205 | }; |
| 206 | |
| 207 | static std::string IdentifierStr; // Filled in if tok_identifier |
| 208 | static double NumVal; // Filled in if tok_number |
| 209 | </pre> |
| 210 | </div> |
| 211 | |
| 212 | <p>Each token returned by our lexer will either be one of the Token enum values |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 213 | or it will be an 'unknown' character like '+', which is returned as its ascii |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 214 | value. If the current token is an identifier, the <tt>IdentifierStr</tt> |
| 215 | global variable holds the name of the identifier. If the current token is a |
| 216 | numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use |
| 217 | global variables for simplicity, this is not the best choice for a real language |
| 218 | implementation :). |
| 219 | </p> |
| 220 | |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 221 | <p>The actual implementation of the lexer is a single function named |
| 222 | <tt>gettok</tt>. The <tt>gettok</tt> function is called to return the next token |
| 223 | from standard input. Its definition starts as:</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 224 | |
| 225 | <div class="doc_code"> |
| 226 | <pre> |
| 227 | /// gettok - Return the next token from standard input. |
| 228 | static int gettok() { |
| 229 | static int LastChar = ' '; |
| 230 | |
| 231 | // Skip any whitespace. |
| 232 | while (isspace(LastChar)) |
| 233 | LastChar = getchar(); |
| 234 | </pre> |
| 235 | </div> |
| 236 | |
| 237 | <p> |
| 238 | <tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read |
| 239 | characters one at a time from standard input. It eats them as it recognizes |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 240 | them and stores the last character read, but not processed, in LastChar. The |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 241 | first thing that it has to do is ignore whitespace between tokens. This is |
| 242 | accomplished with the loop above.</p> |
| 243 | |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 244 | <p>The next thing <tt>gettok</tt> needs to do is recognize identifiers and |
| 245 | specific keywords like "def". Kaleidoscope does this with this simple loop:</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 246 | |
| 247 | <div class="doc_code"> |
| 248 | <pre> |
| 249 | if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* |
| 250 | IdentifierStr = LastChar; |
| 251 | while (isalnum((LastChar = getchar()))) |
| 252 | IdentifierStr += LastChar; |
| 253 | |
| 254 | if (IdentifierStr == "def") return tok_def; |
| 255 | if (IdentifierStr == "extern") return tok_extern; |
| 256 | return tok_identifier; |
| 257 | } |
| 258 | </pre> |
| 259 | </div> |
| 260 | |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 261 | <p>Note that this code sets the '<tt>IdentifierStr</tt>' global whenever it |
| 262 | lexes an identifier. Also, since language keywords are matched by the same |
| 263 | loop, we handle them here inline. Numeric values are similar:</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 264 | |
| 265 | <div class="doc_code"> |
| 266 | <pre> |
| 267 | if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ |
| 268 | std::string NumStr; |
| 269 | do { |
| 270 | NumStr += LastChar; |
| 271 | LastChar = getchar(); |
| 272 | } while (isdigit(LastChar) || LastChar == '.'); |
| 273 | |
| 274 | NumVal = strtod(NumStr.c_str(), 0); |
| 275 | return tok_number; |
| 276 | } |
| 277 | </pre> |
| 278 | </div> |
| 279 | |
| 280 | <p>This is all pretty straight-forward code for processing input. When reading |
| 281 | a numeric value from input, we use the C <tt>strtod</tt> function to convert it |
| 282 | to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 283 | sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 284 | if you typed in "1.23". Feel free to extend it :). Next we handle comments: |
| 285 | </p> |
| 286 | |
| 287 | <div class="doc_code"> |
| 288 | <pre> |
| 289 | if (LastChar == '#') { |
| 290 | // Comment until end of line. |
| 291 | do LastChar = getchar(); |
| 292 | while (LastChar != EOF && LastChar != '\n' & LastChar != '\r'); |
| 293 | |
| 294 | if (LastChar != EOF) |
| 295 | return gettok(); |
| 296 | } |
| 297 | </pre> |
| 298 | </div> |
| 299 | |
Chris Lattner | 7115521 | 2007-11-06 01:39:12 +0000 | [diff] [blame] | 300 | <p>We handle comments by skipping to the end of the line and then return the |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 301 | next comment. Finally, if the input doesn't match one of the above cases, it is |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame] | 302 | either an operator character like '+' or the end of the file. These are handled with |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 303 | this code:</p> |
| 304 | |
| 305 | <div class="doc_code"> |
| 306 | <pre> |
| 307 | // Check for end of file. Don't eat the EOF. |
| 308 | if (LastChar == EOF) |
| 309 | return tok_eof; |
| 310 | |
| 311 | // Otherwise, just return the character as its ascii value. |
| 312 | int ThisChar = LastChar; |
| 313 | LastChar = getchar(); |
| 314 | return ThisChar; |
| 315 | } |
| 316 | </pre> |
| 317 | </div> |
| 318 | |
Chris Lattner | 619bc0a | 2007-11-05 20:13:56 +0000 | [diff] [blame] | 319 | <p>With this, we have the complete lexer for the basic Kaleidoscope language |
| 320 | (the <a href="LangImpl2.html#code">full code listing</a> for the Lexer is |
| 321 | available in the <a href="LangImpl2.html">next chapter</a> of the tutorial). |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 322 | Next we'll <a href="LangImpl2.html">build a simple parser that uses this to |
Chris Lattner | e6c9104 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 323 | build an Abstract Syntax Tree</a>. When we have that, we'll include a driver |
| 324 | so that you can use the lexer and parser together. |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 325 | </p> |
| 326 | |
| 327 | </div> |
| 328 | |
| 329 | <!-- *********************************************************************** --> |
| 330 | <hr> |
| 331 | <address> |
| 332 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 333 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> |
| 334 | <a href="http://validator.w3.org/check/referer"><img |
Chris Lattner | 8eef4b2 | 2007-10-23 06:30:50 +0000 | [diff] [blame] | 335 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 336 | |
| 337 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
| 338 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| 339 | Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $ |
| 340 | </address> |
| 341 | </body> |
| 342 | </html> |