Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | |
| 4 | <html> |
| 5 | <head> |
| 6 | <title>Kaleidoscope: Tutorial Introduction and the Lexer</title> |
| 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 8 | <meta name="author" content="Chris Lattner"> |
| 9 | <meta name="author" content="Erick Tryzelaar"> |
| 10 | <link rel="stylesheet" href="../llvm.css" type="text/css"> |
| 11 | </head> |
| 12 | |
| 13 | <body> |
| 14 | |
| 15 | <div class="doc_title">Kaleidoscope: Tutorial Introduction and the Lexer</div> |
| 16 | |
| 17 | <ul> |
| 18 | <li><a href="index.html">Up to Tutorial Index</a></li> |
| 19 | <li>Chapter 1 |
| 20 | <ol> |
| 21 | <li><a href="#intro">Tutorial Introduction</a></li> |
| 22 | <li><a href="#language">The Basic Language</a></li> |
| 23 | <li><a href="#lexer">The Lexer</a></li> |
| 24 | </ol> |
| 25 | </li> |
| 26 | <li><a href="OCamlLangImpl2.html">Chapter 2</a>: Implementing a Parser and |
| 27 | AST</li> |
| 28 | </ul> |
| 29 | |
| 30 | <div class="doc_author"> |
| 31 | <p> |
| 32 | Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a> |
| 33 | and <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a> |
| 34 | </p> |
| 35 | </div> |
| 36 | |
| 37 | <!-- *********************************************************************** --> |
| 38 | <div class="doc_section"><a name="intro">Tutorial Introduction</a></div> |
| 39 | <!-- *********************************************************************** --> |
| 40 | |
| 41 | <div class="doc_text"> |
| 42 | |
| 43 | <p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial |
| 44 | runs through the implementation of a simple language, showing how fun and |
| 45 | easy it can be. This tutorial will get you up and started as well as help to |
| 46 | build a framework you can extend to other languages. The code in this tutorial |
| 47 | can also be used as a playground to hack on other LLVM specific things. |
| 48 | </p> |
| 49 | |
| 50 | <p> |
| 51 | The goal of this tutorial is to progressively unveil our language, describing |
| 52 | how it is built up over time. This will let us cover a fairly broad range of |
| 53 | language design and LLVM-specific usage issues, showing and explaining the code |
| 54 | for it all along the way, without overwhelming you with tons of details up |
| 55 | front.</p> |
| 56 | |
| 57 | <p>It is useful to point out ahead of time that this tutorial is really about |
| 58 | teaching compiler techniques and LLVM specifically, <em>not</em> about teaching |
| 59 | modern and sane software engineering principles. In practice, this means that |
| 60 | we'll take a number of shortcuts to simplify the exposition. For example, the |
| 61 | code leaks memory, uses global variables all over the place, doesn't use nice |
| 62 | design patterns like <a |
| 63 | href="http://en.wikipedia.org/wiki/Visitor_pattern">visitors</a>, etc... but it |
| 64 | is very simple. If you dig in and use the code as a basis for future projects, |
| 65 | fixing these deficiencies shouldn't be hard.</p> |
| 66 | |
| 67 | <p>I've tried to put this tutorial together in a way that makes chapters easy to |
| 68 | skip over if you are already familiar with or are uninterested in the various |
| 69 | pieces. The structure of the tutorial is: |
| 70 | </p> |
| 71 | |
| 72 | <ul> |
| 73 | <li><b><a href="#language">Chapter #1</a>: Introduction to the Kaleidoscope |
| 74 | language, and the definition of its Lexer</b> - This shows where we are going |
| 75 | and the basic functionality that we want it to do. In order to make this |
| 76 | tutorial maximally understandable and hackable, we choose to implement |
| 77 | everything in Objective Caml instead of using lexer and parser generators. |
| 78 | LLVM obviously works just fine with such tools, feel free to use one if you |
| 79 | prefer.</li> |
| 80 | <li><b><a href="OCamlLangImpl2.html">Chapter #2</a>: Implementing a Parser and |
| 81 | AST</b> - With the lexer in place, we can talk about parsing techniques and |
| 82 | basic AST construction. This tutorial describes recursive descent parsing and |
| 83 | operator precedence parsing. Nothing in Chapters 1 or 2 is LLVM-specific, |
| 84 | the code doesn't even link in LLVM at this point. :)</li> |
| 85 | <li><b><a href="OCamlLangImpl3.html">Chapter #3</a>: Code generation to LLVM |
| 86 | IR</b> - With the AST ready, we can show off how easy generation of LLVM IR |
| 87 | really is.</li> |
| 88 | <li><b><a href="OCamlLangImpl4.html">Chapter #4</a>: Adding JIT and Optimizer |
| 89 | Support</b> - Because a lot of people are interested in using LLVM as a JIT, |
| 90 | we'll dive right into it and show you the 3 lines it takes to add JIT support. |
| 91 | LLVM is also useful in many other ways, but this is one simple and "sexy" way |
| 92 | to shows off its power. :)</li> |
| 93 | <li><b><a href="OCamlLangImpl5.html">Chapter #5</a>: Extending the Language: |
| 94 | Control Flow</b> - With the language up and running, we show how to extend it |
| 95 | with control flow operations (if/then/else and a 'for' loop). This gives us a |
| 96 | chance to talk about simple SSA construction and control flow.</li> |
| 97 | <li><b><a href="OCamlLangImpl6.html">Chapter #6</a>: Extending the Language: |
| 98 | User-defined Operators</b> - This is a silly but fun chapter that talks about |
| 99 | extending the language to let the user program define their own arbitrary |
| 100 | unary and binary operators (with assignable precedence!). This lets us build a |
| 101 | significant piece of the "language" as library routines.</li> |
| 102 | <li><b><a href="OCamlLangImpl7.html">Chapter #7</a>: Extending the Language: |
| 103 | Mutable Variables</b> - This chapter talks about adding user-defined local |
| 104 | variables along with an assignment operator. The interesting part about this |
| 105 | is how easy and trivial it is to construct SSA form in LLVM: no, LLVM does |
| 106 | <em>not</em> require your front-end to construct SSA form!</li> |
| 107 | <li><b><a href="OCamlLangImpl8.html">Chapter #8</a>: Conclusion and other |
| 108 | useful LLVM tidbits</b> - This chapter wraps up the series by talking about |
| 109 | potential ways to extend the language, but also includes a bunch of pointers to |
| 110 | info about "special topics" like adding garbage collection support, exceptions, |
| 111 | debugging, support for "spaghetti stacks", and a bunch of other tips and |
| 112 | tricks.</li> |
| 113 | |
| 114 | </ul> |
| 115 | |
| 116 | <p>By the end of the tutorial, we'll have written a bit less than 700 lines of |
| 117 | non-comment, non-blank, lines of code. With this small amount of code, we'll |
| 118 | have built up a very reasonable compiler for a non-trivial language including |
| 119 | a hand-written lexer, parser, AST, as well as code generation support with a JIT |
| 120 | compiler. While other systems may have interesting "hello world" tutorials, |
| 121 | I think the breadth of this tutorial is a great testament to the strengths of |
| 122 | LLVM and why you should consider it if you're interested in language or compiler |
| 123 | design.</p> |
| 124 | |
| 125 | <p>A note about this tutorial: we expect you to extend the language and play |
| 126 | with it on your own. Take the code and go crazy hacking away at it, compilers |
| 127 | don't need to be scary creatures - it can be a lot of fun to play with |
| 128 | languages!</p> |
| 129 | |
| 130 | </div> |
| 131 | |
| 132 | <!-- *********************************************************************** --> |
| 133 | <div class="doc_section"><a name="language">The Basic Language</a></div> |
| 134 | <!-- *********************************************************************** --> |
| 135 | |
| 136 | <div class="doc_text"> |
| 137 | |
| 138 | <p>This tutorial will be illustrated with a toy language that we'll call |
| 139 | "<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>" (derived |
| 140 | from "meaning beautiful, form, and view"). |
| 141 | Kaleidoscope is a procedural language that allows you to define functions, use |
| 142 | conditionals, math, etc. Over the course of the tutorial, we'll extend |
| 143 | Kaleidoscope to support the if/then/else construct, a for loop, user defined |
| 144 | operators, JIT compilation with a simple command line interface, etc.</p> |
| 145 | |
| 146 | <p>Because we want to keep things simple, the only datatype in Kaleidoscope is a |
| 147 | 64-bit floating point type (aka 'float' in O'Caml parlance). As such, all |
| 148 | values are implicitly double precision and the language doesn't require type |
| 149 | declarations. This gives the language a very nice and simple syntax. For |
| 150 | example, the following simple example computes <a |
| 151 | href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p> |
| 152 | |
| 153 | <div class="doc_code"> |
| 154 | <pre> |
| 155 | # Compute the x'th fibonacci number. |
| 156 | def fib(x) |
| 157 | if x < 3 then |
| 158 | 1 |
| 159 | else |
| 160 | fib(x-1)+fib(x-2) |
| 161 | |
| 162 | # This expression will compute the 40th number. |
| 163 | fib(40) |
| 164 | </pre> |
| 165 | </div> |
| 166 | |
| 167 | <p>We also allow Kaleidoscope to call into standard library functions (the LLVM |
| 168 | JIT makes this completely trivial). This means that you can use the 'extern' |
| 169 | keyword to define a function before you use it (this is also useful for mutually |
| 170 | recursive functions). For example:</p> |
| 171 | |
| 172 | <div class="doc_code"> |
| 173 | <pre> |
| 174 | extern sin(arg); |
| 175 | extern cos(arg); |
| 176 | extern atan2(arg1 arg2); |
| 177 | |
| 178 | atan2(sin(.4), cos(42)) |
| 179 | </pre> |
| 180 | </div> |
| 181 | |
| 182 | <p>A more interesting example is included in Chapter 6 where we write a little |
| 183 | Kaleidoscope application that <a href="OCamlLangImpl6.html#example">displays |
| 184 | a Mandelbrot Set</a> at various levels of magnification.</p> |
| 185 | |
| 186 | <p>Lets dive into the implementation of this language!</p> |
| 187 | |
| 188 | </div> |
| 189 | |
| 190 | <!-- *********************************************************************** --> |
| 191 | <div class="doc_section"><a name="lexer">The Lexer</a></div> |
| 192 | <!-- *********************************************************************** --> |
| 193 | |
| 194 | <div class="doc_text"> |
| 195 | |
| 196 | <p>When it comes to implementing a language, the first thing needed is |
| 197 | the ability to process a text file and recognize what it says. The traditional |
| 198 | way to do this is to use a "<a |
| 199 | href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner') |
| 200 | to break the input up into "tokens". Each token returned by the lexer includes |
| 201 | a token code and potentially some metadata (e.g. the numeric value of a number). |
| 202 | First, we define the possibilities: |
| 203 | </p> |
| 204 | |
| 205 | <div class="doc_code"> |
| 206 | <pre> |
| 207 | (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of |
| 208 | * these others for known things. *) |
| 209 | type token = |
| 210 | (* commands *) |
| 211 | | Def | Extern |
| 212 | |
| 213 | (* primary *) |
| 214 | | Ident of string | Number of float |
| 215 | |
| 216 | (* unknown *) |
| 217 | | Kwd of char |
| 218 | </pre> |
| 219 | </div> |
| 220 | |
| 221 | <p>Each token returned by our lexer will be one of the token variant values. |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 222 | An unknown character like '+' will be returned as <tt>Token.Kwd '+'</tt>. If |
| 223 | the curr token is an identifier, the value will be <tt>Token.Ident s</tt>. If |
| 224 | the current token is a numeric literal (like 1.0), the value will be |
| 225 | <tt>Token.Number 1.0</tt>. |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 226 | </p> |
| 227 | |
| 228 | <p>The actual implementation of the lexer is a collection of functions driven |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 229 | by a function named <tt>Lexer.lex</tt>. The <tt>Lexer.lex</tt> function is |
| 230 | called to return the next token from standard input. We will use |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 231 | <a href="http://caml.inria.fr/pub/docs/manual-camlp4/index.html">Camlp4</a> |
| 232 | to simplify the tokenization of the standard input. Its definition starts |
| 233 | as:</p> |
| 234 | |
| 235 | <div class="doc_code"> |
| 236 | <pre> |
| 237 | (*===----------------------------------------------------------------------=== |
| 238 | * Lexer |
| 239 | *===----------------------------------------------------------------------===*) |
| 240 | |
| 241 | let rec lex = parser |
| 242 | (* Skip any whitespace. *) |
| 243 | | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream |
| 244 | </pre> |
| 245 | </div> |
| 246 | |
| 247 | <p> |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 248 | <tt>Lexer.lex</tt> works by recursing over a <tt>char Stream.t</tt> to read |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 249 | characters one at a time from the standard input. It eats them as it recognizes |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 250 | them and stores them in in a <tt>Token.token</tt> variant. The first thing that |
| 251 | it has to do is ignore whitespace between tokens. This is accomplished with the |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 252 | recursive call above.</p> |
| 253 | |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 254 | <p>The next thing <tt>Lexer.lex</tt> needs to do is recognize identifiers and |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 255 | specific keywords like "def". Kaleidoscope does this with this a pattern match |
| 256 | and a helper function.<p> |
| 257 | |
| 258 | <div class="doc_code"> |
| 259 | <pre> |
| 260 | (* identifier: [a-zA-Z][a-zA-Z0-9] *) |
| 261 | | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] -> |
| 262 | let buffer = Buffer.create 1 in |
| 263 | Buffer.add_char buffer c; |
| 264 | lex_ident buffer stream |
| 265 | |
| 266 | ... |
| 267 | |
| 268 | and lex_ident buffer = parser |
| 269 | | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] -> |
| 270 | Buffer.add_char buffer c; |
| 271 | lex_ident buffer stream |
| 272 | | [< stream=lex >] -> |
| 273 | match Buffer.contents buffer with |
| 274 | | "def" -> [< 'Token.Def; stream >] |
| 275 | | "extern" -> [< 'Token.Extern; stream >] |
| 276 | | id -> [< 'Token.Ident id; stream >] |
| 277 | </pre> |
| 278 | </div> |
| 279 | |
| 280 | Numeric values are similar:</p> |
| 281 | |
| 282 | <div class="doc_code"> |
| 283 | <pre> |
| 284 | (* number: [0-9.]+ *) |
| 285 | | [< ' ('0' .. '9' as c); stream >] -> |
| 286 | let buffer = Buffer.create 1 in |
| 287 | Buffer.add_char buffer c; |
| 288 | lex_number buffer stream |
| 289 | |
| 290 | ... |
| 291 | |
| 292 | and lex_number buffer = parser |
| 293 | | [< ' ('0' .. '9' | '.' as c); stream >] -> |
| 294 | Buffer.add_char buffer c; |
| 295 | lex_number buffer stream |
| 296 | | [< stream=lex >] -> |
| 297 | [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >] |
| 298 | </pre> |
| 299 | </div> |
| 300 | |
| 301 | <p>This is all pretty straight-forward code for processing input. When reading |
| 302 | a numeric value from input, we use the ocaml <tt>float_of_string</tt> function |
Erick Tryzelaar | d564686 | 2008-03-30 19:14:31 +0000 | [diff] [blame] | 303 | to convert it to a numeric value that we store in <tt>Token.Number</tt>. Note |
| 304 | that this isn't doing sufficient error checking: it will raise <tt>Failure</tt> |
Erick Tryzelaar | 9ba8a57 | 2008-03-27 08:18:07 +0000 | [diff] [blame] | 305 | if the string "1.23.45.67". Feel free to extend it :). Next we handle |
| 306 | comments: |
| 307 | </p> |
| 308 | |
| 309 | <div class="doc_code"> |
| 310 | <pre> |
| 311 | (* Comment until end of line. *) |
| 312 | | [< ' ('#'); stream >] -> |
| 313 | lex_comment stream |
| 314 | |
| 315 | ... |
| 316 | |
| 317 | and lex_comment = parser |
| 318 | | [< ' ('\n'); stream=lex >] -> stream |
| 319 | | [< 'c; e=lex_comment >] -> e |
| 320 | | [< >] -> [< >] |
| 321 | </pre> |
| 322 | </div> |
| 323 | |
| 324 | <p>We handle comments by skipping to the end of the line and then return the |
| 325 | next token. Finally, if the input doesn't match one of the above cases, it is |
| 326 | either an operator character like '+' or the end of the file. These are handled |
| 327 | with this code:</p> |
| 328 | |
| 329 | <div class="doc_code"> |
| 330 | <pre> |
| 331 | (* Otherwise, just return the character as its ascii value. *) |
| 332 | | [< 'c; stream >] -> |
| 333 | [< 'Token.Kwd c; lex stream >] |
| 334 | |
| 335 | (* end of stream. *) |
| 336 | | [< >] -> [< >] |
| 337 | </pre> |
| 338 | </div> |
| 339 | |
| 340 | <p>With this, we have the complete lexer for the basic Kaleidoscope language |
| 341 | (the <a href="OCamlLangImpl2.html#code">full code listing</a> for the Lexer is |
| 342 | available in the <a href="OCamlLangImpl2.html">next chapter</a> of the |
| 343 | tutorial). Next we'll <a href="OCamlLangImpl2.html">build a simple parser that |
| 344 | uses this to build an Abstract Syntax Tree</a>. When we have that, we'll |
| 345 | include a driver so that you can use the lexer and parser together. |
| 346 | </p> |
| 347 | |
| 348 | <a href="OCamlLangImpl2.html">Next: Implementing a Parser and AST</a> |
| 349 | </div> |
| 350 | |
| 351 | <!-- *********************************************************************** --> |
| 352 | <hr> |
| 353 | <address> |
| 354 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 355 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> |
| 356 | <a href="http://validator.w3.org/check/referer"><img |
| 357 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> |
| 358 | |
| 359 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
| 360 | <a href="mailto:idadesub@users.sourceforge.net">Erick Tryzelaar</a><br> |
| 361 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| 362 | Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $ |
| 363 | </address> |
| 364 | </body> |
| 365 | </html> |