Sean Silva | ee47edf | 2012-12-05 00:26:32 +0000 | [diff] [blame] | 1 | ================================================= |
| 2 | Kaleidoscope: Tutorial Introduction and the Lexer |
| 3 | ================================================= |
| 4 | |
| 5 | .. contents:: |
| 6 | :local: |
| 7 | |
| 8 | Written by `Chris Lattner <mailto:sabre@nondot.org>`_ and `Erick |
| 9 | Tryzelaar <mailto:idadesub@users.sourceforge.net>`_ |
| 10 | |
| 11 | Tutorial Introduction |
| 12 | ===================== |
| 13 | |
| 14 | Welcome to the "Implementing a language with LLVM" tutorial. This |
| 15 | tutorial runs through the implementation of a simple language, showing |
| 16 | how fun and easy it can be. This tutorial will get you up and started as |
| 17 | well as help to build a framework you can extend to other languages. The |
| 18 | code in this tutorial can also be used as a playground to hack on other |
| 19 | LLVM specific things. |
| 20 | |
| 21 | The goal of this tutorial is to progressively unveil our language, |
| 22 | describing how it is built up over time. This will let us cover a fairly |
| 23 | broad range of language design and LLVM-specific usage issues, showing |
| 24 | and explaining the code for it all along the way, without overwhelming |
| 25 | you with tons of details up front. |
| 26 | |
| 27 | It is useful to point out ahead of time that this tutorial is really |
| 28 | about teaching compiler techniques and LLVM specifically, *not* about |
| 29 | teaching modern and sane software engineering principles. In practice, |
| 30 | this means that we'll take a number of shortcuts to simplify the |
| 31 | exposition. For example, the code leaks memory, uses global variables |
| 32 | all over the place, doesn't use nice design patterns like |
| 33 | `visitors <http://en.wikipedia.org/wiki/Visitor_pattern>`_, etc... but |
| 34 | it is very simple. If you dig in and use the code as a basis for future |
| 35 | projects, fixing these deficiencies shouldn't be hard. |
| 36 | |
| 37 | I've tried to put this tutorial together in a way that makes chapters |
| 38 | easy to skip over if you are already familiar with or are uninterested |
| 39 | in the various pieces. The structure of the tutorial is: |
| 40 | |
| 41 | - `Chapter #1 <#language>`_: Introduction to the Kaleidoscope |
| 42 | language, and the definition of its Lexer - This shows where we are |
| 43 | going and the basic functionality that we want it to do. In order to |
| 44 | make this tutorial maximally understandable and hackable, we choose |
| 45 | to implement everything in Objective Caml instead of using lexer and |
| 46 | parser generators. LLVM obviously works just fine with such tools, |
| 47 | feel free to use one if you prefer. |
| 48 | - `Chapter #2 <OCamlLangImpl2.html>`_: Implementing a Parser and |
| 49 | AST - With the lexer in place, we can talk about parsing techniques |
| 50 | and basic AST construction. This tutorial describes recursive descent |
| 51 | parsing and operator precedence parsing. Nothing in Chapters 1 or 2 |
| 52 | is LLVM-specific, the code doesn't even link in LLVM at this point. |
| 53 | :) |
| 54 | - `Chapter #3 <OCamlLangImpl3.html>`_: Code generation to LLVM IR - |
| 55 | With the AST ready, we can show off how easy generation of LLVM IR |
| 56 | really is. |
| 57 | - `Chapter #4 <OCamlLangImpl4.html>`_: Adding JIT and Optimizer |
| 58 | Support - Because a lot of people are interested in using LLVM as a |
| 59 | JIT, we'll dive right into it and show you the 3 lines it takes to |
| 60 | add JIT support. LLVM is also useful in many other ways, but this is |
| 61 | one simple and "sexy" way to shows off its power. :) |
| 62 | - `Chapter #5 <OCamlLangImpl5.html>`_: Extending the Language: |
| 63 | Control Flow - With the language up and running, we show how to |
| 64 | extend it with control flow operations (if/then/else and a 'for' |
| 65 | loop). This gives us a chance to talk about simple SSA construction |
| 66 | and control flow. |
| 67 | - `Chapter #6 <OCamlLangImpl6.html>`_: Extending the Language: |
| 68 | User-defined Operators - This is a silly but fun chapter that talks |
| 69 | about extending the language to let the user program define their own |
| 70 | arbitrary unary and binary operators (with assignable precedence!). |
| 71 | This lets us build a significant piece of the "language" as library |
| 72 | routines. |
| 73 | - `Chapter #7 <OCamlLangImpl7.html>`_: Extending the Language: |
| 74 | Mutable Variables - This chapter talks about adding user-defined |
| 75 | local variables along with an assignment operator. The interesting |
| 76 | part about this is how easy and trivial it is to construct SSA form |
| 77 | in LLVM: no, LLVM does *not* require your front-end to construct SSA |
| 78 | form! |
| 79 | - `Chapter #8 <OCamlLangImpl8.html>`_: Conclusion and other useful |
| 80 | LLVM tidbits - This chapter wraps up the series by talking about |
| 81 | potential ways to extend the language, but also includes a bunch of |
| 82 | pointers to info about "special topics" like adding garbage |
| 83 | collection support, exceptions, debugging, support for "spaghetti |
| 84 | stacks", and a bunch of other tips and tricks. |
| 85 | |
| 86 | By the end of the tutorial, we'll have written a bit less than 700 lines |
| 87 | of non-comment, non-blank, lines of code. With this small amount of |
| 88 | code, we'll have built up a very reasonable compiler for a non-trivial |
| 89 | language including a hand-written lexer, parser, AST, as well as code |
| 90 | generation support with a JIT compiler. While other systems may have |
| 91 | interesting "hello world" tutorials, I think the breadth of this |
| 92 | tutorial is a great testament to the strengths of LLVM and why you |
| 93 | should consider it if you're interested in language or compiler design. |
| 94 | |
| 95 | A note about this tutorial: we expect you to extend the language and |
| 96 | play with it on your own. Take the code and go crazy hacking away at it, |
| 97 | compilers don't need to be scary creatures - it can be a lot of fun to |
| 98 | play with languages! |
| 99 | |
| 100 | The Basic Language |
| 101 | ================== |
| 102 | |
| 103 | This tutorial will be illustrated with a toy language that we'll call |
| 104 | "`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived |
| 105 | from "meaning beautiful, form, and view"). Kaleidoscope is a procedural |
| 106 | language that allows you to define functions, use conditionals, math, |
| 107 | etc. Over the course of the tutorial, we'll extend Kaleidoscope to |
| 108 | support the if/then/else construct, a for loop, user defined operators, |
| 109 | JIT compilation with a simple command line interface, etc. |
| 110 | |
| 111 | Because we want to keep things simple, the only datatype in Kaleidoscope |
| 112 | is a 64-bit floating point type (aka 'float' in O'Caml parlance). As |
| 113 | such, all values are implicitly double precision and the language |
| 114 | doesn't require type declarations. This gives the language a very nice |
| 115 | and simple syntax. For example, the following simple example computes |
| 116 | `Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_ |
| 117 | |
| 118 | :: |
| 119 | |
| 120 | # Compute the x'th fibonacci number. |
| 121 | def fib(x) |
| 122 | if x < 3 then |
| 123 | 1 |
| 124 | else |
| 125 | fib(x-1)+fib(x-2) |
| 126 | |
| 127 | # This expression will compute the 40th number. |
| 128 | fib(40) |
| 129 | |
| 130 | We also allow Kaleidoscope to call into standard library functions (the |
| 131 | LLVM JIT makes this completely trivial). This means that you can use the |
| 132 | 'extern' keyword to define a function before you use it (this is also |
| 133 | useful for mutually recursive functions). For example: |
| 134 | |
| 135 | :: |
| 136 | |
| 137 | extern sin(arg); |
| 138 | extern cos(arg); |
| 139 | extern atan2(arg1 arg2); |
| 140 | |
| 141 | atan2(sin(.4), cos(42)) |
| 142 | |
| 143 | A more interesting example is included in Chapter 6 where we write a |
| 144 | little Kaleidoscope application that `displays a Mandelbrot |
| 145 | Set <OCamlLangImpl6.html#example>`_ at various levels of magnification. |
| 146 | |
| 147 | Lets dive into the implementation of this language! |
| 148 | |
| 149 | The Lexer |
| 150 | ========= |
| 151 | |
| 152 | When it comes to implementing a language, the first thing needed is the |
| 153 | ability to process a text file and recognize what it says. The |
| 154 | traditional way to do this is to use a |
| 155 | "`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka |
| 156 | 'scanner') to break the input up into "tokens". Each token returned by |
| 157 | the lexer includes a token code and potentially some metadata (e.g. the |
| 158 | numeric value of a number). First, we define the possibilities: |
| 159 | |
| 160 | .. code-block:: ocaml |
| 161 | |
| 162 | (* The lexer returns these 'Kwd' if it is an unknown character, otherwise one of |
| 163 | * these others for known things. *) |
| 164 | type token = |
| 165 | (* commands *) |
| 166 | | Def | Extern |
| 167 | |
| 168 | (* primary *) |
| 169 | | Ident of string | Number of float |
| 170 | |
| 171 | (* unknown *) |
| 172 | | Kwd of char |
| 173 | |
| 174 | Each token returned by our lexer will be one of the token variant |
| 175 | values. An unknown character like '+' will be returned as |
| 176 | ``Token.Kwd '+'``. If the curr token is an identifier, the value will be |
| 177 | ``Token.Ident s``. If the current token is a numeric literal (like 1.0), |
| 178 | the value will be ``Token.Number 1.0``. |
| 179 | |
| 180 | The actual implementation of the lexer is a collection of functions |
| 181 | driven by a function named ``Lexer.lex``. The ``Lexer.lex`` function is |
| 182 | called to return the next token from standard input. We will use |
| 183 | `Camlp4 <http://caml.inria.fr/pub/docs/manual-camlp4/index.html>`_ to |
| 184 | simplify the tokenization of the standard input. Its definition starts |
| 185 | as: |
| 186 | |
| 187 | .. code-block:: ocaml |
| 188 | |
| 189 | (*===----------------------------------------------------------------------=== |
| 190 | * Lexer |
| 191 | *===----------------------------------------------------------------------===*) |
| 192 | |
| 193 | let rec lex = parser |
| 194 | (* Skip any whitespace. *) |
| 195 | | [< ' (' ' | '\n' | '\r' | '\t'); stream >] -> lex stream |
| 196 | |
| 197 | ``Lexer.lex`` works by recursing over a ``char Stream.t`` to read |
| 198 | characters one at a time from the standard input. It eats them as it |
| 199 | recognizes them and stores them in in a ``Token.token`` variant. The |
| 200 | first thing that it has to do is ignore whitespace between tokens. This |
| 201 | is accomplished with the recursive call above. |
| 202 | |
| 203 | The next thing ``Lexer.lex`` needs to do is recognize identifiers and |
| 204 | specific keywords like "def". Kaleidoscope does this with a pattern |
| 205 | match and a helper function. |
| 206 | |
| 207 | .. code-block:: ocaml |
| 208 | |
| 209 | (* identifier: [a-zA-Z][a-zA-Z0-9] *) |
| 210 | | [< ' ('A' .. 'Z' | 'a' .. 'z' as c); stream >] -> |
| 211 | let buffer = Buffer.create 1 in |
| 212 | Buffer.add_char buffer c; |
| 213 | lex_ident buffer stream |
| 214 | |
| 215 | ... |
| 216 | |
| 217 | and lex_ident buffer = parser |
| 218 | | [< ' ('A' .. 'Z' | 'a' .. 'z' | '0' .. '9' as c); stream >] -> |
| 219 | Buffer.add_char buffer c; |
| 220 | lex_ident buffer stream |
| 221 | | [< stream=lex >] -> |
| 222 | match Buffer.contents buffer with |
| 223 | | "def" -> [< 'Token.Def; stream >] |
| 224 | | "extern" -> [< 'Token.Extern; stream >] |
| 225 | | id -> [< 'Token.Ident id; stream >] |
| 226 | |
| 227 | Numeric values are similar: |
| 228 | |
| 229 | .. code-block:: ocaml |
| 230 | |
| 231 | (* number: [0-9.]+ *) |
| 232 | | [< ' ('0' .. '9' as c); stream >] -> |
| 233 | let buffer = Buffer.create 1 in |
| 234 | Buffer.add_char buffer c; |
| 235 | lex_number buffer stream |
| 236 | |
| 237 | ... |
| 238 | |
| 239 | and lex_number buffer = parser |
| 240 | | [< ' ('0' .. '9' | '.' as c); stream >] -> |
| 241 | Buffer.add_char buffer c; |
| 242 | lex_number buffer stream |
| 243 | | [< stream=lex >] -> |
| 244 | [< 'Token.Number (float_of_string (Buffer.contents buffer)); stream >] |
| 245 | |
| 246 | This is all pretty straight-forward code for processing input. When |
| 247 | reading a numeric value from input, we use the ocaml ``float_of_string`` |
| 248 | function to convert it to a numeric value that we store in |
| 249 | ``Token.Number``. Note that this isn't doing sufficient error checking: |
| 250 | it will raise ``Failure`` if the string "1.23.45.67". Feel free to |
| 251 | extend it :). Next we handle comments: |
| 252 | |
| 253 | .. code-block:: ocaml |
| 254 | |
| 255 | (* Comment until end of line. *) |
| 256 | | [< ' ('#'); stream >] -> |
| 257 | lex_comment stream |
| 258 | |
| 259 | ... |
| 260 | |
| 261 | and lex_comment = parser |
| 262 | | [< ' ('\n'); stream=lex >] -> stream |
| 263 | | [< 'c; e=lex_comment >] -> e |
| 264 | | [< >] -> [< >] |
| 265 | |
| 266 | We handle comments by skipping to the end of the line and then return |
| 267 | the next token. Finally, if the input doesn't match one of the above |
| 268 | cases, it is either an operator character like '+' or the end of the |
| 269 | file. These are handled with this code: |
| 270 | |
| 271 | .. code-block:: ocaml |
| 272 | |
| 273 | (* Otherwise, just return the character as its ascii value. *) |
| 274 | | [< 'c; stream >] -> |
| 275 | [< 'Token.Kwd c; lex stream >] |
| 276 | |
| 277 | (* end of stream. *) |
| 278 | | [< >] -> [< >] |
| 279 | |
| 280 | With this, we have the complete lexer for the basic Kaleidoscope |
| 281 | language (the `full code listing <OCamlLangImpl2.html#code>`_ for the |
| 282 | Lexer is available in the `next chapter <OCamlLangImpl2.html>`_ of the |
| 283 | tutorial). Next we'll `build a simple parser that uses this to build an |
| 284 | Abstract Syntax Tree <OCamlLangImpl2.html>`_. When we have that, we'll |
| 285 | include a driver so that you can use the lexer and parser together. |
| 286 | |
| 287 | `Next: Implementing a Parser and AST <OCamlLangImpl2.html>`_ |
| 288 | |