Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| 2 | "http://www.w3.org/TR/html4/strict.dtd"> |
| 3 | |
| 4 | <html> |
| 5 | <head> |
| 6 | <title>Kaleidoscope: The basic language, with its lexer</title> |
| 7 | <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> |
| 8 | <meta name="author" content="Chris Lattner"> |
| 9 | <link rel="stylesheet" href="../llvm.css" type="text/css"> |
| 10 | </head> |
| 11 | |
| 12 | <body> |
| 13 | |
| 14 | <div class="doc_title">Kaleidoscope: The basic language, with its lexer</div> |
| 15 | |
| 16 | <div class="doc_author"> |
| 17 | <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p> |
| 18 | </div> |
| 19 | |
| 20 | <!-- *********************************************************************** --> |
| 21 | <div class="doc_section"><a name="intro">Tutorial Introduction</a></div> |
| 22 | <!-- *********************************************************************** --> |
| 23 | |
| 24 | <div class="doc_text"> |
| 25 | |
| 26 | <p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial |
| 27 | will run through implementation of a simple language, showing how fun and easy |
| 28 | it can be. This tutorial will get you up and started and build a framework you |
| 29 | can extend to other languages and to play with other things. |
| 30 | </p> |
| 31 | |
| 32 | </div> |
| 33 | |
| 34 | <!-- *********************************************************************** --> |
| 35 | <div class="doc_section"><a name="language">The basic language</a></div> |
| 36 | <!-- *********************************************************************** --> |
| 37 | |
| 38 | <div class="doc_text"> |
| 39 | |
| 40 | <p>This tutorial will be illustrated with a toy language that we'll call |
| 41 | "<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>". |
| 42 | Kaleidoscope is a procedural language that allows you to define functions, use |
| 43 | conditionals, math, etc. Over the course of the tutorial, we'll extend |
| 44 | Kaleidoscope to support if/then/else, operator overloading, JIT compilation with |
| 45 | a simple command line interface, etc.</p> |
| 46 | |
| 47 | <p>Because we want to keep things simple, in Kaleidoscope the only datatype is a |
| 48 | 64-bit floating point type (aka 'double' in C parlance). As such, all values |
| 49 | are implicitly double precision and the language doesn't require type |
| 50 | declarations. This gives the language a very nice and simple syntax. For |
| 51 | example, A simple example computes <a |
| 52 | href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>, |
| 53 | which looks like this:</p> |
| 54 | |
| 55 | <div class="doc_code"> |
| 56 | <pre> |
| 57 | # Compute the x'th fibonacci number. |
| 58 | def fib(x) |
Chris Lattner | e6c9104 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 59 | if x < 3 then |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 60 | 1 |
| 61 | else |
| 62 | fib(x-1)+fib(x-2) |
| 63 | |
| 64 | # This expression will compute the 40th number. |
| 65 | fib(40) |
| 66 | </pre> |
| 67 | </div> |
| 68 | |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame^] | 69 | <p>We also allow Kaleidoscope to call into standard library functions (the LLVM |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 70 | JIT makes this completely trivial). This means that you can use the 'extern' |
| 71 | keyword to define a function before you use it (this is also useful for mutually |
| 72 | recursive functions). For example:</p> |
| 73 | |
| 74 | <div class="doc_code"> |
| 75 | <pre> |
| 76 | extern sin(arg); |
| 77 | extern cos(arg); |
| 78 | extern atan2(arg1 arg2); |
| 79 | |
| 80 | atan2(sin(.4), cos(42)) |
| 81 | </pre> |
| 82 | </div> |
| 83 | |
| 84 | <p>In the first incarnation of the language, we will only support basic |
| 85 | arithmetic: if/then/else will be added in a future installment. Another |
| 86 | interesting aspect of the first implementation is that it is a completely |
| 87 | functional language, which does not allow you to have side-effects etc. We will |
| 88 | eventually add side effects for those who prefer them.</p> |
| 89 | |
| 90 | <p>In order to make this tutorial |
| 91 | maximally understandable and hackable, we choose to implement everything in C++ |
| 92 | instead of using lexer and parser generators. LLVM obviously works just fine |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame^] | 93 | with such tools, and making use of them doesn't impact the overall design.</p> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 94 | |
| 95 | <p>A note about this tutorial: we expect you to extend the language and play |
| 96 | with it on your own. Take the code and go crazy hacking away at it. It can be |
| 97 | a lot of fun to play with languages!</p> |
| 98 | |
| 99 | </div> |
| 100 | |
| 101 | <!-- *********************************************************************** --> |
| 102 | <div class="doc_section"><a name="language">The Lexer</a></div> |
| 103 | <!-- *********************************************************************** --> |
| 104 | |
| 105 | <div class="doc_text"> |
| 106 | |
| 107 | <p>When it comes to implementing a language, the first thing needed is |
| 108 | the ability to process a text file and recognize what it says. The traditional |
| 109 | way to do this is to use a "<a |
| 110 | href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner') |
| 111 | to break the input up into "tokens". Each token returned by the lexer includes |
| 112 | a token code and potentially some metadata (e.g. the numeric value of a number). |
| 113 | First, we define the possibilities: |
| 114 | </p> |
| 115 | |
| 116 | <div class="doc_code"> |
| 117 | <pre> |
| 118 | // The lexer returns tokens [0-255] if it is an unknown character, otherwise one |
| 119 | // of these for known things. |
| 120 | enum Token { |
| 121 | tok_eof = -1, |
| 122 | |
| 123 | // commands |
| 124 | tok_def = -2, tok_extern = -3, |
| 125 | |
| 126 | // primary |
| 127 | tok_identifier = -4, tok_number = -5, |
| 128 | }; |
| 129 | |
| 130 | static std::string IdentifierStr; // Filled in if tok_identifier |
| 131 | static double NumVal; // Filled in if tok_number |
| 132 | </pre> |
| 133 | </div> |
| 134 | |
| 135 | <p>Each token returned by our lexer will either be one of the Token enum values |
| 136 | or it will be an 'unknown' character like '+' which is returned as its ascii |
| 137 | value. If the current token is an identifier, the <tt>IdentifierStr</tt> |
| 138 | global variable holds the name of the identifier. If the current token is a |
| 139 | numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use |
| 140 | global variables for simplicity, this is not the best choice for a real language |
| 141 | implementation :). |
| 142 | </p> |
| 143 | |
| 144 | <p>The actual implementation of the lexer is a single function <tt>gettok</tt>. |
| 145 | <tt>gettok</tt> is called to return the next token from standard input. Its |
| 146 | definition starts as:</p> |
| 147 | |
| 148 | <div class="doc_code"> |
| 149 | <pre> |
| 150 | /// gettok - Return the next token from standard input. |
| 151 | static int gettok() { |
| 152 | static int LastChar = ' '; |
| 153 | |
| 154 | // Skip any whitespace. |
| 155 | while (isspace(LastChar)) |
| 156 | LastChar = getchar(); |
| 157 | </pre> |
| 158 | </div> |
| 159 | |
| 160 | <p> |
| 161 | <tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read |
| 162 | characters one at a time from standard input. It eats them as it recognizes |
| 163 | them and stores the last character read but not processed in LastChar. The |
| 164 | first thing that it has to do is ignore whitespace between tokens. This is |
| 165 | accomplished with the loop above.</p> |
| 166 | |
| 167 | <p>The next thing it needs to do is recognize identifiers, and specific keywords |
| 168 | like "def". Kaleidoscope does this with this simple loop:</p> |
| 169 | |
| 170 | <div class="doc_code"> |
| 171 | <pre> |
| 172 | if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]* |
| 173 | IdentifierStr = LastChar; |
| 174 | while (isalnum((LastChar = getchar()))) |
| 175 | IdentifierStr += LastChar; |
| 176 | |
| 177 | if (IdentifierStr == "def") return tok_def; |
| 178 | if (IdentifierStr == "extern") return tok_extern; |
| 179 | return tok_identifier; |
| 180 | } |
| 181 | </pre> |
| 182 | </div> |
| 183 | |
| 184 | <p>Note that it sets the '<tt>IdentifierStr</tt>' global whenever it lexes an |
| 185 | identifier. Also, since language keywords are matched by the same loop, we |
| 186 | handle them here inline. Numeric values are similar:</p> |
| 187 | |
| 188 | <div class="doc_code"> |
| 189 | <pre> |
| 190 | if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+ |
| 191 | std::string NumStr; |
| 192 | do { |
| 193 | NumStr += LastChar; |
| 194 | LastChar = getchar(); |
| 195 | } while (isdigit(LastChar) || LastChar == '.'); |
| 196 | |
| 197 | NumVal = strtod(NumStr.c_str(), 0); |
| 198 | return tok_number; |
| 199 | } |
| 200 | </pre> |
| 201 | </div> |
| 202 | |
| 203 | <p>This is all pretty straight-forward code for processing input. When reading |
| 204 | a numeric value from input, we use the C <tt>strtod</tt> function to convert it |
| 205 | to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame^] | 206 | sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 207 | if you typed in "1.23". Feel free to extend it :). Next we handle comments: |
| 208 | </p> |
| 209 | |
| 210 | <div class="doc_code"> |
| 211 | <pre> |
| 212 | if (LastChar == '#') { |
| 213 | // Comment until end of line. |
| 214 | do LastChar = getchar(); |
| 215 | while (LastChar != EOF && LastChar != '\n' & LastChar != '\r'); |
| 216 | |
| 217 | if (LastChar != EOF) |
| 218 | return gettok(); |
| 219 | } |
| 220 | </pre> |
| 221 | </div> |
| 222 | |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame^] | 223 | <p>We handle comments by skipping to the end of the line and then returning the |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 224 | next comment. Finally, if the input doesn't match one of the above cases, it is |
Duncan Sands | 72261ff | 2007-11-05 16:04:58 +0000 | [diff] [blame^] | 225 | either an operator character like '+' or the end of the file. These are handled with |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 226 | this code:</p> |
| 227 | |
| 228 | <div class="doc_code"> |
| 229 | <pre> |
| 230 | // Check for end of file. Don't eat the EOF. |
| 231 | if (LastChar == EOF) |
| 232 | return tok_eof; |
| 233 | |
| 234 | // Otherwise, just return the character as its ascii value. |
| 235 | int ThisChar = LastChar; |
| 236 | LastChar = getchar(); |
| 237 | return ThisChar; |
| 238 | } |
| 239 | </pre> |
| 240 | </div> |
| 241 | |
| 242 | <p>With this, we have the complete lexer for the basic Kaleidoscope language. |
| 243 | Next we'll <a href="LangImpl2.html">build a simple parser that uses this to |
Chris Lattner | e6c9104 | 2007-10-22 06:34:15 +0000 | [diff] [blame] | 244 | build an Abstract Syntax Tree</a>. When we have that, we'll include a driver |
| 245 | so that you can use the lexer and parser together. |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 246 | </p> |
| 247 | |
| 248 | </div> |
| 249 | |
| 250 | <!-- *********************************************************************** --> |
| 251 | <hr> |
| 252 | <address> |
| 253 | <a href="http://jigsaw.w3.org/css-validator/check/referer"><img |
| 254 | src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a> |
| 255 | <a href="http://validator.w3.org/check/referer"><img |
Chris Lattner | 8eef4b2 | 2007-10-23 06:30:50 +0000 | [diff] [blame] | 256 | src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a> |
Chris Lattner | c38ef54 | 2007-10-22 04:32:37 +0000 | [diff] [blame] | 257 | |
| 258 | <a href="mailto:sabre@nondot.org">Chris Lattner</a><br> |
| 259 | <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br> |
| 260 | Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $ |
| 261 | </address> |
| 262 | </body> |
| 263 | </html> |