blob: 2ce59b98ebd963a10a3479580532c34577bb4012 [file] [log] [blame]
Chris Lattnerc38ef542007-10-22 04:32:37 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: The basic language, with its lexer</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: The basic language, with its lexer</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
Chris Lattner0e555b12007-11-05 20:04:56 +000017<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner128eb862007-11-05 19:06:59 +000018<li>Chapter 1
19 <ol>
20 <li><a href="#intro">Tutorial Introduction</a></li>
21 <li><a href="#language">The Basic Language</a></li>
22 <li><a href="#lexer">The Lexer</a></li>
23 </ol>
24</li>
Chris Lattner0e555b12007-11-05 20:04:56 +000025<li><a href="LangImpl2.html">Chapter 2</a>: Implementing a Parser and AST</li>
Chris Lattner128eb862007-11-05 19:06:59 +000026</ul>
27
Chris Lattnerc38ef542007-10-22 04:32:37 +000028<div class="doc_author">
29 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
30</div>
31
32<!-- *********************************************************************** -->
33<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
34<!-- *********************************************************************** -->
35
36<div class="doc_text">
37
38<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
Chris Lattner619bc0a2007-11-05 20:13:56 +000039will run the through implementation of a simple language, showing how fun and
40easy it can be. This tutorial will get you up and started as well as help to
41build a framework you can extend to other languages. You can also use this
42tutorial to help you start playing with other LLVM specific things.
Chris Lattnerc38ef542007-10-22 04:32:37 +000043</p>
44
45</div>
46
47<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000048<div class="doc_section"><a name="language">The Basic Language</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +000049<!-- *********************************************************************** -->
50
51<div class="doc_text">
52
53<p>This tutorial will be illustrated with a toy language that we'll call
54"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>".
55Kaleidoscope is a procedural language that allows you to define functions, use
56conditionals, math, etc. Over the course of the tutorial, we'll extend
Chris Lattner619bc0a2007-11-05 20:13:56 +000057Kaleidoscope to support the if/then/else construct, a for loop, user defined
58operators, JIT compilation with a simple command line interface, etc.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +000059
Chris Lattner619bc0a2007-11-05 20:13:56 +000060<p>Because we want to keep things simple, the only datatype in Kaleidoscope is a
Chris Lattnerc38ef542007-10-22 04:32:37 +00006164-bit floating point type (aka 'double' in C parlance). As such, all values
62are implicitly double precision and the language doesn't require type
63declarations. This gives the language a very nice and simple syntax. For
Chris Lattner619bc0a2007-11-05 20:13:56 +000064example, the following simple example computes <a
65href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers:</a></p>
Chris Lattnerc38ef542007-10-22 04:32:37 +000066
67<div class="doc_code">
68<pre>
69# Compute the x'th fibonacci number.
70def fib(x)
Chris Lattnere6c91042007-10-22 06:34:15 +000071 if x &lt; 3 then
Chris Lattnerc38ef542007-10-22 04:32:37 +000072 1
73 else
74 fib(x-1)+fib(x-2)
75
76# This expression will compute the 40th number.
77fib(40)
78</pre>
79</div>
80
Duncan Sands72261ff2007-11-05 16:04:58 +000081<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
Chris Lattnerc38ef542007-10-22 04:32:37 +000082JIT makes this completely trivial). This means that you can use the 'extern'
83keyword to define a function before you use it (this is also useful for mutually
84recursive functions). For example:</p>
85
86<div class="doc_code">
87<pre>
88extern sin(arg);
89extern cos(arg);
90extern atan2(arg1 arg2);
91
92atan2(sin(.4), cos(42))
93</pre>
94</div>
95
96<p>In the first incarnation of the language, we will only support basic
97arithmetic: if/then/else will be added in a future installment. Another
98interesting aspect of the first implementation is that it is a completely
Chris Lattner619bc0a2007-11-05 20:13:56 +000099functional language, which does not allow you to have side-effects, etc. We
100will eventually add side effects for those who prefer them.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000101
102<p>In order to make this tutorial
103maximally understandable and hackable, we choose to implement everything in C++
104instead of using lexer and parser generators. LLVM obviously works just fine
Duncan Sands72261ff2007-11-05 16:04:58 +0000105with such tools, and making use of them doesn't impact the overall design.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000106
107<p>A note about this tutorial: we expect you to extend the language and play
108with it on your own. Take the code and go crazy hacking away at it. It can be
109a lot of fun to play with languages!</p>
110
111</div>
112
113<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000114<div class="doc_section"><a name="lexer">The Lexer</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000115<!-- *********************************************************************** -->
116
117<div class="doc_text">
118
119<p>When it comes to implementing a language, the first thing needed is
120the ability to process a text file and recognize what it says. The traditional
121way to do this is to use a "<a
122href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
123to break the input up into "tokens". Each token returned by the lexer includes
124a token code and potentially some metadata (e.g. the numeric value of a number).
125First, we define the possibilities:
126</p>
127
128<div class="doc_code">
129<pre>
130// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
131// of these for known things.
132enum Token {
133 tok_eof = -1,
134
135 // commands
136 tok_def = -2, tok_extern = -3,
137
138 // primary
139 tok_identifier = -4, tok_number = -5,
140};
141
142static std::string IdentifierStr; // Filled in if tok_identifier
143static double NumVal; // Filled in if tok_number
144</pre>
145</div>
146
147<p>Each token returned by our lexer will either be one of the Token enum values
Chris Lattner619bc0a2007-11-05 20:13:56 +0000148or it will be an 'unknown' character like '+', which is returned as its ascii
Chris Lattnerc38ef542007-10-22 04:32:37 +0000149value. If the current token is an identifier, the <tt>IdentifierStr</tt>
150global variable holds the name of the identifier. If the current token is a
151numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
152global variables for simplicity, this is not the best choice for a real language
153implementation :).
154</p>
155
Chris Lattner619bc0a2007-11-05 20:13:56 +0000156<p>The actual implementation of the lexer is a single function named
157<tt>gettok</tt>. The <tt>gettok</tt> function is called to return the next token
158from standard input. Its definition starts as:</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000159
160<div class="doc_code">
161<pre>
162/// gettok - Return the next token from standard input.
163static int gettok() {
164 static int LastChar = ' ';
165
166 // Skip any whitespace.
167 while (isspace(LastChar))
168 LastChar = getchar();
169</pre>
170</div>
171
172<p>
173<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
174characters one at a time from standard input. It eats them as it recognizes
Chris Lattner619bc0a2007-11-05 20:13:56 +0000175them and stores the last character read, but not processed, in LastChar. The
Chris Lattnerc38ef542007-10-22 04:32:37 +0000176first thing that it has to do is ignore whitespace between tokens. This is
177accomplished with the loop above.</p>
178
Chris Lattner619bc0a2007-11-05 20:13:56 +0000179<p>The next thing <tt>gettok</tt> needs to do is recognize identifiers and
180specific keywords like "def". Kaleidoscope does this with this simple loop:</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000181
182<div class="doc_code">
183<pre>
184 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
185 IdentifierStr = LastChar;
186 while (isalnum((LastChar = getchar())))
187 IdentifierStr += LastChar;
188
189 if (IdentifierStr == "def") return tok_def;
190 if (IdentifierStr == "extern") return tok_extern;
191 return tok_identifier;
192 }
193</pre>
194</div>
195
Chris Lattner619bc0a2007-11-05 20:13:56 +0000196<p>Note that this code sets the '<tt>IdentifierStr</tt>' global whenever it
197lexes an identifier. Also, since language keywords are matched by the same
198loop, we handle them here inline. Numeric values are similar:</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000199
200<div class="doc_code">
201<pre>
202 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
203 std::string NumStr;
204 do {
205 NumStr += LastChar;
206 LastChar = getchar();
207 } while (isdigit(LastChar) || LastChar == '.');
208
209 NumVal = strtod(NumStr.c_str(), 0);
210 return tok_number;
211 }
212</pre>
213</div>
214
215<p>This is all pretty straight-forward code for processing input. When reading
216a numeric value from input, we use the C <tt>strtod</tt> function to convert it
217to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
Duncan Sands72261ff2007-11-05 16:04:58 +0000218sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
Chris Lattnerc38ef542007-10-22 04:32:37 +0000219if you typed in "1.23". Feel free to extend it :). Next we handle comments:
220</p>
221
222<div class="doc_code">
223<pre>
224 if (LastChar == '#') {
225 // Comment until end of line.
226 do LastChar = getchar();
227 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
228
229 if (LastChar != EOF)
230 return gettok();
231 }
232</pre>
233</div>
234
Chris Lattner71155212007-11-06 01:39:12 +0000235<p>We handle comments by skipping to the end of the line and then return the
Chris Lattnerc38ef542007-10-22 04:32:37 +0000236next comment. Finally, if the input doesn't match one of the above cases, it is
Duncan Sands72261ff2007-11-05 16:04:58 +0000237either an operator character like '+' or the end of the file. These are handled with
Chris Lattnerc38ef542007-10-22 04:32:37 +0000238this code:</p>
239
240<div class="doc_code">
241<pre>
242 // Check for end of file. Don't eat the EOF.
243 if (LastChar == EOF)
244 return tok_eof;
245
246 // Otherwise, just return the character as its ascii value.
247 int ThisChar = LastChar;
248 LastChar = getchar();
249 return ThisChar;
250}
251</pre>
252</div>
253
Chris Lattner619bc0a2007-11-05 20:13:56 +0000254<p>With this, we have the complete lexer for the basic Kaleidoscope language
255(the <a href="LangImpl2.html#code">full code listing</a> for the Lexer is
256available in the <a href="LangImpl2.html">next chapter</a> of the tutorial).
Chris Lattnerc38ef542007-10-22 04:32:37 +0000257Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
Chris Lattnere6c91042007-10-22 06:34:15 +0000258build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
259so that you can use the lexer and parser together.
Chris Lattnerc38ef542007-10-22 04:32:37 +0000260</p>
261
262</div>
263
264<!-- *********************************************************************** -->
265<hr>
266<address>
267 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
268 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
269 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +0000270 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000271
272 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
273 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
274 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
275</address>
276</body>
277</html>