blob: 8d967508238ecf33cb54f172c52dc56fffcd69d0 [file] [log] [blame]
Chris Lattnerc38ef542007-10-22 04:32:37 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: The basic language, with its lexer</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: The basic language, with its lexer</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
17<li>Chapter 1
18 <ol>
19 <li><a href="#intro">Tutorial Introduction</a></li>
20 <li><a href="#language">The Basic Language</a></li>
21 <li><a href="#lexer">The Lexer</a></li>
22 </ol>
23</li>
24</ul>
25
Chris Lattnerc38ef542007-10-22 04:32:37 +000026<div class="doc_author">
27 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
28</div>
29
30<!-- *********************************************************************** -->
31<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
32<!-- *********************************************************************** -->
33
34<div class="doc_text">
35
36<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
37will run through implementation of a simple language, showing how fun and easy
38it can be. This tutorial will get you up and started and build a framework you
39can extend to other languages and to play with other things.
40</p>
41
42</div>
43
44<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000045<div class="doc_section"><a name="language">The Basic Language</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +000046<!-- *********************************************************************** -->
47
48<div class="doc_text">
49
50<p>This tutorial will be illustrated with a toy language that we'll call
51"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>".
52Kaleidoscope is a procedural language that allows you to define functions, use
53conditionals, math, etc. Over the course of the tutorial, we'll extend
54Kaleidoscope to support if/then/else, operator overloading, JIT compilation with
55a simple command line interface, etc.</p>
56
57<p>Because we want to keep things simple, in Kaleidoscope the only datatype is a
5864-bit floating point type (aka 'double' in C parlance). As such, all values
59are implicitly double precision and the language doesn't require type
60declarations. This gives the language a very nice and simple syntax. For
61example, A simple example computes <a
62href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>,
63which looks like this:</p>
64
65<div class="doc_code">
66<pre>
67# Compute the x'th fibonacci number.
68def fib(x)
Chris Lattnere6c91042007-10-22 06:34:15 +000069 if x &lt; 3 then
Chris Lattnerc38ef542007-10-22 04:32:37 +000070 1
71 else
72 fib(x-1)+fib(x-2)
73
74# This expression will compute the 40th number.
75fib(40)
76</pre>
77</div>
78
Duncan Sands72261ff2007-11-05 16:04:58 +000079<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
Chris Lattnerc38ef542007-10-22 04:32:37 +000080JIT makes this completely trivial). This means that you can use the 'extern'
81keyword to define a function before you use it (this is also useful for mutually
82recursive functions). For example:</p>
83
84<div class="doc_code">
85<pre>
86extern sin(arg);
87extern cos(arg);
88extern atan2(arg1 arg2);
89
90atan2(sin(.4), cos(42))
91</pre>
92</div>
93
94<p>In the first incarnation of the language, we will only support basic
95arithmetic: if/then/else will be added in a future installment. Another
96interesting aspect of the first implementation is that it is a completely
97functional language, which does not allow you to have side-effects etc. We will
98eventually add side effects for those who prefer them.</p>
99
100<p>In order to make this tutorial
101maximally understandable and hackable, we choose to implement everything in C++
102instead of using lexer and parser generators. LLVM obviously works just fine
Duncan Sands72261ff2007-11-05 16:04:58 +0000103with such tools, and making use of them doesn't impact the overall design.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000104
105<p>A note about this tutorial: we expect you to extend the language and play
106with it on your own. Take the code and go crazy hacking away at it. It can be
107a lot of fun to play with languages!</p>
108
109</div>
110
111<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000112<div class="doc_section"><a name="lexer">The Lexer</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000113<!-- *********************************************************************** -->
114
115<div class="doc_text">
116
117<p>When it comes to implementing a language, the first thing needed is
118the ability to process a text file and recognize what it says. The traditional
119way to do this is to use a "<a
120href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
121to break the input up into "tokens". Each token returned by the lexer includes
122a token code and potentially some metadata (e.g. the numeric value of a number).
123First, we define the possibilities:
124</p>
125
126<div class="doc_code">
127<pre>
128// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
129// of these for known things.
130enum Token {
131 tok_eof = -1,
132
133 // commands
134 tok_def = -2, tok_extern = -3,
135
136 // primary
137 tok_identifier = -4, tok_number = -5,
138};
139
140static std::string IdentifierStr; // Filled in if tok_identifier
141static double NumVal; // Filled in if tok_number
142</pre>
143</div>
144
145<p>Each token returned by our lexer will either be one of the Token enum values
146or it will be an 'unknown' character like '+' which is returned as its ascii
147value. If the current token is an identifier, the <tt>IdentifierStr</tt>
148global variable holds the name of the identifier. If the current token is a
149numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
150global variables for simplicity, this is not the best choice for a real language
151implementation :).
152</p>
153
154<p>The actual implementation of the lexer is a single function <tt>gettok</tt>.
155<tt>gettok</tt> is called to return the next token from standard input. Its
156definition starts as:</p>
157
158<div class="doc_code">
159<pre>
160/// gettok - Return the next token from standard input.
161static int gettok() {
162 static int LastChar = ' ';
163
164 // Skip any whitespace.
165 while (isspace(LastChar))
166 LastChar = getchar();
167</pre>
168</div>
169
170<p>
171<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
172characters one at a time from standard input. It eats them as it recognizes
173them and stores the last character read but not processed in LastChar. The
174first thing that it has to do is ignore whitespace between tokens. This is
175accomplished with the loop above.</p>
176
177<p>The next thing it needs to do is recognize identifiers, and specific keywords
178like "def". Kaleidoscope does this with this simple loop:</p>
179
180<div class="doc_code">
181<pre>
182 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
183 IdentifierStr = LastChar;
184 while (isalnum((LastChar = getchar())))
185 IdentifierStr += LastChar;
186
187 if (IdentifierStr == "def") return tok_def;
188 if (IdentifierStr == "extern") return tok_extern;
189 return tok_identifier;
190 }
191</pre>
192</div>
193
194<p>Note that it sets the '<tt>IdentifierStr</tt>' global whenever it lexes an
195identifier. Also, since language keywords are matched by the same loop, we
196handle them here inline. Numeric values are similar:</p>
197
198<div class="doc_code">
199<pre>
200 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
201 std::string NumStr;
202 do {
203 NumStr += LastChar;
204 LastChar = getchar();
205 } while (isdigit(LastChar) || LastChar == '.');
206
207 NumVal = strtod(NumStr.c_str(), 0);
208 return tok_number;
209 }
210</pre>
211</div>
212
213<p>This is all pretty straight-forward code for processing input. When reading
214a numeric value from input, we use the C <tt>strtod</tt> function to convert it
215to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
Duncan Sands72261ff2007-11-05 16:04:58 +0000216sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
Chris Lattnerc38ef542007-10-22 04:32:37 +0000217if you typed in "1.23". Feel free to extend it :). Next we handle comments:
218</p>
219
220<div class="doc_code">
221<pre>
222 if (LastChar == '#') {
223 // Comment until end of line.
224 do LastChar = getchar();
225 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
226
227 if (LastChar != EOF)
228 return gettok();
229 }
230</pre>
231</div>
232
Duncan Sands72261ff2007-11-05 16:04:58 +0000233<p>We handle comments by skipping to the end of the line and then returning the
Chris Lattnerc38ef542007-10-22 04:32:37 +0000234next comment. Finally, if the input doesn't match one of the above cases, it is
Duncan Sands72261ff2007-11-05 16:04:58 +0000235either an operator character like '+' or the end of the file. These are handled with
Chris Lattnerc38ef542007-10-22 04:32:37 +0000236this code:</p>
237
238<div class="doc_code">
239<pre>
240 // Check for end of file. Don't eat the EOF.
241 if (LastChar == EOF)
242 return tok_eof;
243
244 // Otherwise, just return the character as its ascii value.
245 int ThisChar = LastChar;
246 LastChar = getchar();
247 return ThisChar;
248}
249</pre>
250</div>
251
252<p>With this, we have the complete lexer for the basic Kaleidoscope language.
253Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
Chris Lattnere6c91042007-10-22 06:34:15 +0000254build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
255so that you can use the lexer and parser together.
Chris Lattnerc38ef542007-10-22 04:32:37 +0000256</p>
257
258</div>
259
260<!-- *********************************************************************** -->
261<hr>
262<address>
263 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
264 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
265 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +0000266 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000267
268 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
269 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
270 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
271</address>
272</body>
273</html>