blob: a0c9b16184b118df8e5162ed773dc83b33e20c53 [file] [log] [blame]
Chris Lattnerc38ef542007-10-22 04:32:37 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: The basic language, with its lexer</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: The basic language, with its lexer</div>
15
Chris Lattner128eb862007-11-05 19:06:59 +000016<ul>
Chris Lattner0e555b12007-11-05 20:04:56 +000017<li><a href="index.html">Up to Tutorial Index</a></li>
Chris Lattner128eb862007-11-05 19:06:59 +000018<li>Chapter 1
19 <ol>
20 <li><a href="#intro">Tutorial Introduction</a></li>
21 <li><a href="#language">The Basic Language</a></li>
22 <li><a href="#lexer">The Lexer</a></li>
23 </ol>
24</li>
Chris Lattner0e555b12007-11-05 20:04:56 +000025<li><a href="LangImpl2.html">Chapter 2</a>: Implementing a Parser and AST</li>
Chris Lattner128eb862007-11-05 19:06:59 +000026</ul>
27
Chris Lattnerc38ef542007-10-22 04:32:37 +000028<div class="doc_author">
29 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
30</div>
31
32<!-- *********************************************************************** -->
33<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
34<!-- *********************************************************************** -->
35
36<div class="doc_text">
37
38<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
39will run through implementation of a simple language, showing how fun and easy
40it can be. This tutorial will get you up and started and build a framework you
41can extend to other languages and to play with other things.
42</p>
43
44</div>
45
46<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +000047<div class="doc_section"><a name="language">The Basic Language</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +000048<!-- *********************************************************************** -->
49
50<div class="doc_text">
51
52<p>This tutorial will be illustrated with a toy language that we'll call
53"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>".
54Kaleidoscope is a procedural language that allows you to define functions, use
55conditionals, math, etc. Over the course of the tutorial, we'll extend
56Kaleidoscope to support if/then/else, operator overloading, JIT compilation with
57a simple command line interface, etc.</p>
58
59<p>Because we want to keep things simple, in Kaleidoscope the only datatype is a
6064-bit floating point type (aka 'double' in C parlance). As such, all values
61are implicitly double precision and the language doesn't require type
62declarations. This gives the language a very nice and simple syntax. For
63example, A simple example computes <a
64href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>,
65which looks like this:</p>
66
67<div class="doc_code">
68<pre>
69# Compute the x'th fibonacci number.
70def fib(x)
Chris Lattnere6c91042007-10-22 06:34:15 +000071 if x &lt; 3 then
Chris Lattnerc38ef542007-10-22 04:32:37 +000072 1
73 else
74 fib(x-1)+fib(x-2)
75
76# This expression will compute the 40th number.
77fib(40)
78</pre>
79</div>
80
Duncan Sands72261ff2007-11-05 16:04:58 +000081<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
Chris Lattnerc38ef542007-10-22 04:32:37 +000082JIT makes this completely trivial). This means that you can use the 'extern'
83keyword to define a function before you use it (this is also useful for mutually
84recursive functions). For example:</p>
85
86<div class="doc_code">
87<pre>
88extern sin(arg);
89extern cos(arg);
90extern atan2(arg1 arg2);
91
92atan2(sin(.4), cos(42))
93</pre>
94</div>
95
96<p>In the first incarnation of the language, we will only support basic
97arithmetic: if/then/else will be added in a future installment. Another
98interesting aspect of the first implementation is that it is a completely
99functional language, which does not allow you to have side-effects etc. We will
100eventually add side effects for those who prefer them.</p>
101
102<p>In order to make this tutorial
103maximally understandable and hackable, we choose to implement everything in C++
104instead of using lexer and parser generators. LLVM obviously works just fine
Duncan Sands72261ff2007-11-05 16:04:58 +0000105with such tools, and making use of them doesn't impact the overall design.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000106
107<p>A note about this tutorial: we expect you to extend the language and play
108with it on your own. Take the code and go crazy hacking away at it. It can be
109a lot of fun to play with languages!</p>
110
111</div>
112
113<!-- *********************************************************************** -->
Chris Lattner128eb862007-11-05 19:06:59 +0000114<div class="doc_section"><a name="lexer">The Lexer</a></div>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000115<!-- *********************************************************************** -->
116
117<div class="doc_text">
118
119<p>When it comes to implementing a language, the first thing needed is
120the ability to process a text file and recognize what it says. The traditional
121way to do this is to use a "<a
122href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
123to break the input up into "tokens". Each token returned by the lexer includes
124a token code and potentially some metadata (e.g. the numeric value of a number).
125First, we define the possibilities:
126</p>
127
128<div class="doc_code">
129<pre>
130// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
131// of these for known things.
132enum Token {
133 tok_eof = -1,
134
135 // commands
136 tok_def = -2, tok_extern = -3,
137
138 // primary
139 tok_identifier = -4, tok_number = -5,
140};
141
142static std::string IdentifierStr; // Filled in if tok_identifier
143static double NumVal; // Filled in if tok_number
144</pre>
145</div>
146
147<p>Each token returned by our lexer will either be one of the Token enum values
148or it will be an 'unknown' character like '+' which is returned as its ascii
149value. If the current token is an identifier, the <tt>IdentifierStr</tt>
150global variable holds the name of the identifier. If the current token is a
151numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
152global variables for simplicity, this is not the best choice for a real language
153implementation :).
154</p>
155
156<p>The actual implementation of the lexer is a single function <tt>gettok</tt>.
157<tt>gettok</tt> is called to return the next token from standard input. Its
158definition starts as:</p>
159
160<div class="doc_code">
161<pre>
162/// gettok - Return the next token from standard input.
163static int gettok() {
164 static int LastChar = ' ';
165
166 // Skip any whitespace.
167 while (isspace(LastChar))
168 LastChar = getchar();
169</pre>
170</div>
171
172<p>
173<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
174characters one at a time from standard input. It eats them as it recognizes
175them and stores the last character read but not processed in LastChar. The
176first thing that it has to do is ignore whitespace between tokens. This is
177accomplished with the loop above.</p>
178
179<p>The next thing it needs to do is recognize identifiers, and specific keywords
180like "def". Kaleidoscope does this with this simple loop:</p>
181
182<div class="doc_code">
183<pre>
184 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
185 IdentifierStr = LastChar;
186 while (isalnum((LastChar = getchar())))
187 IdentifierStr += LastChar;
188
189 if (IdentifierStr == "def") return tok_def;
190 if (IdentifierStr == "extern") return tok_extern;
191 return tok_identifier;
192 }
193</pre>
194</div>
195
196<p>Note that it sets the '<tt>IdentifierStr</tt>' global whenever it lexes an
197identifier. Also, since language keywords are matched by the same loop, we
198handle them here inline. Numeric values are similar:</p>
199
200<div class="doc_code">
201<pre>
202 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
203 std::string NumStr;
204 do {
205 NumStr += LastChar;
206 LastChar = getchar();
207 } while (isdigit(LastChar) || LastChar == '.');
208
209 NumVal = strtod(NumStr.c_str(), 0);
210 return tok_number;
211 }
212</pre>
213</div>
214
215<p>This is all pretty straight-forward code for processing input. When reading
216a numeric value from input, we use the C <tt>strtod</tt> function to convert it
217to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
Duncan Sands72261ff2007-11-05 16:04:58 +0000218sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
Chris Lattnerc38ef542007-10-22 04:32:37 +0000219if you typed in "1.23". Feel free to extend it :). Next we handle comments:
220</p>
221
222<div class="doc_code">
223<pre>
224 if (LastChar == '#') {
225 // Comment until end of line.
226 do LastChar = getchar();
227 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
228
229 if (LastChar != EOF)
230 return gettok();
231 }
232</pre>
233</div>
234
Duncan Sands72261ff2007-11-05 16:04:58 +0000235<p>We handle comments by skipping to the end of the line and then returning the
Chris Lattnerc38ef542007-10-22 04:32:37 +0000236next comment. Finally, if the input doesn't match one of the above cases, it is
Duncan Sands72261ff2007-11-05 16:04:58 +0000237either an operator character like '+' or the end of the file. These are handled with
Chris Lattnerc38ef542007-10-22 04:32:37 +0000238this code:</p>
239
240<div class="doc_code">
241<pre>
242 // Check for end of file. Don't eat the EOF.
243 if (LastChar == EOF)
244 return tok_eof;
245
246 // Otherwise, just return the character as its ascii value.
247 int ThisChar = LastChar;
248 LastChar = getchar();
249 return ThisChar;
250}
251</pre>
252</div>
253
254<p>With this, we have the complete lexer for the basic Kaleidoscope language.
255Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
Chris Lattnere6c91042007-10-22 06:34:15 +0000256build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
257so that you can use the lexer and parser together.
Chris Lattnerc38ef542007-10-22 04:32:37 +0000258</p>
259
260</div>
261
262<!-- *********************************************************************** -->
263<hr>
264<address>
265 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
266 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
267 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +0000268 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000269
270 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
271 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
272 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
273</address>
274</body>
275</html>