blob: 6325d3c28970ad9c974e414abd535779dc35087c [file] [log] [blame]
Chris Lattnerc38ef542007-10-22 04:32:37 +00001<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
3
4<html>
5<head>
6 <title>Kaleidoscope: The basic language, with its lexer</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10</head>
11
12<body>
13
14<div class="doc_title">Kaleidoscope: The basic language, with its lexer</div>
15
16<div class="doc_author">
17 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
18</div>
19
20<!-- *********************************************************************** -->
21<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
22<!-- *********************************************************************** -->
23
24<div class="doc_text">
25
26<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
27will run through implementation of a simple language, showing how fun and easy
28it can be. This tutorial will get you up and started and build a framework you
29can extend to other languages and to play with other things.
30</p>
31
32</div>
33
34<!-- *********************************************************************** -->
35<div class="doc_section"><a name="language">The basic language</a></div>
36<!-- *********************************************************************** -->
37
38<div class="doc_text">
39
40<p>This tutorial will be illustrated with a toy language that we'll call
41"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>".
42Kaleidoscope is a procedural language that allows you to define functions, use
43conditionals, math, etc. Over the course of the tutorial, we'll extend
44Kaleidoscope to support if/then/else, operator overloading, JIT compilation with
45a simple command line interface, etc.</p>
46
47<p>Because we want to keep things simple, in Kaleidoscope the only datatype is a
4864-bit floating point type (aka 'double' in C parlance). As such, all values
49are implicitly double precision and the language doesn't require type
50declarations. This gives the language a very nice and simple syntax. For
51example, A simple example computes <a
52href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>,
53which looks like this:</p>
54
55<div class="doc_code">
56<pre>
57# Compute the x'th fibonacci number.
58def fib(x)
Chris Lattnere6c91042007-10-22 06:34:15 +000059 if x &lt; 3 then
Chris Lattnerc38ef542007-10-22 04:32:37 +000060 1
61 else
62 fib(x-1)+fib(x-2)
63
64# This expression will compute the 40th number.
65fib(40)
66</pre>
67</div>
68
Duncan Sands72261ff2007-11-05 16:04:58 +000069<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
Chris Lattnerc38ef542007-10-22 04:32:37 +000070JIT makes this completely trivial). This means that you can use the 'extern'
71keyword to define a function before you use it (this is also useful for mutually
72recursive functions). For example:</p>
73
74<div class="doc_code">
75<pre>
76extern sin(arg);
77extern cos(arg);
78extern atan2(arg1 arg2);
79
80atan2(sin(.4), cos(42))
81</pre>
82</div>
83
84<p>In the first incarnation of the language, we will only support basic
85arithmetic: if/then/else will be added in a future installment. Another
86interesting aspect of the first implementation is that it is a completely
87functional language, which does not allow you to have side-effects etc. We will
88eventually add side effects for those who prefer them.</p>
89
90<p>In order to make this tutorial
91maximally understandable and hackable, we choose to implement everything in C++
92instead of using lexer and parser generators. LLVM obviously works just fine
Duncan Sands72261ff2007-11-05 16:04:58 +000093with such tools, and making use of them doesn't impact the overall design.</p>
Chris Lattnerc38ef542007-10-22 04:32:37 +000094
95<p>A note about this tutorial: we expect you to extend the language and play
96with it on your own. Take the code and go crazy hacking away at it. It can be
97a lot of fun to play with languages!</p>
98
99</div>
100
101<!-- *********************************************************************** -->
102<div class="doc_section"><a name="language">The Lexer</a></div>
103<!-- *********************************************************************** -->
104
105<div class="doc_text">
106
107<p>When it comes to implementing a language, the first thing needed is
108the ability to process a text file and recognize what it says. The traditional
109way to do this is to use a "<a
110href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
111to break the input up into "tokens". Each token returned by the lexer includes
112a token code and potentially some metadata (e.g. the numeric value of a number).
113First, we define the possibilities:
114</p>
115
116<div class="doc_code">
117<pre>
118// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
119// of these for known things.
120enum Token {
121 tok_eof = -1,
122
123 // commands
124 tok_def = -2, tok_extern = -3,
125
126 // primary
127 tok_identifier = -4, tok_number = -5,
128};
129
130static std::string IdentifierStr; // Filled in if tok_identifier
131static double NumVal; // Filled in if tok_number
132</pre>
133</div>
134
135<p>Each token returned by our lexer will either be one of the Token enum values
136or it will be an 'unknown' character like '+' which is returned as its ascii
137value. If the current token is an identifier, the <tt>IdentifierStr</tt>
138global variable holds the name of the identifier. If the current token is a
139numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
140global variables for simplicity, this is not the best choice for a real language
141implementation :).
142</p>
143
144<p>The actual implementation of the lexer is a single function <tt>gettok</tt>.
145<tt>gettok</tt> is called to return the next token from standard input. Its
146definition starts as:</p>
147
148<div class="doc_code">
149<pre>
150/// gettok - Return the next token from standard input.
151static int gettok() {
152 static int LastChar = ' ';
153
154 // Skip any whitespace.
155 while (isspace(LastChar))
156 LastChar = getchar();
157</pre>
158</div>
159
160<p>
161<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
162characters one at a time from standard input. It eats them as it recognizes
163them and stores the last character read but not processed in LastChar. The
164first thing that it has to do is ignore whitespace between tokens. This is
165accomplished with the loop above.</p>
166
167<p>The next thing it needs to do is recognize identifiers, and specific keywords
168like "def". Kaleidoscope does this with this simple loop:</p>
169
170<div class="doc_code">
171<pre>
172 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
173 IdentifierStr = LastChar;
174 while (isalnum((LastChar = getchar())))
175 IdentifierStr += LastChar;
176
177 if (IdentifierStr == "def") return tok_def;
178 if (IdentifierStr == "extern") return tok_extern;
179 return tok_identifier;
180 }
181</pre>
182</div>
183
184<p>Note that it sets the '<tt>IdentifierStr</tt>' global whenever it lexes an
185identifier. Also, since language keywords are matched by the same loop, we
186handle them here inline. Numeric values are similar:</p>
187
188<div class="doc_code">
189<pre>
190 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
191 std::string NumStr;
192 do {
193 NumStr += LastChar;
194 LastChar = getchar();
195 } while (isdigit(LastChar) || LastChar == '.');
196
197 NumVal = strtod(NumStr.c_str(), 0);
198 return tok_number;
199 }
200</pre>
201</div>
202
203<p>This is all pretty straight-forward code for processing input. When reading
204a numeric value from input, we use the C <tt>strtod</tt> function to convert it
205to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
Duncan Sands72261ff2007-11-05 16:04:58 +0000206sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
Chris Lattnerc38ef542007-10-22 04:32:37 +0000207if you typed in "1.23". Feel free to extend it :). Next we handle comments:
208</p>
209
210<div class="doc_code">
211<pre>
212 if (LastChar == '#') {
213 // Comment until end of line.
214 do LastChar = getchar();
215 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp; LastChar != '\r');
216
217 if (LastChar != EOF)
218 return gettok();
219 }
220</pre>
221</div>
222
Duncan Sands72261ff2007-11-05 16:04:58 +0000223<p>We handle comments by skipping to the end of the line and then returning the
Chris Lattnerc38ef542007-10-22 04:32:37 +0000224next comment. Finally, if the input doesn't match one of the above cases, it is
Duncan Sands72261ff2007-11-05 16:04:58 +0000225either an operator character like '+' or the end of the file. These are handled with
Chris Lattnerc38ef542007-10-22 04:32:37 +0000226this code:</p>
227
228<div class="doc_code">
229<pre>
230 // Check for end of file. Don't eat the EOF.
231 if (LastChar == EOF)
232 return tok_eof;
233
234 // Otherwise, just return the character as its ascii value.
235 int ThisChar = LastChar;
236 LastChar = getchar();
237 return ThisChar;
238}
239</pre>
240</div>
241
242<p>With this, we have the complete lexer for the basic Kaleidoscope language.
243Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
Chris Lattnere6c91042007-10-22 06:34:15 +0000244build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
245so that you can use the lexer and parser together.
Chris Lattnerc38ef542007-10-22 04:32:37 +0000246</p>
247
248</div>
249
250<!-- *********************************************************************** -->
251<hr>
252<address>
253 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
254 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
255 <a href="http://validator.w3.org/check/referer"><img
Chris Lattner8eef4b22007-10-23 06:30:50 +0000256 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattnerc38ef542007-10-22 04:32:37 +0000257
258 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
259 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
260 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
261</address>
262</body>
263</html>