Blame - docs/tutorial/LangImpl1.html - fp2-dev/platform/external/llvm

blob: 8d967508238ecf33cb54f172c52dc56fffcd69d0 [file] [log] [blame]

Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
				2	"http://www.w3.org/TR/html4/strict.dtd">
				3
				4	<html>
				5	<head>
				6	<title>Kaleidoscope: The basic language, with its lexer</title>
				7	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
				8	<meta name="author" content="Chris Lattner">
				9	<link rel="stylesheet" href="../llvm.css" type="text/css">
				10	</head>
				11
				12	<body>
				13
				14	<div class="doc_title">Kaleidoscope: The basic language, with its lexer</div>
				15
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame^]	16	<ul>
				17	<li>Chapter 1
				18	<ol>
				19	<li><a href="#intro">Tutorial Introduction</a></li>
				20	<li><a href="#language">The Basic Language</a></li>
				21	<li><a href="#lexer">The Lexer</a></li>
				22	</ol>
				23	</li>
				24	</ul>
				25
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	26	<div class="doc_author">
				27	<p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
				28	</div>
				29
				30	<!-- *********************************************************************** -->
				31	<div class="doc_section"><a name="intro">Tutorial Introduction</a></div>
				32	<!-- *********************************************************************** -->
				33
				34	<div class="doc_text">
				35
				36	<p>Welcome to the "Implementing a language with LLVM" tutorial. This tutorial
				37	will run through implementation of a simple language, showing how fun and easy
				38	it can be. This tutorial will get you up and started and build a framework you
				39	can extend to other languages and to play with other things.
				40	</p>
				41
				42	</div>
				43
				44	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame^]	45	<div class="doc_section"><a name="language">The Basic Language</a></div>
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	46	<!-- *********************************************************************** -->
				47
				48	<div class="doc_text">
				49
				50	<p>This tutorial will be illustrated with a toy language that we'll call
				51	"<a href="http://en.wikipedia.org/wiki/Kaleidoscope">Kaleidoscope</a>".
				52	Kaleidoscope is a procedural language that allows you to define functions, use
				53	conditionals, math, etc. Over the course of the tutorial, we'll extend
				54	Kaleidoscope to support if/then/else, operator overloading, JIT compilation with
				55	a simple command line interface, etc.</p>
				56
				57	<p>Because we want to keep things simple, in Kaleidoscope the only datatype is a
				58	64-bit floating point type (aka 'double' in C parlance). As such, all values
				59	are implicitly double precision and the language doesn't require type
				60	declarations. This gives the language a very nice and simple syntax. For
				61	example, A simple example computes <a
				62	href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>,
				63	which looks like this:</p>
				64
				65	<div class="doc_code">
				66	<pre>
				67	# Compute the x'th fibonacci number.
				68	def fib(x)
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	69	if x < 3 then
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	70	1
				71	else
				72	fib(x-1)+fib(x-2)
				73
				74	# This expression will compute the 40th number.
				75	fib(40)
				76	</pre>
				77	</div>
				78
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	79	<p>We also allow Kaleidoscope to call into standard library functions (the LLVM
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	80	JIT makes this completely trivial). This means that you can use the 'extern'
				81	keyword to define a function before you use it (this is also useful for mutually
				82	recursive functions). For example:</p>
				83
				84	<div class="doc_code">
				85	<pre>
				86	extern sin(arg);
				87	extern cos(arg);
				88	extern atan2(arg1 arg2);
				89
				90	atan2(sin(.4), cos(42))
				91	</pre>
				92	</div>
				93
				94	<p>In the first incarnation of the language, we will only support basic
				95	arithmetic: if/then/else will be added in a future installment. Another
				96	interesting aspect of the first implementation is that it is a completely
				97	functional language, which does not allow you to have side-effects etc. We will
				98	eventually add side effects for those who prefer them.</p>
				99
				100	<p>In order to make this tutorial
				101	maximally understandable and hackable, we choose to implement everything in C++
				102	instead of using lexer and parser generators. LLVM obviously works just fine
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	103	with such tools, and making use of them doesn't impact the overall design.</p>
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	104
				105	<p>A note about this tutorial: we expect you to extend the language and play
				106	with it on your own. Take the code and go crazy hacking away at it. It can be
				107	a lot of fun to play with languages!</p>
				108
				109	</div>
				110
				111	<!-- *********************************************************************** -->
Chris Lattner	128eb86	2007-11-05 19:06:59 +0000	[diff] [blame^]	112	<div class="doc_section"><a name="lexer">The Lexer</a></div>
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	113	<!-- *********************************************************************** -->
				114
				115	<div class="doc_text">
				116
				117	<p>When it comes to implementing a language, the first thing needed is
				118	the ability to process a text file and recognize what it says. The traditional
				119	way to do this is to use a "<a
				120	href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a>" (aka 'scanner')
				121	to break the input up into "tokens". Each token returned by the lexer includes
				122	a token code and potentially some metadata (e.g. the numeric value of a number).
				123	First, we define the possibilities:
				124	</p>
				125
				126	<div class="doc_code">
				127	<pre>
				128	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				129	// of these for known things.
				130	enum Token {
				131	tok_eof = -1,
				132
				133	// commands
				134	tok_def = -2, tok_extern = -3,
				135
				136	// primary
				137	tok_identifier = -4, tok_number = -5,
				138	};
				139
				140	static std::string IdentifierStr; // Filled in if tok_identifier
				141	static double NumVal; // Filled in if tok_number
				142	</pre>
				143	</div>
				144
				145	<p>Each token returned by our lexer will either be one of the Token enum values
				146	or it will be an 'unknown' character like '+' which is returned as its ascii
				147	value. If the current token is an identifier, the <tt>IdentifierStr</tt>
				148	global variable holds the name of the identifier. If the current token is a
				149	numeric literal (like 1.0), <tt>NumVal</tt> holds its value. Note that we use
				150	global variables for simplicity, this is not the best choice for a real language
				151	implementation :).
				152	</p>
				153
				154	<p>The actual implementation of the lexer is a single function <tt>gettok</tt>.
				155	<tt>gettok</tt> is called to return the next token from standard input. Its
				156	definition starts as:</p>
				157
				158	<div class="doc_code">
				159	<pre>
				160	/// gettok - Return the next token from standard input.
				161	static int gettok() {
				162	static int LastChar = ' ';
				163
				164	// Skip any whitespace.
				165	while (isspace(LastChar))
				166	LastChar = getchar();
				167	</pre>
				168	</div>
				169
				170	<p>
				171	<tt>gettok</tt> works by calling the C <tt>getchar()</tt> function to read
				172	characters one at a time from standard input. It eats them as it recognizes
				173	them and stores the last character read but not processed in LastChar. The
				174	first thing that it has to do is ignore whitespace between tokens. This is
				175	accomplished with the loop above.</p>
				176
				177	<p>The next thing it needs to do is recognize identifiers, and specific keywords
				178	like "def". Kaleidoscope does this with this simple loop:</p>
				179
				180	<div class="doc_code">
				181	<pre>
				182	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				183	IdentifierStr = LastChar;
				184	while (isalnum((LastChar = getchar())))
				185	IdentifierStr += LastChar;
				186
				187	if (IdentifierStr == "def") return tok_def;
				188	if (IdentifierStr == "extern") return tok_extern;
				189	return tok_identifier;
				190	}
				191	</pre>
				192	</div>
				193
				194	<p>Note that it sets the '<tt>IdentifierStr</tt>' global whenever it lexes an
				195	identifier. Also, since language keywords are matched by the same loop, we
				196	handle them here inline. Numeric values are similar:</p>
				197
				198	<div class="doc_code">
				199	<pre>
				200	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				201	std::string NumStr;
				202	do {
				203	NumStr += LastChar;
				204	LastChar = getchar();
				205	} while (isdigit(LastChar) \|\| LastChar == '.');
				206
				207	NumVal = strtod(NumStr.c_str(), 0);
				208	return tok_number;
				209	}
				210	</pre>
				211	</div>
				212
				213	<p>This is all pretty straight-forward code for processing input. When reading
				214	a numeric value from input, we use the C <tt>strtod</tt> function to convert it
				215	to a numeric value that we store in <tt>NumVal</tt>. Note that this isn't doing
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	216	sufficient error checking: it will incorrectly read "1.23.45.67" and handle it as
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	217	if you typed in "1.23". Feel free to extend it :). Next we handle comments:
				218	</p>
				219
				220	<div class="doc_code">
				221	<pre>
				222	if (LastChar == '#') {
				223	// Comment until end of line.
				224	do LastChar = getchar();
				225	while (LastChar != EOF && LastChar != '\n' & LastChar != '\r');
				226
				227	if (LastChar != EOF)
				228	return gettok();
				229	}
				230	</pre>
				231	</div>
				232
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	233	<p>We handle comments by skipping to the end of the line and then returning the
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	234	next comment. Finally, if the input doesn't match one of the above cases, it is
Duncan Sands	72261ff	2007-11-05 16:04:58 +0000	[diff] [blame]	235	either an operator character like '+' or the end of the file. These are handled with
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	236	this code:</p>
				237
				238	<div class="doc_code">
				239	<pre>
				240	// Check for end of file. Don't eat the EOF.
				241	if (LastChar == EOF)
				242	return tok_eof;
				243
				244	// Otherwise, just return the character as its ascii value.
				245	int ThisChar = LastChar;
				246	LastChar = getchar();
				247	return ThisChar;
				248	}
				249	</pre>
				250	</div>
				251
				252	<p>With this, we have the complete lexer for the basic Kaleidoscope language.
				253	Next we'll <a href="LangImpl2.html">build a simple parser that uses this to
Chris Lattner	e6c9104	2007-10-22 06:34:15 +0000	[diff] [blame]	254	build an Abstract Syntax Tree</a>. When we have that, we'll include a driver
				255	so that you can use the lexer and parser together.
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	256	</p>
				257
				258	</div>
				259
				260	<!-- *********************************************************************** -->
				261	<hr>
				262	<address>
				263	<a href="http://jigsaw.w3.org/css-validator/check/referer"><img
				264	src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
				265	<a href="http://validator.w3.org/check/referer"><img
Chris Lattner	8eef4b2	2007-10-23 06:30:50 +0000	[diff] [blame]	266	src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
Chris Lattner	c38ef54	2007-10-22 04:32:37 +0000	[diff] [blame]	267
				268	<a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
				269	<a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
				270	Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
				271	</address>
				272	</body>
				273	</html>