Blame - llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl01.rst - toolchain/llvm-project

blob: 71ba9322817e090268da20e60c2aa35f99248a6a [file] [log] [blame]

Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	1	=====================================================
				2	Kaleidoscope: Kaleidoscope Introduction and the Lexer
				3	=====================================================
				4
				5	.. contents::
				6	:local:
				7
				8	The Kaleidoscope Language
				9	=========================
				10
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	11	This tutorial is illustrated with a toy language called
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	12	"`Kaleidoscope <http://en.wikipedia.org/wiki/Kaleidoscope>`_" (derived
				13	from "meaning beautiful, form, and view"). Kaleidoscope is a procedural
				14	language that allows you to define functions, use conditionals, math,
				15	etc. Over the course of the tutorial, we'll extend Kaleidoscope to
				16	support the if/then/else construct, a for loop, user defined operators,
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	17	JIT compilation with a simple command line interface, debug info, etc.
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	18
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	19	We want to keep things simple, so the only datatype in Kaleidoscope
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	20	is a 64-bit floating point type (aka 'double' in C parlance). As such,
				21	all values are implicitly double precision and the language doesn't
				22	require type declarations. This gives the language a very nice and
				23	simple syntax. For example, the following simple example computes
				24	`Fibonacci numbers: <http://en.wikipedia.org/wiki/Fibonacci_number>`_
				25
				26	::
				27
				28	# Compute the x'th fibonacci number.
				29	def fib(x)
				30	if x < 3 then
				31	1
				32	else
				33	fib(x-1)+fib(x-2)
				34
				35	# This expression will compute the 40th number.
				36	fib(40)
				37
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	38	We also allow Kaleidoscope to call into standard library functions - the
				39	LLVM JIT makes this really easy. This means that you can use the
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	40	'extern' keyword to define a function before you use it (this is also
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	41	useful for mutually recursive functions). For example:
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	42
				43	::
				44
				45	extern sin(arg);
				46	extern cos(arg);
				47	extern atan2(arg1 arg2);
				48
				49	atan2(sin(.4), cos(42))
				50
				51	A more interesting example is included in Chapter 6 where we write a
				52	little Kaleidoscope application that `displays a Mandelbrot
				53	Set <LangImpl06.html#kicking-the-tires>`_ at various levels of magnification.
				54
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	55	Let's dive into the implementation of this language!
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	56
				57	The Lexer
				58	=========
				59
				60	When it comes to implementing a language, the first thing needed is the
				61	ability to process a text file and recognize what it says. The
				62	traditional way to do this is to use a
				63	"`lexer <http://en.wikipedia.org/wiki/Lexical_analysis>`_" (aka
				64	'scanner') to break the input up into "tokens". Each token returned by
				65	the lexer includes a token code and potentially some metadata (e.g. the
				66	numeric value of a number). First, we define the possibilities:
				67
				68	.. code-block:: c++
				69
				70	// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
				71	// of these for known things.
				72	enum Token {
				73	tok_eof = -1,
				74
				75	// commands
				76	tok_def = -2,
				77	tok_extern = -3,
				78
				79	// primary
				80	tok_identifier = -4,
				81	tok_number = -5,
				82	};
				83
				84	static std::string IdentifierStr; // Filled in if tok_identifier
				85	static double NumVal; // Filled in if tok_number
				86
				87	Each token returned by our lexer will either be one of the Token enum
				88	values or it will be an 'unknown' character like '+', which is returned
				89	as its ASCII value. If the current token is an identifier, the
				90	``IdentifierStr`` global variable holds the name of the identifier. If
				91	the current token is a numeric literal (like 1.0), ``NumVal`` holds its
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	92	value. We use global variables for simplicity, but this is not the
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	93	best choice for a real language implementation :).
				94
				95	The actual implementation of the lexer is a single function named
				96	``gettok``. The ``gettok`` function is called to return the next token
				97	from standard input. Its definition starts as:
				98
				99	.. code-block:: c++
				100
				101	/// gettok - Return the next token from standard input.
				102	static int gettok() {
				103	static int LastChar = ' ';
				104
				105	// Skip any whitespace.
				106	while (isspace(LastChar))
				107	LastChar = getchar();
				108
				109	``gettok`` works by calling the C ``getchar()`` function to read
				110	characters one at a time from standard input. It eats them as it
				111	recognizes them and stores the last character read, but not processed,
				112	in LastChar. The first thing that it has to do is ignore whitespace
				113	between tokens. This is accomplished with the loop above.
				114
				115	The next thing ``gettok`` needs to do is recognize identifiers and
				116	specific keywords like "def". Kaleidoscope does this with this simple
				117	loop:
				118
				119	.. code-block:: c++
				120
				121	if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
				122	IdentifierStr = LastChar;
				123	while (isalnum((LastChar = getchar())))
				124	IdentifierStr += LastChar;
				125
				126	if (IdentifierStr == "def")
				127	return tok_def;
				128	if (IdentifierStr == "extern")
				129	return tok_extern;
				130	return tok_identifier;
				131	}
				132
				133	Note that this code sets the '``IdentifierStr``' global whenever it
				134	lexes an identifier. Also, since language keywords are matched by the
				135	same loop, we handle them here inline. Numeric values are similar:
				136
				137	.. code-block:: c++
				138
				139	if (isdigit(LastChar) \|\| LastChar == '.') { // Number: [0-9.]+
				140	std::string NumStr;
				141	do {
				142	NumStr += LastChar;
				143	LastChar = getchar();
				144	} while (isdigit(LastChar) \|\| LastChar == '.');
				145
				146	NumVal = strtod(NumStr.c_str(), 0);
				147	return tok_number;
				148	}
				149
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	150	This is all pretty straightforward code for processing input. When
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	151	reading a numeric value from input, we use the C ``strtod`` function to
				152	convert it to a numeric value that we store in ``NumVal``. Note that
				153	this isn't doing sufficient error checking: it will incorrectly read
				154	"1.23.45.67" and handle it as if you typed in "1.23". Feel free to
Chris Lattner	0fa6c15	2019-04-07 14:23:11 +0000	[diff] [blame]	155	extend it! Next we handle comments:
Chris Lattner	d80f118	2019-04-07 13:14:23 +0000	[diff] [blame]	156
				157	.. code-block:: c++
				158
				159	if (LastChar == '#') {
				160	// Comment until end of line.
				161	do
				162	LastChar = getchar();
				163	while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
				164
				165	if (LastChar != EOF)
				166	return gettok();
				167	}
				168
				169	We handle comments by skipping to the end of the line and then return
				170	the next token. Finally, if the input doesn't match one of the above
				171	cases, it is either an operator character like '+' or the end of the
				172	file. These are handled with this code:
				173
				174	.. code-block:: c++
				175
				176	// Check for end of file. Don't eat the EOF.
				177	if (LastChar == EOF)
				178	return tok_eof;
				179
				180	// Otherwise, just return the character as its ascii value.
				181	int ThisChar = LastChar;
				182	LastChar = getchar();
				183	return ThisChar;
				184	}
				185
				186	With this, we have the complete lexer for the basic Kaleidoscope
				187	language (the `full code listing <LangImpl02.html#full-code-listing>`_ for the Lexer
				188	is available in the `next chapter <LangImpl02.html>`_ of the tutorial).
				189	Next we'll `build a simple parser that uses this to build an Abstract
				190	Syntax Tree <LangImpl02.html>`_. When we have that, we'll include a
				191	driver so that you can use the lexer and parser together.
				192
				193	`Next: Implementing a Parser and AST <LangImpl02.html>`_
				194