Blame - Doc/ref2.tex - platform/external/python/cpython2

blob: 250bd2e771957a9242579bb17e99d60c5ece5090 [file] [log] [blame]

Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	1	\chapter{Lexical analysis}
				2
				3	A Python program is read by a {\em parser}. Input to the parser is a
				4	stream of {\em tokens}, generated by the {\em lexical analyzer}. This
				5	chapter describes how the lexical analyzer breaks a file into tokens.
				6	\index{lexical analysis}
				7	\index{parser}
				8	\index{token}
				9
				10	\section{Line structure}
				11
				12	A Python program is divided in a number of logical lines. The end of
				13	a logical line is represented by the token NEWLINE. Statements cannot
				14	cross logical line boundaries except where NEWLINE is allowed by the
				15	syntax (e.g. between statements in compound statements).
				16	\index{line structure}
				17	\index{logical line}
				18	\index{NEWLINE token}
				19
				20	\subsection{Comments}
				21
				22	A comment starts with a hash character (\verb\#\) that is not part of
				23	a string literal, and ends at the end of the physical line. A comment
				24	always signifies the end of the logical line. Comments are ignored by
				25	the syntax.
				26	\index{comment}
				27	\index{logical line}
				28	\index{physical line}
				29	\index{hash character}
				30
				31	\subsection{Line joining}
				32
				33	Two or more physical lines may be joined into logical lines using
				34	backslash characters (\verb/\/), as follows: when a physical line ends
				35	in a backslash that is not part of a string literal or comment, it is
				36	joined with the following forming a single logical line, deleting the
				37	backslash and the following end-of-line character. For example:
				38	\index{physical line}
				39	\index{line joining}
				40	\index{backslash character}
				41	%
				42	\begin{verbatim}
				43	month_names = ['Januari', 'Februari', 'Maart', \
				44	'April', 'Mei', 'Juni', \
				45	'Juli', 'Augustus', 'September', \
				46	'Oktober', 'November', 'December']
				47	\end{verbatim}
				48
				49	\subsection{Blank lines}
				50
				51	A logical line that contains only spaces, tabs, and possibly a
				52	comment, is ignored (i.e., no NEWLINE token is generated), except that
				53	during interactive input of statements, an entirely blank logical line
				54	terminates a multi-line statement.
				55	\index{blank line}
				56
				57	\subsection{Indentation}
				58
				59	Leading whitespace (spaces and tabs) at the beginning of a logical
				60	line is used to compute the indentation level of the line, which in
				61	turn is used to determine the grouping of statements.
				62	\index{indentation}
				63	\index{whitespace}
				64	\index{leading whitespace}
				65	\index{space}
				66	\index{tab}
				67	\index{grouping}
				68	\index{statement grouping}
				69
				70	First, tabs are replaced (from left to right) by one to eight spaces
				71	such that the total number of characters up to there is a multiple of
				72	eight (this is intended to be the same rule as used by {\UNIX}). The
				73	total number of spaces preceding the first non-blank character then
				74	determines the line's indentation. Indentation cannot be split over
				75	multiple physical lines using backslashes.
				76
				77	The indentation levels of consecutive lines are used to generate
				78	INDENT and DEDENT tokens, using a stack, as follows.
				79	\index{INDENT token}
				80	\index{DEDENT token}
				81
				82	Before the first line of the file is read, a single zero is pushed on
				83	the stack; this will never be popped off again. The numbers pushed on
				84	the stack will always be strictly increasing from bottom to top. At
				85	the beginning of each logical line, the line's indentation level is
				86	compared to the top of the stack. If it is equal, nothing happens.
				87	If it is larger, it is pushed on the stack, and one INDENT token is
				88	generated. If it is smaller, it {\em must} be one of the numbers
				89	occurring on the stack; all numbers on the stack that are larger are
				90	popped off, and for each number popped off a DEDENT token is
				91	generated. At the end of the file, a DEDENT token is generated for
				92	each number remaining on the stack that is larger than zero.
				93
				94	Here is an example of a correctly (though confusingly) indented piece
				95	of Python code:
				96
				97	\begin{verbatim}
				98	def perm(l):
				99	# Compute the list of all permutations of l
				100
				101	if len(l) <= 1:
				102	return [l]
				103	r = []
				104	for i in range(len(l)):
				105	s = l[:i] + l[i+1:]
				106	p = perm(s)
				107	for x in p:
				108	r.append(l[i:i+1] + x)
				109	return r
				110	\end{verbatim}
				111
				112	The following example shows various indentation errors:
				113
				114	\begin{verbatim}
				115	def perm(l): # error: first line indented
				116	for i in range(len(l)): # error: not indented
				117	s = l[:i] + l[i+1:]
				118	p = perm(l[:i] + l[i+1:]) # error: unexpected indent
				119	for x in p:
				120	r.append(l[i:i+1] + x)
				121	return r # error: inconsistent dedent
				122	\end{verbatim}
				123
				124	(Actually, the first three errors are detected by the parser; only the
				125	last error is found by the lexical analyzer --- the indentation of
				126	\verb\return r\ does not match a level popped off the stack.)
				127
				128	\section{Other tokens}
				129
				130	Besides NEWLINE, INDENT and DEDENT, the following categories of tokens
				131	exist: identifiers, keywords, literals, operators, and delimiters.
				132	Spaces and tabs are not tokens, but serve to delimit tokens. Where
				133	ambiguity exists, a token comprises the longest possible string that
				134	forms a legal token, when read from left to right.
				135
				136	\section{Identifiers}
				137
				138	Identifiers (also referred to as names) are described by the following
				139	lexical definitions:
				140	\index{identifier}
				141	\index{name}
				142
				143	\begin{verbatim}
				144	identifier: (letter\|"_") (letter\|digit\|"_")*
				145	letter: lowercase \| uppercase
				146	lowercase: "a"..."z"
				147	uppercase: "A"..."Z"
				148	digit: "0"..."9"
				149	\end{verbatim}
				150
				151	Identifiers are unlimited in length. Case is significant.
				152
				153	\subsection{Keywords}
				154
				155	The following identifiers are used as reserved words, or {\em
				156	keywords} of the language, and cannot be used as ordinary
				157	identifiers. They must be spelled exactly as written here:
				158	\index{keyword}
				159	\index{reserved word}
				160
				161	\begin{verbatim}
				162	and del for in print
				163	break elif from is raise
				164	class else global not return
				165	continue except if or try
				166	def finally import pass while
				167	\end{verbatim}
				168
				169	% # This Python program sorts and formats the above table
				170	% import string
				171	% l = []
				172	% try:
				173	% while 1:
				174	% l = l + string.split(raw_input())
				175	% except EOFError:
				176	% pass
				177	% l.sort()
				178	% for i in range((len(l)+4)/5):
				179	% for j in range(i, len(l), 5):
				180	% print string.ljust(l[j], 10),
				181	% print
				182
				183	\section{Literals} \label{literals}
				184
				185	Literals are notations for constant values of some built-in types.
				186	\index{literal}
				187	\index{constant}
				188
				189	\subsection{String literals}
				190
				191	String literals are described by the following lexical definitions:
				192	\index{string literal}
				193
				194	\begin{verbatim}
				195	stringliteral: "'" stringitem* "'"
				196	stringitem: stringchar \| escapeseq
				197	stringchar: <any ASCII character except newline or "\" or "'">
				198	escapeseq: "'" <any ASCII character except newline>
				199	\end{verbatim}
				200	\index{ASCII}
				201
				202	String literals cannot span physical line boundaries. Escape
				203	sequences in strings are actually interpreted according to rules
				204	similar to those used by Standard C. The recognized escape sequences
				205	are:
				206	\index{physical line}
				207	\index{escape sequence}
				208	\index{Standard C}
				209	\index{C}
				210
				211	\begin{center}
				212	\begin{tabular}{\|l\|l\|}
				213	\hline
				214	\verb/\\/ & Backslash (\verb/\/) \\
				215	\verb/\'/ & Single quote (\verb/'/) \\
				216	\verb/\a/ & ASCII Bell (BEL) \\
				217	\verb/\b/ & ASCII Backspace (BS) \\
				218	%\verb/\E/ & ASCII Escape (ESC) \\
				219	\verb/\f/ & ASCII Formfeed (FF) \\
				220	\verb/\n/ & ASCII Linefeed (LF) \\
				221	\verb/\r/ & ASCII Carriage Return (CR) \\
				222	\verb/\t/ & ASCII Horizontal Tab (TAB) \\
				223	\verb/\v/ & ASCII Vertical Tab (VT) \\
				224	\verb/\/{\em ooo} & ASCII character with octal value {\em ooo} \\
				225	\verb/\x/{\em xx...} & ASCII character with hex value {\em xx...} \\
				226	\hline
				227	\end{tabular}
				228	\end{center}
				229	\index{ASCII}
				230
				231	In strict compatibility with Standard C, up to three octal digits are
				232	accepted, but an unlimited number of hex digits is taken to be part of
				233	the hex escape (and then the lower 8 bits of the resulting hex number
				234	are used in all current implementations...).
				235
				236	All unrecognized escape sequences are left in the string unchanged,
				237	i.e., {\em the backslash is left in the string.} (This behavior is
				238	useful when debugging: if an escape sequence is mistyped, the
				239	resulting output is more easily recognized as broken. It also helps a
				240	great deal for string literals used as regular expressions or
				241	otherwise passed to other modules that do their own escape handling.)
				242	\index{unrecognized escape sequence}
				243
				244	\subsection{Numeric literals}
				245
				246	There are three types of numeric literals: plain integers, long
				247	integers, and floating point numbers.
				248	\index{number}
				249	\index{numeric literal}
				250	\index{integer literal}
				251	\index{plain integer literal}
				252	\index{long integer literal}
				253	\index{floating point literal}
				254	\index{hexadecimal literal}
				255	\index{octal literal}
				256	\index{decimal literal}
				257
				258	Integer and long integer literals are described by the following
				259	lexical definitions:
				260
				261	\begin{verbatim}
				262	longinteger: integer ("l"\|"L")
				263	integer: decimalinteger \| octinteger \| hexinteger
				264	decimalinteger: nonzerodigit digit* \| "0"
				265	octinteger: "0" octdigit+
				266	hexinteger: "0" ("x"\|"X") hexdigit+
				267
				268	nonzerodigit: "1"..."9"
				269	octdigit: "0"..."7"
				270	hexdigit: digit\|"a"..."f"\|"A"..."F"
				271	\end{verbatim}
				272
				273	Although both lower case `l' and upper case `L' are allowed as suffix
				274	for long integers, it is strongly recommended to always use `L', since
				275	the letter `l' looks too much like the digit `1'.
				276
				277	Plain integer decimal literals must be at most $2^{31} - 1$ (i.e., the
				278	largest positive integer, assuming 32-bit arithmetic). Plain octal and
				279	hexadecimal literals may be as large as $2^{32} - 1$, but values
				280	larger than $2^{31} - 1$ are converted to a negative value by
				281	subtracting $2^{32}$. There is no limit for long integer literals.
				282
				283	Some examples of plain and long integer literals:
				284
				285	\begin{verbatim}
				286	7 2147483647 0177 0x80000000
				287	3L 79228162514264337593543950336L 0377L 0x100000000L
				288	\end{verbatim}
				289
				290	Floating point literals are described by the following lexical
				291	definitions:
				292
				293	\begin{verbatim}
				294	floatnumber: pointfloat \| exponentfloat
				295	pointfloat: [intpart] fraction \| intpart "."
				296	exponentfloat: (intpart \| pointfloat) exponent
				297	intpart: digit+
				298	fraction: "." digit+
				299	exponent: ("e"\|"E") ["+"\|"-"] digit+
				300	\end{verbatim}
				301
				302	The allowed range of floating point literals is
				303	implementation-dependent.
				304
				305	Some examples of floating point literals:
				306
				307	\begin{verbatim}
				308	3.14 10. .001 1e100 3.14e-10
				309	\end{verbatim}
				310
				311	Note that numeric literals do not include a sign; a phrase like
				312	\verb\-1\ is actually an expression composed of the operator
				313	\verb\-\ and the literal \verb\1\.
				314
				315	\section{Operators}
				316
				317	The following tokens are operators:
				318	\index{operators}
				319
				320	\begin{verbatim}
				321	+ - * / %
				322	<< >> & \| ^ ~
				323	< == > <= <> != >=
				324	\end{verbatim}
				325
				326	The comparison operators \verb\<>\ and \verb\!=\ are alternate
				327	spellings of the same operator.
				328
				329	\section{Delimiters}
				330
				331	The following tokens serve as delimiters or otherwise have a special
				332	meaning:
				333	\index{delimiters}
				334
				335	\begin{verbatim}
				336	( ) [ ] { }
				337	; , : . ` =
				338	\end{verbatim}
				339
				340	The following printing ASCII characters are not used in Python. Their
				341	occurrence outside string literals and comments is an unconditional
				342	error:
				343	\index{ASCII}
				344
				345	\begin{verbatim}
				346	@ $ " ?
				347	\end{verbatim}
				348
				349	They may be used by future versions of the language though!