Blame - Doc/ref2.tex - platform/external/python/cpython3

blob: c08f8291c1cae5a40aa1a9ed5b59dafd78eef134 [file] [log] [blame]

Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	1	\chapter{Lexical analysis}
				2
				3	A Python program is read by a {\em parser}. Input to the parser is a
				4	stream of {\em tokens}, generated by the {\em lexical analyzer}. This
				5	chapter describes how the lexical analyzer breaks a file into tokens.
				6	\index{lexical analysis}
				7	\index{parser}
				8	\index{token}
				9
				10	\section{Line structure}
				11
				12	A Python program is divided in a number of logical lines. The end of
				13	a logical line is represented by the token NEWLINE. Statements cannot
				14	cross logical line boundaries except where NEWLINE is allowed by the
				15	syntax (e.g. between statements in compound statements).
				16	\index{line structure}
				17	\index{logical line}
				18	\index{NEWLINE token}
				19
				20	\subsection{Comments}
				21
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	22	A comment starts with a hash character (\verb@#@) that is not part of
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	23	a string literal, and ends at the end of the physical line. A comment
				24	always signifies the end of the logical line. Comments are ignored by
				25	the syntax.
				26	\index{comment}
				27	\index{logical line}
				28	\index{physical line}
				29	\index{hash character}
				30
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	31	\subsection{Explicit line joining}
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	32
				33	Two or more physical lines may be joined into logical lines using
				34	backslash characters (\verb/\/), as follows: when a physical line ends
				35	in a backslash that is not part of a string literal or comment, it is
				36	joined with the following forming a single logical line, deleting the
				37	backslash and the following end-of-line character. For example:
				38	\index{physical line}
				39	\index{line joining}
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	40	\index{line continuation}
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	41	\index{backslash character}
				42	%
				43	\begin{verbatim}
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	44	if 1900 < year < 2100 and 1 <= month <= 12 \
				45	and 1 <= day <= 31 and 0 <= hour < 24 \
				46	and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
				47	return 1
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	48	\end{verbatim}
				49
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	50	A line ending in a backslash cannot carry a comment; a backslash does
				51	not continue a comment (but it does continue a string literal, see
				52	below).
				53
				54	\subsection{Implicit line joining}
				55
				56	Expressions in parentheses, square brackets or curly braces can be
				57	split over more than one physical line without using backslashes.
				58	For example:
				59
				60	\begin{verbatim}
				61	month_names = ['Januari', 'Februari', 'Maart', # These are the
				62	'April', 'Mei', 'Juni', # Dutch names
				63	'Juli', 'Augustus', 'September', # for the months
				64	'Oktober', 'November', 'December'] # of the year
				65	\end{verbatim}
				66
				67	Implicitly continued lines can carry comments. The indentation of the
				68	continuation lines is not important. Blank continuation lines are
				69	allowed.
				70
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	71	\subsection{Blank lines}
				72
				73	A logical line that contains only spaces, tabs, and possibly a
				74	comment, is ignored (i.e., no NEWLINE token is generated), except that
				75	during interactive input of statements, an entirely blank logical line
				76	terminates a multi-line statement.
				77	\index{blank line}
				78
				79	\subsection{Indentation}
				80
				81	Leading whitespace (spaces and tabs) at the beginning of a logical
				82	line is used to compute the indentation level of the line, which in
				83	turn is used to determine the grouping of statements.
				84	\index{indentation}
				85	\index{whitespace}
				86	\index{leading whitespace}
				87	\index{space}
				88	\index{tab}
				89	\index{grouping}
				90	\index{statement grouping}
				91
				92	First, tabs are replaced (from left to right) by one to eight spaces
				93	such that the total number of characters up to there is a multiple of
				94	eight (this is intended to be the same rule as used by {\UNIX}). The
				95	total number of spaces preceding the first non-blank character then
				96	determines the line's indentation. Indentation cannot be split over
				97	multiple physical lines using backslashes.
				98
				99	The indentation levels of consecutive lines are used to generate
				100	INDENT and DEDENT tokens, using a stack, as follows.
				101	\index{INDENT token}
				102	\index{DEDENT token}
				103
				104	Before the first line of the file is read, a single zero is pushed on
				105	the stack; this will never be popped off again. The numbers pushed on
				106	the stack will always be strictly increasing from bottom to top. At
				107	the beginning of each logical line, the line's indentation level is
				108	compared to the top of the stack. If it is equal, nothing happens.
				109	If it is larger, it is pushed on the stack, and one INDENT token is
				110	generated. If it is smaller, it {\em must} be one of the numbers
				111	occurring on the stack; all numbers on the stack that are larger are
				112	popped off, and for each number popped off a DEDENT token is
				113	generated. At the end of the file, a DEDENT token is generated for
				114	each number remaining on the stack that is larger than zero.
				115
				116	Here is an example of a correctly (though confusingly) indented piece
				117	of Python code:
				118
				119	\begin{verbatim}
				120	def perm(l):
				121	# Compute the list of all permutations of l
				122
				123	if len(l) <= 1:
				124	return [l]
				125	r = []
				126	for i in range(len(l)):
				127	s = l[:i] + l[i+1:]
				128	p = perm(s)
				129	for x in p:
				130	r.append(l[i:i+1] + x)
				131	return r
				132	\end{verbatim}
				133
				134	The following example shows various indentation errors:
				135
				136	\begin{verbatim}
				137	def perm(l): # error: first line indented
				138	for i in range(len(l)): # error: not indented
				139	s = l[:i] + l[i+1:]
				140	p = perm(l[:i] + l[i+1:]) # error: unexpected indent
				141	for x in p:
				142	r.append(l[i:i+1] + x)
				143	return r # error: inconsistent dedent
				144	\end{verbatim}
				145
				146	(Actually, the first three errors are detected by the parser; only the
				147	last error is found by the lexical analyzer --- the indentation of
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	148	\verb@return r@ does not match a level popped off the stack.)
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	149
				150	\section{Other tokens}
				151
				152	Besides NEWLINE, INDENT and DEDENT, the following categories of tokens
				153	exist: identifiers, keywords, literals, operators, and delimiters.
				154	Spaces and tabs are not tokens, but serve to delimit tokens. Where
				155	ambiguity exists, a token comprises the longest possible string that
				156	forms a legal token, when read from left to right.
				157
				158	\section{Identifiers}
				159
				160	Identifiers (also referred to as names) are described by the following
				161	lexical definitions:
				162	\index{identifier}
				163	\index{name}
				164
				165	\begin{verbatim}
				166	identifier: (letter\|"_") (letter\|digit\|"_")*
				167	letter: lowercase \| uppercase
				168	lowercase: "a"..."z"
				169	uppercase: "A"..."Z"
				170	digit: "0"..."9"
				171	\end{verbatim}
				172
				173	Identifiers are unlimited in length. Case is significant.
				174
				175	\subsection{Keywords}
				176
				177	The following identifiers are used as reserved words, or {\em
				178	keywords} of the language, and cannot be used as ordinary
				179	identifiers. They must be spelled exactly as written here:
				180	\index{keyword}
				181	\index{reserved word}
				182
				183	\begin{verbatim}
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	184	access del from lambda return
				185	and elif global not try
				186	break else if or while
				187	class except import pass
				188	continue finally in print
				189	def for is raise
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	190	\end{verbatim}
				191
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	192	% When adding keywords, pipe it through keywords.py for reformatting
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	193
				194	\section{Literals} \label{literals}
				195
				196	Literals are notations for constant values of some built-in types.
				197	\index{literal}
				198	\index{constant}
				199
				200	\subsection{String literals}
				201
				202	String literals are described by the following lexical definitions:
				203	\index{string literal}
				204
				205	\begin{verbatim}
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	206	stringliteral: shortstring \| longstring
				207	shortstring: "'" shortstringitem* "'" \| '"' shortstringitem* '"'
				208	longstring: "'''" longstringitem* "'''" \| '"""' longstringitem* '"""'
				209	shortstringitem: shortstringchar \| escapeseq
				210	shortstringchar: <any ASCII character except "\" or newline or the quote>
				211	longstringchar: <any ASCII character except "\">
				212	escapeseq: "\" <any ASCII character>
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	213	\end{verbatim}
				214	\index{ASCII}
				215
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	216	In ``long strings'' (strings surrounded by sets of three quotes),
				217	unescaped newlines and quotes are allowed (and are retained), except
				218	that three unescaped quotes in a row terminate the string. (A
				219	``quote'' is the character used to open the string, i.e. either
				220	\verb/'/ or \verb/"/.)
				221
				222	Escape sequences in strings are interpreted according to rules similar
				223	to those used by Standard C. The recognized escape sequences are:
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	224	\index{physical line}
				225	\index{escape sequence}
				226	\index{Standard C}
				227	\index{C}
				228
				229	\begin{center}
				230	\begin{tabular}{\|l\|l\|}
				231	\hline
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	232	\verb/\/{\em newline} & Ignored \\
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	233	\verb/\\/ & Backslash (\verb/\/) \\
				234	\verb/\'/ & Single quote (\verb/'/) \\
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	235	\verb/\"/ & Double quote (\verb/"/) \\
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	236	\verb/\a/ & ASCII Bell (BEL) \\
				237	\verb/\b/ & ASCII Backspace (BS) \\
				238	%\verb/\E/ & ASCII Escape (ESC) \\
				239	\verb/\f/ & ASCII Formfeed (FF) \\
				240	\verb/\n/ & ASCII Linefeed (LF) \\
				241	\verb/\r/ & ASCII Carriage Return (CR) \\
				242	\verb/\t/ & ASCII Horizontal Tab (TAB) \\
				243	\verb/\v/ & ASCII Vertical Tab (VT) \\
				244	\verb/\/{\em ooo} & ASCII character with octal value {\em ooo} \\
				245	\verb/\x/{\em xx...} & ASCII character with hex value {\em xx...} \\
				246	\hline
				247	\end{tabular}
				248	\end{center}
				249	\index{ASCII}
				250
				251	In strict compatibility with Standard C, up to three octal digits are
				252	accepted, but an unlimited number of hex digits is taken to be part of
				253	the hex escape (and then the lower 8 bits of the resulting hex number
				254	are used in all current implementations...).
				255
				256	All unrecognized escape sequences are left in the string unchanged,
				257	i.e., {\em the backslash is left in the string.} (This behavior is
				258	useful when debugging: if an escape sequence is mistyped, the
				259	resulting output is more easily recognized as broken. It also helps a
				260	great deal for string literals used as regular expressions or
				261	otherwise passed to other modules that do their own escape handling.)
				262	\index{unrecognized escape sequence}
				263
				264	\subsection{Numeric literals}
				265
				266	There are three types of numeric literals: plain integers, long
				267	integers, and floating point numbers.
				268	\index{number}
				269	\index{numeric literal}
				270	\index{integer literal}
				271	\index{plain integer literal}
				272	\index{long integer literal}
				273	\index{floating point literal}
				274	\index{hexadecimal literal}
				275	\index{octal literal}
				276	\index{decimal literal}
				277
				278	Integer and long integer literals are described by the following
				279	lexical definitions:
				280
				281	\begin{verbatim}
				282	longinteger: integer ("l"\|"L")
				283	integer: decimalinteger \| octinteger \| hexinteger
				284	decimalinteger: nonzerodigit digit* \| "0"
				285	octinteger: "0" octdigit+
				286	hexinteger: "0" ("x"\|"X") hexdigit+
				287
				288	nonzerodigit: "1"..."9"
				289	octdigit: "0"..."7"
				290	hexdigit: digit\|"a"..."f"\|"A"..."F"
				291	\end{verbatim}
				292
				293	Although both lower case `l' and upper case `L' are allowed as suffix
				294	for long integers, it is strongly recommended to always use `L', since
				295	the letter `l' looks too much like the digit `1'.
				296
				297	Plain integer decimal literals must be at most $2^{31} - 1$ (i.e., the
				298	largest positive integer, assuming 32-bit arithmetic). Plain octal and
				299	hexadecimal literals may be as large as $2^{32} - 1$, but values
				300	larger than $2^{31} - 1$ are converted to a negative value by
				301	subtracting $2^{32}$. There is no limit for long integer literals.
				302
				303	Some examples of plain and long integer literals:
				304
				305	\begin{verbatim}
				306	7 2147483647 0177 0x80000000
				307	3L 79228162514264337593543950336L 0377L 0x100000000L
				308	\end{verbatim}
				309
				310	Floating point literals are described by the following lexical
				311	definitions:
				312
				313	\begin{verbatim}
				314	floatnumber: pointfloat \| exponentfloat
				315	pointfloat: [intpart] fraction \| intpart "."
				316	exponentfloat: (intpart \| pointfloat) exponent
				317	intpart: digit+
				318	fraction: "." digit+
				319	exponent: ("e"\|"E") ["+"\|"-"] digit+
				320	\end{verbatim}
				321
				322	The allowed range of floating point literals is
				323	implementation-dependent.
				324
				325	Some examples of floating point literals:
				326
				327	\begin{verbatim}
				328	3.14 10. .001 1e100 3.14e-10
				329	\end{verbatim}
				330
				331	Note that numeric literals do not include a sign; a phrase like
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	332	\verb@-1@ is actually an expression composed of the operator
				333	\verb@-@ and the literal \verb@1@.
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	334
				335	\section{Operators}
				336
				337	The following tokens are operators:
				338	\index{operators}
				339
				340	\begin{verbatim}
				341	+ - * / %
				342	<< >> & \| ^ ~
				343	< == > <= <> != >=
				344	\end{verbatim}
				345
Guido van Rossum	6938f06	1994-08-01 12:22:53 +0000	[diff] [blame]	346	The comparison operators \verb@<>@ and \verb@!=@ are alternate
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	347	spellings of the same operator.
				348
				349	\section{Delimiters}
				350
				351	The following tokens serve as delimiters or otherwise have a special
				352	meaning:
				353	\index{delimiters}
				354
				355	\begin{verbatim}
				356	( ) [ ] { }
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	357	, : . " ` '
				358	= ;
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	359	\end{verbatim}
				360
				361	The following printing ASCII characters are not used in Python. Their
				362	occurrence outside string literals and comments is an unconditional
				363	error:
				364	\index{ASCII}
				365
				366	\begin{verbatim}
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	367	@ $ ?
Guido van Rossum	46f3e00	1992-08-14 09:11:01 +0000	[diff] [blame]	368	\end{verbatim}
				369
				370	They may be used by future versions of the language though!