Blame - Doc/reference/lexical_analysis.rst - platform/external/python/cpython3

blob: 3b53d2f655931ad30dcad7a7f6914e1166c8c24c [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	.. _lexical:
				3
				4	****************
				5	Lexical analysis
				6	****************
				7
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	8	.. index:: lexical analysis, parser, token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	9
				10	A Python program is read by a parser. Input to the parser is a stream of
				11	tokens, generated by the lexical analyzer. This chapter describes how the
				12	lexical analyzer breaks a file into tokens.
				13
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	14	Python reads program text as Unicode code points; the encoding of a source file
				15	can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
				16	for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
				17	raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19
				20	.. _line-structure:
				21
				22	Line structure
				23	==============
				24
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	25	.. index:: line structure
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26
				27	A Python program is divided into a number of logical lines.
				28
				29
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	30	.. _logical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
				32	Logical lines
				33	-------------
				34
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	35	.. index:: logical line, physical line, line joining, NEWLINE token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
				37	The end of a logical line is represented by the token NEWLINE. Statements
				38	cannot cross logical line boundaries except where NEWLINE is allowed by the
				39	syntax (e.g., between statements in compound statements). A logical line is
				40	constructed from one or more physical lines by following the explicit or
				41	implicit line joining rules.
				42
				43
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	44	.. _physical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	45
				46	Physical lines
				47	--------------
				48
				49	A physical line is a sequence of characters terminated by an end-of-line
				50	sequence. In source files, any of the standard platform line termination
				51	sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
				52	form using the ASCII sequence CR LF (return followed by linefeed), or the
				53	Macintosh form using the ASCII CR (return) character. All of these forms can be
				54	used equally, regardless of platform.
				55
				56	When embedding Python, source code strings should be passed to Python APIs using
				57	the standard C conventions for newline characters (the ``\n`` character,
				58	representing ASCII LF, is the line terminator).
				59
				60
				61	.. _comments:
				62
				63	Comments
				64	--------
				65
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	66	.. index:: comment, hash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68	A comment starts with a hash character (``#``) that is not part of a string
				69	literal, and ends at the end of the physical line. A comment signifies the end
				70	of the logical line unless the implicit line joining rules are invoked. Comments
				71	are ignored by the syntax; they are not tokens.
				72
				73
				74	.. _encodings:
				75
				76	Encoding declarations
				77	---------------------
				78
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	79	.. index:: source character set, encodings
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81	If a comment in the first or second line of the Python script matches the
				82	regular expression ``coding[=:]\s*([-\w.]+)``, this comment is processed as an
				83	encoding declaration; the first group of this expression names the encoding of
				84	the source code file. The recommended forms of this expression are ::
				85
				86	# -- coding: <encoding-name> --
				87
				88	which is recognized also by GNU Emacs, and ::
				89
				90	# vim:fileencoding=<encoding-name>
				91
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	92	which is recognized by Bram Moolenaar's VIM.
				93
				94	If no encoding declaration is found, the default encoding is UTF-8. In
				95	addition, if the first bytes of the file are the UTF-8 byte-order mark
				96	(``b'\xef\xbb\xbf'``), the declared file encoding is UTF-8 (this is supported,
				97	among others, by Microsoft's :program:`notepad`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99	If an encoding is declared, the encoding name must be recognized by Python. The
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	100	encoding is used for all lexical analysis, including string literals, comments
				101	and identifiers. The encoding declaration must appear on a line of its own.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	103	.. XXX there should be a list of supported encodings.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	104
				105
				106	.. _explicit-joining:
				107
				108	Explicit line joining
				109	---------------------
				110
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	111	.. index:: physical line, line joining, line continuation, backslash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	112
				113	Two or more physical lines may be joined into logical lines using backslash
				114	characters (``\``), as follows: when a physical line ends in a backslash that is
				115	not part of a string literal or comment, it is joined with the following forming
				116	a single logical line, deleting the backslash and the following end-of-line
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	117	character. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	118
				119	if 1900 < year < 2100 and 1 <= month <= 12 \
				120	and 1 <= day <= 31 and 0 <= hour < 24 \
				121	and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
				122	return 1
				123
				124	A line ending in a backslash cannot carry a comment. A backslash does not
				125	continue a comment. A backslash does not continue a token except for string
				126	literals (i.e., tokens other than string literals cannot be split across
				127	physical lines using a backslash). A backslash is illegal elsewhere on a line
				128	outside a string literal.
				129
				130
				131	.. _implicit-joining:
				132
				133	Implicit line joining
				134	---------------------
				135
				136	Expressions in parentheses, square brackets or curly braces can be split over
				137	more than one physical line without using backslashes. For example::
				138
				139	month_names = ['Januari', 'Februari', 'Maart', # These are the
				140	'April', 'Mei', 'Juni', # Dutch names
				141	'Juli', 'Augustus', 'September', # for the months
				142	'Oktober', 'November', 'December'] # of the year
				143
				144	Implicitly continued lines can carry comments. The indentation of the
				145	continuation lines is not important. Blank continuation lines are allowed.
				146	There is no NEWLINE token between implicit continuation lines. Implicitly
				147	continued lines can also occur within triple-quoted strings (see below); in that
				148	case they cannot carry comments.
				149
				150
				151	.. _blank-lines:
				152
				153	Blank lines
				154	-----------
				155
				156	.. index:: single: blank line
				157
				158	A logical line that contains only spaces, tabs, formfeeds and possibly a
				159	comment, is ignored (i.e., no NEWLINE token is generated). During interactive
				160	input of statements, handling of a blank line may differ depending on the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	161	implementation of the read-eval-print loop. In the standard interactive
				162	interpreter, an entirely blank logical line (i.e. one containing not even
				163	whitespace or a comment) terminates a multi-line statement.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164
				165
				166	.. _indentation:
				167
				168	Indentation
				169	-----------
				170
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	171	.. index:: indentation, leading whitespace, space, tab, grouping, statement grouping
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
				173	Leading whitespace (spaces and tabs) at the beginning of a logical line is used
				174	to compute the indentation level of the line, which in turn is used to determine
				175	the grouping of statements.
				176
				177	First, tabs are replaced (from left to right) by one to eight spaces such that
				178	the total number of characters up to and including the replacement is a multiple
				179	of eight (this is intended to be the same rule as used by Unix). The total
				180	number of spaces preceding the first non-blank character then determines the
				181	line's indentation. Indentation cannot be split over multiple physical lines
				182	using backslashes; the whitespace up to the first backslash determines the
				183	indentation.
				184
				185	Cross-platform compatibility note: because of the nature of text editors on
				186	non-UNIX platforms, it is unwise to use a mixture of spaces and tabs for the
				187	indentation in a single source file. It should also be noted that different
				188	platforms may explicitly limit the maximum indentation level.
				189
				190	A formfeed character may be present at the start of the line; it will be ignored
				191	for the indentation calculations above. Formfeed characters occurring elsewhere
				192	in the leading whitespace have an undefined effect (for instance, they may reset
				193	the space count to zero).
				194
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	195	.. index:: INDENT token, DEDENT token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	196
				197	The indentation levels of consecutive lines are used to generate INDENT and
				198	DEDENT tokens, using a stack, as follows.
				199
				200	Before the first line of the file is read, a single zero is pushed on the stack;
				201	this will never be popped off again. The numbers pushed on the stack will
				202	always be strictly increasing from bottom to top. At the beginning of each
				203	logical line, the line's indentation level is compared to the top of the stack.
				204	If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
				205	one INDENT token is generated. If it is smaller, it must be one of the
				206	numbers occurring on the stack; all numbers on the stack that are larger are
				207	popped off, and for each number popped off a DEDENT token is generated. At the
				208	end of the file, a DEDENT token is generated for each number remaining on the
				209	stack that is larger than zero.
				210
				211	Here is an example of a correctly (though confusingly) indented piece of Python
				212	code::
				213
				214	def perm(l):
				215	# Compute the list of all permutations of l
				216	if len(l) <= 1:
				217	return [l]
				218	r = []
				219	for i in range(len(l)):
				220	s = l[:i] + l[i+1:]
				221	p = perm(s)
				222	for x in p:
				223	r.append(l[i:i+1] + x)
				224	return r
				225
				226	The following example shows various indentation errors::
				227
				228	def perm(l): # error: first line indented
				229	for i in range(len(l)): # error: not indented
				230	s = l[:i] + l[i+1:]
				231	p = perm(l[:i] + l[i+1:]) # error: unexpected indent
				232	for x in p:
				233	r.append(l[i:i+1] + x)
				234	return r # error: inconsistent dedent
				235
				236	(Actually, the first three errors are detected by the parser; only the last
				237	error is found by the lexical analyzer --- the indentation of ``return r`` does
				238	not match a level popped off the stack.)
				239
				240
				241	.. _whitespace:
				242
				243	Whitespace between tokens
				244	-------------------------
				245
				246	Except at the beginning of a logical line or in string literals, the whitespace
				247	characters space, tab and formfeed can be used interchangeably to separate
				248	tokens. Whitespace is needed between two tokens only if their concatenation
				249	could otherwise be interpreted as a different token (e.g., ab is one token, but
				250	a b is two tokens).
				251
				252
				253	.. _other-tokens:
				254
				255	Other tokens
				256	============
				257
				258	Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist:
				259	identifiers, keywords, literals, operators, and delimiters. Whitespace
				260	characters (other than line terminators, discussed earlier) are not tokens, but
				261	serve to delimit tokens. Where ambiguity exists, a token comprises the longest
				262	possible string that forms a legal token, when read from left to right.
				263
				264
				265	.. _identifiers:
				266
				267	Identifiers and keywords
				268	========================
				269
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	270	.. index:: identifier, name
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	271
				272	Identifiers (also referred to as names) are described by the following lexical
				273	definitions:
				274
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	275	The syntax of identifiers in Python is based on the Unicode standard annex
				276	UAX-31, with elaboration and changes as defined below.
				277
				278	Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
				279	are the same as in Python 2.5; Python 3.0 introduces additional
				280	characters from outside the ASCII range (see :pep:`3131`). For other
				281	characters, the classification uses the version of the Unicode Character
				282	Database as included in the :mod:`unicodedata` module.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	283
				284	Identifiers are unlimited in length. Case is significant.
				285
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	286	.. productionlist::
				287	identifier: `id_start` `id_continue`*
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	288	id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
				289	id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	290
				291	The Unicode category codes mentioned above stand for:
				292
				293	* Lu - uppercase letters
				294	* Ll - lowercase letters
				295	* Lt - titlecase letters
				296	* Lm - modifier letters
				297	* Lo - other letters
				298	* Nl - letter numbers
				299	* Mn - nonspacing marks
				300	* Mc - spacing combining marks
				301	* Nd - decimal numbers
				302	* Pc - connector punctuations
				303
				304	All identifiers are converted into the normal form NFC while parsing; comparison
				305	of identifiers is based on NFC.
				306
				307	A non-normative HTML file listing all valid identifier characters for Unicode
				308	4.1 can be found at
				309	http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	310
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	311	See :pep:`3131` for further details.
				312
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	313	.. _keywords:
				314
				315	Keywords
				316	--------
				317
				318	.. index::
				319	single: keyword
				320	single: reserved word
				321
				322	The following identifiers are used as reserved words, or keywords of the
				323	language, and cannot be used as ordinary identifiers. They must be spelled
				324	exactly as written here::
				325
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	326	False class finally is return
				327	None continue for lambda try
				328	True def from nonlocal while
				329	and del global not with
				330	as elif if or yield
				331	assert else import pass
				332	break except in raise
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	333
				334	.. _id-classes:
				335
				336	Reserved classes of identifiers
				337	-------------------------------
				338
				339	Certain classes of identifiers (besides keywords) have special meanings. These
				340	classes are identified by the patterns of leading and trailing underscore
				341	characters:
				342
				343	``_*``
				344	Not imported by ``from module import *``. The special identifier ``_`` is used
				345	in the interactive interpreter to store the result of the last evaluation; it is
Georg Brandl	1a3284e	2007-12-02 09:40:06 +0000	[diff] [blame]	346	stored in the :mod:`builtins` module. When not in interactive mode, ``_``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	347	has no special meaning and is not defined. See section :ref:`import`.
				348
				349	.. note::
				350
				351	The name ``_`` is often used in conjunction with internationalization;
				352	refer to the documentation for the :mod:`gettext` module for more
				353	information on this convention.
				354
				355	``__*__``
				356	System-defined names. These names are defined by the interpreter and its
				357	implementation (including the standard library); applications should not expect
				358	to define additional names using this convention. The set of names of this
				359	class defined by Python may be extended in future versions. See section
				360	:ref:`specialnames`.
				361
				362	``__*``
				363	Class-private names. Names in this category, when used within the context of a
				364	class definition, are re-written to use a mangled form to help avoid name
				365	clashes between "private" attributes of base and derived classes. See section
				366	:ref:`atom-identifiers`.
				367
				368
				369	.. _literals:
				370
				371	Literals
				372	========
				373
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	374	.. index:: literal, constant
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	375
				376	Literals are notations for constant values of some built-in types.
				377
				378
				379	.. _strings:
				380
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	381	String and Bytes literals
				382	-------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	384	.. index:: string literal, bytes literal, ASCII
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	385
				386	String literals are described by the following lexical definitions:
				387
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	388	.. productionlist::
				389	stringliteral: [`stringprefix`](`shortstring` \| `longstring`)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	390	stringprefix: "r" \| "R"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	391	shortstring: "'" `shortstringitem`* "'" \| '"' `shortstringitem`* '"'
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	392	longstring: "'''" `longstringitem`* "'''" \| '"""' `longstringitem`* '"""'
				393	shortstringitem: `shortstringchar` \| `stringescapeseq`
				394	longstringitem: `longstringchar` \| `stringescapeseq`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	395	shortstringchar: <any source character except "\" or newline or the quote>
				396	longstringchar: <any source character except "\">
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	397	stringescapeseq: "\" <any source character>
				398
				399	.. productionlist::
				400	bytesliteral: `bytesprefix`(`shortbytes` \| `longbytes`)
				401	bytesprefix: "b" \| "B"
				402	shortbytes: "'" `shortbytesitem`* "'" \| '"' `shortbytesitem`* '"'
				403	longbytes: "'''" `longbytesitem`* "'''" \| '"""' `longbytesitem`* '"""'
				404	shortbytesitem: `shortbyteschar` \| `bytesescapeseq`
				405	longbytesitem: `longbyteschar` \| `bytesescapeseq`
				406	shortbyteschar: <any ASCII character except "\" or newline or the quote>
				407	longbyteschar: <any ASCII character except "\">
				408	bytesescapeseq: "\" <any ASCII character>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409
				410	One syntactic restriction not indicated by these productions is that whitespace
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	411	is not allowed between the :token:`stringprefix` or :token:`bytesprefix` and the
				412	rest of the literal. The source character set is defined by the encoding
				413	declaration; it is UTF-8 if no encoding declaration is given in the source file;
				414	see section :ref:`encodings`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	415
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	416	.. index:: triple-quoted string, Unicode Consortium, raw string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	417
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	418	In plain English: Both types of literals can be enclosed in matching single quotes
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	419	(``'``) or double quotes (``"``). They can also be enclosed in matching groups
				420	of three single or double quotes (these are generally referred to as
				421	triple-quoted strings). The backslash (``\``) character is used to escape
				422	characters that otherwise have a special meaning, such as newline, backslash
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	423	itself, or the quote character.
				424
				425	String literals may optionally be prefixed with a letter ``'r'`` or ``'R'``;
				426	such strings are called :dfn:`raw strings` and use different rules for
				427	interpreting backslash escape sequences.
				428
				429	Bytes literals are always prefixed with ``'b'`` or ``'B'``; they produce an
				430	instance of the :class:`bytes` type instead of the :class:`str` type. They
				431	may only contain ASCII characters; bytes with a numeric value of 128 or greater
				432	must be expressed with escapes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	433
				434	In triple-quoted strings, unescaped newlines and quotes are allowed (and are
				435	retained), except that three unescaped quotes in a row terminate the string. (A
				436	"quote" is the character used to open the string, i.e. either ``'`` or ``"``.)
				437
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	438	.. index:: physical line, escape sequence, Standard C, C
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	439
				440	Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in strings are
				441	interpreted according to rules similar to those used by Standard C. The
				442	recognized escape sequences are:
				443
				444	+-----------------+---------------------------------+-------+
				445	\| Escape Sequence \| Meaning \| Notes \|
				446	+=================+=================================+=======+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	447	\| ``\newline`` \| Backslash and newline ignored \| \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	448	+-----------------+---------------------------------+-------+
				449	\| ``\\`` \| Backslash (``\``) \| \|
				450	+-----------------+---------------------------------+-------+
				451	\| ``\'`` \| Single quote (``'``) \| \|
				452	+-----------------+---------------------------------+-------+
				453	\| ``\"`` \| Double quote (``"``) \| \|
				454	+-----------------+---------------------------------+-------+
				455	\| ``\a`` \| ASCII Bell (BEL) \| \|
				456	+-----------------+---------------------------------+-------+
				457	\| ``\b`` \| ASCII Backspace (BS) \| \|
				458	+-----------------+---------------------------------+-------+
				459	\| ``\f`` \| ASCII Formfeed (FF) \| \|
				460	+-----------------+---------------------------------+-------+
				461	\| ``\n`` \| ASCII Linefeed (LF) \| \|
				462	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	463	\| ``\r`` \| ASCII Carriage Return (CR) \| \|
				464	+-----------------+---------------------------------+-------+
				465	\| ``\t`` \| ASCII Horizontal Tab (TAB) \| \|
				466	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	467	\| ``\v`` \| ASCII Vertical Tab (VT) \| \|
				468	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	469	\| ``\ooo`` \| Character with octal value \| (1,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	470	\| \| ooo \| \|
				471	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	472	\| ``\xhh`` \| Character with hex value hh \| (2,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	473	+-----------------+---------------------------------+-------+
				474
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	475	Escape sequences only recognized in string literals are:
				476
				477	+-----------------+---------------------------------+-------+
				478	\| Escape Sequence \| Meaning \| Notes \|
				479	+=================+=================================+=======+
				480	\| ``\N{name}`` \| Character named name in the \| \|
				481	\| \| Unicode database \| \|
				482	+-----------------+---------------------------------+-------+
				483	\| ``\uxxxx`` \| Character with 16-bit hex value \| \(4) \|
				484	\| \| xxxx \| \|
				485	+-----------------+---------------------------------+-------+
				486	\| ``\Uxxxxxxxx`` \| Character with 32-bit hex value \| \(5) \|
				487	\| \| xxxxxxxx \| \|
				488	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	489
				490	Notes:
				491
				492	(1)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	493	As in Standard C, up to three octal digits are accepted.
				494
				495	(2)
				496	Unlike in Standard C, at most two hex digits are accepted.
				497
				498	(3)
				499	In a bytes literal, hexadecimal and octal escapes denote the byte with the
				500	given value. In a string literal, these escapes denote a Unicode character
				501	with the given value.
				502
				503	(4)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	504	Individual code units which form parts of a surrogate pair can be encoded using
Christian Heimes	1af737c	2008-01-23 08:24:23 +0000	[diff] [blame]	505	this escape sequence. Unlike in Standard C, exactly two hex digits are required.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	506
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	507	(5)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	508	Any Unicode character can be encoded this way, but characters outside the Basic
				509	Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is
				510	compiled to use 16-bit code units (the default). Individual code units which
				511	form parts of a surrogate pair can be encoded using this escape sequence.
				512
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	513
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	514	.. index:: unrecognized escape sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	515
				516	Unlike Standard C, all unrecognized escape sequences are left in the string
				517	unchanged, i.e., the backslash is left in the string. (This behavior is
				518	useful when debugging: if an escape sequence is mistyped, the resulting output
				519	is more easily recognized as broken.) It is also important to note that the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	520	escape sequences only recognized in string literals fall into the category of
				521	unrecognized escapes for bytes literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	522
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	523	When an ``'r'`` or ``'R'`` prefix is used in a string literal, then the
				524	``\uXXXX`` and ``\UXXXXXXXX`` escape sequences are processed while *all other
				525	backslashes are left in the string*. For example, the string literal
				526	``r"\u0062\n"`` consists of three Unicode characters: 'LATIN SMALL LETTER B',
				527	'REVERSE SOLIDUS', and 'LATIN SMALL LETTER N'. Backslashes can be escaped with a
				528	preceding backslash; however, both remain in the string. As a result,
				529	``\uXXXX`` escape sequences are only recognized when there is an odd number of
				530	backslashes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	531
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	532	Even in a raw string, string quotes can be escaped with a backslash, but the
				533	backslash remains in the string; for example, ``r"\""`` is a valid string
				534	literal consisting of two characters: a backslash and a double quote; ``r"\"``
				535	is not a valid string literal (even a raw string cannot end in an odd number of
				536	backslashes). Specifically, a raw string cannot end in a single backslash
				537	(since the backslash would escape the following quote character). Note also
				538	that a single backslash followed by a newline is interpreted as those two
				539	characters as part of the string, not as a line continuation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	540
				541
				542	.. _string-catenation:
				543
				544	String literal concatenation
				545	----------------------------
				546
				547	Multiple adjacent string literals (delimited by whitespace), possibly using
				548	different quoting conventions, are allowed, and their meaning is the same as
				549	their concatenation. Thus, ``"hello" 'world'`` is equivalent to
				550	``"helloworld"``. This feature can be used to reduce the number of backslashes
				551	needed, to split long strings conveniently across long lines, or even to add
				552	comments to parts of strings, for example::
				553
				554	re.compile("[A-Za-z_]" # letter or underscore
				555	"[A-Za-z0-9_]*" # letter, digit or underscore
				556	)
				557
				558	Note that this feature is defined at the syntactical level, but implemented at
				559	compile time. The '+' operator must be used to concatenate string expressions
				560	at run time. Also note that literal concatenation can use different quoting
				561	styles for each component (even mixing raw strings and triple quoted strings).
				562
				563
				564	.. _numbers:
				565
				566	Numeric literals
				567	----------------
				568
Georg Brandl	ba956ae	2007-11-29 17:24:34 +0000	[diff] [blame]	569	.. index:: number, numeric literal, integer literal
				570	floating point literal, hexadecimal literal
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	571	octal literal, binary literal, decimal literal, imaginary literal, complex literal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	572
Georg Brandl	ba956ae	2007-11-29 17:24:34 +0000	[diff] [blame]	573	There are three types of numeric literals: plain integers, floating point
				574	numbers, and imaginary numbers. There are no complex literals
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	575	(complex numbers can be formed by adding a real number and an imaginary number).
				576
				577	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				578	actually an expression composed of the unary operator '``-``' and the literal
				579	``1``.
				580
				581
				582	.. _integers:
				583
				584	Integer literals
				585	----------------
				586
				587	Integer literals are described by the following lexical definitions:
				588
				589	.. productionlist::
				590	integer: `decimalinteger` \| `octinteger` \| `hexinteger`
				591	decimalinteger: `nonzerodigit` `digit`* \| "0"+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	592	nonzerodigit: "1"..."9"
				593	digit: "0"..."9"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	594	octinteger: "0" ("o" \| "O") `octdigit`+
				595	hexinteger: "0" ("x" \| "X") `hexdigit`+
				596	bininteger: "0" ("b" \| "B") `bindigit`+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	597	octdigit: "0"..."7"
				598	hexdigit: `digit` \| "a"..."f" \| "A"..."F"
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	599	bindigit: "0" \| "1"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	600
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	601	There is no limit for the length of integer literals apart from what can be
				602	stored in available memory.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	603
				604	Note that leading zeros in a non-zero decimal number are not allowed. This is
				605	for disambiguation with C-style octal literals, which Python used before version
				606	3.0.
				607
				608	Some examples of integer literals::
				609
				610	7 2147483647 0o177 0b100110111
				611	3 79228162514264337593543950336 0o377 0x100000000
				612	79228162514264337593543950336 0xdeadbeef
				613
				614
				615	.. _floating:
				616
				617	Floating point literals
				618	-----------------------
				619
				620	Floating point literals are described by the following lexical definitions:
				621
				622	.. productionlist::
				623	floatnumber: `pointfloat` \| `exponentfloat`
				624	pointfloat: [`intpart`] `fraction` \| `intpart` "."
				625	exponentfloat: (`intpart` \| `pointfloat`) `exponent`
				626	intpart: `digit`+
				627	fraction: "." `digit`+
				628	exponent: ("e" \| "E") ["+" \| "-"] `digit`+
				629
				630	Note that the integer and exponent parts are always interpreted using radix 10.
				631	For example, ``077e010`` is legal, and denotes the same number as ``77e10``. The
				632	allowed range of floating point literals is implementation-dependent. Some
				633	examples of floating point literals::
				634
				635	3.14 10. .001 1e100 3.14e-10 0e0
				636
				637	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				638	actually an expression composed of the unary operator ``-`` and the literal
				639	``1``.
				640
				641
				642	.. _imaginary:
				643
				644	Imaginary literals
				645	------------------
				646
				647	Imaginary literals are described by the following lexical definitions:
				648
				649	.. productionlist::
				650	imagnumber: (`floatnumber` \| `intpart`) ("j" \| "J")
				651
				652	An imaginary literal yields a complex number with a real part of 0.0. Complex
				653	numbers are represented as a pair of floating point numbers and have the same
				654	restrictions on their range. To create a complex number with a nonzero real
				655	part, add a floating point number to it, e.g., ``(3+4j)``. Some examples of
				656	imaginary literals::
				657
				658	3.14j 10.j 10j .001j 1e100j 3.14e-10j
				659
				660
				661	.. _operators:
				662
				663	Operators
				664	=========
				665
				666	.. index:: single: operators
				667
				668	The following tokens are operators::
				669
				670	+ - * ** / // %
				671	<< >> & \| ^ ~
				672	< > <= >= == !=
				673
				674
				675	.. _delimiters:
				676
				677	Delimiters
				678	==========
				679
				680	.. index:: single: delimiters
				681
				682	The following tokens serve as delimiters in the grammar::
				683
				684	( ) [ ] { } @
				685	, : . ` = ;
				686	+= -= *= /= //= %=
				687	&= \|= ^= >>= <<= **=
				688
				689	The period can also occur in floating-point and imaginary literals. A sequence
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	690	of three periods has a special meaning as an ellipsis literal. The second half
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	691	of the list, the augmented assignment operators, serve lexically as delimiters,
				692	but also perform an operation.
				693
				694	The following printing ASCII characters have special meaning as part of other
				695	tokens or are otherwise significant to the lexical analyzer::
				696
				697	' " # \
				698
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	699	The following printing ASCII characters are not used in Python. Their
				700	occurrence outside string literals and comments is an unconditional error::
				701
				702	$ ?