Blame - Doc/reference/lexical_analysis.rst - platform/external/python/cpython3

blob: 1b315a648eade4fd6bd73ba2641f72c6dac2effa [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	.. _lexical:
				3
				4	****************
				5	Lexical analysis
				6	****************
				7
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	8	.. index:: lexical analysis, parser, token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	9
				10	A Python program is read by a parser. Input to the parser is a stream of
				11	tokens, generated by the lexical analyzer. This chapter describes how the
				12	lexical analyzer breaks a file into tokens.
				13
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	14	Python reads program text as Unicode code points; the encoding of a source file
				15	can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
				16	for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
				17	raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19
				20	.. _line-structure:
				21
				22	Line structure
				23	==============
				24
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	25	.. index:: line structure
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26
				27	A Python program is divided into a number of logical lines.
				28
				29
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	30	.. _logical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
				32	Logical lines
				33	-------------
				34
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	35	.. index:: logical line, physical line, line joining, NEWLINE token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
				37	The end of a logical line is represented by the token NEWLINE. Statements
				38	cannot cross logical line boundaries except where NEWLINE is allowed by the
				39	syntax (e.g., between statements in compound statements). A logical line is
				40	constructed from one or more physical lines by following the explicit or
				41	implicit line joining rules.
				42
				43
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	44	.. _physical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	45
				46	Physical lines
				47	--------------
				48
				49	A physical line is a sequence of characters terminated by an end-of-line
				50	sequence. In source files, any of the standard platform line termination
				51	sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
				52	form using the ASCII sequence CR LF (return followed by linefeed), or the
				53	Macintosh form using the ASCII CR (return) character. All of these forms can be
				54	used equally, regardless of platform.
				55
				56	When embedding Python, source code strings should be passed to Python APIs using
				57	the standard C conventions for newline characters (the ``\n`` character,
				58	representing ASCII LF, is the line terminator).
				59
				60
				61	.. _comments:
				62
				63	Comments
				64	--------
				65
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	66	.. index:: comment, hash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68	A comment starts with a hash character (``#``) that is not part of a string
				69	literal, and ends at the end of the physical line. A comment signifies the end
				70	of the logical line unless the implicit line joining rules are invoked. Comments
				71	are ignored by the syntax; they are not tokens.
				72
				73
				74	.. _encodings:
				75
				76	Encoding declarations
				77	---------------------
				78
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	79	.. index:: source character set, encodings
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81	If a comment in the first or second line of the Python script matches the
				82	regular expression ``coding[=:]\s*([-\w.]+)``, this comment is processed as an
				83	encoding declaration; the first group of this expression names the encoding of
				84	the source code file. The recommended forms of this expression are ::
				85
				86	# -- coding: <encoding-name> --
				87
				88	which is recognized also by GNU Emacs, and ::
				89
				90	# vim:fileencoding=<encoding-name>
				91
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	92	which is recognized by Bram Moolenaar's VIM.
				93
				94	If no encoding declaration is found, the default encoding is UTF-8. In
				95	addition, if the first bytes of the file are the UTF-8 byte-order mark
				96	(``b'\xef\xbb\xbf'``), the declared file encoding is UTF-8 (this is supported,
				97	among others, by Microsoft's :program:`notepad`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99	If an encoding is declared, the encoding name must be recognized by Python. The
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	100	encoding is used for all lexical analysis, including string literals, comments
				101	and identifiers. The encoding declaration must appear on a line of its own.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	103	A list of standard encodings can be found in the section
				104	:ref:`standard-encodings`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106
				107	.. _explicit-joining:
				108
				109	Explicit line joining
				110	---------------------
				111
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	112	.. index:: physical line, line joining, line continuation, backslash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113
				114	Two or more physical lines may be joined into logical lines using backslash
				115	characters (``\``), as follows: when a physical line ends in a backslash that is
				116	not part of a string literal or comment, it is joined with the following forming
				117	a single logical line, deleting the backslash and the following end-of-line
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	118	character. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	119
				120	if 1900 < year < 2100 and 1 <= month <= 12 \
				121	and 1 <= day <= 31 and 0 <= hour < 24 \
				122	and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
				123	return 1
				124
				125	A line ending in a backslash cannot carry a comment. A backslash does not
				126	continue a comment. A backslash does not continue a token except for string
				127	literals (i.e., tokens other than string literals cannot be split across
				128	physical lines using a backslash). A backslash is illegal elsewhere on a line
				129	outside a string literal.
				130
				131
				132	.. _implicit-joining:
				133
				134	Implicit line joining
				135	---------------------
				136
				137	Expressions in parentheses, square brackets or curly braces can be split over
				138	more than one physical line without using backslashes. For example::
				139
				140	month_names = ['Januari', 'Februari', 'Maart', # These are the
				141	'April', 'Mei', 'Juni', # Dutch names
				142	'Juli', 'Augustus', 'September', # for the months
				143	'Oktober', 'November', 'December'] # of the year
				144
				145	Implicitly continued lines can carry comments. The indentation of the
				146	continuation lines is not important. Blank continuation lines are allowed.
				147	There is no NEWLINE token between implicit continuation lines. Implicitly
				148	continued lines can also occur within triple-quoted strings (see below); in that
				149	case they cannot carry comments.
				150
				151
				152	.. _blank-lines:
				153
				154	Blank lines
				155	-----------
				156
				157	.. index:: single: blank line
				158
				159	A logical line that contains only spaces, tabs, formfeeds and possibly a
				160	comment, is ignored (i.e., no NEWLINE token is generated). During interactive
				161	input of statements, handling of a blank line may differ depending on the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	162	implementation of the read-eval-print loop. In the standard interactive
				163	interpreter, an entirely blank logical line (i.e. one containing not even
				164	whitespace or a comment) terminates a multi-line statement.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
				166
				167	.. _indentation:
				168
				169	Indentation
				170	-----------
				171
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	172	.. index:: indentation, leading whitespace, space, tab, grouping, statement grouping
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173
				174	Leading whitespace (spaces and tabs) at the beginning of a logical line is used
				175	to compute the indentation level of the line, which in turn is used to determine
				176	the grouping of statements.
				177
				178	First, tabs are replaced (from left to right) by one to eight spaces such that
				179	the total number of characters up to and including the replacement is a multiple
				180	of eight (this is intended to be the same rule as used by Unix). The total
				181	number of spaces preceding the first non-blank character then determines the
				182	line's indentation. Indentation cannot be split over multiple physical lines
				183	using backslashes; the whitespace up to the first backslash determines the
				184	indentation.
				185
				186	Cross-platform compatibility note: because of the nature of text editors on
				187	non-UNIX platforms, it is unwise to use a mixture of spaces and tabs for the
				188	indentation in a single source file. It should also be noted that different
				189	platforms may explicitly limit the maximum indentation level.
				190
				191	A formfeed character may be present at the start of the line; it will be ignored
				192	for the indentation calculations above. Formfeed characters occurring elsewhere
				193	in the leading whitespace have an undefined effect (for instance, they may reset
				194	the space count to zero).
				195
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	196	.. index:: INDENT token, DEDENT token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	197
				198	The indentation levels of consecutive lines are used to generate INDENT and
				199	DEDENT tokens, using a stack, as follows.
				200
				201	Before the first line of the file is read, a single zero is pushed on the stack;
				202	this will never be popped off again. The numbers pushed on the stack will
				203	always be strictly increasing from bottom to top. At the beginning of each
				204	logical line, the line's indentation level is compared to the top of the stack.
				205	If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
				206	one INDENT token is generated. If it is smaller, it must be one of the
				207	numbers occurring on the stack; all numbers on the stack that are larger are
				208	popped off, and for each number popped off a DEDENT token is generated. At the
				209	end of the file, a DEDENT token is generated for each number remaining on the
				210	stack that is larger than zero.
				211
				212	Here is an example of a correctly (though confusingly) indented piece of Python
				213	code::
				214
				215	def perm(l):
				216	# Compute the list of all permutations of l
				217	if len(l) <= 1:
				218	return [l]
				219	r = []
				220	for i in range(len(l)):
				221	s = l[:i] + l[i+1:]
				222	p = perm(s)
				223	for x in p:
				224	r.append(l[i:i+1] + x)
				225	return r
				226
				227	The following example shows various indentation errors::
				228
				229	def perm(l): # error: first line indented
				230	for i in range(len(l)): # error: not indented
				231	s = l[:i] + l[i+1:]
				232	p = perm(l[:i] + l[i+1:]) # error: unexpected indent
				233	for x in p:
				234	r.append(l[i:i+1] + x)
				235	return r # error: inconsistent dedent
				236
				237	(Actually, the first three errors are detected by the parser; only the last
				238	error is found by the lexical analyzer --- the indentation of ``return r`` does
				239	not match a level popped off the stack.)
				240
				241
				242	.. _whitespace:
				243
				244	Whitespace between tokens
				245	-------------------------
				246
				247	Except at the beginning of a logical line or in string literals, the whitespace
				248	characters space, tab and formfeed can be used interchangeably to separate
				249	tokens. Whitespace is needed between two tokens only if their concatenation
				250	could otherwise be interpreted as a different token (e.g., ab is one token, but
				251	a b is two tokens).
				252
				253
				254	.. _other-tokens:
				255
				256	Other tokens
				257	============
				258
				259	Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist:
				260	identifiers, keywords, literals, operators, and delimiters. Whitespace
				261	characters (other than line terminators, discussed earlier) are not tokens, but
				262	serve to delimit tokens. Where ambiguity exists, a token comprises the longest
				263	possible string that forms a legal token, when read from left to right.
				264
				265
				266	.. _identifiers:
				267
				268	Identifiers and keywords
				269	========================
				270
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	271	.. index:: identifier, name
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	272
				273	Identifiers (also referred to as names) are described by the following lexical
				274	definitions:
				275
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	276	The syntax of identifiers in Python is based on the Unicode standard annex
				277	UAX-31, with elaboration and changes as defined below.
				278
				279	Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
				280	are the same as in Python 2.5; Python 3.0 introduces additional
				281	characters from outside the ASCII range (see :pep:`3131`). For other
				282	characters, the classification uses the version of the Unicode Character
				283	Database as included in the :mod:`unicodedata` module.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	284
				285	Identifiers are unlimited in length. Case is significant.
				286
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	287	.. productionlist::
				288	identifier: `id_start` `id_continue`*
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	289	id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
				290	id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	291
				292	The Unicode category codes mentioned above stand for:
				293
				294	* Lu - uppercase letters
				295	* Ll - lowercase letters
				296	* Lt - titlecase letters
				297	* Lm - modifier letters
				298	* Lo - other letters
				299	* Nl - letter numbers
				300	* Mn - nonspacing marks
				301	* Mc - spacing combining marks
				302	* Nd - decimal numbers
				303	* Pc - connector punctuations
				304
				305	All identifiers are converted into the normal form NFC while parsing; comparison
				306	of identifiers is based on NFC.
				307
				308	A non-normative HTML file listing all valid identifier characters for Unicode
				309	4.1 can be found at
				310	http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	311
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	312	See :pep:`3131` for further details.
				313
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	314	.. _keywords:
				315
				316	Keywords
				317	--------
				318
				319	.. index::
				320	single: keyword
				321	single: reserved word
				322
				323	The following identifiers are used as reserved words, or keywords of the
				324	language, and cannot be used as ordinary identifiers. They must be spelled
				325	exactly as written here::
				326
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	327	False class finally is return
				328	None continue for lambda try
				329	True def from nonlocal while
				330	and del global not with
				331	as elif if or yield
				332	assert else import pass
				333	break except in raise
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	334
				335	.. _id-classes:
				336
				337	Reserved classes of identifiers
				338	-------------------------------
				339
				340	Certain classes of identifiers (besides keywords) have special meanings. These
				341	classes are identified by the patterns of leading and trailing underscore
				342	characters:
				343
				344	``_*``
				345	Not imported by ``from module import *``. The special identifier ``_`` is used
				346	in the interactive interpreter to store the result of the last evaluation; it is
				347	stored in the :mod:`__builtin__` module. When not in interactive mode, ``_``
				348	has no special meaning and is not defined. See section :ref:`import`.
				349
				350	.. note::
				351
				352	The name ``_`` is often used in conjunction with internationalization;
				353	refer to the documentation for the :mod:`gettext` module for more
				354	information on this convention.
				355
				356	``__*__``
				357	System-defined names. These names are defined by the interpreter and its
				358	implementation (including the standard library); applications should not expect
				359	to define additional names using this convention. The set of names of this
				360	class defined by Python may be extended in future versions. See section
				361	:ref:`specialnames`.
				362
				363	``__*``
				364	Class-private names. Names in this category, when used within the context of a
				365	class definition, are re-written to use a mangled form to help avoid name
				366	clashes between "private" attributes of base and derived classes. See section
				367	:ref:`atom-identifiers`.
				368
				369
				370	.. _literals:
				371
				372	Literals
				373	========
				374
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	375	.. index:: literal, constant
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376
				377	Literals are notations for constant values of some built-in types.
				378
				379
				380	.. _strings:
				381
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	382	String and Bytes literals
				383	-------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	384
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	385	.. index:: string literal, bytes literal, ASCII
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	386
				387	String literals are described by the following lexical definitions:
				388
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	389	.. productionlist::
				390	stringliteral: [`stringprefix`](`shortstring` \| `longstring`)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	391	stringprefix: "r" \| "R"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	392	shortstring: "'" `shortstringitem`* "'" \| '"' `shortstringitem`* '"'
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	393	longstring: "'''" `longstringitem`* "'''" \| '"""' `longstringitem`* '"""'
				394	shortstringitem: `shortstringchar` \| `stringescapeseq`
				395	longstringitem: `longstringchar` \| `stringescapeseq`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	396	shortstringchar: <any source character except "\" or newline or the quote>
				397	longstringchar: <any source character except "\">
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	398	stringescapeseq: "\" <any source character>
				399
				400	.. productionlist::
				401	bytesliteral: `bytesprefix`(`shortbytes` \| `longbytes`)
				402	bytesprefix: "b" \| "B"
				403	shortbytes: "'" `shortbytesitem`* "'" \| '"' `shortbytesitem`* '"'
				404	longbytes: "'''" `longbytesitem`* "'''" \| '"""' `longbytesitem`* '"""'
				405	shortbytesitem: `shortbyteschar` \| `bytesescapeseq`
				406	longbytesitem: `longbyteschar` \| `bytesescapeseq`
				407	shortbyteschar: <any ASCII character except "\" or newline or the quote>
				408	longbyteschar: <any ASCII character except "\">
				409	bytesescapeseq: "\" <any ASCII character>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	410
				411	One syntactic restriction not indicated by these productions is that whitespace
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	412	is not allowed between the :token:`stringprefix` or :token:`bytesprefix` and the
				413	rest of the literal. The source character set is defined by the encoding
				414	declaration; it is UTF-8 if no encoding declaration is given in the source file;
				415	see section :ref:`encodings`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	416
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	417	.. index:: triple-quoted string, Unicode Consortium, raw string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	418
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	419	In plain English: Both types of literals can be enclosed in matching single quotes
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	420	(``'``) or double quotes (``"``). They can also be enclosed in matching groups
				421	of three single or double quotes (these are generally referred to as
				422	triple-quoted strings). The backslash (``\``) character is used to escape
				423	characters that otherwise have a special meaning, such as newline, backslash
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	424	itself, or the quote character.
				425
				426	String literals may optionally be prefixed with a letter ``'r'`` or ``'R'``;
				427	such strings are called :dfn:`raw strings` and use different rules for
				428	interpreting backslash escape sequences.
				429
				430	Bytes literals are always prefixed with ``'b'`` or ``'B'``; they produce an
				431	instance of the :class:`bytes` type instead of the :class:`str` type. They
				432	may only contain ASCII characters; bytes with a numeric value of 128 or greater
				433	must be expressed with escapes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	434
				435	In triple-quoted strings, unescaped newlines and quotes are allowed (and are
				436	retained), except that three unescaped quotes in a row terminate the string. (A
				437	"quote" is the character used to open the string, i.e. either ``'`` or ``"``.)
				438
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	439	.. index:: physical line, escape sequence, Standard C, C
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	440
				441	Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in strings are
				442	interpreted according to rules similar to those used by Standard C. The
				443	recognized escape sequences are:
				444
				445	+-----------------+---------------------------------+-------+
				446	\| Escape Sequence \| Meaning \| Notes \|
				447	+=================+=================================+=======+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	448	\| ``\newline`` \| Backslash and newline ignored \| \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	449	+-----------------+---------------------------------+-------+
				450	\| ``\\`` \| Backslash (``\``) \| \|
				451	+-----------------+---------------------------------+-------+
				452	\| ``\'`` \| Single quote (``'``) \| \|
				453	+-----------------+---------------------------------+-------+
				454	\| ``\"`` \| Double quote (``"``) \| \|
				455	+-----------------+---------------------------------+-------+
				456	\| ``\a`` \| ASCII Bell (BEL) \| \|
				457	+-----------------+---------------------------------+-------+
				458	\| ``\b`` \| ASCII Backspace (BS) \| \|
				459	+-----------------+---------------------------------+-------+
				460	\| ``\f`` \| ASCII Formfeed (FF) \| \|
				461	+-----------------+---------------------------------+-------+
				462	\| ``\n`` \| ASCII Linefeed (LF) \| \|
				463	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	464	\| ``\r`` \| ASCII Carriage Return (CR) \| \|
				465	+-----------------+---------------------------------+-------+
				466	\| ``\t`` \| ASCII Horizontal Tab (TAB) \| \|
				467	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	468	\| ``\v`` \| ASCII Vertical Tab (VT) \| \|
				469	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	470	\| ``\ooo`` \| Character with octal value \| (1,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	471	\| \| ooo \| \|
				472	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	473	\| ``\xhh`` \| Character with hex value hh \| (2,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	474	+-----------------+---------------------------------+-------+
				475
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	476	Escape sequences only recognized in string literals are:
				477
				478	+-----------------+---------------------------------+-------+
				479	\| Escape Sequence \| Meaning \| Notes \|
				480	+=================+=================================+=======+
				481	\| ``\N{name}`` \| Character named name in the \| \|
				482	\| \| Unicode database \| \|
				483	+-----------------+---------------------------------+-------+
				484	\| ``\uxxxx`` \| Character with 16-bit hex value \| \(4) \|
				485	\| \| xxxx \| \|
				486	+-----------------+---------------------------------+-------+
				487	\| ``\Uxxxxxxxx`` \| Character with 32-bit hex value \| \(5) \|
				488	\| \| xxxxxxxx \| \|
				489	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	490
				491	Notes:
				492
				493	(1)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	494	As in Standard C, up to three octal digits are accepted.
				495
				496	(2)
				497	Unlike in Standard C, at most two hex digits are accepted.
				498
				499	(3)
				500	In a bytes literal, hexadecimal and octal escapes denote the byte with the
				501	given value. In a string literal, these escapes denote a Unicode character
				502	with the given value.
				503
				504	(4)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	505	Individual code units which form parts of a surrogate pair can be encoded using
				506	this escape sequence.
				507
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	508	(5)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	509	Any Unicode character can be encoded this way, but characters outside the Basic
				510	Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is
				511	compiled to use 16-bit code units (the default). Individual code units which
				512	form parts of a surrogate pair can be encoded using this escape sequence.
				513
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	514
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	515	.. index:: unrecognized escape sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	516
				517	Unlike Standard C, all unrecognized escape sequences are left in the string
				518	unchanged, i.e., the backslash is left in the string. (This behavior is
				519	useful when debugging: if an escape sequence is mistyped, the resulting output
				520	is more easily recognized as broken.) It is also important to note that the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	521	escape sequences only recognized in string literals fall into the category of
				522	unrecognized escapes for bytes literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	523
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	524	When an ``'r'`` or ``'R'`` prefix is used in a string literal, then the
				525	``\uXXXX`` and ``\UXXXXXXXX`` escape sequences are processed while *all other
				526	backslashes are left in the string*. For example, the string literal
				527	``r"\u0062\n"`` consists of three Unicode characters: 'LATIN SMALL LETTER B',
				528	'REVERSE SOLIDUS', and 'LATIN SMALL LETTER N'. Backslashes can be escaped with a
				529	preceding backslash; however, both remain in the string. As a result,
				530	``\uXXXX`` escape sequences are only recognized when there is an odd number of
				531	backslashes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	532
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	533	Even in a raw string, string quotes can be escaped with a backslash, but the
				534	backslash remains in the string; for example, ``r"\""`` is a valid string
				535	literal consisting of two characters: a backslash and a double quote; ``r"\"``
				536	is not a valid string literal (even a raw string cannot end in an odd number of
				537	backslashes). Specifically, a raw string cannot end in a single backslash
				538	(since the backslash would escape the following quote character). Note also
				539	that a single backslash followed by a newline is interpreted as those two
				540	characters as part of the string, not as a line continuation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	541
				542
				543	.. _string-catenation:
				544
				545	String literal concatenation
				546	----------------------------
				547
				548	Multiple adjacent string literals (delimited by whitespace), possibly using
				549	different quoting conventions, are allowed, and their meaning is the same as
				550	their concatenation. Thus, ``"hello" 'world'`` is equivalent to
				551	``"helloworld"``. This feature can be used to reduce the number of backslashes
				552	needed, to split long strings conveniently across long lines, or even to add
				553	comments to parts of strings, for example::
				554
				555	re.compile("[A-Za-z_]" # letter or underscore
				556	"[A-Za-z0-9_]*" # letter, digit or underscore
				557	)
				558
				559	Note that this feature is defined at the syntactical level, but implemented at
				560	compile time. The '+' operator must be used to concatenate string expressions
				561	at run time. Also note that literal concatenation can use different quoting
				562	styles for each component (even mixing raw strings and triple quoted strings).
				563
				564
				565	.. _numbers:
				566
				567	Numeric literals
				568	----------------
				569
Georg Brandl	ba956ae	2007-11-29 17:24:34 +0000	[diff] [blame^]	570	.. index:: number, numeric literal, integer literal
				571	floating point literal, hexadecimal literal
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	572	octal literal, binary literal, decimal literal, imaginary literal, complex literal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	573
Georg Brandl	ba956ae	2007-11-29 17:24:34 +0000	[diff] [blame^]	574	There are three types of numeric literals: plain integers, floating point
				575	numbers, and imaginary numbers. There are no complex literals
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	576	(complex numbers can be formed by adding a real number and an imaginary number).
				577
				578	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				579	actually an expression composed of the unary operator '``-``' and the literal
				580	``1``.
				581
				582
				583	.. _integers:
				584
				585	Integer literals
				586	----------------
				587
				588	Integer literals are described by the following lexical definitions:
				589
				590	.. productionlist::
				591	integer: `decimalinteger` \| `octinteger` \| `hexinteger`
				592	decimalinteger: `nonzerodigit` `digit`* \| "0"+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	593	nonzerodigit: "1"..."9"
				594	digit: "0"..."9"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	595	octinteger: "0" ("o" \| "O") `octdigit`+
				596	hexinteger: "0" ("x" \| "X") `hexdigit`+
				597	bininteger: "0" ("b" \| "B") `bindigit`+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	598	octdigit: "0"..."7"
				599	hexdigit: `digit` \| "a"..."f" \| "A"..."F"
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	600	bindigit: "0" \| "1"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	601
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	602	There is no limit for the length of integer literals apart from what can be
				603	stored in available memory.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	604
				605	Note that leading zeros in a non-zero decimal number are not allowed. This is
				606	for disambiguation with C-style octal literals, which Python used before version
				607	3.0.
				608
				609	Some examples of integer literals::
				610
				611	7 2147483647 0o177 0b100110111
				612	3 79228162514264337593543950336 0o377 0x100000000
				613	79228162514264337593543950336 0xdeadbeef
				614
				615
				616	.. _floating:
				617
				618	Floating point literals
				619	-----------------------
				620
				621	Floating point literals are described by the following lexical definitions:
				622
				623	.. productionlist::
				624	floatnumber: `pointfloat` \| `exponentfloat`
				625	pointfloat: [`intpart`] `fraction` \| `intpart` "."
				626	exponentfloat: (`intpart` \| `pointfloat`) `exponent`
				627	intpart: `digit`+
				628	fraction: "." `digit`+
				629	exponent: ("e" \| "E") ["+" \| "-"] `digit`+
				630
				631	Note that the integer and exponent parts are always interpreted using radix 10.
				632	For example, ``077e010`` is legal, and denotes the same number as ``77e10``. The
				633	allowed range of floating point literals is implementation-dependent. Some
				634	examples of floating point literals::
				635
				636	3.14 10. .001 1e100 3.14e-10 0e0
				637
				638	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				639	actually an expression composed of the unary operator ``-`` and the literal
				640	``1``.
				641
				642
				643	.. _imaginary:
				644
				645	Imaginary literals
				646	------------------
				647
				648	Imaginary literals are described by the following lexical definitions:
				649
				650	.. productionlist::
				651	imagnumber: (`floatnumber` \| `intpart`) ("j" \| "J")
				652
				653	An imaginary literal yields a complex number with a real part of 0.0. Complex
				654	numbers are represented as a pair of floating point numbers and have the same
				655	restrictions on their range. To create a complex number with a nonzero real
				656	part, add a floating point number to it, e.g., ``(3+4j)``. Some examples of
				657	imaginary literals::
				658
				659	3.14j 10.j 10j .001j 1e100j 3.14e-10j
				660
				661
				662	.. _operators:
				663
				664	Operators
				665	=========
				666
				667	.. index:: single: operators
				668
				669	The following tokens are operators::
				670
				671	+ - * ** / // %
				672	<< >> & \| ^ ~
				673	< > <= >= == !=
				674
				675
				676	.. _delimiters:
				677
				678	Delimiters
				679	==========
				680
				681	.. index:: single: delimiters
				682
				683	The following tokens serve as delimiters in the grammar::
				684
				685	( ) [ ] { } @
				686	, : . ` = ;
				687	+= -= *= /= //= %=
				688	&= \|= ^= >>= <<= **=
				689
				690	The period can also occur in floating-point and imaginary literals. A sequence
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	691	of three periods has a special meaning as an ellipsis literal. The second half
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	692	of the list, the augmented assignment operators, serve lexically as delimiters,
				693	but also perform an operation.
				694
				695	The following printing ASCII characters have special meaning as part of other
				696	tokens or are otherwise significant to the lexical analyzer::
				697
				698	' " # \
				699
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	700	The following printing ASCII characters are not used in Python. Their
				701	occurrence outside string literals and comments is an unconditional error::
				702
				703	$ ?