Blame - Doc/reference/lexical_analysis.rst - platform/external/python/cpython3

blob: fdb5f99bdd6e7e7d13606f176f93076e71ead3d6 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	.. _lexical:
				3
				4	****************
				5	Lexical analysis
				6	****************
				7
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	8	.. index:: lexical analysis, parser, token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	9
				10	A Python program is read by a parser. Input to the parser is a stream of
				11	tokens, generated by the lexical analyzer. This chapter describes how the
				12	lexical analyzer breaks a file into tokens.
				13
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	14	Python reads program text as Unicode code points; the encoding of a source file
				15	can be given by an encoding declaration and defaults to UTF-8, see :pep:`3120`
				16	for details. If the source file cannot be decoded, a :exc:`SyntaxError` is
				17	raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19
				20	.. _line-structure:
				21
				22	Line structure
				23	==============
				24
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	25	.. index:: line structure
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26
				27	A Python program is divided into a number of logical lines.
				28
				29
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	30	.. _logical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
				32	Logical lines
				33	-------------
				34
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	35	.. index:: logical line, physical line, line joining, NEWLINE token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
				37	The end of a logical line is represented by the token NEWLINE. Statements
				38	cannot cross logical line boundaries except where NEWLINE is allowed by the
				39	syntax (e.g., between statements in compound statements). A logical line is
				40	constructed from one or more physical lines by following the explicit or
				41	implicit line joining rules.
				42
				43
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	44	.. _physical-lines:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	45
				46	Physical lines
				47	--------------
				48
				49	A physical line is a sequence of characters terminated by an end-of-line
				50	sequence. In source files, any of the standard platform line termination
				51	sequences can be used - the Unix form using ASCII LF (linefeed), the Windows
				52	form using the ASCII sequence CR LF (return followed by linefeed), or the
				53	Macintosh form using the ASCII CR (return) character. All of these forms can be
				54	used equally, regardless of platform.
				55
				56	When embedding Python, source code strings should be passed to Python APIs using
				57	the standard C conventions for newline characters (the ``\n`` character,
				58	representing ASCII LF, is the line terminator).
				59
				60
				61	.. _comments:
				62
				63	Comments
				64	--------
				65
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	66	.. index:: comment, hash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68	A comment starts with a hash character (``#``) that is not part of a string
				69	literal, and ends at the end of the physical line. A comment signifies the end
				70	of the logical line unless the implicit line joining rules are invoked. Comments
				71	are ignored by the syntax; they are not tokens.
				72
				73
				74	.. _encodings:
				75
				76	Encoding declarations
				77	---------------------
				78
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	79	.. index:: source character set, encodings
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81	If a comment in the first or second line of the Python script matches the
				82	regular expression ``coding[=:]\s*([-\w.]+)``, this comment is processed as an
				83	encoding declaration; the first group of this expression names the encoding of
				84	the source code file. The recommended forms of this expression are ::
				85
				86	# -- coding: <encoding-name> --
				87
				88	which is recognized also by GNU Emacs, and ::
				89
				90	# vim:fileencoding=<encoding-name>
				91
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	92	which is recognized by Bram Moolenaar's VIM.
				93
				94	If no encoding declaration is found, the default encoding is UTF-8. In
				95	addition, if the first bytes of the file are the UTF-8 byte-order mark
				96	(``b'\xef\xbb\xbf'``), the declared file encoding is UTF-8 (this is supported,
				97	among others, by Microsoft's :program:`notepad`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99	If an encoding is declared, the encoding name must be recognized by Python. The
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	100	encoding is used for all lexical analysis, including string literals, comments
				101	and identifiers. The encoding declaration must appear on a line of its own.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	103	.. XXX there should be a list of supported encodings.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	104
				105
				106	.. _explicit-joining:
				107
				108	Explicit line joining
				109	---------------------
				110
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	111	.. index:: physical line, line joining, line continuation, backslash character
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	112
				113	Two or more physical lines may be joined into logical lines using backslash
				114	characters (``\``), as follows: when a physical line ends in a backslash that is
				115	not part of a string literal or comment, it is joined with the following forming
				116	a single logical line, deleting the backslash and the following end-of-line
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	117	character. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	118
				119	if 1900 < year < 2100 and 1 <= month <= 12 \
				120	and 1 <= day <= 31 and 0 <= hour < 24 \
				121	and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date
				122	return 1
				123
				124	A line ending in a backslash cannot carry a comment. A backslash does not
				125	continue a comment. A backslash does not continue a token except for string
				126	literals (i.e., tokens other than string literals cannot be split across
				127	physical lines using a backslash). A backslash is illegal elsewhere on a line
				128	outside a string literal.
				129
				130
				131	.. _implicit-joining:
				132
				133	Implicit line joining
				134	---------------------
				135
				136	Expressions in parentheses, square brackets or curly braces can be split over
				137	more than one physical line without using backslashes. For example::
				138
				139	month_names = ['Januari', 'Februari', 'Maart', # These are the
				140	'April', 'Mei', 'Juni', # Dutch names
				141	'Juli', 'Augustus', 'September', # for the months
				142	'Oktober', 'November', 'December'] # of the year
				143
				144	Implicitly continued lines can carry comments. The indentation of the
				145	continuation lines is not important. Blank continuation lines are allowed.
				146	There is no NEWLINE token between implicit continuation lines. Implicitly
				147	continued lines can also occur within triple-quoted strings (see below); in that
				148	case they cannot carry comments.
				149
				150
				151	.. _blank-lines:
				152
				153	Blank lines
				154	-----------
				155
				156	.. index:: single: blank line
				157
				158	A logical line that contains only spaces, tabs, formfeeds and possibly a
				159	comment, is ignored (i.e., no NEWLINE token is generated). During interactive
				160	input of statements, handling of a blank line may differ depending on the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	161	implementation of the read-eval-print loop. In the standard interactive
				162	interpreter, an entirely blank logical line (i.e. one containing not even
				163	whitespace or a comment) terminates a multi-line statement.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164
				165
				166	.. _indentation:
				167
				168	Indentation
				169	-----------
				170
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	171	.. index:: indentation, leading whitespace, space, tab, grouping, statement grouping
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
				173	Leading whitespace (spaces and tabs) at the beginning of a logical line is used
				174	to compute the indentation level of the line, which in turn is used to determine
				175	the grouping of statements.
				176
				177	First, tabs are replaced (from left to right) by one to eight spaces such that
				178	the total number of characters up to and including the replacement is a multiple
				179	of eight (this is intended to be the same rule as used by Unix). The total
				180	number of spaces preceding the first non-blank character then determines the
				181	line's indentation. Indentation cannot be split over multiple physical lines
				182	using backslashes; the whitespace up to the first backslash determines the
				183	indentation.
				184
				185	Cross-platform compatibility note: because of the nature of text editors on
				186	non-UNIX platforms, it is unwise to use a mixture of spaces and tabs for the
				187	indentation in a single source file. It should also be noted that different
				188	platforms may explicitly limit the maximum indentation level.
				189
				190	A formfeed character may be present at the start of the line; it will be ignored
				191	for the indentation calculations above. Formfeed characters occurring elsewhere
				192	in the leading whitespace have an undefined effect (for instance, they may reset
				193	the space count to zero).
				194
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	195	.. index:: INDENT token, DEDENT token
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	196
				197	The indentation levels of consecutive lines are used to generate INDENT and
				198	DEDENT tokens, using a stack, as follows.
				199
				200	Before the first line of the file is read, a single zero is pushed on the stack;
				201	this will never be popped off again. The numbers pushed on the stack will
				202	always be strictly increasing from bottom to top. At the beginning of each
				203	logical line, the line's indentation level is compared to the top of the stack.
				204	If it is equal, nothing happens. If it is larger, it is pushed on the stack, and
				205	one INDENT token is generated. If it is smaller, it must be one of the
				206	numbers occurring on the stack; all numbers on the stack that are larger are
				207	popped off, and for each number popped off a DEDENT token is generated. At the
				208	end of the file, a DEDENT token is generated for each number remaining on the
				209	stack that is larger than zero.
				210
				211	Here is an example of a correctly (though confusingly) indented piece of Python
				212	code::
				213
				214	def perm(l):
				215	# Compute the list of all permutations of l
				216	if len(l) <= 1:
				217	return [l]
				218	r = []
				219	for i in range(len(l)):
				220	s = l[:i] + l[i+1:]
				221	p = perm(s)
				222	for x in p:
				223	r.append(l[i:i+1] + x)
				224	return r
				225
				226	The following example shows various indentation errors::
				227
				228	def perm(l): # error: first line indented
				229	for i in range(len(l)): # error: not indented
				230	s = l[:i] + l[i+1:]
				231	p = perm(l[:i] + l[i+1:]) # error: unexpected indent
				232	for x in p:
				233	r.append(l[i:i+1] + x)
				234	return r # error: inconsistent dedent
				235
				236	(Actually, the first three errors are detected by the parser; only the last
				237	error is found by the lexical analyzer --- the indentation of ``return r`` does
				238	not match a level popped off the stack.)
				239
				240
				241	.. _whitespace:
				242
				243	Whitespace between tokens
				244	-------------------------
				245
				246	Except at the beginning of a logical line or in string literals, the whitespace
				247	characters space, tab and formfeed can be used interchangeably to separate
				248	tokens. Whitespace is needed between two tokens only if their concatenation
				249	could otherwise be interpreted as a different token (e.g., ab is one token, but
				250	a b is two tokens).
				251
				252
				253	.. _other-tokens:
				254
				255	Other tokens
				256	============
				257
				258	Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist:
				259	identifiers, keywords, literals, operators, and delimiters. Whitespace
				260	characters (other than line terminators, discussed earlier) are not tokens, but
				261	serve to delimit tokens. Where ambiguity exists, a token comprises the longest
				262	possible string that forms a legal token, when read from left to right.
				263
				264
				265	.. _identifiers:
				266
				267	Identifiers and keywords
				268	========================
				269
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	270	.. index:: identifier, name
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	271
				272	Identifiers (also referred to as names) are described by the following lexical
Georg Brandl	e06de8b	2008-05-05 21:42:51 +0000	[diff] [blame]	273	definitions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	274
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	275	The syntax of identifiers in Python is based on the Unicode standard annex
Georg Brandl	e06de8b	2008-05-05 21:42:51 +0000	[diff] [blame]	276	UAX-31, with elaboration and changes as defined below; see also :pep:`3131` for
				277	further details.
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	278
				279	Within the ASCII range (U+0001..U+007F), the valid characters for identifiers
Georg Brandl	e06de8b	2008-05-05 21:42:51 +0000	[diff] [blame]	280	are the same as in Python 2.x: the uppercase and lowercase letters ``A`` through
				281	``Z``, the underscore ``_`` and, except for the first character, the digits
				282	``0`` through ``9``.
				283
				284	Python 3.0 introduces additional characters from outside the ASCII range (see
				285	:pep:`3131`). For these characters, the classification uses the version of the
				286	Unicode Character Database as included in the :mod:`unicodedata` module.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	287
				288	Identifiers are unlimited in length. Case is significant.
				289
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	290	.. productionlist::
				291	identifier: `id_start` `id_continue`*
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	292	id_start: <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
				293	id_continue: <all characters in `id_start`, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	294
				295	The Unicode category codes mentioned above stand for:
				296
				297	* Lu - uppercase letters
				298	* Ll - lowercase letters
				299	* Lt - titlecase letters
				300	* Lm - modifier letters
				301	* Lo - other letters
				302	* Nl - letter numbers
				303	* Mn - nonspacing marks
				304	* Mc - spacing combining marks
				305	* Nd - decimal numbers
				306	* Pc - connector punctuations
				307
				308	All identifiers are converted into the normal form NFC while parsing; comparison
				309	of identifiers is based on NFC.
				310
				311	A non-normative HTML file listing all valid identifier characters for Unicode
				312	4.1 can be found at
				313	http://www.dcl.hpi.uni-potsdam.de/home/loewis/table-3131.html.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	314
Mark Summerfield	051d1dd	2007-11-20 13:22:19 +0000	[diff] [blame]	315
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316	.. _keywords:
				317
				318	Keywords
				319	--------
				320
				321	.. index::
				322	single: keyword
				323	single: reserved word
				324
				325	The following identifiers are used as reserved words, or keywords of the
				326	language, and cannot be used as ordinary identifiers. They must be spelled
				327	exactly as written here::
				328
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	329	False class finally is return
				330	None continue for lambda try
				331	True def from nonlocal while
				332	and del global not with
				333	as elif if or yield
				334	assert else import pass
				335	break except in raise
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	336
				337	.. _id-classes:
				338
				339	Reserved classes of identifiers
				340	-------------------------------
				341
				342	Certain classes of identifiers (besides keywords) have special meanings. These
				343	classes are identified by the patterns of leading and trailing underscore
				344	characters:
				345
				346	``_*``
				347	Not imported by ``from module import *``. The special identifier ``_`` is used
				348	in the interactive interpreter to store the result of the last evaluation; it is
Georg Brandl	1a3284e	2007-12-02 09:40:06 +0000	[diff] [blame]	349	stored in the :mod:`builtins` module. When not in interactive mode, ``_``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	350	has no special meaning and is not defined. See section :ref:`import`.
				351
				352	.. note::
				353
				354	The name ``_`` is often used in conjunction with internationalization;
				355	refer to the documentation for the :mod:`gettext` module for more
				356	information on this convention.
				357
				358	``__*__``
				359	System-defined names. These names are defined by the interpreter and its
				360	implementation (including the standard library); applications should not expect
				361	to define additional names using this convention. The set of names of this
				362	class defined by Python may be extended in future versions. See section
				363	:ref:`specialnames`.
				364
				365	``__*``
				366	Class-private names. Names in this category, when used within the context of a
				367	class definition, are re-written to use a mangled form to help avoid name
				368	clashes between "private" attributes of base and derived classes. See section
				369	:ref:`atom-identifiers`.
				370
				371
				372	.. _literals:
				373
				374	Literals
				375	========
				376
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	377	.. index:: literal, constant
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	378
				379	Literals are notations for constant values of some built-in types.
				380
				381
				382	.. _strings:
				383
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	384	String and Bytes literals
				385	-------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	386
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	387	.. index:: string literal, bytes literal, ASCII
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	388
				389	String literals are described by the following lexical definitions:
				390
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	391	.. productionlist::
				392	stringliteral: [`stringprefix`](`shortstring` \| `longstring`)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	393	stringprefix: "r" \| "R"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394	shortstring: "'" `shortstringitem`* "'" \| '"' `shortstringitem`* '"'
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	395	longstring: "'''" `longstringitem`* "'''" \| '"""' `longstringitem`* '"""'
				396	shortstringitem: `shortstringchar` \| `stringescapeseq`
				397	longstringitem: `longstringchar` \| `stringescapeseq`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398	shortstringchar: <any source character except "\" or newline or the quote>
				399	longstringchar: <any source character except "\">
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	400	stringescapeseq: "\" <any source character>
				401
				402	.. productionlist::
				403	bytesliteral: `bytesprefix`(`shortbytes` \| `longbytes`)
				404	bytesprefix: "b" \| "B"
				405	shortbytes: "'" `shortbytesitem`* "'" \| '"' `shortbytesitem`* '"'
				406	longbytes: "'''" `longbytesitem`* "'''" \| '"""' `longbytesitem`* '"""'
				407	shortbytesitem: `shortbyteschar` \| `bytesescapeseq`
				408	longbytesitem: `longbyteschar` \| `bytesescapeseq`
				409	shortbyteschar: <any ASCII character except "\" or newline or the quote>
				410	longbyteschar: <any ASCII character except "\">
				411	bytesescapeseq: "\" <any ASCII character>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	412
				413	One syntactic restriction not indicated by these productions is that whitespace
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	414	is not allowed between the :token:`stringprefix` or :token:`bytesprefix` and the
				415	rest of the literal. The source character set is defined by the encoding
				416	declaration; it is UTF-8 if no encoding declaration is given in the source file;
				417	see section :ref:`encodings`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	418
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	419	.. index:: triple-quoted string, Unicode Consortium, raw string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	420
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	421	In plain English: Both types of literals can be enclosed in matching single quotes
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	422	(``'``) or double quotes (``"``). They can also be enclosed in matching groups
				423	of three single or double quotes (these are generally referred to as
				424	triple-quoted strings). The backslash (``\``) character is used to escape
				425	characters that otherwise have a special meaning, such as newline, backslash
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	426	itself, or the quote character.
				427
				428	String literals may optionally be prefixed with a letter ``'r'`` or ``'R'``;
Benjamin Peterson	a2f837f	2008-04-28 21:05:10 +0000	[diff] [blame]	429	such strings are called :dfn:`raw strings` and treat backslashes as literal
				430	characters. As a result, ``'\U'`` and ``'\u'`` escapes in raw strings are not
				431	treated specially.
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	432
				433	Bytes literals are always prefixed with ``'b'`` or ``'B'``; they produce an
				434	instance of the :class:`bytes` type instead of the :class:`str` type. They
				435	may only contain ASCII characters; bytes with a numeric value of 128 or greater
				436	must be expressed with escapes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	437
				438	In triple-quoted strings, unescaped newlines and quotes are allowed (and are
				439	retained), except that three unescaped quotes in a row terminate the string. (A
				440	"quote" is the character used to open the string, i.e. either ``'`` or ``"``.)
				441
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	442	.. index:: physical line, escape sequence, Standard C, C
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	443
				444	Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in strings are
				445	interpreted according to rules similar to those used by Standard C. The
				446	recognized escape sequences are:
				447
				448	+-----------------+---------------------------------+-------+
				449	\| Escape Sequence \| Meaning \| Notes \|
				450	+=================+=================================+=======+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	451	\| ``\newline`` \| Backslash and newline ignored \| \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	452	+-----------------+---------------------------------+-------+
				453	\| ``\\`` \| Backslash (``\``) \| \|
				454	+-----------------+---------------------------------+-------+
				455	\| ``\'`` \| Single quote (``'``) \| \|
				456	+-----------------+---------------------------------+-------+
				457	\| ``\"`` \| Double quote (``"``) \| \|
				458	+-----------------+---------------------------------+-------+
				459	\| ``\a`` \| ASCII Bell (BEL) \| \|
				460	+-----------------+---------------------------------+-------+
				461	\| ``\b`` \| ASCII Backspace (BS) \| \|
				462	+-----------------+---------------------------------+-------+
				463	\| ``\f`` \| ASCII Formfeed (FF) \| \|
				464	+-----------------+---------------------------------+-------+
				465	\| ``\n`` \| ASCII Linefeed (LF) \| \|
				466	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	467	\| ``\r`` \| ASCII Carriage Return (CR) \| \|
				468	+-----------------+---------------------------------+-------+
				469	\| ``\t`` \| ASCII Horizontal Tab (TAB) \| \|
				470	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	471	\| ``\v`` \| ASCII Vertical Tab (VT) \| \|
				472	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	473	\| ``\ooo`` \| Character with octal value \| (1,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	474	\| \| ooo \| \|
				475	+-----------------+---------------------------------+-------+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	476	\| ``\xhh`` \| Character with hex value hh \| (2,3) \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	477	+-----------------+---------------------------------+-------+
				478
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	479	Escape sequences only recognized in string literals are:
				480
				481	+-----------------+---------------------------------+-------+
				482	\| Escape Sequence \| Meaning \| Notes \|
				483	+=================+=================================+=======+
				484	\| ``\N{name}`` \| Character named name in the \| \|
				485	\| \| Unicode database \| \|
				486	+-----------------+---------------------------------+-------+
				487	\| ``\uxxxx`` \| Character with 16-bit hex value \| \(4) \|
				488	\| \| xxxx \| \|
				489	+-----------------+---------------------------------+-------+
				490	\| ``\Uxxxxxxxx`` \| Character with 32-bit hex value \| \(5) \|
				491	\| \| xxxxxxxx \| \|
				492	+-----------------+---------------------------------+-------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	493
				494	Notes:
				495
				496	(1)
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	497	As in Standard C, up to three octal digits are accepted.
				498
				499	(2)
				500	Unlike in Standard C, at most two hex digits are accepted.
				501
				502	(3)
				503	In a bytes literal, hexadecimal and octal escapes denote the byte with the
				504	given value. In a string literal, these escapes denote a Unicode character
				505	with the given value.
				506
				507	(4)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	508	Individual code units which form parts of a surrogate pair can be encoded using
Christian Heimes	1af737c	2008-01-23 08:24:23 +0000	[diff] [blame]	509	this escape sequence. Unlike in Standard C, exactly two hex digits are required.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	510
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	511	(5)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	512	Any Unicode character can be encoded this way, but characters outside the Basic
				513	Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is
				514	compiled to use 16-bit code units (the default). Individual code units which
				515	form parts of a surrogate pair can be encoded using this escape sequence.
				516
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	517
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	518	.. index:: unrecognized escape sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	519
				520	Unlike Standard C, all unrecognized escape sequences are left in the string
				521	unchanged, i.e., the backslash is left in the string. (This behavior is
				522	useful when debugging: if an escape sequence is mistyped, the resulting output
				523	is more easily recognized as broken.) It is also important to note that the
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	524	escape sequences only recognized in string literals fall into the category of
				525	unrecognized escapes for bytes literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	526
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	527	Even in a raw string, string quotes can be escaped with a backslash, but the
				528	backslash remains in the string; for example, ``r"\""`` is a valid string
				529	literal consisting of two characters: a backslash and a double quote; ``r"\"``
				530	is not a valid string literal (even a raw string cannot end in an odd number of
				531	backslashes). Specifically, a raw string cannot end in a single backslash
				532	(since the backslash would escape the following quote character). Note also
				533	that a single backslash followed by a newline is interpreted as those two
				534	characters as part of the string, not as a line continuation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	535
				536
				537	.. _string-catenation:
				538
				539	String literal concatenation
				540	----------------------------
				541
				542	Multiple adjacent string literals (delimited by whitespace), possibly using
				543	different quoting conventions, are allowed, and their meaning is the same as
				544	their concatenation. Thus, ``"hello" 'world'`` is equivalent to
				545	``"helloworld"``. This feature can be used to reduce the number of backslashes
				546	needed, to split long strings conveniently across long lines, or even to add
				547	comments to parts of strings, for example::
				548
				549	re.compile("[A-Za-z_]" # letter or underscore
				550	"[A-Za-z0-9_]*" # letter, digit or underscore
				551	)
				552
				553	Note that this feature is defined at the syntactical level, but implemented at
				554	compile time. The '+' operator must be used to concatenate string expressions
				555	at run time. Also note that literal concatenation can use different quoting
				556	styles for each component (even mixing raw strings and triple quoted strings).
				557
				558
				559	.. _numbers:
				560
				561	Numeric literals
				562	----------------
				563
Georg Brandl	ba956ae	2007-11-29 17:24:34 +0000	[diff] [blame]	564	.. index:: number, numeric literal, integer literal
				565	floating point literal, hexadecimal literal
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	566	octal literal, binary literal, decimal literal, imaginary literal, complex literal
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	567
Georg Brandl	95817b3	2008-05-11 14:30:18 +0000	[diff] [blame]	568	There are three types of numeric literals: integers, floating point numbers, and
				569	imaginary numbers. There are no complex literals (complex numbers can be formed
				570	by adding a real number and an imaginary number).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
				572	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				573	actually an expression composed of the unary operator '``-``' and the literal
				574	``1``.
				575
				576
				577	.. _integers:
				578
				579	Integer literals
				580	----------------
				581
				582	Integer literals are described by the following lexical definitions:
				583
				584	.. productionlist::
Georg Brandl	ddee308	2008-04-09 18:46:46 +0000	[diff] [blame]	585	integer: `decimalinteger` \| `octinteger` \| `hexinteger` \| `bininteger`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	586	decimalinteger: `nonzerodigit` `digit`* \| "0"+
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	587	nonzerodigit: "1"..."9"
				588	digit: "0"..."9"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589	octinteger: "0" ("o" \| "O") `octdigit`+
				590	hexinteger: "0" ("x" \| "X") `hexdigit`+
				591	bininteger: "0" ("b" \| "B") `bindigit`+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	592	octdigit: "0"..."7"
				593	hexdigit: `digit` \| "a"..."f" \| "A"..."F"
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	594	bindigit: "0" \| "1"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	595
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	596	There is no limit for the length of integer literals apart from what can be
				597	stored in available memory.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	598
				599	Note that leading zeros in a non-zero decimal number are not allowed. This is
				600	for disambiguation with C-style octal literals, which Python used before version
				601	3.0.
				602
				603	Some examples of integer literals::
				604
				605	7 2147483647 0o177 0b100110111
				606	3 79228162514264337593543950336 0o377 0x100000000
				607	79228162514264337593543950336 0xdeadbeef
				608
				609
				610	.. _floating:
				611
				612	Floating point literals
				613	-----------------------
				614
				615	Floating point literals are described by the following lexical definitions:
				616
				617	.. productionlist::
				618	floatnumber: `pointfloat` \| `exponentfloat`
				619	pointfloat: [`intpart`] `fraction` \| `intpart` "."
				620	exponentfloat: (`intpart` \| `pointfloat`) `exponent`
				621	intpart: `digit`+
				622	fraction: "." `digit`+
				623	exponent: ("e" \| "E") ["+" \| "-"] `digit`+
				624
				625	Note that the integer and exponent parts are always interpreted using radix 10.
				626	For example, ``077e010`` is legal, and denotes the same number as ``77e10``. The
				627	allowed range of floating point literals is implementation-dependent. Some
				628	examples of floating point literals::
				629
				630	3.14 10. .001 1e100 3.14e-10 0e0
				631
				632	Note that numeric literals do not include a sign; a phrase like ``-1`` is
				633	actually an expression composed of the unary operator ``-`` and the literal
				634	``1``.
				635
				636
				637	.. _imaginary:
				638
				639	Imaginary literals
				640	------------------
				641
				642	Imaginary literals are described by the following lexical definitions:
				643
				644	.. productionlist::
				645	imagnumber: (`floatnumber` \| `intpart`) ("j" \| "J")
				646
				647	An imaginary literal yields a complex number with a real part of 0.0. Complex
				648	numbers are represented as a pair of floating point numbers and have the same
				649	restrictions on their range. To create a complex number with a nonzero real
				650	part, add a floating point number to it, e.g., ``(3+4j)``. Some examples of
				651	imaginary literals::
				652
				653	3.14j 10.j 10j .001j 1e100j 3.14e-10j
				654
				655
				656	.. _operators:
				657
				658	Operators
				659	=========
				660
				661	.. index:: single: operators
				662
				663	The following tokens are operators::
				664
				665	+ - * ** / // %
				666	<< >> & \| ^ ~
				667	< > <= >= == !=
				668
				669
				670	.. _delimiters:
				671
				672	Delimiters
				673	==========
				674
				675	.. index:: single: delimiters
				676
				677	The following tokens serve as delimiters in the grammar::
				678
				679	( ) [ ] { } @
				680	, : . ` = ;
				681	+= -= *= /= //= %=
				682	&= \|= ^= >>= <<= **=
				683
				684	The period can also occur in floating-point and imaginary literals. A sequence
Georg Brandl	57e3b68	2007-08-31 08:07:45 +0000	[diff] [blame]	685	of three periods has a special meaning as an ellipsis literal. The second half
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	686	of the list, the augmented assignment operators, serve lexically as delimiters,
				687	but also perform an operation.
				688
				689	The following printing ASCII characters have special meaning as part of other
				690	tokens or are otherwise significant to the lexical analyzer::
				691
				692	' " # \
				693
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	694	The following printing ASCII characters are not used in Python. Their
				695	occurrence outside string literals and comments is an unconditional error::
				696
				697	$ ?