Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | |
| 2 | .. _lexical: |
| 3 | |
| 4 | **************** |
| 5 | Lexical analysis |
| 6 | **************** |
| 7 | |
| 8 | .. index:: |
| 9 | single: lexical analysis |
| 10 | single: parser |
| 11 | single: token |
| 12 | |
| 13 | A Python program is read by a *parser*. Input to the parser is a stream of |
| 14 | *tokens*, generated by the *lexical analyzer*. This chapter describes how the |
| 15 | lexical analyzer breaks a file into tokens. |
| 16 | |
| 17 | Python uses the 7-bit ASCII character set for program text. |
| 18 | |
| 19 | .. versionadded:: 2.3 |
| 20 | An encoding declaration can be used to indicate that string literals and |
| 21 | comments use an encoding different from ASCII. |
| 22 | |
| 23 | For compatibility with older versions, Python only warns if it finds 8-bit |
| 24 | characters; those warnings should be corrected by either declaring an explicit |
| 25 | encoding, or using escape sequences if those bytes are binary data, instead of |
| 26 | characters. |
| 27 | |
| 28 | The run-time character set depends on the I/O devices connected to the program |
| 29 | but is generally a superset of ASCII. |
| 30 | |
| 31 | **Future compatibility note:** It may be tempting to assume that the character |
| 32 | set for 8-bit characters is ISO Latin-1 (an ASCII superset that covers most |
| 33 | western languages that use the Latin alphabet), but it is possible that in the |
| 34 | future Unicode text editors will become common. These generally use the UTF-8 |
| 35 | encoding, which is also an ASCII superset, but with very different use for the |
| 36 | characters with ordinals 128-255. While there is no consensus on this subject |
| 37 | yet, it is unwise to assume either Latin-1 or UTF-8, even though the current |
| 38 | implementation appears to favor Latin-1. This applies both to the source |
| 39 | character set and the run-time character set. |
| 40 | |
| 41 | |
| 42 | .. _line-structure: |
| 43 | |
| 44 | Line structure |
| 45 | ============== |
| 46 | |
| 47 | .. index:: single: line structure |
| 48 | |
| 49 | A Python program is divided into a number of *logical lines*. |
| 50 | |
| 51 | |
| 52 | .. _logical: |
| 53 | |
| 54 | Logical lines |
| 55 | ------------- |
| 56 | |
| 57 | .. index:: |
| 58 | single: logical line |
| 59 | single: physical line |
| 60 | single: line joining |
| 61 | single: NEWLINE token |
| 62 | |
| 63 | The end of a logical line is represented by the token NEWLINE. Statements |
| 64 | cannot cross logical line boundaries except where NEWLINE is allowed by the |
| 65 | syntax (e.g., between statements in compound statements). A logical line is |
| 66 | constructed from one or more *physical lines* by following the explicit or |
| 67 | implicit *line joining* rules. |
| 68 | |
| 69 | |
| 70 | .. _physical: |
| 71 | |
| 72 | Physical lines |
| 73 | -------------- |
| 74 | |
| 75 | A physical line is a sequence of characters terminated by an end-of-line |
| 76 | sequence. In source files, any of the standard platform line termination |
| 77 | sequences can be used - the Unix form using ASCII LF (linefeed), the Windows |
Georg Brandl | 9af9498 | 2008-09-13 17:41:16 +0000 | [diff] [blame] | 78 | form using the ASCII sequence CR LF (return followed by linefeed), or the old |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 79 | Macintosh form using the ASCII CR (return) character. All of these forms can be |
| 80 | used equally, regardless of platform. |
| 81 | |
| 82 | When embedding Python, source code strings should be passed to Python APIs using |
| 83 | the standard C conventions for newline characters (the ``\n`` character, |
| 84 | representing ASCII LF, is the line terminator). |
| 85 | |
| 86 | |
| 87 | .. _comments: |
| 88 | |
| 89 | Comments |
| 90 | -------- |
| 91 | |
| 92 | .. index:: |
| 93 | single: comment |
| 94 | single: hash character |
| 95 | |
| 96 | A comment starts with a hash character (``#``) that is not part of a string |
| 97 | literal, and ends at the end of the physical line. A comment signifies the end |
| 98 | of the logical line unless the implicit line joining rules are invoked. Comments |
| 99 | are ignored by the syntax; they are not tokens. |
| 100 | |
| 101 | |
| 102 | .. _encodings: |
| 103 | |
| 104 | Encoding declarations |
| 105 | --------------------- |
| 106 | |
| 107 | .. index:: |
| 108 | single: source character set |
| 109 | single: encodings |
| 110 | |
| 111 | If a comment in the first or second line of the Python script matches the |
| 112 | regular expression ``coding[=:]\s*([-\w.]+)``, this comment is processed as an |
| 113 | encoding declaration; the first group of this expression names the encoding of |
| 114 | the source code file. The recommended forms of this expression are :: |
| 115 | |
| 116 | # -*- coding: <encoding-name> -*- |
| 117 | |
| 118 | which is recognized also by GNU Emacs, and :: |
| 119 | |
| 120 | # vim:fileencoding=<encoding-name> |
| 121 | |
| 122 | which is recognized by Bram Moolenaar's VIM. In addition, if the first bytes of |
| 123 | the file are the UTF-8 byte-order mark (``'\xef\xbb\xbf'``), the declared file |
| 124 | encoding is UTF-8 (this is supported, among others, by Microsoft's |
| 125 | :program:`notepad`). |
| 126 | |
| 127 | If an encoding is declared, the encoding name must be recognized by Python. The |
| 128 | encoding is used for all lexical analysis, in particular to find the end of a |
| 129 | string, and to interpret the contents of Unicode literals. String literals are |
| 130 | converted to Unicode for syntactical analysis, then converted back to their |
| 131 | original encoding before interpretation starts. The encoding declaration must |
| 132 | appear on a line of its own. |
| 133 | |
Georg Brandl | b19be57 | 2007-12-29 10:57:00 +0000 | [diff] [blame] | 134 | .. XXX there should be a list of supported encodings. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 135 | |
| 136 | |
| 137 | .. _explicit-joining: |
| 138 | |
| 139 | Explicit line joining |
| 140 | --------------------- |
| 141 | |
| 142 | .. index:: |
| 143 | single: physical line |
| 144 | single: line joining |
| 145 | single: line continuation |
| 146 | single: backslash character |
| 147 | |
| 148 | Two or more physical lines may be joined into logical lines using backslash |
| 149 | characters (``\``), as follows: when a physical line ends in a backslash that is |
| 150 | not part of a string literal or comment, it is joined with the following forming |
| 151 | a single logical line, deleting the backslash and the following end-of-line |
Georg Brandl | b19be57 | 2007-12-29 10:57:00 +0000 | [diff] [blame] | 152 | character. For example:: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 153 | |
| 154 | if 1900 < year < 2100 and 1 <= month <= 12 \ |
| 155 | and 1 <= day <= 31 and 0 <= hour < 24 \ |
| 156 | and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date |
| 157 | return 1 |
| 158 | |
| 159 | A line ending in a backslash cannot carry a comment. A backslash does not |
| 160 | continue a comment. A backslash does not continue a token except for string |
| 161 | literals (i.e., tokens other than string literals cannot be split across |
| 162 | physical lines using a backslash). A backslash is illegal elsewhere on a line |
| 163 | outside a string literal. |
| 164 | |
| 165 | |
| 166 | .. _implicit-joining: |
| 167 | |
| 168 | Implicit line joining |
| 169 | --------------------- |
| 170 | |
| 171 | Expressions in parentheses, square brackets or curly braces can be split over |
| 172 | more than one physical line without using backslashes. For example:: |
| 173 | |
| 174 | month_names = ['Januari', 'Februari', 'Maart', # These are the |
| 175 | 'April', 'Mei', 'Juni', # Dutch names |
| 176 | 'Juli', 'Augustus', 'September', # for the months |
| 177 | 'Oktober', 'November', 'December'] # of the year |
| 178 | |
| 179 | Implicitly continued lines can carry comments. The indentation of the |
| 180 | continuation lines is not important. Blank continuation lines are allowed. |
| 181 | There is no NEWLINE token between implicit continuation lines. Implicitly |
| 182 | continued lines can also occur within triple-quoted strings (see below); in that |
| 183 | case they cannot carry comments. |
| 184 | |
| 185 | |
| 186 | .. _blank-lines: |
| 187 | |
| 188 | Blank lines |
| 189 | ----------- |
| 190 | |
| 191 | .. index:: single: blank line |
| 192 | |
| 193 | A logical line that contains only spaces, tabs, formfeeds and possibly a |
| 194 | comment, is ignored (i.e., no NEWLINE token is generated). During interactive |
| 195 | input of statements, handling of a blank line may differ depending on the |
| 196 | implementation of the read-eval-print loop. In the standard implementation, an |
| 197 | entirely blank logical line (i.e. one containing not even whitespace or a |
| 198 | comment) terminates a multi-line statement. |
| 199 | |
| 200 | |
| 201 | .. _indentation: |
| 202 | |
| 203 | Indentation |
| 204 | ----------- |
| 205 | |
| 206 | .. index:: |
| 207 | single: indentation |
| 208 | single: whitespace |
| 209 | single: leading whitespace |
| 210 | single: space |
| 211 | single: tab |
| 212 | single: grouping |
| 213 | single: statement grouping |
| 214 | |
| 215 | Leading whitespace (spaces and tabs) at the beginning of a logical line is used |
| 216 | to compute the indentation level of the line, which in turn is used to determine |
| 217 | the grouping of statements. |
| 218 | |
| 219 | First, tabs are replaced (from left to right) by one to eight spaces such that |
| 220 | the total number of characters up to and including the replacement is a multiple |
| 221 | of eight (this is intended to be the same rule as used by Unix). The total |
| 222 | number of spaces preceding the first non-blank character then determines the |
| 223 | line's indentation. Indentation cannot be split over multiple physical lines |
| 224 | using backslashes; the whitespace up to the first backslash determines the |
| 225 | indentation. |
| 226 | |
| 227 | **Cross-platform compatibility note:** because of the nature of text editors on |
| 228 | non-UNIX platforms, it is unwise to use a mixture of spaces and tabs for the |
| 229 | indentation in a single source file. It should also be noted that different |
| 230 | platforms may explicitly limit the maximum indentation level. |
| 231 | |
| 232 | A formfeed character may be present at the start of the line; it will be ignored |
| 233 | for the indentation calculations above. Formfeed characters occurring elsewhere |
| 234 | in the leading whitespace have an undefined effect (for instance, they may reset |
| 235 | the space count to zero). |
| 236 | |
| 237 | .. index:: |
| 238 | single: INDENT token |
| 239 | single: DEDENT token |
| 240 | |
| 241 | The indentation levels of consecutive lines are used to generate INDENT and |
| 242 | DEDENT tokens, using a stack, as follows. |
| 243 | |
| 244 | Before the first line of the file is read, a single zero is pushed on the stack; |
| 245 | this will never be popped off again. The numbers pushed on the stack will |
| 246 | always be strictly increasing from bottom to top. At the beginning of each |
| 247 | logical line, the line's indentation level is compared to the top of the stack. |
| 248 | If it is equal, nothing happens. If it is larger, it is pushed on the stack, and |
| 249 | one INDENT token is generated. If it is smaller, it *must* be one of the |
| 250 | numbers occurring on the stack; all numbers on the stack that are larger are |
| 251 | popped off, and for each number popped off a DEDENT token is generated. At the |
| 252 | end of the file, a DEDENT token is generated for each number remaining on the |
| 253 | stack that is larger than zero. |
| 254 | |
| 255 | Here is an example of a correctly (though confusingly) indented piece of Python |
| 256 | code:: |
| 257 | |
| 258 | def perm(l): |
| 259 | # Compute the list of all permutations of l |
| 260 | if len(l) <= 1: |
| 261 | return [l] |
| 262 | r = [] |
| 263 | for i in range(len(l)): |
| 264 | s = l[:i] + l[i+1:] |
| 265 | p = perm(s) |
| 266 | for x in p: |
| 267 | r.append(l[i:i+1] + x) |
| 268 | return r |
| 269 | |
| 270 | The following example shows various indentation errors:: |
| 271 | |
| 272 | def perm(l): # error: first line indented |
| 273 | for i in range(len(l)): # error: not indented |
| 274 | s = l[:i] + l[i+1:] |
| 275 | p = perm(l[:i] + l[i+1:]) # error: unexpected indent |
| 276 | for x in p: |
| 277 | r.append(l[i:i+1] + x) |
| 278 | return r # error: inconsistent dedent |
| 279 | |
| 280 | (Actually, the first three errors are detected by the parser; only the last |
| 281 | error is found by the lexical analyzer --- the indentation of ``return r`` does |
| 282 | not match a level popped off the stack.) |
| 283 | |
| 284 | |
| 285 | .. _whitespace: |
| 286 | |
| 287 | Whitespace between tokens |
| 288 | ------------------------- |
| 289 | |
| 290 | Except at the beginning of a logical line or in string literals, the whitespace |
| 291 | characters space, tab and formfeed can be used interchangeably to separate |
| 292 | tokens. Whitespace is needed between two tokens only if their concatenation |
| 293 | could otherwise be interpreted as a different token (e.g., ab is one token, but |
| 294 | a b is two tokens). |
| 295 | |
| 296 | |
| 297 | .. _other-tokens: |
| 298 | |
| 299 | Other tokens |
| 300 | ============ |
| 301 | |
| 302 | Besides NEWLINE, INDENT and DEDENT, the following categories of tokens exist: |
| 303 | *identifiers*, *keywords*, *literals*, *operators*, and *delimiters*. Whitespace |
| 304 | characters (other than line terminators, discussed earlier) are not tokens, but |
| 305 | serve to delimit tokens. Where ambiguity exists, a token comprises the longest |
| 306 | possible string that forms a legal token, when read from left to right. |
| 307 | |
| 308 | |
| 309 | .. _identifiers: |
| 310 | |
| 311 | Identifiers and keywords |
| 312 | ======================== |
| 313 | |
| 314 | .. index:: |
| 315 | single: identifier |
| 316 | single: name |
| 317 | |
| 318 | Identifiers (also referred to as *names*) are described by the following lexical |
| 319 | definitions: |
| 320 | |
| 321 | .. productionlist:: |
| 322 | identifier: (`letter`|"_") (`letter` | `digit` | "_")* |
| 323 | letter: `lowercase` | `uppercase` |
| 324 | lowercase: "a"..."z" |
| 325 | uppercase: "A"..."Z" |
| 326 | digit: "0"..."9" |
| 327 | |
| 328 | Identifiers are unlimited in length. Case is significant. |
| 329 | |
| 330 | |
| 331 | .. _keywords: |
| 332 | |
| 333 | Keywords |
| 334 | -------- |
| 335 | |
| 336 | .. index:: |
| 337 | single: keyword |
| 338 | single: reserved word |
| 339 | |
| 340 | The following identifiers are used as reserved words, or *keywords* of the |
| 341 | language, and cannot be used as ordinary identifiers. They must be spelled |
Georg Brandl | 2ca9be4 | 2009-05-04 20:42:08 +0000 | [diff] [blame^] | 342 | exactly as written here: |
| 343 | |
| 344 | .. sourcecode:: text |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 345 | |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 346 | and del from not while |
| 347 | as elif global or with |
| 348 | assert else if pass yield |
| 349 | break except import print |
| 350 | class exec in raise |
| 351 | continue finally is return |
| 352 | def for lambda try |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 353 | |
| 354 | .. versionchanged:: 2.4 |
| 355 | :const:`None` became a constant and is now recognized by the compiler as a name |
| 356 | for the built-in object :const:`None`. Although it is not a keyword, you cannot |
| 357 | assign a different object to it. |
| 358 | |
| 359 | .. versionchanged:: 2.5 |
| 360 | Both :keyword:`as` and :keyword:`with` are only recognized when the |
| 361 | ``with_statement`` future feature has been enabled. It will always be enabled in |
| 362 | Python 2.6. See section :ref:`with` for details. Note that using :keyword:`as` |
| 363 | and :keyword:`with` as identifiers will always issue a warning, even when the |
| 364 | ``with_statement`` future directive is not in effect. |
| 365 | |
| 366 | |
| 367 | .. _id-classes: |
| 368 | |
| 369 | Reserved classes of identifiers |
| 370 | ------------------------------- |
| 371 | |
| 372 | Certain classes of identifiers (besides keywords) have special meanings. These |
| 373 | classes are identified by the patterns of leading and trailing underscore |
| 374 | characters: |
| 375 | |
| 376 | ``_*`` |
| 377 | Not imported by ``from module import *``. The special identifier ``_`` is used |
| 378 | in the interactive interpreter to store the result of the last evaluation; it is |
| 379 | stored in the :mod:`__builtin__` module. When not in interactive mode, ``_`` |
| 380 | has no special meaning and is not defined. See section :ref:`import`. |
| 381 | |
| 382 | .. note:: |
| 383 | |
| 384 | The name ``_`` is often used in conjunction with internationalization; |
| 385 | refer to the documentation for the :mod:`gettext` module for more |
| 386 | information on this convention. |
| 387 | |
| 388 | ``__*__`` |
| 389 | System-defined names. These names are defined by the interpreter and its |
| 390 | implementation (including the standard library); applications should not expect |
| 391 | to define additional names using this convention. The set of names of this |
| 392 | class defined by Python may be extended in future versions. See section |
| 393 | :ref:`specialnames`. |
| 394 | |
| 395 | ``__*`` |
| 396 | Class-private names. Names in this category, when used within the context of a |
| 397 | class definition, are re-written to use a mangled form to help avoid name |
| 398 | clashes between "private" attributes of base and derived classes. See section |
| 399 | :ref:`atom-identifiers`. |
| 400 | |
| 401 | |
| 402 | .. _literals: |
| 403 | |
| 404 | Literals |
| 405 | ======== |
| 406 | |
| 407 | .. index:: |
| 408 | single: literal |
| 409 | single: constant |
| 410 | |
| 411 | Literals are notations for constant values of some built-in types. |
| 412 | |
| 413 | |
| 414 | .. _strings: |
| 415 | |
| 416 | String literals |
| 417 | --------------- |
| 418 | |
| 419 | .. index:: single: string literal |
| 420 | |
| 421 | String literals are described by the following lexical definitions: |
| 422 | |
| 423 | .. index:: single: ASCII@ASCII |
| 424 | |
| 425 | .. productionlist:: |
| 426 | stringliteral: [`stringprefix`](`shortstring` | `longstring`) |
| 427 | stringprefix: "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR" |
| 428 | shortstring: "'" `shortstringitem`* "'" | '"' `shortstringitem`* '"' |
Georg Brandl | 03894c5 | 2008-08-06 17:20:41 +0000 | [diff] [blame] | 429 | longstring: "'''" `longstringitem`* "'''" |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 430 | : | '"""' `longstringitem`* '"""' |
| 431 | shortstringitem: `shortstringchar` | `escapeseq` |
| 432 | longstringitem: `longstringchar` | `escapeseq` |
| 433 | shortstringchar: <any source character except "\" or newline or the quote> |
| 434 | longstringchar: <any source character except "\"> |
| 435 | escapeseq: "\" <any ASCII character> |
| 436 | |
| 437 | One syntactic restriction not indicated by these productions is that whitespace |
| 438 | is not allowed between the :token:`stringprefix` and the rest of the string |
| 439 | literal. The source character set is defined by the encoding declaration; it is |
| 440 | ASCII if no encoding declaration is given in the source file; see section |
| 441 | :ref:`encodings`. |
| 442 | |
| 443 | .. index:: |
| 444 | single: triple-quoted string |
| 445 | single: Unicode Consortium |
| 446 | single: string; Unicode |
| 447 | single: raw string |
| 448 | |
| 449 | In plain English: String literals can be enclosed in matching single quotes |
| 450 | (``'``) or double quotes (``"``). They can also be enclosed in matching groups |
| 451 | of three single or double quotes (these are generally referred to as |
| 452 | *triple-quoted strings*). The backslash (``\``) character is used to escape |
| 453 | characters that otherwise have a special meaning, such as newline, backslash |
| 454 | itself, or the quote character. String literals may optionally be prefixed with |
| 455 | a letter ``'r'`` or ``'R'``; such strings are called :dfn:`raw strings` and use |
| 456 | different rules for interpreting backslash escape sequences. A prefix of |
| 457 | ``'u'`` or ``'U'`` makes the string a Unicode string. Unicode strings use the |
| 458 | Unicode character set as defined by the Unicode Consortium and ISO 10646. Some |
| 459 | additional escape sequences, described below, are available in Unicode strings. |
| 460 | The two prefix characters may be combined; in this case, ``'u'`` must appear |
| 461 | before ``'r'``. |
| 462 | |
| 463 | In triple-quoted strings, unescaped newlines and quotes are allowed (and are |
| 464 | retained), except that three unescaped quotes in a row terminate the string. (A |
| 465 | "quote" is the character used to open the string, i.e. either ``'`` or ``"``.) |
| 466 | |
| 467 | .. index:: |
| 468 | single: physical line |
| 469 | single: escape sequence |
| 470 | single: Standard C |
| 471 | single: C |
| 472 | |
| 473 | Unless an ``'r'`` or ``'R'`` prefix is present, escape sequences in strings are |
| 474 | interpreted according to rules similar to those used by Standard C. The |
| 475 | recognized escape sequences are: |
| 476 | |
| 477 | +-----------------+---------------------------------+-------+ |
| 478 | | Escape Sequence | Meaning | Notes | |
| 479 | +=================+=================================+=======+ |
| 480 | | ``\newline`` | Ignored | | |
| 481 | +-----------------+---------------------------------+-------+ |
| 482 | | ``\\`` | Backslash (``\``) | | |
| 483 | +-----------------+---------------------------------+-------+ |
| 484 | | ``\'`` | Single quote (``'``) | | |
| 485 | +-----------------+---------------------------------+-------+ |
| 486 | | ``\"`` | Double quote (``"``) | | |
| 487 | +-----------------+---------------------------------+-------+ |
| 488 | | ``\a`` | ASCII Bell (BEL) | | |
| 489 | +-----------------+---------------------------------+-------+ |
| 490 | | ``\b`` | ASCII Backspace (BS) | | |
| 491 | +-----------------+---------------------------------+-------+ |
| 492 | | ``\f`` | ASCII Formfeed (FF) | | |
| 493 | +-----------------+---------------------------------+-------+ |
| 494 | | ``\n`` | ASCII Linefeed (LF) | | |
| 495 | +-----------------+---------------------------------+-------+ |
| 496 | | ``\N{name}`` | Character named *name* in the | | |
| 497 | | | Unicode database (Unicode only) | | |
| 498 | +-----------------+---------------------------------+-------+ |
| 499 | | ``\r`` | ASCII Carriage Return (CR) | | |
| 500 | +-----------------+---------------------------------+-------+ |
| 501 | | ``\t`` | ASCII Horizontal Tab (TAB) | | |
| 502 | +-----------------+---------------------------------+-------+ |
| 503 | | ``\uxxxx`` | Character with 16-bit hex value | \(1) | |
| 504 | | | *xxxx* (Unicode only) | | |
| 505 | +-----------------+---------------------------------+-------+ |
| 506 | | ``\Uxxxxxxxx`` | Character with 32-bit hex value | \(2) | |
| 507 | | | *xxxxxxxx* (Unicode only) | | |
| 508 | +-----------------+---------------------------------+-------+ |
| 509 | | ``\v`` | ASCII Vertical Tab (VT) | | |
| 510 | +-----------------+---------------------------------+-------+ |
| 511 | | ``\ooo`` | Character with octal value | (3,5) | |
| 512 | | | *ooo* | | |
| 513 | +-----------------+---------------------------------+-------+ |
| 514 | | ``\xhh`` | Character with hex value *hh* | (4,5) | |
| 515 | +-----------------+---------------------------------+-------+ |
| 516 | |
| 517 | .. index:: single: ASCII@ASCII |
| 518 | |
| 519 | Notes: |
| 520 | |
| 521 | (1) |
| 522 | Individual code units which form parts of a surrogate pair can be encoded using |
| 523 | this escape sequence. |
| 524 | |
| 525 | (2) |
| 526 | Any Unicode character can be encoded this way, but characters outside the Basic |
| 527 | Multilingual Plane (BMP) will be encoded using a surrogate pair if Python is |
| 528 | compiled to use 16-bit code units (the default). Individual code units which |
| 529 | form parts of a surrogate pair can be encoded using this escape sequence. |
| 530 | |
| 531 | (3) |
| 532 | As in Standard C, up to three octal digits are accepted. |
| 533 | |
| 534 | (4) |
Georg Brandl | 953e1ee | 2008-01-22 07:53:31 +0000 | [diff] [blame] | 535 | Unlike in Standard C, exactly two hex digits are required. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 536 | |
| 537 | (5) |
| 538 | In a string literal, hexadecimal and octal escapes denote the byte with the |
| 539 | given value; it is not necessary that the byte encodes a character in the source |
| 540 | character set. In a Unicode literal, these escapes denote a Unicode character |
| 541 | with the given value. |
| 542 | |
| 543 | .. index:: single: unrecognized escape sequence |
| 544 | |
| 545 | Unlike Standard C, all unrecognized escape sequences are left in the string |
| 546 | unchanged, i.e., *the backslash is left in the string*. (This behavior is |
| 547 | useful when debugging: if an escape sequence is mistyped, the resulting output |
| 548 | is more easily recognized as broken.) It is also important to note that the |
| 549 | escape sequences marked as "(Unicode only)" in the table above fall into the |
| 550 | category of unrecognized escapes for non-Unicode string literals. |
| 551 | |
| 552 | When an ``'r'`` or ``'R'`` prefix is present, a character following a backslash |
| 553 | is included in the string without change, and *all backslashes are left in the |
| 554 | string*. For example, the string literal ``r"\n"`` consists of two characters: |
| 555 | a backslash and a lowercase ``'n'``. String quotes can be escaped with a |
| 556 | backslash, but the backslash remains in the string; for example, ``r"\""`` is a |
| 557 | valid string literal consisting of two characters: a backslash and a double |
| 558 | quote; ``r"\"`` is not a valid string literal (even a raw string cannot end in |
| 559 | an odd number of backslashes). Specifically, *a raw string cannot end in a |
| 560 | single backslash* (since the backslash would escape the following quote |
| 561 | character). Note also that a single backslash followed by a newline is |
| 562 | interpreted as those two characters as part of the string, *not* as a line |
| 563 | continuation. |
| 564 | |
| 565 | When an ``'r'`` or ``'R'`` prefix is used in conjunction with a ``'u'`` or |
| 566 | ``'U'`` prefix, then the ``\uXXXX`` and ``\UXXXXXXXX`` escape sequences are |
| 567 | processed while *all other backslashes are left in the string*. For example, |
| 568 | the string literal ``ur"\u0062\n"`` consists of three Unicode characters: 'LATIN |
| 569 | SMALL LETTER B', 'REVERSE SOLIDUS', and 'LATIN SMALL LETTER N'. Backslashes can |
| 570 | be escaped with a preceding backslash; however, both remain in the string. As a |
| 571 | result, ``\uXXXX`` escape sequences are only recognized when there are an odd |
| 572 | number of backslashes. |
| 573 | |
| 574 | |
| 575 | .. _string-catenation: |
| 576 | |
| 577 | String literal concatenation |
| 578 | ---------------------------- |
| 579 | |
| 580 | Multiple adjacent string literals (delimited by whitespace), possibly using |
| 581 | different quoting conventions, are allowed, and their meaning is the same as |
| 582 | their concatenation. Thus, ``"hello" 'world'`` is equivalent to |
| 583 | ``"helloworld"``. This feature can be used to reduce the number of backslashes |
| 584 | needed, to split long strings conveniently across long lines, or even to add |
| 585 | comments to parts of strings, for example:: |
| 586 | |
| 587 | re.compile("[A-Za-z_]" # letter or underscore |
| 588 | "[A-Za-z0-9_]*" # letter, digit or underscore |
| 589 | ) |
| 590 | |
| 591 | Note that this feature is defined at the syntactical level, but implemented at |
| 592 | compile time. The '+' operator must be used to concatenate string expressions |
| 593 | at run time. Also note that literal concatenation can use different quoting |
| 594 | styles for each component (even mixing raw strings and triple quoted strings). |
| 595 | |
| 596 | |
| 597 | .. _numbers: |
| 598 | |
| 599 | Numeric literals |
| 600 | ---------------- |
| 601 | |
| 602 | .. index:: |
| 603 | single: number |
| 604 | single: numeric literal |
| 605 | single: integer literal |
| 606 | single: plain integer literal |
| 607 | single: long integer literal |
| 608 | single: floating point literal |
| 609 | single: hexadecimal literal |
Benjamin Peterson | d79af0f | 2008-10-30 22:44:18 +0000 | [diff] [blame] | 610 | single: binary literal |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 611 | single: octal literal |
| 612 | single: decimal literal |
| 613 | single: imaginary literal |
| 614 | single: complex; literal |
| 615 | |
| 616 | There are four types of numeric literals: plain integers, long integers, |
| 617 | floating point numbers, and imaginary numbers. There are no complex literals |
| 618 | (complex numbers can be formed by adding a real number and an imaginary number). |
| 619 | |
| 620 | Note that numeric literals do not include a sign; a phrase like ``-1`` is |
| 621 | actually an expression composed of the unary operator '``-``' and the literal |
| 622 | ``1``. |
| 623 | |
| 624 | |
| 625 | .. _integers: |
| 626 | |
| 627 | Integer and long integer literals |
| 628 | --------------------------------- |
| 629 | |
| 630 | Integer and long integer literals are described by the following lexical |
| 631 | definitions: |
| 632 | |
| 633 | .. productionlist:: |
| 634 | longinteger: `integer` ("l" | "L") |
Benjamin Peterson | b5f8208 | 2008-10-30 22:39:25 +0000 | [diff] [blame] | 635 | integer: `decimalinteger` | `octinteger` | `hexinteger` | `bininteger` |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 636 | decimalinteger: `nonzerodigit` `digit`* | "0" |
Benjamin Peterson | d79af0f | 2008-10-30 22:44:18 +0000 | [diff] [blame] | 637 | octinteger: "0" ("o" | "O") `octdigit`+ | "0" `octdigit`+ |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 638 | hexinteger: "0" ("x" | "X") `hexdigit`+ |
Benjamin Peterson | b5f8208 | 2008-10-30 22:39:25 +0000 | [diff] [blame] | 639 | bininteger: "0" ("b" | "B") `bindigit`+ |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 640 | nonzerodigit: "1"..."9" |
| 641 | octdigit: "0"..."7" |
Benjamin Peterson | d79af0f | 2008-10-30 22:44:18 +0000 | [diff] [blame] | 642 | bindigit: "0" | "1" |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 643 | hexdigit: `digit` | "a"..."f" | "A"..."F" |
| 644 | |
| 645 | Although both lower case ``'l'`` and upper case ``'L'`` are allowed as suffix |
| 646 | for long integers, it is strongly recommended to always use ``'L'``, since the |
| 647 | letter ``'l'`` looks too much like the digit ``'1'``. |
| 648 | |
| 649 | Plain integer literals that are above the largest representable plain integer |
| 650 | (e.g., 2147483647 when using 32-bit arithmetic) are accepted as if they were |
| 651 | long integers instead. [#]_ There is no limit for long integer literals apart |
| 652 | from what can be stored in available memory. |
| 653 | |
| 654 | Some examples of plain integer literals (first row) and long integer literals |
| 655 | (second and third rows):: |
| 656 | |
| 657 | 7 2147483647 0177 |
| 658 | 3L 79228162514264337593543950336L 0377L 0x100000000L |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 659 | 79228162514264337593543950336 0xdeadbeef |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 660 | |
| 661 | |
| 662 | .. _floating: |
| 663 | |
| 664 | Floating point literals |
| 665 | ----------------------- |
| 666 | |
| 667 | Floating point literals are described by the following lexical definitions: |
| 668 | |
| 669 | .. productionlist:: |
| 670 | floatnumber: `pointfloat` | `exponentfloat` |
| 671 | pointfloat: [`intpart`] `fraction` | `intpart` "." |
| 672 | exponentfloat: (`intpart` | `pointfloat`) `exponent` |
| 673 | intpart: `digit`+ |
| 674 | fraction: "." `digit`+ |
| 675 | exponent: ("e" | "E") ["+" | "-"] `digit`+ |
| 676 | |
| 677 | Note that the integer and exponent parts of floating point numbers can look like |
| 678 | octal integers, but are interpreted using radix 10. For example, ``077e010`` is |
| 679 | legal, and denotes the same number as ``77e10``. The allowed range of floating |
| 680 | point literals is implementation-dependent. Some examples of floating point |
| 681 | literals:: |
| 682 | |
| 683 | 3.14 10. .001 1e100 3.14e-10 0e0 |
| 684 | |
| 685 | Note that numeric literals do not include a sign; a phrase like ``-1`` is |
| 686 | actually an expression composed of the unary operator ``-`` and the literal |
| 687 | ``1``. |
| 688 | |
| 689 | |
| 690 | .. _imaginary: |
| 691 | |
| 692 | Imaginary literals |
| 693 | ------------------ |
| 694 | |
| 695 | Imaginary literals are described by the following lexical definitions: |
| 696 | |
| 697 | .. productionlist:: |
| 698 | imagnumber: (`floatnumber` | `intpart`) ("j" | "J") |
| 699 | |
| 700 | An imaginary literal yields a complex number with a real part of 0.0. Complex |
| 701 | numbers are represented as a pair of floating point numbers and have the same |
| 702 | restrictions on their range. To create a complex number with a nonzero real |
| 703 | part, add a floating point number to it, e.g., ``(3+4j)``. Some examples of |
| 704 | imaginary literals:: |
| 705 | |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 706 | 3.14j 10.j 10j .001j 1e100j 3.14e-10j |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 707 | |
| 708 | |
| 709 | .. _operators: |
| 710 | |
| 711 | Operators |
| 712 | ========= |
| 713 | |
| 714 | .. index:: single: operators |
| 715 | |
| 716 | The following tokens are operators:: |
| 717 | |
| 718 | + - * ** / // % |
| 719 | << >> & | ^ ~ |
| 720 | < > <= >= == != <> |
| 721 | |
| 722 | The comparison operators ``<>`` and ``!=`` are alternate spellings of the same |
| 723 | operator. ``!=`` is the preferred spelling; ``<>`` is obsolescent. |
| 724 | |
| 725 | |
| 726 | .. _delimiters: |
| 727 | |
| 728 | Delimiters |
| 729 | ========== |
| 730 | |
| 731 | .. index:: single: delimiters |
| 732 | |
| 733 | The following tokens serve as delimiters in the grammar:: |
| 734 | |
| 735 | ( ) [ ] { } @ |
| 736 | , : . ` = ; |
| 737 | += -= *= /= //= %= |
| 738 | &= |= ^= >>= <<= **= |
| 739 | |
| 740 | The period can also occur in floating-point and imaginary literals. A sequence |
| 741 | of three periods has a special meaning as an ellipsis in slices. The second half |
| 742 | of the list, the augmented assignment operators, serve lexically as delimiters, |
| 743 | but also perform an operation. |
| 744 | |
| 745 | The following printing ASCII characters have special meaning as part of other |
| 746 | tokens or are otherwise significant to the lexical analyzer:: |
| 747 | |
| 748 | ' " # \ |
| 749 | |
| 750 | .. index:: single: ASCII@ASCII |
| 751 | |
| 752 | The following printing ASCII characters are not used in Python. Their |
| 753 | occurrence outside string literals and comments is an unconditional error:: |
| 754 | |
| 755 | $ ? |
| 756 | |
| 757 | .. rubric:: Footnotes |
| 758 | |
| 759 | .. [#] In versions of Python prior to 2.4, octal and hexadecimal literals in the range |
| 760 | just above the largest representable plain integer but below the largest |
| 761 | unsigned 32-bit number (on a machine using 32-bit arithmetic), 4294967296, were |
| 762 | taken as the negative plain integer obtained by subtracting 4294967296 from |
| 763 | their unsigned value. |
| 764 | |