Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 1 | \chapter{Lexical analysis\label{lexical}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 2 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 3 | A Python program is read by a \emph{parser}. Input to the parser is a |
| 4 | stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 5 | chapter describes how the lexical analyzer breaks a file into tokens. |
| 6 | \index{lexical analysis} |
| 7 | \index{parser} |
| 8 | \index{token} |
| 9 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 10 | Python uses the 7-bit \ASCII{} character set for program text and string |
| 11 | literals. 8-bit characters may be used in string literals and comments |
| 12 | but their interpretation is platform dependent; the proper way to |
| 13 | insert 8-bit characters in string literals is by using octal or |
| 14 | hexadecimal escape sequences. |
| 15 | |
| 16 | The run-time character set depends on the I/O devices connected to the |
| 17 | program but is generally a superset of \ASCII{}. |
| 18 | |
| 19 | \strong{Future compatibility note:} It may be tempting to assume that the |
| 20 | character set for 8-bit characters is ISO Latin-1 (an \ASCII{} |
| 21 | superset that covers most western languages that use the Latin |
| 22 | alphabet), but it is possible that in the future Unicode text editors |
| 23 | will become common. These generally use the UTF-8 encoding, which is |
| 24 | also an \ASCII{} superset, but with very different use for the |
| 25 | characters with ordinals 128-255. While there is no consensus on this |
| 26 | subject yet, it is unwise to assume either Latin-1 or UTF-8, even |
| 27 | though the current implementation appears to favor Latin-1. This |
| 28 | applies both to the source character set and the run-time character |
| 29 | set. |
| 30 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 31 | \section{Line structure\label{line-structure}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 32 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 33 | A Python program is divided into a number of \emph{logical lines}. |
| 34 | \index{line structure} |
| 35 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 36 | \subsection{Logical lines\label{logical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 37 | |
| 38 | The end of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 39 | a logical line is represented by the token NEWLINE. Statements cannot |
| 40 | cross logical line boundaries except where NEWLINE is allowed by the |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 41 | syntax (e.g., between statements in compound statements). |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 42 | A logical line is constructed from one or more \emph{physical lines} |
| 43 | by following the explicit or implicit \emph{line joining} rules. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 44 | \index{logical line} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 45 | \index{physical line} |
| 46 | \index{line joining} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 47 | \index{NEWLINE token} |
| 48 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 49 | \subsection{Physical lines\label{physical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 50 | |
| 51 | A physical line ends in whatever the current platform's convention is |
| 52 | for terminating lines. On \UNIX{}, this is the \ASCII{} LF (linefeed) |
| 53 | character. On DOS/Windows, it is the \ASCII{} sequence CR LF (return |
| 54 | followed by linefeed). On Macintosh, it is the \ASCII{} CR (return) |
| 55 | character. |
| 56 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 57 | \subsection{Comments\label{comments}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 58 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 59 | A comment starts with a hash character (\code{\#}) that is not part of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 60 | a string literal, and ends at the end of the physical line. A comment |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 61 | signifies the end of the logical line unless the implicit line joining |
| 62 | rules are invoked. |
| 63 | Comments are ignored by the syntax; they are not tokens. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 64 | \index{comment} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 65 | \index{hash character} |
| 66 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 67 | \subsection{Explicit line joining\label{explicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 68 | |
| 69 | Two or more physical lines may be joined into logical lines using |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 70 | backslash characters (\code{\e}), as follows: when a physical line ends |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 71 | in a backslash that is not part of a string literal or comment, it is |
| 72 | joined with the following forming a single logical line, deleting the |
| 73 | backslash and the following end-of-line character. For example: |
| 74 | \index{physical line} |
| 75 | \index{line joining} |
| 76 | \index{line continuation} |
| 77 | \index{backslash character} |
| 78 | % |
| 79 | \begin{verbatim} |
| 80 | if 1900 < year < 2100 and 1 <= month <= 12 \ |
| 81 | and 1 <= day <= 31 and 0 <= hour < 24 \ |
| 82 | and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date |
| 83 | return 1 |
| 84 | \end{verbatim} |
| 85 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 86 | A line ending in a backslash cannot carry a comment. A backslash does |
| 87 | not continue a comment. A backslash does not continue a token except |
| 88 | for string literals (i.e., tokens other than string literals cannot be |
| 89 | split across physical lines using a backslash). A backslash is |
| 90 | illegal elsewhere on a line outside a string literal. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 91 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 92 | \subsection{Implicit line joining\label{implicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 93 | |
| 94 | Expressions in parentheses, square brackets or curly braces can be |
| 95 | split over more than one physical line without using backslashes. |
| 96 | For example: |
| 97 | |
| 98 | \begin{verbatim} |
| 99 | month_names = ['Januari', 'Februari', 'Maart', # These are the |
| 100 | 'April', 'Mei', 'Juni', # Dutch names |
| 101 | 'Juli', 'Augustus', 'September', # for the months |
| 102 | 'Oktober', 'November', 'December'] # of the year |
| 103 | \end{verbatim} |
| 104 | |
| 105 | Implicitly continued lines can carry comments. The indentation of the |
| 106 | continuation lines is not important. Blank continuation lines are |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 107 | allowed. There is no NEWLINE token between implicit continuation |
| 108 | lines. Implicitly continued lines can also occur within triple-quoted |
| 109 | strings (see below); in that case they cannot carry comments. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 110 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 111 | \subsection{Blank lines\label{blank-lines}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 112 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 113 | A logical line that contains only spaces, tabs, formfeeds and possibly a |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 114 | comment, is ignored (i.e., no NEWLINE token is generated), except that |
| 115 | during interactive input of statements, an entirely blank logical line |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 116 | (i.e. one containing not even whitespace or a comment) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 117 | terminates a multi-line statement. |
| 118 | \index{blank line} |
| 119 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 120 | \subsection{Indentation\label{indentation}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 121 | |
| 122 | Leading whitespace (spaces and tabs) at the beginning of a logical |
| 123 | line is used to compute the indentation level of the line, which in |
| 124 | turn is used to determine the grouping of statements. |
| 125 | \index{indentation} |
| 126 | \index{whitespace} |
| 127 | \index{leading whitespace} |
| 128 | \index{space} |
| 129 | \index{tab} |
| 130 | \index{grouping} |
| 131 | \index{statement grouping} |
| 132 | |
| 133 | First, tabs are replaced (from left to right) by one to eight spaces |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 134 | such that the total number of characters up to and including the |
| 135 | replacement is a multiple of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 136 | eight (this is intended to be the same rule as used by \UNIX{}). The |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 137 | total number of spaces preceding the first non-blank character then |
| 138 | determines the line's indentation. Indentation cannot be split over |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 139 | multiple physical lines using backslashes; the whitespace up to the |
| 140 | first backslash determines the indentation. |
| 141 | |
| 142 | \strong{Cross-platform compatibility note:} because of the nature of |
| 143 | text editors on non-UNIX platforms, it is unwise to use a mixture of |
| 144 | spaces and tabs for the indentation in a single source file. |
| 145 | |
| 146 | A formfeed character may be present at the start of the line; it will |
| 147 | be ignored for the indentation calculations above. A formfeed |
| 148 | characters occurring elsewhere in the leading whitespace have an |
| 149 | undefined effect (for instance, they may reset the space count to |
| 150 | zero). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 151 | |
| 152 | The indentation levels of consecutive lines are used to generate |
| 153 | INDENT and DEDENT tokens, using a stack, as follows. |
| 154 | \index{INDENT token} |
| 155 | \index{DEDENT token} |
| 156 | |
| 157 | Before the first line of the file is read, a single zero is pushed on |
| 158 | the stack; this will never be popped off again. The numbers pushed on |
| 159 | the stack will always be strictly increasing from bottom to top. At |
| 160 | the beginning of each logical line, the line's indentation level is |
| 161 | compared to the top of the stack. If it is equal, nothing happens. |
| 162 | If it is larger, it is pushed on the stack, and one INDENT token is |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 163 | generated. If it is smaller, it \emph{must} be one of the numbers |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 164 | occurring on the stack; all numbers on the stack that are larger are |
| 165 | popped off, and for each number popped off a DEDENT token is |
| 166 | generated. At the end of the file, a DEDENT token is generated for |
| 167 | each number remaining on the stack that is larger than zero. |
| 168 | |
| 169 | Here is an example of a correctly (though confusingly) indented piece |
| 170 | of Python code: |
| 171 | |
| 172 | \begin{verbatim} |
| 173 | def perm(l): |
| 174 | # Compute the list of all permutations of l |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 175 | if len(l) <= 1: |
| 176 | return [l] |
| 177 | r = [] |
| 178 | for i in range(len(l)): |
| 179 | s = l[:i] + l[i+1:] |
| 180 | p = perm(s) |
| 181 | for x in p: |
| 182 | r.append(l[i:i+1] + x) |
| 183 | return r |
| 184 | \end{verbatim} |
| 185 | |
| 186 | The following example shows various indentation errors: |
| 187 | |
| 188 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 189 | def perm(l): # error: first line indented |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 190 | for i in range(len(l)): # error: not indented |
| 191 | s = l[:i] + l[i+1:] |
| 192 | p = perm(l[:i] + l[i+1:]) # error: unexpected indent |
| 193 | for x in p: |
| 194 | r.append(l[i:i+1] + x) |
| 195 | return r # error: inconsistent dedent |
| 196 | \end{verbatim} |
| 197 | |
| 198 | (Actually, the first three errors are detected by the parser; only the |
| 199 | last error is found by the lexical analyzer --- the indentation of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 200 | \code{return r} does not match a level popped off the stack.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 201 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 202 | \subsection{Whitespace between tokens\label{whitespace}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 203 | |
| 204 | Except at the beginning of a logical line or in string literals, the |
| 205 | whitespace characters space, tab and formfeed can be used |
| 206 | interchangeably to separate tokens. Whitespace is needed between two |
| 207 | tokens only if their concatenation could otherwise be interpreted as a |
| 208 | different token (e.g., ab is one token, but a b is two tokens). |
| 209 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 210 | \section{Other tokens\label{other-tokens}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 211 | |
| 212 | Besides NEWLINE, INDENT and DEDENT, the following categories of tokens |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 213 | exist: \emph{identifiers}, \emph{keywords}, \emph{literals}, |
| 214 | \emph{operators}, and \emph{delimiters}. |
| 215 | Whitespace characters (other than line terminators, discussed earlier) |
| 216 | are not tokens, but serve to delimit tokens. |
| 217 | Where |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 218 | ambiguity exists, a token comprises the longest possible string that |
| 219 | forms a legal token, when read from left to right. |
| 220 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 221 | \section{Identifiers and keywords\label{identifiers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 222 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 223 | Identifiers (also referred to as \emph{names}) are described by the following |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 224 | lexical definitions: |
| 225 | \index{identifier} |
| 226 | \index{name} |
| 227 | |
| 228 | \begin{verbatim} |
| 229 | identifier: (letter|"_") (letter|digit|"_")* |
| 230 | letter: lowercase | uppercase |
| 231 | lowercase: "a"..."z" |
| 232 | uppercase: "A"..."Z" |
| 233 | digit: "0"..."9" |
| 234 | \end{verbatim} |
| 235 | |
| 236 | Identifiers are unlimited in length. Case is significant. |
| 237 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 238 | \subsection{Keywords\label{keywords}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 239 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 240 | The following identifiers are used as reserved words, or |
| 241 | \emph{keywords} of the language, and cannot be used as ordinary |
| 242 | identifiers. They must be spelled exactly as written here:% |
| 243 | \index{keyword}% |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 244 | \index{reserved word} |
| 245 | |
| 246 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 247 | and del for is raise |
| 248 | assert elif from lambda return |
| 249 | break else global not try |
| 250 | class except if or while |
| 251 | continue exec import pass |
| 252 | def finally in print |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 253 | \end{verbatim} |
| 254 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 255 | % When adding keywords, use reswords.py for reformatting |
| 256 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 257 | \subsection{Reserved classes of identifiers\label{id-classes}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 258 | |
| 259 | Certain classes of identifiers (besides keywords) have special |
| 260 | meanings. These are: |
| 261 | |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 262 | \begin{tableii}{l|l}{code}{Form}{Meaning} |
| 263 | \lineii{_*}{Not imported by \samp{from \var{module} import *}} |
| 264 | \lineii{__*__}{System-defined name} |
| 265 | \lineii{__*}{Class-private name mangling} |
| 266 | \end{tableii} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 267 | |
| 268 | (XXX need section references here.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 269 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 270 | \section{Literals\label{literals}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 271 | |
| 272 | Literals are notations for constant values of some built-in types. |
| 273 | \index{literal} |
| 274 | \index{constant} |
| 275 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 276 | \subsection{String literals\label{strings}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 277 | |
| 278 | String literals are described by the following lexical definitions: |
| 279 | \index{string literal} |
| 280 | |
| 281 | \begin{verbatim} |
| 282 | stringliteral: shortstring | longstring |
| 283 | shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"' |
| 284 | longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""' |
| 285 | shortstringitem: shortstringchar | escapeseq |
| 286 | longstringitem: longstringchar | escapeseq |
| 287 | shortstringchar: <any ASCII character except "\" or newline or the quote> |
| 288 | longstringchar: <any ASCII character except "\"> |
| 289 | escapeseq: "\" <any ASCII character> |
| 290 | \end{verbatim} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 291 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 292 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 293 | In plain English: String literals can be enclosed in matching single |
| 294 | quotes (\code{'}) or double quotes (\code{"}). They can also be |
| 295 | enclosed in matching groups of three single or double quotes (these |
| 296 | are generally referred to as \emph{triple-quoted strings}). The |
| 297 | backslash (\code{\e}) character is used to escape characters that |
| 298 | otherwise have a special meaning, such as newline, backslash itself, |
| 299 | or the quote character. String literals may optionally be prefixed |
| 300 | with a letter `r' or `R'; such strings are called raw strings and use |
| 301 | different rules for backslash escape sequences. |
| 302 | \index{triple-quoted string} |
| 303 | \index{raw string} |
| 304 | |
| 305 | In triple-quoted strings, |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 306 | unescaped newlines and quotes are allowed (and are retained), except |
| 307 | that three unescaped quotes in a row terminate the string. (A |
| 308 | ``quote'' is the character used to open the string, i.e. either |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 309 | \code{'} or \code{"}.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 310 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 311 | Unless an `r' or `R' prefix is present, escape sequences in strings |
| 312 | are interpreted according to rules similar |
| 313 | to those used by Standard \C{}. The recognized escape sequences are: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 314 | \index{physical line} |
| 315 | \index{escape sequence} |
| 316 | \index{Standard C} |
| 317 | \index{C} |
| 318 | |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 319 | \begin{tableii}{l|l}{code}{Escape Sequence}{Meaning} |
| 320 | \lineii{\e\var{newline}} {Ignored} |
| 321 | \lineii{\e\e} {Backslash (\code{\e})} |
| 322 | \lineii{\e'} {Single quote (\code{'})} |
| 323 | \lineii{\e"} {Double quote (\code{"})} |
| 324 | \lineii{\e a} {\ASCII{} Bell (BEL)} |
| 325 | \lineii{\e b} {\ASCII{} Backspace (BS)} |
| 326 | \lineii{\e f} {\ASCII{} Formfeed (FF)} |
| 327 | \lineii{\e n} {\ASCII{} Linefeed (LF)} |
| 328 | \lineii{\e r} {\ASCII{} Carriage Return (CR)} |
| 329 | \lineii{\e t} {\ASCII{} Horizontal Tab (TAB)} |
| 330 | \lineii{\e v} {\ASCII{} Vertical Tab (VT)} |
| 331 | \lineii{\e\var{ooo}} {\ASCII{} character with octal value \emph{ooo}} |
| 332 | \lineii{\e x\var{hh...}} {\ASCII{} character with hex value \emph{hh...}} |
| 333 | \end{tableii} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 334 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 335 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 336 | In strict compatibility with Standard \C, up to three octal digits are |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 337 | accepted, but an unlimited number of hex digits is taken to be part of |
| 338 | the hex escape (and then the lower 8 bits of the resulting hex number |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 339 | are used in 8-bit implementations). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 340 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 341 | Unlike Standard \C{}, |
| 342 | all unrecognized escape sequences are left in the string unchanged, |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 343 | i.e., \emph{the backslash is left in the string.} (This behavior is |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 344 | useful when debugging: if an escape sequence is mistyped, the |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 345 | resulting output is more easily recognized as broken.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 346 | \index{unrecognized escape sequence} |
| 347 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 348 | When an `r' or `R' prefix is present, backslashes are still used to |
| 349 | quote the following character, but \emph{all backslashes are left in |
| 350 | the string}. For example, the string literal \code{r"\e n"} consists |
| 351 | of two characters: a backslash and a lowercase `n'. String quotes can |
| 352 | be escaped with a backslash, but the backslash remains in the string; |
| 353 | for example, \code{r"\""} is a valid string literal consisting of two |
| 354 | characters: a backslash and a double quote; \code{r"\"} is not a value |
| 355 | string literal (even a raw string cannot end in an odd number of |
| 356 | backslashes). Specifically, \emph{a raw string cannot end in a single |
| 357 | backslash} (since the backslash would escape the following quote |
| 358 | character). |
| 359 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 360 | \subsection{String literal concatenation\label{string-catenation}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 361 | |
| 362 | Multiple adjacent string literals (delimited by whitespace), possibly |
| 363 | using different quoting conventions, are allowed, and their meaning is |
| 364 | the same as their concatenation. Thus, \code{"hello" 'world'} is |
| 365 | equivalent to \code{"helloworld"}. This feature can be used to reduce |
| 366 | the number of backslashes needed, to split long strings conveniently |
| 367 | across long lines, or even to add comments to parts of strings, for |
| 368 | example: |
| 369 | |
| 370 | \begin{verbatim} |
| 371 | re.compile("[A-Za-z_]" # letter or underscore |
| 372 | "[A-Za-z0-9_]*" # letter, digit or underscore |
| 373 | ) |
| 374 | \end{verbatim} |
| 375 | |
| 376 | Note that this feature is defined at the syntactical level, but |
| 377 | implemented at compile time. The `+' operator must be used to |
| 378 | concatenate string expressions at run time. Also note that literal |
| 379 | concatenation can use different quoting styles for each component |
| 380 | (even mixing raw strings and triple quoted strings). |
| 381 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 382 | \subsection{Numeric literals\label{numbers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 383 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 384 | There are four types of numeric literals: plain integers, long |
| 385 | integers, floating point numbers, and imaginary numbers. There are no |
| 386 | complex literals (complex numbers can be formed by adding a real |
| 387 | number and an imaginary number). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 388 | \index{number} |
| 389 | \index{numeric literal} |
| 390 | \index{integer literal} |
| 391 | \index{plain integer literal} |
| 392 | \index{long integer literal} |
| 393 | \index{floating point literal} |
| 394 | \index{hexadecimal literal} |
| 395 | \index{octal literal} |
| 396 | \index{decimal literal} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 397 | \index{imaginary literal} |
| 398 | \index{complex literal} |
| 399 | |
| 400 | Note that numeric literals do not include a sign; a phrase like |
| 401 | \code{-1} is actually an expression composed of the unary operator |
| 402 | `\code{-}' and the literal \code{1}. |
| 403 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 404 | \subsection{Integer and long integer literals\label{integers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 405 | |
| 406 | Integer and long integer literals are described by the following |
| 407 | lexical definitions: |
| 408 | |
| 409 | \begin{verbatim} |
| 410 | longinteger: integer ("l"|"L") |
| 411 | integer: decimalinteger | octinteger | hexinteger |
| 412 | decimalinteger: nonzerodigit digit* | "0" |
| 413 | octinteger: "0" octdigit+ |
| 414 | hexinteger: "0" ("x"|"X") hexdigit+ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 415 | nonzerodigit: "1"..."9" |
| 416 | octdigit: "0"..."7" |
| 417 | hexdigit: digit|"a"..."f"|"A"..."F" |
| 418 | \end{verbatim} |
| 419 | |
| 420 | Although both lower case `l' and upper case `L' are allowed as suffix |
| 421 | for long integers, it is strongly recommended to always use `L', since |
| 422 | the letter `l' looks too much like the digit `1'. |
| 423 | |
| 424 | Plain integer decimal literals must be at most 2147483647 (i.e., the |
| 425 | largest positive integer, using 32-bit arithmetic). Plain octal and |
| 426 | hexadecimal literals may be as large as 4294967295, but values larger |
| 427 | than 2147483647 are converted to a negative value by subtracting |
| 428 | 4294967296. There is no limit for long integer literals apart from |
| 429 | what can be stored in available memory. |
| 430 | |
| 431 | Some examples of plain and long integer literals: |
| 432 | |
| 433 | \begin{verbatim} |
| 434 | 7 2147483647 0177 0x80000000 |
| 435 | 3L 79228162514264337593543950336L 0377L 0x100000000L |
| 436 | \end{verbatim} |
| 437 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 438 | \subsection{Floating point literals\label{floating}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 439 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 440 | Floating point literals are described by the following lexical |
| 441 | definitions: |
| 442 | |
| 443 | \begin{verbatim} |
| 444 | floatnumber: pointfloat | exponentfloat |
| 445 | pointfloat: [intpart] fraction | intpart "." |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 446 | exponentfloat: (nonzerodigit digit* | pointfloat) exponent |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 447 | intpart: nonzerodigit digit* | "0" |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 448 | fraction: "." digit+ |
| 449 | exponent: ("e"|"E") ["+"|"-"] digit+ |
| 450 | \end{verbatim} |
| 451 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 452 | Note that the integer part of a floating point number cannot look like |
| 453 | an octal integer. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 454 | The allowed range of floating point literals is |
| 455 | implementation-dependent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 456 | Some examples of floating point literals: |
| 457 | |
| 458 | \begin{verbatim} |
| 459 | 3.14 10. .001 1e100 3.14e-10 |
| 460 | \end{verbatim} |
| 461 | |
| 462 | Note that numeric literals do not include a sign; a phrase like |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 463 | \code{-1} is actually an expression composed of the operator |
| 464 | \code{-} and the literal \code{1}. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 465 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 466 | \subsection{Imaginary literals\label{imaginary}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 467 | |
| 468 | Imaginary literals are described by the following lexical definitions: |
| 469 | |
| 470 | \begin{verbatim} |
| 471 | imagnumber: (floatnumber | intpart) ("j"|"J") |
| 472 | \end{verbatim} |
| 473 | |
| 474 | An imaginary literals yields a complex number with a real part of |
| 475 | 0.0. Complex numbers are represented as a pair of floating point |
| 476 | numbers and have the same restrictions on their range. To create a |
| 477 | complex number with a nonzero real part, add a floating point number |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 478 | to it, e.g., \code{(3+4j)}. Some examples of imaginary literals: |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 479 | |
| 480 | \begin{verbatim} |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 481 | 3.14j 10.j 10j .001j 1e100j 3.14e-10j |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 482 | \end{verbatim} |
| 483 | |
| 484 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 485 | \section{Operators\label{operators}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 486 | |
| 487 | The following tokens are operators: |
| 488 | \index{operators} |
| 489 | |
| 490 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 491 | + - * ** / % |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 492 | << >> & | ^ ~ |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 493 | < > <= >= == != <> |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 494 | \end{verbatim} |
| 495 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 496 | The comparison operators \code{<>} and \code{!=} are alternate |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 497 | spellings of the same operator. \code{!=} is the preferred spelling; |
| 498 | \code{<>} is obsolescent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 499 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 500 | \section{Delimiters\label{delimiters}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 501 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 502 | The following tokens serve as delimiters in the grammar: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 503 | \index{delimiters} |
| 504 | |
| 505 | \begin{verbatim} |
| 506 | ( ) [ ] { } |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 507 | , : . ` = ; |
| 508 | \end{verbatim} |
| 509 | |
| 510 | The period can also occur in floating-point and imaginary literals. A |
| 511 | sequence of three periods has a special meaning as ellipses in slices. |
| 512 | |
| 513 | The following printing ASCII characters have special meaning as part |
| 514 | of other tokens or are otherwise significant to the lexical analyzer: |
| 515 | |
| 516 | \begin{verbatim} |
| 517 | ' " # \ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 518 | \end{verbatim} |
| 519 | |
| 520 | The following printing \ASCII{} characters are not used in Python. Their |
| 521 | occurrence outside string literals and comments is an unconditional |
| 522 | error: |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 523 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 524 | |
| 525 | \begin{verbatim} |
| 526 | @ $ ? |
| 527 | \end{verbatim} |