Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 1 | \chapter{Lexical analysis\label{lexical}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 2 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 3 | A Python program is read by a \emph{parser}. Input to the parser is a |
| 4 | stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 5 | chapter describes how the lexical analyzer breaks a file into tokens. |
| 6 | \index{lexical analysis} |
| 7 | \index{parser} |
| 8 | \index{token} |
| 9 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 10 | Python uses the 7-bit \ASCII{} character set for program text and string |
| 11 | literals. 8-bit characters may be used in string literals and comments |
| 12 | but their interpretation is platform dependent; the proper way to |
| 13 | insert 8-bit characters in string literals is by using octal or |
| 14 | hexadecimal escape sequences. |
| 15 | |
| 16 | The run-time character set depends on the I/O devices connected to the |
| 17 | program but is generally a superset of \ASCII{}. |
| 18 | |
| 19 | \strong{Future compatibility note:} It may be tempting to assume that the |
| 20 | character set for 8-bit characters is ISO Latin-1 (an \ASCII{} |
| 21 | superset that covers most western languages that use the Latin |
| 22 | alphabet), but it is possible that in the future Unicode text editors |
| 23 | will become common. These generally use the UTF-8 encoding, which is |
| 24 | also an \ASCII{} superset, but with very different use for the |
| 25 | characters with ordinals 128-255. While there is no consensus on this |
| 26 | subject yet, it is unwise to assume either Latin-1 or UTF-8, even |
| 27 | though the current implementation appears to favor Latin-1. This |
| 28 | applies both to the source character set and the run-time character |
| 29 | set. |
| 30 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 31 | \section{Line structure\label{line-structure}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 32 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 33 | A Python program is divided into a number of \emph{logical lines}. |
| 34 | \index{line structure} |
| 35 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 36 | \subsection{Logical lines\label{logical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 37 | |
| 38 | The end of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 39 | a logical line is represented by the token NEWLINE. Statements cannot |
| 40 | cross logical line boundaries except where NEWLINE is allowed by the |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 41 | syntax (e.g., between statements in compound statements). |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 42 | A logical line is constructed from one or more \emph{physical lines} |
| 43 | by following the explicit or implicit \emph{line joining} rules. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 44 | \index{logical line} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 45 | \index{physical line} |
| 46 | \index{line joining} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 47 | \index{NEWLINE token} |
| 48 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 49 | \subsection{Physical lines\label{physical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 50 | |
| 51 | A physical line ends in whatever the current platform's convention is |
| 52 | for terminating lines. On \UNIX{}, this is the \ASCII{} LF (linefeed) |
| 53 | character. On DOS/Windows, it is the \ASCII{} sequence CR LF (return |
| 54 | followed by linefeed). On Macintosh, it is the \ASCII{} CR (return) |
| 55 | character. |
| 56 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 57 | \subsection{Comments\label{comments}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 58 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 59 | A comment starts with a hash character (\code{\#}) that is not part of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 60 | a string literal, and ends at the end of the physical line. A comment |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 61 | signifies the end of the logical line unless the implicit line joining |
| 62 | rules are invoked. |
| 63 | Comments are ignored by the syntax; they are not tokens. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 64 | \index{comment} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 65 | \index{hash character} |
| 66 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 67 | \subsection{Explicit line joining\label{explicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 68 | |
| 69 | Two or more physical lines may be joined into logical lines using |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 70 | backslash characters (\code{\e}), as follows: when a physical line ends |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 71 | in a backslash that is not part of a string literal or comment, it is |
| 72 | joined with the following forming a single logical line, deleting the |
| 73 | backslash and the following end-of-line character. For example: |
| 74 | \index{physical line} |
| 75 | \index{line joining} |
| 76 | \index{line continuation} |
| 77 | \index{backslash character} |
| 78 | % |
| 79 | \begin{verbatim} |
| 80 | if 1900 < year < 2100 and 1 <= month <= 12 \ |
| 81 | and 1 <= day <= 31 and 0 <= hour < 24 \ |
| 82 | and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date |
| 83 | return 1 |
| 84 | \end{verbatim} |
| 85 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 86 | A line ending in a backslash cannot carry a comment. A backslash does |
| 87 | not continue a comment. A backslash does not continue a token except |
| 88 | for string literals (i.e., tokens other than string literals cannot be |
| 89 | split across physical lines using a backslash). A backslash is |
| 90 | illegal elsewhere on a line outside a string literal. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 91 | |
Fred Drake | c411fa6 | 1999-02-22 14:32:18 +0000 | [diff] [blame] | 92 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 93 | \subsection{Implicit line joining\label{implicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 94 | |
| 95 | Expressions in parentheses, square brackets or curly braces can be |
| 96 | split over more than one physical line without using backslashes. |
| 97 | For example: |
| 98 | |
| 99 | \begin{verbatim} |
| 100 | month_names = ['Januari', 'Februari', 'Maart', # These are the |
| 101 | 'April', 'Mei', 'Juni', # Dutch names |
| 102 | 'Juli', 'Augustus', 'September', # for the months |
| 103 | 'Oktober', 'November', 'December'] # of the year |
| 104 | \end{verbatim} |
| 105 | |
| 106 | Implicitly continued lines can carry comments. The indentation of the |
| 107 | continuation lines is not important. Blank continuation lines are |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 108 | allowed. There is no NEWLINE token between implicit continuation |
| 109 | lines. Implicitly continued lines can also occur within triple-quoted |
| 110 | strings (see below); in that case they cannot carry comments. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 111 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 112 | |
Fred Drake | c411fa6 | 1999-02-22 14:32:18 +0000 | [diff] [blame] | 113 | \subsection{Blank lines \index{blank line}\label{blank-lines}} |
| 114 | |
| 115 | A logical line that contains only spaces, tabs, formfeeds and possibly |
| 116 | a comment, is ignored (i.e., no NEWLINE token is generated). During |
| 117 | interactive input of statements, handling of a blank line may differ |
| 118 | depending on the implementation of the read-eval-print loop. In the |
| 119 | standard implementation, an entirely blank logical line (i.e.\ one |
| 120 | containing not even whitespace or a comment) terminates a multi-line |
| 121 | statement. |
| 122 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 123 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 124 | \subsection{Indentation\label{indentation}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 125 | |
| 126 | Leading whitespace (spaces and tabs) at the beginning of a logical |
| 127 | line is used to compute the indentation level of the line, which in |
| 128 | turn is used to determine the grouping of statements. |
| 129 | \index{indentation} |
| 130 | \index{whitespace} |
| 131 | \index{leading whitespace} |
| 132 | \index{space} |
| 133 | \index{tab} |
| 134 | \index{grouping} |
| 135 | \index{statement grouping} |
| 136 | |
| 137 | First, tabs are replaced (from left to right) by one to eight spaces |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 138 | such that the total number of characters up to and including the |
| 139 | replacement is a multiple of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 140 | eight (this is intended to be the same rule as used by \UNIX{}). The |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 141 | total number of spaces preceding the first non-blank character then |
| 142 | determines the line's indentation. Indentation cannot be split over |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 143 | multiple physical lines using backslashes; the whitespace up to the |
| 144 | first backslash determines the indentation. |
| 145 | |
| 146 | \strong{Cross-platform compatibility note:} because of the nature of |
| 147 | text editors on non-UNIX platforms, it is unwise to use a mixture of |
| 148 | spaces and tabs for the indentation in a single source file. |
| 149 | |
| 150 | A formfeed character may be present at the start of the line; it will |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 151 | be ignored for the indentation calculations above. Formfeed |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 152 | characters occurring elsewhere in the leading whitespace have an |
| 153 | undefined effect (for instance, they may reset the space count to |
| 154 | zero). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 155 | |
| 156 | The indentation levels of consecutive lines are used to generate |
| 157 | INDENT and DEDENT tokens, using a stack, as follows. |
| 158 | \index{INDENT token} |
| 159 | \index{DEDENT token} |
| 160 | |
| 161 | Before the first line of the file is read, a single zero is pushed on |
| 162 | the stack; this will never be popped off again. The numbers pushed on |
| 163 | the stack will always be strictly increasing from bottom to top. At |
| 164 | the beginning of each logical line, the line's indentation level is |
| 165 | compared to the top of the stack. If it is equal, nothing happens. |
| 166 | If it is larger, it is pushed on the stack, and one INDENT token is |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 167 | generated. If it is smaller, it \emph{must} be one of the numbers |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 168 | occurring on the stack; all numbers on the stack that are larger are |
| 169 | popped off, and for each number popped off a DEDENT token is |
| 170 | generated. At the end of the file, a DEDENT token is generated for |
| 171 | each number remaining on the stack that is larger than zero. |
| 172 | |
| 173 | Here is an example of a correctly (though confusingly) indented piece |
| 174 | of Python code: |
| 175 | |
| 176 | \begin{verbatim} |
| 177 | def perm(l): |
| 178 | # Compute the list of all permutations of l |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 179 | if len(l) <= 1: |
| 180 | return [l] |
| 181 | r = [] |
| 182 | for i in range(len(l)): |
| 183 | s = l[:i] + l[i+1:] |
| 184 | p = perm(s) |
| 185 | for x in p: |
| 186 | r.append(l[i:i+1] + x) |
| 187 | return r |
| 188 | \end{verbatim} |
| 189 | |
| 190 | The following example shows various indentation errors: |
| 191 | |
| 192 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 193 | def perm(l): # error: first line indented |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 194 | for i in range(len(l)): # error: not indented |
| 195 | s = l[:i] + l[i+1:] |
| 196 | p = perm(l[:i] + l[i+1:]) # error: unexpected indent |
| 197 | for x in p: |
| 198 | r.append(l[i:i+1] + x) |
| 199 | return r # error: inconsistent dedent |
| 200 | \end{verbatim} |
| 201 | |
| 202 | (Actually, the first three errors are detected by the parser; only the |
| 203 | last error is found by the lexical analyzer --- the indentation of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 204 | \code{return r} does not match a level popped off the stack.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 205 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 206 | \subsection{Whitespace between tokens\label{whitespace}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 207 | |
| 208 | Except at the beginning of a logical line or in string literals, the |
| 209 | whitespace characters space, tab and formfeed can be used |
| 210 | interchangeably to separate tokens. Whitespace is needed between two |
| 211 | tokens only if their concatenation could otherwise be interpreted as a |
| 212 | different token (e.g., ab is one token, but a b is two tokens). |
| 213 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 214 | \section{Other tokens\label{other-tokens}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 215 | |
| 216 | Besides NEWLINE, INDENT and DEDENT, the following categories of tokens |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 217 | exist: \emph{identifiers}, \emph{keywords}, \emph{literals}, |
| 218 | \emph{operators}, and \emph{delimiters}. |
| 219 | Whitespace characters (other than line terminators, discussed earlier) |
| 220 | are not tokens, but serve to delimit tokens. |
| 221 | Where |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 222 | ambiguity exists, a token comprises the longest possible string that |
| 223 | forms a legal token, when read from left to right. |
| 224 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 225 | \section{Identifiers and keywords\label{identifiers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 226 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 227 | Identifiers (also referred to as \emph{names}) are described by the following |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 228 | lexical definitions: |
| 229 | \index{identifier} |
| 230 | \index{name} |
| 231 | |
| 232 | \begin{verbatim} |
| 233 | identifier: (letter|"_") (letter|digit|"_")* |
| 234 | letter: lowercase | uppercase |
| 235 | lowercase: "a"..."z" |
| 236 | uppercase: "A"..."Z" |
| 237 | digit: "0"..."9" |
| 238 | \end{verbatim} |
| 239 | |
| 240 | Identifiers are unlimited in length. Case is significant. |
| 241 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 242 | \subsection{Keywords\label{keywords}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 243 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 244 | The following identifiers are used as reserved words, or |
| 245 | \emph{keywords} of the language, and cannot be used as ordinary |
| 246 | identifiers. They must be spelled exactly as written here:% |
| 247 | \index{keyword}% |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 248 | \index{reserved word} |
| 249 | |
| 250 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 251 | and del for is raise |
| 252 | assert elif from lambda return |
| 253 | break else global not try |
| 254 | class except if or while |
| 255 | continue exec import pass |
| 256 | def finally in print |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 257 | \end{verbatim} |
| 258 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 259 | % When adding keywords, use reswords.py for reformatting |
| 260 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 261 | \subsection{Reserved classes of identifiers\label{id-classes}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 262 | |
| 263 | Certain classes of identifiers (besides keywords) have special |
| 264 | meanings. These are: |
| 265 | |
Fred Drake | 39fc1bc | 1999-03-05 18:30:21 +0000 | [diff] [blame] | 266 | \begin{tableiii}{l|l|l}{code}{Form}{Meaning}{Notes} |
| 267 | \lineiii{_*}{Not imported by \samp{from \var{module} import *}}{(1)} |
| 268 | \lineiii{__*__}{System-defined name}{} |
| 269 | \lineiii{__*}{Class-private name mangling}{} |
| 270 | \end{tableiii} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 271 | |
| 272 | (XXX need section references here.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 273 | |
Fred Drake | 39fc1bc | 1999-03-05 18:30:21 +0000 | [diff] [blame] | 274 | Note: |
| 275 | |
| 276 | \begin{description} |
| 277 | \item[(1)] The special identifier \samp{_} is used in the interactive |
| 278 | interpreter to store the result of the last evaluation; it is stored |
| 279 | in the \module{__builtin__} module. When not in interactive mode, |
| 280 | \samp{_} has no special meaning and is not defined. |
| 281 | \end{description} |
| 282 | |
| 283 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 284 | \section{Literals\label{literals}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 285 | |
| 286 | Literals are notations for constant values of some built-in types. |
| 287 | \index{literal} |
| 288 | \index{constant} |
| 289 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 290 | \subsection{String literals\label{strings}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 291 | |
| 292 | String literals are described by the following lexical definitions: |
| 293 | \index{string literal} |
| 294 | |
| 295 | \begin{verbatim} |
| 296 | stringliteral: shortstring | longstring |
| 297 | shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"' |
| 298 | longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""' |
| 299 | shortstringitem: shortstringchar | escapeseq |
| 300 | longstringitem: longstringchar | escapeseq |
| 301 | shortstringchar: <any ASCII character except "\" or newline or the quote> |
| 302 | longstringchar: <any ASCII character except "\"> |
| 303 | escapeseq: "\" <any ASCII character> |
| 304 | \end{verbatim} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 305 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 306 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 307 | In plain English: String literals can be enclosed in matching single |
| 308 | quotes (\code{'}) or double quotes (\code{"}). They can also be |
| 309 | enclosed in matching groups of three single or double quotes (these |
| 310 | are generally referred to as \emph{triple-quoted strings}). The |
| 311 | backslash (\code{\e}) character is used to escape characters that |
| 312 | otherwise have a special meaning, such as newline, backslash itself, |
| 313 | or the quote character. String literals may optionally be prefixed |
| 314 | with a letter `r' or `R'; such strings are called raw strings and use |
| 315 | different rules for backslash escape sequences. |
| 316 | \index{triple-quoted string} |
| 317 | \index{raw string} |
| 318 | |
| 319 | In triple-quoted strings, |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 320 | unescaped newlines and quotes are allowed (and are retained), except |
| 321 | that three unescaped quotes in a row terminate the string. (A |
| 322 | ``quote'' is the character used to open the string, i.e. either |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 323 | \code{'} or \code{"}.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 324 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 325 | Unless an `r' or `R' prefix is present, escape sequences in strings |
| 326 | are interpreted according to rules similar |
| 327 | to those used by Standard \C{}. The recognized escape sequences are: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 328 | \index{physical line} |
| 329 | \index{escape sequence} |
| 330 | \index{Standard C} |
| 331 | \index{C} |
| 332 | |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 333 | \begin{tableii}{l|l}{code}{Escape Sequence}{Meaning} |
| 334 | \lineii{\e\var{newline}} {Ignored} |
| 335 | \lineii{\e\e} {Backslash (\code{\e})} |
| 336 | \lineii{\e'} {Single quote (\code{'})} |
| 337 | \lineii{\e"} {Double quote (\code{"})} |
| 338 | \lineii{\e a} {\ASCII{} Bell (BEL)} |
| 339 | \lineii{\e b} {\ASCII{} Backspace (BS)} |
| 340 | \lineii{\e f} {\ASCII{} Formfeed (FF)} |
| 341 | \lineii{\e n} {\ASCII{} Linefeed (LF)} |
| 342 | \lineii{\e r} {\ASCII{} Carriage Return (CR)} |
| 343 | \lineii{\e t} {\ASCII{} Horizontal Tab (TAB)} |
| 344 | \lineii{\e v} {\ASCII{} Vertical Tab (VT)} |
| 345 | \lineii{\e\var{ooo}} {\ASCII{} character with octal value \emph{ooo}} |
| 346 | \lineii{\e x\var{hh...}} {\ASCII{} character with hex value \emph{hh...}} |
| 347 | \end{tableii} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 348 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 349 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 350 | In strict compatibility with Standard \C, up to three octal digits are |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 351 | accepted, but an unlimited number of hex digits is taken to be part of |
| 352 | the hex escape (and then the lower 8 bits of the resulting hex number |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 353 | are used in 8-bit implementations). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 354 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 355 | Unlike Standard \C{}, |
| 356 | all unrecognized escape sequences are left in the string unchanged, |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 357 | i.e., \emph{the backslash is left in the string.} (This behavior is |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 358 | useful when debugging: if an escape sequence is mistyped, the |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 359 | resulting output is more easily recognized as broken.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 360 | \index{unrecognized escape sequence} |
| 361 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 362 | When an `r' or `R' prefix is present, backslashes are still used to |
| 363 | quote the following character, but \emph{all backslashes are left in |
| 364 | the string}. For example, the string literal \code{r"\e n"} consists |
| 365 | of two characters: a backslash and a lowercase `n'. String quotes can |
| 366 | be escaped with a backslash, but the backslash remains in the string; |
Fred Drake | c456d36 | 1998-10-01 20:41:57 +0000 | [diff] [blame] | 367 | for example, \code{r"\e""} is a valid string literal consisting of two |
| 368 | characters: a backslash and a double quote; \code{r"\e"} is not a value |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 369 | string literal (even a raw string cannot end in an odd number of |
| 370 | backslashes). Specifically, \emph{a raw string cannot end in a single |
| 371 | backslash} (since the backslash would escape the following quote |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 372 | character). Note also that a single backslash followed by a newline |
| 373 | is interpreted as those two characters as part of the string, |
| 374 | \emph{not} as a line continuation. |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 375 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 376 | \subsection{String literal concatenation\label{string-catenation}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 377 | |
| 378 | Multiple adjacent string literals (delimited by whitespace), possibly |
| 379 | using different quoting conventions, are allowed, and their meaning is |
| 380 | the same as their concatenation. Thus, \code{"hello" 'world'} is |
| 381 | equivalent to \code{"helloworld"}. This feature can be used to reduce |
| 382 | the number of backslashes needed, to split long strings conveniently |
| 383 | across long lines, or even to add comments to parts of strings, for |
| 384 | example: |
| 385 | |
| 386 | \begin{verbatim} |
| 387 | re.compile("[A-Za-z_]" # letter or underscore |
| 388 | "[A-Za-z0-9_]*" # letter, digit or underscore |
| 389 | ) |
| 390 | \end{verbatim} |
| 391 | |
| 392 | Note that this feature is defined at the syntactical level, but |
| 393 | implemented at compile time. The `+' operator must be used to |
| 394 | concatenate string expressions at run time. Also note that literal |
| 395 | concatenation can use different quoting styles for each component |
| 396 | (even mixing raw strings and triple quoted strings). |
| 397 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 398 | \subsection{Numeric literals\label{numbers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 399 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 400 | There are four types of numeric literals: plain integers, long |
| 401 | integers, floating point numbers, and imaginary numbers. There are no |
| 402 | complex literals (complex numbers can be formed by adding a real |
| 403 | number and an imaginary number). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 404 | \index{number} |
| 405 | \index{numeric literal} |
| 406 | \index{integer literal} |
| 407 | \index{plain integer literal} |
| 408 | \index{long integer literal} |
| 409 | \index{floating point literal} |
| 410 | \index{hexadecimal literal} |
| 411 | \index{octal literal} |
| 412 | \index{decimal literal} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 413 | \index{imaginary literal} |
| 414 | \index{complex literal} |
| 415 | |
| 416 | Note that numeric literals do not include a sign; a phrase like |
| 417 | \code{-1} is actually an expression composed of the unary operator |
| 418 | `\code{-}' and the literal \code{1}. |
| 419 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 420 | \subsection{Integer and long integer literals\label{integers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 421 | |
| 422 | Integer and long integer literals are described by the following |
| 423 | lexical definitions: |
| 424 | |
| 425 | \begin{verbatim} |
| 426 | longinteger: integer ("l"|"L") |
| 427 | integer: decimalinteger | octinteger | hexinteger |
| 428 | decimalinteger: nonzerodigit digit* | "0" |
| 429 | octinteger: "0" octdigit+ |
| 430 | hexinteger: "0" ("x"|"X") hexdigit+ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 431 | nonzerodigit: "1"..."9" |
| 432 | octdigit: "0"..."7" |
| 433 | hexdigit: digit|"a"..."f"|"A"..."F" |
| 434 | \end{verbatim} |
| 435 | |
| 436 | Although both lower case `l' and upper case `L' are allowed as suffix |
| 437 | for long integers, it is strongly recommended to always use `L', since |
| 438 | the letter `l' looks too much like the digit `1'. |
| 439 | |
| 440 | Plain integer decimal literals must be at most 2147483647 (i.e., the |
| 441 | largest positive integer, using 32-bit arithmetic). Plain octal and |
| 442 | hexadecimal literals may be as large as 4294967295, but values larger |
| 443 | than 2147483647 are converted to a negative value by subtracting |
| 444 | 4294967296. There is no limit for long integer literals apart from |
| 445 | what can be stored in available memory. |
| 446 | |
| 447 | Some examples of plain and long integer literals: |
| 448 | |
| 449 | \begin{verbatim} |
| 450 | 7 2147483647 0177 0x80000000 |
| 451 | 3L 79228162514264337593543950336L 0377L 0x100000000L |
| 452 | \end{verbatim} |
| 453 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 454 | \subsection{Floating point literals\label{floating}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 455 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 456 | Floating point literals are described by the following lexical |
| 457 | definitions: |
| 458 | |
| 459 | \begin{verbatim} |
| 460 | floatnumber: pointfloat | exponentfloat |
| 461 | pointfloat: [intpart] fraction | intpart "." |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 462 | exponentfloat: (nonzerodigit digit* | pointfloat) exponent |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 463 | intpart: nonzerodigit digit* | "0" |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 464 | fraction: "." digit+ |
| 465 | exponent: ("e"|"E") ["+"|"-"] digit+ |
| 466 | \end{verbatim} |
| 467 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 468 | Note that the integer part of a floating point number cannot look like |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 469 | an octal integer, though the exponent may look like an octal literal |
| 470 | but will always be interpreted using radix 10. For example, |
| 471 | \samp{1e010} is legal, while \samp{07.1} is a syntax error. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 472 | The allowed range of floating point literals is |
| 473 | implementation-dependent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 474 | Some examples of floating point literals: |
| 475 | |
| 476 | \begin{verbatim} |
| 477 | 3.14 10. .001 1e100 3.14e-10 |
| 478 | \end{verbatim} |
| 479 | |
| 480 | Note that numeric literals do not include a sign; a phrase like |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 481 | \code{-1} is actually an expression composed of the operator |
| 482 | \code{-} and the literal \code{1}. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 483 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 484 | \subsection{Imaginary literals\label{imaginary}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 485 | |
| 486 | Imaginary literals are described by the following lexical definitions: |
| 487 | |
| 488 | \begin{verbatim} |
| 489 | imagnumber: (floatnumber | intpart) ("j"|"J") |
| 490 | \end{verbatim} |
| 491 | |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 492 | An imaginary literal yields a complex number with a real part of |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 493 | 0.0. Complex numbers are represented as a pair of floating point |
| 494 | numbers and have the same restrictions on their range. To create a |
| 495 | complex number with a nonzero real part, add a floating point number |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 496 | to it, e.g., \code{(3+4j)}. Some examples of imaginary literals: |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 497 | |
| 498 | \begin{verbatim} |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 499 | 3.14j 10.j 10j .001j 1e100j 3.14e-10j |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 500 | \end{verbatim} |
| 501 | |
| 502 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 503 | \section{Operators\label{operators}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 504 | |
| 505 | The following tokens are operators: |
| 506 | \index{operators} |
| 507 | |
| 508 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 509 | + - * ** / % |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 510 | << >> & | ^ ~ |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 511 | < > <= >= == != <> |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 512 | \end{verbatim} |
| 513 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 514 | The comparison operators \code{<>} and \code{!=} are alternate |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 515 | spellings of the same operator. \code{!=} is the preferred spelling; |
| 516 | \code{<>} is obsolescent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 517 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 518 | \section{Delimiters\label{delimiters}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 519 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 520 | The following tokens serve as delimiters in the grammar: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 521 | \index{delimiters} |
| 522 | |
| 523 | \begin{verbatim} |
| 524 | ( ) [ ] { } |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 525 | , : . ` = ; |
| 526 | \end{verbatim} |
| 527 | |
| 528 | The period can also occur in floating-point and imaginary literals. A |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 529 | sequence of three periods has a special meaning as an ellipsis in slices. |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 530 | |
| 531 | The following printing ASCII characters have special meaning as part |
| 532 | of other tokens or are otherwise significant to the lexical analyzer: |
| 533 | |
| 534 | \begin{verbatim} |
| 535 | ' " # \ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 536 | \end{verbatim} |
| 537 | |
| 538 | The following printing \ASCII{} characters are not used in Python. Their |
| 539 | occurrence outside string literals and comments is an unconditional |
| 540 | error: |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 541 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 542 | |
| 543 | \begin{verbatim} |
| 544 | @ $ ? |
| 545 | \end{verbatim} |