Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 1 | \chapter{Lexical analysis\label{lexical}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 2 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 3 | A Python program is read by a \emph{parser}. Input to the parser is a |
| 4 | stream of \emph{tokens}, generated by the \emph{lexical analyzer}. This |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 5 | chapter describes how the lexical analyzer breaks a file into tokens. |
| 6 | \index{lexical analysis} |
| 7 | \index{parser} |
| 8 | \index{token} |
| 9 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 10 | Python uses the 7-bit \ASCII{} character set for program text and string |
| 11 | literals. 8-bit characters may be used in string literals and comments |
| 12 | but their interpretation is platform dependent; the proper way to |
| 13 | insert 8-bit characters in string literals is by using octal or |
| 14 | hexadecimal escape sequences. |
| 15 | |
| 16 | The run-time character set depends on the I/O devices connected to the |
| 17 | program but is generally a superset of \ASCII{}. |
| 18 | |
| 19 | \strong{Future compatibility note:} It may be tempting to assume that the |
| 20 | character set for 8-bit characters is ISO Latin-1 (an \ASCII{} |
| 21 | superset that covers most western languages that use the Latin |
| 22 | alphabet), but it is possible that in the future Unicode text editors |
| 23 | will become common. These generally use the UTF-8 encoding, which is |
| 24 | also an \ASCII{} superset, but with very different use for the |
| 25 | characters with ordinals 128-255. While there is no consensus on this |
| 26 | subject yet, it is unwise to assume either Latin-1 or UTF-8, even |
| 27 | though the current implementation appears to favor Latin-1. This |
| 28 | applies both to the source character set and the run-time character |
| 29 | set. |
| 30 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 31 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 32 | \section{Line structure\label{line-structure}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 33 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 34 | A Python program is divided into a number of \emph{logical lines}. |
| 35 | \index{line structure} |
| 36 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 37 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 38 | \subsection{Logical lines\label{logical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 39 | |
| 40 | The end of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 41 | a logical line is represented by the token NEWLINE. Statements cannot |
| 42 | cross logical line boundaries except where NEWLINE is allowed by the |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 43 | syntax (e.g., between statements in compound statements). |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 44 | A logical line is constructed from one or more \emph{physical lines} |
| 45 | by following the explicit or implicit \emph{line joining} rules. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 46 | \index{logical line} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 47 | \index{physical line} |
| 48 | \index{line joining} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 49 | \index{NEWLINE token} |
| 50 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 51 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 52 | \subsection{Physical lines\label{physical}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 53 | |
| 54 | A physical line ends in whatever the current platform's convention is |
| 55 | for terminating lines. On \UNIX{}, this is the \ASCII{} LF (linefeed) |
| 56 | character. On DOS/Windows, it is the \ASCII{} sequence CR LF (return |
| 57 | followed by linefeed). On Macintosh, it is the \ASCII{} CR (return) |
| 58 | character. |
| 59 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 60 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 61 | \subsection{Comments\label{comments}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 62 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 63 | A comment starts with a hash character (\code{\#}) that is not part of |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 64 | a string literal, and ends at the end of the physical line. A comment |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 65 | signifies the end of the logical line unless the implicit line joining |
| 66 | rules are invoked. |
| 67 | Comments are ignored by the syntax; they are not tokens. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 68 | \index{comment} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 69 | \index{hash character} |
| 70 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 71 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 72 | \subsection{Explicit line joining\label{explicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 73 | |
| 74 | Two or more physical lines may be joined into logical lines using |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 75 | backslash characters (\code{\e}), as follows: when a physical line ends |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 76 | in a backslash that is not part of a string literal or comment, it is |
| 77 | joined with the following forming a single logical line, deleting the |
| 78 | backslash and the following end-of-line character. For example: |
| 79 | \index{physical line} |
| 80 | \index{line joining} |
| 81 | \index{line continuation} |
| 82 | \index{backslash character} |
| 83 | % |
| 84 | \begin{verbatim} |
| 85 | if 1900 < year < 2100 and 1 <= month <= 12 \ |
| 86 | and 1 <= day <= 31 and 0 <= hour < 24 \ |
| 87 | and 0 <= minute < 60 and 0 <= second < 60: # Looks like a valid date |
| 88 | return 1 |
| 89 | \end{verbatim} |
| 90 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 91 | A line ending in a backslash cannot carry a comment. A backslash does |
| 92 | not continue a comment. A backslash does not continue a token except |
| 93 | for string literals (i.e., tokens other than string literals cannot be |
| 94 | split across physical lines using a backslash). A backslash is |
| 95 | illegal elsewhere on a line outside a string literal. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 96 | |
Fred Drake | c411fa6 | 1999-02-22 14:32:18 +0000 | [diff] [blame] | 97 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 98 | \subsection{Implicit line joining\label{implicit-joining}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 99 | |
| 100 | Expressions in parentheses, square brackets or curly braces can be |
| 101 | split over more than one physical line without using backslashes. |
| 102 | For example: |
| 103 | |
| 104 | \begin{verbatim} |
| 105 | month_names = ['Januari', 'Februari', 'Maart', # These are the |
| 106 | 'April', 'Mei', 'Juni', # Dutch names |
| 107 | 'Juli', 'Augustus', 'September', # for the months |
| 108 | 'Oktober', 'November', 'December'] # of the year |
| 109 | \end{verbatim} |
| 110 | |
| 111 | Implicitly continued lines can carry comments. The indentation of the |
| 112 | continuation lines is not important. Blank continuation lines are |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 113 | allowed. There is no NEWLINE token between implicit continuation |
| 114 | lines. Implicitly continued lines can also occur within triple-quoted |
| 115 | strings (see below); in that case they cannot carry comments. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 116 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 117 | |
Fred Drake | c411fa6 | 1999-02-22 14:32:18 +0000 | [diff] [blame] | 118 | \subsection{Blank lines \index{blank line}\label{blank-lines}} |
| 119 | |
| 120 | A logical line that contains only spaces, tabs, formfeeds and possibly |
| 121 | a comment, is ignored (i.e., no NEWLINE token is generated). During |
| 122 | interactive input of statements, handling of a blank line may differ |
| 123 | depending on the implementation of the read-eval-print loop. In the |
| 124 | standard implementation, an entirely blank logical line (i.e.\ one |
| 125 | containing not even whitespace or a comment) terminates a multi-line |
| 126 | statement. |
| 127 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 128 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 129 | \subsection{Indentation\label{indentation}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 130 | |
| 131 | Leading whitespace (spaces and tabs) at the beginning of a logical |
| 132 | line is used to compute the indentation level of the line, which in |
| 133 | turn is used to determine the grouping of statements. |
| 134 | \index{indentation} |
| 135 | \index{whitespace} |
| 136 | \index{leading whitespace} |
| 137 | \index{space} |
| 138 | \index{tab} |
| 139 | \index{grouping} |
| 140 | \index{statement grouping} |
| 141 | |
| 142 | First, tabs are replaced (from left to right) by one to eight spaces |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 143 | such that the total number of characters up to and including the |
| 144 | replacement is a multiple of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 145 | eight (this is intended to be the same rule as used by \UNIX{}). The |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 146 | total number of spaces preceding the first non-blank character then |
| 147 | determines the line's indentation. Indentation cannot be split over |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 148 | multiple physical lines using backslashes; the whitespace up to the |
| 149 | first backslash determines the indentation. |
| 150 | |
| 151 | \strong{Cross-platform compatibility note:} because of the nature of |
| 152 | text editors on non-UNIX platforms, it is unwise to use a mixture of |
| 153 | spaces and tabs for the indentation in a single source file. |
| 154 | |
| 155 | A formfeed character may be present at the start of the line; it will |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 156 | be ignored for the indentation calculations above. Formfeed |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 157 | characters occurring elsewhere in the leading whitespace have an |
| 158 | undefined effect (for instance, they may reset the space count to |
| 159 | zero). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 160 | |
| 161 | The indentation levels of consecutive lines are used to generate |
| 162 | INDENT and DEDENT tokens, using a stack, as follows. |
| 163 | \index{INDENT token} |
| 164 | \index{DEDENT token} |
| 165 | |
| 166 | Before the first line of the file is read, a single zero is pushed on |
| 167 | the stack; this will never be popped off again. The numbers pushed on |
| 168 | the stack will always be strictly increasing from bottom to top. At |
| 169 | the beginning of each logical line, the line's indentation level is |
| 170 | compared to the top of the stack. If it is equal, nothing happens. |
| 171 | If it is larger, it is pushed on the stack, and one INDENT token is |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 172 | generated. If it is smaller, it \emph{must} be one of the numbers |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 173 | occurring on the stack; all numbers on the stack that are larger are |
| 174 | popped off, and for each number popped off a DEDENT token is |
| 175 | generated. At the end of the file, a DEDENT token is generated for |
| 176 | each number remaining on the stack that is larger than zero. |
| 177 | |
| 178 | Here is an example of a correctly (though confusingly) indented piece |
| 179 | of Python code: |
| 180 | |
| 181 | \begin{verbatim} |
| 182 | def perm(l): |
| 183 | # Compute the list of all permutations of l |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 184 | if len(l) <= 1: |
| 185 | return [l] |
| 186 | r = [] |
| 187 | for i in range(len(l)): |
| 188 | s = l[:i] + l[i+1:] |
| 189 | p = perm(s) |
| 190 | for x in p: |
| 191 | r.append(l[i:i+1] + x) |
| 192 | return r |
| 193 | \end{verbatim} |
| 194 | |
| 195 | The following example shows various indentation errors: |
| 196 | |
| 197 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 198 | def perm(l): # error: first line indented |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 199 | for i in range(len(l)): # error: not indented |
| 200 | s = l[:i] + l[i+1:] |
| 201 | p = perm(l[:i] + l[i+1:]) # error: unexpected indent |
| 202 | for x in p: |
| 203 | r.append(l[i:i+1] + x) |
| 204 | return r # error: inconsistent dedent |
| 205 | \end{verbatim} |
| 206 | |
| 207 | (Actually, the first three errors are detected by the parser; only the |
| 208 | last error is found by the lexical analyzer --- the indentation of |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 209 | \code{return r} does not match a level popped off the stack.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 210 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 211 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 212 | \subsection{Whitespace between tokens\label{whitespace}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 213 | |
| 214 | Except at the beginning of a logical line or in string literals, the |
| 215 | whitespace characters space, tab and formfeed can be used |
| 216 | interchangeably to separate tokens. Whitespace is needed between two |
| 217 | tokens only if their concatenation could otherwise be interpreted as a |
| 218 | different token (e.g., ab is one token, but a b is two tokens). |
| 219 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 220 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 221 | \section{Other tokens\label{other-tokens}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 222 | |
| 223 | Besides NEWLINE, INDENT and DEDENT, the following categories of tokens |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 224 | exist: \emph{identifiers}, \emph{keywords}, \emph{literals}, |
| 225 | \emph{operators}, and \emph{delimiters}. |
| 226 | Whitespace characters (other than line terminators, discussed earlier) |
| 227 | are not tokens, but serve to delimit tokens. |
| 228 | Where |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 229 | ambiguity exists, a token comprises the longest possible string that |
| 230 | forms a legal token, when read from left to right. |
| 231 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 232 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 233 | \section{Identifiers and keywords\label{identifiers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 234 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 235 | Identifiers (also referred to as \emph{names}) are described by the following |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 236 | lexical definitions: |
| 237 | \index{identifier} |
| 238 | \index{name} |
| 239 | |
| 240 | \begin{verbatim} |
| 241 | identifier: (letter|"_") (letter|digit|"_")* |
| 242 | letter: lowercase | uppercase |
| 243 | lowercase: "a"..."z" |
| 244 | uppercase: "A"..."Z" |
| 245 | digit: "0"..."9" |
| 246 | \end{verbatim} |
| 247 | |
| 248 | Identifiers are unlimited in length. Case is significant. |
| 249 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 250 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 251 | \subsection{Keywords\label{keywords}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 252 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 253 | The following identifiers are used as reserved words, or |
| 254 | \emph{keywords} of the language, and cannot be used as ordinary |
| 255 | identifiers. They must be spelled exactly as written here:% |
| 256 | \index{keyword}% |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 257 | \index{reserved word} |
| 258 | |
| 259 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 260 | and del for is raise |
| 261 | assert elif from lambda return |
| 262 | break else global not try |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 263 | class except if or yeild |
| 264 | continue exec import pass while |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 265 | def finally in print |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 266 | \end{verbatim} |
| 267 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 268 | % When adding keywords, use reswords.py for reformatting |
| 269 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 270 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 271 | \subsection{Reserved classes of identifiers\label{id-classes}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 272 | |
| 273 | Certain classes of identifiers (besides keywords) have special |
| 274 | meanings. These are: |
| 275 | |
Fred Drake | 39fc1bc | 1999-03-05 18:30:21 +0000 | [diff] [blame] | 276 | \begin{tableiii}{l|l|l}{code}{Form}{Meaning}{Notes} |
| 277 | \lineiii{_*}{Not imported by \samp{from \var{module} import *}}{(1)} |
| 278 | \lineiii{__*__}{System-defined name}{} |
| 279 | \lineiii{__*}{Class-private name mangling}{} |
| 280 | \end{tableiii} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 281 | |
| 282 | (XXX need section references here.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 283 | |
Fred Drake | 39fc1bc | 1999-03-05 18:30:21 +0000 | [diff] [blame] | 284 | Note: |
| 285 | |
| 286 | \begin{description} |
| 287 | \item[(1)] The special identifier \samp{_} is used in the interactive |
| 288 | interpreter to store the result of the last evaluation; it is stored |
| 289 | in the \module{__builtin__} module. When not in interactive mode, |
| 290 | \samp{_} has no special meaning and is not defined. |
| 291 | \end{description} |
| 292 | |
| 293 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 294 | \section{Literals\label{literals}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 295 | |
| 296 | Literals are notations for constant values of some built-in types. |
| 297 | \index{literal} |
| 298 | \index{constant} |
| 299 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 300 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 301 | \subsection{String literals\label{strings}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 302 | |
| 303 | String literals are described by the following lexical definitions: |
| 304 | \index{string literal} |
| 305 | |
| 306 | \begin{verbatim} |
| 307 | stringliteral: shortstring | longstring |
| 308 | shortstring: "'" shortstringitem* "'" | '"' shortstringitem* '"' |
| 309 | longstring: "'''" longstringitem* "'''" | '"""' longstringitem* '"""' |
| 310 | shortstringitem: shortstringchar | escapeseq |
| 311 | longstringitem: longstringchar | escapeseq |
| 312 | shortstringchar: <any ASCII character except "\" or newline or the quote> |
| 313 | longstringchar: <any ASCII character except "\"> |
| 314 | escapeseq: "\" <any ASCII character> |
| 315 | \end{verbatim} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 316 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 317 | |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 318 | \index{triple-quoted string} |
| 319 | \index{Unicode Consortium} |
| 320 | \index{string!Unicode} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 321 | In plain English: String literals can be enclosed in matching single |
| 322 | quotes (\code{'}) or double quotes (\code{"}). They can also be |
| 323 | enclosed in matching groups of three single or double quotes (these |
| 324 | are generally referred to as \emph{triple-quoted strings}). The |
| 325 | backslash (\code{\e}) character is used to escape characters that |
| 326 | otherwise have a special meaning, such as newline, backslash itself, |
| 327 | or the quote character. String literals may optionally be prefixed |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 328 | with a letter `r' or `R'; such strings are called |
| 329 | \dfn{raw strings}\index{raw string} and use different rules for |
| 330 | backslash escape sequences. A prefix of 'u' or 'U' makes the string |
| 331 | a Unicode string. Unicode strings use the Unicode character set as |
| 332 | defined by the Unicode Consortium and ISO~10646. Some additional |
| 333 | escape sequences, described below, are available in Unicode strings. |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 334 | |
| 335 | In triple-quoted strings, |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 336 | unescaped newlines and quotes are allowed (and are retained), except |
| 337 | that three unescaped quotes in a row terminate the string. (A |
| 338 | ``quote'' is the character used to open the string, i.e. either |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 339 | \code{'} or \code{"}.) |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 340 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 341 | Unless an `r' or `R' prefix is present, escape sequences in strings |
| 342 | are interpreted according to rules similar |
| 343 | to those used by Standard \C{}. The recognized escape sequences are: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 344 | \index{physical line} |
| 345 | \index{escape sequence} |
| 346 | \index{Standard C} |
| 347 | \index{C} |
| 348 | |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 349 | \begin{tableii}{l|l}{code}{Escape Sequence}{Meaning} |
| 350 | \lineii{\e\var{newline}} {Ignored} |
| 351 | \lineii{\e\e} {Backslash (\code{\e})} |
| 352 | \lineii{\e'} {Single quote (\code{'})} |
| 353 | \lineii{\e"} {Double quote (\code{"})} |
| 354 | \lineii{\e a} {\ASCII{} Bell (BEL)} |
| 355 | \lineii{\e b} {\ASCII{} Backspace (BS)} |
| 356 | \lineii{\e f} {\ASCII{} Formfeed (FF)} |
| 357 | \lineii{\e n} {\ASCII{} Linefeed (LF)} |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 358 | \lineii{\e N\{\var{name}\}} |
| 359 | {Character named \var{name} in the Unicode database (Unicode only)} |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 360 | \lineii{\e r} {\ASCII{} Carriage Return (CR)} |
| 361 | \lineii{\e t} {\ASCII{} Horizontal Tab (TAB)} |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 362 | \lineii{\e u\var{xxxx}} |
| 363 | {Character with 16-bit hex value \var{xxxx} (Unicode only)} |
| 364 | \lineii{\e U\var{xxxxxxxx}} |
| 365 | {Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)} |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 366 | \lineii{\e v} {\ASCII{} Vertical Tab (VT)} |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 367 | \lineii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}} |
| 368 | \lineii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}} |
Fred Drake | a1cce71 | 1998-07-24 22:12:32 +0000 | [diff] [blame] | 369 | \end{tableii} |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 370 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 371 | |
Tim Peters | 7530208 | 2001-02-14 04:03:51 +0000 | [diff] [blame] | 372 | As in Standard C, up to three octal digits are accepted. However, |
| 373 | exactly two hex digits are taken in hex escapes. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 374 | |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 375 | Unlike Standard \index{unrecognized escape sequence}C, |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 376 | all unrecognized escape sequences are left in the string unchanged, |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 377 | i.e., \emph{the backslash is left in the string}. (This behavior is |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 378 | useful when debugging: if an escape sequence is mistyped, the |
Fred Drake | dea764d | 2000-12-19 04:52:03 +0000 | [diff] [blame] | 379 | resulting output is more easily recognized as broken.) It is also |
| 380 | important to note that the escape sequences marked as ``(Unicode |
| 381 | only)'' in the table above fall into the category of unrecognized |
| 382 | escapes for non-Unicode string literals. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 383 | |
Fred Drake | 347a625 | 2001-01-09 21:38:16 +0000 | [diff] [blame] | 384 | When an `r' or `R' prefix is present, a character following a |
| 385 | backslash is included in the string without change, and \emph{all |
| 386 | backslashes are left in the string}. For example, the string literal |
| 387 | \code{r"\e n"} consists of two characters: a backslash and a lowercase |
| 388 | `n'. String quotes can be escaped with a backslash, but the backslash |
| 389 | remains in the string; for example, \code{r"\e""} is a valid string |
| 390 | literal consisting of two characters: a backslash and a double quote; |
| 391 | \code{r"\e"} is not a value string literal (even a raw string cannot |
| 392 | end in an odd number of backslashes). Specifically, \emph{a raw |
| 393 | string cannot end in a single backslash} (since the backslash would |
| 394 | escape the following quote character). Note also that a single |
| 395 | backslash followed by a newline is interpreted as those two characters |
| 396 | as part of the string, \emph{not} as a line continuation. |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 397 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 398 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 399 | \subsection{String literal concatenation\label{string-catenation}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 400 | |
| 401 | Multiple adjacent string literals (delimited by whitespace), possibly |
| 402 | using different quoting conventions, are allowed, and their meaning is |
| 403 | the same as their concatenation. Thus, \code{"hello" 'world'} is |
| 404 | equivalent to \code{"helloworld"}. This feature can be used to reduce |
| 405 | the number of backslashes needed, to split long strings conveniently |
| 406 | across long lines, or even to add comments to parts of strings, for |
| 407 | example: |
| 408 | |
| 409 | \begin{verbatim} |
| 410 | re.compile("[A-Za-z_]" # letter or underscore |
| 411 | "[A-Za-z0-9_]*" # letter, digit or underscore |
| 412 | ) |
| 413 | \end{verbatim} |
| 414 | |
| 415 | Note that this feature is defined at the syntactical level, but |
| 416 | implemented at compile time. The `+' operator must be used to |
| 417 | concatenate string expressions at run time. Also note that literal |
| 418 | concatenation can use different quoting styles for each component |
| 419 | (even mixing raw strings and triple quoted strings). |
| 420 | |
Fred Drake | 2ed27d3 | 2000-11-17 19:05:12 +0000 | [diff] [blame] | 421 | |
| 422 | \subsection{Unicode literals \label{unicode}} |
| 423 | |
| 424 | XXX explain more here... |
| 425 | |
| 426 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 427 | \subsection{Numeric literals\label{numbers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 428 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 429 | There are four types of numeric literals: plain integers, long |
| 430 | integers, floating point numbers, and imaginary numbers. There are no |
| 431 | complex literals (complex numbers can be formed by adding a real |
| 432 | number and an imaginary number). |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 433 | \index{number} |
| 434 | \index{numeric literal} |
| 435 | \index{integer literal} |
| 436 | \index{plain integer literal} |
| 437 | \index{long integer literal} |
| 438 | \index{floating point literal} |
| 439 | \index{hexadecimal literal} |
| 440 | \index{octal literal} |
| 441 | \index{decimal literal} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 442 | \index{imaginary literal} |
| 443 | \index{complex literal} |
| 444 | |
| 445 | Note that numeric literals do not include a sign; a phrase like |
| 446 | \code{-1} is actually an expression composed of the unary operator |
| 447 | `\code{-}' and the literal \code{1}. |
| 448 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 449 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 450 | \subsection{Integer and long integer literals\label{integers}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 451 | |
| 452 | Integer and long integer literals are described by the following |
| 453 | lexical definitions: |
| 454 | |
| 455 | \begin{verbatim} |
| 456 | longinteger: integer ("l"|"L") |
| 457 | integer: decimalinteger | octinteger | hexinteger |
| 458 | decimalinteger: nonzerodigit digit* | "0" |
| 459 | octinteger: "0" octdigit+ |
| 460 | hexinteger: "0" ("x"|"X") hexdigit+ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 461 | nonzerodigit: "1"..."9" |
| 462 | octdigit: "0"..."7" |
| 463 | hexdigit: digit|"a"..."f"|"A"..."F" |
| 464 | \end{verbatim} |
| 465 | |
| 466 | Although both lower case `l' and upper case `L' are allowed as suffix |
| 467 | for long integers, it is strongly recommended to always use `L', since |
| 468 | the letter `l' looks too much like the digit `1'. |
| 469 | |
| 470 | Plain integer decimal literals must be at most 2147483647 (i.e., the |
| 471 | largest positive integer, using 32-bit arithmetic). Plain octal and |
| 472 | hexadecimal literals may be as large as 4294967295, but values larger |
| 473 | than 2147483647 are converted to a negative value by subtracting |
| 474 | 4294967296. There is no limit for long integer literals apart from |
| 475 | what can be stored in available memory. |
| 476 | |
| 477 | Some examples of plain and long integer literals: |
| 478 | |
| 479 | \begin{verbatim} |
| 480 | 7 2147483647 0177 0x80000000 |
| 481 | 3L 79228162514264337593543950336L 0377L 0x100000000L |
| 482 | \end{verbatim} |
| 483 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 484 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 485 | \subsection{Floating point literals\label{floating}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 486 | |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 487 | Floating point literals are described by the following lexical |
| 488 | definitions: |
| 489 | |
| 490 | \begin{verbatim} |
| 491 | floatnumber: pointfloat | exponentfloat |
| 492 | pointfloat: [intpart] fraction | intpart "." |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 493 | exponentfloat: (nonzerodigit digit* | pointfloat) exponent |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 494 | intpart: nonzerodigit digit* | "0" |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 495 | fraction: "." digit+ |
| 496 | exponent: ("e"|"E") ["+"|"-"] digit+ |
| 497 | \end{verbatim} |
| 498 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 499 | Note that the integer part of a floating point number cannot look like |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 500 | an octal integer, though the exponent may look like an octal literal |
| 501 | but will always be interpreted using radix 10. For example, |
| 502 | \samp{1e010} is legal, while \samp{07.1} is a syntax error. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 503 | The allowed range of floating point literals is |
| 504 | implementation-dependent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 505 | Some examples of floating point literals: |
| 506 | |
| 507 | \begin{verbatim} |
| 508 | 3.14 10. .001 1e100 3.14e-10 |
| 509 | \end{verbatim} |
| 510 | |
| 511 | Note that numeric literals do not include a sign; a phrase like |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 512 | \code{-1} is actually an expression composed of the operator |
| 513 | \code{-} and the literal \code{1}. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 514 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 515 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 516 | \subsection{Imaginary literals\label{imaginary}} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 517 | |
| 518 | Imaginary literals are described by the following lexical definitions: |
| 519 | |
| 520 | \begin{verbatim} |
| 521 | imagnumber: (floatnumber | intpart) ("j"|"J") |
| 522 | \end{verbatim} |
| 523 | |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 524 | An imaginary literal yields a complex number with a real part of |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 525 | 0.0. Complex numbers are represented as a pair of floating point |
| 526 | numbers and have the same restrictions on their range. To create a |
| 527 | complex number with a nonzero real part, add a floating point number |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 528 | to it, e.g., \code{(3+4j)}. Some examples of imaginary literals: |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 529 | |
| 530 | \begin{verbatim} |
Guido van Rossum | 7c0240f | 1998-07-24 15:36:43 +0000 | [diff] [blame] | 531 | 3.14j 10.j 10j .001j 1e100j 3.14e-10j |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 532 | \end{verbatim} |
| 533 | |
| 534 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 535 | \section{Operators\label{operators}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 536 | |
| 537 | The following tokens are operators: |
| 538 | \index{operators} |
| 539 | |
| 540 | \begin{verbatim} |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 541 | + - * ** / % |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 542 | << >> & | ^ ~ |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 543 | < > <= >= == != <> |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 544 | \end{verbatim} |
| 545 | |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 546 | The comparison operators \code{<>} and \code{!=} are alternate |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 547 | spellings of the same operator. \code{!=} is the preferred spelling; |
| 548 | \code{<>} is obsolescent. |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 549 | |
Fred Drake | f5eae66 | 2001-06-23 05:26:52 +0000 | [diff] [blame^] | 550 | |
Fred Drake | 61c7728 | 1998-07-28 19:34:22 +0000 | [diff] [blame] | 551 | \section{Delimiters\label{delimiters}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 552 | |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 553 | The following tokens serve as delimiters in the grammar: |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 554 | \index{delimiters} |
| 555 | |
| 556 | \begin{verbatim} |
| 557 | ( ) [ ] { } |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 558 | , : . ` = ; |
Thomas Wouters | 12bba85 | 2000-08-24 20:06:04 +0000 | [diff] [blame] | 559 | += -= *= /= %= **= |
| 560 | &= |= ^= >>= <<= |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 561 | \end{verbatim} |
| 562 | |
| 563 | The period can also occur in floating-point and imaginary literals. A |
Fred Drake | e15956b | 2000-04-03 04:51:13 +0000 | [diff] [blame] | 564 | sequence of three periods has a special meaning as an ellipsis in slices. |
Thomas Wouters | 12bba85 | 2000-08-24 20:06:04 +0000 | [diff] [blame] | 565 | The second half of the list, the augmented assignment operators, serve |
| 566 | lexically as delimiters, but also perform an operation. |
Guido van Rossum | 60f2f0c | 1998-06-15 18:00:50 +0000 | [diff] [blame] | 567 | |
| 568 | The following printing ASCII characters have special meaning as part |
| 569 | of other tokens or are otherwise significant to the lexical analyzer: |
| 570 | |
| 571 | \begin{verbatim} |
| 572 | ' " # \ |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 573 | \end{verbatim} |
| 574 | |
| 575 | The following printing \ASCII{} characters are not used in Python. Their |
| 576 | occurrence outside string literals and comments is an unconditional |
| 577 | error: |
Fred Drake | 5c07d9b | 1998-05-14 19:37:06 +0000 | [diff] [blame] | 578 | \index{ASCII@\ASCII{}} |
Fred Drake | f666917 | 1998-05-06 19:52:49 +0000 | [diff] [blame] | 579 | |
| 580 | \begin{verbatim} |
| 581 | @ $ ? |
| 582 | \end{verbatim} |