First round of corrections (lexer only).
diff --git a/Doc/ref.tex b/Doc/ref.tex
index 6af7535..a2eb381 100644
--- a/Doc/ref.tex
+++ b/Doc/ref.tex
@@ -42,9 +42,8 @@
This reference manual describes the syntax and ``core semantics'' of
the language. It is terse, but exact and complete. The semantics of
non-essential built-in object types and of the built-in functions and
-modules are described in the {\em Library Reference} document. For an
-informal introduction to the language, see the {\em Tutorial}
-document.
+modules are described in the {\em Python Library Reference}. For an
+informal introduction to the language, see the {\em Python Tutorial}.
\end{abstract}
@@ -63,132 +62,119 @@
\chapter{Lexical analysis}
-A Python program is read by a {\em parser}.
-Input to the parser is a stream of {\em tokens}, generated
-by the {\em lexical analyzer}.
+A Python program is read by a {\em parser}. Input to the parser is a
+stream of {\em tokens}, generated by the {\em lexical analyzer}. This
+chapter describes how the lexical analyzer breaks a file into tokens.
\section{Line structure}
-A Python program is divided in a number of logical lines.
-Statements may not straddle logical line boundaries except where
-explicitly allowed by the syntax.
-To this purpose, the end of a logical line
-is represented by the token NEWLINE.
+A Python program is divided in a number of logical lines. Statements
+do not straddle logical line boundaries except where explicitly
+indicated by the syntax (i.e., for compound statements). To this
+purpose, the end of a logical line is represented by the token
+NEWLINE.
\subsection{Comments}
-A comment starts with a hash character (\verb/#/) and ends at the end
-of the physical line. Comments are ignored by the syntax.
-A hash character in a string literal does not start a comment.
+A comment starts with a hash character (\verb\#\) that is not part of
+a string literal, and ends at the end of the physical line. Comments
+are ignored by the syntax.
\subsection{Line joining}
-Physical lines may be joined into logical lines using backslash
-characters (\verb/\/), as follows.
-If a physical line ends in a backslash that is not part of a string
-literal or comment, it is joined with
-the following forming a single logical line, deleting the backslash
-and the following end-of-line character. More than two physical
-lines may be joined together in this way.
+Two or more physical lines may be joined into logical lines using
+backslash characters (\verb/\/), as follows: When physical line ends
+in a backslash that is not part of a string literal or comment, it is
+joined with the following forming a single logical line, deleting the
+backslash and the following end-of-line character.
\subsection{Blank lines}
-A physical line that is not the continuation of the previous line
-and contains only spaces, tabs and possibly a comment, is ignored
-(i.e., no NEWLINE token is generated),
-except that during interactive input of statements, an empty
-physical line terminates a multi-line statement.
+A logical line that contains only spaces, tabs, and possibly a
+comment, is ignored (i.e., no NEWLINE token is generated), except that
+during interactive input of statements, an entirely blank logical line
+terminates a multi-line statement.
\subsection{Indentation}
-Spaces and tabs at the beginning of a line are used to compute
+Spaces and tabs at the beginning of a logical line are used to compute
the indentation level of the line, which in turn is used to determine
the grouping of statements.
-First, each tab is replaced by one to eight spaces such that the column number
-of the next character is a multiple of eight (counting from zero).
-The column number of the first non-space character then defines the
-line's indentation.
-Indentation cannot be split over multiple physical lines using
-backslashes.
+First, each tab is replaced by one to eight spaces such that the total
+number of spaces up to that point is a multiple of eight. The total
+number of spaces preceding the first non-blank character then
+determines the line's indentation. Indentation cannot be split over
+multiple physical lines using backslashes.
The indentation levels of consecutive lines are used to generate
INDENT and DEDENT tokens, using a stack, as follows.
Before the first line of the file is read, a single zero is pushed on
-the stack; this will never be popped off again. The numbers pushed
-on the stack will always be strictly increasing from bottom to top.
-At the beginning of each logical line, the line's indentation level
-is compared to the top of the stack.
-If it is equal, nothing happens.
-If it larger, it is pushed on the stack, and one INDENT token is generated.
-If it is smaller, it {\em must} be one of the numbers occurring on the
-stack; all numbers on the stack that are larger are popped off,
-and for each number popped off a DEDENT token is generated.
-At the end of the file, a DEDENT token is generated for each number
-remaining on the stack that is larger than zero.
+the stack; this will never be popped off again. The numbers pushed on
+the stack will always be strictly increasing from bottom to top. At
+the beginning of each logical line, the line's indentation level is
+compared to the top of the stack. If it is equal, nothing happens.
+If it larger, it is pushed on the stack, and one INDENT token is
+generated. If it is smaller, it {\em must} be one of the numbers
+occurring on the stack; all numbers on the stack that are larger are
+popped off, and for each number popped off a DEDENT token is
+generated. At the end of the file, a DEDENT token is generated for
+each number remaining on the stack that is larger than zero.
\section{Other tokens}
Besides NEWLINE, INDENT and DEDENT, the following categories of tokens
exist: identifiers, keywords, literals, operators, and delimiters.
-Spaces and tabs are not tokens, but serve to delimit tokens.
-Where ambiguity exists, a token comprises the longest possible
-string that forms a legal token, when reading from left to right.
+Spaces and tabs are not tokens, but serve to delimit tokens. Where
+ambiguity exists, a token comprises the longest possible string that
+forms a legal token, when read from left to right.
Tokens are described using an extended regular expression notation.
This is similar to the extended BNF notation used later, except that
-the notation <...> is used to give an informal description of a character,
-and that spaces and tabs are not to be ignored.
+the notation \verb\<...>\ is used to give an informal description of a
+character, and that spaces and tabs are not to be ignored.
\section{Identifiers}
Identifiers are described by the following regular expressions:
\begin{verbatim}
-identifier: (letter|'_') (letter|digit|'_')*
+identifier: (letter|"_") (letter|digit|"_")*
letter: lowercase | uppercase
-lowercase: 'a'|'b'|...|'z'
-uppercase: 'A'|'B'|...|'Z'
-digit: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
+lowercase: "a"|"b"|...|"z"
+uppercase: "A"|"B"|...|"Z"
+digit: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
\end{verbatim}
-Identifiers are unlimited in length.
-Upper and lower case letters are different.
+Identifiers are unlimited in length. Case is significant.
\section{Keywords}
-The following tokens are used as reserved words,
-or keywords of the language,
-and may not be used as ordinary identifiers.
-They must be spelled exactly as written here:
+The following identifiers are used as reserved words, or {\em
+keywords} of the language, and may not be used as ordinary
+identifiers. They must be spelled exactly as written here:
-{\tt
- and
- break
- class
- continue
- def
- del
- elif
- else
- except
- finally
- for
- from
- if
- import
- in
- is
- not
- or
- pass
- print
- raise
- return
- try
- while
-}
+\begin{verbatim}
+and del for is raise
+break elif from not return
+class else if or try
+continue except import pass while
+def finally in print
+\end{verbatim}
+
+% import string
+% l = []
+% try:
+% while 1:
+% l = l + string.split(raw_input())
+% except EOFError:
+% pass
+% l.sort()
+% for i in range((len(l)+4)/5):
+% for j in range(i, len(l), 5):
+% print string.ljust(l[j], 10),
+% print
\section{Literals}
@@ -197,24 +183,47 @@
String literals are described by the following regular expressions:
\begin{verbatim}
-stringliteral: '\'' stringitem* '\''
+stringliteral: "'" stringitem* "'"
stringitem: stringchar | escapeseq
-stringchar: <any character except newline or '\\' or '\''>
-escapeseq: '\\' <any character except newline>
+stringchar: <any character except newline or "\" or "'">
+escapeseq: "'" <any character except newline>
\end{verbatim}
-String literals cannot span physical line boundaries.
-Escape sequences in strings are actually interpreted according to almost the
-same rules as used by Standard C
-(XXX which should be made explicit here),
-except that \verb/\E/ is equivalent to \verb/\033/,
-\verb/\"/ is not recognized,
-newline characters cannot be escaped, and
-{\em all unrecognized escape sequences are left in the string unchanged}.
-(The latter rule is useful when debugging: if an escape sequence is
-mistyped, the resulting output is more easily recognized as broken.
-It also helps somewhat for string literals used as regular expressions
-or otherwise passed to other modules that do their own escape handling.)
+String literals cannot span physical line boundaries. Escape
+sequences in strings are actually interpreted according to rules
+simular to those used by Standard C. The recognized escape sequences
+are:
+
+\begin{center}
+\begin{tabular}{|l|l|}
+\hline
+\verb/\\/ & Backslash (\verb/\/) \\
+\verb/\'/ & Single quote (\verb/'/) \\
+\verb/\a/ & ASCII Bell (BEL) \\
+\verb/\b/ & ASCII Backspace (BS) \\
+\verb/\E/ & ASCII Escape (ESC) \\
+\verb/\f/ & ASCII Formfeed (FF) \\
+\verb/\n/ & ASCII Linefeed (LF) \\
+\verb/\r/ & ASCII Carriage Return (CR) \\
+\verb/\t/ & ASCII Horizontal Tab (TAB) \\
+\verb/\v/ & ASCII Vertical Tab (VT) \\
+\verb/\/{\em ooo} & ASCII character with octal value {\em ooo} \\
+\verb/\x/{em xx...} & ASCII character with hex value {\em xx} \\
+\hline
+\end{tabular}
+\end{center}
+
+For compatibility with in Standard C, up to three octal digits are
+accepted, but an unlimited number of hex digits is taken to be part of
+the hex escape (and then the lower 8 bits of the resulting hex number
+are used...).
+
+All unrecognized escape sequences are left in the string {\em
+unchanged}, i.e., the backslash is left in the string. (This rule is
+useful when debugging: if an escape sequence is mistyped, the
+resulting output is more easily recognized as broken. It also helps
+somewhat for string literals used as regular expressions or otherwise
+passed to other modules that do their own escape handling.)
\subsection{Numeric literals}
@@ -224,24 +233,24 @@
Integers and long integers are described by the following regular expressions:
\begin{verbatim}
-longinteger: integer ('l'|'L')
+longinteger: integer ("l"|"L")
integer: decimalinteger | octinteger | hexinteger
-decimalinteger: nonzerodigit digit* | '0'
-octinteger: '0' octdigit+
-hexinteger: '0' ('x'|'X') hexdigit+
+decimalinteger: nonzerodigit digit* | "0"
+octinteger: "0" octdigit+
+hexinteger: "0" ("x"|"X") hexdigit+
-nonzerodigit: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
-octdigit: '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'
-hexdigit: digit|'a'|'b'|'c'|'d'|'e'|'f'|'A'|'B'|'C'|'D'|'E'|'F'
+nonzerodigit: "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"
+octdigit: "0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"
+hexdigit: digit|"a"|"b"|"c"|"d"|"e"|"f"|"A"|"B"|"C"|"D"|"E"|"F"
\end{verbatim}
Floating point numbers are described by the following regular expressions:
\begin{verbatim}
-floatnumber: [intpart] fraction [exponent] | intpart ['.'] exponent
+floatnumber: [intpart] fraction [exponent] | intpart ["."] exponent
intpart: digit+
-fraction: '.' digit+
-exponent: ('e'|'E') ['+'|'-'] digit+
+fraction: "." digit+
+exponent: ("e"|"E") ["+"|"-"] digit+
\end{verbatim}
\section{Operators}
@@ -292,15 +301,15 @@
may be used where an expression is required by enclosing it in
parentheses. The only place where an unparenthesized condition
is not allowed is on the right-hand side of the assignment operator,
-because this operator is the same token (\verb/'='/) as used for
+because this operator is the same token (\verb\=\) as used for
compasisons.
The comma plays a somewhat special role in Python's syntax.
It is an operator with a lower precedence than all others, but
occasionally serves other purposes as well (e.g., it has special
semantics in print statements). When a comma is accepted by the
-syntax, one of the syntactic categories \verb/expression_list/
-or \verb/condition_list/ is always used.
+syntax, one of the syntactic categories \verb\expression_list\
+or \verb\condition_list\ is always used.
When (one alternative of) a syntax rule has the form
@@ -308,8 +317,8 @@
name: othername
\end{verbatim}
-and no semantics are given, the semantics of this form of \verb/name/
-are the same as for \verb/othername/.
+and no semantics are given, the semantics of this form of \verb\name\
+are the same as for \verb\othername\.
\section{Arithmetic conversions}
@@ -414,11 +423,11 @@
A string conversion evaluates the contained condition list and converts the
resulting object into a string according to rules specific to its type.
-If the object is a string, a number, \verb/None/, or a tuple, list or
+If the object is a string, a number, \verb\None\, or a tuple, list or
dictionary containing only objects whose type is in this list,
the resulting
string is a valid Python expression which can be passed to the
-built-in function \verb/eval()/ to yield an expression with the
+built-in function \verb\eval()\ to yield an expression with the
same value (or an approximation, if floating point numbers are
involved).
@@ -459,11 +468,11 @@
factor: primary | '-' factor | '+' factor | '~' factor
\end{verbatim}
-The unary \verb/'-'/ operator yields the negative of its numeric argument.
+The unary \verb\-\ operator yields the negative of its numeric argument.
-The unary \verb/'+'/ operator yields its numeric argument unchanged.
+The unary \verb\+\ operator yields its numeric argument unchanged.
-The unary \verb/'~'/ operator yields the bit-wise negation of its
+The unary \verb\~\ operator yields the bit-wise negation of its
integral numerical argument.
In all three cases, if the argument does not have the proper type,
@@ -477,7 +486,7 @@
term: factor | term '*' factor | term '/' factor | term '%' factor
\end{verbatim}
-The \verb/'*'/ operator yields the product of its arguments.
+The \verb\*\ operator yields the product of its arguments.
The arguments must either both be numbers, or one argument must be
a (short) integer and the other must be a string.
In the former case, the numbers are converted to a common type
@@ -572,7 +581,7 @@
a trailing comma doesn't create a tuple, but rather yields the
value of that expression).
-To create an empty tuple, use an empty pair of parentheses: \verb/()/.
+To create an empty tuple, use an empty pair of parentheses: \verb\()\.
\section{Comparisons}
@@ -597,8 +606,8 @@
between $e_0$ and $e_2$, e.g., $x < y > z$ is perfectly legal.
For the benefit of C programmers,
-the comparison operators \verb/=/ and \verb/==/ are equivalent,
-and so are \verb/<>/ and \verb/!=/.
+the comparison operators \verb\=\ and \verb\==\ are equivalent,
+and so are \verb\<>\ and \verb\!=\.
Use of the C variants is discouraged.
The operators {\tt '<', '>', '=', '>=', '<='}, and {\tt '<>'} compare
@@ -610,7 +619,7 @@
(This unusual
definition of comparison is done to simplify the definition of
-operations like sorting and the \verb/in/ and \verb/not in/ operators.)
+operations like sorting and the \verb\in\ and \verb\not in\ operators.)
Comparison of objects of the same type depends on the type:
@@ -869,12 +878,12 @@
unless the output system believes it is positioned at the beginning
of a line. This is the case: (1) when no characters have been written
to standard output; or (2) when the last character written to
-standard output is \verb/'\n'/;
+standard output is \verb/\n/;
or (3) when the last I/O operation
on standard output was not a \verb\print\ statement.
Finally,
-a \verb/'\n'/ character is written at the end,
+a \verb/\n/ character is written at the end,
unless the \verb\print\ statement ends with a comma.
This is the only action if the statement contains just the keyword
\verb\print\.