Merged revisions 53538-53622 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
r53545 | andrew.kuchling | 2007-01-24 21:06:41 +0100 (Wed, 24 Jan 2007) | 1 line
Strengthen warning about using lock()
........
r53556 | thomas.heller | 2007-01-25 19:34:14 +0100 (Thu, 25 Jan 2007) | 3 lines
Fix for #1643874: When calling SysAllocString, create a PyCObject
which will eventually call SysFreeString to free the BSTR resource.
........
r53563 | andrew.kuchling | 2007-01-25 21:02:13 +0100 (Thu, 25 Jan 2007) | 1 line
Add item
........
r53564 | brett.cannon | 2007-01-25 21:22:02 +0100 (Thu, 25 Jan 2007) | 8 lines
Fix time.strptime's %U support. Basically rewrote the algorithm to be more
generic so that one only has to shift certain values based on whether the week
was specified to start on Monday or Sunday. Cut out a lot of edge case code
compared to the previous version. Also broke algorithm out into its own
function (that is private to the module).
Fixes bug #1643943 (thanks Biran Nahas for the report).
........
r53570 | brett.cannon | 2007-01-26 00:30:39 +0100 (Fri, 26 Jan 2007) | 4 lines
Remove specific mention of my name and email address from modules. Not really
needed and all bug reports should go to the bug tracker, not directly to me.
Plus I am not the only person to have edited these files at this point.
........
r53573 | fred.drake | 2007-01-26 17:28:44 +0100 (Fri, 26 Jan 2007) | 1 line
fix typo (extraneous ")")
........
r53575 | georg.brandl | 2007-01-27 18:43:02 +0100 (Sat, 27 Jan 2007) | 4 lines
Patch #1638243: the compiler package is now able to correctly compile
a with statement; previously, executing code containing a with statement
compiled by the compiler package crashed the interpreter.
........
r53578 | georg.brandl | 2007-01-27 18:59:42 +0100 (Sat, 27 Jan 2007) | 3 lines
Patch #1634778: add missing encoding aliases for iso8859_15 and
iso8859_16.
........
r53579 | georg.brandl | 2007-01-27 20:38:50 +0100 (Sat, 27 Jan 2007) | 2 lines
Bug #1645944: os.access now returns bool but docstring is not updated
........
r53590 | brett.cannon | 2007-01-28 21:58:00 +0100 (Sun, 28 Jan 2007) | 2 lines
Use the thread lock's context manager instead of a try/finally statement.
........
r53591 | brett.cannon | 2007-01-29 05:41:44 +0100 (Mon, 29 Jan 2007) | 2 lines
Add a test for slicing an exception.
........
r53594 | andrew.kuchling | 2007-01-29 21:21:43 +0100 (Mon, 29 Jan 2007) | 1 line
Minor edits to the curses HOWTO
........
r53596 | andrew.kuchling | 2007-01-29 21:55:40 +0100 (Mon, 29 Jan 2007) | 1 line
Various minor edits
........
r53597 | andrew.kuchling | 2007-01-29 22:28:48 +0100 (Mon, 29 Jan 2007) | 1 line
More edits
........
r53601 | tim.peters | 2007-01-30 04:03:46 +0100 (Tue, 30 Jan 2007) | 2 lines
Whitespace normalization.
........
r53603 | georg.brandl | 2007-01-30 21:21:30 +0100 (Tue, 30 Jan 2007) | 2 lines
Bug #1648191: typo in docs.
........
r53605 | brett.cannon | 2007-01-30 22:34:36 +0100 (Tue, 30 Jan 2007) | 8 lines
No more raising of string exceptions!
The next step of PEP 352 (for 2.6) causes raising a string exception to trigger
a TypeError. Trying to catch a string exception raises a DeprecationWarning.
References to string exceptions has been removed from the docs since they are
now just an error.
........
r53618 | raymond.hettinger | 2007-02-01 22:02:59 +0100 (Thu, 01 Feb 2007) | 1 line
Bug #1648179: set.update() not recognizing __iter__ overrides in dict subclasses.
........
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex
index 3c63b3a..62b6daf 100644
--- a/Doc/howto/regex.tex
+++ b/Doc/howto/regex.tex
@@ -34,17 +34,18 @@
The \module{re} module was added in Python 1.5, and provides
Perl-style regular expression patterns. Earlier versions of Python
came with the \module{regex} module, which provided Emacs-style
-patterns. \module{regex} module was removed in Python 2.5.
+patterns. The \module{regex} module was removed completely in Python 2.5.
-Regular expressions (or REs) are essentially a tiny, highly
-specialized programming language embedded inside Python and made
-available through the \module{re} module. Using this little language,
-you specify the rules for the set of possible strings that you want to
-match; this set might contain English sentences, or e-mail addresses,
-or TeX commands, or anything you like. You can then ask questions
-such as ``Does this string match the pattern?'', or ``Is there a match
-for the pattern anywhere in this string?''. You can also use REs to
-modify a string or to split it apart in various ways.
+Regular expressions (called REs, or regexes, or regex patterns) are
+essentially a tiny, highly specialized programming language embedded
+inside Python and made available through the \module{re} module.
+Using this little language, you specify the rules for the set of
+possible strings that you want to match; this set might contain
+English sentences, or e-mail addresses, or TeX commands, or anything
+you like. You can then ask questions such as ``Does this string match
+the pattern?'', or ``Is there a match for the pattern anywhere in this
+string?''. You can also use REs to modify a string or to split it
+apart in various ways.
Regular expression patterns are compiled into a series of bytecodes
which are then executed by a matching engine written in C. For
@@ -80,11 +81,12 @@
would let this RE match \samp{Test} or \samp{TEST} as well; more
about this later.)
-There are exceptions to this rule; some characters are
-special, and don't match themselves. Instead, they signal that some
-out-of-the-ordinary thing should be matched, or they affect other
-portions of the RE by repeating them. Much of this document is
-devoted to discussing various metacharacters and what they do.
+There are exceptions to this rule; some characters are special
+\dfn{metacharacters}, and don't match themselves. Instead, they
+signal that some out-of-the-ordinary thing should be matched, or they
+affect other portions of the RE by repeating them or changing their
+meaning. Much of this document is devoted to discussing various
+metacharacters and what they do.
Here's a complete list of the metacharacters; their meanings will be
discussed in the rest of this HOWTO.
@@ -111,9 +113,10 @@
usually a metacharacter, but inside a character class it's stripped of
its special nature.
-You can match the characters not within a range by \dfn{complementing}
-the set. This is indicated by including a \character{\^} as the first
-character of the class; \character{\^} elsewhere will simply match the
+You can match the characters not listed within the class by
+\dfn{complementing} the set. This is indicated by including a
+\character{\^} as the first character of the class; \character{\^}
+outside a character class will simply match the
\character{\^} character. For example, \verb|[^5]| will match any
character except \character{5}.
@@ -176,7 +179,7 @@
For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
characters), and so forth. The RE engine has various internal
-limitations stemming from the size of C's \code{int} type, that will
+limitations stemming from the size of C's \code{int} type that will
prevent it from matching over 2 billion \samp{a} characters; you
probably don't have enough memory to construct a string that large, so
you shouldn't run into that limit.
@@ -238,9 +241,9 @@
You can omit either \var{m} or \var{n}; in that case, a reasonable
value is assumed for the missing value. Omitting \var{m} is
-interpreted as a lower limit of 0, while omitting \var{n} results in an
-upper bound of infinity --- actually, the 2 billion limit mentioned
-earlier, but that might as well be infinity.
+interpreted as a lower limit of 0, while omitting \var{n} results in
+an upper bound of infinity --- actually, the upper bound is the
+2-billion limit mentioned earlier, but that might as well be infinity.
Readers of a reductionist bent may notice that the three other qualifiers
can all be expressed using this notation. \regexp{\{0,\}} is the same
@@ -285,7 +288,7 @@
no need to bloat the language specification by including them.)
Instead, the \module{re} module is simply a C extension module
included with Python, just like the \module{socket} or \module{zlib}
-module.
+modules.
Putting REs in strings keeps the Python language simpler, but has one
disadvantage which is the topic of the next section.
@@ -326,7 +329,7 @@
a string literal prefixed with \character{r}, so \code{r"\e n"} is a
two-character string containing \character{\e} and \character{n},
while \code{"\e n"} is a one-character string containing a newline.
-Frequently regular expressions will be expressed in Python
+Regular expressions will often be written in Python
code using this raw string notation.
\begin{tableii}{c|c}{code}{Regular String}{Raw string}
@@ -368,9 +371,9 @@
\file{redemo.py} can be quite useful when trying to debug a
complicated RE. Phil Schwartz's
\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
-tool for developing and testing RE patterns. This HOWTO will use the
-standard Python interpreter for its examples.
+tool for developing and testing RE patterns.
+This HOWTO uses the standard Python interpreter for its examples.
First, run the Python interpreter, import the \module{re} module, and
compile a RE:
@@ -401,7 +404,7 @@
later use.
\begin{verbatim}
->>> m = p.match( 'tempo')
+>>> m = p.match('tempo')
>>> print m
<_sre.SRE_Match object at 80c4f68>
\end{verbatim}
@@ -472,9 +475,9 @@
\end{verbatim}
\method{findall()} has to create the entire list before it can be
-returned as the result. In Python 2.2, the \method{finditer()} method
-is also available, returning a sequence of \class{MatchObject} instances
-as an iterator.
+returned as the result. The \method{finditer()} method returns a
+sequence of \class{MatchObject} instances as an
+iterator.\footnote{Introduced in Python 2.2.2.}
\begin{verbatim}
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@@ -491,13 +494,13 @@
\subsection{Module-Level Functions}
-You don't have to produce a \class{RegexObject} and call its methods;
+You don't have to create a \class{RegexObject} and call its methods;
the \module{re} module also provides top-level functions called
-\function{match()}, \function{search()}, \function{sub()}, and so
-forth. These functions take the same arguments as the corresponding
-\class{RegexObject} method, with the RE string added as the first
-argument, and still return either \code{None} or a \class{MatchObject}
-instance.
+\function{match()}, \function{search()}, \function{findall()},
+\function{sub()}, and so forth. These functions take the same
+arguments as the corresponding \class{RegexObject} method, with the RE
+string added as the first argument, and still return either
+\code{None} or a \class{MatchObject} instance.
\begin{verbatim}
>>> print re.match(r'From\s+', 'Fromage amk')
@@ -514,7 +517,7 @@
Should you use these module-level functions, or should you get the
\class{RegexObject} and call its methods yourself? That choice
depends on how frequently the RE will be used, and on your personal
-coding style. If a RE is being used at only one point in the code,
+coding style. If the RE is being used at only one point in the code,
then the module functions are probably more convenient. If a program
contains a lot of regular expressions, or re-uses the same ones in
several locations, then it might be worthwhile to collect all the
@@ -537,7 +540,7 @@
Compilation flags let you modify some aspects of how regular
expressions work. Flags are available in the \module{re} module under
-two names, a long name such as \constant{IGNORECASE}, and a short,
+two names, a long name such as \constant{IGNORECASE} and a short,
one-letter form such as \constant{I}. (If you're familiar with Perl's
pattern modifiers, the one-letter forms use the same letters; the
short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
@@ -617,7 +620,7 @@
format them. When this flag has been specified, whitespace within the
RE string is ignored, except when the whitespace is in a character
class or preceded by an unescaped backslash; this lets you organize
-and indent the RE more clearly. It also enables you to put comments
+and indent the RE more clearly. This flag also lets you put comments
within a RE that will be ignored by the engine; comments are marked by
a \character{\#} that's neither in a character class or preceded by an
unescaped backslash.
@@ -629,18 +632,19 @@
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
- [0-9]+[^0-9] # Decimal form
- | 0[0-7]+[^0-7] # Octal form
- | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
+ 0[0-7]+ # Octal form
+ | [0-9]+ # Decimal form
+ | x[0-9a-fA-F]+ # Hexadecimal form
)
+ ; # Trailing semicolon
""", re.VERBOSE)
\end{verbatim}
Without the verbose setting, the RE would look like this:
\begin{verbatim}
-charref = re.compile("&#([0-9]+[^0-9]"
- "|0[0-7]+[^0-7]"
- "|x[0-9a-fA-F]+[^0-9a-fA-F])")
+charref = re.compile("&#(0[0-7]+"
+ "|[0-9]+"
+ "|x[0-9a-fA-F]+);")
\end{verbatim}
In the above example, Python's automatic concatenation of string
@@ -722,12 +726,12 @@
\item[\regexp{\e A}] Matches only at the start of the string. When
not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
-effectively the same. In \constant{MULTILINE} mode, however, they're
-different; \regexp{\e A} still matches only at the beginning of the
+effectively the same. In \constant{MULTILINE} mode, they're
+different: \regexp{\e A} still matches only at the beginning of the
string, but \regexp{\^} may match at any location inside the string
that follows a newline character.
-\item[\regexp{\e Z}]Matches only at the end of the string.
+\item[\regexp{\e Z}] Matches only at the end of the string.
\item[\regexp{\e b}] Word boundary.
This is a zero-width assertion that matches only at the
@@ -782,14 +786,23 @@
strings by writing a RE divided into several subgroups which
match different components of interest. For example, an RFC-822
header line is divided into a header name and a value, separated by a
-\character{:}. This can be handled by writing a regular expression
+\character{:}, like this:
+
+\begin{verbatim}
+From: author@example.com
+User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
+MIME-Version: 1.0
+To: editor@example.com
+\end{verbatim}
+
+This can be handled by writing a regular expression
which matches an entire header line, and has one group which matches the
header name, and another group which matches the header's value.
Groups are marked by the \character{(}, \character{)} metacharacters.
\character{(} and \character{)} have much the same meaning as they do
in mathematical expressions; they group together the expressions
-contained inside them. For example, you can repeat the contents of a
+contained inside them, and you can repeat the contents of a
group with a repeating qualifier, such as \regexp{*}, \regexp{+},
\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
@@ -881,12 +894,13 @@
syntax for regular expression extensions, so we'll look at that first.
Perl 5 added several additional features to standard regular
-expressions, and the Python \module{re} module supports most of them.
-It would have been difficult to choose new single-keystroke
-metacharacters or new special sequences beginning with \samp{\e} to
-represent the new features without making Perl's regular expressions
-confusingly different from standard REs. If you chose \samp{\&} as a
-new metacharacter, for example, old expressions would be assuming that
+expressions, and the Python \module{re} module supports most of them.
+It would have been difficult to choose new
+single-keystroke metacharacters or new special sequences beginning
+with \samp{\e} to represent the new features without making Perl's
+regular expressions confusingly different from standard REs. If you
+chose \samp{\&} as a new metacharacter, for example, old expressions
+would be assuming that
\samp{\&} was a regular character and wouldn't have escaped it by
writing \regexp{\e \&} or \regexp{[\&]}.
@@ -913,15 +927,15 @@
to the features that simplify working with groups in complex REs.
Since groups are numbered from left to right and a complex expression
may use many groups, it can become difficult to keep track of the
-correct numbering, and modifying such a complex RE is annoying.
-Insert a new group near the beginning, and you change the numbers of
+correct numbering. Modifying such a complex RE is annoying, too:
+insert a new group near the beginning and you change the numbers of
everything that follows it.
-First, sometimes you'll want to use a group to collect a part of a
-regular expression, but aren't interested in retrieving the group's
-contents. You can make this fact explicit by using a non-capturing
-group: \regexp{(?:...)}, where you can put any other regular
-expression inside the parentheses.
+Sometimes you'll want to use a group to collect a part of a regular
+expression, but aren't interested in retrieving the group's contents.
+You can make this fact explicit by using a non-capturing group:
+\regexp{(?:...)}, where you can replace the \regexp{...}
+with any other regular expression.
\begin{verbatim}
>>> m = re.match("([abc])+", "abc")
@@ -937,23 +951,23 @@
capturing group; you can put anything inside it, repeat it with a
repetition metacharacter such as \samp{*}, and nest it within other
groups (capturing or non-capturing). \regexp{(?:...)} is particularly
-useful when modifying an existing group, since you can add new groups
+useful when modifying an existing pattern, since you can add new groups
without changing how all the other groups are numbered. It should be
mentioned that there's no performance difference in searching between
capturing and non-capturing groups; neither form is any faster than
the other.
-The second, and more significant, feature is named groups; instead of
+A more significant feature is named groups: instead of
referring to them by numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
-the group. Except for associating a name with a group, named groups
-also behave identically to capturing groups. The \class{MatchObject}
-methods that deal with capturing groups all accept either integers, to
-refer to groups by number, or a string containing the group name.
-Named groups are still given numbers, so you can retrieve information
-about a group in two ways:
+the group. Named groups also behave exactly like capturing groups,
+and additionally associate a name with a group. The
+\class{MatchObject} methods that deal with capturing groups all accept
+either integers that refer to the group by number or strings that
+contain the desired group's name. Named groups are still given
+numbers, so you can retrieve information about a group in two ways:
\begin{verbatim}
>>> p = re.compile(r'(?P<word>\b\w+\b)')
@@ -980,11 +994,11 @@
It's obviously much easier to retrieve \code{m.group('zonem')},
instead of having to remember to retrieve group 9.
-Since the syntax for backreferences, in an expression like
-\regexp{(...)\e 1}, refers to the number of the group there's
+The syntax for backreferences in an expression such as
+\regexp{(...)\e 1} refers to the number of the group. There's
naturally a variant that uses the group name instead of the number.
-This is also a Python extension: \regexp{(?P=\var{name})} indicates
-that the contents of the group called \var{name} should again be found
+This is another Python extension: \regexp{(?P=\var{name})} indicates
+that the contents of the group called \var{name} should again be matched
at the current point. The regular expression for finding doubled
words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
@@ -1014,11 +1028,11 @@
\emph{doesn't} match at the current position in the string.
\end{itemize}
-An example will help make this concrete by demonstrating a case
-where a lookahead is useful. Consider a simple pattern to match a
-filename and split it apart into a base name and an extension,
-separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news}
-is the base name, and \samp{rc} is the filename's extension.
+To make this concrete, let's look at a case where a lookahead is
+useful. Consider a simple pattern to match a filename and split it
+apart into a base name and an extension, separated by a \samp{.}. For
+example, in \samp{news.rc}, \samp{news} is the base name, and
+\samp{rc} is the filename's extension.
The pattern to match this is quite simple:
@@ -1065,12 +1079,12 @@
exclude both \samp{bat} and \samp{exe} as extensions, the pattern
would get even more complicated and confusing.
-A negative lookahead cuts through all this:
+A negative lookahead cuts through all this confusion:
\regexp{.*[.](?!bat\$).*\$}
% $
-The lookahead means: if the expression \regexp{bat} doesn't match at
+The negative lookahead means: if the expression \regexp{bat} doesn't match at
this point, try the rest of the pattern; if \regexp{bat\$} does match,
the whole pattern will fail. The trailing \regexp{\$} is required to
ensure that something like \samp{sample.batch}, where the extension
@@ -1087,7 +1101,7 @@
\section{Modifying Strings}
Up to this point, we've simply performed searches against a static
-string. Regular expressions are also commonly used to modify a string
+string. Regular expressions are also commonly used to modify strings
in various ways, using the following \class{RegexObject} methods:
\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}