blob: 766aab0f5509d1959ca5f163821df40470fa7185 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{re} ---
Fred Drake062ea2e2000-10-06 19:59:22 +00002 Regular expression operations}
Fred Drake66da9d61998-08-07 18:57:18 +00003\declaremodule{standard}{re}
Andrew M. Kuchlingaf5b7662000-06-27 03:16:04 +00004\moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
Fred Drake062ea2e2000-10-06 19:59:22 +00005\moduleauthor{Fredrik Lundh}{effbot@telia.com}
Andrew M. Kuchlingaf5b7662000-06-27 03:16:04 +00006\sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
Guido van Rossum1acceb01997-08-14 23:12:18 +00007
Fred Drakeb91e9341998-07-23 17:59:49 +00008
Fred Drake062ea2e2000-10-06 19:59:22 +00009\modulesynopsis{Regular expression search and match operations with a
10 Perl-style expression syntax.}
Fred Drakeb91e9341998-07-23 17:59:49 +000011
Guido van Rossum1acceb01997-08-14 23:12:18 +000012
Guido van Rossum1acceb01997-08-14 23:12:18 +000013This module provides regular expression matching operations similar to
Fred Drake062ea2e2000-10-06 19:59:22 +000014those found in Perl. Regular expression pattern strings may not
15contain null bytes, but can specify the null byte using the
16\code{\e\var{number}} notation. Both patterns and strings to be
17searched can be Unicode strings as well as 8-bit strings. The
18\module{re} module is always available.
Guido van Rossum1acceb01997-08-14 23:12:18 +000019
Andrew M. Kuchling25332811998-04-09 14:56:04 +000020Regular expressions use the backslash character (\character{\e}) to
Guido van Rossum1acceb01997-08-14 23:12:18 +000021indicate special forms or to allow special characters to be used
22without invoking their special meaning. This collides with Python's
23usage of the same character for the same purpose in string literals;
24for example, to match a literal backslash, one might have to write
Andrew M. Kuchling25332811998-04-09 14:56:04 +000025\code{'\e\e\e\e'} as the pattern string, because the regular expression
Fred Drake20e01961998-02-19 15:09:35 +000026must be \samp{\e\e}, and each backslash must be expressed as
27\samp{\e\e} inside a regular Python string literal.
Guido van Rossum1acceb01997-08-14 23:12:18 +000028
29The solution is to use Python's raw string notation for regular
30expression patterns; backslashes are not handled in any special way in
Andrew M. Kuchling25332811998-04-09 14:56:04 +000031a string literal prefixed with \character{r}. So \code{r"\e n"} is a
32two-character string containing \character{\e} and \character{n},
33while \code{"\e n"} is a one-character string containing a newline.
34Usually patterns will be expressed in Python code using this raw
35string notation.
Guido van Rossum1acceb01997-08-14 23:12:18 +000036
Fred Drake062ea2e2000-10-06 19:59:22 +000037\strong{Implementation note:}
38The \module{re}\refstmodindex{pre} module has two distinct
39implementations: \module{sre} is the default implementation and
40includes Unicode support, but may run into stack limitations for some
41patterns. Though this will be fixed for a future release of Python,
42the older implementation (without Unicode support) is still available
43as the \module{pre}\refstmodindex{pre} module.
44
45
Fred Drakee20bd192001-04-12 16:47:17 +000046\begin{seealso}
47 \seetitle{Mastering Regular Expressions}{Book on regular expressions
48 by Jeffrey Friedl, published by O'Reilly. The Python
49 material in this book dates from before the \refmodule{re}
50 module, but it covers writing good regular expression
51 patterns in great detail.}
52\end{seealso}
53
54
Fred Draked16d4981998-09-10 20:21:00 +000055\subsection{Regular Expression Syntax \label{re-syntax}}
Guido van Rossum1acceb01997-08-14 23:12:18 +000056
57A regular expression (or RE) specifies a set of strings that matches
58it; the functions in this module let you check if a particular string
59matches a given regular expression (or if a given regular expression
60matches a particular string, which comes down to the same thing).
61
62Regular expressions can be concatenated to form new regular
63expressions; if \emph{A} and \emph{B} are both regular expressions,
64then \emph{AB} is also an regular expression. If a string \emph{p}
65matches A and another string \emph{q} matches B, the string \emph{pq}
66will match AB. Thus, complex expressions can easily be constructed
67from simpler primitive expressions like the ones described here. For
68details of the theory and implementation of regular expressions,
69consult the Friedl book referenced below, or almost any textbook about
70compiler construction.
71
Andrew M. Kuchlingc1cea201998-10-28 15:44:14 +000072A brief explanation of the format of regular expressions follows. For
73further information and a gentler presentation, consult the Regular
74Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
Guido van Rossum1acceb01997-08-14 23:12:18 +000075
76Regular expressions can contain both special and ordinary characters.
Andrew M. Kuchling25332811998-04-09 14:56:04 +000077Most ordinary characters, like \character{A}, \character{a}, or \character{0},
Guido van Rossum1acceb01997-08-14 23:12:18 +000078are the simplest regular expressions; they simply match themselves.
Andrew M. Kuchling25332811998-04-09 14:56:04 +000079You can concatenate ordinary characters, so \regexp{last} matches the
80string \code{'last'}. (In the rest of this section, we'll write RE's in
81\regexp{this special style}, usually without quotes, and strings to be
82matched \code{'in single quotes'}.)
Guido van Rossum1acceb01997-08-14 23:12:18 +000083
Andrew M. Kuchling25332811998-04-09 14:56:04 +000084Some characters, like \character{|} or \character{(}, are special. Special
Guido van Rossum1acceb01997-08-14 23:12:18 +000085characters either stand for classes of ordinary characters, or affect
86how the regular expressions around them are interpreted.
87
88The special characters are:
Fred Draked16d4981998-09-10 20:21:00 +000089
Fred Drake1e270f01998-11-30 22:58:12 +000090\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Fred Draked16d4981998-09-10 20:21:00 +000091
Andrew M. Kuchling25332811998-04-09 14:56:04 +000092\item[\character{.}] (Dot.) In the default mode, this matches any
Fred Drake20e01961998-02-19 15:09:35 +000093character except a newline. If the \constant{DOTALL} flag has been
Guido van Rossum1acceb01997-08-14 23:12:18 +000094specified, this matches any character including a newline.
Fred Draked16d4981998-09-10 20:21:00 +000095
Andrew M. Kuchling25332811998-04-09 14:56:04 +000096\item[\character{\^}] (Caret.) Matches the start of the string, and in
97\constant{MULTILINE} mode also matches immediately after each newline.
Fred Draked16d4981998-09-10 20:21:00 +000098
Andrew M. Kuchling25332811998-04-09 14:56:04 +000099\item[\character{\$}] Matches the end of the string, and in
Fred Drake20e01961998-02-19 15:09:35 +0000100\constant{MULTILINE} mode also matches before a newline.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000101\regexp{foo} matches both 'foo' and 'foobar', while the regular
102expression \regexp{foo\$} matches only 'foo'.
Fred Draked16d4981998-09-10 20:21:00 +0000103
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000104\item[\character{*}] Causes the resulting RE to
Guido van Rossum1acceb01997-08-14 23:12:18 +0000105match 0 or more repetitions of the preceding RE, as many repetitions
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000106as are possible. \regexp{ab*} will
Guido van Rossum1acceb01997-08-14 23:12:18 +0000107match 'a', 'ab', or 'a' followed by any number of 'b's.
Fred Draked16d4981998-09-10 20:21:00 +0000108
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000109\item[\character{+}] Causes the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000110resulting RE to match 1 or more repetitions of the preceding RE.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000111\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
Guido van Rossum1acceb01997-08-14 23:12:18 +0000112will not match just 'a'.
Fred Draked16d4981998-09-10 20:21:00 +0000113
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000114\item[\character{?}] Causes the resulting RE to
115match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
Guido van Rossum1acceb01997-08-14 23:12:18 +0000116match either 'a' or 'ab'.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000117\item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
118\character{?} qualifiers are all \dfn{greedy}; they match as much text as
Guido van Rossum1acceb01997-08-14 23:12:18 +0000119possible. Sometimes this behaviour isn't desired; if the RE
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000120\regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
121entire string, and not just \code{'<H1>'}.
122Adding \character{?} after the qualifier makes it perform the match in
123\dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
124possible will be matched. Using \regexp{.*?} in the previous
125expression will match only \code{'<H1>'}.
Fred Draked16d4981998-09-10 20:21:00 +0000126
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000127\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
128\var{m} to \var{n} repetitions of the preceding RE, attempting to
Andrew M. Kuchlingc1cea201998-10-28 15:44:14 +0000129match as many repetitions as possible. For example, \regexp{a\{3,5\}}
130will match from 3 to 5 \character{a} characters. Omitting \var{n}
131specifies an infinite upper bound; you can't omit \var{m}.
Fred Draked16d4981998-09-10 20:21:00 +0000132
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000133\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
134match from \var{m} to \var{n} repetitions of the preceding RE,
135attempting to match as \emph{few} repetitions as possible. This is
136the non-greedy version of the previous qualifier. For example, on the
Fred Draked16d4981998-09-10 20:21:00 +00001376-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
138\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
139characters.
140
141\item[\character{\e}] Either escapes special characters (permitting
142you to match characters like \character{*}, \character{?}, and so
143forth), or signals a special sequence; special sequences are discussed
144below.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000145
146If you're not using a raw string to
147express the pattern, remember that Python also uses the
148backslash as an escape sequence in string literals; if the escape
149sequence isn't recognized by Python's parser, the backslash and
150subsequent character are included in the resulting string. However,
151if Python would recognize the resulting sequence, the backslash should
Fred Drake023f87f1998-01-12 19:16:24 +0000152be repeated twice. This is complicated and hard to understand, so
153it's highly recommended that you use raw strings for all but the
154simplest expressions.
Fred Draked16d4981998-09-10 20:21:00 +0000155
Guido van Rossum1acceb01997-08-14 23:12:18 +0000156\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum48d04371997-12-11 20:19:08 +0000157be listed individually, or a range of characters can be indicated by
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000158giving two characters and separating them by a \character{-}. Special
159characters are not active inside sets. For example, \regexp{[akm\$]}
Fred Drake76547c51998-04-03 05:59:05 +0000160will match any of the characters \character{a}, \character{k},
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000161\character{m}, or \character{\$}; \regexp{[a-z]}
162will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
Fred Drake1e270f01998-11-30 22:58:12 +0000163letter or digit. Character classes such as \code{\e w} or \code{\e S}
164(defined below) are also acceptable inside a range. If you want to
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000165include a \character{]} or a \character{-} inside a set, precede it with a
166backslash, or place it as the first character. The
167pattern \regexp{[]]} will match \code{']'}, for example.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000168
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000169You can match the characters not within a range by \dfn{complementing}
170the set. This is indicated by including a
171\character{\^} as the first character of the set; \character{\^} elsewhere will
Fred Drakecd058531998-12-28 19:03:24 +0000172simply match the \character{\^} character. For example, \regexp{[{\^}5]}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000173will match any character except \character{5}.
174
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000175\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
Fred Drake062ea2e2000-10-06 19:59:22 +0000176creates a regular expression that will match either A or B. An
177arbitrary number of REs can be separated by the \character{|} in this
178way. This can be used inside groups (see below) as well. REs
179separated by \character{|} are tried from left to right, and the first
180one that allows the complete pattern to match is considered the
181accepted branch. This means that if \code{A} matches, \code{B} will
182never be tested, even if it would produce a longer overall match. In
183other words, the \character{|} operator is never greedy. To match a
184literal \character{|}, use \regexp{\e|}, or enclose it inside a
185character class, as in \regexp{[|]}.
Fred Draked16d4981998-09-10 20:21:00 +0000186
Guido van Rossum48d04371997-12-11 20:19:08 +0000187\item[\code{(...)}] Matches whatever regular expression is inside the
188parentheses, and indicates the start and end of a group; the contents
189of a group can be retrieved after a match has been performed, and can
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000190be matched later in the string with the \regexp{\e \var{number}} special
Fred Draked16d4981998-09-10 20:21:00 +0000191sequence, described below. To match the literals \character{(} or
Fred Drake2c4f5542000-10-10 22:00:03 +0000192\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
Fred Draked16d4981998-09-10 20:21:00 +0000193inside a character class: \regexp{[(] [)]}.
194
195\item[\code{(?...)}] This is an extension notation (a \character{?}
196following a \character{(} is not meaningful otherwise). The first
197character after the \character{?}
Guido van Rossum0b334101997-12-08 17:33:40 +0000198determines what the meaning and further syntax of the construct is.
Guido van Rossume9625e81998-04-02 01:32:24 +0000199Extensions usually do not create a new group;
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000200\regexp{(?P<\var{name}>...)} is the only exception to this rule.
Guido van Rossum0b334101997-12-08 17:33:40 +0000201Following are the currently supported extensions.
Fred Draked16d4981998-09-10 20:21:00 +0000202
Fred Drakee53793b2000-09-25 17:52:40 +0000203\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
204\character{L}, \character{m}, \character{s}, \character{u},
205\character{x}.) The group matches the empty string; the letters set
206the corresponding flags (\constant{re.I}, \constant{re.L},
207\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
208for the entire regular expression. This is useful if you wish to
209include the flags as part of the regular expression, instead of
210passing a \var{flag} argument to the \function{compile()} function.
Fred Draked16d4981998-09-10 20:21:00 +0000211
Fred Drake062ea2e2000-10-06 19:59:22 +0000212Note that the \regexp{(?x)} flag changes how the expression is parsed.
213It should be used first in the expression string, or after one or more
214whitespace characters. If there are non-whitespace characters before
215the flag, the results are undefined.
216
Guido van Rossum1acceb01997-08-14 23:12:18 +0000217\item[\code{(?:...)}] A non-grouping version of regular parentheses.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000218Matches whatever regular expression is inside the parentheses, but the
219substring matched by the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000220group \emph{cannot} be retrieved after performing a match or
221referenced later in the pattern.
Fred Draked16d4981998-09-10 20:21:00 +0000222
Guido van Rossum1acceb01997-08-14 23:12:18 +0000223\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
Guido van Rossume9625e81998-04-02 01:32:24 +0000224the substring matched by the group is accessible via the symbolic group
Guido van Rossum1acceb01997-08-14 23:12:18 +0000225name \var{name}. Group names must be valid Python identifiers. A
226symbolic group is also a numbered group, just as if the group were not
227named. So the group named 'id' in the example above can also be
228referenced as the numbered group 1.
229
Guido van Rossum48d04371997-12-11 20:19:08 +0000230For example, if the pattern is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000231\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Guido van Rossum1acceb01997-08-14 23:12:18 +0000232name in arguments to methods of match objects, such as \code{m.group('id')}
Fred Drake023f87f1998-01-12 19:16:24 +0000233or \code{m.end('id')}, and also by name in pattern text
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000234(e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
Fred Draked16d4981998-09-10 20:21:00 +0000235
Fred Drake023f87f1998-01-12 19:16:24 +0000236\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
237earlier group named \var{name}.
Fred Draked16d4981998-09-10 20:21:00 +0000238
Fred Drake023f87f1998-01-12 19:16:24 +0000239\item[\code{(?\#...)}] A comment; the contents of the parentheses are
240simply ignored.
Fred Draked16d4981998-09-10 20:21:00 +0000241
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000242\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
Fred Drake023f87f1998-01-12 19:16:24 +0000243consume any of the string. This is called a lookahead assertion. For
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000244example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
245followed by \code{'Asimov'}.
Fred Draked16d4981998-09-10 20:21:00 +0000246
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000247\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
Fred Drake023f87f1998-01-12 19:16:24 +0000248is a negative lookahead assertion. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000249\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
250followed by \code{'Asimov'}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000251
Andrew M. Kuchling9351dd22000-10-05 15:22:28 +0000252\item[\code{(?<=...)}] Matches if the current position in the string
253is preceded by a match for \regexp{...} that ends at the current
254position. This is called a positive lookbehind assertion.
255\regexp{(?<=abc)def} will match \samp{abcdef}, since the lookbehind
256will back up 3 characters and check if the contained pattern matches.
257The contained pattern must only match strings of some fixed length,
258meaning that \regexp{abc} or \regexp{a|b} are allowed, but \regexp{a*}
259isn't.
260
261\item[\code{(?<!...)}] Matches if the current position in the string
262is not preceded by a match for \regexp{...}. This
263is called a negative lookbehind assertion. Similar to positive lookbehind
264assertions, the contained pattern must only match strings of some
265fixed length.
266
Fred Drake2705e801998-02-16 21:21:13 +0000267\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000268
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000269The special sequences consist of \character{\e} and a character from the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000270list below. If the ordinary character is not on the list, then the
271resulting RE will match the second character. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000272\regexp{\e\$} matches the character \character{\$}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000273
Fred Drake1e270f01998-11-30 22:58:12 +0000274\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000275
Guido van Rossum1acceb01997-08-14 23:12:18 +0000276\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum0b334101997-12-08 17:33:40 +0000277same number. Groups are numbered starting from 1. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000278\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
279\code{'the end'} (note
Guido van Rossum0b334101997-12-08 17:33:40 +0000280the space after the group). This special sequence can only be used to
281match one of the first 99 groups. If the first digit of \var{number}
282is 0, or \var{number} is 3 octal digits long, it will not be interpreted
283as a group match, but as the character with octal value \var{number}.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000284Inside the \character{[} and \character{]} of a character class, all numeric
Guido van Rossume9625e81998-04-02 01:32:24 +0000285escapes are treated as characters.
Fred Drakee53793b2000-09-25 17:52:40 +0000286
Guido van Rossum1acceb01997-08-14 23:12:18 +0000287\item[\code{\e A}] Matches only at the start of the string.
Fred Drakee53793b2000-09-25 17:52:40 +0000288
Guido van Rossum1acceb01997-08-14 23:12:18 +0000289\item[\code{\e b}] Matches the empty string, but only at the
290beginning or end of a word. A word is defined as a sequence of
291alphanumeric characters, so the end of a word is indicated by
Guido van Rossum48d04371997-12-11 20:19:08 +0000292whitespace or a non-alphanumeric character. Inside a character range,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000293\regexp{\e b} represents the backspace character, for compatibility with
Guido van Rossum48d04371997-12-11 20:19:08 +0000294Python's string literals.
Fred Drakee53793b2000-09-25 17:52:40 +0000295
Guido van Rossum0b334101997-12-08 17:33:40 +0000296\item[\code{\e B}] Matches the empty string, but only when it is
297\emph{not} at the beginning or end of a word.
Fred Drakee53793b2000-09-25 17:52:40 +0000298
Guido van Rossum1acceb01997-08-14 23:12:18 +0000299\item[\code{\e d}]Matches any decimal digit; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000300equivalent to the set \regexp{[0-9]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000301
Guido van Rossum1acceb01997-08-14 23:12:18 +0000302\item[\code{\e D}]Matches any non-digit character; this is
Fred Drakecd058531998-12-28 19:03:24 +0000303equivalent to the set \regexp{[{\^}0-9]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000304
Guido van Rossum1acceb01997-08-14 23:12:18 +0000305\item[\code{\e s}]Matches any whitespace character; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000306equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000307
Guido van Rossum1acceb01997-08-14 23:12:18 +0000308\item[\code{\e S}]Matches any non-whitespace character; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000309equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000310
311\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
312flags are not specified,
Fred Drake023f87f1998-01-12 19:16:24 +0000313matches any alphanumeric character; this is equivalent to the set
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000314\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
Fred Drakee53793b2000-09-25 17:52:40 +0000315\regexp{[0-9_]} plus whatever characters are defined as letters for
316the current locale. If \constant{UNICODE} is set, this will match the
317characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
318in the Unicode character properties database.
319
320\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
321flags are not specified, matches any non-alphanumeric character; this
322is equivalent to the set \regexp{[{\^}a-zA-Z0-9_]}. With
323\constant{LOCALE}, it will match any character not in the set
324\regexp{[0-9_]}, and not defined as a letter for the current locale.
325If \constant{UNICODE} is set, this will match anything other than
326\regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
327character properties database.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000328
329\item[\code{\e Z}]Matches only at the end of the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000330
331\item[\code{\e \e}] Matches a literal backslash.
332
Fred Drake2705e801998-02-16 21:21:13 +0000333\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000334
Fred Drake42de1851998-04-20 16:28:44 +0000335
Fred Drake768ac6b1998-12-22 18:19:45 +0000336\subsection{Matching vs. Searching \label{matching-searching}}
337\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
338
Fred Drake768ac6b1998-12-22 18:19:45 +0000339Python offers two different primitive operations based on regular
340expressions: match and search. If you are accustomed to Perl's
341semantics, the search operation is what you're looking for. See the
342\function{search()} function and corresponding method of compiled
343regular expression objects.
344
345Note that match may differ from search using a regular expression
Fred Drake3d0971e1999-06-29 21:21:19 +0000346beginning with \character{\^}: \character{\^} matches only at the
347start of the string, or in \constant{MULTILINE} mode also immediately
348following a newline. The ``match'' operation succeeds only if the
349pattern matches at the start of the string regardless of mode, or at
350the starting position given by the optional \var{pos} argument
351regardless of whether a newline precedes it.
Fred Drake768ac6b1998-12-22 18:19:45 +0000352
353% Examples from Tim Peters:
354\begin{verbatim}
355re.compile("a").match("ba", 1) # succeeds
356re.compile("^a").search("ba", 1) # fails; 'a' not at start
357re.compile("^a").search("\na", 1) # fails; 'a' not at start
358re.compile("^a", re.M).search("\na", 1) # succeeds
359re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
360\end{verbatim}
361
362
Guido van Rossum1acceb01997-08-14 23:12:18 +0000363\subsection{Module Contents}
Fred Drake78f8e981997-12-29 21:39:39 +0000364\nodename{Contents of Module re}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000365
366The module defines the following functions and constants, and an exception:
367
Guido van Rossum1acceb01997-08-14 23:12:18 +0000368
Fred Drake013ad981998-03-08 07:38:27 +0000369\begin{funcdesc}{compile}{pattern\optional{, flags}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000370 Compile a regular expression pattern into a regular expression
Fred Drake20e01961998-02-19 15:09:35 +0000371 object, which can be used for matching using its \function{match()} and
372 \function{search()} methods, described below.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000373
Guido van Rossum0b334101997-12-08 17:33:40 +0000374 The expression's behaviour can be modified by specifying a
375 \var{flags} value. Values can be any of the following variables,
376 combined using bitwise OR (the \code{|} operator).
377
Fred Drake76547c51998-04-03 05:59:05 +0000378The sequence
379
380\begin{verbatim}
381prog = re.compile(pat)
382result = prog.match(str)
383\end{verbatim}
384
385is equivalent to
386
387\begin{verbatim}
388result = re.match(pat, str)
389\end{verbatim}
390
391but the version using \function{compile()} is more efficient when the
392expression will be used several times in a single program.
393%(The compiled version of the last pattern passed to
Fred Drake895aa9d2001-04-18 17:26:20 +0000394%\function{re.match()} or \function{re.search()} is cached, so
Fred Drake76547c51998-04-03 05:59:05 +0000395%programs that use only a single regular expression at a time needn't
396%worry about compiling regular expressions.)
397\end{funcdesc}
398
Fred Drake013ad981998-03-08 07:38:27 +0000399\begin{datadesc}{I}
400\dataline{IGNORECASE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000401Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
Guido van Rossum48d04371997-12-11 20:19:08 +0000402lowercase letters, too. This is not affected by the current locale.
Fred Drake013ad981998-03-08 07:38:27 +0000403\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000404
Fred Drake013ad981998-03-08 07:38:27 +0000405\begin{datadesc}{L}
406\dataline{LOCALE}
Fred Drakee53793b2000-09-25 17:52:40 +0000407Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
408\regexp{\e B} dependent on the current locale.
Fred Drake013ad981998-03-08 07:38:27 +0000409\end{datadesc}
Guido van Rossuma42c1781997-12-09 20:41:47 +0000410
Fred Drake013ad981998-03-08 07:38:27 +0000411\begin{datadesc}{M}
412\dataline{MULTILINE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000413When specified, the pattern character \character{\^} matches at the
Fred Drake023f87f1998-01-12 19:16:24 +0000414beginning of the string and at the beginning of each line
415(immediately following each newline); and the pattern character
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000416\character{\$} matches at the end of the string and at the end of each line
Guido van Rossum48d04371997-12-11 20:19:08 +0000417(immediately preceding each newline).
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000418By default, \character{\^} matches only at the beginning of the string, and
419\character{\$} only at the end of the string and immediately before the
Guido van Rossum0b334101997-12-08 17:33:40 +0000420newline (if any) at the end of the string.
Fred Drake013ad981998-03-08 07:38:27 +0000421\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000422
Fred Drake013ad981998-03-08 07:38:27 +0000423\begin{datadesc}{S}
424\dataline{DOTALL}
Fred Drakee53793b2000-09-25 17:52:40 +0000425Make the \character{.} special character match any character at all,
426including a newline; without this flag, \character{.} will match
427anything \emph{except} a newline.
428\end{datadesc}
429
430\begin{datadesc}{U}
431\dataline{UNICODE}
432Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
433\regexp{\e B} dependent on the Unicode character properties database.
434\versionadded{2.0}
Fred Drake013ad981998-03-08 07:38:27 +0000435\end{datadesc}
Guido van Rossum48d04371997-12-11 20:19:08 +0000436
Fred Drake013ad981998-03-08 07:38:27 +0000437\begin{datadesc}{X}
438\dataline{VERBOSE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000439This flag allows you to write regular expressions that look nicer.
440Whitespace within the pattern is ignored,
Guido van Rossum48d04371997-12-11 20:19:08 +0000441except when in a character class or preceded by an unescaped
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000442backslash, and, when a line contains a \character{\#} neither in a character
Guido van Rossum48d04371997-12-11 20:19:08 +0000443class or preceded by an unescaped backslash, all characters from the
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000444leftmost such \character{\#} through the end of the line are ignored.
445% XXX should add an example here
Fred Drake013ad981998-03-08 07:38:27 +0000446\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000447
Guido van Rossum0b334101997-12-08 17:33:40 +0000448
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000449\begin{funcdesc}{search}{pattern, string\optional{, flags}}
450 Scan through \var{string} looking for a location where the regular
451 expression \var{pattern} produces a match, and return a
452 corresponding \class{MatchObject} instance.
453 Return \code{None} if no
454 position in the string matches the pattern; note that this is
455 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000456\end{funcdesc}
457
Fred Drake013ad981998-03-08 07:38:27 +0000458\begin{funcdesc}{match}{pattern, string\optional{, flags}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000459 If zero or more characters at the beginning of \var{string} match
460 the regular expression \var{pattern}, return a corresponding
Fred Drake20e01961998-02-19 15:09:35 +0000461 \class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum1acceb01997-08-14 23:12:18 +0000462 match the pattern; note that this is different from a zero-length
463 match.
Fred Drake768ac6b1998-12-22 18:19:45 +0000464
465 \strong{Note:} If you want to locate a match anywhere in
466 \var{string}, use \method{search()} instead.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000467\end{funcdesc}
468
Fred Drake77a6c9e2000-09-07 14:00:51 +0000469\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000470 Split \var{string} by the occurrences of \var{pattern}. If
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000471 capturing parentheses are used in \var{pattern}, then the text of all
472 groups in the pattern are also returned as part of the resulting list.
Guido van Rossum97546391998-01-12 18:58:53 +0000473 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
474 occur, and the remainder of the string is returned as the final
475 element of the list. (Incompatibility note: in the original Python
476 1.5 release, \var{maxsplit} was ignored. This has been fixed in
477 later releases.)
Fred Drake768ac6b1998-12-22 18:19:45 +0000478
Fred Drake19479911998-02-13 06:58:54 +0000479\begin{verbatim}
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000480>>> re.split('\W+', 'Words, words, words.')
Guido van Rossum1acceb01997-08-14 23:12:18 +0000481['Words', 'words', 'words', '']
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000482>>> re.split('(\W+)', 'Words, words, words.')
Guido van Rossum1acceb01997-08-14 23:12:18 +0000483['Words', ', ', 'words', ', ', 'words', '.', '']
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000484>>> re.split('\W+', 'Words, words, words.', 1)
Guido van Rossum97546391998-01-12 18:58:53 +0000485['Words', 'words, words.']
Fred Drake19479911998-02-13 06:58:54 +0000486\end{verbatim}
Fred Drake768ac6b1998-12-22 18:19:45 +0000487
Guido van Rossum1acceb01997-08-14 23:12:18 +0000488 This function combines and extends the functionality of
Fred Drake20e01961998-02-19 15:09:35 +0000489 the old \function{regsub.split()} and \function{regsub.splitx()}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000490\end{funcdesc}
491
Guido van Rossum6c373f71998-06-29 22:48:01 +0000492\begin{funcdesc}{findall}{pattern, string}
493Return a list of all non-overlapping matches of \var{pattern} in
494\var{string}. If one or more groups are present in the pattern,
495return a list of groups; this will be a list of tuples if the pattern
496has more than one group. Empty matches are included in the result.
Fred Drakedda199b1999-02-02 19:01:37 +0000497\versionadded{1.5.2}
Guido van Rossum6c373f71998-06-29 22:48:01 +0000498\end{funcdesc}
499
Fred Drake013ad981998-03-08 07:38:27 +0000500\begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000501Return the string obtained by replacing the leftmost non-overlapping
502occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000503\var{repl}. If the pattern isn't found, \var{string} is returned
504unchanged. \var{repl} can be a string or a function; if a function,
Fred Drakebfb092e1999-04-09 19:57:09 +0000505it is called for every non-overlapping occurrence of \var{pattern}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000506The function takes a single match object argument, and returns the
507replacement string. For example:
Fred Drake768ac6b1998-12-22 18:19:45 +0000508
Fred Drake19479911998-02-13 06:58:54 +0000509\begin{verbatim}
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000510>>> def dashrepl(matchobj):
Guido van Rossume9625e81998-04-02 01:32:24 +0000511.... if matchobj.group(0) == '-': return ' '
512.... else: return '-'
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000513>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
514'pro--gram files'
Fred Drake19479911998-02-13 06:58:54 +0000515\end{verbatim}
Fred Drake768ac6b1998-12-22 18:19:45 +0000516
Fred Drake895aa9d2001-04-18 17:26:20 +0000517The pattern may be a string or an RE object; if you need to specify
518regular expression flags, you must use a RE object, or use
Guido van Rossum48d04371997-12-11 20:19:08 +0000519embedded modifiers in a pattern; e.g.
Fred Drake013ad981998-03-08 07:38:27 +0000520\samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
Fred Drake023f87f1998-01-12 19:16:24 +0000521
Guido van Rossum1acceb01997-08-14 23:12:18 +0000522The optional argument \var{count} is the maximum number of pattern
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000523occurrences to be replaced; \var{count} must be a non-negative integer, and
Guido van Rossum1acceb01997-08-14 23:12:18 +0000524the default value of 0 means to replace all occurrences.
525
526Empty matches for the pattern are replaced only when not adjacent to a
Fred Drake013ad981998-03-08 07:38:27 +0000527previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
Guido van Rossume9625e81998-04-02 01:32:24 +0000528
529If \var{repl} is a string, any backslash escapes in it are processed.
530That is, \samp{\e n} is converted to a single newline character,
531\samp{\e r} is converted to a linefeed, and so forth. Unknown escapes
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000532such as \samp{\e j} are left alone. Backreferences, such as \samp{\e 6}, are
Guido van Rossume9625e81998-04-02 01:32:24 +0000533replaced with the substring matched by group 6 in the pattern.
534
535In addition to character escapes and backreferences as described
536above, \samp{\e g<name>} will use the substring matched by the group
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000537named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
Guido van Rossume9625e81998-04-02 01:32:24 +0000538\samp{\e g<number>} uses the corresponding group number; \samp{\e
539g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
540replacement such as \samp{\e g<2>0}. \samp{\e 20} would be
541interpreted as a reference to group 20, not a reference to group 2
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000542followed by the literal character \character{0}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000543\end{funcdesc}
544
Fred Drake013ad981998-03-08 07:38:27 +0000545\begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000546Perform the same operation as \function{sub()}, but return a tuple
Fred Drake023f87f1998-01-12 19:16:24 +0000547\code{(\var{new_string}, \var{number_of_subs_made})}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000548\end{funcdesc}
549
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000550\begin{funcdesc}{escape}{string}
551 Return \var{string} with all non-alphanumerics backslashed; this is
552 useful if you want to match an arbitrary literal string that may have
553 regular expression metacharacters in it.
554\end{funcdesc}
555
Guido van Rossum1acceb01997-08-14 23:12:18 +0000556\begin{excdesc}{error}
557 Exception raised when a string passed to one of the functions here
558 is not a valid regular expression (e.g., unmatched parentheses) or
Fred Drake013ad981998-03-08 07:38:27 +0000559 when some other error occurs during compilation or matching. It is
560 never an error if a string contains no match for a pattern.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000561\end{excdesc}
562
Fred Drake42de1851998-04-20 16:28:44 +0000563
Fred Draked16d4981998-09-10 20:21:00 +0000564\subsection{Regular Expression Objects \label{re-objects}}
Fred Drake42de1851998-04-20 16:28:44 +0000565
Guido van Rossum1acceb01997-08-14 23:12:18 +0000566Compiled regular expression objects support the following methods and
567attributes:
568
Fred Drake77a6c9e2000-09-07 14:00:51 +0000569\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
570 endpos}}}
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000571 Scan through \var{string} looking for a location where this regular
572 expression produces a match, and return a
573 corresponding \class{MatchObject} instance. Return \code{None} if no
574 position in the string matches the pattern; note that this is
575 different from finding a zero-length match at some point in the string.
576
577 The optional \var{pos} and \var{endpos} parameters have the same
578 meaning as for the \method{match()} method.
579\end{methoddesc}
580
Fred Drake77a6c9e2000-09-07 14:00:51 +0000581\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
582 endpos}}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000583 If zero or more characters at the beginning of \var{string} match
584 this regular expression, return a corresponding
Fred Drake20e01961998-02-19 15:09:35 +0000585 \class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000586 match the pattern; note that this is different from a zero-length
587 match.
Fred Drake768ac6b1998-12-22 18:19:45 +0000588
589 \strong{Note:} If you want to locate a match anywhere in
590 \var{string}, use \method{search()} instead.
591
Guido van Rossum1acceb01997-08-14 23:12:18 +0000592 The optional second parameter \var{pos} gives an index in the string
Andrew M. Kuchling65b78631998-06-22 15:02:42 +0000593 where the search is to start; it defaults to \code{0}. This is not
594 completely equivalent to slicing the string; the \code{'\^'} pattern
595 character matches at the real beginning of the string and at positions
596 just after a newline, but not necessarily at the index where the search
597 is to start.
Guido van Rossum0b334101997-12-08 17:33:40 +0000598
599 The optional parameter \var{endpos} limits how far the string will
600 be searched; it will be as if the string is \var{endpos} characters
601 long, so only the characters from \var{pos} to \var{endpos} will be
602 searched for a match.
Fred Drake76547c51998-04-03 05:59:05 +0000603\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000604
Fred Drake77a6c9e2000-09-07 14:00:51 +0000605\begin{methoddesc}[RegexObject]{split}{string\optional{,
Fred Drake76547c51998-04-03 05:59:05 +0000606 maxsplit\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000607Identical to the \function{split()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000608\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000609
Guido van Rossum6c373f71998-06-29 22:48:01 +0000610\begin{methoddesc}[RegexObject]{findall}{string}
611Identical to the \function{findall()} function, using the compiled pattern.
612\end{methoddesc}
613
Fred Drake76547c51998-04-03 05:59:05 +0000614\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000615Identical to the \function{sub()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000616\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000617
Fred Drake76547c51998-04-03 05:59:05 +0000618\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
619 count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000620Identical to the \function{subn()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000621\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000622
Guido van Rossum1acceb01997-08-14 23:12:18 +0000623
Fred Drake76547c51998-04-03 05:59:05 +0000624\begin{memberdesc}[RegexObject]{flags}
Fred Drake895aa9d2001-04-18 17:26:20 +0000625The flags argument used when the RE object was compiled, or
Fred Drake013ad981998-03-08 07:38:27 +0000626\code{0} if no flags were provided.
Fred Drake76547c51998-04-03 05:59:05 +0000627\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000628
Fred Drake76547c51998-04-03 05:59:05 +0000629\begin{memberdesc}[RegexObject]{groupindex}
Fred Drake013ad981998-03-08 07:38:27 +0000630A dictionary mapping any symbolic group names defined by
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000631\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
Guido van Rossum1acceb01997-08-14 23:12:18 +0000632symbolic groups were used in the pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000633\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000634
Fred Drake76547c51998-04-03 05:59:05 +0000635\begin{memberdesc}[RegexObject]{pattern}
Fred Drake895aa9d2001-04-18 17:26:20 +0000636The pattern string from which the RE object was compiled.
Fred Drake76547c51998-04-03 05:59:05 +0000637\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000638
Fred Drake42de1851998-04-20 16:28:44 +0000639
Fred Draked16d4981998-09-10 20:21:00 +0000640\subsection{Match Objects \label{match-objects}}
Fred Drake023f87f1998-01-12 19:16:24 +0000641
Fred Drake20e01961998-02-19 15:09:35 +0000642\class{MatchObject} instances support the following methods and attributes:
Guido van Rossum1acceb01997-08-14 23:12:18 +0000643
Andrew M. Kuchling7a90db62000-10-05 12:35:29 +0000644\begin{methoddesc}[MatchObject]{expand}{template}
645 Return the string obtained by doing backslash substitution on the
646template string \var{template}, as done by the \method{sub()} method.
647Escapes such as \samp{\e n} are converted to the appropriate
648characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and named
649backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced by the contents of the
650corresponding group.
651\end{methoddesc}
652
Fred Drake77a6c9e2000-09-07 14:00:51 +0000653\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
Guido van Rossum46503921998-01-19 23:14:17 +0000654Returns one or more subgroups of the match. If there is a single
655argument, the result is a single string; if there are
Guido van Rossum48d04371997-12-11 20:19:08 +0000656multiple arguments, the result is a tuple with one item per argument.
Guido van Rossum46503921998-01-19 23:14:17 +0000657Without arguments, \var{group1} defaults to zero (i.e. the whole match
658is returned).
659If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum48d04371997-12-11 20:19:08 +0000660entire matching string; if it is in the inclusive range [1..99], it is
Guido van Rossum791468f1998-04-03 20:07:37 +0000661the string matching the the corresponding parenthesized group. If a
662group number is negative or larger than the number of groups defined
663in the pattern, an \exception{IndexError} exception is raised.
664If a group is contained in a part of the pattern that did not match,
Fred Drake77a6c9e2000-09-07 14:00:51 +0000665the corresponding result is \code{-1}. If a group is contained in a
Guido van Rossum791468f1998-04-03 20:07:37 +0000666part of the pattern that matched multiple times, the last match is
667returned.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000668
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000669If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
Guido van Rossum46503921998-01-19 23:14:17 +0000670the \var{groupN} arguments may also be strings identifying groups by
Guido van Rossum791468f1998-04-03 20:07:37 +0000671their group name. If a string argument is not used as a group name in
672the pattern, an \exception{IndexError} exception is raised.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000673
674A moderately complicated example:
Fred Drake023f87f1998-01-12 19:16:24 +0000675
676\begin{verbatim}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000677m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
Fred Drake023f87f1998-01-12 19:16:24 +0000678\end{verbatim}
679
680After performing this match, \code{m.group(1)} is \code{'3'}, as is
Guido van Rossum46503921998-01-19 23:14:17 +0000681\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Fred Drake76547c51998-04-03 05:59:05 +0000682\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000683
Guido van Rossum6c373f71998-06-29 22:48:01 +0000684\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
Guido van Rossum48d04371997-12-11 20:19:08 +0000685Return a tuple containing all the subgroups of the match, from 1 up to
Guido van Rossum6c373f71998-06-29 22:48:01 +0000686however many groups are in the pattern. The \var{default} argument is
687used for groups that did not participate in the match; it defaults to
688\code{None}. (Incompatibility note: in the original Python 1.5
689release, if the tuple was one element long, a string would be returned
690instead. In later versions (from 1.5.1 on), a singleton tuple is
691returned in such cases.)
692\end{methoddesc}
693
694\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
695Return a dictionary containing all the \emph{named} subgroups of the
696match, keyed by the subgroup name. The \var{default} argument is
697used for groups that did not participate in the match; it defaults to
698\code{None}.
Fred Drake76547c51998-04-03 05:59:05 +0000699\end{methoddesc}
Guido van Rossum48d04371997-12-11 20:19:08 +0000700
Fred Drake76547c51998-04-03 05:59:05 +0000701\begin{methoddesc}[MatchObject]{start}{\optional{group}}
Fred Drake013ad981998-03-08 07:38:27 +0000702\funcline{end}{\optional{group}}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000703Return the indices of the start and end of the substring
Guido van Rossum46503921998-01-19 23:14:17 +0000704matched by \var{group}; \var{group} defaults to zero (meaning the whole
705matched substring).
Fred Drake77a6c9e2000-09-07 14:00:51 +0000706Return \code{-1} if \var{group} exists but
Guido van Rossume4eb2231997-12-17 00:23:39 +0000707did not contribute to the match. For a match object
Fred Drake023f87f1998-01-12 19:16:24 +0000708\var{m}, and a group \var{g} that did contribute to the match, the
709substring matched by group \var{g} (equivalent to
710\code{\var{m}.group(\var{g})}) is
711
712\begin{verbatim}
713m.string[m.start(g):m.end(g)]
714\end{verbatim}
715
Guido van Rossume4eb2231997-12-17 00:23:39 +0000716Note that
717\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
Fred Drake023f87f1998-01-12 19:16:24 +0000718\var{group} matched a null string. For example, after \code{\var{m} =
719re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
720\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
721\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
Fred Drake20e01961998-02-19 15:09:35 +0000722an \exception{IndexError} exception.
Fred Drake76547c51998-04-03 05:59:05 +0000723\end{methoddesc}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000724
Fred Drake76547c51998-04-03 05:59:05 +0000725\begin{methoddesc}[MatchObject]{span}{\optional{group}}
Fred Drake20e01961998-02-19 15:09:35 +0000726For \class{MatchObject} \var{m}, return the 2-tuple
Fred Drake023f87f1998-01-12 19:16:24 +0000727\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000728Note that if \var{group} did not contribute to the match, this is
Fred Drake77a6c9e2000-09-07 14:00:51 +0000729\code{(-1, -1)}. Again, \var{group} defaults to zero.
Fred Drake76547c51998-04-03 05:59:05 +0000730\end{methoddesc}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000731
Fred Drake76547c51998-04-03 05:59:05 +0000732\begin{memberdesc}[MatchObject]{pos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000733The value of \var{pos} which was passed to the
Fred Drake895aa9d2001-04-18 17:26:20 +0000734\function{search()} or \function{match()} function. This is the index
735into the string at which the RE engine started looking for a match.
Fred Drake76547c51998-04-03 05:59:05 +0000736\end{memberdesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000737
Fred Drake76547c51998-04-03 05:59:05 +0000738\begin{memberdesc}[MatchObject]{endpos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000739The value of \var{endpos} which was passed to the
Fred Drake895aa9d2001-04-18 17:26:20 +0000740\function{search()} or \function{match()} function. This is the index
741into the string beyond which the RE engine will not go.
Fred Drake76547c51998-04-03 05:59:05 +0000742\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000743
Andrew M. Kuchling75afc0b2000-10-18 23:08:13 +0000744\begin{memberdesc}[MatchObject]{lastgroup}
745The name of the last matched capturing group, or \code{None} if the
746group didn't have a name, or if no group was matched at all.
747\end{memberdesc}
748
749\begin{memberdesc}[MatchObject]{lastindex}
750The integer index of the last matched capturing group, or \code{None}
751if no group was matched at all.
752\end{memberdesc}
753
Fred Drake76547c51998-04-03 05:59:05 +0000754\begin{memberdesc}[MatchObject]{re}
Fred Drake20e01961998-02-19 15:09:35 +0000755The regular expression object whose \method{match()} or
756\method{search()} method produced this \class{MatchObject} instance.
Fred Drake76547c51998-04-03 05:59:05 +0000757\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000758
Fred Drake76547c51998-04-03 05:59:05 +0000759\begin{memberdesc}[MatchObject]{string}
Fred Drake20e01961998-02-19 15:09:35 +0000760The string passed to \function{match()} or \function{search()}.
Fred Drake76547c51998-04-03 05:59:05 +0000761\end{memberdesc}