blob: 853372df4127bb495413738057890217c1c375e9 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{re} ---
Fred Drake062ea2e2000-10-06 19:59:22 +00002 Regular expression operations}
Fred Drake66da9d61998-08-07 18:57:18 +00003\declaremodule{standard}{re}
Andrew M. Kuchlingaf5b7662000-06-27 03:16:04 +00004\moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
Fred Drake062ea2e2000-10-06 19:59:22 +00005\moduleauthor{Fredrik Lundh}{effbot@telia.com}
Andrew M. Kuchlingaf5b7662000-06-27 03:16:04 +00006\sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
Guido van Rossum1acceb01997-08-14 23:12:18 +00007
Fred Drakeb91e9341998-07-23 17:59:49 +00008
Fred Drake062ea2e2000-10-06 19:59:22 +00009\modulesynopsis{Regular expression search and match operations with a
10 Perl-style expression syntax.}
Fred Drakeb91e9341998-07-23 17:59:49 +000011
Guido van Rossum1acceb01997-08-14 23:12:18 +000012
Guido van Rossum1acceb01997-08-14 23:12:18 +000013This module provides regular expression matching operations similar to
Fred Drake062ea2e2000-10-06 19:59:22 +000014those found in Perl. Regular expression pattern strings may not
15contain null bytes, but can specify the null byte using the
16\code{\e\var{number}} notation. Both patterns and strings to be
17searched can be Unicode strings as well as 8-bit strings. The
18\module{re} module is always available.
Guido van Rossum1acceb01997-08-14 23:12:18 +000019
Andrew M. Kuchling25332811998-04-09 14:56:04 +000020Regular expressions use the backslash character (\character{\e}) to
Guido van Rossum1acceb01997-08-14 23:12:18 +000021indicate special forms or to allow special characters to be used
22without invoking their special meaning. This collides with Python's
23usage of the same character for the same purpose in string literals;
24for example, to match a literal backslash, one might have to write
Andrew M. Kuchling25332811998-04-09 14:56:04 +000025\code{'\e\e\e\e'} as the pattern string, because the regular expression
Fred Drake20e01961998-02-19 15:09:35 +000026must be \samp{\e\e}, and each backslash must be expressed as
27\samp{\e\e} inside a regular Python string literal.
Guido van Rossum1acceb01997-08-14 23:12:18 +000028
29The solution is to use Python's raw string notation for regular
30expression patterns; backslashes are not handled in any special way in
Andrew M. Kuchling25332811998-04-09 14:56:04 +000031a string literal prefixed with \character{r}. So \code{r"\e n"} is a
32two-character string containing \character{\e} and \character{n},
33while \code{"\e n"} is a one-character string containing a newline.
34Usually patterns will be expressed in Python code using this raw
35string notation.
Guido van Rossum1acceb01997-08-14 23:12:18 +000036
Fred Drake062ea2e2000-10-06 19:59:22 +000037\strong{Implementation note:}
38The \module{re}\refstmodindex{pre} module has two distinct
39implementations: \module{sre} is the default implementation and
40includes Unicode support, but may run into stack limitations for some
41patterns. Though this will be fixed for a future release of Python,
42the older implementation (without Unicode support) is still available
43as the \module{pre}\refstmodindex{pre} module.
44
45
Fred Drakee20bd192001-04-12 16:47:17 +000046\begin{seealso}
47 \seetitle{Mastering Regular Expressions}{Book on regular expressions
48 by Jeffrey Friedl, published by O'Reilly. The Python
49 material in this book dates from before the \refmodule{re}
50 module, but it covers writing good regular expression
51 patterns in great detail.}
52\end{seealso}
53
54
Fred Draked16d4981998-09-10 20:21:00 +000055\subsection{Regular Expression Syntax \label{re-syntax}}
Guido van Rossum1acceb01997-08-14 23:12:18 +000056
57A regular expression (or RE) specifies a set of strings that matches
58it; the functions in this module let you check if a particular string
59matches a given regular expression (or if a given regular expression
60matches a particular string, which comes down to the same thing).
61
62Regular expressions can be concatenated to form new regular
63expressions; if \emph{A} and \emph{B} are both regular expressions,
64then \emph{AB} is also an regular expression. If a string \emph{p}
65matches A and another string \emph{q} matches B, the string \emph{pq}
66will match AB. Thus, complex expressions can easily be constructed
67from simpler primitive expressions like the ones described here. For
68details of the theory and implementation of regular expressions,
69consult the Friedl book referenced below, or almost any textbook about
70compiler construction.
71
Andrew M. Kuchlingc1cea201998-10-28 15:44:14 +000072A brief explanation of the format of regular expressions follows. For
73further information and a gentler presentation, consult the Regular
74Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
Guido van Rossum1acceb01997-08-14 23:12:18 +000075
76Regular expressions can contain both special and ordinary characters.
Andrew M. Kuchling25332811998-04-09 14:56:04 +000077Most ordinary characters, like \character{A}, \character{a}, or \character{0},
Guido van Rossum1acceb01997-08-14 23:12:18 +000078are the simplest regular expressions; they simply match themselves.
Andrew M. Kuchling25332811998-04-09 14:56:04 +000079You can concatenate ordinary characters, so \regexp{last} matches the
80string \code{'last'}. (In the rest of this section, we'll write RE's in
81\regexp{this special style}, usually without quotes, and strings to be
82matched \code{'in single quotes'}.)
Guido van Rossum1acceb01997-08-14 23:12:18 +000083
Andrew M. Kuchling25332811998-04-09 14:56:04 +000084Some characters, like \character{|} or \character{(}, are special. Special
Guido van Rossum1acceb01997-08-14 23:12:18 +000085characters either stand for classes of ordinary characters, or affect
86how the regular expressions around them are interpreted.
87
88The special characters are:
Fred Draked16d4981998-09-10 20:21:00 +000089
Fred Drake1e270f01998-11-30 22:58:12 +000090\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Fred Draked16d4981998-09-10 20:21:00 +000091
Andrew M. Kuchling25332811998-04-09 14:56:04 +000092\item[\character{.}] (Dot.) In the default mode, this matches any
Fred Drake20e01961998-02-19 15:09:35 +000093character except a newline. If the \constant{DOTALL} flag has been
Guido van Rossum1acceb01997-08-14 23:12:18 +000094specified, this matches any character including a newline.
Fred Draked16d4981998-09-10 20:21:00 +000095
Andrew M. Kuchling25332811998-04-09 14:56:04 +000096\item[\character{\^}] (Caret.) Matches the start of the string, and in
97\constant{MULTILINE} mode also matches immediately after each newline.
Fred Draked16d4981998-09-10 20:21:00 +000098
Andrew M. Kuchling25332811998-04-09 14:56:04 +000099\item[\character{\$}] Matches the end of the string, and in
Fred Drake20e01961998-02-19 15:09:35 +0000100\constant{MULTILINE} mode also matches before a newline.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000101\regexp{foo} matches both 'foo' and 'foobar', while the regular
102expression \regexp{foo\$} matches only 'foo'.
Fred Draked16d4981998-09-10 20:21:00 +0000103
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000104\item[\character{*}] Causes the resulting RE to
Guido van Rossum1acceb01997-08-14 23:12:18 +0000105match 0 or more repetitions of the preceding RE, as many repetitions
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000106as are possible. \regexp{ab*} will
Guido van Rossum1acceb01997-08-14 23:12:18 +0000107match 'a', 'ab', or 'a' followed by any number of 'b's.
Fred Draked16d4981998-09-10 20:21:00 +0000108
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000109\item[\character{+}] Causes the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000110resulting RE to match 1 or more repetitions of the preceding RE.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000111\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
Guido van Rossum1acceb01997-08-14 23:12:18 +0000112will not match just 'a'.
Fred Draked16d4981998-09-10 20:21:00 +0000113
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000114\item[\character{?}] Causes the resulting RE to
115match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
Guido van Rossum1acceb01997-08-14 23:12:18 +0000116match either 'a' or 'ab'.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000117\item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
118\character{?} qualifiers are all \dfn{greedy}; they match as much text as
Guido van Rossum1acceb01997-08-14 23:12:18 +0000119possible. Sometimes this behaviour isn't desired; if the RE
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000120\regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
121entire string, and not just \code{'<H1>'}.
122Adding \character{?} after the qualifier makes it perform the match in
123\dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
124possible will be matched. Using \regexp{.*?} in the previous
125expression will match only \code{'<H1>'}.
Fred Draked16d4981998-09-10 20:21:00 +0000126
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000127\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
128\var{m} to \var{n} repetitions of the preceding RE, attempting to
Andrew M. Kuchlingc1cea201998-10-28 15:44:14 +0000129match as many repetitions as possible. For example, \regexp{a\{3,5\}}
130will match from 3 to 5 \character{a} characters. Omitting \var{n}
131specifies an infinite upper bound; you can't omit \var{m}.
Fred Draked16d4981998-09-10 20:21:00 +0000132
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000133\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
134match from \var{m} to \var{n} repetitions of the preceding RE,
135attempting to match as \emph{few} repetitions as possible. This is
136the non-greedy version of the previous qualifier. For example, on the
Fred Draked16d4981998-09-10 20:21:00 +00001376-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
138\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
139characters.
140
141\item[\character{\e}] Either escapes special characters (permitting
142you to match characters like \character{*}, \character{?}, and so
143forth), or signals a special sequence; special sequences are discussed
144below.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000145
146If you're not using a raw string to
147express the pattern, remember that Python also uses the
148backslash as an escape sequence in string literals; if the escape
149sequence isn't recognized by Python's parser, the backslash and
150subsequent character are included in the resulting string. However,
151if Python would recognize the resulting sequence, the backslash should
Fred Drake023f87f1998-01-12 19:16:24 +0000152be repeated twice. This is complicated and hard to understand, so
153it's highly recommended that you use raw strings for all but the
154simplest expressions.
Fred Draked16d4981998-09-10 20:21:00 +0000155
Guido van Rossum1acceb01997-08-14 23:12:18 +0000156\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum48d04371997-12-11 20:19:08 +0000157be listed individually, or a range of characters can be indicated by
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000158giving two characters and separating them by a \character{-}. Special
159characters are not active inside sets. For example, \regexp{[akm\$]}
Fred Drake76547c51998-04-03 05:59:05 +0000160will match any of the characters \character{a}, \character{k},
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000161\character{m}, or \character{\$}; \regexp{[a-z]}
162will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
Fred Drake1e270f01998-11-30 22:58:12 +0000163letter or digit. Character classes such as \code{\e w} or \code{\e S}
164(defined below) are also acceptable inside a range. If you want to
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000165include a \character{]} or a \character{-} inside a set, precede it with a
166backslash, or place it as the first character. The
167pattern \regexp{[]]} will match \code{']'}, for example.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000168
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000169You can match the characters not within a range by \dfn{complementing}
170the set. This is indicated by including a
171\character{\^} as the first character of the set; \character{\^} elsewhere will
Fred Drakecd058531998-12-28 19:03:24 +0000172simply match the \character{\^} character. For example, \regexp{[{\^}5]}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000173will match any character except \character{5}.
174
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000175\item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
Fred Drake062ea2e2000-10-06 19:59:22 +0000176creates a regular expression that will match either A or B. An
177arbitrary number of REs can be separated by the \character{|} in this
178way. This can be used inside groups (see below) as well. REs
179separated by \character{|} are tried from left to right, and the first
180one that allows the complete pattern to match is considered the
181accepted branch. This means that if \code{A} matches, \code{B} will
182never be tested, even if it would produce a longer overall match. In
183other words, the \character{|} operator is never greedy. To match a
184literal \character{|}, use \regexp{\e|}, or enclose it inside a
185character class, as in \regexp{[|]}.
Fred Draked16d4981998-09-10 20:21:00 +0000186
Guido van Rossum48d04371997-12-11 20:19:08 +0000187\item[\code{(...)}] Matches whatever regular expression is inside the
188parentheses, and indicates the start and end of a group; the contents
189of a group can be retrieved after a match has been performed, and can
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000190be matched later in the string with the \regexp{\e \var{number}} special
Fred Draked16d4981998-09-10 20:21:00 +0000191sequence, described below. To match the literals \character{(} or
Fred Drake2c4f5542000-10-10 22:00:03 +0000192\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
Fred Draked16d4981998-09-10 20:21:00 +0000193inside a character class: \regexp{[(] [)]}.
194
195\item[\code{(?...)}] This is an extension notation (a \character{?}
196following a \character{(} is not meaningful otherwise). The first
197character after the \character{?}
Guido van Rossum0b334101997-12-08 17:33:40 +0000198determines what the meaning and further syntax of the construct is.
Guido van Rossume9625e81998-04-02 01:32:24 +0000199Extensions usually do not create a new group;
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000200\regexp{(?P<\var{name}>...)} is the only exception to this rule.
Guido van Rossum0b334101997-12-08 17:33:40 +0000201Following are the currently supported extensions.
Fred Draked16d4981998-09-10 20:21:00 +0000202
Fred Drakee53793b2000-09-25 17:52:40 +0000203\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
204\character{L}, \character{m}, \character{s}, \character{u},
205\character{x}.) The group matches the empty string; the letters set
206the corresponding flags (\constant{re.I}, \constant{re.L},
207\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
208for the entire regular expression. This is useful if you wish to
209include the flags as part of the regular expression, instead of
210passing a \var{flag} argument to the \function{compile()} function.
Fred Draked16d4981998-09-10 20:21:00 +0000211
Fred Drake062ea2e2000-10-06 19:59:22 +0000212Note that the \regexp{(?x)} flag changes how the expression is parsed.
213It should be used first in the expression string, or after one or more
214whitespace characters. If there are non-whitespace characters before
215the flag, the results are undefined.
216
Guido van Rossum1acceb01997-08-14 23:12:18 +0000217\item[\code{(?:...)}] A non-grouping version of regular parentheses.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000218Matches whatever regular expression is inside the parentheses, but the
219substring matched by the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000220group \emph{cannot} be retrieved after performing a match or
221referenced later in the pattern.
Fred Draked16d4981998-09-10 20:21:00 +0000222
Guido van Rossum1acceb01997-08-14 23:12:18 +0000223\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
Guido van Rossume9625e81998-04-02 01:32:24 +0000224the substring matched by the group is accessible via the symbolic group
Guido van Rossum1acceb01997-08-14 23:12:18 +0000225name \var{name}. Group names must be valid Python identifiers. A
226symbolic group is also a numbered group, just as if the group were not
227named. So the group named 'id' in the example above can also be
228referenced as the numbered group 1.
229
Guido van Rossum48d04371997-12-11 20:19:08 +0000230For example, if the pattern is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000231\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Fred Drake907e76b2001-07-06 20:30:11 +0000232name in arguments to methods of match objects, such as
233\code{m.group('id')} or \code{m.end('id')}, and also by name in
234pattern text (for example, \regexp{(?P=id)}) and replacement text
235(such as \code{\e g<id>}).
Fred Draked16d4981998-09-10 20:21:00 +0000236
Fred Drake023f87f1998-01-12 19:16:24 +0000237\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
238earlier group named \var{name}.
Fred Draked16d4981998-09-10 20:21:00 +0000239
Fred Drake023f87f1998-01-12 19:16:24 +0000240\item[\code{(?\#...)}] A comment; the contents of the parentheses are
241simply ignored.
Fred Draked16d4981998-09-10 20:21:00 +0000242
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000243\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
Fred Drake023f87f1998-01-12 19:16:24 +0000244consume any of the string. This is called a lookahead assertion. For
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000245example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
246followed by \code{'Asimov'}.
Fred Draked16d4981998-09-10 20:21:00 +0000247
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000248\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
Fred Drake023f87f1998-01-12 19:16:24 +0000249is a negative lookahead assertion. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000250\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
251followed by \code{'Asimov'}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000252
Andrew M. Kuchling9351dd22000-10-05 15:22:28 +0000253\item[\code{(?<=...)}] Matches if the current position in the string
254is preceded by a match for \regexp{...} that ends at the current
255position. This is called a positive lookbehind assertion.
256\regexp{(?<=abc)def} will match \samp{abcdef}, since the lookbehind
257will back up 3 characters and check if the contained pattern matches.
258The contained pattern must only match strings of some fixed length,
259meaning that \regexp{abc} or \regexp{a|b} are allowed, but \regexp{a*}
260isn't.
261
262\item[\code{(?<!...)}] Matches if the current position in the string
263is not preceded by a match for \regexp{...}. This
264is called a negative lookbehind assertion. Similar to positive lookbehind
265assertions, the contained pattern must only match strings of some
266fixed length.
267
Fred Drake2705e801998-02-16 21:21:13 +0000268\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000269
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000270The special sequences consist of \character{\e} and a character from the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000271list below. If the ordinary character is not on the list, then the
272resulting RE will match the second character. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000273\regexp{\e\$} matches the character \character{\$}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000274
Fred Drake1e270f01998-11-30 22:58:12 +0000275\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000276
Guido van Rossum1acceb01997-08-14 23:12:18 +0000277\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum0b334101997-12-08 17:33:40 +0000278same number. Groups are numbered starting from 1. For example,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000279\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
280\code{'the end'} (note
Guido van Rossum0b334101997-12-08 17:33:40 +0000281the space after the group). This special sequence can only be used to
282match one of the first 99 groups. If the first digit of \var{number}
283is 0, or \var{number} is 3 octal digits long, it will not be interpreted
284as a group match, but as the character with octal value \var{number}.
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000285Inside the \character{[} and \character{]} of a character class, all numeric
Guido van Rossume9625e81998-04-02 01:32:24 +0000286escapes are treated as characters.
Fred Drakee53793b2000-09-25 17:52:40 +0000287
Guido van Rossum1acceb01997-08-14 23:12:18 +0000288\item[\code{\e A}] Matches only at the start of the string.
Fred Drakee53793b2000-09-25 17:52:40 +0000289
Guido van Rossum1acceb01997-08-14 23:12:18 +0000290\item[\code{\e b}] Matches the empty string, but only at the
291beginning or end of a word. A word is defined as a sequence of
292alphanumeric characters, so the end of a word is indicated by
Guido van Rossum48d04371997-12-11 20:19:08 +0000293whitespace or a non-alphanumeric character. Inside a character range,
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000294\regexp{\e b} represents the backspace character, for compatibility with
Guido van Rossum48d04371997-12-11 20:19:08 +0000295Python's string literals.
Fred Drakee53793b2000-09-25 17:52:40 +0000296
Guido van Rossum0b334101997-12-08 17:33:40 +0000297\item[\code{\e B}] Matches the empty string, but only when it is
298\emph{not} at the beginning or end of a word.
Fred Drakee53793b2000-09-25 17:52:40 +0000299
Guido van Rossum1acceb01997-08-14 23:12:18 +0000300\item[\code{\e d}]Matches any decimal digit; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000301equivalent to the set \regexp{[0-9]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000302
Guido van Rossum1acceb01997-08-14 23:12:18 +0000303\item[\code{\e D}]Matches any non-digit character; this is
Fred Drakecd058531998-12-28 19:03:24 +0000304equivalent to the set \regexp{[{\^}0-9]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000305
Guido van Rossum1acceb01997-08-14 23:12:18 +0000306\item[\code{\e s}]Matches any whitespace character; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000307equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000308
Guido van Rossum1acceb01997-08-14 23:12:18 +0000309\item[\code{\e S}]Matches any non-whitespace character; this is
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000310equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
Fred Drakee53793b2000-09-25 17:52:40 +0000311
312\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
313flags are not specified,
Fred Drake023f87f1998-01-12 19:16:24 +0000314matches any alphanumeric character; this is equivalent to the set
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000315\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
Fred Drakee53793b2000-09-25 17:52:40 +0000316\regexp{[0-9_]} plus whatever characters are defined as letters for
317the current locale. If \constant{UNICODE} is set, this will match the
318characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
319in the Unicode character properties database.
320
321\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
322flags are not specified, matches any non-alphanumeric character; this
323is equivalent to the set \regexp{[{\^}a-zA-Z0-9_]}. With
324\constant{LOCALE}, it will match any character not in the set
325\regexp{[0-9_]}, and not defined as a letter for the current locale.
326If \constant{UNICODE} is set, this will match anything other than
327\regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
328character properties database.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000329
330\item[\code{\e Z}]Matches only at the end of the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000331
332\item[\code{\e \e}] Matches a literal backslash.
333
Fred Drake2705e801998-02-16 21:21:13 +0000334\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000335
Fred Drake42de1851998-04-20 16:28:44 +0000336
Fred Drake768ac6b1998-12-22 18:19:45 +0000337\subsection{Matching vs. Searching \label{matching-searching}}
338\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
339
Fred Drake768ac6b1998-12-22 18:19:45 +0000340Python offers two different primitive operations based on regular
341expressions: match and search. If you are accustomed to Perl's
342semantics, the search operation is what you're looking for. See the
343\function{search()} function and corresponding method of compiled
344regular expression objects.
345
346Note that match may differ from search using a regular expression
Fred Drake3d0971e1999-06-29 21:21:19 +0000347beginning with \character{\^}: \character{\^} matches only at the
348start of the string, or in \constant{MULTILINE} mode also immediately
349following a newline. The ``match'' operation succeeds only if the
350pattern matches at the start of the string regardless of mode, or at
351the starting position given by the optional \var{pos} argument
352regardless of whether a newline precedes it.
Fred Drake768ac6b1998-12-22 18:19:45 +0000353
354% Examples from Tim Peters:
355\begin{verbatim}
356re.compile("a").match("ba", 1) # succeeds
357re.compile("^a").search("ba", 1) # fails; 'a' not at start
358re.compile("^a").search("\na", 1) # fails; 'a' not at start
359re.compile("^a", re.M).search("\na", 1) # succeeds
360re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
361\end{verbatim}
362
363
Guido van Rossum1acceb01997-08-14 23:12:18 +0000364\subsection{Module Contents}
Fred Drake78f8e981997-12-29 21:39:39 +0000365\nodename{Contents of Module re}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000366
367The module defines the following functions and constants, and an exception:
368
Guido van Rossum1acceb01997-08-14 23:12:18 +0000369
Fred Drake013ad981998-03-08 07:38:27 +0000370\begin{funcdesc}{compile}{pattern\optional{, flags}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000371 Compile a regular expression pattern into a regular expression
Fred Drake20e01961998-02-19 15:09:35 +0000372 object, which can be used for matching using its \function{match()} and
373 \function{search()} methods, described below.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000374
Guido van Rossum0b334101997-12-08 17:33:40 +0000375 The expression's behaviour can be modified by specifying a
376 \var{flags} value. Values can be any of the following variables,
377 combined using bitwise OR (the \code{|} operator).
378
Fred Drake76547c51998-04-03 05:59:05 +0000379The sequence
380
381\begin{verbatim}
382prog = re.compile(pat)
383result = prog.match(str)
384\end{verbatim}
385
386is equivalent to
387
388\begin{verbatim}
389result = re.match(pat, str)
390\end{verbatim}
391
392but the version using \function{compile()} is more efficient when the
393expression will be used several times in a single program.
394%(The compiled version of the last pattern passed to
Fred Drake895aa9d2001-04-18 17:26:20 +0000395%\function{re.match()} or \function{re.search()} is cached, so
Fred Drake76547c51998-04-03 05:59:05 +0000396%programs that use only a single regular expression at a time needn't
397%worry about compiling regular expressions.)
398\end{funcdesc}
399
Fred Drake013ad981998-03-08 07:38:27 +0000400\begin{datadesc}{I}
401\dataline{IGNORECASE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000402Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
Guido van Rossum48d04371997-12-11 20:19:08 +0000403lowercase letters, too. This is not affected by the current locale.
Fred Drake013ad981998-03-08 07:38:27 +0000404\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000405
Fred Drake013ad981998-03-08 07:38:27 +0000406\begin{datadesc}{L}
407\dataline{LOCALE}
Fred Drakee53793b2000-09-25 17:52:40 +0000408Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
409\regexp{\e B} dependent on the current locale.
Fred Drake013ad981998-03-08 07:38:27 +0000410\end{datadesc}
Guido van Rossuma42c1781997-12-09 20:41:47 +0000411
Fred Drake013ad981998-03-08 07:38:27 +0000412\begin{datadesc}{M}
413\dataline{MULTILINE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000414When specified, the pattern character \character{\^} matches at the
Fred Drake023f87f1998-01-12 19:16:24 +0000415beginning of the string and at the beginning of each line
416(immediately following each newline); and the pattern character
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000417\character{\$} matches at the end of the string and at the end of each line
Guido van Rossum48d04371997-12-11 20:19:08 +0000418(immediately preceding each newline).
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000419By default, \character{\^} matches only at the beginning of the string, and
420\character{\$} only at the end of the string and immediately before the
Guido van Rossum0b334101997-12-08 17:33:40 +0000421newline (if any) at the end of the string.
Fred Drake013ad981998-03-08 07:38:27 +0000422\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000423
Fred Drake013ad981998-03-08 07:38:27 +0000424\begin{datadesc}{S}
425\dataline{DOTALL}
Fred Drakee53793b2000-09-25 17:52:40 +0000426Make the \character{.} special character match any character at all,
427including a newline; without this flag, \character{.} will match
428anything \emph{except} a newline.
429\end{datadesc}
430
431\begin{datadesc}{U}
432\dataline{UNICODE}
433Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
434\regexp{\e B} dependent on the Unicode character properties database.
435\versionadded{2.0}
Fred Drake013ad981998-03-08 07:38:27 +0000436\end{datadesc}
Guido van Rossum48d04371997-12-11 20:19:08 +0000437
Fred Drake013ad981998-03-08 07:38:27 +0000438\begin{datadesc}{X}
439\dataline{VERBOSE}
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000440This flag allows you to write regular expressions that look nicer.
441Whitespace within the pattern is ignored,
Guido van Rossum48d04371997-12-11 20:19:08 +0000442except when in a character class or preceded by an unescaped
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000443backslash, and, when a line contains a \character{\#} neither in a character
Guido van Rossum48d04371997-12-11 20:19:08 +0000444class or preceded by an unescaped backslash, all characters from the
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000445leftmost such \character{\#} through the end of the line are ignored.
446% XXX should add an example here
Fred Drake013ad981998-03-08 07:38:27 +0000447\end{datadesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000448
Guido van Rossum0b334101997-12-08 17:33:40 +0000449
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000450\begin{funcdesc}{search}{pattern, string\optional{, flags}}
451 Scan through \var{string} looking for a location where the regular
452 expression \var{pattern} produces a match, and return a
453 corresponding \class{MatchObject} instance.
454 Return \code{None} if no
455 position in the string matches the pattern; note that this is
456 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000457\end{funcdesc}
458
Fred Drake013ad981998-03-08 07:38:27 +0000459\begin{funcdesc}{match}{pattern, string\optional{, flags}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000460 If zero or more characters at the beginning of \var{string} match
461 the regular expression \var{pattern}, return a corresponding
Fred Drake20e01961998-02-19 15:09:35 +0000462 \class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum1acceb01997-08-14 23:12:18 +0000463 match the pattern; note that this is different from a zero-length
464 match.
Fred Drake768ac6b1998-12-22 18:19:45 +0000465
466 \strong{Note:} If you want to locate a match anywhere in
467 \var{string}, use \method{search()} instead.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000468\end{funcdesc}
469
Fred Drake77a6c9e2000-09-07 14:00:51 +0000470\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000471 Split \var{string} by the occurrences of \var{pattern}. If
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000472 capturing parentheses are used in \var{pattern}, then the text of all
473 groups in the pattern are also returned as part of the resulting list.
Guido van Rossum97546391998-01-12 18:58:53 +0000474 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
475 occur, and the remainder of the string is returned as the final
476 element of the list. (Incompatibility note: in the original Python
477 1.5 release, \var{maxsplit} was ignored. This has been fixed in
478 later releases.)
Fred Drake768ac6b1998-12-22 18:19:45 +0000479
Fred Drake19479911998-02-13 06:58:54 +0000480\begin{verbatim}
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000481>>> re.split('\W+', 'Words, words, words.')
Guido van Rossum1acceb01997-08-14 23:12:18 +0000482['Words', 'words', 'words', '']
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000483>>> re.split('(\W+)', 'Words, words, words.')
Guido van Rossum1acceb01997-08-14 23:12:18 +0000484['Words', ', ', 'words', ', ', 'words', '.', '']
Andrew M. Kuchlingd22e2501998-08-14 14:49:20 +0000485>>> re.split('\W+', 'Words, words, words.', 1)
Guido van Rossum97546391998-01-12 18:58:53 +0000486['Words', 'words, words.']
Fred Drake19479911998-02-13 06:58:54 +0000487\end{verbatim}
Fred Drake768ac6b1998-12-22 18:19:45 +0000488
Guido van Rossum1acceb01997-08-14 23:12:18 +0000489 This function combines and extends the functionality of
Fred Drake20e01961998-02-19 15:09:35 +0000490 the old \function{regsub.split()} and \function{regsub.splitx()}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000491\end{funcdesc}
492
Guido van Rossum6c373f71998-06-29 22:48:01 +0000493\begin{funcdesc}{findall}{pattern, string}
494Return a list of all non-overlapping matches of \var{pattern} in
495\var{string}. If one or more groups are present in the pattern,
496return a list of groups; this will be a list of tuples if the pattern
497has more than one group. Empty matches are included in the result.
Fred Drakedda199b1999-02-02 19:01:37 +0000498\versionadded{1.5.2}
Guido van Rossum6c373f71998-06-29 22:48:01 +0000499\end{funcdesc}
500
Fred Drake013ad981998-03-08 07:38:27 +0000501\begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000502Return the string obtained by replacing the leftmost non-overlapping
503occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000504\var{repl}. If the pattern isn't found, \var{string} is returned
505unchanged. \var{repl} can be a string or a function; if a function,
Fred Drakebfb092e1999-04-09 19:57:09 +0000506it is called for every non-overlapping occurrence of \var{pattern}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000507The function takes a single match object argument, and returns the
508replacement string. For example:
Fred Drake768ac6b1998-12-22 18:19:45 +0000509
Fred Drake19479911998-02-13 06:58:54 +0000510\begin{verbatim}
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000511>>> def dashrepl(matchobj):
Guido van Rossume9625e81998-04-02 01:32:24 +0000512.... if matchobj.group(0) == '-': return ' '
513.... else: return '-'
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000514>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
515'pro--gram files'
Fred Drake19479911998-02-13 06:58:54 +0000516\end{verbatim}
Fred Drake768ac6b1998-12-22 18:19:45 +0000517
Fred Drake895aa9d2001-04-18 17:26:20 +0000518The pattern may be a string or an RE object; if you need to specify
519regular expression flags, you must use a RE object, or use
Fred Drake907e76b2001-07-06 20:30:11 +0000520embedded modifiers in a pattern; for example,
Fred Drake013ad981998-03-08 07:38:27 +0000521\samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
Fred Drake023f87f1998-01-12 19:16:24 +0000522
Guido van Rossum1acceb01997-08-14 23:12:18 +0000523The optional argument \var{count} is the maximum number of pattern
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000524occurrences to be replaced; \var{count} must be a non-negative integer, and
Guido van Rossum1acceb01997-08-14 23:12:18 +0000525the default value of 0 means to replace all occurrences.
526
527Empty matches for the pattern are replaced only when not adjacent to a
Fred Drake013ad981998-03-08 07:38:27 +0000528previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
Guido van Rossume9625e81998-04-02 01:32:24 +0000529
530If \var{repl} is a string, any backslash escapes in it are processed.
531That is, \samp{\e n} is converted to a single newline character,
532\samp{\e r} is converted to a linefeed, and so forth. Unknown escapes
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000533such as \samp{\e j} are left alone. Backreferences, such as \samp{\e 6}, are
Guido van Rossume9625e81998-04-02 01:32:24 +0000534replaced with the substring matched by group 6 in the pattern.
535
536In addition to character escapes and backreferences as described
537above, \samp{\e g<name>} will use the substring matched by the group
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000538named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
Guido van Rossume9625e81998-04-02 01:32:24 +0000539\samp{\e g<number>} uses the corresponding group number; \samp{\e
540g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
541replacement such as \samp{\e g<2>0}. \samp{\e 20} would be
542interpreted as a reference to group 20, not a reference to group 2
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000543followed by the literal character \character{0}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000544\end{funcdesc}
545
Fred Drake013ad981998-03-08 07:38:27 +0000546\begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000547Perform the same operation as \function{sub()}, but return a tuple
Fred Drake023f87f1998-01-12 19:16:24 +0000548\code{(\var{new_string}, \var{number_of_subs_made})}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000549\end{funcdesc}
550
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000551\begin{funcdesc}{escape}{string}
552 Return \var{string} with all non-alphanumerics backslashed; this is
553 useful if you want to match an arbitrary literal string that may have
554 regular expression metacharacters in it.
555\end{funcdesc}
556
Guido van Rossum1acceb01997-08-14 23:12:18 +0000557\begin{excdesc}{error}
558 Exception raised when a string passed to one of the functions here
Fred Drake907e76b2001-07-06 20:30:11 +0000559 is not a valid regular expression (for example, it might contain
560 unmatched parentheses) or when some other error occurs during
561 compilation or matching. It is never an error if a string contains
562 no match for a pattern.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000563\end{excdesc}
564
Fred Drake42de1851998-04-20 16:28:44 +0000565
Fred Draked16d4981998-09-10 20:21:00 +0000566\subsection{Regular Expression Objects \label{re-objects}}
Fred Drake42de1851998-04-20 16:28:44 +0000567
Guido van Rossum1acceb01997-08-14 23:12:18 +0000568Compiled regular expression objects support the following methods and
569attributes:
570
Fred Drake77a6c9e2000-09-07 14:00:51 +0000571\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
572 endpos}}}
Guido van Rossum7d447aa1998-10-13 16:03:52 +0000573 Scan through \var{string} looking for a location where this regular
574 expression produces a match, and return a
575 corresponding \class{MatchObject} instance. Return \code{None} if no
576 position in the string matches the pattern; note that this is
577 different from finding a zero-length match at some point in the string.
578
579 The optional \var{pos} and \var{endpos} parameters have the same
580 meaning as for the \method{match()} method.
581\end{methoddesc}
582
Fred Drake77a6c9e2000-09-07 14:00:51 +0000583\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
584 endpos}}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000585 If zero or more characters at the beginning of \var{string} match
586 this regular expression, return a corresponding
Fred Drake20e01961998-02-19 15:09:35 +0000587 \class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000588 match the pattern; note that this is different from a zero-length
589 match.
Fred Drake768ac6b1998-12-22 18:19:45 +0000590
591 \strong{Note:} If you want to locate a match anywhere in
592 \var{string}, use \method{search()} instead.
593
Guido van Rossum1acceb01997-08-14 23:12:18 +0000594 The optional second parameter \var{pos} gives an index in the string
Andrew M. Kuchling65b78631998-06-22 15:02:42 +0000595 where the search is to start; it defaults to \code{0}. This is not
596 completely equivalent to slicing the string; the \code{'\^'} pattern
597 character matches at the real beginning of the string and at positions
598 just after a newline, but not necessarily at the index where the search
599 is to start.
Guido van Rossum0b334101997-12-08 17:33:40 +0000600
601 The optional parameter \var{endpos} limits how far the string will
602 be searched; it will be as if the string is \var{endpos} characters
603 long, so only the characters from \var{pos} to \var{endpos} will be
604 searched for a match.
Fred Drake76547c51998-04-03 05:59:05 +0000605\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000606
Fred Drake77a6c9e2000-09-07 14:00:51 +0000607\begin{methoddesc}[RegexObject]{split}{string\optional{,
Fred Drake76547c51998-04-03 05:59:05 +0000608 maxsplit\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000609Identical to the \function{split()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000610\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000611
Guido van Rossum6c373f71998-06-29 22:48:01 +0000612\begin{methoddesc}[RegexObject]{findall}{string}
613Identical to the \function{findall()} function, using the compiled pattern.
614\end{methoddesc}
615
Fred Drake76547c51998-04-03 05:59:05 +0000616\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000617Identical to the \function{sub()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000618\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000619
Fred Drake76547c51998-04-03 05:59:05 +0000620\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
621 count\code{ = 0}}}
Fred Drake20e01961998-02-19 15:09:35 +0000622Identical to the \function{subn()} function, using the compiled pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000623\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000624
Guido van Rossum1acceb01997-08-14 23:12:18 +0000625
Fred Drake76547c51998-04-03 05:59:05 +0000626\begin{memberdesc}[RegexObject]{flags}
Fred Drake895aa9d2001-04-18 17:26:20 +0000627The flags argument used when the RE object was compiled, or
Fred Drake013ad981998-03-08 07:38:27 +0000628\code{0} if no flags were provided.
Fred Drake76547c51998-04-03 05:59:05 +0000629\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000630
Fred Drake76547c51998-04-03 05:59:05 +0000631\begin{memberdesc}[RegexObject]{groupindex}
Fred Drake013ad981998-03-08 07:38:27 +0000632A dictionary mapping any symbolic group names defined by
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000633\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
Guido van Rossum1acceb01997-08-14 23:12:18 +0000634symbolic groups were used in the pattern.
Fred Drake76547c51998-04-03 05:59:05 +0000635\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000636
Fred Drake76547c51998-04-03 05:59:05 +0000637\begin{memberdesc}[RegexObject]{pattern}
Fred Drake895aa9d2001-04-18 17:26:20 +0000638The pattern string from which the RE object was compiled.
Fred Drake76547c51998-04-03 05:59:05 +0000639\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000640
Fred Drake42de1851998-04-20 16:28:44 +0000641
Fred Draked16d4981998-09-10 20:21:00 +0000642\subsection{Match Objects \label{match-objects}}
Fred Drake023f87f1998-01-12 19:16:24 +0000643
Fred Drake20e01961998-02-19 15:09:35 +0000644\class{MatchObject} instances support the following methods and attributes:
Guido van Rossum1acceb01997-08-14 23:12:18 +0000645
Andrew M. Kuchling7a90db62000-10-05 12:35:29 +0000646\begin{methoddesc}[MatchObject]{expand}{template}
647 Return the string obtained by doing backslash substitution on the
648template string \var{template}, as done by the \method{sub()} method.
649Escapes such as \samp{\e n} are converted to the appropriate
650characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and named
651backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced by the contents of the
652corresponding group.
653\end{methoddesc}
654
Fred Drake77a6c9e2000-09-07 14:00:51 +0000655\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
Guido van Rossum46503921998-01-19 23:14:17 +0000656Returns one or more subgroups of the match. If there is a single
657argument, the result is a single string; if there are
Guido van Rossum48d04371997-12-11 20:19:08 +0000658multiple arguments, the result is a tuple with one item per argument.
Fred Drake907e76b2001-07-06 20:30:11 +0000659Without arguments, \var{group1} defaults to zero (the whole match
Guido van Rossum46503921998-01-19 23:14:17 +0000660is returned).
661If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum48d04371997-12-11 20:19:08 +0000662entire matching string; if it is in the inclusive range [1..99], it is
Guido van Rossum791468f1998-04-03 20:07:37 +0000663the string matching the the corresponding parenthesized group. If a
664group number is negative or larger than the number of groups defined
665in the pattern, an \exception{IndexError} exception is raised.
666If a group is contained in a part of the pattern that did not match,
Fred Drake77a6c9e2000-09-07 14:00:51 +0000667the corresponding result is \code{-1}. If a group is contained in a
Guido van Rossum791468f1998-04-03 20:07:37 +0000668part of the pattern that matched multiple times, the last match is
669returned.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000670
Andrew M. Kuchling25332811998-04-09 14:56:04 +0000671If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
Guido van Rossum46503921998-01-19 23:14:17 +0000672the \var{groupN} arguments may also be strings identifying groups by
Guido van Rossum791468f1998-04-03 20:07:37 +0000673their group name. If a string argument is not used as a group name in
674the pattern, an \exception{IndexError} exception is raised.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000675
676A moderately complicated example:
Fred Drake023f87f1998-01-12 19:16:24 +0000677
678\begin{verbatim}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000679m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
Fred Drake023f87f1998-01-12 19:16:24 +0000680\end{verbatim}
681
682After performing this match, \code{m.group(1)} is \code{'3'}, as is
Guido van Rossum46503921998-01-19 23:14:17 +0000683\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Fred Drake76547c51998-04-03 05:59:05 +0000684\end{methoddesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000685
Guido van Rossum6c373f71998-06-29 22:48:01 +0000686\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
Guido van Rossum48d04371997-12-11 20:19:08 +0000687Return a tuple containing all the subgroups of the match, from 1 up to
Guido van Rossum6c373f71998-06-29 22:48:01 +0000688however many groups are in the pattern. The \var{default} argument is
689used for groups that did not participate in the match; it defaults to
690\code{None}. (Incompatibility note: in the original Python 1.5
691release, if the tuple was one element long, a string would be returned
692instead. In later versions (from 1.5.1 on), a singleton tuple is
693returned in such cases.)
694\end{methoddesc}
695
696\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
697Return a dictionary containing all the \emph{named} subgroups of the
698match, keyed by the subgroup name. The \var{default} argument is
699used for groups that did not participate in the match; it defaults to
700\code{None}.
Fred Drake76547c51998-04-03 05:59:05 +0000701\end{methoddesc}
Guido van Rossum48d04371997-12-11 20:19:08 +0000702
Fred Drake76547c51998-04-03 05:59:05 +0000703\begin{methoddesc}[MatchObject]{start}{\optional{group}}
Fred Drake013ad981998-03-08 07:38:27 +0000704\funcline{end}{\optional{group}}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000705Return the indices of the start and end of the substring
Guido van Rossum46503921998-01-19 23:14:17 +0000706matched by \var{group}; \var{group} defaults to zero (meaning the whole
707matched substring).
Fred Drake77a6c9e2000-09-07 14:00:51 +0000708Return \code{-1} if \var{group} exists but
Guido van Rossume4eb2231997-12-17 00:23:39 +0000709did not contribute to the match. For a match object
Fred Drake023f87f1998-01-12 19:16:24 +0000710\var{m}, and a group \var{g} that did contribute to the match, the
711substring matched by group \var{g} (equivalent to
712\code{\var{m}.group(\var{g})}) is
713
714\begin{verbatim}
715m.string[m.start(g):m.end(g)]
716\end{verbatim}
717
Guido van Rossume4eb2231997-12-17 00:23:39 +0000718Note that
719\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
Fred Drake023f87f1998-01-12 19:16:24 +0000720\var{group} matched a null string. For example, after \code{\var{m} =
721re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
722\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
723\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
Fred Drake20e01961998-02-19 15:09:35 +0000724an \exception{IndexError} exception.
Fred Drake76547c51998-04-03 05:59:05 +0000725\end{methoddesc}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000726
Fred Drake76547c51998-04-03 05:59:05 +0000727\begin{methoddesc}[MatchObject]{span}{\optional{group}}
Fred Drake20e01961998-02-19 15:09:35 +0000728For \class{MatchObject} \var{m}, return the 2-tuple
Fred Drake023f87f1998-01-12 19:16:24 +0000729\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000730Note that if \var{group} did not contribute to the match, this is
Fred Drake77a6c9e2000-09-07 14:00:51 +0000731\code{(-1, -1)}. Again, \var{group} defaults to zero.
Fred Drake76547c51998-04-03 05:59:05 +0000732\end{methoddesc}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000733
Fred Drake76547c51998-04-03 05:59:05 +0000734\begin{memberdesc}[MatchObject]{pos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000735The value of \var{pos} which was passed to the
Fred Drake895aa9d2001-04-18 17:26:20 +0000736\function{search()} or \function{match()} function. This is the index
737into the string at which the RE engine started looking for a match.
Fred Drake76547c51998-04-03 05:59:05 +0000738\end{memberdesc}
Guido van Rossum0b334101997-12-08 17:33:40 +0000739
Fred Drake76547c51998-04-03 05:59:05 +0000740\begin{memberdesc}[MatchObject]{endpos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000741The value of \var{endpos} which was passed to the
Fred Drake895aa9d2001-04-18 17:26:20 +0000742\function{search()} or \function{match()} function. This is the index
743into the string beyond which the RE engine will not go.
Fred Drake76547c51998-04-03 05:59:05 +0000744\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000745
Andrew M. Kuchling75afc0b2000-10-18 23:08:13 +0000746\begin{memberdesc}[MatchObject]{lastgroup}
747The name of the last matched capturing group, or \code{None} if the
748group didn't have a name, or if no group was matched at all.
749\end{memberdesc}
750
751\begin{memberdesc}[MatchObject]{lastindex}
752The integer index of the last matched capturing group, or \code{None}
753if no group was matched at all.
754\end{memberdesc}
755
Fred Drake76547c51998-04-03 05:59:05 +0000756\begin{memberdesc}[MatchObject]{re}
Fred Drake20e01961998-02-19 15:09:35 +0000757The regular expression object whose \method{match()} or
758\method{search()} method produced this \class{MatchObject} instance.
Fred Drake76547c51998-04-03 05:59:05 +0000759\end{memberdesc}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000760
Fred Drake76547c51998-04-03 05:59:05 +0000761\begin{memberdesc}[MatchObject]{string}
Fred Drake20e01961998-02-19 15:09:35 +0000762The string passed to \function{match()} or \function{search()}.
Fred Drake76547c51998-04-03 05:59:05 +0000763\end{memberdesc}