blob: b63a5fac19b485c67075f696837e38c6c601645d [file] [log] [blame]
Guido van Rossum1acceb01997-08-14 23:12:18 +00001\section{Built-in Module \sectcode{re}}
2\label{module-re}
3
4\bimodindex{re}
5
6% XXX Remove before 1.5final release.
7{\large\bf The \code{re} module is still in the process of being
8developed, and more features will be added in future 1.5 alphas and
9betas. This documentation is also preliminary and incomplete. If you
10find a bug or documentation error, or just find something unclear,
11please send a message to
12\code{string-sig@python.org}, and we'll fix it.}
13
14This module provides regular expression matching operations similar to
Guido van Rossum0b334101997-12-08 17:33:40 +000015those found in Perl. It's 8-bit clean: both patterns and strings may
16contain null bytes and characters whose high bit is set. It is always
17available.
Guido van Rossum1acceb01997-08-14 23:12:18 +000018
19Regular expressions use the backslash character (\code{\e}) to
20indicate special forms or to allow special characters to be used
21without invoking their special meaning. This collides with Python's
22usage of the same character for the same purpose in string literals;
23for example, to match a literal backslash, one might have to write
Guido van Rossum0b334101997-12-08 17:33:40 +000024\code{\e\e\e\e} as the pattern string, because the regular expression
25must be \code{\e\e}, and each backslash must be expressed as
26\code{\e\e} inside a regular Python string literal.
Guido van Rossum1acceb01997-08-14 23:12:18 +000027
28The solution is to use Python's raw string notation for regular
29expression patterns; backslashes are not handled in any special way in
30a string literal prefixed with 'r'. So \code{r"\e n"} is a two
31character string containing a backslash and the letter 'n', while
32\code{"\e n"} is a one-character string containing a newline. Usually
33patterns will be expressed in Python code using this raw string notation.
34
35% XXX Can the following section be dropped, or should it be boiled down?
36
37%\strong{Please note:} There is a little-known fact about Python string
38%literals which means that you don't usually have to worry about
39%doubling backslashes, even though they are used to escape special
40%characters in string literals as well as in regular expressions. This
41%is because Python doesn't remove backslashes from string literals if
42%they are followed by an unrecognized escape character.
43%\emph{However}, if you want to include a literal \dfn{backslash} in a
44%regular expression represented as a string literal, you have to
45%\emph{quadruple} it or enclose it in a singleton character class.
46%E.g.\ to extract \LaTeX\ \code{\e section\{{\rm
47%\ldots}\}} headers from a document, you can use this pattern:
48%\code{'[\e ] section\{\e (.*\e )\}'}. \emph{Another exception:}
49%the escape sequence \code{\e b} is significant in string literals
50%(where it means the ASCII bell character) as well as in Emacs regular
51%expressions (where it stands for a word boundary), so in order to
52%search for a word boundary, you should use the pattern \code{'\e \e b'}.
53%Similarly, a backslash followed by a digit 0-7 should be doubled to
54%avoid interpretation as an octal escape.
55
56\subsection{Regular Expressions}
57
58A regular expression (or RE) specifies a set of strings that matches
59it; the functions in this module let you check if a particular string
60matches a given regular expression (or if a given regular expression
61matches a particular string, which comes down to the same thing).
62
63Regular expressions can be concatenated to form new regular
64expressions; if \emph{A} and \emph{B} are both regular expressions,
65then \emph{AB} is also an regular expression. If a string \emph{p}
66matches A and another string \emph{q} matches B, the string \emph{pq}
67will match AB. Thus, complex expressions can easily be constructed
68from simpler primitive expressions like the ones described here. For
69details of the theory and implementation of regular expressions,
70consult the Friedl book referenced below, or almost any textbook about
71compiler construction.
72
Guido van Rossum0b334101997-12-08 17:33:40 +000073A brief explanation of the format of regular expressions follows.
74%For further information and a gentler presentation, consult XXX somewhere.
Guido van Rossum1acceb01997-08-14 23:12:18 +000075
76Regular expressions can contain both special and ordinary characters.
77Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
78are the simplest regular expressions; they simply match themselves.
79You can concatenate ordinary characters, so '\code{last}' matches the
80characters 'last'. (In the rest of this section, we'll write RE's in
81\code{this special font}, usually without quotes, and strings to be
82matched 'in single quotes'.)
83
84Some characters, like \code{|} or \code{(}, are special. Special
85characters either stand for classes of ordinary characters, or affect
86how the regular expressions around them are interpreted.
87
88The special characters are:
89\begin{itemize}
90\item[\code{.}] (Dot.) In the default mode, this matches any
91character except a newline. If the \code{DOTALL} flag has been
92specified, this matches any character including a newline.
93\item[\code{\^}] (Caret.) Matches the start of the string, and in
94\code{MULTILINE} mode also immediately after each newline.
95\item[\code{\$}] Matches the end of the string.
96\code{foo} matches both 'foo' and 'foobar', while the regular
97expression '\code{foo\$}' matches only 'foo'.
98%
99\item[\code{*}] Causes the resulting RE to
100match 0 or more repetitions of the preceding RE, as many repetitions
101as are possible. \code{ab*} will
102match 'a', 'ab', or 'a' followed by any number of 'b's.
103%
104\item[\code{+}] Causes the
105resulting RE to match 1 or more repetitions of the preceding RE.
106\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
107will not match just 'a'.
108%
109\item[\code{?}] Causes the resulting RE to
110match 0 or 1 repetitions of the preceding RE. \code{ab?} will
111match either 'a' or 'ab'.
112\item[\code{*?}, \code{+?}, \code{??}] The \code{*}, \code{+}, and
113\code{?} qualifiers are all \dfn{greedy}; they match as much text as
114possible. Sometimes this behaviour isn't desired; if the RE
115\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
116entire string, and not just \code{<H1>}.
117Adding \code{?} after the qualifier makes it perform the match in
118\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
119possible will be matched. Using \code{.*?} in the previous
Guido van Rossum0b334101997-12-08 17:33:40 +0000120expression will match only \code{<H1>}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000121%
122\item[\code{\e}] Either escapes special characters (permitting you to match
123characters like '*?+\&\$'), or signals a special sequence; special
124sequences are discussed below.
125
126If you're not using a raw string to
127express the pattern, remember that Python also uses the
128backslash as an escape sequence in string literals; if the escape
129sequence isn't recognized by Python's parser, the backslash and
130subsequent character are included in the resulting string. However,
131if Python would recognize the resulting sequence, the backslash should
132be repeated twice. This is complicated and hard to understand, so
133it's highly recommended that you use raw strings.
134%
135\item[\code{[]}] Used to indicate a set of characters. Characters can
136be listed individually, or a range is indicated by giving two
137characters and separating them by a '-'. Special characters are not
138active inside sets. For example, \code{[akm\$]} will match any of the
139characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will match any
140lowercase letter and \code{[a-zA-Z0-9]} matches any letter or digit.
Guido van Rossum0b334101997-12-08 17:33:40 +0000141Character classes of the form \code{\e \var{X}} defined below are also acceptable.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000142If you want to include a \code{]} or a \code{-} inside a
143set, precede it with a backslash.
144
145Characters \emph{not} within a range can be matched by including a
146\code{\^} as the first character of the set; \code{\^} elsewhere will
147simply match the '\code{\^}' character.
148%
149\item[\code{|}]\code{A|B}, where A and B can be arbitrary REs,
150creates a regular expression that will match either A or B. This can
151be used inside groups (see below) as well. To match a literal '|',
152use \code{\e|}, or enclose it inside a character class, like \code{[|]}.
153%
Guido van Rossum0b334101997-12-08 17:33:40 +0000154\item[\code{(...)}] Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000155contents of a group can be retrieved after a match has been performed,
156and can be matched later in the string with the
157\code{\e \var{number}} special sequence, described below. To match the
158literals '(' or ')',
159use \code{\e(} or \code{\e)}, or enclose them inside a character
160class: \code{[(] [)]}.
161%
Guido van Rossum0b334101997-12-08 17:33:40 +0000162\item[\code{(?...)}] This is an extension notation (a '?' following a
163'(' is not meaningful otherwise). The first character after the '?'
164determines what the meaning and further syntax of the construct is.
165Following are the currently supported extensions.
166%
Guido van Rossumbd49ac41997-12-10 23:05:53 +0000167\item[\code{(?iLmsx)}] (One or more letters from the set 'i', 'L', 'm', 's',
Guido van Rossum0b334101997-12-08 17:33:40 +0000168'x'.) The group matches the empty string; the letters set the
169corresponding flags (re.I, re.L, re.M, re.S, re.X) for the entire regular
Guido van Rossumbd49ac41997-12-10 23:05:53 +0000170expression. (The flag 'L' is uppercase because it is not in standard Perl.)
171This is useful if you wish include the flags as part of the regular
Guido van Rossum0b334101997-12-08 17:33:40 +0000172expression, instead of passing a \var{flag} argument to the \code{compile} function.
173%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000174\item[\code{(?:...)}] A non-grouping version of regular parentheses.
175Matches whatever's inside the parentheses, but the text matched by the
176group \emph{cannot} be retrieved after performing a match or
177referenced later in the pattern.
178%
179\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
180the text matched by the group is accessible via the symbolic group
181name \var{name}. Group names must be valid Python identifiers. A
182symbolic group is also a numbered group, just as if the group were not
183named. So the group named 'id' in the example above can also be
184referenced as the numbered group 1.
185
186For example, if the pattern string is
187\code{r'(?P<id>[a-zA-Z_]\e w*)'}, the group can be referenced by its
188name in arguments to methods of match objects, such as \code{m.group('id')}
189or \code{m.end('id')}, and also by name in pattern text (e.g. \code{(?P=id)}) and
190replacement text (e.g. \code{\e g<id>}).
191%
192\item[\code{(?\#...)}] A comment; the contents of the parentheses are simply ignored.
193%
Guido van Rossum0b334101997-12-08 17:33:40 +0000194\item[\code{(?=...)}] Matches if \code{...} matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example,
195\code{Isaac (?=Asimov)} will match 'Isaac~' only if it's followed by 'Asimov'.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000196%
Guido van Rossum0b334101997-12-08 17:33:40 +0000197\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This is a negative lookahead assertion. For example,
198For example,
199\code{Isaac (?!Asimov)} will match 'Isaac~' only if it's \emph{not} followed by 'Asimov'.
200
Guido van Rossum1acceb01997-08-14 23:12:18 +0000201\end{itemize}
202
203The special sequences consist of '\code{\e}' and a character from the
204list below. If the ordinary character is not on the list, then the
205resulting RE will match the second character. For example,
206\code{\e\$} matches the character '\$'. Ones where the backslash
207should be doubled are indicated.
208
209\begin{itemize}
210
211%
212\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum0b334101997-12-08 17:33:40 +0000213same number. Groups are numbered starting from 1. For example,
214\code{(.+) \e 1} matches 'the the' or '55 55', but not 'the end' (note
215the space after the group). This special sequence can only be used to
216match one of the first 99 groups. If the first digit of \var{number}
217is 0, or \var{number} is 3 octal digits long, it will not be interpreted
218as a group match, but as the character with octal value \var{number}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000219%
220\item[\code{\e A}] Matches only at the start of the string.
221%
222\item[\code{\e b}] Matches the empty string, but only at the
223beginning or end of a word. A word is defined as a sequence of
224alphanumeric characters, so the end of a word is indicated by
Guido van Rossum0b334101997-12-08 17:33:40 +0000225whitespace or a non-alphanumeric character.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000226%
Guido van Rossum0b334101997-12-08 17:33:40 +0000227\item[\code{\e B}] Matches the empty string, but only when it is
228\emph{not} at the beginning or end of a word.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000229%
230\item[\code{\e d}]Matches any decimal digit; this is
231equivalent to the set \code{[0-9]}.
232%
233\item[\code{\e D}]Matches any non-digit character; this is
Guido van Rossumd7dc2eb1997-10-22 03:03:44 +0000234equivalent to the set \code{[{\^}0-9]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000235%
236\item[\code{\e s}]Matches any whitespace character; this is
237equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
238%
239\item[\code{\e S}]Matches any non-whitespace character; this is
Guido van Rossumd7dc2eb1997-10-22 03:03:44 +0000240equivalent to the set \code{[{\^} \e t\e n\e r\e f\e v]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000241%
Guido van Rossum0b334101997-12-08 17:33:40 +0000242\item[\code{\e w}]When the LOCALE flag is not specified, matches any alphanumeric character; this is
243equivalent to the set \code{[a-zA-Z0-9_]}. With LOCALE, it will match
244the set \code{[0-9_]} plus whatever characters are defined as letters
245for the current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000246%
Guido van Rossum0b334101997-12-08 17:33:40 +0000247\item[\code{\e W}]When the LOCALE flag is not specified, matches any
248non-alphanumeric character; this is equivalent to the set
249\code{[{\^}a-zA-Z0-9_]}. With LOCALE, it will match any character
250not in the set \code{[0-9_]}, and not defined as a letter
251for the current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000252
253\item[\code{\e Z}]Matches only at the end of the string.
254%
255
256\item[\code{\e \e}] Matches a literal backslash.
257
258\end{itemize}
259
260\subsection{Module Contents}
261
262The module defines the following functions and constants, and an exception:
263
264\renewcommand{\indexsubitem}{(in module re)}
265
266\begin{funcdesc}{compile}{pattern\optional{\, flags}}
267 Compile a regular expression pattern into a regular expression
268 object, which can be used for matching using its \code{match} and
269 \code{search} methods, described below.
270
Guido van Rossum0b334101997-12-08 17:33:40 +0000271 The expression's behaviour can be modified by specifying a
272 \var{flags} value. Values can be any of the following variables,
273 combined using bitwise OR (the \code{|} operator).
274
Guido van Rossuma42c1781997-12-09 20:41:47 +0000275\begin{itemize}
Guido van Rossum0b334101997-12-08 17:33:40 +0000276
Guido van Rossuma42c1781997-12-09 20:41:47 +0000277\item[I ] or IGNORECASE:
278Perform case-insensitive matching; expressions like [A-Z] will match
279lowercase letters, too.
Guido van Rossum0b334101997-12-08 17:33:40 +0000280
Guido van Rossuma42c1781997-12-09 20:41:47 +0000281\item[L ] or LOCALE:
282Make \code{\e w}, \code{\e W}, \code{\e b}, \code{\e B}, dependent on
283the current locale.
Guido van Rossum0b334101997-12-08 17:33:40 +0000284
Guido van Rossuma42c1781997-12-09 20:41:47 +0000285\item[M ] or MULTILINE:
286When specified, the pattern character \code{\^} matches at the
287beginning of the string and at the beginning of each line (immediately
288following each newline); and the pattern character \code{\$} matches
289at the end of the string and at the end of each line (immediately
290preceding each newline).
291
Guido van Rossum0b334101997-12-08 17:33:40 +0000292By default, \code{\^} matches only at the beginning of the string, and
293\code{\$} only at the end of the string and immediately before the
294newline (if any) at the end of the string.
Guido van Rossum0b334101997-12-08 17:33:40 +0000295
Guido van Rossuma42c1781997-12-09 20:41:47 +0000296\item[S ] or DOTALL:
297Make the \code{.} special character match a newline; without this
298flag, \code{.} will match anything \emph{except} a newline.
Guido van Rossum0b334101997-12-08 17:33:40 +0000299
Guido van Rossuma42c1781997-12-09 20:41:47 +0000300\item[X ] or VERBOSE:
301When specified, whitespace within the pattern string is ignored except
302when in a character class or preceded by an unescaped backslash, and,
303when a line contains a \code{\#} not in a character class or preceded
304by an unescaped backslash, all characters from the leftmost such
305\code{\#} through the end of the line are ignored.
Guido van Rossum0b334101997-12-08 17:33:40 +0000306
Guido van Rossuma42c1781997-12-09 20:41:47 +0000307\end{itemize}
Guido van Rossum0b334101997-12-08 17:33:40 +0000308
Guido van Rossum1acceb01997-08-14 23:12:18 +0000309 The sequence
310%
311\bcode\begin{verbatim}
312prog = re.compile(pat)
313result = prog.match(str)
314\end{verbatim}\ecode
315%
316is equivalent to
317%
318\bcode\begin{verbatim}
319result = re.match(pat, str)
320\end{verbatim}\ecode
321%
322but the version using \code{compile()} is more efficient when multiple
323regular expressions are used concurrently in a single program.
324%(The compiled version of the last pattern passed to \code{regex.match()} or
325%\code{regex.search()} is cached, so programs that use only a single
326%regular expression at a time needn't worry about compiling regular
327%expressions.)
328\end{funcdesc}
329
330\begin{funcdesc}{escape}{string}
331Return \var{string} with all non-alphanumerics backslashed; this is
332useful if you want to match some variable string which may have
333regular expression metacharacters in it.
334\end{funcdesc}
335
336\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
337 If zero or more characters at the beginning of \var{string} match
338 the regular expression \var{pattern}, return a corresponding
339 \code{Match} object. Return \code{None} if the string does not
340 match the pattern; note that this is different from a zero-length
341 match.
342\end{funcdesc}
343
344\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
345 Scan through \var{string} looking for a location where the regular
346 expression \var{pattern} produces a match. Return \code{None} if no
347 position in the string matches the pattern; note that this is
348 different from finding a zero-length match at some point in the string.
349\end{funcdesc}
350
351\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
352 Split \var{string} by the occurrences of \var{pattern}. If
353 capturing parentheses are used in pattern, then occurrences of
354 patterns or subpatterns are also returned.
355%
356\bcode\begin{verbatim}
357>>> re.split('[\W]+', 'Words, words, words.')
358['Words', 'words', 'words', '']
359>>> re.split('([\W]+)', 'Words, words, words.')
360['Words', ', ', 'words', ', ', 'words', '.', '']
361\end{verbatim}\ecode
362%
363 This function combines and extends the functionality of
Guido van Rossum0b334101997-12-08 17:33:40 +0000364 the old \code{regex.split()} and \code{regex.splitx()}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000365\end{funcdesc}
366
367\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
368Return the string obtained by replacing the leftmost non-overlapping
369occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000370\var{repl}. If the pattern isn't found, \var{string} is returned
371unchanged. \var{repl} can be a string or a function; if a function,
372it is called for every non-overlapping occurance of \var{pattern}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000373The function takes a single match object argument, and returns the
374replacement string. For example:
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000375%
376\bcode\begin{verbatim}
377>>> def dashrepl(matchobj):
378... if matchobj.group(0) == '-': return ' '
379... else: return '-'
380>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
381'pro--gram files'
382\end{verbatim}\ecode
383%
Guido van Rossum0b334101997-12-08 17:33:40 +0000384The pattern may be a string or a
385regexp object; if you need to specify
386regular expression flags, you must use a regexp object, or use
387embedded modifiers in a pattern string; e.g.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000388%
389\bcode\begin{verbatim}
390sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
391\end{verbatim}\ecode
392%
393The optional argument \var{count} is the maximum number of pattern
394occurrences to be replaced; count must be a non-negative integer, and
395the default value of 0 means to replace all occurrences.
396
397Empty matches for the pattern are replaced only when not adjacent to a
398previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
399\end{funcdesc}
400
401\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
402Perform the same operation as \code{sub()}, but return a tuple
403\code{(new_string, number_of_subs_made)}.
404\end{funcdesc}
405
406\begin{excdesc}{error}
407 Exception raised when a string passed to one of the functions here
408 is not a valid regular expression (e.g., unmatched parentheses) or
409 when some other error occurs during compilation or matching. (It is
410 never an error if a string contains no match for a pattern.)
411\end{excdesc}
412
413\subsection{Regular Expression Objects}
414Compiled regular expression objects support the following methods and
415attributes:
416
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000417\renewcommand{\indexsubitem}{(re method)}
Guido van Rossum0b334101997-12-08 17:33:40 +0000418\begin{funcdesc}{match}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000419 If zero or more characters at the beginning of \var{string} match
420 this regular expression, return a corresponding
421 \code{Match} object. Return \code{None} if the string does not
422 match the pattern; note that this is different from a zero-length
423 match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000424
425 The optional second parameter \var{pos} gives an index in the string
426 where the search is to start; it defaults to \code{0}. This is not
427 completely equivalent to slicing the string; the \code{'\^'} pattern
428 character matches at the real begin of the string and at positions
429 just after a newline, not necessarily at the index where the search
430 is to start.
Guido van Rossum0b334101997-12-08 17:33:40 +0000431
432 The optional parameter \var{endpos} limits how far the string will
433 be searched; it will be as if the string is \var{endpos} characters
434 long, so only the characters from \var{pos} to \var{endpos} will be
435 searched for a match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000436\end{funcdesc}
437
Guido van Rossum0b334101997-12-08 17:33:40 +0000438\begin{funcdesc}{search}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000439 Scan through \var{string} looking for a location where this regular
440 expression produces a match. Return \code{None} if no
441 position in the string matches the pattern; note that this is
442 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000443
Guido van Rossum0b334101997-12-08 17:33:40 +0000444 The optional \var{pos} and \var{endpos} parameters have the same meaning as for the
Guido van Rossum1acceb01997-08-14 23:12:18 +0000445 \code{match} method.
446\end{funcdesc}
447
448\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
449Identical to the \code{split} function, using the compiled pattern.
450\end{funcdesc}
451
452\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
453Identical to the \code{sub} function, using the compiled pattern.
454\end{funcdesc}
455
456\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
457Identical to the \code{subn} function, using the compiled pattern.
458\end{funcdesc}
459
460\renewcommand{\indexsubitem}{(regex attribute)}
461
462\begin{datadesc}{flags}
463The flags argument used when the regex object was compiled, or 0 if no
464flags were provided.
465\end{datadesc}
466
467\begin{datadesc}{groupindex}
468A dictionary mapping any symbolic group names (defined by
469\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
470symbolic groups were used in the pattern.
471\end{datadesc}
472
473\begin{datadesc}{pattern}
474The pattern string from which the regex object was compiled.
475\end{datadesc}
476
477\subsection{Match Objects}
478Match objects support the following methods and attributes:
479
Guido van Rossum1acceb01997-08-14 23:12:18 +0000480\begin{funcdesc}{start}{group}
481\end{funcdesc}
482
483\begin{funcdesc}{end}{group}
Guido van Rossum0b334101997-12-08 17:33:40 +0000484Return the indices of the start and end of the substring
485matched by \var{group}. Return \code{None} if \var{group} exists but
486did not contribute to the match. Note that for a match object
487\code{m}, and a group \code{g} that did contribute to the match, the
488substring matched by group \code{g} is
Guido van Rossum1acceb01997-08-14 23:12:18 +0000489\bcode\begin{verbatim}
490 m.string[m.start(g):m.end(g)]
491\end{verbatim}\ecode
492%
493Note too that \code{m.start(\var{group})} will equal
494\code{m.end(\var{group})} if \var{group} matched a null string. For example,
495after \code{m = re.search('b(c?)', 'cba')}, \code{m.start(0)} is 1,
496\code{m.end(0)} is 2, \code{m.start(1)} and \code{m.end(1)} are both
4972, and \code{m.start(2)} raises an
498\code{IndexError} exception.
499\end{funcdesc}
500
Guido van Rossum0b334101997-12-08 17:33:40 +0000501\begin{funcdesc}{span}{group}
502Return the 2-tuple \code{(start(\var{group}), end(\var{group}))}.
503Note that if \var{group} did not contribute to the match, this is
504\code{(None, None)}.
505\end{funcdesc}
506
Guido van Rossum1acceb01997-08-14 23:12:18 +0000507\begin{funcdesc}{group}{\optional{g1, g2, ...})}
508This method is only valid when the last call to the \code{match}
509or \code{search} method found a match. It returns one or more
510groups of the match. If there is a single \var{index} argument,
511the result is a single string; if there are multiple arguments, the
512result is a tuple with one item per argument. If the \var{index} is
513zero, the corresponding return value is the entire matching string; if
514it is in the inclusive range [1..99], it is the string matching the
515the corresponding parenthesized group (using the default syntax,
516groups are parenthesized using \code{\e (} and \code{\e )}). If no
517such group exists, the corresponding result is \code{None}.
518
Guido van Rossum0b334101997-12-08 17:33:40 +0000519If the regular expression uses the \code{(?P<\var{name}>...)} syntax,
520the \var{index} arguments may also be strings identifying groups by
521their group name.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000522\end{funcdesc}
523
524\begin{datadesc}{pos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000525The value of \var{pos} which was passed to the
526\code{search} or \code{match} function. This is the index into the
527string at which the regex engine started looking for a match.
528\end{datadesc}
529
530\begin{datadesc}{endpos}
531The value of \var{endpos} which was passed to the
532\code{search} or \code{match} function. This is the index into the
533string beyond which the regex engine will not go.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000534\end{datadesc}
535
536\begin{datadesc}{re}
537The regular expression object whose match() or search() method
Guido van Rossum0b334101997-12-08 17:33:40 +0000538produced this match object.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000539\end{datadesc}
540
541\begin{datadesc}{string}
542The string passed to \code{match()} or \code{search()}.
543\end{datadesc}
544
Guido van Rossum1acceb01997-08-14 23:12:18 +0000545\begin{seealso}
546\seetext Jeffrey Friedl, \emph{Mastering Regular Expressions}.
547\end{seealso}
548