blob: e692e7ef770579fa75d81c1013997efee4f5db6a [file] [log] [blame]
Guido van Rossum1acceb01997-08-14 23:12:18 +00001\section{Built-in Module \sectcode{re}}
2\label{module-re}
3
4\bimodindex{re}
5
Guido van Rossum1acceb01997-08-14 23:12:18 +00006This module provides regular expression matching operations similar to
Guido van Rossum0b334101997-12-08 17:33:40 +00007those found in Perl. It's 8-bit clean: both patterns and strings may
8contain null bytes and characters whose high bit is set. It is always
9available.
Guido van Rossum1acceb01997-08-14 23:12:18 +000010
11Regular expressions use the backslash character (\code{\e}) to
12indicate special forms or to allow special characters to be used
13without invoking their special meaning. This collides with Python's
14usage of the same character for the same purpose in string literals;
15for example, to match a literal backslash, one might have to write
Guido van Rossum0b334101997-12-08 17:33:40 +000016\code{\e\e\e\e} as the pattern string, because the regular expression
17must be \code{\e\e}, and each backslash must be expressed as
18\code{\e\e} inside a regular Python string literal.
Guido van Rossum1acceb01997-08-14 23:12:18 +000019
20The solution is to use Python's raw string notation for regular
21expression patterns; backslashes are not handled in any special way in
22a string literal prefixed with 'r'. So \code{r"\e n"} is a two
23character string containing a backslash and the letter 'n', while
24\code{"\e n"} is a one-character string containing a newline. Usually
25patterns will be expressed in Python code using this raw string notation.
26
Guido van Rossum48d04371997-12-11 20:19:08 +000027\subsection{Regular Expression Syntax}
Guido van Rossum1acceb01997-08-14 23:12:18 +000028
29A regular expression (or RE) specifies a set of strings that matches
30it; the functions in this module let you check if a particular string
31matches a given regular expression (or if a given regular expression
32matches a particular string, which comes down to the same thing).
33
34Regular expressions can be concatenated to form new regular
35expressions; if \emph{A} and \emph{B} are both regular expressions,
36then \emph{AB} is also an regular expression. If a string \emph{p}
37matches A and another string \emph{q} matches B, the string \emph{pq}
38will match AB. Thus, complex expressions can easily be constructed
39from simpler primitive expressions like the ones described here. For
40details of the theory and implementation of regular expressions,
41consult the Friedl book referenced below, or almost any textbook about
42compiler construction.
43
Guido van Rossum0b334101997-12-08 17:33:40 +000044A brief explanation of the format of regular expressions follows.
45%For further information and a gentler presentation, consult XXX somewhere.
Guido van Rossum1acceb01997-08-14 23:12:18 +000046
47Regular expressions can contain both special and ordinary characters.
48Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
49are the simplest regular expressions; they simply match themselves.
50You can concatenate ordinary characters, so '\code{last}' matches the
51characters 'last'. (In the rest of this section, we'll write RE's in
52\code{this special font}, usually without quotes, and strings to be
53matched 'in single quotes'.)
54
55Some characters, like \code{|} or \code{(}, are special. Special
56characters either stand for classes of ordinary characters, or affect
57how the regular expressions around them are interpreted.
58
59The special characters are:
60\begin{itemize}
61\item[\code{.}] (Dot.) In the default mode, this matches any
62character except a newline. If the \code{DOTALL} flag has been
63specified, this matches any character including a newline.
64\item[\code{\^}] (Caret.) Matches the start of the string, and in
65\code{MULTILINE} mode also immediately after each newline.
Guido van Rossum48d04371997-12-11 20:19:08 +000066\item[\code{\$}] Matches the end of the string, and in
67\code{MULTILINE} mode also matches before a newline.
Guido van Rossum1acceb01997-08-14 23:12:18 +000068\code{foo} matches both 'foo' and 'foobar', while the regular
Guido van Rossum48d04371997-12-11 20:19:08 +000069expression \code{foo\$} matches only 'foo'.
Guido van Rossum1acceb01997-08-14 23:12:18 +000070%
71\item[\code{*}] Causes the resulting RE to
72match 0 or more repetitions of the preceding RE, as many repetitions
73as are possible. \code{ab*} will
74match 'a', 'ab', or 'a' followed by any number of 'b's.
75%
76\item[\code{+}] Causes the
77resulting RE to match 1 or more repetitions of the preceding RE.
78\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
79will not match just 'a'.
80%
81\item[\code{?}] Causes the resulting RE to
82match 0 or 1 repetitions of the preceding RE. \code{ab?} will
83match either 'a' or 'ab'.
84\item[\code{*?}, \code{+?}, \code{??}] The \code{*}, \code{+}, and
85\code{?} qualifiers are all \dfn{greedy}; they match as much text as
86possible. Sometimes this behaviour isn't desired; if the RE
87\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
88entire string, and not just \code{<H1>}.
89Adding \code{?} after the qualifier makes it perform the match in
90\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
91possible will be matched. Using \code{.*?} in the previous
Guido van Rossum0b334101997-12-08 17:33:40 +000092expression will match only \code{<H1>}.
Guido van Rossum1acceb01997-08-14 23:12:18 +000093%
Guido van Rossum0148bbf1997-12-22 22:41:40 +000094\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
95\var{m} to \var{n} repetitions of the preceding RE, attempting to
96match as many repetitions as possible. For example, \code{a\{3,5\}}
97will match from 3 to 5 'a' characters.
98%
99\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
100match from \var{m} to \var{n} repetitions of the preceding RE,
101attempting to match as \emph{few} repetitions as possible. This is
102the non-greedy version of the previous qualifier. For example, on the
1036-character string 'aaaaaa', \code{a\{3,5\}} will match 5 'a'
104characters, while \code{a\{3,5\}?} will only match 3 characters.
105%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000106\item[\code{\e}] Either escapes special characters (permitting you to match
107characters like '*?+\&\$'), or signals a special sequence; special
108sequences are discussed below.
109
110If you're not using a raw string to
111express the pattern, remember that Python also uses the
112backslash as an escape sequence in string literals; if the escape
113sequence isn't recognized by Python's parser, the backslash and
114subsequent character are included in the resulting string. However,
115if Python would recognize the resulting sequence, the backslash should
116be repeated twice. This is complicated and hard to understand, so
Guido van Rossum48d04371997-12-11 20:19:08 +0000117it's highly recommended that you use raw strings for all but the simplest expressions.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000118%
119\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum48d04371997-12-11 20:19:08 +0000120be listed individually, or a range of characters can be indicated by
121giving two characters and separating them by a '-'. Special
122characters are not active inside sets. For example, \code{[akm\$]}
123will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]}
124will match any lowercase letter and \code{[a-zA-Z0-9]} matches any
125letter or digit. Character classes such as \code{\e w} or \code {\e
126S} (defined below) are also acceptable inside a range. If you want to
127include a \code{]} or a \code{-} inside a set, precede it with a
128backslash.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000129
130Characters \emph{not} within a range can be matched by including a
131\code{\^} as the first character of the set; \code{\^} elsewhere will
132simply match the '\code{\^}' character.
133%
134\item[\code{|}]\code{A|B}, where A and B can be arbitrary REs,
135creates a regular expression that will match either A or B. This can
Guido van Rossumeb0f0661997-12-30 20:38:16 +0000136be used inside groups (see below) as well. To match a literal '\code{|}',
Guido van Rossum1acceb01997-08-14 23:12:18 +0000137use \code{\e|}, or enclose it inside a character class, like \code{[|]}.
138%
Guido van Rossum48d04371997-12-11 20:19:08 +0000139\item[\code{(...)}] Matches whatever regular expression is inside the
140parentheses, and indicates the start and end of a group; the contents
141of a group can be retrieved after a match has been performed, and can
142be matched later in the string with the \code{\e \var{number}} special
143sequence, described below. To match the literals '(' or ')',
Guido van Rossum1acceb01997-08-14 23:12:18 +0000144use \code{\e(} or \code{\e)}, or enclose them inside a character
145class: \code{[(] [)]}.
146%
Guido van Rossum0b334101997-12-08 17:33:40 +0000147\item[\code{(?...)}] This is an extension notation (a '?' following a
148'(' is not meaningful otherwise). The first character after the '?'
149determines what the meaning and further syntax of the construct is.
150Following are the currently supported extensions.
151%
Guido van Rossumbd49ac41997-12-10 23:05:53 +0000152\item[\code{(?iLmsx)}] (One or more letters from the set 'i', 'L', 'm', 's',
Guido van Rossum0b334101997-12-08 17:33:40 +0000153'x'.) The group matches the empty string; the letters set the
154corresponding flags (re.I, re.L, re.M, re.S, re.X) for the entire regular
Guido van Rossum48d04371997-12-11 20:19:08 +0000155expression. This is useful if you wish include the flags as part of
156the regular expression, instead of passing a \var{flag} argument to
157the \code{compile} function.
Guido van Rossum0b334101997-12-08 17:33:40 +0000158%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000159\item[\code{(?:...)}] A non-grouping version of regular parentheses.
160Matches whatever's inside the parentheses, but the text matched by the
161group \emph{cannot} be retrieved after performing a match or
162referenced later in the pattern.
163%
164\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
165the text matched by the group is accessible via the symbolic group
166name \var{name}. Group names must be valid Python identifiers. A
167symbolic group is also a numbered group, just as if the group were not
168named. So the group named 'id' in the example above can also be
169referenced as the numbered group 1.
170
Guido van Rossum48d04371997-12-11 20:19:08 +0000171For example, if the pattern is
172\code{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Guido van Rossum1acceb01997-08-14 23:12:18 +0000173name in arguments to methods of match objects, such as \code{m.group('id')}
174or \code{m.end('id')}, and also by name in pattern text (e.g. \code{(?P=id)}) and
175replacement text (e.g. \code{\e g<id>}).
176%
Guido van Rossum48d04371997-12-11 20:19:08 +0000177\item[\code{(?P=\var{name})}] Matches whatever text was matched by the earlier group named \var{name}.
178%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000179\item[\code{(?\#...)}] A comment; the contents of the parentheses are simply ignored.
180%
Guido van Rossum0b334101997-12-08 17:33:40 +0000181\item[\code{(?=...)}] Matches if \code{...} matches next, but doesn't consume any of the string. This is called a lookahead assertion. For example,
182\code{Isaac (?=Asimov)} will match 'Isaac~' only if it's followed by 'Asimov'.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000183%
Guido van Rossum0b334101997-12-08 17:33:40 +0000184\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This is a negative lookahead assertion. For example,
185For example,
186\code{Isaac (?!Asimov)} will match 'Isaac~' only if it's \emph{not} followed by 'Asimov'.
187
Guido van Rossum1acceb01997-08-14 23:12:18 +0000188\end{itemize}
189
190The special sequences consist of '\code{\e}' and a character from the
191list below. If the ordinary character is not on the list, then the
192resulting RE will match the second character. For example,
Guido van Rossum48d04371997-12-11 20:19:08 +0000193\code{\e\$} matches the character '\$'.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000194
195\begin{itemize}
196
197%
198\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum0b334101997-12-08 17:33:40 +0000199same number. Groups are numbered starting from 1. For example,
200\code{(.+) \e 1} matches 'the the' or '55 55', but not 'the end' (note
201the space after the group). This special sequence can only be used to
202match one of the first 99 groups. If the first digit of \var{number}
203is 0, or \var{number} is 3 octal digits long, it will not be interpreted
204as a group match, but as the character with octal value \var{number}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000205%
206\item[\code{\e A}] Matches only at the start of the string.
207%
208\item[\code{\e b}] Matches the empty string, but only at the
209beginning or end of a word. A word is defined as a sequence of
210alphanumeric characters, so the end of a word is indicated by
Guido van Rossum48d04371997-12-11 20:19:08 +0000211whitespace or a non-alphanumeric character. Inside a character range,
212\code{\e b} represents the backspace character, for compatibility with
213Python's string literals.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000214%
Guido van Rossum0b334101997-12-08 17:33:40 +0000215\item[\code{\e B}] Matches the empty string, but only when it is
216\emph{not} at the beginning or end of a word.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000217%
218\item[\code{\e d}]Matches any decimal digit; this is
219equivalent to the set \code{[0-9]}.
220%
221\item[\code{\e D}]Matches any non-digit character; this is
Fred Drakec4586381998-01-06 15:46:21 +0000222equivalent to the set \code{[{\^}0-9]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000223%
224\item[\code{\e s}]Matches any whitespace character; this is
225equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
226%
227\item[\code{\e S}]Matches any non-whitespace character; this is
Fred Drake78f8e981997-12-29 21:39:39 +0000228equivalent to the set \code{[\^ \e t\e n\e r\e f\e v]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000229%
Guido van Rossum0b334101997-12-08 17:33:40 +0000230\item[\code{\e w}]When the LOCALE flag is not specified, matches any alphanumeric character; this is
231equivalent to the set \code{[a-zA-Z0-9_]}. With LOCALE, it will match
232the set \code{[0-9_]} plus whatever characters are defined as letters
233for the current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000234%
Guido van Rossum0b334101997-12-08 17:33:40 +0000235\item[\code{\e W}]When the LOCALE flag is not specified, matches any
236non-alphanumeric character; this is equivalent to the set
Guido van Rossumb2c45a81998-01-12 05:49:05 +0000237\code{[{\^}a-zA-Z0-9_]}. With LOCALE, it will match any character
Guido van Rossum0b334101997-12-08 17:33:40 +0000238not in the set \code{[0-9_]}, and not defined as a letter
239for the current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000240
241\item[\code{\e Z}]Matches only at the end of the string.
242%
243
244\item[\code{\e \e}] Matches a literal backslash.
245
246\end{itemize}
247
248\subsection{Module Contents}
Fred Drake78f8e981997-12-29 21:39:39 +0000249\nodename{Contents of Module re}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000250
251The module defines the following functions and constants, and an exception:
252
253\renewcommand{\indexsubitem}{(in module re)}
254
255\begin{funcdesc}{compile}{pattern\optional{\, flags}}
256 Compile a regular expression pattern into a regular expression
257 object, which can be used for matching using its \code{match} and
258 \code{search} methods, described below.
259
Guido van Rossum0b334101997-12-08 17:33:40 +0000260 The expression's behaviour can be modified by specifying a
261 \var{flags} value. Values can be any of the following variables,
262 combined using bitwise OR (the \code{|} operator).
263
Fred Drake78f8e981997-12-29 21:39:39 +0000264\begin{description}
Guido van Rossum0b334101997-12-08 17:33:40 +0000265
Fred Drake78f8e981997-12-29 21:39:39 +0000266% The use of \quad in the item labels is ugly but adds enough space
267% to the label that it doesn't get visually run-in with the text.
Guido van Rossum0b334101997-12-08 17:33:40 +0000268
Fred Drake78f8e981997-12-29 21:39:39 +0000269\item[I or IGNORECASE or \code{(?i)}\quad]
270
271Perform case-insensitive matching; expressions like \code{[A-Z]} will match
Guido van Rossum48d04371997-12-11 20:19:08 +0000272lowercase letters, too. This is not affected by the current locale.
Guido van Rossum0b334101997-12-08 17:33:40 +0000273
Fred Drake78f8e981997-12-29 21:39:39 +0000274\item[L or LOCALE or \code{(?L)}\quad]
275
276Make \code{\e w}, \code{\e W}, \code{\e b},
Guido van Rossum48d04371997-12-11 20:19:08 +0000277\code{\e B}, dependent on the current locale.
Guido van Rossuma42c1781997-12-09 20:41:47 +0000278
Fred Drake78f8e981997-12-29 21:39:39 +0000279\item[M or MULTILINE or \code{(?m)}\quad]
Guido van Rossum48d04371997-12-11 20:19:08 +0000280
Fred Drake78f8e981997-12-29 21:39:39 +0000281When specified, the pattern character \code{\^} matches at the
Guido van Rossum48d04371997-12-11 20:19:08 +0000282 beginning of the string and at the beginning of each line
283 (immediately following each newline); and the pattern character
284\code{\$} matches at the end of the string and at the end of each line
285(immediately preceding each newline).
Guido van Rossum0b334101997-12-08 17:33:40 +0000286By default, \code{\^} matches only at the beginning of the string, and
287\code{\$} only at the end of the string and immediately before the
288newline (if any) at the end of the string.
Guido van Rossum0b334101997-12-08 17:33:40 +0000289
Fred Drake78f8e981997-12-29 21:39:39 +0000290\item[S or DOTALL or \code{(?s)}\quad]
Guido van Rossum0b334101997-12-08 17:33:40 +0000291
Fred Drake78f8e981997-12-29 21:39:39 +0000292Make the \code{.} special character any character at all, including a
Guido van Rossum48d04371997-12-11 20:19:08 +0000293newline; without this flag, \code{.} will match anything \emph{except}
Fred Drake78f8e981997-12-29 21:39:39 +0000294a newline.
Guido van Rossum48d04371997-12-11 20:19:08 +0000295
Fred Drake78f8e981997-12-29 21:39:39 +0000296\item[X or VERBOSE or \code{(?x)}\quad]
Guido van Rossum48d04371997-12-11 20:19:08 +0000297
Fred Drake78f8e981997-12-29 21:39:39 +0000298Ignore whitespace within the pattern
Guido van Rossum48d04371997-12-11 20:19:08 +0000299except when in a character class or preceded by an unescaped
300backslash, and, when a line contains a \code{\#} neither in a character
301class or preceded by an unescaped backslash, all characters from the
Fred Drake78f8e981997-12-29 21:39:39 +0000302leftmost such \code{\#} through the end of the line are ignored.
Guido van Rossum0b334101997-12-08 17:33:40 +0000303
Fred Drake78f8e981997-12-29 21:39:39 +0000304\end{description}
Guido van Rossum0b334101997-12-08 17:33:40 +0000305
Fred Drake78f8e981997-12-29 21:39:39 +0000306The sequence
Guido van Rossum1acceb01997-08-14 23:12:18 +0000307%
308\bcode\begin{verbatim}
309prog = re.compile(pat)
310result = prog.match(str)
311\end{verbatim}\ecode
312%
313is equivalent to
314%
315\bcode\begin{verbatim}
316result = re.match(pat, str)
317\end{verbatim}\ecode
318%
Guido van Rossum48d04371997-12-11 20:19:08 +0000319but the version using \code{compile()} is more efficient when the
320expression will be used several times in a single program.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000321%(The compiled version of the last pattern passed to \code{regex.match()} or
322%\code{regex.search()} is cached, so programs that use only a single
323%regular expression at a time needn't worry about compiling regular
324%expressions.)
325\end{funcdesc}
326
327\begin{funcdesc}{escape}{string}
Guido van Rossum48d04371997-12-11 20:19:08 +0000328 Return \var{string} with all non-alphanumerics backslashed; this is
329 useful if you want to match an arbitrary literal string that may have
330 regular expression metacharacters in it.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000331\end{funcdesc}
332
333\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
334 If zero or more characters at the beginning of \var{string} match
335 the regular expression \var{pattern}, return a corresponding
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000336 \code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum1acceb01997-08-14 23:12:18 +0000337 match the pattern; note that this is different from a zero-length
338 match.
339\end{funcdesc}
340
341\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
342 Scan through \var{string} looking for a location where the regular
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000343 expression \var{pattern} produces a match, and return a corresponding \code{MatchObject} instance.
344 Return \code{None} if no
Guido van Rossum1acceb01997-08-14 23:12:18 +0000345 position in the string matches the pattern; note that this is
346 different from finding a zero-length match at some point in the string.
347\end{funcdesc}
348
349\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
350 Split \var{string} by the occurrences of \var{pattern}. If
351 capturing parentheses are used in pattern, then occurrences of
352 patterns or subpatterns are also returned.
Guido van Rossum97546391998-01-12 18:58:53 +0000353 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
354 occur, and the remainder of the string is returned as the final
355 element of the list. (Incompatibility note: in the original Python
356 1.5 release, \var{maxsplit} was ignored. This has been fixed in
357 later releases.)
Guido van Rossum1acceb01997-08-14 23:12:18 +0000358%
359\bcode\begin{verbatim}
360>>> re.split('[\W]+', 'Words, words, words.')
361['Words', 'words', 'words', '']
362>>> re.split('([\W]+)', 'Words, words, words.')
363['Words', ', ', 'words', ', ', 'words', '.', '']
Guido van Rossum97546391998-01-12 18:58:53 +0000364>>> re.split('[\W]+', 'Words, words, words.', 1)
365['Words', 'words, words.']
Guido van Rossum1acceb01997-08-14 23:12:18 +0000366\end{verbatim}\ecode
367%
368 This function combines and extends the functionality of
Guido van Rossum97546391998-01-12 18:58:53 +0000369 the old \code{regsub.split()} and \code{regsub.splitx()}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000370\end{funcdesc}
371
372\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
373Return the string obtained by replacing the leftmost non-overlapping
374occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000375\var{repl}. If the pattern isn't found, \var{string} is returned
376unchanged. \var{repl} can be a string or a function; if a function,
377it is called for every non-overlapping occurance of \var{pattern}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000378The function takes a single match object argument, and returns the
379replacement string. For example:
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000380%
381\bcode\begin{verbatim}
382>>> def dashrepl(matchobj):
383... if matchobj.group(0) == '-': return ' '
384... else: return '-'
385>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
386'pro--gram files'
387\end{verbatim}\ecode
388%
Guido van Rossum0b334101997-12-08 17:33:40 +0000389The pattern may be a string or a
Guido van Rossum48d04371997-12-11 20:19:08 +0000390regex object; if you need to specify
391regular expression flags, you must use a regex object, or use
392embedded modifiers in a pattern; e.g.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000393%
394\bcode\begin{verbatim}
395sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
396\end{verbatim}\ecode
397%
398The optional argument \var{count} is the maximum number of pattern
399occurrences to be replaced; count must be a non-negative integer, and
400the default value of 0 means to replace all occurrences.
401
402Empty matches for the pattern are replaced only when not adjacent to a
403previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
404\end{funcdesc}
405
406\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
407Perform the same operation as \code{sub()}, but return a tuple
408\code{(new_string, number_of_subs_made)}.
409\end{funcdesc}
410
411\begin{excdesc}{error}
412 Exception raised when a string passed to one of the functions here
413 is not a valid regular expression (e.g., unmatched parentheses) or
414 when some other error occurs during compilation or matching. (It is
415 never an error if a string contains no match for a pattern.)
416\end{excdesc}
417
418\subsection{Regular Expression Objects}
419Compiled regular expression objects support the following methods and
420attributes:
421
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000422\renewcommand{\indexsubitem}{(re method)}
Guido van Rossum0b334101997-12-08 17:33:40 +0000423\begin{funcdesc}{match}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000424 If zero or more characters at the beginning of \var{string} match
425 this regular expression, return a corresponding
Guido van Rossum48d04371997-12-11 20:19:08 +0000426 \code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000427 match the pattern; note that this is different from a zero-length
428 match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000429
430 The optional second parameter \var{pos} gives an index in the string
Guido van Rossum48d04371997-12-11 20:19:08 +0000431 where the search is to start; it defaults to \code{0}. The
432 \code{'\^'} pattern character will match at the index where the
433 search is to start.
Guido van Rossum0b334101997-12-08 17:33:40 +0000434
435 The optional parameter \var{endpos} limits how far the string will
436 be searched; it will be as if the string is \var{endpos} characters
437 long, so only the characters from \var{pos} to \var{endpos} will be
438 searched for a match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000439\end{funcdesc}
440
Guido van Rossum0b334101997-12-08 17:33:40 +0000441\begin{funcdesc}{search}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000442 Scan through \var{string} looking for a location where this regular
443 expression produces a match. Return \code{None} if no
444 position in the string matches the pattern; note that this is
445 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000446
Guido van Rossum48d04371997-12-11 20:19:08 +0000447 The optional \var{pos} and \var{endpos} parameters have the same
448 meaning as for the \code{match} method.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000449\end{funcdesc}
450
451\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
452Identical to the \code{split} function, using the compiled pattern.
453\end{funcdesc}
454
455\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
456Identical to the \code{sub} function, using the compiled pattern.
457\end{funcdesc}
458
459\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
460Identical to the \code{subn} function, using the compiled pattern.
461\end{funcdesc}
462
463\renewcommand{\indexsubitem}{(regex attribute)}
464
465\begin{datadesc}{flags}
466The flags argument used when the regex object was compiled, or 0 if no
467flags were provided.
468\end{datadesc}
469
470\begin{datadesc}{groupindex}
471A dictionary mapping any symbolic group names (defined by
472\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
473symbolic groups were used in the pattern.
474\end{datadesc}
475
476\begin{datadesc}{pattern}
477The pattern string from which the regex object was compiled.
478\end{datadesc}
479
Guido van Rossum48d04371997-12-11 20:19:08 +0000480\subsection{MatchObjects}
481\code{Matchobject} instances support the following methods and attributes:
Guido van Rossum1acceb01997-08-14 23:12:18 +0000482
Guido van Rossum48d04371997-12-11 20:19:08 +0000483\begin{funcdesc}{group}{\optional{g1, g2, ...}}
484Returns one or more groups of the match. If there is a single
485\var{index} argument, the result is a single string; if there are
486multiple arguments, the result is a tuple with one item per argument.
487If the \var{index} is zero, the corresponding return value is the
488entire matching string; if it is in the inclusive range [1..99], it is
489the string matching the the corresponding parenthesized group. If no
490such group exists, the corresponding result is
491\code{None}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000492
Guido van Rossum0b334101997-12-08 17:33:40 +0000493If the regular expression uses the \code{(?P<\var{name}>...)} syntax,
494the \var{index} arguments may also be strings identifying groups by
495their group name.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000496
497A moderately complicated example:
498\bcode\begin{verbatim}
499m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
500\end{verbatim}\ecode
501%
502After performing this match, \code{m.group(1)} is \code{'3'}, as is \code{m.group('int')}.
503\code{m.group(2)} is \code{'14'}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000504\end{funcdesc}
505
Guido van Rossum48d04371997-12-11 20:19:08 +0000506\begin{funcdesc}{groups}{}
507Return a tuple containing all the subgroups of the match, from 1 up to
508however many groups are in the pattern. Groups that did not
Guido van Rossum97546391998-01-12 18:58:53 +0000509participate in the match have values of \code{None}. (Incompatibility
510note: in the original Python 1.5 release, if the tuple was one element
511long, a string would be returned instead. In later versions, a
512singleton tuple is returned in such cases.)
Guido van Rossum48d04371997-12-11 20:19:08 +0000513\end{funcdesc}
514
Guido van Rossume4eb2231997-12-17 00:23:39 +0000515\begin{funcdesc}{start}{group}
516\end{funcdesc}
517
518\begin{funcdesc}{end}{group}
519Return the indices of the start and end of the substring
520matched by \var{group}. Return \code{None} if \var{group} exists but
521did not contribute to the match. For a match object
522\code{m}, and a group \code{g} that did contribute to the match, the
523substring matched by group \code{g} (equivalent to \code{m.group(g)}) is
524\bcode\begin{verbatim}
525 m.string[m.start(g):m.end(g)]
526\end{verbatim}\ecode
527%
528Note that
529\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
530\var{group} matched a null string. For example, after \code{m =
531re.search('b(c?)', 'cba')}, \code{m.start(0)} is 1, \code{m.end(0)} is
5322, \code{m.start(1)} and \code{m.end(1)} are both 2, and
533\code{m.start(2)} raises an \code{IndexError} exception.
534
535\end{funcdesc}
536
537\begin{funcdesc}{span}{group}
538Return the 2-tuple \code{(start(\var{group}), end(\var{group}))}.
539Note that if \var{group} did not contribute to the match, this is
540\code{(None, None)}.
541\end{funcdesc}
542
Guido van Rossum1acceb01997-08-14 23:12:18 +0000543\begin{datadesc}{pos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000544The value of \var{pos} which was passed to the
545\code{search} or \code{match} function. This is the index into the
546string at which the regex engine started looking for a match.
547\end{datadesc}
548
549\begin{datadesc}{endpos}
550The value of \var{endpos} which was passed to the
551\code{search} or \code{match} function. This is the index into the
552string beyond which the regex engine will not go.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000553\end{datadesc}
554
555\begin{datadesc}{re}
Guido van Rossum48d04371997-12-11 20:19:08 +0000556The regular expression object whose \code{match()} or \code{search()} method
557produced this \code{MatchObject} instance.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000558\end{datadesc}
559
560\begin{datadesc}{string}
561The string passed to \code{match()} or \code{search()}.
562\end{datadesc}
563
Guido van Rossum1acceb01997-08-14 23:12:18 +0000564\begin{seealso}
Fred Drakef9951811997-12-29 16:37:04 +0000565\seetext{Jeffrey Friedl, \emph{Mastering Regular Expressions},
Guido van Rossume4eb2231997-12-17 00:23:39 +0000566O'Reilly. The Python material in this book dates from before the re
567module, but it covers writing good regular expression patterns in
Fred Drakef9951811997-12-29 16:37:04 +0000568great detail.}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000569\end{seealso}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000570
571