blob: 8b0d681848ee6018c78149f0a5ba854f2e4c17ad [file] [log] [blame]
Guido van Rossum1acceb01997-08-14 23:12:18 +00001\section{Built-in Module \sectcode{re}}
2\label{module-re}
3
4\bimodindex{re}
5
Guido van Rossum1acceb01997-08-14 23:12:18 +00006This module provides regular expression matching operations similar to
Guido van Rossum0b334101997-12-08 17:33:40 +00007those found in Perl. It's 8-bit clean: both patterns and strings may
8contain null bytes and characters whose high bit is set. It is always
9available.
Guido van Rossum1acceb01997-08-14 23:12:18 +000010
11Regular expressions use the backslash character (\code{\e}) to
12indicate special forms or to allow special characters to be used
13without invoking their special meaning. This collides with Python's
14usage of the same character for the same purpose in string literals;
15for example, to match a literal backslash, one might have to write
Guido van Rossum0b334101997-12-08 17:33:40 +000016\code{\e\e\e\e} as the pattern string, because the regular expression
17must be \code{\e\e}, and each backslash must be expressed as
18\code{\e\e} inside a regular Python string literal.
Guido van Rossum1acceb01997-08-14 23:12:18 +000019
20The solution is to use Python's raw string notation for regular
21expression patterns; backslashes are not handled in any special way in
22a string literal prefixed with 'r'. So \code{r"\e n"} is a two
23character string containing a backslash and the letter 'n', while
24\code{"\e n"} is a one-character string containing a newline. Usually
25patterns will be expressed in Python code using this raw string notation.
26
Guido van Rossum48d04371997-12-11 20:19:08 +000027\subsection{Regular Expression Syntax}
Guido van Rossum1acceb01997-08-14 23:12:18 +000028
29A regular expression (or RE) specifies a set of strings that matches
30it; the functions in this module let you check if a particular string
31matches a given regular expression (or if a given regular expression
32matches a particular string, which comes down to the same thing).
33
34Regular expressions can be concatenated to form new regular
35expressions; if \emph{A} and \emph{B} are both regular expressions,
36then \emph{AB} is also an regular expression. If a string \emph{p}
37matches A and another string \emph{q} matches B, the string \emph{pq}
38will match AB. Thus, complex expressions can easily be constructed
39from simpler primitive expressions like the ones described here. For
40details of the theory and implementation of regular expressions,
41consult the Friedl book referenced below, or almost any textbook about
42compiler construction.
43
Guido van Rossum0b334101997-12-08 17:33:40 +000044A brief explanation of the format of regular expressions follows.
45%For further information and a gentler presentation, consult XXX somewhere.
Guido van Rossum1acceb01997-08-14 23:12:18 +000046
47Regular expressions can contain both special and ordinary characters.
48Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
49are the simplest regular expressions; they simply match themselves.
50You can concatenate ordinary characters, so '\code{last}' matches the
51characters 'last'. (In the rest of this section, we'll write RE's in
52\code{this special font}, usually without quotes, and strings to be
53matched 'in single quotes'.)
54
55Some characters, like \code{|} or \code{(}, are special. Special
56characters either stand for classes of ordinary characters, or affect
57how the regular expressions around them are interpreted.
58
59The special characters are:
Fred Drake2705e801998-02-16 21:21:13 +000060% define these since they're used twice:
61\newcommand{\MyLeftMargin}{0.7in}
62\newcommand{\MyLabelWidth}{0.65in}
63\begin{list}{}{\leftmargin \MyLeftMargin \labelwidth \MyLabelWidth}
Guido van Rossum1acceb01997-08-14 23:12:18 +000064\item[\code{.}] (Dot.) In the default mode, this matches any
65character except a newline. If the \code{DOTALL} flag has been
66specified, this matches any character including a newline.
67\item[\code{\^}] (Caret.) Matches the start of the string, and in
68\code{MULTILINE} mode also immediately after each newline.
Guido van Rossum48d04371997-12-11 20:19:08 +000069\item[\code{\$}] Matches the end of the string, and in
70\code{MULTILINE} mode also matches before a newline.
Guido van Rossum1acceb01997-08-14 23:12:18 +000071\code{foo} matches both 'foo' and 'foobar', while the regular
Guido van Rossum48d04371997-12-11 20:19:08 +000072expression \code{foo\$} matches only 'foo'.
Guido van Rossum1acceb01997-08-14 23:12:18 +000073%
74\item[\code{*}] Causes the resulting RE to
75match 0 or more repetitions of the preceding RE, as many repetitions
76as are possible. \code{ab*} will
77match 'a', 'ab', or 'a' followed by any number of 'b's.
78%
79\item[\code{+}] Causes the
80resulting RE to match 1 or more repetitions of the preceding RE.
81\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
82will not match just 'a'.
83%
84\item[\code{?}] Causes the resulting RE to
85match 0 or 1 repetitions of the preceding RE. \code{ab?} will
86match either 'a' or 'ab'.
87\item[\code{*?}, \code{+?}, \code{??}] The \code{*}, \code{+}, and
88\code{?} qualifiers are all \dfn{greedy}; they match as much text as
89possible. Sometimes this behaviour isn't desired; if the RE
90\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
91entire string, and not just \code{<H1>}.
92Adding \code{?} after the qualifier makes it perform the match in
93\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
94possible will be matched. Using \code{.*?} in the previous
Guido van Rossum0b334101997-12-08 17:33:40 +000095expression will match only \code{<H1>}.
Guido van Rossum1acceb01997-08-14 23:12:18 +000096%
Guido van Rossum0148bbf1997-12-22 22:41:40 +000097\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
98\var{m} to \var{n} repetitions of the preceding RE, attempting to
99match as many repetitions as possible. For example, \code{a\{3,5\}}
100will match from 3 to 5 'a' characters.
101%
102\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
103match from \var{m} to \var{n} repetitions of the preceding RE,
104attempting to match as \emph{few} repetitions as possible. This is
105the non-greedy version of the previous qualifier. For example, on the
1066-character string 'aaaaaa', \code{a\{3,5\}} will match 5 'a'
107characters, while \code{a\{3,5\}?} will only match 3 characters.
108%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000109\item[\code{\e}] Either escapes special characters (permitting you to match
110characters like '*?+\&\$'), or signals a special sequence; special
111sequences are discussed below.
112
113If you're not using a raw string to
114express the pattern, remember that Python also uses the
115backslash as an escape sequence in string literals; if the escape
116sequence isn't recognized by Python's parser, the backslash and
117subsequent character are included in the resulting string. However,
118if Python would recognize the resulting sequence, the backslash should
Fred Drake023f87f1998-01-12 19:16:24 +0000119be repeated twice. This is complicated and hard to understand, so
120it's highly recommended that you use raw strings for all but the
121simplest expressions.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000122%
123\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum48d04371997-12-11 20:19:08 +0000124be listed individually, or a range of characters can be indicated by
125giving two characters and separating them by a '-'. Special
126characters are not active inside sets. For example, \code{[akm\$]}
127will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]}
128will match any lowercase letter and \code{[a-zA-Z0-9]} matches any
129letter or digit. Character classes such as \code{\e w} or \code {\e
130S} (defined below) are also acceptable inside a range. If you want to
131include a \code{]} or a \code{-} inside a set, precede it with a
132backslash.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000133
134Characters \emph{not} within a range can be matched by including a
135\code{\^} as the first character of the set; \code{\^} elsewhere will
136simply match the '\code{\^}' character.
137%
138\item[\code{|}]\code{A|B}, where A and B can be arbitrary REs,
139creates a regular expression that will match either A or B. This can
Guido van Rossumeb0f0661997-12-30 20:38:16 +0000140be used inside groups (see below) as well. To match a literal '\code{|}',
Guido van Rossum1acceb01997-08-14 23:12:18 +0000141use \code{\e|}, or enclose it inside a character class, like \code{[|]}.
142%
Guido van Rossum48d04371997-12-11 20:19:08 +0000143\item[\code{(...)}] Matches whatever regular expression is inside the
144parentheses, and indicates the start and end of a group; the contents
145of a group can be retrieved after a match has been performed, and can
146be matched later in the string with the \code{\e \var{number}} special
147sequence, described below. To match the literals '(' or ')',
Guido van Rossum1acceb01997-08-14 23:12:18 +0000148use \code{\e(} or \code{\e)}, or enclose them inside a character
149class: \code{[(] [)]}.
150%
Guido van Rossum0b334101997-12-08 17:33:40 +0000151\item[\code{(?...)}] This is an extension notation (a '?' following a
152'(' is not meaningful otherwise). The first character after the '?'
153determines what the meaning and further syntax of the construct is.
154Following are the currently supported extensions.
155%
Fred Drake023f87f1998-01-12 19:16:24 +0000156\item[\code{(?iLmsx)}] (One or more letters from the set '\code{i}',
157'\code{L}', '\code{m}', '\code{s}', '\code{x}'.) The group matches
158the empty string; the letters set the corresponding flags
159(\code{re.I}, \code{re.L}, \code{re.M}, \code{re.S}, \code{re.X}) for
160the entire regular expression. This is useful if you wish include the
161flags as part of the regular expression, instead of passing a
162\var{flag} argument to the \code{compile()} function.
Guido van Rossum0b334101997-12-08 17:33:40 +0000163%
Guido van Rossum1acceb01997-08-14 23:12:18 +0000164\item[\code{(?:...)}] A non-grouping version of regular parentheses.
165Matches whatever's inside the parentheses, but the text matched by the
166group \emph{cannot} be retrieved after performing a match or
167referenced later in the pattern.
168%
169\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
170the text matched by the group is accessible via the symbolic group
171name \var{name}. Group names must be valid Python identifiers. A
172symbolic group is also a numbered group, just as if the group were not
173named. So the group named 'id' in the example above can also be
174referenced as the numbered group 1.
175
Guido van Rossum48d04371997-12-11 20:19:08 +0000176For example, if the pattern is
177\code{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Guido van Rossum1acceb01997-08-14 23:12:18 +0000178name in arguments to methods of match objects, such as \code{m.group('id')}
Fred Drake023f87f1998-01-12 19:16:24 +0000179or \code{m.end('id')}, and also by name in pattern text
180(e.g. \code{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
Guido van Rossum1acceb01997-08-14 23:12:18 +0000181%
Fred Drake023f87f1998-01-12 19:16:24 +0000182\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
183earlier group named \var{name}.
Guido van Rossum48d04371997-12-11 20:19:08 +0000184%
Fred Drake023f87f1998-01-12 19:16:24 +0000185\item[\code{(?\#...)}] A comment; the contents of the parentheses are
186simply ignored.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000187%
Fred Drake023f87f1998-01-12 19:16:24 +0000188\item[\code{(?=...)}] Matches if \code{...} matches next, but doesn't
189consume any of the string. This is called a lookahead assertion. For
190example, \code{Isaac (?=Asimov)} will match 'Isaac~' only if it's
191followed by 'Asimov'.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000192%
Fred Drake023f87f1998-01-12 19:16:24 +0000193\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This
194is a negative lookahead assertion. For example,
195\code{Isaac (?!Asimov)} will match 'Isaac~' only if it's \emph{not}
196followed by 'Asimov'.
Guido van Rossum0b334101997-12-08 17:33:40 +0000197
Fred Drake2705e801998-02-16 21:21:13 +0000198\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000199
200The special sequences consist of '\code{\e}' and a character from the
201list below. If the ordinary character is not on the list, then the
202resulting RE will match the second character. For example,
Guido van Rossum48d04371997-12-11 20:19:08 +0000203\code{\e\$} matches the character '\$'.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000204
Fred Drake2705e801998-02-16 21:21:13 +0000205\begin{list}{}{\leftmargin \MyLeftMargin \labelwidth \MyLabelWidth}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000206
207%
208\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum0b334101997-12-08 17:33:40 +0000209same number. Groups are numbered starting from 1. For example,
210\code{(.+) \e 1} matches 'the the' or '55 55', but not 'the end' (note
211the space after the group). This special sequence can only be used to
212match one of the first 99 groups. If the first digit of \var{number}
213is 0, or \var{number} is 3 octal digits long, it will not be interpreted
214as a group match, but as the character with octal value \var{number}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000215%
216\item[\code{\e A}] Matches only at the start of the string.
217%
218\item[\code{\e b}] Matches the empty string, but only at the
219beginning or end of a word. A word is defined as a sequence of
220alphanumeric characters, so the end of a word is indicated by
Guido van Rossum48d04371997-12-11 20:19:08 +0000221whitespace or a non-alphanumeric character. Inside a character range,
222\code{\e b} represents the backspace character, for compatibility with
223Python's string literals.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000224%
Guido van Rossum0b334101997-12-08 17:33:40 +0000225\item[\code{\e B}] Matches the empty string, but only when it is
226\emph{not} at the beginning or end of a word.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000227%
228\item[\code{\e d}]Matches any decimal digit; this is
229equivalent to the set \code{[0-9]}.
230%
231\item[\code{\e D}]Matches any non-digit character; this is
Fred Drakec4586381998-01-06 15:46:21 +0000232equivalent to the set \code{[{\^}0-9]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000233%
234\item[\code{\e s}]Matches any whitespace character; this is
235equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
236%
237\item[\code{\e S}]Matches any non-whitespace character; this is
Guido van Rossumf5370f41998-02-11 22:52:47 +0000238equivalent to the set \code{[\^\ \e t\e n\e r\e f\e v]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000239%
Fred Drake023f87f1998-01-12 19:16:24 +0000240\item[\code{\e w}]When the \code{LOCALE} flag is not specified,
241matches any alphanumeric character; this is equivalent to the set
242\code{[a-zA-Z0-9_]}. With \code{LOCALE}, it will match the set
243\code{[0-9_]} plus whatever characters are defined as letters for the
244current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000245%
Fred Drake023f87f1998-01-12 19:16:24 +0000246\item[\code{\e W}]When the \code{LOCALE} flag is not specified,
247matches any non-alphanumeric character; this is equivalent to the set
248\code{[{\^}a-zA-Z0-9_]}. With \code{LOCALE}, it will match any
249character not in the set \code{[0-9_]}, and not defined as a letter
Guido van Rossum0b334101997-12-08 17:33:40 +0000250for the current locale.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000251
252\item[\code{\e Z}]Matches only at the end of the string.
253%
254
255\item[\code{\e \e}] Matches a literal backslash.
256
Fred Drake2705e801998-02-16 21:21:13 +0000257\end{list}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000258
259\subsection{Module Contents}
Fred Drake78f8e981997-12-29 21:39:39 +0000260\nodename{Contents of Module re}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000261
262The module defines the following functions and constants, and an exception:
263
Fred Drake19479911998-02-13 06:58:54 +0000264\setindexsubitem{(in module re)}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000265
266\begin{funcdesc}{compile}{pattern\optional{\, flags}}
267 Compile a regular expression pattern into a regular expression
Fred Drake023f87f1998-01-12 19:16:24 +0000268 object, which can be used for matching using its \code{match()} and
269 \code{search()} methods, described below.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000270
Guido van Rossum0b334101997-12-08 17:33:40 +0000271 The expression's behaviour can be modified by specifying a
272 \var{flags} value. Values can be any of the following variables,
273 combined using bitwise OR (the \code{|} operator).
274
Fred Drake78f8e981997-12-29 21:39:39 +0000275\begin{description}
Guido van Rossum0b334101997-12-08 17:33:40 +0000276
Fred Drake78f8e981997-12-29 21:39:39 +0000277% The use of \quad in the item labels is ugly but adds enough space
278% to the label that it doesn't get visually run-in with the text.
Guido van Rossum0b334101997-12-08 17:33:40 +0000279
Fred Drake023f87f1998-01-12 19:16:24 +0000280\item[\code{I} or \code{IGNORECASE} or \code{(?i)}\quad]
Fred Drake78f8e981997-12-29 21:39:39 +0000281
282Perform case-insensitive matching; expressions like \code{[A-Z]} will match
Guido van Rossum48d04371997-12-11 20:19:08 +0000283lowercase letters, too. This is not affected by the current locale.
Guido van Rossum0b334101997-12-08 17:33:40 +0000284
Fred Drake023f87f1998-01-12 19:16:24 +0000285\item[\code{L} or \code{LOCALE} or \code{(?L)}\quad]
Fred Drake78f8e981997-12-29 21:39:39 +0000286
287Make \code{\e w}, \code{\e W}, \code{\e b},
Guido van Rossum48d04371997-12-11 20:19:08 +0000288\code{\e B}, dependent on the current locale.
Guido van Rossuma42c1781997-12-09 20:41:47 +0000289
Fred Drake023f87f1998-01-12 19:16:24 +0000290\item[\code{M} or \code{MULTILINE} or \code{(?m)}\quad]
Guido van Rossum48d04371997-12-11 20:19:08 +0000291
Fred Drake78f8e981997-12-29 21:39:39 +0000292When specified, the pattern character \code{\^} matches at the
Fred Drake023f87f1998-01-12 19:16:24 +0000293beginning of the string and at the beginning of each line
294(immediately following each newline); and the pattern character
Guido van Rossum48d04371997-12-11 20:19:08 +0000295\code{\$} matches at the end of the string and at the end of each line
296(immediately preceding each newline).
Guido van Rossum0b334101997-12-08 17:33:40 +0000297By default, \code{\^} matches only at the beginning of the string, and
298\code{\$} only at the end of the string and immediately before the
299newline (if any) at the end of the string.
Guido van Rossum0b334101997-12-08 17:33:40 +0000300
Fred Drake023f87f1998-01-12 19:16:24 +0000301\item[\code{S} or \code{DOTALL} or \code{(?s)}\quad]
Guido van Rossum0b334101997-12-08 17:33:40 +0000302
Fred Drake78f8e981997-12-29 21:39:39 +0000303Make the \code{.} special character any character at all, including a
Guido van Rossum48d04371997-12-11 20:19:08 +0000304newline; without this flag, \code{.} will match anything \emph{except}
Fred Drake78f8e981997-12-29 21:39:39 +0000305a newline.
Guido van Rossum48d04371997-12-11 20:19:08 +0000306
Fred Drake023f87f1998-01-12 19:16:24 +0000307\item[\code{X} or \code{VERBOSE} or \code{(?x)}\quad]
Guido van Rossum48d04371997-12-11 20:19:08 +0000308
Fred Drake78f8e981997-12-29 21:39:39 +0000309Ignore whitespace within the pattern
Guido van Rossum48d04371997-12-11 20:19:08 +0000310except when in a character class or preceded by an unescaped
311backslash, and, when a line contains a \code{\#} neither in a character
312class or preceded by an unescaped backslash, all characters from the
Fred Drake78f8e981997-12-29 21:39:39 +0000313leftmost such \code{\#} through the end of the line are ignored.
Guido van Rossum0b334101997-12-08 17:33:40 +0000314
Fred Drake78f8e981997-12-29 21:39:39 +0000315\end{description}
Guido van Rossum0b334101997-12-08 17:33:40 +0000316
Fred Drake78f8e981997-12-29 21:39:39 +0000317The sequence
Guido van Rossum1acceb01997-08-14 23:12:18 +0000318%
Fred Drake19479911998-02-13 06:58:54 +0000319\begin{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000320prog = re.compile(pat)
321result = prog.match(str)
Fred Drake19479911998-02-13 06:58:54 +0000322\end{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000323%
324is equivalent to
Fred Drake023f87f1998-01-12 19:16:24 +0000325
326\begin{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000327result = re.match(pat, str)
Fred Drake023f87f1998-01-12 19:16:24 +0000328\end{verbatim}
329
Guido van Rossum48d04371997-12-11 20:19:08 +0000330but the version using \code{compile()} is more efficient when the
331expression will be used several times in a single program.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000332%(The compiled version of the last pattern passed to \code{regex.match()} or
333%\code{regex.search()} is cached, so programs that use only a single
334%regular expression at a time needn't worry about compiling regular
335%expressions.)
336\end{funcdesc}
337
338\begin{funcdesc}{escape}{string}
Guido van Rossum48d04371997-12-11 20:19:08 +0000339 Return \var{string} with all non-alphanumerics backslashed; this is
340 useful if you want to match an arbitrary literal string that may have
341 regular expression metacharacters in it.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000342\end{funcdesc}
343
344\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
345 If zero or more characters at the beginning of \var{string} match
346 the regular expression \var{pattern}, return a corresponding
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000347 \code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum1acceb01997-08-14 23:12:18 +0000348 match the pattern; note that this is different from a zero-length
349 match.
350\end{funcdesc}
351
352\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
353 Scan through \var{string} looking for a location where the regular
Fred Drake023f87f1998-01-12 19:16:24 +0000354 expression \var{pattern} produces a match, and return a
355 corresponding \code{MatchObject} instance.
Guido van Rossum0148bbf1997-12-22 22:41:40 +0000356 Return \code{None} if no
Guido van Rossum1acceb01997-08-14 23:12:18 +0000357 position in the string matches the pattern; note that this is
358 different from finding a zero-length match at some point in the string.
359\end{funcdesc}
360
361\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
362 Split \var{string} by the occurrences of \var{pattern}. If
363 capturing parentheses are used in pattern, then occurrences of
364 patterns or subpatterns are also returned.
Guido van Rossum97546391998-01-12 18:58:53 +0000365 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
366 occur, and the remainder of the string is returned as the final
367 element of the list. (Incompatibility note: in the original Python
368 1.5 release, \var{maxsplit} was ignored. This has been fixed in
369 later releases.)
Guido van Rossum1acceb01997-08-14 23:12:18 +0000370%
Fred Drake19479911998-02-13 06:58:54 +0000371\begin{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000372>>> re.split('[\W]+', 'Words, words, words.')
373['Words', 'words', 'words', '']
374>>> re.split('([\W]+)', 'Words, words, words.')
375['Words', ', ', 'words', ', ', 'words', '.', '']
Guido van Rossum97546391998-01-12 18:58:53 +0000376>>> re.split('[\W]+', 'Words, words, words.', 1)
377['Words', 'words, words.']
Fred Drake19479911998-02-13 06:58:54 +0000378\end{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000379%
380 This function combines and extends the functionality of
Guido van Rossum97546391998-01-12 18:58:53 +0000381 the old \code{regsub.split()} and \code{regsub.splitx()}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000382\end{funcdesc}
383
384\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
385Return the string obtained by replacing the leftmost non-overlapping
386occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000387\var{repl}. If the pattern isn't found, \var{string} is returned
388unchanged. \var{repl} can be a string or a function; if a function,
389it is called for every non-overlapping occurance of \var{pattern}.
Guido van Rossum0b334101997-12-08 17:33:40 +0000390The function takes a single match object argument, and returns the
391replacement string. For example:
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000392%
Fred Drake19479911998-02-13 06:58:54 +0000393\begin{verbatim}
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000394>>> def dashrepl(matchobj):
395... if matchobj.group(0) == '-': return ' '
396... else: return '-'
397>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
398'pro--gram files'
Fred Drake19479911998-02-13 06:58:54 +0000399\end{verbatim}
Barry Warsaw4552f3d1997-11-20 00:15:13 +0000400%
Guido van Rossum0b334101997-12-08 17:33:40 +0000401The pattern may be a string or a
Guido van Rossum48d04371997-12-11 20:19:08 +0000402regex object; if you need to specify
403regular expression flags, you must use a regex object, or use
404embedded modifiers in a pattern; e.g.
Fred Drake023f87f1998-01-12 19:16:24 +0000405
406\begin{verbatim}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000407sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
Fred Drake023f87f1998-01-12 19:16:24 +0000408\end{verbatim}
409
Guido van Rossum1acceb01997-08-14 23:12:18 +0000410The optional argument \var{count} is the maximum number of pattern
411occurrences to be replaced; count must be a non-negative integer, and
412the default value of 0 means to replace all occurrences.
413
414Empty matches for the pattern are replaced only when not adjacent to a
415previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
416\end{funcdesc}
417
418\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
419Perform the same operation as \code{sub()}, but return a tuple
Fred Drake023f87f1998-01-12 19:16:24 +0000420\code{(\var{new_string}, \var{number_of_subs_made})}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000421\end{funcdesc}
422
423\begin{excdesc}{error}
424 Exception raised when a string passed to one of the functions here
425 is not a valid regular expression (e.g., unmatched parentheses) or
426 when some other error occurs during compilation or matching. (It is
427 never an error if a string contains no match for a pattern.)
428\end{excdesc}
429
430\subsection{Regular Expression Objects}
431Compiled regular expression objects support the following methods and
432attributes:
433
Fred Drake19479911998-02-13 06:58:54 +0000434\setindexsubitem{(re method)}
Guido van Rossum0b334101997-12-08 17:33:40 +0000435\begin{funcdesc}{match}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000436 If zero or more characters at the beginning of \var{string} match
437 this regular expression, return a corresponding
Guido van Rossum48d04371997-12-11 20:19:08 +0000438 \code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000439 match the pattern; note that this is different from a zero-length
440 match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000441
442 The optional second parameter \var{pos} gives an index in the string
Guido van Rossum48d04371997-12-11 20:19:08 +0000443 where the search is to start; it defaults to \code{0}. The
444 \code{'\^'} pattern character will match at the index where the
445 search is to start.
Guido van Rossum0b334101997-12-08 17:33:40 +0000446
447 The optional parameter \var{endpos} limits how far the string will
448 be searched; it will be as if the string is \var{endpos} characters
449 long, so only the characters from \var{pos} to \var{endpos} will be
450 searched for a match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000451\end{funcdesc}
452
Guido van Rossum0b334101997-12-08 17:33:40 +0000453\begin{funcdesc}{search}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000454 Scan through \var{string} looking for a location where this regular
455 expression produces a match. Return \code{None} if no
456 position in the string matches the pattern; note that this is
457 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000458
Guido van Rossum48d04371997-12-11 20:19:08 +0000459 The optional \var{pos} and \var{endpos} parameters have the same
Fred Drake023f87f1998-01-12 19:16:24 +0000460 meaning as for the \code{match()} method.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000461\end{funcdesc}
462
463\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
Fred Drake023f87f1998-01-12 19:16:24 +0000464Identical to the \code{split()} function, using the compiled pattern.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000465\end{funcdesc}
466
467\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
Fred Drake023f87f1998-01-12 19:16:24 +0000468Identical to the \code{sub()} function, using the compiled pattern.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000469\end{funcdesc}
470
471\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
Fred Drake023f87f1998-01-12 19:16:24 +0000472Identical to the \code{subn()} function, using the compiled pattern.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000473\end{funcdesc}
474
Fred Drake19479911998-02-13 06:58:54 +0000475\setindexsubitem{(regex attribute)}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000476
477\begin{datadesc}{flags}
478The flags argument used when the regex object was compiled, or 0 if no
479flags were provided.
480\end{datadesc}
481
482\begin{datadesc}{groupindex}
483A dictionary mapping any symbolic group names (defined by
484\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
485symbolic groups were used in the pattern.
486\end{datadesc}
487
488\begin{datadesc}{pattern}
489The pattern string from which the regex object was compiled.
490\end{datadesc}
491
Fred Drake023f87f1998-01-12 19:16:24 +0000492\subsection{Match Objects}
493
494\code{MatchObject} instances support the following methods and attributes:
Guido van Rossum1acceb01997-08-14 23:12:18 +0000495
Guido van Rossum46503921998-01-19 23:14:17 +0000496\begin{funcdesc}{group}{\optional{group1, group2, ...}}
497Returns one or more subgroups of the match. If there is a single
498argument, the result is a single string; if there are
Guido van Rossum48d04371997-12-11 20:19:08 +0000499multiple arguments, the result is a tuple with one item per argument.
Guido van Rossum46503921998-01-19 23:14:17 +0000500Without arguments, \var{group1} defaults to zero (i.e. the whole match
501is returned).
502If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum48d04371997-12-11 20:19:08 +0000503entire matching string; if it is in the inclusive range [1..99], it is
504the string matching the the corresponding parenthesized group. If no
505such group exists, the corresponding result is
506\code{None}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000507
Guido van Rossum0b334101997-12-08 17:33:40 +0000508If the regular expression uses the \code{(?P<\var{name}>...)} syntax,
Guido van Rossum46503921998-01-19 23:14:17 +0000509the \var{groupN} arguments may also be strings identifying groups by
Guido van Rossum0b334101997-12-08 17:33:40 +0000510their group name.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000511
512A moderately complicated example:
Fred Drake023f87f1998-01-12 19:16:24 +0000513
514\begin{verbatim}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000515m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
Fred Drake023f87f1998-01-12 19:16:24 +0000516\end{verbatim}
517
518After performing this match, \code{m.group(1)} is \code{'3'}, as is
Guido van Rossum46503921998-01-19 23:14:17 +0000519\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000520\end{funcdesc}
521
Guido van Rossum48d04371997-12-11 20:19:08 +0000522\begin{funcdesc}{groups}{}
523Return a tuple containing all the subgroups of the match, from 1 up to
524however many groups are in the pattern. Groups that did not
Guido van Rossum97546391998-01-12 18:58:53 +0000525participate in the match have values of \code{None}. (Incompatibility
526note: in the original Python 1.5 release, if the tuple was one element
527long, a string would be returned instead. In later versions, a
528singleton tuple is returned in such cases.)
Guido van Rossum48d04371997-12-11 20:19:08 +0000529\end{funcdesc}
530
Guido van Rossum46503921998-01-19 23:14:17 +0000531\begin{funcdesc}{start}{\optional{group}}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000532\end{funcdesc}
533
Guido van Rossum46503921998-01-19 23:14:17 +0000534\begin{funcdesc}{end}{\optional{group}}
Guido van Rossume4eb2231997-12-17 00:23:39 +0000535Return the indices of the start and end of the substring
Guido van Rossum46503921998-01-19 23:14:17 +0000536matched by \var{group}; \var{group} defaults to zero (meaning the whole
537matched substring).
538Return \code{None} if \var{group} exists but
Guido van Rossume4eb2231997-12-17 00:23:39 +0000539did not contribute to the match. For a match object
Fred Drake023f87f1998-01-12 19:16:24 +0000540\var{m}, and a group \var{g} that did contribute to the match, the
541substring matched by group \var{g} (equivalent to
542\code{\var{m}.group(\var{g})}) is
543
544\begin{verbatim}
545m.string[m.start(g):m.end(g)]
546\end{verbatim}
547
Guido van Rossume4eb2231997-12-17 00:23:39 +0000548Note that
549\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
Fred Drake023f87f1998-01-12 19:16:24 +0000550\var{group} matched a null string. For example, after \code{\var{m} =
551re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
552\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
553\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
554an \code{IndexError} exception.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000555
556\end{funcdesc}
557
Guido van Rossum46503921998-01-19 23:14:17 +0000558\begin{funcdesc}{span}{\optional{group}}
Fred Drake023f87f1998-01-12 19:16:24 +0000559For \code{MatchObject} \var{m}, return the 2-tuple
560\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000561Note that if \var{group} did not contribute to the match, this is
Guido van Rossum46503921998-01-19 23:14:17 +0000562\code{(None, None)}. Again, \var{group} defaults to zero.
Guido van Rossume4eb2231997-12-17 00:23:39 +0000563\end{funcdesc}
564
Guido van Rossum1acceb01997-08-14 23:12:18 +0000565\begin{datadesc}{pos}
Guido van Rossum0b334101997-12-08 17:33:40 +0000566The value of \var{pos} which was passed to the
Fred Drake023f87f1998-01-12 19:16:24 +0000567\code{search()} or \code{match()} function. This is the index into
568the string at which the regex engine started looking for a match.
Guido van Rossum0b334101997-12-08 17:33:40 +0000569\end{datadesc}
570
571\begin{datadesc}{endpos}
572The value of \var{endpos} which was passed to the
Fred Drake023f87f1998-01-12 19:16:24 +0000573\code{search()} or \code{match()} function. This is the index into
574the string beyond which the regex engine will not go.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000575\end{datadesc}
576
577\begin{datadesc}{re}
Guido van Rossum48d04371997-12-11 20:19:08 +0000578The regular expression object whose \code{match()} or \code{search()} method
579produced this \code{MatchObject} instance.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000580\end{datadesc}
581
582\begin{datadesc}{string}
583The string passed to \code{match()} or \code{search()}.
584\end{datadesc}
585
Guido van Rossum1acceb01997-08-14 23:12:18 +0000586\begin{seealso}
Fred Drakef9951811997-12-29 16:37:04 +0000587\seetext{Jeffrey Friedl, \emph{Mastering Regular Expressions},
Fred Drake023f87f1998-01-12 19:16:24 +0000588O'Reilly. The Python material in this book dates from before the
589\code{re} module, but it covers writing good regular expression
590patterns in great detail.}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000591\end{seealso}