blob: bf6aeb8f44f3d5b47068da7f0440254156b20a5c [file] [log] [blame]
Guido van Rossum1acceb01997-08-14 23:12:18 +00001\section{Built-in Module \sectcode{re}}
2\label{module-re}
3
4\bimodindex{re}
5
6% XXX Remove before 1.5final release.
7{\large\bf The \code{re} module is still in the process of being
8developed, and more features will be added in future 1.5 alphas and
9betas. This documentation is also preliminary and incomplete. If you
10find a bug or documentation error, or just find something unclear,
11please send a message to
12\code{string-sig@python.org}, and we'll fix it.}
13
14This module provides regular expression matching operations similar to
15those found in Perl. It's 8-bit
16clean: both patterns and strings may contain null bytes and characters
17whose high bit is set. It is always available.
18
19Regular expressions use the backslash character (\code{\e}) to
20indicate special forms or to allow special characters to be used
21without invoking their special meaning. This collides with Python's
22usage of the same character for the same purpose in string literals;
23for example, to match a literal backslash, one might have to write
24\code{\e\e\e\e} as the pattern string, because the regular expression must be \code{\e\e}, and each backslash must be expressed as \code{\e\e} inside a regular Python string literal.
25
26The solution is to use Python's raw string notation for regular
27expression patterns; backslashes are not handled in any special way in
28a string literal prefixed with 'r'. So \code{r"\e n"} is a two
29character string containing a backslash and the letter 'n', while
30\code{"\e n"} is a one-character string containing a newline. Usually
31patterns will be expressed in Python code using this raw string notation.
32
33% XXX Can the following section be dropped, or should it be boiled down?
34
35%\strong{Please note:} There is a little-known fact about Python string
36%literals which means that you don't usually have to worry about
37%doubling backslashes, even though they are used to escape special
38%characters in string literals as well as in regular expressions. This
39%is because Python doesn't remove backslashes from string literals if
40%they are followed by an unrecognized escape character.
41%\emph{However}, if you want to include a literal \dfn{backslash} in a
42%regular expression represented as a string literal, you have to
43%\emph{quadruple} it or enclose it in a singleton character class.
44%E.g.\ to extract \LaTeX\ \code{\e section\{{\rm
45%\ldots}\}} headers from a document, you can use this pattern:
46%\code{'[\e ] section\{\e (.*\e )\}'}. \emph{Another exception:}
47%the escape sequence \code{\e b} is significant in string literals
48%(where it means the ASCII bell character) as well as in Emacs regular
49%expressions (where it stands for a word boundary), so in order to
50%search for a word boundary, you should use the pattern \code{'\e \e b'}.
51%Similarly, a backslash followed by a digit 0-7 should be doubled to
52%avoid interpretation as an octal escape.
53
54\subsection{Regular Expressions}
55
56A regular expression (or RE) specifies a set of strings that matches
57it; the functions in this module let you check if a particular string
58matches a given regular expression (or if a given regular expression
59matches a particular string, which comes down to the same thing).
60
61Regular expressions can be concatenated to form new regular
62expressions; if \emph{A} and \emph{B} are both regular expressions,
63then \emph{AB} is also an regular expression. If a string \emph{p}
64matches A and another string \emph{q} matches B, the string \emph{pq}
65will match AB. Thus, complex expressions can easily be constructed
66from simpler primitive expressions like the ones described here. For
67details of the theory and implementation of regular expressions,
68consult the Friedl book referenced below, or almost any textbook about
69compiler construction.
70
71A brief explanation of the format of regular expressions follows. For
72further information and a gentler presentation, consult XXX somewhere.
73
74Regular expressions can contain both special and ordinary characters.
75Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
76are the simplest regular expressions; they simply match themselves.
77You can concatenate ordinary characters, so '\code{last}' matches the
78characters 'last'. (In the rest of this section, we'll write RE's in
79\code{this special font}, usually without quotes, and strings to be
80matched 'in single quotes'.)
81
82Some characters, like \code{|} or \code{(}, are special. Special
83characters either stand for classes of ordinary characters, or affect
84how the regular expressions around them are interpreted.
85
86The special characters are:
87\begin{itemize}
88\item[\code{.}] (Dot.) In the default mode, this matches any
89character except a newline. If the \code{DOTALL} flag has been
90specified, this matches any character including a newline.
91\item[\code{\^}] (Caret.) Matches the start of the string, and in
92\code{MULTILINE} mode also immediately after each newline.
93\item[\code{\$}] Matches the end of the string.
94\code{foo} matches both 'foo' and 'foobar', while the regular
95expression '\code{foo\$}' matches only 'foo'.
96%
97\item[\code{*}] Causes the resulting RE to
98match 0 or more repetitions of the preceding RE, as many repetitions
99as are possible. \code{ab*} will
100match 'a', 'ab', or 'a' followed by any number of 'b's.
101%
102\item[\code{+}] Causes the
103resulting RE to match 1 or more repetitions of the preceding RE.
104\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
105will not match just 'a'.
106%
107\item[\code{?}] Causes the resulting RE to
108match 0 or 1 repetitions of the preceding RE. \code{ab?} will
109match either 'a' or 'ab'.
110\item[\code{*?}, \code{+?}, \code{??}] The \code{*}, \code{+}, and
111\code{?} qualifiers are all \dfn{greedy}; they match as much text as
112possible. Sometimes this behaviour isn't desired; if the RE
113\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
114entire string, and not just \code{<H1>}.
115Adding \code{?} after the qualifier makes it perform the match in
116\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
117possible will be matched. Using \code{.*?} in the previous
118expression, it will match only \code{<H1>}.
119%
120\item[\code{\e}] Either escapes special characters (permitting you to match
121characters like '*?+\&\$'), or signals a special sequence; special
122sequences are discussed below.
123
124If you're not using a raw string to
125express the pattern, remember that Python also uses the
126backslash as an escape sequence in string literals; if the escape
127sequence isn't recognized by Python's parser, the backslash and
128subsequent character are included in the resulting string. However,
129if Python would recognize the resulting sequence, the backslash should
130be repeated twice. This is complicated and hard to understand, so
131it's highly recommended that you use raw strings.
132%
133\item[\code{[]}] Used to indicate a set of characters. Characters can
134be listed individually, or a range is indicated by giving two
135characters and separating them by a '-'. Special characters are not
136active inside sets. For example, \code{[akm\$]} will match any of the
137characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will match any
138lowercase letter and \code{[a-zA-Z0-9]} matches any letter or digit.
139Character classes of the form \code{\e \var{X}} defined below are also acceptable.
140If you want to include a \code{]} or a \code{-} inside a
141set, precede it with a backslash.
142
143Characters \emph{not} within a range can be matched by including a
144\code{\^} as the first character of the set; \code{\^} elsewhere will
145simply match the '\code{\^}' character.
146%
147\item[\code{|}]\code{A|B}, where A and B can be arbitrary REs,
148creates a regular expression that will match either A or B. This can
149be used inside groups (see below) as well. To match a literal '|',
150use \code{\e|}, or enclose it inside a character class, like \code{[|]}.
151%
152\item[\code{( ... )}] Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the
153contents of a group can be retrieved after a match has been performed,
154and can be matched later in the string with the
155\code{\e \var{number}} special sequence, described below. To match the
156literals '(' or ')',
157use \code{\e(} or \code{\e)}, or enclose them inside a character
158class: \code{[(] [)]}.
159%
160\item[\code{(?:...)}] A non-grouping version of regular parentheses.
161Matches whatever's inside the parentheses, but the text matched by the
162group \emph{cannot} be retrieved after performing a match or
163referenced later in the pattern.
164%
165\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
166the text matched by the group is accessible via the symbolic group
167name \var{name}. Group names must be valid Python identifiers. A
168symbolic group is also a numbered group, just as if the group were not
169named. So the group named 'id' in the example above can also be
170referenced as the numbered group 1.
171
172For example, if the pattern string is
173\code{r'(?P<id>[a-zA-Z_]\e w*)'}, the group can be referenced by its
174name in arguments to methods of match objects, such as \code{m.group('id')}
175or \code{m.end('id')}, and also by name in pattern text (e.g. \code{(?P=id)}) and
176replacement text (e.g. \code{\e g<id>}).
177%
178\item[\code{(?\#...)}] A comment; the contents of the parentheses are simply ignored.
179%
180\item[\code{(?=...)}] Matches if \code{RE} matches next. This is not
181implemented as of Python 1.5a3.
182%
183\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This is not
184implemented as of Python 1.5a3.
185\end{itemize}
186
187The special sequences consist of '\code{\e}' and a character from the
188list below. If the ordinary character is not on the list, then the
189resulting RE will match the second character. For example,
190\code{\e\$} matches the character '\$'. Ones where the backslash
191should be doubled are indicated.
192
193\begin{itemize}
194
195%
196\item[\code{\e \var{number}}] Matches the contents of the group of the
197same number. For example, \code{(.+) \e 1} matches 'the the' or '55
19855', but not 'the end' (note the space after the group). This special
199sequence can only be used to match one of the first 99 groups. If the
200first digit of \var{number} is 0, or \var{number} is 3 octal digits
201long, it will not interpreted as a group match, but as the character
202with octal value \var{number}.
203%
204\item[\code{\e A}] Matches only at the start of the string.
205%
206\item[\code{\e b}] Matches the empty string, but only at the
207beginning or end of a word. A word is defined as a sequence of
208alphanumeric characters, so the end of a word is indicated by
209whitespace or a non-alphanumeric character.
210%
211\item[\code{\e B}] Matches the empty string, but only when it is \emph{not} at the
212beginning or end of a word.
213%
214\item[\code{\e d}]Matches any decimal digit; this is
215equivalent to the set \code{[0-9]}.
216%
217\item[\code{\e D}]Matches any non-digit character; this is
Guido van Rossumd7dc2eb1997-10-22 03:03:44 +0000218equivalent to the set \code{[{\^}0-9]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000219%
220\item[\code{\e s}]Matches any whitespace character; this is
221equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
222%
223\item[\code{\e S}]Matches any non-whitespace character; this is
Guido van Rossumd7dc2eb1997-10-22 03:03:44 +0000224equivalent to the set \code{[{\^} \e t\e n\e r\e f\e v]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000225%
226\item[\code{\e w}]Matches any alphanumeric character; this is
227equivalent to the set \code{[a-zA-Z0-9_]}.
228%
229\item[\code{\e W}] Matches any non-alphanumeric character; this is
Guido van Rossumd7dc2eb1997-10-22 03:03:44 +0000230equivalent to the set \code{[{\^}a-zA-Z0-9_]}.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000231
232\item[\code{\e Z}]Matches only at the end of the string.
233%
234
235\item[\code{\e \e}] Matches a literal backslash.
236
237\end{itemize}
238
239\subsection{Module Contents}
240
241The module defines the following functions and constants, and an exception:
242
243\renewcommand{\indexsubitem}{(in module re)}
244
245\begin{funcdesc}{compile}{pattern\optional{\, flags}}
246 Compile a regular expression pattern into a regular expression
247 object, which can be used for matching using its \code{match} and
248 \code{search} methods, described below.
249
250 The sequence
251%
252\bcode\begin{verbatim}
253prog = re.compile(pat)
254result = prog.match(str)
255\end{verbatim}\ecode
256%
257is equivalent to
258%
259\bcode\begin{verbatim}
260result = re.match(pat, str)
261\end{verbatim}\ecode
262%
263but the version using \code{compile()} is more efficient when multiple
264regular expressions are used concurrently in a single program.
265%(The compiled version of the last pattern passed to \code{regex.match()} or
266%\code{regex.search()} is cached, so programs that use only a single
267%regular expression at a time needn't worry about compiling regular
268%expressions.)
269\end{funcdesc}
270
271\begin{funcdesc}{escape}{string}
272Return \var{string} with all non-alphanumerics backslashed; this is
273useful if you want to match some variable string which may have
274regular expression metacharacters in it.
275\end{funcdesc}
276
277\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
278 If zero or more characters at the beginning of \var{string} match
279 the regular expression \var{pattern}, return a corresponding
280 \code{Match} object. Return \code{None} if the string does not
281 match the pattern; note that this is different from a zero-length
282 match.
283\end{funcdesc}
284
285\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
286 Scan through \var{string} looking for a location where the regular
287 expression \var{pattern} produces a match. Return \code{None} if no
288 position in the string matches the pattern; note that this is
289 different from finding a zero-length match at some point in the string.
290\end{funcdesc}
291
292\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
293 Split \var{string} by the occurrences of \var{pattern}. If
294 capturing parentheses are used in pattern, then occurrences of
295 patterns or subpatterns are also returned.
296%
297\bcode\begin{verbatim}
298>>> re.split('[\W]+', 'Words, words, words.')
299['Words', 'words', 'words', '']
300>>> re.split('([\W]+)', 'Words, words, words.')
301['Words', ', ', 'words', ', ', 'words', '.', '']
302\end{verbatim}\ecode
303%
304 This function combines and extends the functionality of
305 \code{regex.split()} and \code{regex.splitx()}.
306\end{funcdesc}
307
308\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
309Return the string obtained by replacing the leftmost non-overlapping
310occurrences of \var{pattern} in \var{string} by the replacement
311\var{repl}, which can be a string or the function that returns a string. If the pattern isn't found, \var{string} is returned unchanged. The
312pattern may be a string or a regexp object; if you need to specify
313regular expression flags, you must use a regexp object, or use
314embedded modifiers in a pattern string; e.g.
315%
316\bcode\begin{verbatim}
317sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
318\end{verbatim}\ecode
319%
320The optional argument \var{count} is the maximum number of pattern
321occurrences to be replaced; count must be a non-negative integer, and
322the default value of 0 means to replace all occurrences.
323
324Empty matches for the pattern are replaced only when not adjacent to a
325previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
326\end{funcdesc}
327
328\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
329Perform the same operation as \code{sub()}, but return a tuple
330\code{(new_string, number_of_subs_made)}.
331\end{funcdesc}
332
333\begin{excdesc}{error}
334 Exception raised when a string passed to one of the functions here
335 is not a valid regular expression (e.g., unmatched parentheses) or
336 when some other error occurs during compilation or matching. (It is
337 never an error if a string contains no match for a pattern.)
338\end{excdesc}
339
340\subsection{Regular Expression Objects}
341Compiled regular expression objects support the following methods and
342attributes:
343
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000344\renewcommand{\indexsubitem}{(re method)}
Guido van Rossum1acceb01997-08-14 23:12:18 +0000345\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000346 If zero or more characters at the beginning of \var{string} match
347 this regular expression, return a corresponding
348 \code{Match} object. Return \code{None} if the string does not
349 match the pattern; note that this is different from a zero-length
350 match.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000351
352 The optional second parameter \var{pos} gives an index in the string
353 where the search is to start; it defaults to \code{0}. This is not
354 completely equivalent to slicing the string; the \code{'\^'} pattern
355 character matches at the real begin of the string and at positions
356 just after a newline, not necessarily at the index where the search
357 is to start.
358\end{funcdesc}
359
360\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossumeb53ae41997-10-05 18:54:07 +0000361 Scan through \var{string} looking for a location where this regular
362 expression produces a match. Return \code{None} if no
363 position in the string matches the pattern; note that this is
364 different from finding a zero-length match at some point in the string.
Guido van Rossum1acceb01997-08-14 23:12:18 +0000365
366 The optional second parameter has the same meaning as for the
367 \code{match} method.
368\end{funcdesc}
369
370\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
371Identical to the \code{split} function, using the compiled pattern.
372\end{funcdesc}
373
374\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
375Identical to the \code{sub} function, using the compiled pattern.
376\end{funcdesc}
377
378\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
379Identical to the \code{subn} function, using the compiled pattern.
380\end{funcdesc}
381
382\renewcommand{\indexsubitem}{(regex attribute)}
383
384\begin{datadesc}{flags}
385The flags argument used when the regex object was compiled, or 0 if no
386flags were provided.
387\end{datadesc}
388
389\begin{datadesc}{groupindex}
390A dictionary mapping any symbolic group names (defined by
391\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
392symbolic groups were used in the pattern.
393\end{datadesc}
394
395\begin{datadesc}{pattern}
396The pattern string from which the regex object was compiled.
397\end{datadesc}
398
399\subsection{Match Objects}
400Match objects support the following methods and attributes:
401
402\begin{funcdesc}{span}{group}
403Return the 2-tuple \code{(start(\var{group}), end(\var{group}))}.
404Note that if \var{group} did not contribute to the match, this is \code{(None,
405None)}.
406\end{funcdesc}
407
408\begin{funcdesc}{start}{group}
409\end{funcdesc}
410
411\begin{funcdesc}{end}{group}
412Return the indices of the start and end of the substring matched by
413\var{group}. Return \code{None} if \var{group} exists but did not contribute to
414the match. Note that for a match object \code{m}, and a group \code{g}
415that did contribute to the match, the substring matched by group \code{g} is
416\bcode\begin{verbatim}
417 m.string[m.start(g):m.end(g)]
418\end{verbatim}\ecode
419%
420Note too that \code{m.start(\var{group})} will equal
421\code{m.end(\var{group})} if \var{group} matched a null string. For example,
422after \code{m = re.search('b(c?)', 'cba')}, \code{m.start(0)} is 1,
423\code{m.end(0)} is 2, \code{m.start(1)} and \code{m.end(1)} are both
4242, and \code{m.start(2)} raises an
425\code{IndexError} exception.
426\end{funcdesc}
427
428\begin{funcdesc}{group}{\optional{g1, g2, ...})}
429This method is only valid when the last call to the \code{match}
430or \code{search} method found a match. It returns one or more
431groups of the match. If there is a single \var{index} argument,
432the result is a single string; if there are multiple arguments, the
433result is a tuple with one item per argument. If the \var{index} is
434zero, the corresponding return value is the entire matching string; if
435it is in the inclusive range [1..99], it is the string matching the
436the corresponding parenthesized group (using the default syntax,
437groups are parenthesized using \code{\e (} and \code{\e )}). If no
438such group exists, the corresponding result is \code{None}.
439
440If the regular expression was compiled by \code{symcomp} instead of
441\code{compile}, the \var{index} arguments may also be strings
442identifying groups by their group name.
443\end{funcdesc}
444
445\begin{datadesc}{pos}
446The index at which the search or match began.
447\end{datadesc}
448
449\begin{datadesc}{re}
450The regular expression object whose match() or search() method
451produced this match object.
452\end{datadesc}
453
454\begin{datadesc}{string}
455The string passed to \code{match()} or \code{search()}.
456\end{datadesc}
457
458
459
460\begin{seealso}
461\seetext Jeffrey Friedl, \emph{Mastering Regular Expressions}.
462\end{seealso}
463