blob: 45e7249241a379376502c6d624586191717b753a [file] [log] [blame]
Guido van Rossum5fdeeea1994-01-02 01:22:07 +00001\section{Built-in Module \sectcode{regex}}
2
3\bimodindex{regex}
4This module provides regular expression matching operations similar to
5those found in Emacs. It is always available.
6
Guido van Rossumfe4254e1995-08-11 00:31:57 +00007By default the patterns are Emacs-style regular expressions,
8with one exception. There is
Guido van Rossum5fdeeea1994-01-02 01:22:07 +00009a way to change the syntax to match that of several well-known
Guido van Rossumfe4254e1995-08-11 00:31:57 +000010\UNIX{} utilities. The exception is that Emacs' \samp{\e s}
11pattern is not supported, since the original implementation references
12the Emacs syntax tables.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +000013
14This module is 8-bit clean: both patterns and strings may contain null
15bytes and characters whose high bit is set.
16
Guido van Rossum326c0bc1994-01-03 00:00:31 +000017\strong{Please note:} There is a little-known fact about Python string
18literals which means that you don't usually have to worry about
19doubling backslashes, even though they are used to escape special
20characters in string literals as well as in regular expressions. This
21is because Python doesn't remove backslashes from string literals if
22they are followed by an unrecognized escape character.
23\emph{However}, if you want to include a literal \dfn{backslash} in a
24regular expression represented as a string literal, you have to
Guido van Rossum6c4f0031995-03-07 10:14:09 +000025\emph{quadruple} it. E.g.\ to extract \LaTeX\ \samp{\e section\{{\rm
Guido van Rossum326c0bc1994-01-03 00:00:31 +000026\ldots}\}} headers from a document, you can use this pattern:
Guido van Rossum1a535601996-06-26 19:43:22 +000027\code{'\e \e \e \e section\{\e (.*\e )\}'}. \emph{Another exception:}
28the escape sequece \samp{\e b} is significant in string literals
29(where it means the ASCII bell character) as well as in Emacs regular
30expressions (where it stands for a word boundary), so in order to
31search for a word boundary, you should use the pattern \code{'\e \e b'}.
32Similarly, a backslash followed by a digit 0-7 should be doubled to
33avoid interpretation as an octal escape.
34
35\subsection{Regular Expressions}
36
37A regular expression (or RE) specifies a set of strings that matches
38it; the functions in this module let you check if a particular string
39matches a given regular expression.
40
41Regular expressions can be concatenated to form new regular
42expressions; if \emph{A} and \emph{B} are both regular expressions,
43then \emph{AB} is also an regular expression. If a string \emph{p}
44matches A and another string \emph{q} matches B, the string \emph{pq}
45will match AB. Thus, complex expressions can easily be constructed
46from simpler ones like the primitives described here. For details of
47the theory and implementation of regular expressions, consult almost
48any textbook about compiler construction.
49
50% XXX The reference could be made more specific, say to
51% "Compilers: Principles, Techniques and Tools", by Alfred V. Aho,
52% Ravi Sethi, and Jeffrey D. Ullman, or some FA text.
53
54A brief explanation of the format of regular
55expressions follows.
56
57Regular expressions can contain both special and ordinary characters.
58Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
59the simplest regular expressions; they simply match themselves. You
60can concatenate ordinary characters, so '\code{last}' matches the
61characters 'last'.
62
63Special characters either stand for classes of ordinary characters, or
64affect how the regular expressions around them are interpreted.
65
66The special characters are:
67\begin{itemize}
68\item[\code{.}]{Matches any character except a newline.}
69\item[\code{\^}]{Matches the start of the string.}
70\item[\code{\$}]{Matches the end of the string.
71\code{foo} matches both 'foo' and 'foobar', while the regular
72expression '\code{foo\$}' matches only 'foo'.}
73\item[\code{*}] Causes the resulting RE to
74match 0 or more repetitions of the preceding RE. \code{ab*} will
75match 'a', 'ab', or 'a' followed by any number of 'b's.
76\item[\code{+}] Causes the
77resulting RE to match 1 or more repetitions of the preceding RE.
78\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
79will not match just 'a'.
80\item[\code{?}] Causes the resulting RE to
81match 0 or 1 repetitions of the preceding RE. \code{ab?} will
82match either 'a' or 'ab'.
83
84\item[\code{\e}] Either escapes special characters (permitting you to match
85characters like '*?+\&\$'), or signals a special sequence; special
86sequences are discussed below. Remember that Python also uses the
87backslash as an escape sequence in string literals; if the escape
88sequence isn't recognized by Python's parser, the backslash and
89subsequent character are included in the resulting string. However,
90if Python would recognize the resulting sequence, the backslash should
91be repeated twice.
92
93\item[\code{[]}] Used to indicate a set of characters. Characters can
94be listed individually, or a range is indicated by giving two
95characters and separating them by a '-'. Special characters are
96not active inside sets. For example, \code{[akm\$]}
97will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
98match any lowercase letter.
99
100If you want to include a \code{]} inside a
101set, it must be the first character of the set; to include a \code{-},
102place it as the first or last character.
103
104Characters \emph{not} within a range can be matched by including a
105\code{\^} as the first character of the set; \code{\^} elsewhere will
106simply match the '\code{\^}' character.
107\end{itemize}
108
109The special sequences consist of '\code{\e}' and a character
110from the list below. If the ordinary character is not on the list,
111then the resulting RE will match the second character. For example,
112\code{\e\$} matches the character '\$'. Ones where the backslash
113should be doubled are indicated.
114
115\begin{itemize}
116\item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
117creates a regular expression that will match either A or B.
118%
119\item[\code{\e( \e)}]{Indicates the start and end of a group; the
120contents of a group can be matched later in the string with the
121\code{\e \[1-9]} special sequence, described next.}
122%
123{\fulllineitems\item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
124{Matches the contents of the group of the same
125number. For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
126'55 55', but not 'the end' (note the space after the group). This
127special sequence can only be used to match one of the first 9 groups;
128groups with higher numbers can be matched using the \code{\e v}
129sequence.}}
130%
131\item[\code{\e \e b}]{Matches the empty string, but only at the
132beginning or end of a word. A word is defined as a sequence of
133alphanumeric characters, so the end of a word is indicated by
134whitespace or a non-alphanumeric character.}
135%
136\item[\code{\e B}]{Matches the empty string, but when it is \emph{not} at the
137beginning or end of a word.}
138%
139\item[\code{\e v}]{Must be followed by a two digit decimal number, and
140matches the contents of the group of the same number. The group number must be between 1 and 99, inclusive.}
141%
142\item[\code{\e w}]Matches any alphanumeric character; this is
143equivalent to the set \code{[a-zA-Z0-9]}.
144%
145\item[\code{\e W}]{Matches any non-alphanumeric character; this is
146equivalent to the set \code{[\^a-zA-Z0-9]}.}
147\item[\code{\e <}]{Matches the empty string, but only at the beginning of a
148word. A word is defined as a sequence of alphanumeric characters, so
149the end of a word is indicated by whitespace or a non-alphanumeric
150character.}
151\item[\code{\e >}]{Matches the empty string, but only at the end of a
152word.}
153
154% In Emacs, the following two are start of buffer/end of buffer. In
155% Python they seem to be synonyms for ^$.
156\item[\code{\e `}]{Like \code{\^}, this only matches at the start of the
157string.}
158\item[\code{\e \e '}] Like \code{\$}, this only matches at the end of the
159string.
160% end of buffer
161\end{itemize}
162
163\subsection{Module Contents}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000164
165The module defines these functions, and an exception:
166
167\renewcommand{\indexsubitem}{(in module regex)}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000168
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000169\begin{funcdesc}{match}{pattern\, string}
170 Return how many characters at the beginning of \var{string} match
171 the regular expression \var{pattern}. Return \code{-1} if the
172 string does not match the pattern (this is different from a
173 zero-length match!).
174\end{funcdesc}
175
176\begin{funcdesc}{search}{pattern\, string}
177 Return the first position in \var{string} that matches the regular
178 expression \var{pattern}. Return -1 if no position in the string
179 matches the pattern (this is different from a zero-length match
180 anywhere!).
181\end{funcdesc}
182
Guido van Rossum16d6e711994-08-08 12:30:22 +0000183\begin{funcdesc}{compile}{pattern\optional{\, translate}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000184 Compile a regular expression pattern into a regular expression
185 object, which can be used for matching using its \code{match} and
Guido van Rossum470be141995-03-17 16:07:09 +0000186 \code{search} methods, described below. The optional argument
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000187 \var{translate}, if present, must be a 256-character string
188 indicating how characters (both of the pattern and of the strings to
189 be matched) are translated before comparing them; the \code{i}-th
190 element of the string gives the translation for the character with
Guido van Rossum470be141995-03-17 16:07:09 +0000191 \ASCII{} code \code{i}. This can be used to implement
192 case-insensitive matching; see the \code{casefold} data item below.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000193
194 The sequence
195
196\bcode\begin{verbatim}
197prog = regex.compile(pat)
198result = prog.match(str)
199\end{verbatim}\ecode
200
201is equivalent to
202
203\bcode\begin{verbatim}
204result = regex.match(pat, str)
205\end{verbatim}\ecode
206
207but the version using \code{compile()} is more efficient when multiple
208regular expressions are used concurrently in a single program. (The
209compiled version of the last pattern passed to \code{regex.match()} or
210\code{regex.search()} is cached, so programs that use only a single
211regular expression at a time needn't worry about compiling regular
212expressions.)
213\end{funcdesc}
214
215\begin{funcdesc}{set_syntax}{flags}
216 Set the syntax to be used by future calls to \code{compile},
217 \code{match} and \code{search}. (Already compiled expression objects
218 are not affected.) The argument is an integer which is the OR of
219 several flag bits. The return value is the previous value of
220 the syntax flags. Names for the flags are defined in the standard
221 module \code{regex_syntax}; read the file \file{regex_syntax.py} for
222 more information.
223\end{funcdesc}
224
Guido van Rossum16d6e711994-08-08 12:30:22 +0000225\begin{funcdesc}{symcomp}{pattern\optional{\, translate}}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000226This is like \code{compile}, but supports symbolic group names: if a
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000227parenthesis-enclosed group begins with a group name in angular
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000228brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
229be referenced by its name in arguments to the \code{group} method of
230the resulting compiled regular expression object, like this:
Guido van Rossum7defee71995-02-27 17:52:35 +0000231\code{p.group('id')}. Group names may contain alphanumeric characters
232and \code{'_'} only.
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000233\end{funcdesc}
234
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000235\begin{excdesc}{error}
236 Exception raised when a string passed to one of the functions here
237 is not a valid regular expression (e.g., unmatched parentheses) or
238 when some other error occurs during compilation or matching. (It is
239 never an error if a string contains no match for a pattern.)
240\end{excdesc}
241
242\begin{datadesc}{casefold}
243A string suitable to pass as \var{translate} argument to
244\code{compile} to map all upper case characters to their lowercase
245equivalents.
246\end{datadesc}
247
248\noindent
249Compiled regular expression objects support these methods:
250
251\renewcommand{\indexsubitem}{(regex method)}
Guido van Rossum16d6e711994-08-08 12:30:22 +0000252\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000253 Return how many characters at the beginning of \var{string} match
254 the compiled regular expression. Return \code{-1} if the string
255 does not match the pattern (this is different from a zero-length
256 match!).
257
258 The optional second parameter \var{pos} gives an index in the string
259 where the search is to start; it defaults to \code{0}. This is not
260 completely equivalent to slicing the string; the \code{'\^'} pattern
261 character matches at the real begin of the string and at positions
262 just after a newline, not necessarily at the index where the search
263 is to start.
264\end{funcdesc}
265
Guido van Rossum16d6e711994-08-08 12:30:22 +0000266\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000267 Return the first position in \var{string} that matches the regular
268 expression \code{pattern}. Return \code{-1} if no position in the
269 string matches the pattern (this is different from a zero-length
270 match anywhere!).
271
272 The optional second parameter has the same meaning as for the
273 \code{match} method.
274\end{funcdesc}
275
276\begin{funcdesc}{group}{index\, index\, ...}
277This method is only valid when the last call to the \code{match}
278or \code{search} method found a match. It returns one or more
279groups of the match. If there is a single \var{index} argument,
280the result is a single string; if there are multiple arguments, the
281result is a tuple with one item per argument. If the \var{index} is
282zero, the corresponding return value is the entire matching string; if
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000283it is in the inclusive range [1..99], it is the string matching the
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000284the corresponding parenthesized group (using the default syntax,
285groups are parenthesized using \code{\\(} and \code{\\)}). If no
286such group exists, the corresponding result is \code{None}.
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000287
288If the regular expression was compiled by \code{symcomp} instead of
289\code{compile}, the \var{index} arguments may also be strings
290identifying groups by their group name.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000291\end{funcdesc}
292
293\noindent
294Compiled regular expressions support these data attributes:
295
296\renewcommand{\indexsubitem}{(regex attribute)}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000297
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000298\begin{datadesc}{regs}
299When the last call to the \code{match} or \code{search} method found a
300match, this is a tuple of pairs of indices corresponding to the
301beginning and end of all parenthesized groups in the pattern. Indices
302are relative to the string argument passed to \code{match} or
303\code{search}. The 0-th tuple gives the beginning and end or the
304whole pattern. When the last match or search failed, this is
305\code{None}.
306\end{datadesc}
307
308\begin{datadesc}{last}
309When the last call to the \code{match} or \code{search} method found a
310match, this is the string argument passed to that method. When the
311last match or search failed, this is \code{None}.
312\end{datadesc}
313
314\begin{datadesc}{translate}
315This is the value of the \var{translate} argument to
316\code{regex.compile} that created this regular expression object. If
317the \var{translate} argument was omitted in the \code{regex.compile}
318call, this is \code{None}.
319\end{datadesc}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000320
321\begin{datadesc}{givenpat}
322The regular expression pattern as passed to \code{compile} or
323\code{symcomp}.
324\end{datadesc}
325
326\begin{datadesc}{realpat}
327The regular expression after stripping the group names for regular
328expressions compiled with \code{symcomp}. Same as \code{givenpat}
329otherwise.
330\end{datadesc}
331
332\begin{datadesc}{groupindex}
333A dictionary giving the mapping from symbolic group names to numerical
334group indices for regular expressions compiled with \code{symcomp}.
335\code{None} otherwise.
336\end{datadesc}