blob: f2e094ade1ddcbd8299778c6467221ea22d14a42 [file] [log] [blame]
Guido van Rossum5fdeeea1994-01-02 01:22:07 +00001\section{Built-in Module \sectcode{regex}}
2
3\bimodindex{regex}
4This module provides regular expression matching operations similar to
5those found in Emacs. It is always available.
6
Guido van Rossum6240b0b1996-10-24 22:49:13 +00007By default the patterns are Emacs-style regular expressions
8(with one exception). There is
Guido van Rossum5fdeeea1994-01-02 01:22:07 +00009a way to change the syntax to match that of several well-known
Guido van Rossumfe4254e1995-08-11 00:31:57 +000010\UNIX{} utilities. The exception is that Emacs' \samp{\e s}
11pattern is not supported, since the original implementation references
12the Emacs syntax tables.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +000013
14This module is 8-bit clean: both patterns and strings may contain null
15bytes and characters whose high bit is set.
16
Guido van Rossum326c0bc1994-01-03 00:00:31 +000017\strong{Please note:} There is a little-known fact about Python string
18literals which means that you don't usually have to worry about
19doubling backslashes, even though they are used to escape special
20characters in string literals as well as in regular expressions. This
21is because Python doesn't remove backslashes from string literals if
22they are followed by an unrecognized escape character.
23\emph{However}, if you want to include a literal \dfn{backslash} in a
24regular expression represented as a string literal, you have to
Guido van Rossum6c4f0031995-03-07 10:14:09 +000025\emph{quadruple} it. E.g.\ to extract \LaTeX\ \samp{\e section\{{\rm
Guido van Rossum326c0bc1994-01-03 00:00:31 +000026\ldots}\}} headers from a document, you can use this pattern:
Guido van Rossum1a535601996-06-26 19:43:22 +000027\code{'\e \e \e \e section\{\e (.*\e )\}'}. \emph{Another exception:}
28the escape sequece \samp{\e b} is significant in string literals
29(where it means the ASCII bell character) as well as in Emacs regular
30expressions (where it stands for a word boundary), so in order to
31search for a word boundary, you should use the pattern \code{'\e \e b'}.
32Similarly, a backslash followed by a digit 0-7 should be doubled to
33avoid interpretation as an octal escape.
34
35\subsection{Regular Expressions}
36
37A regular expression (or RE) specifies a set of strings that matches
38it; the functions in this module let you check if a particular string
Guido van Rossum6240b0b1996-10-24 22:49:13 +000039matches a given regular expression (or if a given regular expression
40matches a particular string, which comes down to the same thing).
Guido van Rossum1a535601996-06-26 19:43:22 +000041
42Regular expressions can be concatenated to form new regular
43expressions; if \emph{A} and \emph{B} are both regular expressions,
44then \emph{AB} is also an regular expression. If a string \emph{p}
45matches A and another string \emph{q} matches B, the string \emph{pq}
46will match AB. Thus, complex expressions can easily be constructed
47from simpler ones like the primitives described here. For details of
48the theory and implementation of regular expressions, consult almost
49any textbook about compiler construction.
50
51% XXX The reference could be made more specific, say to
52% "Compilers: Principles, Techniques and Tools", by Alfred V. Aho,
53% Ravi Sethi, and Jeffrey D. Ullman, or some FA text.
54
Guido van Rossum6240b0b1996-10-24 22:49:13 +000055A brief explanation of the format of regular expressions follows.
Guido van Rossum1a535601996-06-26 19:43:22 +000056
57Regular expressions can contain both special and ordinary characters.
58Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
59the simplest regular expressions; they simply match themselves. You
60can concatenate ordinary characters, so '\code{last}' matches the
Guido van Rossum6240b0b1996-10-24 22:49:13 +000061characters 'last'. (In the rest of this section, we'll write RE's in
62\code{this special font}, usually without quotes, and strings to be
63matched 'in single quotes'.)
Guido van Rossum1a535601996-06-26 19:43:22 +000064
65Special characters either stand for classes of ordinary characters, or
66affect how the regular expressions around them are interpreted.
67
68The special characters are:
69\begin{itemize}
Guido van Rossum6240b0b1996-10-24 22:49:13 +000070\item[\code{.}]{(Dot.) Matches any character except a newline.}
71\item[\code{\^}]{(Caret.) Matches the start of the string.}
Guido van Rossum1a535601996-06-26 19:43:22 +000072\item[\code{\$}]{Matches the end of the string.
73\code{foo} matches both 'foo' and 'foobar', while the regular
74expression '\code{foo\$}' matches only 'foo'.}
75\item[\code{*}] Causes the resulting RE to
76match 0 or more repetitions of the preceding RE. \code{ab*} will
77match 'a', 'ab', or 'a' followed by any number of 'b's.
78\item[\code{+}] Causes the
79resulting RE to match 1 or more repetitions of the preceding RE.
80\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
81will not match just 'a'.
82\item[\code{?}] Causes the resulting RE to
83match 0 or 1 repetitions of the preceding RE. \code{ab?} will
84match either 'a' or 'ab'.
85
86\item[\code{\e}] Either escapes special characters (permitting you to match
87characters like '*?+\&\$'), or signals a special sequence; special
88sequences are discussed below. Remember that Python also uses the
89backslash as an escape sequence in string literals; if the escape
90sequence isn't recognized by Python's parser, the backslash and
91subsequent character are included in the resulting string. However,
92if Python would recognize the resulting sequence, the backslash should
93be repeated twice.
94
95\item[\code{[]}] Used to indicate a set of characters. Characters can
96be listed individually, or a range is indicated by giving two
97characters and separating them by a '-'. Special characters are
98not active inside sets. For example, \code{[akm\$]}
99will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
100match any lowercase letter.
101
102If you want to include a \code{]} inside a
103set, it must be the first character of the set; to include a \code{-},
104place it as the first or last character.
105
106Characters \emph{not} within a range can be matched by including a
107\code{\^} as the first character of the set; \code{\^} elsewhere will
108simply match the '\code{\^}' character.
109\end{itemize}
110
111The special sequences consist of '\code{\e}' and a character
112from the list below. If the ordinary character is not on the list,
113then the resulting RE will match the second character. For example,
114\code{\e\$} matches the character '\$'. Ones where the backslash
115should be doubled are indicated.
116
117\begin{itemize}
118\item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
Guido van Rossum6240b0b1996-10-24 22:49:13 +0000119creates a regular expression that will match either A or B. This can
120be used inside groups (see below) as well.
Guido van Rossum1a535601996-06-26 19:43:22 +0000121%
122\item[\code{\e( \e)}]{Indicates the start and end of a group; the
123contents of a group can be matched later in the string with the
124\code{\e \[1-9]} special sequence, described next.}
125%
126{\fulllineitems\item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
127{Matches the contents of the group of the same
128number. For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
129'55 55', but not 'the end' (note the space after the group). This
130special sequence can only be used to match one of the first 9 groups;
131groups with higher numbers can be matched using the \code{\e v}
Guido van Rossum6240b0b1996-10-24 22:49:13 +0000132sequence. (\code{\e 8} and \code{\e 9} don't need a double backslash
133because they are not octal digits.)}}
Guido van Rossum1a535601996-06-26 19:43:22 +0000134%
135\item[\code{\e \e b}]{Matches the empty string, but only at the
136beginning or end of a word. A word is defined as a sequence of
137alphanumeric characters, so the end of a word is indicated by
138whitespace or a non-alphanumeric character.}
139%
140\item[\code{\e B}]{Matches the empty string, but when it is \emph{not} at the
141beginning or end of a word.}
142%
143\item[\code{\e v}]{Must be followed by a two digit decimal number, and
144matches the contents of the group of the same number. The group number must be between 1 and 99, inclusive.}
145%
146\item[\code{\e w}]Matches any alphanumeric character; this is
147equivalent to the set \code{[a-zA-Z0-9]}.
148%
149\item[\code{\e W}]{Matches any non-alphanumeric character; this is
150equivalent to the set \code{[\^a-zA-Z0-9]}.}
151\item[\code{\e <}]{Matches the empty string, but only at the beginning of a
152word. A word is defined as a sequence of alphanumeric characters, so
153the end of a word is indicated by whitespace or a non-alphanumeric
154character.}
155\item[\code{\e >}]{Matches the empty string, but only at the end of a
156word.}
157
Guido van Rossum6240b0b1996-10-24 22:49:13 +0000158\item[\code{\e \e \e \e}]{Matches a literal backslash.}
159
Guido van Rossum1a535601996-06-26 19:43:22 +0000160% In Emacs, the following two are start of buffer/end of buffer. In
161% Python they seem to be synonyms for ^$.
162\item[\code{\e `}]{Like \code{\^}, this only matches at the start of the
163string.}
164\item[\code{\e \e '}] Like \code{\$}, this only matches at the end of the
165string.
166% end of buffer
167\end{itemize}
168
169\subsection{Module Contents}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000170
171The module defines these functions, and an exception:
172
173\renewcommand{\indexsubitem}{(in module regex)}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000174
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000175\begin{funcdesc}{match}{pattern\, string}
176 Return how many characters at the beginning of \var{string} match
177 the regular expression \var{pattern}. Return \code{-1} if the
178 string does not match the pattern (this is different from a
179 zero-length match!).
180\end{funcdesc}
181
182\begin{funcdesc}{search}{pattern\, string}
183 Return the first position in \var{string} that matches the regular
Guido van Rossum6240b0b1996-10-24 22:49:13 +0000184 expression \var{pattern}. Return \code{-1} if no position in the string
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000185 matches the pattern (this is different from a zero-length match
186 anywhere!).
187\end{funcdesc}
188
Guido van Rossum16d6e711994-08-08 12:30:22 +0000189\begin{funcdesc}{compile}{pattern\optional{\, translate}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000190 Compile a regular expression pattern into a regular expression
191 object, which can be used for matching using its \code{match} and
Guido van Rossum470be141995-03-17 16:07:09 +0000192 \code{search} methods, described below. The optional argument
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000193 \var{translate}, if present, must be a 256-character string
194 indicating how characters (both of the pattern and of the strings to
195 be matched) are translated before comparing them; the \code{i}-th
196 element of the string gives the translation for the character with
Guido van Rossum470be141995-03-17 16:07:09 +0000197 \ASCII{} code \code{i}. This can be used to implement
198 case-insensitive matching; see the \code{casefold} data item below.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000199
200 The sequence
201
202\bcode\begin{verbatim}
203prog = regex.compile(pat)
204result = prog.match(str)
205\end{verbatim}\ecode
206
207is equivalent to
208
209\bcode\begin{verbatim}
210result = regex.match(pat, str)
211\end{verbatim}\ecode
212
213but the version using \code{compile()} is more efficient when multiple
214regular expressions are used concurrently in a single program. (The
215compiled version of the last pattern passed to \code{regex.match()} or
216\code{regex.search()} is cached, so programs that use only a single
217regular expression at a time needn't worry about compiling regular
218expressions.)
219\end{funcdesc}
220
221\begin{funcdesc}{set_syntax}{flags}
222 Set the syntax to be used by future calls to \code{compile},
223 \code{match} and \code{search}. (Already compiled expression objects
224 are not affected.) The argument is an integer which is the OR of
225 several flag bits. The return value is the previous value of
226 the syntax flags. Names for the flags are defined in the standard
227 module \code{regex_syntax}; read the file \file{regex_syntax.py} for
228 more information.
229\end{funcdesc}
230
Guido van Rossum16d6e711994-08-08 12:30:22 +0000231\begin{funcdesc}{symcomp}{pattern\optional{\, translate}}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000232This is like \code{compile}, but supports symbolic group names: if a
Guido van Rossum6c4f0031995-03-07 10:14:09 +0000233parenthesis-enclosed group begins with a group name in angular
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000234brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
235be referenced by its name in arguments to the \code{group} method of
236the resulting compiled regular expression object, like this:
Guido van Rossum7defee71995-02-27 17:52:35 +0000237\code{p.group('id')}. Group names may contain alphanumeric characters
238and \code{'_'} only.
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000239\end{funcdesc}
240
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000241\begin{excdesc}{error}
242 Exception raised when a string passed to one of the functions here
243 is not a valid regular expression (e.g., unmatched parentheses) or
244 when some other error occurs during compilation or matching. (It is
245 never an error if a string contains no match for a pattern.)
246\end{excdesc}
247
248\begin{datadesc}{casefold}
249A string suitable to pass as \var{translate} argument to
250\code{compile} to map all upper case characters to their lowercase
251equivalents.
252\end{datadesc}
253
254\noindent
255Compiled regular expression objects support these methods:
256
257\renewcommand{\indexsubitem}{(regex method)}
Guido van Rossum16d6e711994-08-08 12:30:22 +0000258\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000259 Return how many characters at the beginning of \var{string} match
260 the compiled regular expression. Return \code{-1} if the string
261 does not match the pattern (this is different from a zero-length
262 match!).
263
264 The optional second parameter \var{pos} gives an index in the string
265 where the search is to start; it defaults to \code{0}. This is not
266 completely equivalent to slicing the string; the \code{'\^'} pattern
267 character matches at the real begin of the string and at positions
268 just after a newline, not necessarily at the index where the search
269 is to start.
270\end{funcdesc}
271
Guido van Rossum16d6e711994-08-08 12:30:22 +0000272\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000273 Return the first position in \var{string} that matches the regular
274 expression \code{pattern}. Return \code{-1} if no position in the
275 string matches the pattern (this is different from a zero-length
276 match anywhere!).
277
278 The optional second parameter has the same meaning as for the
279 \code{match} method.
280\end{funcdesc}
281
282\begin{funcdesc}{group}{index\, index\, ...}
283This method is only valid when the last call to the \code{match}
284or \code{search} method found a match. It returns one or more
285groups of the match. If there is a single \var{index} argument,
286the result is a single string; if there are multiple arguments, the
287result is a tuple with one item per argument. If the \var{index} is
288zero, the corresponding return value is the entire matching string; if
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000289it is in the inclusive range [1..99], it is the string matching the
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000290the corresponding parenthesized group (using the default syntax,
291groups are parenthesized using \code{\\(} and \code{\\)}). If no
292such group exists, the corresponding result is \code{None}.
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000293
294If the regular expression was compiled by \code{symcomp} instead of
295\code{compile}, the \var{index} arguments may also be strings
296identifying groups by their group name.
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000297\end{funcdesc}
298
299\noindent
300Compiled regular expressions support these data attributes:
301
302\renewcommand{\indexsubitem}{(regex attribute)}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000303
Guido van Rossum5fdeeea1994-01-02 01:22:07 +0000304\begin{datadesc}{regs}
305When the last call to the \code{match} or \code{search} method found a
306match, this is a tuple of pairs of indices corresponding to the
307beginning and end of all parenthesized groups in the pattern. Indices
308are relative to the string argument passed to \code{match} or
309\code{search}. The 0-th tuple gives the beginning and end or the
310whole pattern. When the last match or search failed, this is
311\code{None}.
312\end{datadesc}
313
314\begin{datadesc}{last}
315When the last call to the \code{match} or \code{search} method found a
316match, this is the string argument passed to that method. When the
317last match or search failed, this is \code{None}.
318\end{datadesc}
319
320\begin{datadesc}{translate}
321This is the value of the \var{translate} argument to
322\code{regex.compile} that created this regular expression object. If
323the \var{translate} argument was omitted in the \code{regex.compile}
324call, this is \code{None}.
325\end{datadesc}
Guido van Rossum326c0bc1994-01-03 00:00:31 +0000326
327\begin{datadesc}{givenpat}
328The regular expression pattern as passed to \code{compile} or
329\code{symcomp}.
330\end{datadesc}
331
332\begin{datadesc}{realpat}
333The regular expression after stripping the group names for regular
334expressions compiled with \code{symcomp}. Same as \code{givenpat}
335otherwise.
336\end{datadesc}
337
338\begin{datadesc}{groupindex}
339A dictionary giving the mapping from symbolic group names to numerical
340group indices for regular expressions compiled with \code{symcomp}.
341\code{None} otherwise.
342\end{datadesc}