blob: 544c20461601b95268a7e264f0d61cb9c38e1668 [file] [log] [blame]
Guido van Rossum5fdeeea1994-01-02 01:22:07 +00001\section{Built-in Module \sectcode{regex}}
2
3\bimodindex{regex}
4This module provides regular expression matching operations similar to
5those found in Emacs. It is always available.
6
7By default the patterns are Emacs-style regular expressions; there is
8a way to change the syntax to match that of several well-known
9\UNIX{} utilities.
10
11This module is 8-bit clean: both patterns and strings may contain null
12bytes and characters whose high bit is set.
13
14\strong{Please note:} There is a little-known fact about Python string literals
15which means that you don't usually have to worry about doubling
16backslashes, even though they are used to escape special characters in
17string literals as well as in regular expressions. This is because
18Python doesn't remove backslashes from string literals if they are
19followed by an unrecognized escape character. \emph{However}, if you
20want to include a literal \dfn{backslash} in a regular expression
21represented as a string literal, you have to \emph{quadruple} it. E.g.
22to extract LaTeX \samp{\e section\{{\rm \ldots}\}} headers from a document, you can
23use this pattern: \code{'\e \e \e\e section\{\e (.*\e )\}'}.
24
25The module defines these functions, and an exception:
26
27\renewcommand{\indexsubitem}{(in module regex)}
28\begin{funcdesc}{match}{pattern\, string}
29 Return how many characters at the beginning of \var{string} match
30 the regular expression \var{pattern}. Return \code{-1} if the
31 string does not match the pattern (this is different from a
32 zero-length match!).
33\end{funcdesc}
34
35\begin{funcdesc}{search}{pattern\, string}
36 Return the first position in \var{string} that matches the regular
37 expression \var{pattern}. Return -1 if no position in the string
38 matches the pattern (this is different from a zero-length match
39 anywhere!).
40\end{funcdesc}
41
42\begin{funcdesc}{compile}{pattern\, translate}
43 Compile a regular expression pattern into a regular expression
44 object, which can be used for matching using its \code{match} and
45 \code{search} methods, described below. The optional
46 \var{translate}, if present, must be a 256-character string
47 indicating how characters (both of the pattern and of the strings to
48 be matched) are translated before comparing them; the \code{i}-th
49 element of the string gives the translation for the character with
50 ASCII code \code{i}.
51
52 The sequence
53
54\bcode\begin{verbatim}
55prog = regex.compile(pat)
56result = prog.match(str)
57\end{verbatim}\ecode
58
59is equivalent to
60
61\bcode\begin{verbatim}
62result = regex.match(pat, str)
63\end{verbatim}\ecode
64
65but the version using \code{compile()} is more efficient when multiple
66regular expressions are used concurrently in a single program. (The
67compiled version of the last pattern passed to \code{regex.match()} or
68\code{regex.search()} is cached, so programs that use only a single
69regular expression at a time needn't worry about compiling regular
70expressions.)
71\end{funcdesc}
72
73\begin{funcdesc}{set_syntax}{flags}
74 Set the syntax to be used by future calls to \code{compile},
75 \code{match} and \code{search}. (Already compiled expression objects
76 are not affected.) The argument is an integer which is the OR of
77 several flag bits. The return value is the previous value of
78 the syntax flags. Names for the flags are defined in the standard
79 module \code{regex_syntax}; read the file \file{regex_syntax.py} for
80 more information.
81\end{funcdesc}
82
83\begin{excdesc}{error}
84 Exception raised when a string passed to one of the functions here
85 is not a valid regular expression (e.g., unmatched parentheses) or
86 when some other error occurs during compilation or matching. (It is
87 never an error if a string contains no match for a pattern.)
88\end{excdesc}
89
90\begin{datadesc}{casefold}
91A string suitable to pass as \var{translate} argument to
92\code{compile} to map all upper case characters to their lowercase
93equivalents.
94\end{datadesc}
95
96\noindent
97Compiled regular expression objects support these methods:
98
99\renewcommand{\indexsubitem}{(regex method)}
100\begin{funcdesc}{match}{string\, pos}
101 Return how many characters at the beginning of \var{string} match
102 the compiled regular expression. Return \code{-1} if the string
103 does not match the pattern (this is different from a zero-length
104 match!).
105
106 The optional second parameter \var{pos} gives an index in the string
107 where the search is to start; it defaults to \code{0}. This is not
108 completely equivalent to slicing the string; the \code{'\^'} pattern
109 character matches at the real begin of the string and at positions
110 just after a newline, not necessarily at the index where the search
111 is to start.
112\end{funcdesc}
113
114\begin{funcdesc}{search}{string\, pos}
115 Return the first position in \var{string} that matches the regular
116 expression \code{pattern}. Return \code{-1} if no position in the
117 string matches the pattern (this is different from a zero-length
118 match anywhere!).
119
120 The optional second parameter has the same meaning as for the
121 \code{match} method.
122\end{funcdesc}
123
124\begin{funcdesc}{group}{index\, index\, ...}
125This method is only valid when the last call to the \code{match}
126or \code{search} method found a match. It returns one or more
127groups of the match. If there is a single \var{index} argument,
128the result is a single string; if there are multiple arguments, the
129result is a tuple with one item per argument. If the \var{index} is
130zero, the corresponding return value is the entire matching string; if
131it is in the inclusive range [1..9], it is the string matching the
132the corresponding parenthesized group (using the default syntax,
133groups are parenthesized using \code{\\(} and \code{\\)}). If no
134such group exists, the corresponding result is \code{None}.
135\end{funcdesc}
136
137\noindent
138Compiled regular expressions support these data attributes:
139
140\renewcommand{\indexsubitem}{(regex attribute)}
141\begin{datadesc}{regs}
142When the last call to the \code{match} or \code{search} method found a
143match, this is a tuple of pairs of indices corresponding to the
144beginning and end of all parenthesized groups in the pattern. Indices
145are relative to the string argument passed to \code{match} or
146\code{search}. The 0-th tuple gives the beginning and end or the
147whole pattern. When the last match or search failed, this is
148\code{None}.
149\end{datadesc}
150
151\begin{datadesc}{last}
152When the last call to the \code{match} or \code{search} method found a
153match, this is the string argument passed to that method. When the
154last match or search failed, this is \code{None}.
155\end{datadesc}
156
157\begin{datadesc}{translate}
158This is the value of the \var{translate} argument to
159\code{regex.compile} that created this regular expression object. If
160the \var{translate} argument was omitted in the \code{regex.compile}
161call, this is \code{None}.
162\end{datadesc}