Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 1 | \section{Built-in Module \sectcode{regex}} |
| 2 | |
| 3 | \bimodindex{regex} |
| 4 | This module provides regular expression matching operations similar to |
| 5 | those found in Emacs. It is always available. |
| 6 | |
| 7 | By default the patterns are Emacs-style regular expressions; there is |
| 8 | a way to change the syntax to match that of several well-known |
| 9 | \UNIX{} utilities. |
| 10 | |
| 11 | This module is 8-bit clean: both patterns and strings may contain null |
| 12 | bytes and characters whose high bit is set. |
| 13 | |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 14 | \strong{Please note:} There is a little-known fact about Python string |
| 15 | literals which means that you don't usually have to worry about |
| 16 | doubling backslashes, even though they are used to escape special |
| 17 | characters in string literals as well as in regular expressions. This |
| 18 | is because Python doesn't remove backslashes from string literals if |
| 19 | they are followed by an unrecognized escape character. |
| 20 | \emph{However}, if you want to include a literal \dfn{backslash} in a |
| 21 | regular expression represented as a string literal, you have to |
| 22 | \emph{quadruple} it. E.g. to extract LaTeX \samp{\e section\{{\rm |
| 23 | \ldots}\}} headers from a document, you can use this pattern: |
| 24 | \code{'\e \e \e\e section\{\e (.*\e )\}'}. |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 25 | |
| 26 | The module defines these functions, and an exception: |
| 27 | |
| 28 | \renewcommand{\indexsubitem}{(in module regex)} |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 29 | |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 30 | \begin{funcdesc}{match}{pattern\, string} |
| 31 | Return how many characters at the beginning of \var{string} match |
| 32 | the regular expression \var{pattern}. Return \code{-1} if the |
| 33 | string does not match the pattern (this is different from a |
| 34 | zero-length match!). |
| 35 | \end{funcdesc} |
| 36 | |
| 37 | \begin{funcdesc}{search}{pattern\, string} |
| 38 | Return the first position in \var{string} that matches the regular |
| 39 | expression \var{pattern}. Return -1 if no position in the string |
| 40 | matches the pattern (this is different from a zero-length match |
| 41 | anywhere!). |
| 42 | \end{funcdesc} |
| 43 | |
| 44 | \begin{funcdesc}{compile}{pattern\, translate} |
| 45 | Compile a regular expression pattern into a regular expression |
| 46 | object, which can be used for matching using its \code{match} and |
| 47 | \code{search} methods, described below. The optional |
| 48 | \var{translate}, if present, must be a 256-character string |
| 49 | indicating how characters (both of the pattern and of the strings to |
| 50 | be matched) are translated before comparing them; the \code{i}-th |
| 51 | element of the string gives the translation for the character with |
| 52 | ASCII code \code{i}. |
| 53 | |
| 54 | The sequence |
| 55 | |
| 56 | \bcode\begin{verbatim} |
| 57 | prog = regex.compile(pat) |
| 58 | result = prog.match(str) |
| 59 | \end{verbatim}\ecode |
| 60 | |
| 61 | is equivalent to |
| 62 | |
| 63 | \bcode\begin{verbatim} |
| 64 | result = regex.match(pat, str) |
| 65 | \end{verbatim}\ecode |
| 66 | |
| 67 | but the version using \code{compile()} is more efficient when multiple |
| 68 | regular expressions are used concurrently in a single program. (The |
| 69 | compiled version of the last pattern passed to \code{regex.match()} or |
| 70 | \code{regex.search()} is cached, so programs that use only a single |
| 71 | regular expression at a time needn't worry about compiling regular |
| 72 | expressions.) |
| 73 | \end{funcdesc} |
| 74 | |
| 75 | \begin{funcdesc}{set_syntax}{flags} |
| 76 | Set the syntax to be used by future calls to \code{compile}, |
| 77 | \code{match} and \code{search}. (Already compiled expression objects |
| 78 | are not affected.) The argument is an integer which is the OR of |
| 79 | several flag bits. The return value is the previous value of |
| 80 | the syntax flags. Names for the flags are defined in the standard |
| 81 | module \code{regex_syntax}; read the file \file{regex_syntax.py} for |
| 82 | more information. |
| 83 | \end{funcdesc} |
| 84 | |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 85 | \begin{funcdesc}{symcomp}{pattern\, translate} |
| 86 | This is like \code{compile}, but supports symbolic group names: if a |
| 87 | parentheses-enclosed group begins with a group name in angular |
| 88 | brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can |
| 89 | be referenced by its name in arguments to the \code{group} method of |
| 90 | the resulting compiled regular expression object, like this: |
| 91 | \code{p.group('id')}. |
| 92 | \end{funcdesc} |
| 93 | |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 94 | \begin{excdesc}{error} |
| 95 | Exception raised when a string passed to one of the functions here |
| 96 | is not a valid regular expression (e.g., unmatched parentheses) or |
| 97 | when some other error occurs during compilation or matching. (It is |
| 98 | never an error if a string contains no match for a pattern.) |
| 99 | \end{excdesc} |
| 100 | |
| 101 | \begin{datadesc}{casefold} |
| 102 | A string suitable to pass as \var{translate} argument to |
| 103 | \code{compile} to map all upper case characters to their lowercase |
| 104 | equivalents. |
| 105 | \end{datadesc} |
| 106 | |
| 107 | \noindent |
| 108 | Compiled regular expression objects support these methods: |
| 109 | |
| 110 | \renewcommand{\indexsubitem}{(regex method)} |
| 111 | \begin{funcdesc}{match}{string\, pos} |
| 112 | Return how many characters at the beginning of \var{string} match |
| 113 | the compiled regular expression. Return \code{-1} if the string |
| 114 | does not match the pattern (this is different from a zero-length |
| 115 | match!). |
| 116 | |
| 117 | The optional second parameter \var{pos} gives an index in the string |
| 118 | where the search is to start; it defaults to \code{0}. This is not |
| 119 | completely equivalent to slicing the string; the \code{'\^'} pattern |
| 120 | character matches at the real begin of the string and at positions |
| 121 | just after a newline, not necessarily at the index where the search |
| 122 | is to start. |
| 123 | \end{funcdesc} |
| 124 | |
| 125 | \begin{funcdesc}{search}{string\, pos} |
| 126 | Return the first position in \var{string} that matches the regular |
| 127 | expression \code{pattern}. Return \code{-1} if no position in the |
| 128 | string matches the pattern (this is different from a zero-length |
| 129 | match anywhere!). |
| 130 | |
| 131 | The optional second parameter has the same meaning as for the |
| 132 | \code{match} method. |
| 133 | \end{funcdesc} |
| 134 | |
| 135 | \begin{funcdesc}{group}{index\, index\, ...} |
| 136 | This method is only valid when the last call to the \code{match} |
| 137 | or \code{search} method found a match. It returns one or more |
| 138 | groups of the match. If there is a single \var{index} argument, |
| 139 | the result is a single string; if there are multiple arguments, the |
| 140 | result is a tuple with one item per argument. If the \var{index} is |
| 141 | zero, the corresponding return value is the entire matching string; if |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 142 | it is in the inclusive range [1..99], it is the string matching the |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 143 | the corresponding parenthesized group (using the default syntax, |
| 144 | groups are parenthesized using \code{\\(} and \code{\\)}). If no |
| 145 | such group exists, the corresponding result is \code{None}. |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 146 | |
| 147 | If the regular expression was compiled by \code{symcomp} instead of |
| 148 | \code{compile}, the \var{index} arguments may also be strings |
| 149 | identifying groups by their group name. |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 150 | \end{funcdesc} |
| 151 | |
| 152 | \noindent |
| 153 | Compiled regular expressions support these data attributes: |
| 154 | |
| 155 | \renewcommand{\indexsubitem}{(regex attribute)} |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 156 | |
Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 157 | \begin{datadesc}{regs} |
| 158 | When the last call to the \code{match} or \code{search} method found a |
| 159 | match, this is a tuple of pairs of indices corresponding to the |
| 160 | beginning and end of all parenthesized groups in the pattern. Indices |
| 161 | are relative to the string argument passed to \code{match} or |
| 162 | \code{search}. The 0-th tuple gives the beginning and end or the |
| 163 | whole pattern. When the last match or search failed, this is |
| 164 | \code{None}. |
| 165 | \end{datadesc} |
| 166 | |
| 167 | \begin{datadesc}{last} |
| 168 | When the last call to the \code{match} or \code{search} method found a |
| 169 | match, this is the string argument passed to that method. When the |
| 170 | last match or search failed, this is \code{None}. |
| 171 | \end{datadesc} |
| 172 | |
| 173 | \begin{datadesc}{translate} |
| 174 | This is the value of the \var{translate} argument to |
| 175 | \code{regex.compile} that created this regular expression object. If |
| 176 | the \var{translate} argument was omitted in the \code{regex.compile} |
| 177 | call, this is \code{None}. |
| 178 | \end{datadesc} |
Guido van Rossum | 326c0bc | 1994-01-03 00:00:31 +0000 | [diff] [blame] | 179 | |
| 180 | \begin{datadesc}{givenpat} |
| 181 | The regular expression pattern as passed to \code{compile} or |
| 182 | \code{symcomp}. |
| 183 | \end{datadesc} |
| 184 | |
| 185 | \begin{datadesc}{realpat} |
| 186 | The regular expression after stripping the group names for regular |
| 187 | expressions compiled with \code{symcomp}. Same as \code{givenpat} |
| 188 | otherwise. |
| 189 | \end{datadesc} |
| 190 | |
| 191 | \begin{datadesc}{groupindex} |
| 192 | A dictionary giving the mapping from symbolic group names to numerical |
| 193 | group indices for regular expressions compiled with \code{symcomp}. |
| 194 | \code{None} otherwise. |
| 195 | \end{datadesc} |