Guido van Rossum | 5fdeeea | 1994-01-02 01:22:07 +0000 | [diff] [blame] | 1 | \section{Built-in Module \sectcode{regex}} |
| 2 | |
| 3 | \bimodindex{regex} |
| 4 | This module provides regular expression matching operations similar to |
| 5 | those found in Emacs. It is always available. |
| 6 | |
| 7 | By default the patterns are Emacs-style regular expressions; there is |
| 8 | a way to change the syntax to match that of several well-known |
| 9 | \UNIX{} utilities. |
| 10 | |
| 11 | This module is 8-bit clean: both patterns and strings may contain null |
| 12 | bytes and characters whose high bit is set. |
| 13 | |
| 14 | \strong{Please note:} There is a little-known fact about Python string literals |
| 15 | which means that you don't usually have to worry about doubling |
| 16 | backslashes, even though they are used to escape special characters in |
| 17 | string literals as well as in regular expressions. This is because |
| 18 | Python doesn't remove backslashes from string literals if they are |
| 19 | followed by an unrecognized escape character. \emph{However}, if you |
| 20 | want to include a literal \dfn{backslash} in a regular expression |
| 21 | represented as a string literal, you have to \emph{quadruple} it. E.g. |
| 22 | to extract LaTeX \samp{\e section\{{\rm \ldots}\}} headers from a document, you can |
| 23 | use this pattern: \code{'\e \e \e\e section\{\e (.*\e )\}'}. |
| 24 | |
| 25 | The module defines these functions, and an exception: |
| 26 | |
| 27 | \renewcommand{\indexsubitem}{(in module regex)} |
| 28 | \begin{funcdesc}{match}{pattern\, string} |
| 29 | Return how many characters at the beginning of \var{string} match |
| 30 | the regular expression \var{pattern}. Return \code{-1} if the |
| 31 | string does not match the pattern (this is different from a |
| 32 | zero-length match!). |
| 33 | \end{funcdesc} |
| 34 | |
| 35 | \begin{funcdesc}{search}{pattern\, string} |
| 36 | Return the first position in \var{string} that matches the regular |
| 37 | expression \var{pattern}. Return -1 if no position in the string |
| 38 | matches the pattern (this is different from a zero-length match |
| 39 | anywhere!). |
| 40 | \end{funcdesc} |
| 41 | |
| 42 | \begin{funcdesc}{compile}{pattern\, translate} |
| 43 | Compile a regular expression pattern into a regular expression |
| 44 | object, which can be used for matching using its \code{match} and |
| 45 | \code{search} methods, described below. The optional |
| 46 | \var{translate}, if present, must be a 256-character string |
| 47 | indicating how characters (both of the pattern and of the strings to |
| 48 | be matched) are translated before comparing them; the \code{i}-th |
| 49 | element of the string gives the translation for the character with |
| 50 | ASCII code \code{i}. |
| 51 | |
| 52 | The sequence |
| 53 | |
| 54 | \bcode\begin{verbatim} |
| 55 | prog = regex.compile(pat) |
| 56 | result = prog.match(str) |
| 57 | \end{verbatim}\ecode |
| 58 | |
| 59 | is equivalent to |
| 60 | |
| 61 | \bcode\begin{verbatim} |
| 62 | result = regex.match(pat, str) |
| 63 | \end{verbatim}\ecode |
| 64 | |
| 65 | but the version using \code{compile()} is more efficient when multiple |
| 66 | regular expressions are used concurrently in a single program. (The |
| 67 | compiled version of the last pattern passed to \code{regex.match()} or |
| 68 | \code{regex.search()} is cached, so programs that use only a single |
| 69 | regular expression at a time needn't worry about compiling regular |
| 70 | expressions.) |
| 71 | \end{funcdesc} |
| 72 | |
| 73 | \begin{funcdesc}{set_syntax}{flags} |
| 74 | Set the syntax to be used by future calls to \code{compile}, |
| 75 | \code{match} and \code{search}. (Already compiled expression objects |
| 76 | are not affected.) The argument is an integer which is the OR of |
| 77 | several flag bits. The return value is the previous value of |
| 78 | the syntax flags. Names for the flags are defined in the standard |
| 79 | module \code{regex_syntax}; read the file \file{regex_syntax.py} for |
| 80 | more information. |
| 81 | \end{funcdesc} |
| 82 | |
| 83 | \begin{excdesc}{error} |
| 84 | Exception raised when a string passed to one of the functions here |
| 85 | is not a valid regular expression (e.g., unmatched parentheses) or |
| 86 | when some other error occurs during compilation or matching. (It is |
| 87 | never an error if a string contains no match for a pattern.) |
| 88 | \end{excdesc} |
| 89 | |
| 90 | \begin{datadesc}{casefold} |
| 91 | A string suitable to pass as \var{translate} argument to |
| 92 | \code{compile} to map all upper case characters to their lowercase |
| 93 | equivalents. |
| 94 | \end{datadesc} |
| 95 | |
| 96 | \noindent |
| 97 | Compiled regular expression objects support these methods: |
| 98 | |
| 99 | \renewcommand{\indexsubitem}{(regex method)} |
| 100 | \begin{funcdesc}{match}{string\, pos} |
| 101 | Return how many characters at the beginning of \var{string} match |
| 102 | the compiled regular expression. Return \code{-1} if the string |
| 103 | does not match the pattern (this is different from a zero-length |
| 104 | match!). |
| 105 | |
| 106 | The optional second parameter \var{pos} gives an index in the string |
| 107 | where the search is to start; it defaults to \code{0}. This is not |
| 108 | completely equivalent to slicing the string; the \code{'\^'} pattern |
| 109 | character matches at the real begin of the string and at positions |
| 110 | just after a newline, not necessarily at the index where the search |
| 111 | is to start. |
| 112 | \end{funcdesc} |
| 113 | |
| 114 | \begin{funcdesc}{search}{string\, pos} |
| 115 | Return the first position in \var{string} that matches the regular |
| 116 | expression \code{pattern}. Return \code{-1} if no position in the |
| 117 | string matches the pattern (this is different from a zero-length |
| 118 | match anywhere!). |
| 119 | |
| 120 | The optional second parameter has the same meaning as for the |
| 121 | \code{match} method. |
| 122 | \end{funcdesc} |
| 123 | |
| 124 | \begin{funcdesc}{group}{index\, index\, ...} |
| 125 | This method is only valid when the last call to the \code{match} |
| 126 | or \code{search} method found a match. It returns one or more |
| 127 | groups of the match. If there is a single \var{index} argument, |
| 128 | the result is a single string; if there are multiple arguments, the |
| 129 | result is a tuple with one item per argument. If the \var{index} is |
| 130 | zero, the corresponding return value is the entire matching string; if |
| 131 | it is in the inclusive range [1..9], it is the string matching the |
| 132 | the corresponding parenthesized group (using the default syntax, |
| 133 | groups are parenthesized using \code{\\(} and \code{\\)}). If no |
| 134 | such group exists, the corresponding result is \code{None}. |
| 135 | \end{funcdesc} |
| 136 | |
| 137 | \noindent |
| 138 | Compiled regular expressions support these data attributes: |
| 139 | |
| 140 | \renewcommand{\indexsubitem}{(regex attribute)} |
| 141 | \begin{datadesc}{regs} |
| 142 | When the last call to the \code{match} or \code{search} method found a |
| 143 | match, this is a tuple of pairs of indices corresponding to the |
| 144 | beginning and end of all parenthesized groups in the pattern. Indices |
| 145 | are relative to the string argument passed to \code{match} or |
| 146 | \code{search}. The 0-th tuple gives the beginning and end or the |
| 147 | whole pattern. When the last match or search failed, this is |
| 148 | \code{None}. |
| 149 | \end{datadesc} |
| 150 | |
| 151 | \begin{datadesc}{last} |
| 152 | When the last call to the \code{match} or \code{search} method found a |
| 153 | match, this is the string argument passed to that method. When the |
| 154 | last match or search failed, this is \code{None}. |
| 155 | \end{datadesc} |
| 156 | |
| 157 | \begin{datadesc}{translate} |
| 158 | This is the value of the \var{translate} argument to |
| 159 | \code{regex.compile} that created this regular expression object. If |
| 160 | the \var{translate} argument was omitted in the \code{regex.compile} |
| 161 | call, this is \code{None}. |
| 162 | \end{datadesc} |