Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 1 | # module 're' -- A collection of regular expression operations |
| 2 | |
| 3 | """Support for regular expressions (RE). |
| 4 | |
| 5 | This module provides regular expression matching operations similar to |
| 6 | those found in Perl. It's 8-bit clean: the strings being processed may |
| 7 | contain both null bytes and characters whose high bit is set. Regular |
| 8 | expression pattern strings may not contain null bytes, but can specify |
| 9 | the null byte using the \\number notation. Characters with the high |
| 10 | bit set may be included. |
| 11 | |
| 12 | Regular expressions can contain both special and ordinary |
| 13 | characters. Most ordinary characters, like "A", "a", or "0", are the |
| 14 | simplest regular expressions; they simply match themselves. You can |
| 15 | concatenate ordinary characters, so last matches the string 'last'. |
| 16 | |
| 17 | The special characters are: |
| 18 | "." Matches any character except a newline. |
| 19 | "^" Matches the start of the string. |
| 20 | "$" Matches the end of the string. |
| 21 | "*" Matches 0 or more (greedy) repetitions of the preceding RE. |
| 22 | Greedy means that it will match as many repetitions as possible. |
| 23 | "+" Matches 0 or more (greedy) repetitions of the preceding RE. |
| 24 | "?" Matches 0 or 1 (greedy) of the preceding RE. |
| 25 | *?,+?,?? Non-greedy versions of the previous three special characters. |
| 26 | {m,n} Matches from m to n repetitions of the preceding RE. |
| 27 | {m,n}? Non-greedy version of the above. |
| 28 | "\\" Either escapes special characters or signals a special sequence. |
| 29 | [] Indicates a set of characters. |
| 30 | A "^" as the first character indicates a complementing set. |
| 31 | "|" A|B, creates an RE that will match either A or B. |
| 32 | (...) Matches the RE inside the parentheses. |
| 33 | The contents can be retrieved or matched later in the string. |
| 34 | (?iLmsx) Set the I, L, M, S, or X flag for the RE. |
| 35 | (?:...) Non-grouping version of regular parentheses. |
| 36 | (?P<name>...) The substring matched by the group is accessible by name. |
| 37 | (?P=name) Matches the text matched earlier by the group named name. |
| 38 | (?#...) A comment; ignored. |
| 39 | (?=...) Matches if ... matches next, but doesn't consume the string. |
| 40 | (?!...) Matches if ... doesn't match next. |
| 41 | |
| 42 | The special sequences consist of "\\" and a character from the list |
| 43 | below. If the ordinary character is not on the list, then the |
| 44 | resulting RE will match the second character. |
| 45 | \\number Matches the contents of the group of the same number. |
| 46 | \\A Matches only at the start of the string. |
| 47 | \\Z Matches only at the end of the string. |
| 48 | \\b Matches the empty string, but only at the start or end of a word. |
| 49 | \\B Matches the empty string, but not at the start or end of a word. |
| 50 | \\d Matches any decimal digit; equivalent to the set [0-9]. |
| 51 | \\D Matches any non-digit character; equivalent to the set [^0-9]. |
| 52 | \\s Matches any whitespace character; equivalent to [ \\t\\n\\r\\f\\v]. |
| 53 | \\S Matches any non-whitespace character; equiv. to [^ \\t\\n\\r\\f\\v]. |
| 54 | \\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]. |
| 55 | With LOCALE, it will match the set [0-9_] plus characters defined |
| 56 | as letters for the current locale. |
| 57 | \\W Matches the complement of \\w. |
| 58 | \\\\ Matches a literal backslash. |
| 59 | |
| 60 | This module exports the following functions: |
| 61 | match Match a regular expression pattern to the beginning of a string. |
| 62 | search Search a string for the presence of a pattern. |
| 63 | sub Substitute occurrences of a pattern found in a string. |
| 64 | subn Same as sub, but also return the number of substitutions made. |
| 65 | split Split a string by the occurrences of a pattern. |
| 66 | findall Find all occurrences of a pattern in a string. |
| 67 | compile Compile a pattern into a RegexObject. |
| 68 | escape Backslash all non-alphanumerics in a string. |
| 69 | |
| 70 | This module exports the following classes: |
| 71 | RegexObject Holds a compiled regular expression pattern. |
| 72 | MatchObject Contains information about pattern matches. |
| 73 | |
| 74 | Some of the functions in this module takes flags as optional parameters: |
| 75 | I IGNORECASE Perform case-insensitive matching. |
| 76 | L LOCALE Make \w, \W, \b, \B, dependent on the current locale. |
| 77 | M MULTILINE "^" matches the beginning of lines as well as the string. |
| 78 | "$" matches the end of lines as well as the string. |
| 79 | S DOTALL "." matches any character at all, including the newline. |
| 80 | X VERBOSE Ignore whitespaces and comments for nicer looking RE's. |
| 81 | |
| 82 | This module also defines an exception 'error'. |
| 83 | |
| 84 | """ |
| 85 | |
| 86 | |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 87 | import sys |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 88 | import string |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 89 | from pcre import * |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 90 | |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 91 | # |
| 92 | # First, the public part of the interface: |
| 93 | # |
| 94 | |
| 95 | # pcre.error and re.error should be the same, since exceptions can be |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 96 | # raised from either module. |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 97 | |
| 98 | # compilation flags |
| 99 | |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 100 | I = IGNORECASE |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 101 | L = LOCALE |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 102 | M = MULTILINE |
| 103 | S = DOTALL |
| 104 | X = VERBOSE |
Guido van Rossum | 09bcfd6 | 1997-07-15 15:38:20 +0000 | [diff] [blame] | 105 | |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 106 | |
Guido van Rossum | 09bcfd6 | 1997-07-15 15:38:20 +0000 | [diff] [blame] | 107 | # |
| 108 | # |
| 109 | # |
| 110 | |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 111 | _cache = {} |
| 112 | _MAXCACHE = 20 |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 113 | |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 114 | def _cachecompile(pattern, flags=0): |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 115 | key = (pattern, flags) |
| 116 | try: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 117 | return _cache[key] |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 118 | except KeyError: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 119 | pass |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 120 | value = compile(pattern, flags) |
| 121 | if len(_cache) >= _MAXCACHE: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 122 | _cache.clear() |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 123 | _cache[key] = value |
| 124 | return value |
| 125 | |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 126 | def match(pattern, string, flags=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 127 | """match (pattern, string[, flags]) -> MatchObject or None |
| 128 | |
| 129 | If zero or more characters at the beginning of string match the |
| 130 | regular expression pattern, return a corresponding MatchObject |
| 131 | instance. Return None if the string does not match the pattern; |
| 132 | note that this is different from a zero-length match. |
| 133 | |
| 134 | Note: If you want to locate a match anywhere in string, use |
| 135 | search() instead. |
| 136 | |
| 137 | """ |
| 138 | |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 139 | return _cachecompile(pattern, flags).match(string) |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 140 | |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 141 | def search(pattern, string, flags=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 142 | """search (pattern, string[, flags]) -> MatchObject or None |
| 143 | |
| 144 | Scan through string looking for a location where the regular |
| 145 | expression pattern produces a match, and return a corresponding |
| 146 | MatchObject instance. Return None if no position in the string |
| 147 | matches the pattern; note that this is different from finding a |
| 148 | zero-length match at some point in the string. |
| 149 | |
| 150 | """ |
Guido van Rossum | 26d80e6 | 1997-07-15 18:59:04 +0000 | [diff] [blame] | 151 | return _cachecompile(pattern, flags).search(string) |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 152 | |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 153 | def sub(pattern, repl, string, count=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 154 | """sub(pattern, repl, string[, count=0]) -> string |
| 155 | |
| 156 | Return the string obtained by replacing the leftmost |
| 157 | non-overlapping occurrences of pattern in string by the |
| 158 | replacement repl. If the pattern isn't found, string is returned |
| 159 | unchanged. repl can be a string or a function; if a function, it |
| 160 | is called for every non-overlapping occurrence of pattern. The |
| 161 | function takes a single match object argument, and returns the |
| 162 | replacement string. |
| 163 | |
| 164 | The pattern may be a string or a regex object; if you need to |
| 165 | specify regular expression flags, you must use a regex object, or |
| 166 | use embedded modifiers in a pattern; e.g. |
| 167 | sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'. |
| 168 | |
| 169 | The optional argument count is the maximum number of pattern |
| 170 | occurrences to be replaced; count must be a non-negative integer, |
| 171 | and the default value of 0 means to replace all occurrences. |
| 172 | |
| 173 | """ |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 174 | if type(pattern) == type(''): |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 175 | pattern = _cachecompile(pattern) |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 176 | return pattern.sub(repl, string, count) |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 177 | |
| 178 | def subn(pattern, repl, string, count=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 179 | """subn(pattern, repl, string[, count=0]) -> (string, num substitutions) |
| 180 | |
| 181 | Perform the same operation as sub(), but return a tuple |
| 182 | (new_string, number_of_subs_made). |
| 183 | |
| 184 | """ |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 185 | if type(pattern) == type(''): |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 186 | pattern = _cachecompile(pattern) |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 187 | return pattern.subn(repl, string, count) |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 188 | |
Guido van Rossum | 8a9a4a2 | 1997-07-11 20:48:25 +0000 | [diff] [blame] | 189 | def split(pattern, string, maxsplit=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 190 | """split(pattern, string[, maxsplit=0]) -> list of strings |
| 191 | |
| 192 | Split string by the occurrences of pattern. If capturing |
| 193 | parentheses are used in pattern, then the text of all groups in |
| 194 | the pattern are also returned as part of the resulting list. If |
| 195 | maxsplit is nonzero, at most maxsplit splits occur, and the |
| 196 | remainder of the string is returned as the final element of the |
| 197 | list. |
| 198 | |
| 199 | """ |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 200 | if type(pattern) == type(''): |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 201 | pattern = _cachecompile(pattern) |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 202 | return pattern.split(string, maxsplit) |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 203 | |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 204 | def findall(pattern, string): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 205 | """findall(pattern, string) -> list |
| 206 | |
| 207 | Return a list of all non-overlapping matches of pattern in |
| 208 | string. If one or more groups are present in the pattern, return a |
| 209 | list of groups; this will be a list of tuples if the pattern has |
| 210 | more than one group. Empty matches are included in the result. |
| 211 | |
| 212 | """ |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 213 | if type(pattern) == type(''): |
| 214 | pattern = _cachecompile(pattern) |
| 215 | return pattern.findall(string) |
| 216 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 217 | def escape(pattern): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 218 | """escape(string) -> string |
| 219 | |
| 220 | Return string with all non-alphanumerics backslashed; this is |
| 221 | useful if you want to match an arbitrary literal string that may |
| 222 | have regular expression metacharacters in it. |
| 223 | |
| 224 | """ |
Guido van Rossum | 3a8d1f5 | 1998-07-20 15:46:13 +0000 | [diff] [blame] | 225 | result = list(pattern) |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 226 | alphanum=string.letters+'_'+string.digits |
Guido van Rossum | 3a8d1f5 | 1998-07-20 15:46:13 +0000 | [diff] [blame] | 227 | for i in range(len(pattern)): |
| 228 | char = pattern[i] |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 229 | if char not in alphanum: |
Guido van Rossum | 3a8d1f5 | 1998-07-20 15:46:13 +0000 | [diff] [blame] | 230 | if char=='\000': result[i] = '\\000' |
| 231 | else: result[i] = '\\'+char |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 232 | return string.join(result, '') |
| 233 | |
| 234 | def compile(pattern, flags=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 235 | """compile(pattern[, flags]) -> RegexObject |
| 236 | |
| 237 | Compile a regular expression pattern into a regular expression |
| 238 | object, which can be used for matching using its match() and |
| 239 | search() methods. |
| 240 | |
| 241 | """ |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 242 | groupindex={} |
| 243 | code=pcre_compile(pattern, flags, groupindex) |
| 244 | return RegexObject(pattern, flags, code, groupindex) |
| 245 | |
| 246 | |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 247 | # |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 248 | # Class definitions |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 249 | # |
| 250 | |
| 251 | class RegexObject: |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 252 | """Holds a compiled regular expression pattern. |
| 253 | |
| 254 | Methods: |
| 255 | match Match the pattern to the beginning of a string. |
| 256 | search Search a string for the presence of the pattern. |
| 257 | sub Substitute occurrences of the pattern found in a string. |
| 258 | subn Same as sub, but also return the number of substitutions made. |
| 259 | split Split a string by the occurrences of the pattern. |
| 260 | findall Find all occurrences of the pattern in a string. |
| 261 | |
| 262 | """ |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 263 | |
Guido van Rossum | bf9d353 | 1997-10-06 14:45:17 +0000 | [diff] [blame] | 264 | def __init__(self, pattern, flags, code, groupindex): |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 265 | self.code = code |
| 266 | self.flags = flags |
| 267 | self.pattern = pattern |
| 268 | self.groupindex = groupindex |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 269 | |
| 270 | def search(self, string, pos=0, endpos=None): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 271 | """search(string[, pos][, endpos]) -> MatchObject or None |
| 272 | |
| 273 | Scan through string looking for a location where this regular |
| 274 | expression produces a match, and return a corresponding |
| 275 | MatchObject instance. Return None if no position in the string |
| 276 | matches the pattern; note that this is different from finding |
| 277 | a zero-length match at some point in the string. The optional |
| 278 | pos and endpos parameters have the same meaning as for the |
| 279 | match() method. |
| 280 | |
| 281 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 282 | if endpos is None or endpos>len(string): |
| 283 | endpos=len(string) |
| 284 | if endpos<pos: endpos=pos |
| 285 | regs = self.code.match(string, pos, endpos, 0) |
| 286 | if regs is None: |
| 287 | return None |
| 288 | self._num_regs=len(regs) |
| 289 | |
| 290 | return MatchObject(self, |
| 291 | string, |
| 292 | pos, endpos, |
| 293 | regs) |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 294 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 295 | def match(self, string, pos=0, endpos=None): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 296 | """match(string[, pos][, endpos]) -> MatchObject or None |
| 297 | |
| 298 | If zero or more characters at the beginning of string match |
| 299 | this regular expression, return a corresponding MatchObject |
| 300 | instance. Return None if the string does not match the |
| 301 | pattern; note that this is different from a zero-length match. |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 302 | |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 303 | Note: If you want to locate a match anywhere in string, use |
| 304 | search() instead. |
| 305 | |
| 306 | The optional second parameter pos gives an index in the string |
| 307 | where the search is to start; it defaults to 0. This is not |
| 308 | completely equivalent to slicing the string; the '' pattern |
| 309 | character matches at the real beginning of the string and at |
| 310 | positions just after a newline, but not necessarily at the |
| 311 | index where the search is to start. |
| 312 | |
| 313 | The optional parameter endpos limits how far the string will |
| 314 | be searched; it will be as if the string is endpos characters |
| 315 | long, so only the characters from pos to endpos will be |
| 316 | searched for a match. |
| 317 | |
| 318 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 319 | if endpos is None or endpos>len(string): |
| 320 | endpos=len(string) |
| 321 | if endpos<pos: endpos=pos |
| 322 | regs = self.code.match(string, pos, endpos, ANCHORED) |
| 323 | if regs is None: |
| 324 | return None |
| 325 | self._num_regs=len(regs) |
| 326 | return MatchObject(self, |
| 327 | string, |
| 328 | pos, endpos, |
| 329 | regs) |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 330 | |
Guido van Rossum | 8a9a4a2 | 1997-07-11 20:48:25 +0000 | [diff] [blame] | 331 | def sub(self, repl, string, count=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 332 | """sub(repl, string[, count=0]) -> string |
| 333 | |
| 334 | Return the string obtained by replacing the leftmost |
| 335 | non-overlapping occurrences of the compiled pattern in string |
| 336 | by the replacement repl. If the pattern isn't found, string is |
| 337 | returned unchanged. |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 338 | |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 339 | Identical to the sub() function, using the compiled pattern. |
| 340 | |
| 341 | """ |
Guido van Rossum | 71fa97c | 1997-07-18 04:26:03 +0000 | [diff] [blame] | 342 | return self.subn(repl, string, count)[0] |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 343 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 344 | def subn(self, repl, source, count=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 345 | """subn(repl, string[, count=0]) -> tuple |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 346 | |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 347 | Perform the same operation as sub(), but return a tuple |
| 348 | (new_string, number_of_subs_made). |
| 349 | |
| 350 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 351 | if count < 0: |
| 352 | raise error, "negative substitution count" |
| 353 | if count == 0: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 354 | count = sys.maxint |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 355 | n = 0 # Number of matches |
| 356 | pos = 0 # Where to start searching |
| 357 | lastmatch = -1 # End of last match |
| 358 | results = [] # Substrings making up the result |
| 359 | end = len(source) |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 360 | |
| 361 | if type(repl) is type(''): |
| 362 | # See if repl contains group references |
| 363 | try: |
| 364 | repl = pcre_expand(_Dummy, repl) |
| 365 | except: |
| 366 | m = MatchObject(self, source, 0, end, []) |
| 367 | repl = lambda m, repl=repl, expand=pcre_expand: expand(m, repl) |
| 368 | else: |
| 369 | m = None |
| 370 | else: |
| 371 | m = MatchObject(self, source, 0, end, []) |
| 372 | |
| 373 | match = self.code.match |
| 374 | append = results.append |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 375 | while n < count and pos <= end: |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 376 | regs = match(source, pos, end, 0) |
| 377 | if not regs: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 378 | break |
Andrew M. Kuchling | 9a80e00 | 1998-08-21 18:39:38 +0000 | [diff] [blame] | 379 | self._num_regs = len(regs) |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 380 | i, j = regs[0] |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 381 | if i == j == lastmatch: |
| 382 | # Empty match adjacent to previous match |
| 383 | pos = pos + 1 |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 384 | append(source[lastmatch:pos]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 385 | continue |
| 386 | if pos < i: |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 387 | append(source[pos:i]) |
| 388 | if m: |
| 389 | m.pos = pos |
| 390 | m.regs = regs |
| 391 | append(repl(m)) |
| 392 | else: |
| 393 | append(repl) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 394 | pos = lastmatch = j |
| 395 | if i == j: |
| 396 | # Last match was empty; don't try here again |
| 397 | pos = pos + 1 |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 398 | append(source[lastmatch:pos]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 399 | n = n + 1 |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 400 | append(source[pos:]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 401 | return (string.join(results, ''), n) |
| 402 | |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 403 | def split(self, source, maxsplit=0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 404 | """split(source[, maxsplit=0]) -> list of strings |
| 405 | |
| 406 | Split string by the occurrences of the compiled pattern. If |
| 407 | capturing parentheses are used in the pattern, then the text |
| 408 | of all groups in the pattern are also returned as part of the |
| 409 | resulting list. If maxsplit is nonzero, at most maxsplit |
| 410 | splits occur, and the remainder of the string is returned as |
| 411 | the final element of the list. |
| 412 | |
| 413 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 414 | if maxsplit < 0: |
| 415 | raise error, "negative split count" |
| 416 | if maxsplit == 0: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 417 | maxsplit = sys.maxint |
| 418 | n = 0 |
| 419 | pos = 0 |
| 420 | lastmatch = 0 |
| 421 | results = [] |
| 422 | end = len(source) |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 423 | match = self.code.match |
| 424 | append = results.append |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 425 | while n < maxsplit: |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 426 | regs = match(source, pos, end, 0) |
| 427 | if not regs: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 428 | break |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 429 | i, j = regs[0] |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 430 | if i == j: |
| 431 | # Empty match |
| 432 | if pos >= end: |
| 433 | break |
| 434 | pos = pos+1 |
| 435 | continue |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 436 | append(source[lastmatch:i]) |
| 437 | rest = regs[1:] |
| 438 | if rest: |
| 439 | for a, b in rest: |
| 440 | if a == -1 or b == -1: |
| 441 | group = None |
| 442 | else: |
| 443 | group = source[a:b] |
| 444 | append(group) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 445 | pos = lastmatch = j |
| 446 | n = n + 1 |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 447 | append(source[lastmatch:]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 448 | return results |
Guido van Rossum | 9e18ec7 | 1997-07-17 22:39:13 +0000 | [diff] [blame] | 449 | |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 450 | def findall(self, source): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 451 | """findall(source) -> list |
| 452 | |
| 453 | Return a list of all non-overlapping matches of the compiled |
| 454 | pattern in string. If one or more groups are present in the |
| 455 | pattern, return a list of groups; this will be a list of |
| 456 | tuples if the pattern has more than one group. Empty matches |
| 457 | are included in the result. |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 458 | |
| 459 | """ |
| 460 | pos = 0 |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 461 | end = len(source) |
| 462 | results = [] |
| 463 | match = self.code.match |
| 464 | append = results.append |
| 465 | while pos <= end: |
| 466 | regs = match(source, pos, end, 0) |
| 467 | if not regs: |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 468 | break |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 469 | i, j = regs[0] |
| 470 | rest = regs[1:] |
| 471 | if not rest: |
| 472 | gr = source[i:j] |
| 473 | elif len(rest) == 1: |
| 474 | a, b = rest[0] |
| 475 | gr = source[a:b] |
| 476 | else: |
| 477 | gr = [] |
| 478 | for (a, b) in rest: |
| 479 | gr.append(source[a:b]) |
| 480 | gr = tuple(gr) |
| 481 | append(gr) |
| 482 | pos = max(j, pos+1) |
| 483 | return results |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 484 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 485 | # The following 3 functions were contributed by Mike Fletcher, and |
| 486 | # allow pickling and unpickling of RegexObject instances. |
| 487 | def __getinitargs__(self): |
| 488 | return (None,None,None,None) # any 4 elements, to work around |
| 489 | # problems with the |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 490 | # pickle/cPickle modules not yet |
| 491 | # ignoring the __init__ function |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 492 | def __getstate__(self): |
| 493 | return self.pattern, self.flags, self.groupindex |
| 494 | def __setstate__(self, statetuple): |
| 495 | self.pattern = statetuple[0] |
| 496 | self.flags = statetuple[1] |
| 497 | self.groupindex = statetuple[2] |
| 498 | self.code = apply(pcre_compile, statetuple) |
| 499 | |
Guido van Rossum | 0e5ab17 | 1998-07-17 20:18:49 +0000 | [diff] [blame] | 500 | class _Dummy: |
| 501 | # Dummy class used by _subn_string(). Has 'group' to avoid core dump. |
| 502 | group = None |
| 503 | |
Guido van Rossum | 5ca1b71 | 1997-07-10 21:00:31 +0000 | [diff] [blame] | 504 | class MatchObject: |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 505 | """Holds a compiled regular expression pattern. |
| 506 | |
| 507 | Methods: |
| 508 | start Return the index of the start of a matched substring. |
| 509 | end Return the index of the end of a matched substring. |
| 510 | span Return a tuple of (start, end) of a matched substring. |
| 511 | groups Return a tuple of all the subgroups of the match. |
| 512 | group Return one or more subgroups of the match. |
| 513 | groupdict Return a dictionary of all the named subgroups of the match. |
| 514 | |
| 515 | """ |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 516 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 517 | def __init__(self, re, string, pos, endpos, regs): |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 518 | self.re = re |
| 519 | self.string = string |
| 520 | self.pos = pos |
| 521 | self.endpos = endpos |
| 522 | self.regs = regs |
| 523 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 524 | def start(self, g = 0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 525 | """start([group=0]) -> int or None |
| 526 | |
| 527 | Return the index of the start of the substring matched by |
| 528 | group; group defaults to zero (meaning the whole matched |
| 529 | substring). Return None if group exists but did not contribute |
| 530 | to the match. |
| 531 | |
| 532 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 533 | if type(g) == type(''): |
| 534 | try: |
| 535 | g = self.re.groupindex[g] |
| 536 | except (KeyError, TypeError): |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 537 | raise IndexError, 'group %s is undefined' % `g` |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 538 | return self.regs[g][0] |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 539 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 540 | def end(self, g = 0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 541 | """end([group=0]) -> int or None |
| 542 | |
| 543 | Return the indices of the end of the substring matched by |
| 544 | group; group defaults to zero (meaning the whole matched |
| 545 | substring). Return None if group exists but did not contribute |
| 546 | to the match. |
| 547 | |
| 548 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 549 | if type(g) == type(''): |
| 550 | try: |
| 551 | g = self.re.groupindex[g] |
| 552 | except (KeyError, TypeError): |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 553 | raise IndexError, 'group %s is undefined' % `g` |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 554 | return self.regs[g][1] |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 555 | |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 556 | def span(self, g = 0): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 557 | """span([group=0]) -> tuple |
| 558 | |
| 559 | Return the 2-tuple (m.start(group), m.end(group)). Note that |
| 560 | if group did not contribute to the match, this is (None, |
| 561 | None). Group defaults to zero (meaning the whole matched |
| 562 | substring). |
| 563 | |
| 564 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 565 | if type(g) == type(''): |
| 566 | try: |
| 567 | g = self.re.groupindex[g] |
| 568 | except (KeyError, TypeError): |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 569 | raise IndexError, 'group %s is undefined' % `g` |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 570 | return self.regs[g] |
Guido van Rossum | a0e4c1b | 1997-07-17 14:52:48 +0000 | [diff] [blame] | 571 | |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 572 | def groups(self, default=None): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 573 | """groups([default=None]) -> tuple |
| 574 | |
| 575 | Return a tuple containing all the subgroups of the match, from |
| 576 | 1 up to however many groups are in the pattern. The default |
| 577 | argument is used for groups that did not participate in the |
| 578 | match. |
| 579 | |
| 580 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 581 | result = [] |
| 582 | for g in range(1, self.re._num_regs): |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 583 | a, b = self.regs[g] |
| 584 | if a == -1 or b == -1: |
| 585 | result.append(default) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 586 | else: |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 587 | result.append(self.string[a:b]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 588 | return tuple(result) |
Guido van Rossum | dfa6790 | 1997-12-08 17:12:06 +0000 | [diff] [blame] | 589 | |
Guido van Rossum | 8a9a4a2 | 1997-07-11 20:48:25 +0000 | [diff] [blame] | 590 | def group(self, *groups): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 591 | """group([group1, group2, ...]) -> string or tuple |
| 592 | |
| 593 | Return one or more subgroups of the match. If there is a |
| 594 | single argument, the result is a single string; if there are |
| 595 | multiple arguments, the result is a tuple with one item per |
| 596 | argument. Without arguments, group1 defaults to zero (i.e. the |
| 597 | whole match is returned). If a groupN argument is zero, the |
| 598 | corresponding return value is the entire matching string; if |
| 599 | it is in the inclusive range [1..99], it is the string |
| 600 | matching the the corresponding parenthesized group. If a group |
| 601 | number is negative or larger than the number of groups defined |
| 602 | in the pattern, an IndexError exception is raised. If a group |
| 603 | is contained in a part of the pattern that did not match, the |
| 604 | corresponding result is None. If a group is contained in a |
| 605 | part of the pattern that matched multiple times, the last |
| 606 | match is returned. |
| 607 | |
| 608 | If the regular expression uses the (?P<name>...) syntax, the |
| 609 | groupN arguments may also be strings identifying groups by |
| 610 | their group name. If a string argument is not used as a group |
| 611 | name in the pattern, an IndexError exception is raised. |
| 612 | |
| 613 | """ |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 614 | if len(groups) == 0: |
| 615 | groups = (0,) |
| 616 | result = [] |
| 617 | for g in groups: |
| 618 | if type(g) == type(''): |
| 619 | try: |
| 620 | g = self.re.groupindex[g] |
| 621 | except (KeyError, TypeError): |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 622 | raise IndexError, 'group %s is undefined' % `g` |
| 623 | if g >= len(self.regs): |
| 624 | raise IndexError, 'group %s is undefined' % `g` |
| 625 | a, b = self.regs[g] |
| 626 | if a == -1 or b == -1: |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 627 | result.append(None) |
| 628 | else: |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 629 | result.append(self.string[a:b]) |
Guido van Rossum | 45e2fbc | 1998-03-26 21:13:24 +0000 | [diff] [blame] | 630 | if len(result) > 1: |
| 631 | return tuple(result) |
| 632 | elif len(result) == 1: |
| 633 | return result[0] |
| 634 | else: |
| 635 | return () |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 636 | |
| 637 | def groupdict(self, default=None): |
Guido van Rossum | 6ebb387 | 1999-07-09 21:15:32 +0000 | [diff] [blame] | 638 | """groupdict([default=None]) -> dictionary |
| 639 | |
| 640 | Return a dictionary containing all the named subgroups of the |
| 641 | match, keyed by the subgroup name. The default argument is |
| 642 | used for groups that did not participate in the match. |
| 643 | |
| 644 | """ |
Guido van Rossum | be0b62c | 1998-06-29 20:29:08 +0000 | [diff] [blame] | 645 | dict = {} |
| 646 | for name, index in self.re.groupindex.items(): |
| 647 | a, b = self.regs[index] |
| 648 | if a == -1 or b == -1: |
| 649 | dict[name] = default |
| 650 | else: |
| 651 | dict[name] = self.string[a:b] |
| 652 | return dict |