Andrew M. Kuchling | e8f44d6 | 2005-08-30 01:25:05 +0000 | [diff] [blame] | 1 | \documentclass{howto} |
| 2 | |
| 3 | % TODO: |
| 4 | % Document lookbehind assertions |
| 5 | % Better way of displaying a RE, a string, and what it matches |
| 6 | % Mention optional argument to match.groups() |
| 7 | % Unicode (at least a reference) |
| 8 | |
| 9 | \title{Regular Expression HOWTO} |
| 10 | |
| 11 | \release{0.05} |
| 12 | |
| 13 | \author{A.M. Kuchling} |
| 14 | \authoraddress{\email{amk@amk.ca}} |
| 15 | |
| 16 | \begin{document} |
| 17 | \maketitle |
| 18 | |
| 19 | \begin{abstract} |
| 20 | \noindent |
| 21 | This document is an introductory tutorial to using regular expressions |
| 22 | in Python with the \module{re} module. It provides a gentler |
| 23 | introduction than the corresponding section in the Library Reference. |
| 24 | |
| 25 | This document is available from |
| 26 | \url{http://www.amk.ca/python/howto}. |
| 27 | |
| 28 | \end{abstract} |
| 29 | |
| 30 | \tableofcontents |
| 31 | |
| 32 | \section{Introduction} |
| 33 | |
| 34 | The \module{re} module was added in Python 1.5, and provides |
| 35 | Perl-style regular expression patterns. Earlier versions of Python |
| 36 | came with the \module{regex} module, which provides Emacs-style |
| 37 | patterns. Emacs-style patterns are slightly less readable and |
| 38 | don't provide as many features, so there's not much reason to use |
| 39 | the \module{regex} module when writing new code, though you might |
| 40 | encounter old code that uses it. |
| 41 | |
| 42 | Regular expressions (or REs) are essentially a tiny, highly |
| 43 | specialized programming language embedded inside Python and made |
| 44 | available through the \module{re} module. Using this little language, |
| 45 | you specify the rules for the set of possible strings that you want to |
| 46 | match; this set might contain English sentences, or e-mail addresses, |
| 47 | or TeX commands, or anything you like. You can then ask questions |
| 48 | such as ``Does this string match the pattern?'', or ``Is there a match |
| 49 | for the pattern anywhere in this string?''. You can also use REs to |
| 50 | modify a string or to split it apart in various ways. |
| 51 | |
| 52 | Regular expression patterns are compiled into a series of bytecodes |
| 53 | which are then executed by a matching engine written in C. For |
| 54 | advanced use, it may be necessary to pay careful attention to how the |
| 55 | engine will execute a given RE, and write the RE in a certain way in |
| 56 | order to produce bytecode that runs faster. Optimization isn't |
| 57 | covered in this document, because it requires that you have a good |
| 58 | understanding of the matching engine's internals. |
| 59 | |
| 60 | The regular expression language is relatively small and restricted, so |
| 61 | not all possible string processing tasks can be done using regular |
| 62 | expressions. There are also tasks that \emph{can} be done with |
| 63 | regular expressions, but the expressions turn out to be very |
| 64 | complicated. In these cases, you may be better off writing Python |
| 65 | code to do the processing; while Python code will be slower than an |
| 66 | elaborate regular expression, it will also probably be more understandable. |
| 67 | |
| 68 | \section{Simple Patterns} |
| 69 | |
| 70 | We'll start by learning about the simplest possible regular |
| 71 | expressions. Since regular expressions are used to operate on |
| 72 | strings, we'll begin with the most common task: matching characters. |
| 73 | |
| 74 | For a detailed explanation of the computer science underlying regular |
| 75 | expressions (deterministic and non-deterministic finite automata), you |
| 76 | can refer to almost any textbook on writing compilers. |
| 77 | |
| 78 | \subsection{Matching Characters} |
| 79 | |
| 80 | Most letters and characters will simply match themselves. For |
| 81 | example, the regular expression \regexp{test} will match the string |
| 82 | \samp{test} exactly. (You can enable a case-insensitive mode that |
| 83 | would let this RE match \samp{Test} or \samp{TEST} as well; more |
| 84 | about this later.) |
| 85 | |
| 86 | There are exceptions to this rule; some characters are |
| 87 | special, and don't match themselves. Instead, they signal that some |
| 88 | out-of-the-ordinary thing should be matched, or they affect other |
| 89 | portions of the RE by repeating them. Much of this document is |
| 90 | devoted to discussing various metacharacters and what they do. |
| 91 | |
| 92 | Here's a complete list of the metacharacters; their meanings will be |
| 93 | discussed in the rest of this HOWTO. |
| 94 | |
| 95 | \begin{verbatim} |
| 96 | . ^ $ * + ? { [ ] \ | ( ) |
| 97 | \end{verbatim} |
| 98 | % $ |
| 99 | |
| 100 | The first metacharacters we'll look at are \samp{[} and \samp{]}. |
| 101 | They're used for specifying a character class, which is a set of |
| 102 | characters that you wish to match. Characters can be listed |
| 103 | individually, or a range of characters can be indicated by giving two |
| 104 | characters and separating them by a \character{-}. For example, |
| 105 | \regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or |
| 106 | \samp{c}; this is the same as |
| 107 | \regexp{[a-c]}, which uses a range to express the same set of |
| 108 | characters. If you wanted to match only lowercase letters, your |
| 109 | RE would be \regexp{[a-z]}. |
| 110 | |
| 111 | Metacharacters are not active inside classes. For example, |
| 112 | \regexp{[akm\$]} will match any of the characters \character{a}, |
| 113 | \character{k}, \character{m}, or \character{\$}; \character{\$} is |
| 114 | usually a metacharacter, but inside a character class it's stripped of |
| 115 | its special nature. |
| 116 | |
| 117 | You can match the characters not within a range by \dfn{complementing} |
| 118 | the set. This is indicated by including a \character{\^} as the first |
| 119 | character of the class; \character{\^} elsewhere will simply match the |
| 120 | \character{\^} character. For example, \verb|[^5]| will match any |
| 121 | character except \character{5}. |
| 122 | |
| 123 | Perhaps the most important metacharacter is the backslash, \samp{\e}. |
| 124 | As in Python string literals, the backslash can be followed by various |
| 125 | characters to signal various special sequences. It's also used to escape |
| 126 | all the metacharacters so you can still match them in patterns; for |
| 127 | example, if you need to match a \samp{[} or |
| 128 | \samp{\e}, you can precede them with a backslash to remove their |
| 129 | special meaning: \regexp{\e[} or \regexp{\e\e}. |
| 130 | |
| 131 | Some of the special sequences beginning with \character{\e} represent |
| 132 | predefined sets of characters that are often useful, such as the set |
| 133 | of digits, the set of letters, or the set of anything that isn't |
| 134 | whitespace. The following predefined special sequences are available: |
| 135 | |
| 136 | \begin{itemize} |
| 137 | \item[\code{\e d}]Matches any decimal digit; this is |
| 138 | equivalent to the class \regexp{[0-9]}. |
| 139 | |
| 140 | \item[\code{\e D}]Matches any non-digit character; this is |
| 141 | equivalent to the class \verb|[^0-9]|. |
| 142 | |
| 143 | \item[\code{\e s}]Matches any whitespace character; this is |
| 144 | equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. |
| 145 | |
| 146 | \item[\code{\e S}]Matches any non-whitespace character; this is |
| 147 | equivalent to the class \verb|[^ \t\n\r\f\v]|. |
| 148 | |
| 149 | \item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class |
| 150 | \regexp{[a-zA-Z0-9_]}. |
| 151 | |
| 152 | \item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class |
| 153 | \verb|[^a-zA-Z0-9_]|. |
| 154 | \end{itemize} |
| 155 | |
| 156 | These sequences can be included inside a character class. For |
| 157 | example, \regexp{[\e s,.]} is a character class that will match any |
| 158 | whitespace character, or \character{,} or \character{.}. |
| 159 | |
| 160 | The final metacharacter in this section is \regexp{.}. It matches |
| 161 | anything except a newline character, and there's an alternate mode |
| 162 | (\code{re.DOTALL}) where it will match even a newline. \character{.} |
| 163 | is often used where you want to match ``any character''. |
| 164 | |
| 165 | \subsection{Repeating Things} |
| 166 | |
| 167 | Being able to match varying sets of characters is the first thing |
| 168 | regular expressions can do that isn't already possible with the |
| 169 | methods available on strings. However, if that was the only |
| 170 | additional capability of regexes, they wouldn't be much of an advance. |
| 171 | Another capability is that you can specify that portions of the RE |
| 172 | must be repeated a certain number of times. |
| 173 | |
| 174 | The first metacharacter for repeating things that we'll look at is |
| 175 | \regexp{*}. \regexp{*} doesn't match the literal character \samp{*}; |
| 176 | instead, it specifies that the previous character can be matched zero |
| 177 | or more times, instead of exactly once. |
| 178 | |
| 179 | For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} |
| 180 | characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} |
| 181 | characters), and so forth. The RE engine has various internal |
| 182 | limitations stemming from the size of C's \code{int} type, that will |
| 183 | prevent it from matching over 2 billion \samp{a} characters; you |
| 184 | probably don't have enough memory to construct a string that large, so |
| 185 | you shouldn't run into that limit. |
| 186 | |
| 187 | Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, |
| 188 | the matching engine will try to repeat it as many times as possible. |
| 189 | If later portions of the pattern don't match, the matching engine will |
| 190 | then back up and try again with few repetitions. |
| 191 | |
| 192 | A step-by-step example will make this more obvious. Let's consider |
| 193 | the expression \regexp{a[bcd]*b}. This matches the letter |
| 194 | \character{a}, zero or more letters from the class \code{[bcd]}, and |
| 195 | finally ends with a \character{b}. Now imagine matching this RE |
| 196 | against the string \samp{abcbd}. |
| 197 | |
| 198 | \begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} |
| 199 | \lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} |
| 200 | \lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as |
| 201 | it can, which is to the end of the string.} |
| 202 | \lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the |
| 203 | current position is at the end of the string, so it fails.} |
| 204 | \lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches |
| 205 | one less character.} |
| 206 | \lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the |
| 207 | current position is at the last character, which is a \character{d}.} |
| 208 | \lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is |
| 209 | only matching \samp{bc}.} |
| 210 | \lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time |
| 211 | but the character at the current position is \character{b}, so it succeeds.} |
| 212 | \end{tableiii} |
| 213 | |
| 214 | The end of the RE has now been reached, and it has matched |
| 215 | \samp{abcb}. This demonstrates how the matching engine goes as far as |
| 216 | it can at first, and if no match is found it will then progressively |
| 217 | back up and retry the rest of the RE again and again. It will back up |
| 218 | until it has tried zero matches for \regexp{[bcd]*}, and if that |
| 219 | subsequently fails, the engine will conclude that the string doesn't |
| 220 | match the RE at all. |
| 221 | |
| 222 | Another repeating metacharacter is \regexp{+}, which matches one or |
| 223 | more times. Pay careful attention to the difference between |
| 224 | \regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more |
| 225 | times, so whatever's being repeated may not be present at all, while |
| 226 | \regexp{+} requires at least \emph{one} occurrence. To use a similar |
| 227 | example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), |
| 228 | \samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. |
| 229 | |
| 230 | There are two more repeating qualifiers. The question mark character, |
| 231 | \regexp{?}, matches either once or zero times; you can think of it as |
| 232 | marking something as being optional. For example, \regexp{home-?brew} |
| 233 | matches either \samp{homebrew} or \samp{home-brew}. |
| 234 | |
| 235 | The most complicated repeated qualifier is |
| 236 | \regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal |
| 237 | integers. This qualifier means there must be at least \var{m} |
| 238 | repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b} |
| 239 | will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match |
| 240 | \samp{ab}, which has no slashes, or \samp{a////b}, which has four. |
| 241 | |
| 242 | You can omit either \var{m} or \var{n}; in that case, a reasonable |
| 243 | value is assumed for the missing value. Omitting \var{m} is |
| 244 | interpreted as a lower limit of 0, while omitting \var{n} results in an |
| 245 | upper bound of infinity --- actually, the 2 billion limit mentioned |
| 246 | earlier, but that might as well be infinity. |
| 247 | |
| 248 | Readers of a reductionist bent may notice that the three other qualifiers |
| 249 | can all be expressed using this notation. \regexp{\{0,\}} is the same |
| 250 | as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and |
| 251 | \regexp{\{0,1\}} is the same as \regexp{?}. It's better to use |
| 252 | \regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because |
| 253 | they're shorter and easier to read. |
| 254 | |
| 255 | \section{Using Regular Expressions} |
| 256 | |
| 257 | Now that we've looked at some simple regular expressions, how do we |
| 258 | actually use them in Python? The \module{re} module provides an |
| 259 | interface to the regular expression engine, allowing you to compile |
| 260 | REs into objects and then perform matches with them. |
| 261 | |
| 262 | \subsection{Compiling Regular Expressions} |
| 263 | |
| 264 | Regular expressions are compiled into \class{RegexObject} instances, |
| 265 | which have methods for various operations such as searching for |
| 266 | pattern matches or performing string substitutions. |
| 267 | |
| 268 | \begin{verbatim} |
| 269 | >>> import re |
| 270 | >>> p = re.compile('ab*') |
| 271 | >>> print p |
| 272 | <re.RegexObject instance at 80b4150> |
| 273 | \end{verbatim} |
| 274 | |
| 275 | \function{re.compile()} also accepts an optional \var{flags} |
| 276 | argument, used to enable various special features and syntax |
| 277 | variations. We'll go over the available settings later, but for now a |
| 278 | single example will do: |
| 279 | |
| 280 | \begin{verbatim} |
| 281 | >>> p = re.compile('ab*', re.IGNORECASE) |
| 282 | \end{verbatim} |
| 283 | |
| 284 | The RE is passed to \function{re.compile()} as a string. REs are |
| 285 | handled as strings because regular expressions aren't part of the core |
| 286 | Python language, and no special syntax was created for expressing |
| 287 | them. (There are applications that don't need REs at all, so there's |
| 288 | no need to bloat the language specification by including them.) |
| 289 | Instead, the \module{re} module is simply a C extension module |
| 290 | included with Python, just like the \module{socket} or \module{zlib} |
| 291 | module. |
| 292 | |
| 293 | Putting REs in strings keeps the Python language simpler, but has one |
| 294 | disadvantage which is the topic of the next section. |
| 295 | |
| 296 | \subsection{The Backslash Plague} |
| 297 | |
| 298 | As stated earlier, regular expressions use the backslash |
| 299 | character (\character{\e}) to indicate special forms or to allow |
| 300 | special characters to be used without invoking their special meaning. |
| 301 | This conflicts with Python's usage of the same character for the same |
| 302 | purpose in string literals. |
| 303 | |
| 304 | Let's say you want to write a RE that matches the string |
| 305 | \samp{{\e}section}, which might be found in a \LaTeX\ file. To figure |
| 306 | out what to write in the program code, start with the desired string |
| 307 | to be matched. Next, you must escape any backslashes and other |
| 308 | metacharacters by preceding them with a backslash, resulting in the |
| 309 | string \samp{\e\e section}. The resulting string that must be passed |
| 310 | to \function{re.compile()} must be \verb|\\section|. However, to |
| 311 | express this as a Python string literal, both backslashes must be |
| 312 | escaped \emph{again}. |
| 313 | |
| 314 | \begin{tableii}{c|l}{code}{Characters}{Stage} |
| 315 | \lineii{\e section}{Text string to be matched} |
| 316 | \lineii{\e\e section}{Escaped backslash for \function{re.compile}} |
| 317 | \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} |
| 318 | \end{tableii} |
| 319 | |
| 320 | In short, to match a literal backslash, one has to write |
| 321 | \code{'\e\e\e\e'} as the RE string, because the regular expression |
| 322 | must be \samp{\e\e}, and each backslash must be expressed as |
| 323 | \samp{\e\e} inside a regular Python string literal. In REs that |
| 324 | feature backslashes repeatedly, this leads to lots of repeated |
| 325 | backslashes and makes the resulting strings difficult to understand. |
| 326 | |
| 327 | The solution is to use Python's raw string notation for regular |
| 328 | expressions; backslashes are not handled in any special way in |
| 329 | a string literal prefixed with \character{r}, so \code{r"\e n"} is a |
| 330 | two-character string containing \character{\e} and \character{n}, |
| 331 | while \code{"\e n"} is a one-character string containing a newline. |
| 332 | Frequently regular expressions will be expressed in Python |
| 333 | code using this raw string notation. |
| 334 | |
| 335 | \begin{tableii}{c|c}{code}{Regular String}{Raw string} |
| 336 | \lineii{"ab*"}{\code{r"ab*"}} |
| 337 | \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} |
| 338 | \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} |
| 339 | \end{tableii} |
| 340 | |
| 341 | \subsection{Performing Matches} |
| 342 | |
| 343 | Once you have an object representing a compiled regular expression, |
| 344 | what do you do with it? \class{RegexObject} instances have several |
| 345 | methods and attributes. Only the most significant ones will be |
| 346 | covered here; consult \ulink{the Library |
| 347 | Reference}{http://www.python.org/doc/lib/module-re.html} for a |
| 348 | complete listing. |
| 349 | |
| 350 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} |
| 351 | \lineii{match()}{Determine if the RE matches at the beginning of |
| 352 | the string.} |
| 353 | \lineii{search()}{Scan through a string, looking for any location |
| 354 | where this RE matches.} |
| 355 | \lineii{findall()}{Find all substrings where the RE matches, |
| 356 | and returns them as a list.} |
| 357 | \lineii{finditer()}{Find all substrings where the RE matches, |
| 358 | and returns them as an iterator.} |
| 359 | \end{tableii} |
| 360 | |
| 361 | \method{match()} and \method{search()} return \code{None} if no match |
| 362 | can be found. If they're successful, a \code{MatchObject} instance is |
| 363 | returned, containing information about the match: where it starts and |
| 364 | ends, the substring it matched, and more. |
| 365 | |
| 366 | You can learn about this by interactively experimenting with the |
| 367 | \module{re} module. If you have Tkinter available, you may also want |
| 368 | to look at \file{Tools/scripts/redemo.py}, a demonstration program |
| 369 | included with the Python distribution. It allows you to enter REs and |
| 370 | strings, and displays whether the RE matches or fails. |
| 371 | \file{redemo.py} can be quite useful when trying to debug a |
| 372 | complicated RE. Phil Schwartz's |
| 373 | \ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive |
| 374 | tool for developing and testing RE patterns. This HOWTO will use the |
| 375 | standard Python interpreter for its examples. |
| 376 | |
| 377 | First, run the Python interpreter, import the \module{re} module, and |
| 378 | compile a RE: |
| 379 | |
| 380 | \begin{verbatim} |
| 381 | Python 2.2.2 (#1, Feb 10 2003, 12:57:01) |
| 382 | >>> import re |
| 383 | >>> p = re.compile('[a-z]+') |
| 384 | >>> p |
| 385 | <_sre.SRE_Pattern object at 80c3c28> |
| 386 | \end{verbatim} |
| 387 | |
| 388 | Now, you can try matching various strings against the RE |
| 389 | \regexp{[a-z]+}. An empty string shouldn't match at all, since |
| 390 | \regexp{+} means 'one or more repetitions'. \method{match()} should |
| 391 | return \code{None} in this case, which will cause the interpreter to |
| 392 | print no output. You can explicitly print the result of |
| 393 | \method{match()} to make this clear. |
| 394 | |
| 395 | \begin{verbatim} |
| 396 | >>> p.match("") |
| 397 | >>> print p.match("") |
| 398 | None |
| 399 | \end{verbatim} |
| 400 | |
| 401 | Now, let's try it on a string that it should match, such as |
| 402 | \samp{tempo}. In this case, \method{match()} will return a |
| 403 | \class{MatchObject}, so you should store the result in a variable for |
| 404 | later use. |
| 405 | |
| 406 | \begin{verbatim} |
| 407 | >>> m = p.match( 'tempo') |
| 408 | >>> print m |
| 409 | <_sre.SRE_Match object at 80c4f68> |
| 410 | \end{verbatim} |
| 411 | |
| 412 | Now you can query the \class{MatchObject} for information about the |
| 413 | matching string. \class{MatchObject} instances also have several |
| 414 | methods and attributes; the most important ones are: |
| 415 | |
| 416 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} |
| 417 | \lineii{group()}{Return the string matched by the RE} |
| 418 | \lineii{start()}{Return the starting position of the match} |
| 419 | \lineii{end()}{Return the ending position of the match} |
| 420 | \lineii{span()}{Return a tuple containing the (start, end) positions |
| 421 | of the match} |
| 422 | \end{tableii} |
| 423 | |
| 424 | Trying these methods will soon clarify their meaning: |
| 425 | |
| 426 | \begin{verbatim} |
| 427 | >>> m.group() |
| 428 | 'tempo' |
| 429 | >>> m.start(), m.end() |
| 430 | (0, 5) |
| 431 | >>> m.span() |
| 432 | (0, 5) |
| 433 | \end{verbatim} |
| 434 | |
| 435 | \method{group()} returns the substring that was matched by the |
| 436 | RE. \method{start()} and \method{end()} return the starting and |
| 437 | ending index of the match. \method{span()} returns both start and end |
| 438 | indexes in a single tuple. Since the \method{match} method only |
| 439 | checks if the RE matches at the start of a string, |
| 440 | \method{start()} will always be zero. However, the \method{search} |
| 441 | method of \class{RegexObject} instances scans through the string, so |
| 442 | the match may not start at zero in that case. |
| 443 | |
| 444 | \begin{verbatim} |
| 445 | >>> print p.match('::: message') |
| 446 | None |
| 447 | >>> m = p.search('::: message') ; print m |
| 448 | <re.MatchObject instance at 80c9650> |
| 449 | >>> m.group() |
| 450 | 'message' |
| 451 | >>> m.span() |
| 452 | (4, 11) |
| 453 | \end{verbatim} |
| 454 | |
| 455 | In actual programs, the most common style is to store the |
| 456 | \class{MatchObject} in a variable, and then check if it was |
| 457 | \code{None}. This usually looks like: |
| 458 | |
| 459 | \begin{verbatim} |
| 460 | p = re.compile( ... ) |
| 461 | m = p.match( 'string goes here' ) |
| 462 | if m: |
| 463 | print 'Match found: ', m.group() |
| 464 | else: |
| 465 | print 'No match' |
| 466 | \end{verbatim} |
| 467 | |
| 468 | Two \class{RegexObject} methods return all of the matches for a pattern. |
| 469 | \method{findall()} returns a list of matching strings: |
| 470 | |
| 471 | \begin{verbatim} |
| 472 | >>> p = re.compile('\d+') |
| 473 | >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') |
| 474 | ['12', '11', '10'] |
| 475 | \end{verbatim} |
| 476 | |
| 477 | \method{findall()} has to create the entire list before it can be |
| 478 | returned as the result. In Python 2.2, the \method{finditer()} method |
| 479 | is also available, returning a sequence of \class{MatchObject} instances |
| 480 | as an iterator. |
| 481 | |
| 482 | \begin{verbatim} |
| 483 | >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') |
| 484 | >>> iterator |
| 485 | <callable-iterator object at 0x401833ac> |
| 486 | >>> for match in iterator: |
| 487 | ... print match.span() |
| 488 | ... |
| 489 | (0, 2) |
| 490 | (22, 24) |
| 491 | (29, 31) |
| 492 | \end{verbatim} |
| 493 | |
| 494 | |
| 495 | \subsection{Module-Level Functions} |
| 496 | |
| 497 | You don't have to produce a \class{RegexObject} and call its methods; |
| 498 | the \module{re} module also provides top-level functions called |
| 499 | \function{match()}, \function{search()}, \function{sub()}, and so |
| 500 | forth. These functions take the same arguments as the corresponding |
| 501 | \class{RegexObject} method, with the RE string added as the first |
| 502 | argument, and still return either \code{None} or a \class{MatchObject} |
| 503 | instance. |
| 504 | |
| 505 | \begin{verbatim} |
| 506 | >>> print re.match(r'From\s+', 'Fromage amk') |
| 507 | None |
| 508 | >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') |
| 509 | <re.MatchObject instance at 80c5978> |
| 510 | \end{verbatim} |
| 511 | |
| 512 | Under the hood, these functions simply produce a \class{RegexObject} |
| 513 | for you and call the appropriate method on it. They also store the |
| 514 | compiled object in a cache, so future calls using the same |
| 515 | RE are faster. |
| 516 | |
| 517 | Should you use these module-level functions, or should you get the |
| 518 | \class{RegexObject} and call its methods yourself? That choice |
| 519 | depends on how frequently the RE will be used, and on your personal |
| 520 | coding style. If a RE is being used at only one point in the code, |
| 521 | then the module functions are probably more convenient. If a program |
| 522 | contains a lot of regular expressions, or re-uses the same ones in |
| 523 | several locations, then it might be worthwhile to collect all the |
| 524 | definitions in one place, in a section of code that compiles all the |
| 525 | REs ahead of time. To take an example from the standard library, |
| 526 | here's an extract from \file{xmllib.py}: |
| 527 | |
| 528 | \begin{verbatim} |
| 529 | ref = re.compile( ... ) |
| 530 | entityref = re.compile( ... ) |
| 531 | charref = re.compile( ... ) |
| 532 | starttagopen = re.compile( ... ) |
| 533 | \end{verbatim} |
| 534 | |
| 535 | I generally prefer to work with the compiled object, even for |
| 536 | one-time uses, but few people will be as much of a purist about this |
| 537 | as I am. |
| 538 | |
| 539 | \subsection{Compilation Flags} |
| 540 | |
| 541 | Compilation flags let you modify some aspects of how regular |
| 542 | expressions work. Flags are available in the \module{re} module under |
| 543 | two names, a long name such as \constant{IGNORECASE}, and a short, |
| 544 | one-letter form such as \constant{I}. (If you're familiar with Perl's |
| 545 | pattern modifiers, the one-letter forms use the same letters; the |
| 546 | short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) |
| 547 | Multiple flags can be specified by bitwise OR-ing them; \code{re.I | |
| 548 | re.M} sets both the \constant{I} and \constant{M} flags, for example. |
| 549 | |
| 550 | Here's a table of the available flags, followed by |
| 551 | a more detailed explanation of each one. |
| 552 | |
| 553 | \begin{tableii}{c|l}{}{Flag}{Meaning} |
| 554 | \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any |
| 555 | character, including newlines} |
| 556 | \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} |
| 557 | \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} |
| 558 | \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, |
| 559 | affecting \regexp{\^} and \regexp{\$}} |
| 560 | \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, |
| 561 | which can be organized more cleanly and understandably.} |
| 562 | \end{tableii} |
| 563 | |
| 564 | \begin{datadesc}{I} |
| 565 | \dataline{IGNORECASE} |
| 566 | Perform case-insensitive matching; character class and literal strings |
| 567 | will match |
| 568 | letters by ignoring case. For example, \regexp{[A-Z]} will match |
| 569 | lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, |
| 570 | \samp{spam}, or \samp{spAM}. |
| 571 | This lowercasing doesn't take the current locale into account; it will |
| 572 | if you also set the \constant{LOCALE} flag. |
| 573 | \end{datadesc} |
| 574 | |
| 575 | \begin{datadesc}{L} |
| 576 | \dataline{LOCALE} |
| 577 | Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, |
| 578 | and \regexp{\e B}, dependent on the current locale. |
| 579 | |
| 580 | Locales are a feature of the C library intended to help in writing |
| 581 | programs that take account of language differences. For example, if |
| 582 | you're processing French text, you'd want to be able to write |
| 583 | \regexp{\e w+} to match words, but \regexp{\e w} only matches the |
| 584 | character class \regexp{[A-Za-z]}; it won't match \character{\'e} or |
| 585 | \character{\c c}. If your system is configured properly and a French |
| 586 | locale is selected, certain C functions will tell the program that |
| 587 | \character{\'e} should also be considered a letter. Setting the |
| 588 | \constant{LOCALE} flag when compiling a regular expression will cause the |
| 589 | resulting compiled object to use these C functions for \regexp{\e w}; |
| 590 | this is slower, but also enables \regexp{\e w+} to match French words as |
| 591 | you'd expect. |
| 592 | \end{datadesc} |
| 593 | |
| 594 | \begin{datadesc}{M} |
| 595 | \dataline{MULTILINE} |
| 596 | (\regexp{\^} and \regexp{\$} haven't been explained yet; |
| 597 | they'll be introduced in section~\ref{more-metacharacters}.) |
| 598 | |
| 599 | Usually \regexp{\^} matches only at the beginning of the string, and |
| 600 | \regexp{\$} matches only at the end of the string and immediately before the |
| 601 | newline (if any) at the end of the string. When this flag is |
| 602 | specified, \regexp{\^} matches at the beginning of the string and at |
| 603 | the beginning of each line within the string, immediately following |
| 604 | each newline. Similarly, the \regexp{\$} metacharacter matches either at |
| 605 | the end of the string and at the end of each line (immediately |
| 606 | preceding each newline). |
| 607 | |
| 608 | \end{datadesc} |
| 609 | |
| 610 | \begin{datadesc}{S} |
| 611 | \dataline{DOTALL} |
| 612 | Makes the \character{.} special character match any character at all, |
| 613 | including a newline; without this flag, \character{.} will match |
| 614 | anything \emph{except} a newline. |
| 615 | \end{datadesc} |
| 616 | |
| 617 | \begin{datadesc}{X} |
| 618 | \dataline{VERBOSE} This flag allows you to write regular expressions |
| 619 | that are more readable by granting you more flexibility in how you can |
| 620 | format them. When this flag has been specified, whitespace within the |
| 621 | RE string is ignored, except when the whitespace is in a character |
| 622 | class or preceded by an unescaped backslash; this lets you organize |
| 623 | and indent the RE more clearly. It also enables you to put comments |
| 624 | within a RE that will be ignored by the engine; comments are marked by |
| 625 | a \character{\#} that's neither in a character class or preceded by an |
| 626 | unescaped backslash. |
| 627 | |
| 628 | For example, here's a RE that uses \constant{re.VERBOSE}; see how |
| 629 | much easier it is to read? |
| 630 | |
| 631 | \begin{verbatim} |
| 632 | charref = re.compile(r""" |
| 633 | &[#] # Start of a numeric entity reference |
| 634 | ( |
| 635 | [0-9]+[^0-9] # Decimal form |
| 636 | | 0[0-7]+[^0-7] # Octal form |
| 637 | | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form |
| 638 | ) |
| 639 | """, re.VERBOSE) |
| 640 | \end{verbatim} |
| 641 | |
| 642 | Without the verbose setting, the RE would look like this: |
| 643 | \begin{verbatim} |
| 644 | charref = re.compile("&#([0-9]+[^0-9]" |
| 645 | "|0[0-7]+[^0-7]" |
| 646 | "|x[0-9a-fA-F]+[^0-9a-fA-F])") |
| 647 | \end{verbatim} |
| 648 | |
| 649 | In the above example, Python's automatic concatenation of string |
| 650 | literals has been used to break up the RE into smaller pieces, but |
| 651 | it's still more difficult to understand than the version using |
| 652 | \constant{re.VERBOSE}. |
| 653 | |
| 654 | \end{datadesc} |
| 655 | |
| 656 | \section{More Pattern Power} |
| 657 | |
| 658 | So far we've only covered a part of the features of regular |
| 659 | expressions. In this section, we'll cover some new metacharacters, |
| 660 | and how to use groups to retrieve portions of the text that was matched. |
| 661 | |
| 662 | \subsection{More Metacharacters\label{more-metacharacters}} |
| 663 | |
| 664 | There are some metacharacters that we haven't covered yet. Most of |
| 665 | them will be covered in this section. |
| 666 | |
| 667 | Some of the remaining metacharacters to be discussed are |
| 668 | \dfn{zero-width assertions}. They don't cause the engine to advance |
| 669 | through the string; instead, they consume no characters at all, |
| 670 | and simply succeed or fail. For example, \regexp{\e b} is an |
| 671 | assertion that the current position is located at a word boundary; the |
| 672 | position isn't changed by the \regexp{\e b} at all. This means that |
| 673 | zero-width assertions should never be repeated, because if they match |
| 674 | once at a given location, they can obviously be matched an infinite |
| 675 | number of times. |
| 676 | |
| 677 | \begin{list}{}{} |
| 678 | |
| 679 | \item[\regexp{|}] |
| 680 | Alternation, or the ``or'' operator. |
| 681 | If A and B are regular expressions, |
| 682 | \regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. |
| 683 | \regexp{|} has very low precedence in order to make it work reasonably when |
| 684 | you're alternating multi-character strings. |
| 685 | \regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not |
| 686 | \samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. |
| 687 | |
| 688 | To match a literal \character{|}, |
| 689 | use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. |
| 690 | |
| 691 | \item[\regexp{\^}] Matches at the beginning of lines. Unless the |
| 692 | \constant{MULTILINE} flag has been set, this will only match at the |
| 693 | beginning of the string. In \constant{MULTILINE} mode, this also |
| 694 | matches immediately after each newline within the string. |
| 695 | |
| 696 | For example, if you wish to match the word \samp{From} only at the |
| 697 | beginning of a line, the RE to use is \verb|^From|. |
| 698 | |
| 699 | \begin{verbatim} |
| 700 | >>> print re.search('^From', 'From Here to Eternity') |
| 701 | <re.MatchObject instance at 80c1520> |
| 702 | >>> print re.search('^From', 'Reciting From Memory') |
| 703 | None |
| 704 | \end{verbatim} |
| 705 | |
| 706 | %To match a literal \character{\^}, use \regexp{\e\^} or enclose it |
| 707 | %inside a character class, as in \regexp{[{\e}\^]}. |
| 708 | |
| 709 | \item[\regexp{\$}] Matches at the end of a line, which is defined as |
| 710 | either the end of the string, or any location followed by a newline |
| 711 | character. |
| 712 | |
| 713 | \begin{verbatim} |
| 714 | >>> print re.search('}$', '{block}') |
| 715 | <re.MatchObject instance at 80adfa8> |
| 716 | >>> print re.search('}$', '{block} ') |
| 717 | None |
| 718 | >>> print re.search('}$', '{block}\n') |
| 719 | <re.MatchObject instance at 80adfa8> |
| 720 | \end{verbatim} |
| 721 | % $ |
| 722 | |
| 723 | To match a literal \character{\$}, use \regexp{\e\$} or enclose it |
| 724 | inside a character class, as in \regexp{[\$]}. |
| 725 | |
| 726 | \item[\regexp{\e A}] Matches only at the start of the string. When |
| 727 | not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are |
| 728 | effectively the same. In \constant{MULTILINE} mode, however, they're |
| 729 | different; \regexp{\e A} still matches only at the beginning of the |
| 730 | string, but \regexp{\^} may match at any location inside the string |
| 731 | that follows a newline character. |
| 732 | |
| 733 | \item[\regexp{\e Z}]Matches only at the end of the string. |
| 734 | |
| 735 | \item[\regexp{\e b}] Word boundary. |
| 736 | This is a zero-width assertion that matches only at the |
| 737 | beginning or end of a word. A word is defined as a sequence of |
| 738 | alphanumeric characters, so the end of a word is indicated by |
| 739 | whitespace or a non-alphanumeric character. |
| 740 | |
| 741 | The following example matches \samp{class} only when it's a complete |
| 742 | word; it won't match when it's contained inside another word. |
| 743 | |
| 744 | \begin{verbatim} |
| 745 | >>> p = re.compile(r'\bclass\b') |
| 746 | >>> print p.search('no class at all') |
| 747 | <re.MatchObject instance at 80c8f28> |
| 748 | >>> print p.search('the declassified algorithm') |
| 749 | None |
| 750 | >>> print p.search('one subclass is') |
| 751 | None |
| 752 | \end{verbatim} |
| 753 | |
| 754 | There are two subtleties you should remember when using this special |
| 755 | sequence. First, this is the worst collision between Python's string |
| 756 | literals and regular expression sequences. In Python's string |
| 757 | literals, \samp{\e b} is the backspace character, ASCII value 8. If |
| 758 | you're not using raw strings, then Python will convert the \samp{\e b} to |
| 759 | a backspace, and your RE won't match as you expect it to. The |
| 760 | following example looks the same as our previous RE, but omits |
| 761 | the \character{r} in front of the RE string. |
| 762 | |
| 763 | \begin{verbatim} |
| 764 | >>> p = re.compile('\bclass\b') |
| 765 | >>> print p.search('no class at all') |
| 766 | None |
| 767 | >>> print p.search('\b' + 'class' + '\b') |
| 768 | <re.MatchObject instance at 80c3ee0> |
| 769 | \end{verbatim} |
| 770 | |
| 771 | Second, inside a character class, where there's no use for this |
| 772 | assertion, \regexp{\e b} represents the backspace character, for |
| 773 | compatibility with Python's string literals. |
| 774 | |
| 775 | \item[\regexp{\e B}] Another zero-width assertion, this is the |
| 776 | opposite of \regexp{\e b}, only matching when the current |
| 777 | position is not at a word boundary. |
| 778 | |
| 779 | \end{list} |
| 780 | |
| 781 | \subsection{Grouping} |
| 782 | |
| 783 | Frequently you need to obtain more information than just whether the |
| 784 | RE matched or not. Regular expressions are often used to dissect |
| 785 | strings by writing a RE divided into several subgroups which |
| 786 | match different components of interest. For example, an RFC-822 |
| 787 | header line is divided into a header name and a value, separated by a |
| 788 | \character{:}. This can be handled by writing a regular expression |
| 789 | which matches an entire header line, and has one group which matches the |
| 790 | header name, and another group which matches the header's value. |
| 791 | |
| 792 | Groups are marked by the \character{(}, \character{)} metacharacters. |
| 793 | \character{(} and \character{)} have much the same meaning as they do |
| 794 | in mathematical expressions; they group together the expressions |
| 795 | contained inside them. For example, you can repeat the contents of a |
| 796 | group with a repeating qualifier, such as \regexp{*}, \regexp{+}, |
| 797 | \regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, |
| 798 | \regexp{(ab)*} will match zero or more repetitions of \samp{ab}. |
| 799 | |
| 800 | \begin{verbatim} |
| 801 | >>> p = re.compile('(ab)*') |
| 802 | >>> print p.match('ababababab').span() |
| 803 | (0, 10) |
| 804 | \end{verbatim} |
| 805 | |
| 806 | Groups indicated with \character{(}, \character{)} also capture the |
| 807 | starting and ending index of the text that they match; this can be |
| 808 | retrieved by passing an argument to \method{group()}, |
| 809 | \method{start()}, \method{end()}, and \method{span()}. Groups are |
| 810 | numbered starting with 0. Group 0 is always present; it's the whole |
| 811 | RE, so \class{MatchObject} methods all have group 0 as their default |
| 812 | argument. Later we'll see how to express groups that don't capture |
| 813 | the span of text that they match. |
| 814 | |
| 815 | \begin{verbatim} |
| 816 | >>> p = re.compile('(a)b') |
| 817 | >>> m = p.match('ab') |
| 818 | >>> m.group() |
| 819 | 'ab' |
| 820 | >>> m.group(0) |
| 821 | 'ab' |
| 822 | \end{verbatim} |
| 823 | |
| 824 | Subgroups are numbered from left to right, from 1 upward. Groups can |
| 825 | be nested; to determine the number, just count the opening parenthesis |
| 826 | characters, going from left to right. |
| 827 | |
| 828 | \begin{verbatim} |
| 829 | >>> p = re.compile('(a(b)c)d') |
| 830 | >>> m = p.match('abcd') |
| 831 | >>> m.group(0) |
| 832 | 'abcd' |
| 833 | >>> m.group(1) |
| 834 | 'abc' |
| 835 | >>> m.group(2) |
| 836 | 'b' |
| 837 | \end{verbatim} |
| 838 | |
| 839 | \method{group()} can be passed multiple group numbers at a time, in |
| 840 | which case it will return a tuple containing the corresponding values |
| 841 | for those groups. |
| 842 | |
| 843 | \begin{verbatim} |
| 844 | >>> m.group(2,1,2) |
| 845 | ('b', 'abc', 'b') |
| 846 | \end{verbatim} |
| 847 | |
| 848 | The \method{groups()} method returns a tuple containing the strings |
| 849 | for all the subgroups, from 1 up to however many there are. |
| 850 | |
| 851 | \begin{verbatim} |
| 852 | >>> m.groups() |
| 853 | ('abc', 'b') |
| 854 | \end{verbatim} |
| 855 | |
| 856 | Backreferences in a pattern allow you to specify that the contents of |
| 857 | an earlier capturing group must also be found at the current location |
| 858 | in the string. For example, \regexp{\e 1} will succeed if the exact |
| 859 | contents of group 1 can be found at the current position, and fails |
| 860 | otherwise. Remember that Python's string literals also use a |
| 861 | backslash followed by numbers to allow including arbitrary characters |
| 862 | in a string, so be sure to use a raw string when incorporating |
| 863 | backreferences in a RE. |
| 864 | |
| 865 | For example, the following RE detects doubled words in a string. |
| 866 | |
| 867 | \begin{verbatim} |
| 868 | >>> p = re.compile(r'(\b\w+)\s+\1') |
| 869 | >>> p.search('Paris in the the spring').group() |
| 870 | 'the the' |
| 871 | \end{verbatim} |
| 872 | |
| 873 | Backreferences like this aren't often useful for just searching |
| 874 | through a string --- there are few text formats which repeat data in |
| 875 | this way --- but you'll soon find out that they're \emph{very} useful |
| 876 | when performing string substitutions. |
| 877 | |
| 878 | \subsection{Non-capturing and Named Groups} |
| 879 | |
| 880 | Elaborate REs may use many groups, both to capture substrings of |
| 881 | interest, and to group and structure the RE itself. In complex REs, |
| 882 | it becomes difficult to keep track of the group numbers. There are |
| 883 | two features which help with this problem. Both of them use a common |
| 884 | syntax for regular expression extensions, so we'll look at that first. |
| 885 | |
| 886 | Perl 5 added several additional features to standard regular |
| 887 | expressions, and the Python \module{re} module supports most of them. |
| 888 | It would have been difficult to choose new single-keystroke |
| 889 | metacharacters or new special sequences beginning with \samp{\e} to |
| 890 | represent the new features without making Perl's regular expressions |
| 891 | confusingly different from standard REs. If you chose \samp{\&} as a |
| 892 | new metacharacter, for example, old expressions would be assuming that |
| 893 | \samp{\&} was a regular character and wouldn't have escaped it by |
| 894 | writing \regexp{\e \&} or \regexp{[\&]}. |
| 895 | |
| 896 | The solution chosen by the Perl developers was to use \regexp{(?...)} |
| 897 | as the extension syntax. \samp{?} immediately after a parenthesis was |
| 898 | a syntax error because the \samp{?} would have nothing to repeat, so |
| 899 | this didn't introduce any compatibility problems. The characters |
| 900 | immediately after the \samp{?} indicate what extension is being used, |
| 901 | so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and |
| 902 | \regexp{(?:foo)} is something else (a non-capturing group containing |
| 903 | the subexpression \regexp{foo}). |
| 904 | |
| 905 | Python adds an extension syntax to Perl's extension syntax. If the |
| 906 | first character after the question mark is a \samp{P}, you know that |
| 907 | it's an extension that's specific to Python. Currently there are two |
| 908 | such extensions: \regexp{(?P<\var{name}>...)} defines a named group, |
| 909 | and \regexp{(?P=\var{name})} is a backreference to a named group. If |
| 910 | future versions of Perl 5 add similar features using a different |
| 911 | syntax, the \module{re} module will be changed to support the new |
| 912 | syntax, while preserving the Python-specific syntax for |
| 913 | compatibility's sake. |
| 914 | |
| 915 | Now that we've looked at the general extension syntax, we can return |
| 916 | to the features that simplify working with groups in complex REs. |
| 917 | Since groups are numbered from left to right and a complex expression |
| 918 | may use many groups, it can become difficult to keep track of the |
| 919 | correct numbering, and modifying such a complex RE is annoying. |
| 920 | Insert a new group near the beginning, and you change the numbers of |
| 921 | everything that follows it. |
| 922 | |
| 923 | First, sometimes you'll want to use a group to collect a part of a |
| 924 | regular expression, but aren't interested in retrieving the group's |
| 925 | contents. You can make this fact explicit by using a non-capturing |
| 926 | group: \regexp{(?:...)}, where you can put any other regular |
| 927 | expression inside the parentheses. |
| 928 | |
| 929 | \begin{verbatim} |
| 930 | >>> m = re.match("([abc])+", "abc") |
| 931 | >>> m.groups() |
| 932 | ('c',) |
| 933 | >>> m = re.match("(?:[abc])+", "abc") |
| 934 | >>> m.groups() |
| 935 | () |
| 936 | \end{verbatim} |
| 937 | |
| 938 | Except for the fact that you can't retrieve the contents of what the |
| 939 | group matched, a non-capturing group behaves exactly the same as a |
| 940 | capturing group; you can put anything inside it, repeat it with a |
| 941 | repetition metacharacter such as \samp{*}, and nest it within other |
| 942 | groups (capturing or non-capturing). \regexp{(?:...)} is particularly |
| 943 | useful when modifying an existing group, since you can add new groups |
| 944 | without changing how all the other groups are numbered. It should be |
| 945 | mentioned that there's no performance difference in searching between |
| 946 | capturing and non-capturing groups; neither form is any faster than |
| 947 | the other. |
| 948 | |
| 949 | The second, and more significant, feature is named groups; instead of |
| 950 | referring to them by numbers, groups can be referenced by a name. |
| 951 | |
| 952 | The syntax for a named group is one of the Python-specific extensions: |
| 953 | \regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of |
| 954 | the group. Except for associating a name with a group, named groups |
| 955 | also behave identically to capturing groups. The \class{MatchObject} |
| 956 | methods that deal with capturing groups all accept either integers, to |
| 957 | refer to groups by number, or a string containing the group name. |
| 958 | Named groups are still given numbers, so you can retrieve information |
| 959 | about a group in two ways: |
| 960 | |
| 961 | \begin{verbatim} |
| 962 | >>> p = re.compile(r'(?P<word>\b\w+\b)') |
| 963 | >>> m = p.search( '(((( Lots of punctuation )))' ) |
| 964 | >>> m.group('word') |
| 965 | 'Lots' |
| 966 | >>> m.group(1) |
| 967 | 'Lots' |
| 968 | \end{verbatim} |
| 969 | |
| 970 | Named groups are handy because they let you use easily-remembered |
| 971 | names, instead of having to remember numbers. Here's an example RE |
| 972 | from the \module{imaplib} module: |
| 973 | |
| 974 | \begin{verbatim} |
| 975 | InternalDate = re.compile(r'INTERNALDATE "' |
| 976 | r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' |
| 977 | r'(?P<year>[0-9][0-9][0-9][0-9])' |
| 978 | r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' |
| 979 | r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' |
| 980 | r'"') |
| 981 | \end{verbatim} |
| 982 | |
| 983 | It's obviously much easier to retrieve \code{m.group('zonem')}, |
| 984 | instead of having to remember to retrieve group 9. |
| 985 | |
| 986 | Since the syntax for backreferences, in an expression like |
| 987 | \regexp{(...)\e 1}, refers to the number of the group there's |
| 988 | naturally a variant that uses the group name instead of the number. |
| 989 | This is also a Python extension: \regexp{(?P=\var{name})} indicates |
| 990 | that the contents of the group called \var{name} should again be found |
| 991 | at the current point. The regular expression for finding doubled |
| 992 | words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as |
| 993 | \regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: |
| 994 | |
| 995 | \begin{verbatim} |
| 996 | >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') |
| 997 | >>> p.search('Paris in the the spring').group() |
| 998 | 'the the' |
| 999 | \end{verbatim} |
| 1000 | |
| 1001 | \subsection{Lookahead Assertions} |
| 1002 | |
| 1003 | Another zero-width assertion is the lookahead assertion. Lookahead |
| 1004 | assertions are available in both positive and negative form, and |
| 1005 | look like this: |
| 1006 | |
| 1007 | \begin{itemize} |
| 1008 | \item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds |
| 1009 | if the contained regular expression, represented here by \code{...}, |
| 1010 | successfully matches at the current location, and fails otherwise. |
| 1011 | But, once the contained expression has been tried, the matching engine |
| 1012 | doesn't advance at all; the rest of the pattern is tried right where |
| 1013 | the assertion started. |
| 1014 | |
| 1015 | \item[\regexp{(?!...)}] Negative lookahead assertion. This is the |
| 1016 | opposite of the positive assertion; it succeeds if the contained expression |
| 1017 | \emph{doesn't} match at the current position in the string. |
| 1018 | \end{itemize} |
| 1019 | |
| 1020 | An example will help make this concrete by demonstrating a case |
| 1021 | where a lookahead is useful. Consider a simple pattern to match a |
| 1022 | filename and split it apart into a base name and an extension, |
| 1023 | separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news} |
| 1024 | is the base name, and \samp{rc} is the filename's extension. |
| 1025 | |
| 1026 | The pattern to match this is quite simple: |
| 1027 | |
| 1028 | \regexp{.*[.].*\$} |
| 1029 | |
| 1030 | Notice that the \samp{.} needs to be treated specially because it's a |
| 1031 | metacharacter; I've put it inside a character class. Also notice the |
| 1032 | trailing \regexp{\$}; this is added to ensure that all the rest of the |
| 1033 | string must be included in the extension. This regular expression |
| 1034 | matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and |
| 1035 | \samp{printers.conf}. |
| 1036 | |
| 1037 | Now, consider complicating the problem a bit; what if you want to |
| 1038 | match filenames where the extension is not \samp{bat}? |
| 1039 | Some incorrect attempts: |
| 1040 | |
| 1041 | \verb|.*[.][^b].*$| |
| 1042 | % $ |
| 1043 | |
| 1044 | The first attempt above tries to exclude \samp{bat} by requiring that |
| 1045 | the first character of the extension is not a \samp{b}. This is |
| 1046 | wrong, because the pattern also doesn't match \samp{foo.bar}. |
| 1047 | |
| 1048 | % Messes up the HTML without the curly braces around \^ |
| 1049 | \regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} |
| 1050 | |
| 1051 | The expression gets messier when you try to patch up the first |
| 1052 | solution by requiring one of the following cases to match: the first |
| 1053 | character of the extension isn't \samp{b}; the second character isn't |
| 1054 | \samp{a}; or the third character isn't \samp{t}. This accepts |
| 1055 | \samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a |
| 1056 | three-letter extension and won't accept a filename with a two-letter |
| 1057 | extension such as \samp{sendmail.cf}. We'll complicate the pattern |
| 1058 | again in an effort to fix it. |
| 1059 | |
| 1060 | \regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} |
| 1061 | |
| 1062 | In the third attempt, the second and third letters are all made |
| 1063 | optional in order to allow matching extensions shorter than three |
| 1064 | characters, such as \samp{sendmail.cf}. |
| 1065 | |
| 1066 | The pattern's getting really complicated now, which makes it hard to |
| 1067 | read and understand. Worse, if the problem changes and you want to |
| 1068 | exclude both \samp{bat} and \samp{exe} as extensions, the pattern |
| 1069 | would get even more complicated and confusing. |
| 1070 | |
| 1071 | A negative lookahead cuts through all this: |
| 1072 | |
| 1073 | \regexp{.*[.](?!bat\$).*\$} |
| 1074 | % $ |
| 1075 | |
| 1076 | The lookahead means: if the expression \regexp{bat} doesn't match at |
| 1077 | this point, try the rest of the pattern; if \regexp{bat\$} does match, |
| 1078 | the whole pattern will fail. The trailing \regexp{\$} is required to |
| 1079 | ensure that something like \samp{sample.batch}, where the extension |
| 1080 | only starts with \samp{bat}, will be allowed. |
| 1081 | |
| 1082 | Excluding another filename extension is now easy; simply add it as an |
| 1083 | alternative inside the assertion. The following pattern excludes |
| 1084 | filenames that end in either \samp{bat} or \samp{exe}: |
| 1085 | |
| 1086 | \regexp{.*[.](?!bat\$|exe\$).*\$} |
| 1087 | % $ |
| 1088 | |
| 1089 | |
| 1090 | \section{Modifying Strings} |
| 1091 | |
| 1092 | Up to this point, we've simply performed searches against a static |
| 1093 | string. Regular expressions are also commonly used to modify a string |
| 1094 | in various ways, using the following \class{RegexObject} methods: |
| 1095 | |
| 1096 | \begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} |
| 1097 | \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} |
| 1098 | \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} |
| 1099 | \lineii{subn()}{Does the same thing as \method{sub()}, |
| 1100 | but returns the new string and the number of replacements} |
| 1101 | \end{tableii} |
| 1102 | |
| 1103 | |
| 1104 | \subsection{Splitting Strings} |
| 1105 | |
| 1106 | The \method{split()} method of a \class{RegexObject} splits a string |
| 1107 | apart wherever the RE matches, returning a list of the pieces. |
| 1108 | It's similar to the \method{split()} method of strings but |
| 1109 | provides much more |
| 1110 | generality in the delimiters that you can split by; |
| 1111 | \method{split()} only supports splitting by whitespace or by |
| 1112 | a fixed string. As you'd expect, there's a module-level |
| 1113 | \function{re.split()} function, too. |
| 1114 | |
| 1115 | \begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} |
| 1116 | Split \var{string} by the matches of the regular expression. If |
| 1117 | capturing parentheses are used in the RE, then their contents will |
| 1118 | also be returned as part of the resulting list. If \var{maxsplit} |
| 1119 | is nonzero, at most \var{maxsplit} splits are performed. |
| 1120 | \end{methoddesc} |
| 1121 | |
| 1122 | You can limit the number of splits made, by passing a value for |
| 1123 | \var{maxsplit}. When \var{maxsplit} is nonzero, at most |
| 1124 | \var{maxsplit} splits will be made, and the remainder of the string is |
| 1125 | returned as the final element of the list. In the following example, |
| 1126 | the delimiter is any sequence of non-alphanumeric characters. |
| 1127 | |
| 1128 | \begin{verbatim} |
| 1129 | >>> p = re.compile(r'\W+') |
| 1130 | >>> p.split('This is a test, short and sweet, of split().') |
| 1131 | ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] |
| 1132 | >>> p.split('This is a test, short and sweet, of split().', 3) |
| 1133 | ['This', 'is', 'a', 'test, short and sweet, of split().'] |
| 1134 | \end{verbatim} |
| 1135 | |
| 1136 | Sometimes you're not only interested in what the text between |
| 1137 | delimiters is, but also need to know what the delimiter was. If |
| 1138 | capturing parentheses are used in the RE, then their values are also |
| 1139 | returned as part of the list. Compare the following calls: |
| 1140 | |
| 1141 | \begin{verbatim} |
| 1142 | >>> p = re.compile(r'\W+') |
| 1143 | >>> p2 = re.compile(r'(\W+)') |
| 1144 | >>> p.split('This... is a test.') |
| 1145 | ['This', 'is', 'a', 'test', ''] |
| 1146 | >>> p2.split('This... is a test.') |
| 1147 | ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] |
| 1148 | \end{verbatim} |
| 1149 | |
| 1150 | The module-level function \function{re.split()} adds the RE to be |
| 1151 | used as the first argument, but is otherwise the same. |
| 1152 | |
| 1153 | \begin{verbatim} |
| 1154 | >>> re.split('[\W]+', 'Words, words, words.') |
| 1155 | ['Words', 'words', 'words', ''] |
| 1156 | >>> re.split('([\W]+)', 'Words, words, words.') |
| 1157 | ['Words', ', ', 'words', ', ', 'words', '.', ''] |
| 1158 | >>> re.split('[\W]+', 'Words, words, words.', 1) |
| 1159 | ['Words', 'words, words.'] |
| 1160 | \end{verbatim} |
| 1161 | |
| 1162 | \subsection{Search and Replace} |
| 1163 | |
| 1164 | Another common task is to find all the matches for a pattern, and |
| 1165 | replace them with a different string. The \method{sub()} method takes |
| 1166 | a replacement value, which can be either a string or a function, and |
| 1167 | the string to be processed. |
| 1168 | |
| 1169 | \begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} |
| 1170 | Returns the string obtained by replacing the leftmost non-overlapping |
| 1171 | occurrences of the RE in \var{string} by the replacement |
| 1172 | \var{replacement}. If the pattern isn't found, \var{string} is returned |
| 1173 | unchanged. |
| 1174 | |
| 1175 | The optional argument \var{count} is the maximum number of pattern |
| 1176 | occurrences to be replaced; \var{count} must be a non-negative |
| 1177 | integer. The default value of 0 means to replace all occurrences. |
| 1178 | \end{methoddesc} |
| 1179 | |
| 1180 | Here's a simple example of using the \method{sub()} method. It |
| 1181 | replaces colour names with the word \samp{colour}: |
| 1182 | |
| 1183 | \begin{verbatim} |
| 1184 | >>> p = re.compile( '(blue|white|red)') |
| 1185 | >>> p.sub( 'colour', 'blue socks and red shoes') |
| 1186 | 'colour socks and colour shoes' |
| 1187 | >>> p.sub( 'colour', 'blue socks and red shoes', count=1) |
| 1188 | 'colour socks and red shoes' |
| 1189 | \end{verbatim} |
| 1190 | |
| 1191 | The \method{subn()} method does the same work, but returns a 2-tuple |
| 1192 | containing the new string value and the number of replacements |
| 1193 | that were performed: |
| 1194 | |
| 1195 | \begin{verbatim} |
| 1196 | >>> p = re.compile( '(blue|white|red)') |
| 1197 | >>> p.subn( 'colour', 'blue socks and red shoes') |
| 1198 | ('colour socks and colour shoes', 2) |
| 1199 | >>> p.subn( 'colour', 'no colours at all') |
| 1200 | ('no colours at all', 0) |
| 1201 | \end{verbatim} |
| 1202 | |
| 1203 | Empty matches are replaced only when they're not |
| 1204 | adjacent to a previous match. |
| 1205 | |
| 1206 | \begin{verbatim} |
| 1207 | >>> p = re.compile('x*') |
| 1208 | >>> p.sub('-', 'abxd') |
| 1209 | '-a-b-d-' |
| 1210 | \end{verbatim} |
| 1211 | |
| 1212 | If \var{replacement} is a string, any backslash escapes in it are |
| 1213 | processed. That is, \samp{\e n} is converted to a single newline |
| 1214 | character, \samp{\e r} is converted to a carriage return, and so forth. |
| 1215 | Unknown escapes such as \samp{\e j} are left alone. Backreferences, |
| 1216 | such as \samp{\e 6}, are replaced with the substring matched by the |
| 1217 | corresponding group in the RE. This lets you incorporate |
| 1218 | portions of the original text in the resulting |
| 1219 | replacement string. |
| 1220 | |
| 1221 | This example matches the word \samp{section} followed by a string |
| 1222 | enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to |
| 1223 | \samp{subsection}: |
| 1224 | |
| 1225 | \begin{verbatim} |
| 1226 | >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) |
| 1227 | >>> p.sub(r'subsection{\1}','section{First} section{second}') |
| 1228 | 'subsection{First} subsection{second}' |
| 1229 | \end{verbatim} |
| 1230 | |
| 1231 | There's also a syntax for referring to named groups as defined by the |
| 1232 | \regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the |
| 1233 | substring matched by the group named \samp{name}, and |
| 1234 | \samp{\e g<\var{number}>} |
| 1235 | uses the corresponding group number. |
| 1236 | \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, |
| 1237 | but isn't ambiguous in a |
| 1238 | replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be |
| 1239 | interpreted as a reference to group 20, not a reference to group 2 |
| 1240 | followed by the literal character \character{0}.) The following |
| 1241 | substitutions are all equivalent, but use all three variations of the |
| 1242 | replacement string. |
| 1243 | |
| 1244 | \begin{verbatim} |
| 1245 | >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) |
| 1246 | >>> p.sub(r'subsection{\1}','section{First}') |
| 1247 | 'subsection{First}' |
| 1248 | >>> p.sub(r'subsection{\g<1>}','section{First}') |
| 1249 | 'subsection{First}' |
| 1250 | >>> p.sub(r'subsection{\g<name>}','section{First}') |
| 1251 | 'subsection{First}' |
| 1252 | \end{verbatim} |
| 1253 | |
| 1254 | \var{replacement} can also be a function, which gives you even more |
| 1255 | control. If \var{replacement} is a function, the function is |
| 1256 | called for every non-overlapping occurrence of \var{pattern}. On each |
| 1257 | call, the function is |
| 1258 | passed a \class{MatchObject} argument for the match |
| 1259 | and can use this information to compute the desired replacement string and return it. |
| 1260 | |
| 1261 | In the following example, the replacement function translates |
| 1262 | decimals into hexadecimal: |
| 1263 | |
| 1264 | \begin{verbatim} |
| 1265 | >>> def hexrepl( match ): |
| 1266 | ... "Return the hex string for a decimal number" |
| 1267 | ... value = int( match.group() ) |
| 1268 | ... return hex(value) |
| 1269 | ... |
| 1270 | >>> p = re.compile(r'\d+') |
| 1271 | >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') |
| 1272 | 'Call 0xffd2 for printing, 0xc000 for user code.' |
| 1273 | \end{verbatim} |
| 1274 | |
| 1275 | When using the module-level \function{re.sub()} function, the pattern |
| 1276 | is passed as the first argument. The pattern may be a string or a |
| 1277 | \class{RegexObject}; if you need to specify regular expression flags, |
| 1278 | you must either use a \class{RegexObject} as the first parameter, or use |
| 1279 | embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb |
| 1280 | BBBB")} returns \code{'x x'}. |
| 1281 | |
| 1282 | \section{Common Problems} |
| 1283 | |
| 1284 | Regular expressions are a powerful tool for some applications, but in |
| 1285 | some ways their behaviour isn't intuitive and at times they don't |
| 1286 | behave the way you may expect them to. This section will point out |
| 1287 | some of the most common pitfalls. |
| 1288 | |
| 1289 | \subsection{Use String Methods} |
| 1290 | |
| 1291 | Sometimes using the \module{re} module is a mistake. If you're |
| 1292 | matching a fixed string, or a single character class, and you're not |
| 1293 | using any \module{re} features such as the \constant{IGNORECASE} flag, |
| 1294 | then the full power of regular expressions may not be required. |
| 1295 | Strings have several methods for performing operations with fixed |
| 1296 | strings and they're usually much faster, because the implementation is |
| 1297 | a single small C loop that's been optimized for the purpose, instead |
| 1298 | of the large, more generalized regular expression engine. |
| 1299 | |
| 1300 | One example might be replacing a single fixed string with another |
| 1301 | one; for example, you might replace \samp{word} |
| 1302 | with \samp{deed}. \code{re.sub()} seems like the function to use for |
| 1303 | this, but consider the \method{replace()} method. Note that |
| 1304 | \function{replace()} will also replace \samp{word} inside |
| 1305 | words, turning \samp{swordfish} into \samp{sdeedfish}, but the |
| 1306 | na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing |
| 1307 | the substitution on parts of words, the pattern would have to be |
| 1308 | \regexp{\e bword\e b}, in order to require that \samp{word} have a |
| 1309 | word boundary on either side. This takes the job beyond |
| 1310 | \method{replace}'s abilities.) |
| 1311 | |
| 1312 | Another common task is deleting every occurrence of a single character |
| 1313 | from a string or replacing it with another single character. You |
| 1314 | might do this with something like \code{re.sub('\e n', ' ', S)}, but |
| 1315 | \method{translate()} is capable of doing both tasks |
Andrew M. Kuchling | c28dd1f | 2005-08-31 17:49:38 +0000 | [diff] [blame] | 1316 | and will be faster than any regular expression operation can be. |
Andrew M. Kuchling | e8f44d6 | 2005-08-30 01:25:05 +0000 | [diff] [blame] | 1317 | |
| 1318 | In short, before turning to the \module{re} module, consider whether |
| 1319 | your problem can be solved with a faster and simpler string method. |
| 1320 | |
| 1321 | \subsection{match() versus search()} |
| 1322 | |
| 1323 | The \function{match()} function only checks if the RE matches at |
| 1324 | the beginning of the string while \function{search()} will scan |
| 1325 | forward through the string for a match. |
| 1326 | It's important to keep this distinction in mind. Remember, |
| 1327 | \function{match()} will only report a successful match which |
| 1328 | will start at 0; if the match wouldn't start at zero, |
| 1329 | \function{match()} will \emph{not} report it. |
| 1330 | |
| 1331 | \begin{verbatim} |
| 1332 | >>> print re.match('super', 'superstition').span() |
| 1333 | (0, 5) |
| 1334 | >>> print re.match('super', 'insuperable') |
| 1335 | None |
| 1336 | \end{verbatim} |
| 1337 | |
| 1338 | On the other hand, \function{search()} will scan forward through the |
| 1339 | string, reporting the first match it finds. |
| 1340 | |
| 1341 | \begin{verbatim} |
| 1342 | >>> print re.search('super', 'superstition').span() |
| 1343 | (0, 5) |
| 1344 | >>> print re.search('super', 'insuperable').span() |
| 1345 | (2, 7) |
| 1346 | \end{verbatim} |
| 1347 | |
| 1348 | Sometimes you'll be tempted to keep using \function{re.match()}, and |
| 1349 | just add \regexp{.*} to the front of your RE. Resist this temptation |
| 1350 | and use \function{re.search()} instead. The regular expression |
| 1351 | compiler does some analysis of REs in order to speed up the process of |
| 1352 | looking for a match. One such analysis figures out what the first |
| 1353 | character of a match must be; for example, a pattern starting with |
| 1354 | \regexp{Crow} must match starting with a \character{C}. The analysis |
| 1355 | lets the engine quickly scan through the string looking for the |
| 1356 | starting character, only trying the full match if a \character{C} is found. |
| 1357 | |
| 1358 | Adding \regexp{.*} defeats this optimization, requiring scanning to |
| 1359 | the end of the string and then backtracking to find a match for the |
| 1360 | rest of the RE. Use \function{re.search()} instead. |
| 1361 | |
| 1362 | \subsection{Greedy versus Non-Greedy} |
| 1363 | |
| 1364 | When repeating a regular expression, as in \regexp{a*}, the resulting |
| 1365 | action is to consume as much of the pattern as possible. This |
| 1366 | fact often bites you when you're trying to match a pair of |
| 1367 | balanced delimiters, such as the angle brackets surrounding an HTML |
| 1368 | tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't |
| 1369 | work because of the greedy nature of \regexp{.*}. |
| 1370 | |
| 1371 | \begin{verbatim} |
| 1372 | >>> s = '<html><head><title>Title</title>' |
| 1373 | >>> len(s) |
| 1374 | 32 |
| 1375 | >>> print re.match('<.*>', s).span() |
| 1376 | (0, 32) |
| 1377 | >>> print re.match('<.*>', s).group() |
| 1378 | <html><head><title>Title</title> |
| 1379 | \end{verbatim} |
| 1380 | |
| 1381 | The RE matches the \character{<} in \samp{<html>}, and the |
| 1382 | \regexp{.*} consumes the rest of the string. There's still more left |
| 1383 | in the RE, though, and the \regexp{>} can't match at the end of |
| 1384 | the string, so the regular expression engine has to backtrack |
| 1385 | character by character until it finds a match for the \regexp{>}. |
| 1386 | The final match extends from the \character{<} in \samp{<html>} |
| 1387 | to the \character{>} in \samp{</title>}, which isn't what you want. |
| 1388 | |
| 1389 | In this case, the solution is to use the non-greedy qualifiers |
| 1390 | \regexp{*?}, \regexp{+?}, \regexp{??}, or |
| 1391 | \regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as |
| 1392 | possible. In the above example, the \character{>} is tried |
| 1393 | immediately after the first \character{<} matches, and when it fails, |
| 1394 | the engine advances a character at a time, retrying the \character{>} |
| 1395 | at every step. This produces just the right result: |
| 1396 | |
| 1397 | \begin{verbatim} |
| 1398 | >>> print re.match('<.*?>', s).group() |
| 1399 | <html> |
| 1400 | \end{verbatim} |
| 1401 | |
| 1402 | (Note that parsing HTML or XML with regular expressions is painful. |
| 1403 | Quick-and-dirty patterns will handle common cases, but HTML and XML |
| 1404 | have special cases that will break the obvious regular expression; by |
| 1405 | the time you've written a regular expression that handles all of the |
| 1406 | possible cases, the patterns will be \emph{very} complicated. Use an |
| 1407 | HTML or XML parser module for such tasks.) |
| 1408 | |
| 1409 | \subsection{Not Using re.VERBOSE} |
| 1410 | |
| 1411 | By now you've probably noticed that regular expressions are a very |
| 1412 | compact notation, but they're not terribly readable. REs of |
| 1413 | moderate complexity can become lengthy collections of backslashes, |
| 1414 | parentheses, and metacharacters, making them difficult to read and |
| 1415 | understand. |
| 1416 | |
| 1417 | For such REs, specifying the \code{re.VERBOSE} flag when |
| 1418 | compiling the regular expression can be helpful, because it allows |
| 1419 | you to format the regular expression more clearly. |
| 1420 | |
| 1421 | The \code{re.VERBOSE} flag has several effects. Whitespace in the |
| 1422 | regular expression that \emph{isn't} inside a character class is |
| 1423 | ignored. This means that an expression such as \regexp{dog | cat} is |
| 1424 | equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} |
| 1425 | will still match the characters \character{a}, \character{b}, or a |
| 1426 | space. In addition, you can also put comments inside a RE; comments |
| 1427 | extend from a \samp{\#} character to the next newline. When used with |
| 1428 | triple-quoted strings, this enables REs to be formatted more neatly: |
| 1429 | |
| 1430 | \begin{verbatim} |
| 1431 | pat = re.compile(r""" |
| 1432 | \s* # Skip leading whitespace |
| 1433 | (?P<header>[^:]+) # Header name |
| 1434 | \s* : # Whitespace, and a colon |
| 1435 | (?P<value>.*?) # The header's value -- *? used to |
| 1436 | # lose the following trailing whitespace |
| 1437 | \s*$ # Trailing whitespace to end-of-line |
| 1438 | """, re.VERBOSE) |
| 1439 | \end{verbatim} |
| 1440 | % $ |
| 1441 | |
| 1442 | This is far more readable than: |
| 1443 | |
| 1444 | \begin{verbatim} |
| 1445 | pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") |
| 1446 | \end{verbatim} |
| 1447 | % $ |
| 1448 | |
| 1449 | \section{Feedback} |
| 1450 | |
| 1451 | Regular expressions are a complicated topic. Did this document help |
| 1452 | you understand them? Were there parts that were unclear, or Problems |
| 1453 | you encountered that weren't covered here? If so, please send |
| 1454 | suggestions for improvements to the author. |
| 1455 | |
| 1456 | The most complete book on regular expressions is almost certainly |
| 1457 | Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published |
| 1458 | by O'Reilly. Unfortunately, it exclusively concentrates on Perl and |
| 1459 | Java's flavours of regular expressions, and doesn't contain any Python |
| 1460 | material at all, so it won't be useful as a reference for programming |
| 1461 | in Python. (The first edition covered Python's now-obsolete |
| 1462 | \module{regex} module, which won't help you much.) Consider checking |
| 1463 | it out from your library. |
| 1464 | |
| 1465 | \end{document} |
| 1466 | |