Christian Heimes | 2202f87 | 2008-02-06 14:31:34 +0000 | [diff] [blame] | 1 | .. _regex-howto: |
| 2 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 3 | **************************** |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 4 | Regular Expression HOWTO |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 5 | **************************** |
| 6 | |
Benjamin Peterson | f07d002 | 2009-03-21 17:31:58 +0000 | [diff] [blame] | 7 | :Author: A.M. Kuchling <amk@amk.ca> |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 8 | :Release: 0.05 |
| 9 | |
Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 10 | .. TODO: |
| 11 | Document lookbehind assertions |
| 12 | Better way of displaying a RE, a string, and what it matches |
| 13 | Mention optional argument to match.groups() |
| 14 | Unicode (at least a reference) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 15 | |
| 16 | |
| 17 | .. topic:: Abstract |
| 18 | |
| 19 | This document is an introductory tutorial to using regular expressions in Python |
| 20 | with the :mod:`re` module. It provides a gentler introduction than the |
| 21 | corresponding section in the Library Reference. |
| 22 | |
| 23 | |
| 24 | Introduction |
| 25 | ============ |
| 26 | |
| 27 | The :mod:`re` module was added in Python 1.5, and provides Perl-style regular |
| 28 | expression patterns. Earlier versions of Python came with the :mod:`regex` |
| 29 | module, which provided Emacs-style patterns. The :mod:`regex` module was |
| 30 | removed completely in Python 2.5. |
| 31 | |
| 32 | Regular expressions (called REs, or regexes, or regex patterns) are essentially |
| 33 | a tiny, highly specialized programming language embedded inside Python and made |
| 34 | available through the :mod:`re` module. Using this little language, you specify |
| 35 | the rules for the set of possible strings that you want to match; this set might |
| 36 | contain English sentences, or e-mail addresses, or TeX commands, or anything you |
| 37 | like. You can then ask questions such as "Does this string match the pattern?", |
| 38 | or "Is there a match for the pattern anywhere in this string?". You can also |
| 39 | use REs to modify a string or to split it apart in various ways. |
| 40 | |
| 41 | Regular expression patterns are compiled into a series of bytecodes which are |
| 42 | then executed by a matching engine written in C. For advanced use, it may be |
| 43 | necessary to pay careful attention to how the engine will execute a given RE, |
| 44 | and write the RE in a certain way in order to produce bytecode that runs faster. |
| 45 | Optimization isn't covered in this document, because it requires that you have a |
| 46 | good understanding of the matching engine's internals. |
| 47 | |
| 48 | The regular expression language is relatively small and restricted, so not all |
| 49 | possible string processing tasks can be done using regular expressions. There |
| 50 | are also tasks that *can* be done with regular expressions, but the expressions |
| 51 | turn out to be very complicated. In these cases, you may be better off writing |
| 52 | Python code to do the processing; while Python code will be slower than an |
| 53 | elaborate regular expression, it will also probably be more understandable. |
| 54 | |
| 55 | |
| 56 | Simple Patterns |
| 57 | =============== |
| 58 | |
| 59 | We'll start by learning about the simplest possible regular expressions. Since |
| 60 | regular expressions are used to operate on strings, we'll begin with the most |
| 61 | common task: matching characters. |
| 62 | |
| 63 | For a detailed explanation of the computer science underlying regular |
| 64 | expressions (deterministic and non-deterministic finite automata), you can refer |
| 65 | to almost any textbook on writing compilers. |
| 66 | |
| 67 | |
| 68 | Matching Characters |
| 69 | ------------------- |
| 70 | |
| 71 | Most letters and characters will simply match themselves. For example, the |
| 72 | regular expression ``test`` will match the string ``test`` exactly. (You can |
| 73 | enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` |
| 74 | as well; more about this later.) |
| 75 | |
| 76 | There are exceptions to this rule; some characters are special |
| 77 | :dfn:`metacharacters`, and don't match themselves. Instead, they signal that |
| 78 | some out-of-the-ordinary thing should be matched, or they affect other portions |
| 79 | of the RE by repeating them or changing their meaning. Much of this document is |
| 80 | devoted to discussing various metacharacters and what they do. |
| 81 | |
| 82 | Here's a complete list of the metacharacters; their meanings will be discussed |
| 83 | in the rest of this HOWTO. :: |
| 84 | |
| 85 | . ^ $ * + ? { [ ] \ | ( ) |
| 86 | |
| 87 | The first metacharacters we'll look at are ``[`` and ``]``. They're used for |
| 88 | specifying a character class, which is a set of characters that you wish to |
| 89 | match. Characters can be listed individually, or a range of characters can be |
| 90 | indicated by giving two characters and separating them by a ``'-'``. For |
| 91 | example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this |
| 92 | is the same as ``[a-c]``, which uses a range to express the same set of |
| 93 | characters. If you wanted to match only lowercase letters, your RE would be |
| 94 | ``[a-z]``. |
| 95 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 96 | Metacharacters are not active inside classes. For example, ``[akm$]`` will |
| 97 | match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is |
| 98 | usually a metacharacter, but inside a character class it's stripped of its |
| 99 | special nature. |
| 100 | |
| 101 | You can match the characters not listed within the class by :dfn:`complementing` |
| 102 | the set. This is indicated by including a ``'^'`` as the first character of the |
| 103 | class; ``'^'`` outside a character class will simply match the ``'^'`` |
| 104 | character. For example, ``[^5]`` will match any character except ``'5'``. |
| 105 | |
| 106 | Perhaps the most important metacharacter is the backslash, ``\``. As in Python |
| 107 | string literals, the backslash can be followed by various characters to signal |
| 108 | various special sequences. It's also used to escape all the metacharacters so |
| 109 | you can still match them in patterns; for example, if you need to match a ``[`` |
| 110 | or ``\``, you can precede them with a backslash to remove their special |
| 111 | meaning: ``\[`` or ``\\``. |
| 112 | |
| 113 | Some of the special sequences beginning with ``'\'`` represent predefined sets |
| 114 | of characters that are often useful, such as the set of digits, the set of |
| 115 | letters, or the set of anything that isn't whitespace. The following predefined |
| 116 | special sequences are available: |
| 117 | |
| 118 | ``\d`` |
| 119 | Matches any decimal digit; this is equivalent to the class ``[0-9]``. |
| 120 | |
| 121 | ``\D`` |
| 122 | Matches any non-digit character; this is equivalent to the class ``[^0-9]``. |
| 123 | |
| 124 | ``\s`` |
| 125 | Matches any whitespace character; this is equivalent to the class ``[ |
| 126 | \t\n\r\f\v]``. |
| 127 | |
| 128 | ``\S`` |
| 129 | Matches any non-whitespace character; this is equivalent to the class ``[^ |
| 130 | \t\n\r\f\v]``. |
| 131 | |
| 132 | ``\w`` |
| 133 | Matches any alphanumeric character; this is equivalent to the class |
| 134 | ``[a-zA-Z0-9_]``. |
| 135 | |
| 136 | ``\W`` |
| 137 | Matches any non-alphanumeric character; this is equivalent to the class |
| 138 | ``[^a-zA-Z0-9_]``. |
| 139 | |
| 140 | These sequences can be included inside a character class. For example, |
| 141 | ``[\s,.]`` is a character class that will match any whitespace character, or |
| 142 | ``','`` or ``'.'``. |
| 143 | |
| 144 | The final metacharacter in this section is ``.``. It matches anything except a |
| 145 | newline character, and there's an alternate mode (``re.DOTALL``) where it will |
| 146 | match even a newline. ``'.'`` is often used where you want to match "any |
| 147 | character". |
| 148 | |
| 149 | |
| 150 | Repeating Things |
| 151 | ---------------- |
| 152 | |
| 153 | Being able to match varying sets of characters is the first thing regular |
| 154 | expressions can do that isn't already possible with the methods available on |
| 155 | strings. However, if that was the only additional capability of regexes, they |
| 156 | wouldn't be much of an advance. Another capability is that you can specify that |
| 157 | portions of the RE must be repeated a certain number of times. |
| 158 | |
| 159 | The first metacharacter for repeating things that we'll look at is ``*``. ``*`` |
| 160 | doesn't match the literal character ``*``; instead, it specifies that the |
| 161 | previous character can be matched zero or more times, instead of exactly once. |
| 162 | |
| 163 | For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), |
| 164 | ``caaat`` (3 ``a`` characters), and so forth. The RE engine has various |
| 165 | internal limitations stemming from the size of C's ``int`` type that will |
| 166 | prevent it from matching over 2 billion ``a`` characters; you probably don't |
| 167 | have enough memory to construct a string that large, so you shouldn't run into |
| 168 | that limit. |
| 169 | |
| 170 | Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching |
| 171 | engine will try to repeat it as many times as possible. If later portions of the |
| 172 | pattern don't match, the matching engine will then back up and try again with |
| 173 | few repetitions. |
| 174 | |
| 175 | A step-by-step example will make this more obvious. Let's consider the |
| 176 | expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters |
| 177 | from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching |
| 178 | this RE against the string ``abcbd``. |
| 179 | |
| 180 | +------+-----------+---------------------------------+ |
| 181 | | Step | Matched | Explanation | |
| 182 | +======+===========+=================================+ |
| 183 | | 1 | ``a`` | The ``a`` in the RE matches. | |
| 184 | +------+-----------+---------------------------------+ |
| 185 | | 2 | ``abcbd`` | The engine matches ``[bcd]*``, | |
| 186 | | | | going as far as it can, which | |
| 187 | | | | is to the end of the string. | |
| 188 | +------+-----------+---------------------------------+ |
| 189 | | 3 | *Failure* | The engine tries to match | |
| 190 | | | | ``b``, but the current position | |
| 191 | | | | is at the end of the string, so | |
| 192 | | | | it fails. | |
| 193 | +------+-----------+---------------------------------+ |
| 194 | | 4 | ``abcb`` | Back up, so that ``[bcd]*`` | |
| 195 | | | | matches one less character. | |
| 196 | +------+-----------+---------------------------------+ |
| 197 | | 5 | *Failure* | Try ``b`` again, but the | |
| 198 | | | | current position is at the last | |
| 199 | | | | character, which is a ``'d'``. | |
| 200 | +------+-----------+---------------------------------+ |
| 201 | | 6 | ``abc`` | Back up again, so that | |
| 202 | | | | ``[bcd]*`` is only matching | |
| 203 | | | | ``bc``. | |
| 204 | +------+-----------+---------------------------------+ |
| 205 | | 6 | ``abcb`` | Try ``b`` again. This time | |
Christian Heimes | a612dc0 | 2008-02-24 13:08:18 +0000 | [diff] [blame] | 206 | | | | the character at the | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 207 | | | | current position is ``'b'``, so | |
| 208 | | | | it succeeds. | |
| 209 | +------+-----------+---------------------------------+ |
| 210 | |
| 211 | The end of the RE has now been reached, and it has matched ``abcb``. This |
| 212 | demonstrates how the matching engine goes as far as it can at first, and if no |
| 213 | match is found it will then progressively back up and retry the rest of the RE |
| 214 | again and again. It will back up until it has tried zero matches for |
| 215 | ``[bcd]*``, and if that subsequently fails, the engine will conclude that the |
| 216 | string doesn't match the RE at all. |
| 217 | |
| 218 | Another repeating metacharacter is ``+``, which matches one or more times. Pay |
| 219 | careful attention to the difference between ``*`` and ``+``; ``*`` matches |
| 220 | *zero* or more times, so whatever's being repeated may not be present at all, |
| 221 | while ``+`` requires at least *one* occurrence. To use a similar example, |
| 222 | ``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match |
| 223 | ``ct``. |
| 224 | |
| 225 | There are two more repeating qualifiers. The question mark character, ``?``, |
| 226 | matches either once or zero times; you can think of it as marking something as |
| 227 | being optional. For example, ``home-?brew`` matches either ``homebrew`` or |
| 228 | ``home-brew``. |
| 229 | |
| 230 | The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are |
| 231 | decimal integers. This qualifier means there must be at least *m* repetitions, |
| 232 | and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and |
| 233 | ``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which |
| 234 | has four. |
| 235 | |
| 236 | You can omit either *m* or *n*; in that case, a reasonable value is assumed for |
| 237 | the missing value. Omitting *m* is interpreted as a lower limit of 0, while |
| 238 | omitting *n* results in an upper bound of infinity --- actually, the upper bound |
| 239 | is the 2-billion limit mentioned earlier, but that might as well be infinity. |
| 240 | |
| 241 | Readers of a reductionist bent may notice that the three other qualifiers can |
| 242 | all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` |
| 243 | is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use |
| 244 | ``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier |
| 245 | to read. |
| 246 | |
| 247 | |
| 248 | Using Regular Expressions |
| 249 | ========================= |
| 250 | |
| 251 | Now that we've looked at some simple regular expressions, how do we actually use |
| 252 | them in Python? The :mod:`re` module provides an interface to the regular |
| 253 | expression engine, allowing you to compile REs into objects and then perform |
| 254 | matches with them. |
| 255 | |
| 256 | |
| 257 | Compiling Regular Expressions |
| 258 | ----------------------------- |
| 259 | |
| 260 | Regular expressions are compiled into :class:`RegexObject` instances, which have |
| 261 | methods for various operations such as searching for pattern matches or |
| 262 | performing string substitutions. :: |
| 263 | |
| 264 | >>> import re |
| 265 | >>> p = re.compile('ab*') |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 266 | >>> p |
Benjamin Peterson | 25c95f1 | 2009-05-08 20:42:26 +0000 | [diff] [blame^] | 267 | <_sre.SRE_Pattern object at 80b4150> |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 268 | |
| 269 | :func:`re.compile` also accepts an optional *flags* argument, used to enable |
| 270 | various special features and syntax variations. We'll go over the available |
| 271 | settings later, but for now a single example will do:: |
| 272 | |
| 273 | >>> p = re.compile('ab*', re.IGNORECASE) |
| 274 | |
| 275 | The RE is passed to :func:`re.compile` as a string. REs are handled as strings |
| 276 | because regular expressions aren't part of the core Python language, and no |
| 277 | special syntax was created for expressing them. (There are applications that |
| 278 | don't need REs at all, so there's no need to bloat the language specification by |
| 279 | including them.) Instead, the :mod:`re` module is simply a C extension module |
| 280 | included with Python, just like the :mod:`socket` or :mod:`zlib` modules. |
| 281 | |
| 282 | Putting REs in strings keeps the Python language simpler, but has one |
| 283 | disadvantage which is the topic of the next section. |
| 284 | |
| 285 | |
| 286 | The Backslash Plague |
| 287 | -------------------- |
| 288 | |
| 289 | As stated earlier, regular expressions use the backslash character (``'\'``) to |
| 290 | indicate special forms or to allow special characters to be used without |
| 291 | invoking their special meaning. This conflicts with Python's usage of the same |
| 292 | character for the same purpose in string literals. |
| 293 | |
| 294 | Let's say you want to write a RE that matches the string ``\section``, which |
| 295 | might be found in a LaTeX file. To figure out what to write in the program |
| 296 | code, start with the desired string to be matched. Next, you must escape any |
| 297 | backslashes and other metacharacters by preceding them with a backslash, |
| 298 | resulting in the string ``\\section``. The resulting string that must be passed |
| 299 | to :func:`re.compile` must be ``\\section``. However, to express this as a |
| 300 | Python string literal, both backslashes must be escaped *again*. |
| 301 | |
| 302 | +-------------------+------------------------------------------+ |
| 303 | | Characters | Stage | |
| 304 | +===================+==========================================+ |
| 305 | | ``\section`` | Text string to be matched | |
| 306 | +-------------------+------------------------------------------+ |
| 307 | | ``\\section`` | Escaped backslash for :func:`re.compile` | |
| 308 | +-------------------+------------------------------------------+ |
| 309 | | ``"\\\\section"`` | Escaped backslashes for a string literal | |
| 310 | +-------------------+------------------------------------------+ |
| 311 | |
| 312 | In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE |
| 313 | string, because the regular expression must be ``\\``, and each backslash must |
| 314 | be expressed as ``\\`` inside a regular Python string literal. In REs that |
| 315 | feature backslashes repeatedly, this leads to lots of repeated backslashes and |
| 316 | makes the resulting strings difficult to understand. |
| 317 | |
| 318 | The solution is to use Python's raw string notation for regular expressions; |
| 319 | backslashes are not handled in any special way in a string literal prefixed with |
| 320 | ``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, |
| 321 | while ``"\n"`` is a one-character string containing a newline. Regular |
| 322 | expressions will often be written in Python code using this raw string notation. |
| 323 | |
| 324 | +-------------------+------------------+ |
| 325 | | Regular String | Raw string | |
| 326 | +===================+==================+ |
| 327 | | ``"ab*"`` | ``r"ab*"`` | |
| 328 | +-------------------+------------------+ |
| 329 | | ``"\\\\section"`` | ``r"\\section"`` | |
| 330 | +-------------------+------------------+ |
| 331 | | ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | |
| 332 | +-------------------+------------------+ |
| 333 | |
| 334 | |
| 335 | Performing Matches |
| 336 | ------------------ |
| 337 | |
| 338 | Once you have an object representing a compiled regular expression, what do you |
| 339 | do with it? :class:`RegexObject` instances have several methods and attributes. |
Georg Brandl | 86def6c | 2008-01-21 20:36:10 +0000 | [diff] [blame] | 340 | Only the most significant ones will be covered here; consult the :mod:`re` docs |
| 341 | for a complete listing. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 342 | |
| 343 | +------------------+-----------------------------------------------+ |
| 344 | | Method/Attribute | Purpose | |
| 345 | +==================+===============================================+ |
| 346 | | ``match()`` | Determine if the RE matches at the beginning | |
| 347 | | | of the string. | |
| 348 | +------------------+-----------------------------------------------+ |
| 349 | | ``search()`` | Scan through a string, looking for any | |
| 350 | | | location where this RE matches. | |
| 351 | +------------------+-----------------------------------------------+ |
| 352 | | ``findall()`` | Find all substrings where the RE matches, and | |
| 353 | | | returns them as a list. | |
| 354 | +------------------+-----------------------------------------------+ |
| 355 | | ``finditer()`` | Find all substrings where the RE matches, and | |
Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 356 | | | returns them as an :term:`iterator`. | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 357 | +------------------+-----------------------------------------------+ |
| 358 | |
| 359 | :meth:`match` and :meth:`search` return ``None`` if no match can be found. If |
| 360 | they're successful, a ``MatchObject`` instance is returned, containing |
| 361 | information about the match: where it starts and ends, the substring it matched, |
| 362 | and more. |
| 363 | |
| 364 | You can learn about this by interactively experimenting with the :mod:`re` |
| 365 | module. If you have Tkinter available, you may also want to look at |
| 366 | :file:`Tools/scripts/redemo.py`, a demonstration program included with the |
| 367 | Python distribution. It allows you to enter REs and strings, and displays |
| 368 | whether the RE matches or fails. :file:`redemo.py` can be quite useful when |
| 369 | trying to debug a complicated RE. Phil Schwartz's `Kodos |
Christian Heimes | dd15f6c | 2008-03-16 00:07:10 +0000 | [diff] [blame] | 370 | <http://kodos.sourceforge.net/>`_ is also an interactive tool for developing and |
| 371 | testing RE patterns. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 372 | |
| 373 | This HOWTO uses the standard Python interpreter for its examples. First, run the |
| 374 | Python interpreter, import the :mod:`re` module, and compile a RE:: |
| 375 | |
| 376 | Python 2.2.2 (#1, Feb 10 2003, 12:57:01) |
| 377 | >>> import re |
| 378 | >>> p = re.compile('[a-z]+') |
| 379 | >>> p |
| 380 | <_sre.SRE_Pattern object at 80c3c28> |
| 381 | |
| 382 | Now, you can try matching various strings against the RE ``[a-z]+``. An empty |
| 383 | string shouldn't match at all, since ``+`` means 'one or more repetitions'. |
| 384 | :meth:`match` should return ``None`` in this case, which will cause the |
| 385 | interpreter to print no output. You can explicitly print the result of |
| 386 | :meth:`match` to make this clear. :: |
| 387 | |
| 388 | >>> p.match("") |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 389 | >>> print(p.match("")) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 390 | None |
| 391 | |
| 392 | Now, let's try it on a string that it should match, such as ``tempo``. In this |
| 393 | case, :meth:`match` will return a :class:`MatchObject`, so you should store the |
| 394 | result in a variable for later use. :: |
| 395 | |
| 396 | >>> m = p.match('tempo') |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 397 | >>> m |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 398 | <_sre.SRE_Match object at 80c4f68> |
| 399 | |
| 400 | Now you can query the :class:`MatchObject` for information about the matching |
| 401 | string. :class:`MatchObject` instances also have several methods and |
| 402 | attributes; the most important ones are: |
| 403 | |
| 404 | +------------------+--------------------------------------------+ |
| 405 | | Method/Attribute | Purpose | |
| 406 | +==================+============================================+ |
| 407 | | ``group()`` | Return the string matched by the RE | |
| 408 | +------------------+--------------------------------------------+ |
| 409 | | ``start()`` | Return the starting position of the match | |
| 410 | +------------------+--------------------------------------------+ |
| 411 | | ``end()`` | Return the ending position of the match | |
| 412 | +------------------+--------------------------------------------+ |
| 413 | | ``span()`` | Return a tuple containing the (start, end) | |
| 414 | | | positions of the match | |
| 415 | +------------------+--------------------------------------------+ |
| 416 | |
| 417 | Trying these methods will soon clarify their meaning:: |
| 418 | |
| 419 | >>> m.group() |
| 420 | 'tempo' |
| 421 | >>> m.start(), m.end() |
| 422 | (0, 5) |
| 423 | >>> m.span() |
| 424 | (0, 5) |
| 425 | |
| 426 | :meth:`group` returns the substring that was matched by the RE. :meth:`start` |
| 427 | and :meth:`end` return the starting and ending index of the match. :meth:`span` |
| 428 | returns both start and end indexes in a single tuple. Since the :meth:`match` |
| 429 | method only checks if the RE matches at the start of a string, :meth:`start` |
| 430 | will always be zero. However, the :meth:`search` method of :class:`RegexObject` |
| 431 | instances scans through the string, so the match may not start at zero in that |
| 432 | case. :: |
| 433 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 434 | >>> print(p.match('::: message')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 435 | None |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 436 | >>> m = p.search('::: message') ; print(m) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 437 | <re.MatchObject instance at 80c9650> |
| 438 | >>> m.group() |
| 439 | 'message' |
| 440 | >>> m.span() |
| 441 | (4, 11) |
| 442 | |
| 443 | In actual programs, the most common style is to store the :class:`MatchObject` |
| 444 | in a variable, and then check if it was ``None``. This usually looks like:: |
| 445 | |
| 446 | p = re.compile( ... ) |
| 447 | m = p.match( 'string goes here' ) |
| 448 | if m: |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 449 | print('Match found: ', m.group()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 450 | else: |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 451 | print('No match') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 452 | |
| 453 | Two :class:`RegexObject` methods return all of the matches for a pattern. |
| 454 | :meth:`findall` returns a list of matching strings:: |
| 455 | |
| 456 | >>> p = re.compile('\d+') |
| 457 | >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') |
| 458 | ['12', '11', '10'] |
| 459 | |
| 460 | :meth:`findall` has to create the entire list before it can be returned as the |
| 461 | result. The :meth:`finditer` method returns a sequence of :class:`MatchObject` |
Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 462 | instances as an :term:`iterator`. [#]_ :: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 463 | |
| 464 | >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') |
| 465 | >>> iterator |
| 466 | <callable-iterator object at 0x401833ac> |
| 467 | >>> for match in iterator: |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 468 | ... print(match.span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 469 | ... |
| 470 | (0, 2) |
| 471 | (22, 24) |
| 472 | (29, 31) |
| 473 | |
| 474 | |
| 475 | Module-Level Functions |
| 476 | ---------------------- |
| 477 | |
| 478 | You don't have to create a :class:`RegexObject` and call its methods; the |
| 479 | :mod:`re` module also provides top-level functions called :func:`match`, |
| 480 | :func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions |
| 481 | take the same arguments as the corresponding :class:`RegexObject` method, with |
| 482 | the RE string added as the first argument, and still return either ``None`` or a |
| 483 | :class:`MatchObject` instance. :: |
| 484 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 485 | >>> print(re.match(r'From\s+', 'Fromage amk')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 486 | None |
| 487 | >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') |
| 488 | <re.MatchObject instance at 80c5978> |
| 489 | |
| 490 | Under the hood, these functions simply produce a :class:`RegexObject` for you |
| 491 | and call the appropriate method on it. They also store the compiled object in a |
| 492 | cache, so future calls using the same RE are faster. |
| 493 | |
| 494 | Should you use these module-level functions, or should you get the |
| 495 | :class:`RegexObject` and call its methods yourself? That choice depends on how |
| 496 | frequently the RE will be used, and on your personal coding style. If the RE is |
| 497 | being used at only one point in the code, then the module functions are probably |
| 498 | more convenient. If a program contains a lot of regular expressions, or re-uses |
| 499 | the same ones in several locations, then it might be worthwhile to collect all |
| 500 | the definitions in one place, in a section of code that compiles all the REs |
| 501 | ahead of time. To take an example from the standard library, here's an extract |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 502 | from the now deprecated :file:`xmllib.py`:: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 503 | |
| 504 | ref = re.compile( ... ) |
| 505 | entityref = re.compile( ... ) |
| 506 | charref = re.compile( ... ) |
| 507 | starttagopen = re.compile( ... ) |
| 508 | |
| 509 | I generally prefer to work with the compiled object, even for one-time uses, but |
| 510 | few people will be as much of a purist about this as I am. |
| 511 | |
| 512 | |
| 513 | Compilation Flags |
| 514 | ----------------- |
| 515 | |
| 516 | Compilation flags let you modify some aspects of how regular expressions work. |
| 517 | Flags are available in the :mod:`re` module under two names, a long name such as |
| 518 | :const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're |
| 519 | familiar with Perl's pattern modifiers, the one-letter forms use the same |
| 520 | letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) |
| 521 | Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets |
| 522 | both the :const:`I` and :const:`M` flags, for example. |
| 523 | |
| 524 | Here's a table of the available flags, followed by a more detailed explanation |
| 525 | of each one. |
| 526 | |
| 527 | +---------------------------------+--------------------------------------------+ |
| 528 | | Flag | Meaning | |
| 529 | +=================================+============================================+ |
| 530 | | :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | |
| 531 | | | newlines | |
| 532 | +---------------------------------+--------------------------------------------+ |
| 533 | | :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches | |
| 534 | +---------------------------------+--------------------------------------------+ |
| 535 | | :const:`LOCALE`, :const:`L` | Do a locale-aware match | |
| 536 | +---------------------------------+--------------------------------------------+ |
| 537 | | :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | |
| 538 | | | ``$`` | |
| 539 | +---------------------------------+--------------------------------------------+ |
| 540 | | :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | |
| 541 | | | more cleanly and understandably. | |
| 542 | +---------------------------------+--------------------------------------------+ |
Georg Brandl | ce9fbd3 | 2009-03-31 18:41:03 +0000 | [diff] [blame] | 543 | | :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | |
| 544 | | | ``\s`` and ``\d`` match only on ASCII | |
| 545 | | | characters with the respective property. | |
| 546 | +---------------------------------+--------------------------------------------+ |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 547 | |
| 548 | |
| 549 | .. data:: I |
| 550 | IGNORECASE |
| 551 | :noindex: |
| 552 | |
| 553 | Perform case-insensitive matching; character class and literal strings will |
| 554 | match letters by ignoring case. For example, ``[A-Z]`` will match lowercase |
| 555 | letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This |
| 556 | lowercasing doesn't take the current locale into account; it will if you also |
| 557 | set the :const:`LOCALE` flag. |
| 558 | |
| 559 | |
| 560 | .. data:: L |
| 561 | LOCALE |
| 562 | :noindex: |
| 563 | |
| 564 | Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. |
| 565 | |
| 566 | Locales are a feature of the C library intended to help in writing programs that |
| 567 | take account of language differences. For example, if you're processing French |
| 568 | text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only |
| 569 | matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If |
| 570 | your system is configured properly and a French locale is selected, certain C |
| 571 | functions will tell the program that ``'é'`` should also be considered a letter. |
| 572 | Setting the :const:`LOCALE` flag when compiling a regular expression will cause |
| 573 | the resulting compiled object to use these C functions for ``\w``; this is |
| 574 | slower, but also enables ``\w+`` to match French words as you'd expect. |
| 575 | |
| 576 | |
| 577 | .. data:: M |
| 578 | MULTILINE |
| 579 | :noindex: |
| 580 | |
| 581 | (``^`` and ``$`` haven't been explained yet; they'll be introduced in section |
| 582 | :ref:`more-metacharacters`.) |
| 583 | |
| 584 | Usually ``^`` matches only at the beginning of the string, and ``$`` matches |
| 585 | only at the end of the string and immediately before the newline (if any) at the |
| 586 | end of the string. When this flag is specified, ``^`` matches at the beginning |
| 587 | of the string and at the beginning of each line within the string, immediately |
| 588 | following each newline. Similarly, the ``$`` metacharacter matches either at |
| 589 | the end of the string and at the end of each line (immediately preceding each |
| 590 | newline). |
| 591 | |
| 592 | |
| 593 | .. data:: S |
| 594 | DOTALL |
| 595 | :noindex: |
| 596 | |
| 597 | Makes the ``'.'`` special character match any character at all, including a |
| 598 | newline; without this flag, ``'.'`` will match anything *except* a newline. |
| 599 | |
| 600 | |
Georg Brandl | ce9fbd3 | 2009-03-31 18:41:03 +0000 | [diff] [blame] | 601 | .. data:: A |
| 602 | ASCII |
| 603 | :noindex: |
| 604 | |
| 605 | Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only |
| 606 | matching instead of full Unicode matching. This is only meaningful for |
| 607 | Unicode patterns, and is ignored for byte patterns. |
| 608 | |
| 609 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 610 | .. data:: X |
| 611 | VERBOSE |
| 612 | :noindex: |
| 613 | |
| 614 | This flag allows you to write regular expressions that are more readable by |
| 615 | granting you more flexibility in how you can format them. When this flag has |
| 616 | been specified, whitespace within the RE string is ignored, except when the |
| 617 | whitespace is in a character class or preceded by an unescaped backslash; this |
| 618 | lets you organize and indent the RE more clearly. This flag also lets you put |
| 619 | comments within a RE that will be ignored by the engine; comments are marked by |
| 620 | a ``'#'`` that's neither in a character class or preceded by an unescaped |
| 621 | backslash. |
| 622 | |
| 623 | For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it |
| 624 | is to read? :: |
| 625 | |
| 626 | charref = re.compile(r""" |
Georg Brandl | 06788c9 | 2009-01-03 21:31:47 +0000 | [diff] [blame] | 627 | &[#] # Start of a numeric entity reference |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 628 | ( |
| 629 | 0[0-7]+ # Octal form |
| 630 | | [0-9]+ # Decimal form |
| 631 | | x[0-9a-fA-F]+ # Hexadecimal form |
| 632 | ) |
| 633 | ; # Trailing semicolon |
| 634 | """, re.VERBOSE) |
| 635 | |
| 636 | Without the verbose setting, the RE would look like this:: |
| 637 | |
| 638 | charref = re.compile("&#(0[0-7]+" |
| 639 | "|[0-9]+" |
| 640 | "|x[0-9a-fA-F]+);") |
| 641 | |
| 642 | In the above example, Python's automatic concatenation of string literals has |
| 643 | been used to break up the RE into smaller pieces, but it's still more difficult |
| 644 | to understand than the version using :const:`re.VERBOSE`. |
| 645 | |
| 646 | |
| 647 | More Pattern Power |
| 648 | ================== |
| 649 | |
| 650 | So far we've only covered a part of the features of regular expressions. In |
| 651 | this section, we'll cover some new metacharacters, and how to use groups to |
| 652 | retrieve portions of the text that was matched. |
| 653 | |
| 654 | |
| 655 | .. _more-metacharacters: |
| 656 | |
| 657 | More Metacharacters |
| 658 | ------------------- |
| 659 | |
| 660 | There are some metacharacters that we haven't covered yet. Most of them will be |
| 661 | covered in this section. |
| 662 | |
| 663 | Some of the remaining metacharacters to be discussed are :dfn:`zero-width |
| 664 | assertions`. They don't cause the engine to advance through the string; |
| 665 | instead, they consume no characters at all, and simply succeed or fail. For |
| 666 | example, ``\b`` is an assertion that the current position is located at a word |
| 667 | boundary; the position isn't changed by the ``\b`` at all. This means that |
| 668 | zero-width assertions should never be repeated, because if they match once at a |
| 669 | given location, they can obviously be matched an infinite number of times. |
| 670 | |
| 671 | ``|`` |
| 672 | Alternation, or the "or" operator. If A and B are regular expressions, |
| 673 | ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very |
| 674 | low precedence in order to make it work reasonably when you're alternating |
| 675 | multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, |
| 676 | not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. |
| 677 | |
| 678 | To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, |
| 679 | as in ``[|]``. |
| 680 | |
| 681 | ``^`` |
| 682 | Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been |
| 683 | set, this will only match at the beginning of the string. In :const:`MULTILINE` |
| 684 | mode, this also matches immediately after each newline within the string. |
| 685 | |
| 686 | For example, if you wish to match the word ``From`` only at the beginning of a |
| 687 | line, the RE to use is ``^From``. :: |
| 688 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 689 | >>> print(re.search('^From', 'From Here to Eternity')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 690 | <re.MatchObject instance at 80c1520> |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 691 | >>> print(re.search('^From', 'Reciting From Memory')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 692 | None |
| 693 | |
Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 694 | .. To match a literal \character{\^}, use \regexp{\e\^} or enclose it |
| 695 | .. inside a character class, as in \regexp{[{\e}\^]}. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 696 | |
| 697 | ``$`` |
| 698 | Matches at the end of a line, which is defined as either the end of the string, |
| 699 | or any location followed by a newline character. :: |
| 700 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 701 | >>> print(re.search('}$', '{block}')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 702 | <re.MatchObject instance at 80adfa8> |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 703 | >>> print(re.search('}$', '{block} ')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 704 | None |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 705 | >>> print(re.search('}$', '{block}\n')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 706 | <re.MatchObject instance at 80adfa8> |
| 707 | |
| 708 | To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, |
| 709 | as in ``[$]``. |
| 710 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 711 | ``\A`` |
| 712 | Matches only at the start of the string. When not in :const:`MULTILINE` mode, |
| 713 | ``\A`` and ``^`` are effectively the same. In :const:`MULTILINE` mode, they're |
| 714 | different: ``\A`` still matches only at the beginning of the string, but ``^`` |
| 715 | may match at any location inside the string that follows a newline character. |
| 716 | |
| 717 | ``\Z`` |
| 718 | Matches only at the end of the string. |
| 719 | |
| 720 | ``\b`` |
| 721 | Word boundary. This is a zero-width assertion that matches only at the |
| 722 | beginning or end of a word. A word is defined as a sequence of alphanumeric |
| 723 | characters, so the end of a word is indicated by whitespace or a |
| 724 | non-alphanumeric character. |
| 725 | |
| 726 | The following example matches ``class`` only when it's a complete word; it won't |
| 727 | match when it's contained inside another word. :: |
| 728 | |
| 729 | >>> p = re.compile(r'\bclass\b') |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 730 | >>> print(p.search('no class at all')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 731 | <re.MatchObject instance at 80c8f28> |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 732 | >>> print(p.search('the declassified algorithm')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 733 | None |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 734 | >>> print(p.search('one subclass is')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 735 | None |
| 736 | |
| 737 | There are two subtleties you should remember when using this special sequence. |
| 738 | First, this is the worst collision between Python's string literals and regular |
| 739 | expression sequences. In Python's string literals, ``\b`` is the backspace |
| 740 | character, ASCII value 8. If you're not using raw strings, then Python will |
| 741 | convert the ``\b`` to a backspace, and your RE won't match as you expect it to. |
| 742 | The following example looks the same as our previous RE, but omits the ``'r'`` |
| 743 | in front of the RE string. :: |
| 744 | |
| 745 | >>> p = re.compile('\bclass\b') |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 746 | >>> print(p.search('no class at all')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 747 | None |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 748 | >>> print(p.search('\b' + 'class' + '\b') ) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 749 | <re.MatchObject instance at 80c3ee0> |
| 750 | |
| 751 | Second, inside a character class, where there's no use for this assertion, |
| 752 | ``\b`` represents the backspace character, for compatibility with Python's |
| 753 | string literals. |
| 754 | |
| 755 | ``\B`` |
| 756 | Another zero-width assertion, this is the opposite of ``\b``, only matching when |
| 757 | the current position is not at a word boundary. |
| 758 | |
| 759 | |
| 760 | Grouping |
| 761 | -------- |
| 762 | |
| 763 | Frequently you need to obtain more information than just whether the RE matched |
| 764 | or not. Regular expressions are often used to dissect strings by writing a RE |
| 765 | divided into several subgroups which match different components of interest. |
| 766 | For example, an RFC-822 header line is divided into a header name and a value, |
| 767 | separated by a ``':'``, like this:: |
| 768 | |
| 769 | From: author@example.com |
| 770 | User-Agent: Thunderbird 1.5.0.9 (X11/20061227) |
| 771 | MIME-Version: 1.0 |
| 772 | To: editor@example.com |
| 773 | |
| 774 | This can be handled by writing a regular expression which matches an entire |
| 775 | header line, and has one group which matches the header name, and another group |
| 776 | which matches the header's value. |
| 777 | |
| 778 | Groups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` |
| 779 | have much the same meaning as they do in mathematical expressions; they group |
| 780 | together the expressions contained inside them, and you can repeat the contents |
| 781 | of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or |
| 782 | ``{m,n}``. For example, ``(ab)*`` will match zero or more repetitions of |
| 783 | ``ab``. :: |
| 784 | |
| 785 | >>> p = re.compile('(ab)*') |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 786 | >>> print(p.match('ababababab').span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 787 | (0, 10) |
| 788 | |
| 789 | Groups indicated with ``'('``, ``')'`` also capture the starting and ending |
| 790 | index of the text that they match; this can be retrieved by passing an argument |
| 791 | to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are |
| 792 | numbered starting with 0. Group 0 is always present; it's the whole RE, so |
| 793 | :class:`MatchObject` methods all have group 0 as their default argument. Later |
| 794 | we'll see how to express groups that don't capture the span of text that they |
| 795 | match. :: |
| 796 | |
| 797 | >>> p = re.compile('(a)b') |
| 798 | >>> m = p.match('ab') |
| 799 | >>> m.group() |
| 800 | 'ab' |
| 801 | >>> m.group(0) |
| 802 | 'ab' |
| 803 | |
| 804 | Subgroups are numbered from left to right, from 1 upward. Groups can be nested; |
| 805 | to determine the number, just count the opening parenthesis characters, going |
| 806 | from left to right. :: |
| 807 | |
| 808 | >>> p = re.compile('(a(b)c)d') |
| 809 | >>> m = p.match('abcd') |
| 810 | >>> m.group(0) |
| 811 | 'abcd' |
| 812 | >>> m.group(1) |
| 813 | 'abc' |
| 814 | >>> m.group(2) |
| 815 | 'b' |
| 816 | |
| 817 | :meth:`group` can be passed multiple group numbers at a time, in which case it |
| 818 | will return a tuple containing the corresponding values for those groups. :: |
| 819 | |
| 820 | >>> m.group(2,1,2) |
| 821 | ('b', 'abc', 'b') |
| 822 | |
| 823 | The :meth:`groups` method returns a tuple containing the strings for all the |
| 824 | subgroups, from 1 up to however many there are. :: |
| 825 | |
| 826 | >>> m.groups() |
| 827 | ('abc', 'b') |
| 828 | |
| 829 | Backreferences in a pattern allow you to specify that the contents of an earlier |
| 830 | capturing group must also be found at the current location in the string. For |
| 831 | example, ``\1`` will succeed if the exact contents of group 1 can be found at |
| 832 | the current position, and fails otherwise. Remember that Python's string |
| 833 | literals also use a backslash followed by numbers to allow including arbitrary |
| 834 | characters in a string, so be sure to use a raw string when incorporating |
| 835 | backreferences in a RE. |
| 836 | |
| 837 | For example, the following RE detects doubled words in a string. :: |
| 838 | |
| 839 | >>> p = re.compile(r'(\b\w+)\s+\1') |
| 840 | >>> p.search('Paris in the the spring').group() |
| 841 | 'the the' |
| 842 | |
| 843 | Backreferences like this aren't often useful for just searching through a string |
| 844 | --- there are few text formats which repeat data in this way --- but you'll soon |
| 845 | find out that they're *very* useful when performing string substitutions. |
| 846 | |
| 847 | |
| 848 | Non-capturing and Named Groups |
| 849 | ------------------------------ |
| 850 | |
| 851 | Elaborate REs may use many groups, both to capture substrings of interest, and |
| 852 | to group and structure the RE itself. In complex REs, it becomes difficult to |
| 853 | keep track of the group numbers. There are two features which help with this |
| 854 | problem. Both of them use a common syntax for regular expression extensions, so |
| 855 | we'll look at that first. |
| 856 | |
| 857 | Perl 5 added several additional features to standard regular expressions, and |
| 858 | the Python :mod:`re` module supports most of them. It would have been |
| 859 | difficult to choose new single-keystroke metacharacters or new special sequences |
| 860 | beginning with ``\`` to represent the new features without making Perl's regular |
| 861 | expressions confusingly different from standard REs. If you chose ``&`` as a |
| 862 | new metacharacter, for example, old expressions would be assuming that ``&`` was |
| 863 | a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. |
| 864 | |
| 865 | The solution chosen by the Perl developers was to use ``(?...)`` as the |
| 866 | extension syntax. ``?`` immediately after a parenthesis was a syntax error |
| 867 | because the ``?`` would have nothing to repeat, so this didn't introduce any |
| 868 | compatibility problems. The characters immediately after the ``?`` indicate |
| 869 | what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead |
| 870 | assertion) and ``(?:foo)`` is something else (a non-capturing group containing |
| 871 | the subexpression ``foo``). |
| 872 | |
| 873 | Python adds an extension syntax to Perl's extension syntax. If the first |
| 874 | character after the question mark is a ``P``, you know that it's an extension |
| 875 | that's specific to Python. Currently there are two such extensions: |
| 876 | ``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to |
| 877 | a named group. If future versions of Perl 5 add similar features using a |
| 878 | different syntax, the :mod:`re` module will be changed to support the new |
| 879 | syntax, while preserving the Python-specific syntax for compatibility's sake. |
| 880 | |
| 881 | Now that we've looked at the general extension syntax, we can return to the |
| 882 | features that simplify working with groups in complex REs. Since groups are |
| 883 | numbered from left to right and a complex expression may use many groups, it can |
| 884 | become difficult to keep track of the correct numbering. Modifying such a |
| 885 | complex RE is annoying, too: insert a new group near the beginning and you |
| 886 | change the numbers of everything that follows it. |
| 887 | |
| 888 | Sometimes you'll want to use a group to collect a part of a regular expression, |
| 889 | but aren't interested in retrieving the group's contents. You can make this fact |
| 890 | explicit by using a non-capturing group: ``(?:...)``, where you can replace the |
| 891 | ``...`` with any other regular expression. :: |
| 892 | |
| 893 | >>> m = re.match("([abc])+", "abc") |
| 894 | >>> m.groups() |
| 895 | ('c',) |
| 896 | >>> m = re.match("(?:[abc])+", "abc") |
| 897 | >>> m.groups() |
| 898 | () |
| 899 | |
| 900 | Except for the fact that you can't retrieve the contents of what the group |
| 901 | matched, a non-capturing group behaves exactly the same as a capturing group; |
| 902 | you can put anything inside it, repeat it with a repetition metacharacter such |
| 903 | as ``*``, and nest it within other groups (capturing or non-capturing). |
| 904 | ``(?:...)`` is particularly useful when modifying an existing pattern, since you |
| 905 | can add new groups without changing how all the other groups are numbered. It |
| 906 | should be mentioned that there's no performance difference in searching between |
| 907 | capturing and non-capturing groups; neither form is any faster than the other. |
| 908 | |
| 909 | A more significant feature is named groups: instead of referring to them by |
| 910 | numbers, groups can be referenced by a name. |
| 911 | |
| 912 | The syntax for a named group is one of the Python-specific extensions: |
| 913 | ``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups |
| 914 | also behave exactly like capturing groups, and additionally associate a name |
| 915 | with a group. The :class:`MatchObject` methods that deal with capturing groups |
| 916 | all accept either integers that refer to the group by number or strings that |
| 917 | contain the desired group's name. Named groups are still given numbers, so you |
| 918 | can retrieve information about a group in two ways:: |
| 919 | |
| 920 | >>> p = re.compile(r'(?P<word>\b\w+\b)') |
| 921 | >>> m = p.search( '(((( Lots of punctuation )))' ) |
| 922 | >>> m.group('word') |
| 923 | 'Lots' |
| 924 | >>> m.group(1) |
| 925 | 'Lots' |
| 926 | |
| 927 | Named groups are handy because they let you use easily-remembered names, instead |
| 928 | of having to remember numbers. Here's an example RE from the :mod:`imaplib` |
| 929 | module:: |
| 930 | |
| 931 | InternalDate = re.compile(r'INTERNALDATE "' |
| 932 | r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' |
Georg Brandl | a1c6a1c | 2009-01-03 21:26:05 +0000 | [diff] [blame] | 933 | r'(?P<year>[0-9][0-9][0-9][0-9])' |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 934 | r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' |
| 935 | r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' |
| 936 | r'"') |
| 937 | |
| 938 | It's obviously much easier to retrieve ``m.group('zonem')``, instead of having |
| 939 | to remember to retrieve group 9. |
| 940 | |
| 941 | The syntax for backreferences in an expression such as ``(...)\1`` refers to the |
| 942 | number of the group. There's naturally a variant that uses the group name |
| 943 | instead of the number. This is another Python extension: ``(?P=name)`` indicates |
| 944 | that the contents of the group called *name* should again be matched at the |
| 945 | current point. The regular expression for finding doubled words, |
| 946 | ``(\b\w+)\s+\1`` can also be written as ``(?P<word>\b\w+)\s+(?P=word)``:: |
| 947 | |
| 948 | >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') |
| 949 | >>> p.search('Paris in the the spring').group() |
| 950 | 'the the' |
| 951 | |
| 952 | |
| 953 | Lookahead Assertions |
| 954 | -------------------- |
| 955 | |
| 956 | Another zero-width assertion is the lookahead assertion. Lookahead assertions |
| 957 | are available in both positive and negative form, and look like this: |
| 958 | |
| 959 | ``(?=...)`` |
| 960 | Positive lookahead assertion. This succeeds if the contained regular |
| 961 | expression, represented here by ``...``, successfully matches at the current |
| 962 | location, and fails otherwise. But, once the contained expression has been |
| 963 | tried, the matching engine doesn't advance at all; the rest of the pattern is |
| 964 | tried right where the assertion started. |
| 965 | |
| 966 | ``(?!...)`` |
| 967 | Negative lookahead assertion. This is the opposite of the positive assertion; |
| 968 | it succeeds if the contained expression *doesn't* match at the current position |
| 969 | in the string. |
| 970 | |
| 971 | To make this concrete, let's look at a case where a lookahead is useful. |
| 972 | Consider a simple pattern to match a filename and split it apart into a base |
| 973 | name and an extension, separated by a ``.``. For example, in ``news.rc``, |
| 974 | ``news`` is the base name, and ``rc`` is the filename's extension. |
| 975 | |
| 976 | The pattern to match this is quite simple: |
| 977 | |
| 978 | ``.*[.].*$`` |
| 979 | |
| 980 | Notice that the ``.`` needs to be treated specially because it's a |
| 981 | metacharacter; I've put it inside a character class. Also notice the trailing |
| 982 | ``$``; this is added to ensure that all the rest of the string must be included |
| 983 | in the extension. This regular expression matches ``foo.bar`` and |
| 984 | ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. |
| 985 | |
| 986 | Now, consider complicating the problem a bit; what if you want to match |
| 987 | filenames where the extension is not ``bat``? Some incorrect attempts: |
| 988 | |
| 989 | ``.*[.][^b].*$`` The first attempt above tries to exclude ``bat`` by requiring |
| 990 | that the first character of the extension is not a ``b``. This is wrong, |
| 991 | because the pattern also doesn't match ``foo.bar``. |
| 992 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 993 | ``.*[.]([^b]..|.[^a].|..[^t])$`` |
| 994 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 995 | The expression gets messier when you try to patch up the first solution by |
| 996 | requiring one of the following cases to match: the first character of the |
| 997 | extension isn't ``b``; the second character isn't ``a``; or the third character |
| 998 | isn't ``t``. This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it |
| 999 | requires a three-letter extension and won't accept a filename with a two-letter |
| 1000 | extension such as ``sendmail.cf``. We'll complicate the pattern again in an |
| 1001 | effort to fix it. |
| 1002 | |
| 1003 | ``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` |
| 1004 | |
| 1005 | In the third attempt, the second and third letters are all made optional in |
| 1006 | order to allow matching extensions shorter than three characters, such as |
| 1007 | ``sendmail.cf``. |
| 1008 | |
| 1009 | The pattern's getting really complicated now, which makes it hard to read and |
| 1010 | understand. Worse, if the problem changes and you want to exclude both ``bat`` |
| 1011 | and ``exe`` as extensions, the pattern would get even more complicated and |
| 1012 | confusing. |
| 1013 | |
| 1014 | A negative lookahead cuts through all this confusion: |
| 1015 | |
| 1016 | ``.*[.](?!bat$).*$`` The negative lookahead means: if the expression ``bat`` |
| 1017 | doesn't match at this point, try the rest of the pattern; if ``bat$`` does |
| 1018 | match, the whole pattern will fail. The trailing ``$`` is required to ensure |
| 1019 | that something like ``sample.batch``, where the extension only starts with |
| 1020 | ``bat``, will be allowed. |
| 1021 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1022 | Excluding another filename extension is now easy; simply add it as an |
| 1023 | alternative inside the assertion. The following pattern excludes filenames that |
| 1024 | end in either ``bat`` or ``exe``: |
| 1025 | |
| 1026 | ``.*[.](?!bat$|exe$).*$`` |
| 1027 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1028 | |
| 1029 | Modifying Strings |
| 1030 | ================= |
| 1031 | |
| 1032 | Up to this point, we've simply performed searches against a static string. |
| 1033 | Regular expressions are also commonly used to modify strings in various ways, |
| 1034 | using the following :class:`RegexObject` methods: |
| 1035 | |
| 1036 | +------------------+-----------------------------------------------+ |
| 1037 | | Method/Attribute | Purpose | |
| 1038 | +==================+===============================================+ |
| 1039 | | ``split()`` | Split the string into a list, splitting it | |
| 1040 | | | wherever the RE matches | |
| 1041 | +------------------+-----------------------------------------------+ |
| 1042 | | ``sub()`` | Find all substrings where the RE matches, and | |
| 1043 | | | replace them with a different string | |
| 1044 | +------------------+-----------------------------------------------+ |
| 1045 | | ``subn()`` | Does the same thing as :meth:`sub`, but | |
| 1046 | | | returns the new string and the number of | |
| 1047 | | | replacements | |
| 1048 | +------------------+-----------------------------------------------+ |
| 1049 | |
| 1050 | |
| 1051 | Splitting Strings |
| 1052 | ----------------- |
| 1053 | |
| 1054 | The :meth:`split` method of a :class:`RegexObject` splits a string apart |
| 1055 | wherever the RE matches, returning a list of the pieces. It's similar to the |
| 1056 | :meth:`split` method of strings but provides much more generality in the |
| 1057 | delimiters that you can split by; :meth:`split` only supports splitting by |
| 1058 | whitespace or by a fixed string. As you'd expect, there's a module-level |
| 1059 | :func:`re.split` function, too. |
| 1060 | |
| 1061 | |
| 1062 | .. method:: .split(string [, maxsplit=0]) |
| 1063 | :noindex: |
| 1064 | |
| 1065 | Split *string* by the matches of the regular expression. If capturing |
| 1066 | parentheses are used in the RE, then their contents will also be returned as |
| 1067 | part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits |
| 1068 | are performed. |
| 1069 | |
| 1070 | You can limit the number of splits made, by passing a value for *maxsplit*. |
| 1071 | When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the |
| 1072 | remainder of the string is returned as the final element of the list. In the |
| 1073 | following example, the delimiter is any sequence of non-alphanumeric characters. |
| 1074 | :: |
| 1075 | |
| 1076 | >>> p = re.compile(r'\W+') |
| 1077 | >>> p.split('This is a test, short and sweet, of split().') |
| 1078 | ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] |
| 1079 | >>> p.split('This is a test, short and sweet, of split().', 3) |
| 1080 | ['This', 'is', 'a', 'test, short and sweet, of split().'] |
| 1081 | |
| 1082 | Sometimes you're not only interested in what the text between delimiters is, but |
| 1083 | also need to know what the delimiter was. If capturing parentheses are used in |
| 1084 | the RE, then their values are also returned as part of the list. Compare the |
| 1085 | following calls:: |
| 1086 | |
| 1087 | >>> p = re.compile(r'\W+') |
| 1088 | >>> p2 = re.compile(r'(\W+)') |
| 1089 | >>> p.split('This... is a test.') |
| 1090 | ['This', 'is', 'a', 'test', ''] |
| 1091 | >>> p2.split('This... is a test.') |
| 1092 | ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] |
| 1093 | |
| 1094 | The module-level function :func:`re.split` adds the RE to be used as the first |
| 1095 | argument, but is otherwise the same. :: |
| 1096 | |
| 1097 | >>> re.split('[\W]+', 'Words, words, words.') |
| 1098 | ['Words', 'words', 'words', ''] |
| 1099 | >>> re.split('([\W]+)', 'Words, words, words.') |
| 1100 | ['Words', ', ', 'words', ', ', 'words', '.', ''] |
| 1101 | >>> re.split('[\W]+', 'Words, words, words.', 1) |
| 1102 | ['Words', 'words, words.'] |
| 1103 | |
| 1104 | |
| 1105 | Search and Replace |
| 1106 | ------------------ |
| 1107 | |
| 1108 | Another common task is to find all the matches for a pattern, and replace them |
| 1109 | with a different string. The :meth:`sub` method takes a replacement value, |
| 1110 | which can be either a string or a function, and the string to be processed. |
| 1111 | |
| 1112 | |
| 1113 | .. method:: .sub(replacement, string[, count=0]) |
| 1114 | :noindex: |
| 1115 | |
| 1116 | Returns the string obtained by replacing the leftmost non-overlapping |
| 1117 | occurrences of the RE in *string* by the replacement *replacement*. If the |
| 1118 | pattern isn't found, *string* is returned unchanged. |
| 1119 | |
| 1120 | The optional argument *count* is the maximum number of pattern occurrences to be |
| 1121 | replaced; *count* must be a non-negative integer. The default value of 0 means |
| 1122 | to replace all occurrences. |
| 1123 | |
| 1124 | Here's a simple example of using the :meth:`sub` method. It replaces colour |
| 1125 | names with the word ``colour``:: |
| 1126 | |
| 1127 | >>> p = re.compile( '(blue|white|red)') |
| 1128 | >>> p.sub( 'colour', 'blue socks and red shoes') |
| 1129 | 'colour socks and colour shoes' |
| 1130 | >>> p.sub( 'colour', 'blue socks and red shoes', count=1) |
| 1131 | 'colour socks and red shoes' |
| 1132 | |
| 1133 | The :meth:`subn` method does the same work, but returns a 2-tuple containing the |
| 1134 | new string value and the number of replacements that were performed:: |
| 1135 | |
| 1136 | >>> p = re.compile( '(blue|white|red)') |
| 1137 | >>> p.subn( 'colour', 'blue socks and red shoes') |
| 1138 | ('colour socks and colour shoes', 2) |
| 1139 | >>> p.subn( 'colour', 'no colours at all') |
| 1140 | ('no colours at all', 0) |
| 1141 | |
| 1142 | Empty matches are replaced only when they're not adjacent to a previous match. |
| 1143 | :: |
| 1144 | |
| 1145 | >>> p = re.compile('x*') |
| 1146 | >>> p.sub('-', 'abxd') |
| 1147 | '-a-b-d-' |
| 1148 | |
| 1149 | If *replacement* is a string, any backslash escapes in it are processed. That |
| 1150 | is, ``\n`` is converted to a single newline character, ``\r`` is converted to a |
| 1151 | carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. |
| 1152 | Backreferences, such as ``\6``, are replaced with the substring matched by the |
| 1153 | corresponding group in the RE. This lets you incorporate portions of the |
| 1154 | original text in the resulting replacement string. |
| 1155 | |
| 1156 | This example matches the word ``section`` followed by a string enclosed in |
| 1157 | ``{``, ``}``, and changes ``section`` to ``subsection``:: |
| 1158 | |
| 1159 | >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) |
| 1160 | >>> p.sub(r'subsection{\1}','section{First} section{second}') |
| 1161 | 'subsection{First} subsection{second}' |
| 1162 | |
| 1163 | There's also a syntax for referring to named groups as defined by the |
| 1164 | ``(?P<name>...)`` syntax. ``\g<name>`` will use the substring matched by the |
| 1165 | group named ``name``, and ``\g<number>`` uses the corresponding group number. |
| 1166 | ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous in a |
| 1167 | replacement string such as ``\g<2>0``. (``\20`` would be interpreted as a |
| 1168 | reference to group 20, not a reference to group 2 followed by the literal |
| 1169 | character ``'0'``.) The following substitutions are all equivalent, but use all |
| 1170 | three variations of the replacement string. :: |
| 1171 | |
| 1172 | >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) |
| 1173 | >>> p.sub(r'subsection{\1}','section{First}') |
| 1174 | 'subsection{First}' |
| 1175 | >>> p.sub(r'subsection{\g<1>}','section{First}') |
| 1176 | 'subsection{First}' |
| 1177 | >>> p.sub(r'subsection{\g<name>}','section{First}') |
| 1178 | 'subsection{First}' |
| 1179 | |
| 1180 | *replacement* can also be a function, which gives you even more control. If |
| 1181 | *replacement* is a function, the function is called for every non-overlapping |
| 1182 | occurrence of *pattern*. On each call, the function is passed a |
| 1183 | :class:`MatchObject` argument for the match and can use this information to |
| 1184 | compute the desired replacement string and return it. |
| 1185 | |
| 1186 | In the following example, the replacement function translates decimals into |
| 1187 | hexadecimal:: |
| 1188 | |
| 1189 | >>> def hexrepl( match ): |
| 1190 | ... "Return the hex string for a decimal number" |
| 1191 | ... value = int( match.group() ) |
| 1192 | ... return hex(value) |
| 1193 | ... |
| 1194 | >>> p = re.compile(r'\d+') |
| 1195 | >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') |
| 1196 | 'Call 0xffd2 for printing, 0xc000 for user code.' |
| 1197 | |
| 1198 | When using the module-level :func:`re.sub` function, the pattern is passed as |
| 1199 | the first argument. The pattern may be a string or a :class:`RegexObject`; if |
| 1200 | you need to specify regular expression flags, you must either use a |
| 1201 | :class:`RegexObject` as the first parameter, or use embedded modifiers in the |
| 1202 | pattern, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. |
| 1203 | |
| 1204 | |
| 1205 | Common Problems |
| 1206 | =============== |
| 1207 | |
| 1208 | Regular expressions are a powerful tool for some applications, but in some ways |
| 1209 | their behaviour isn't intuitive and at times they don't behave the way you may |
| 1210 | expect them to. This section will point out some of the most common pitfalls. |
| 1211 | |
| 1212 | |
| 1213 | Use String Methods |
| 1214 | ------------------ |
| 1215 | |
| 1216 | Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed |
| 1217 | string, or a single character class, and you're not using any :mod:`re` features |
| 1218 | such as the :const:`IGNORECASE` flag, then the full power of regular expressions |
| 1219 | may not be required. Strings have several methods for performing operations with |
| 1220 | fixed strings and they're usually much faster, because the implementation is a |
| 1221 | single small C loop that's been optimized for the purpose, instead of the large, |
| 1222 | more generalized regular expression engine. |
| 1223 | |
| 1224 | One example might be replacing a single fixed string with another one; for |
| 1225 | example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the |
| 1226 | function to use for this, but consider the :meth:`replace` method. Note that |
| 1227 | :func:`replace` will also replace ``word`` inside words, turning ``swordfish`` |
| 1228 | into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To |
| 1229 | avoid performing the substitution on parts of words, the pattern would have to |
| 1230 | be ``\bword\b``, in order to require that ``word`` have a word boundary on |
| 1231 | either side. This takes the job beyond :meth:`replace`'s abilities.) |
| 1232 | |
| 1233 | Another common task is deleting every occurrence of a single character from a |
| 1234 | string or replacing it with another single character. You might do this with |
| 1235 | something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of |
| 1236 | doing both tasks and will be faster than any regular expression operation can |
| 1237 | be. |
| 1238 | |
| 1239 | In short, before turning to the :mod:`re` module, consider whether your problem |
| 1240 | can be solved with a faster and simpler string method. |
| 1241 | |
| 1242 | |
| 1243 | match() versus search() |
| 1244 | ----------------------- |
| 1245 | |
| 1246 | The :func:`match` function only checks if the RE matches at the beginning of the |
| 1247 | string while :func:`search` will scan forward through the string for a match. |
| 1248 | It's important to keep this distinction in mind. Remember, :func:`match` will |
| 1249 | only report a successful match which will start at 0; if the match wouldn't |
| 1250 | start at zero, :func:`match` will *not* report it. :: |
| 1251 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1252 | >>> print(re.match('super', 'superstition').span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1253 | (0, 5) |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1254 | >>> print(re.match('super', 'insuperable')) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1255 | None |
| 1256 | |
| 1257 | On the other hand, :func:`search` will scan forward through the string, |
| 1258 | reporting the first match it finds. :: |
| 1259 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1260 | >>> print(re.search('super', 'superstition').span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1261 | (0, 5) |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1262 | >>> print(re.search('super', 'insuperable').span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1263 | (2, 7) |
| 1264 | |
| 1265 | Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` |
| 1266 | to the front of your RE. Resist this temptation and use :func:`re.search` |
| 1267 | instead. The regular expression compiler does some analysis of REs in order to |
| 1268 | speed up the process of looking for a match. One such analysis figures out what |
| 1269 | the first character of a match must be; for example, a pattern starting with |
| 1270 | ``Crow`` must match starting with a ``'C'``. The analysis lets the engine |
| 1271 | quickly scan through the string looking for the starting character, only trying |
| 1272 | the full match if a ``'C'`` is found. |
| 1273 | |
| 1274 | Adding ``.*`` defeats this optimization, requiring scanning to the end of the |
| 1275 | string and then backtracking to find a match for the rest of the RE. Use |
| 1276 | :func:`re.search` instead. |
| 1277 | |
| 1278 | |
| 1279 | Greedy versus Non-Greedy |
| 1280 | ------------------------ |
| 1281 | |
| 1282 | When repeating a regular expression, as in ``a*``, the resulting action is to |
| 1283 | consume as much of the pattern as possible. This fact often bites you when |
| 1284 | you're trying to match a pair of balanced delimiters, such as the angle brackets |
| 1285 | surrounding an HTML tag. The naive pattern for matching a single HTML tag |
| 1286 | doesn't work because of the greedy nature of ``.*``. :: |
| 1287 | |
| 1288 | >>> s = '<html><head><title>Title</title>' |
| 1289 | >>> len(s) |
| 1290 | 32 |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1291 | >>> print(re.match('<.*>', s).span()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1292 | (0, 32) |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1293 | >>> print(re.match('<.*>', s).group()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1294 | <html><head><title>Title</title> |
| 1295 | |
| 1296 | The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of |
| 1297 | the string. There's still more left in the RE, though, and the ``>`` can't |
| 1298 | match at the end of the string, so the regular expression engine has to |
| 1299 | backtrack character by character until it finds a match for the ``>``. The |
| 1300 | final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in |
| 1301 | ``</title>``, which isn't what you want. |
| 1302 | |
| 1303 | In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, |
| 1304 | ``??``, or ``{m,n}?``, which match as *little* text as possible. In the above |
| 1305 | example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and |
| 1306 | when it fails, the engine advances a character at a time, retrying the ``'>'`` |
| 1307 | at every step. This produces just the right result:: |
| 1308 | |
Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 1309 | >>> print(re.match('<.*?>', s).group()) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1310 | <html> |
| 1311 | |
| 1312 | (Note that parsing HTML or XML with regular expressions is painful. |
| 1313 | Quick-and-dirty patterns will handle common cases, but HTML and XML have special |
| 1314 | cases that will break the obvious regular expression; by the time you've written |
| 1315 | a regular expression that handles all of the possible cases, the patterns will |
| 1316 | be *very* complicated. Use an HTML or XML parser module for such tasks.) |
| 1317 | |
| 1318 | |
| 1319 | Not Using re.VERBOSE |
| 1320 | -------------------- |
| 1321 | |
| 1322 | By now you've probably noticed that regular expressions are a very compact |
| 1323 | notation, but they're not terribly readable. REs of moderate complexity can |
| 1324 | become lengthy collections of backslashes, parentheses, and metacharacters, |
| 1325 | making them difficult to read and understand. |
| 1326 | |
| 1327 | For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular |
| 1328 | expression can be helpful, because it allows you to format the regular |
| 1329 | expression more clearly. |
| 1330 | |
| 1331 | The ``re.VERBOSE`` flag has several effects. Whitespace in the regular |
| 1332 | expression that *isn't* inside a character class is ignored. This means that an |
| 1333 | expression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, |
| 1334 | but ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space. In |
| 1335 | addition, you can also put comments inside a RE; comments extend from a ``#`` |
| 1336 | character to the next newline. When used with triple-quoted strings, this |
| 1337 | enables REs to be formatted more neatly:: |
| 1338 | |
| 1339 | pat = re.compile(r""" |
| 1340 | \s* # Skip leading whitespace |
| 1341 | (?P<header>[^:]+) # Header name |
| 1342 | \s* : # Whitespace, and a colon |
| 1343 | (?P<value>.*?) # The header's value -- *? used to |
| 1344 | # lose the following trailing whitespace |
| 1345 | \s*$ # Trailing whitespace to end-of-line |
| 1346 | """, re.VERBOSE) |
| 1347 | |
Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 1348 | This is far more readable than:: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1349 | |
| 1350 | pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") |
| 1351 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1352 | |
| 1353 | Feedback |
| 1354 | ======== |
| 1355 | |
| 1356 | Regular expressions are a complicated topic. Did this document help you |
| 1357 | understand them? Were there parts that were unclear, or Problems you |
| 1358 | encountered that weren't covered here? If so, please send suggestions for |
| 1359 | improvements to the author. |
| 1360 | |
| 1361 | The most complete book on regular expressions is almost certainly Jeffrey |
| 1362 | Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, |
| 1363 | it exclusively concentrates on Perl and Java's flavours of regular expressions, |
| 1364 | and doesn't contain any Python material at all, so it won't be useful as a |
| 1365 | reference for programming in Python. (The first edition covered Python's |
| 1366 | now-removed :mod:`regex` module, which won't help you much.) Consider checking |
| 1367 | it out from your library. |
| 1368 | |
| 1369 | |
| 1370 | .. rubric:: Footnotes |
| 1371 | |
| 1372 | .. [#] Introduced in Python 2.2.2. |
| 1373 | |