| .. _regex-howto: | 
 |  | 
 | **************************** | 
 |   Regular Expression HOWTO | 
 | **************************** | 
 |  | 
 | :Author: A.M. Kuchling <amk@amk.ca> | 
 |  | 
 | .. TODO: | 
 |    Document lookbehind assertions | 
 |    Better way of displaying a RE, a string, and what it matches | 
 |    Mention optional argument to match.groups() | 
 |    Unicode (at least a reference) | 
 |  | 
 |  | 
 | .. topic:: Abstract | 
 |  | 
 |    This document is an introductory tutorial to using regular expressions in Python | 
 |    with the :mod:`re` module.  It provides a gentler introduction than the | 
 |    corresponding section in the Library Reference. | 
 |  | 
 |  | 
 | Introduction | 
 | ============ | 
 |  | 
 | Regular expressions (called REs, or regexes, or regex patterns) are essentially | 
 | a tiny, highly specialized programming language embedded inside Python and made | 
 | available through the :mod:`re` module. Using this little language, you specify | 
 | the rules for the set of possible strings that you want to match; this set might | 
 | contain English sentences, or e-mail addresses, or TeX commands, or anything you | 
 | like.  You can then ask questions such as "Does this string match the pattern?", | 
 | or "Is there a match for the pattern anywhere in this string?".  You can also | 
 | use REs to modify a string or to split it apart in various ways. | 
 |  | 
 | Regular expression patterns are compiled into a series of bytecodes which are | 
 | then executed by a matching engine written in C.  For advanced use, it may be | 
 | necessary to pay careful attention to how the engine will execute a given RE, | 
 | and write the RE in a certain way in order to produce bytecode that runs faster. | 
 | Optimization isn't covered in this document, because it requires that you have a | 
 | good understanding of the matching engine's internals. | 
 |  | 
 | The regular expression language is relatively small and restricted, so not all | 
 | possible string processing tasks can be done using regular expressions.  There | 
 | are also tasks that *can* be done with regular expressions, but the expressions | 
 | turn out to be very complicated.  In these cases, you may be better off writing | 
 | Python code to do the processing; while Python code will be slower than an | 
 | elaborate regular expression, it will also probably be more understandable. | 
 |  | 
 |  | 
 | Simple Patterns | 
 | =============== | 
 |  | 
 | We'll start by learning about the simplest possible regular expressions.  Since | 
 | regular expressions are used to operate on strings, we'll begin with the most | 
 | common task: matching characters. | 
 |  | 
 | For a detailed explanation of the computer science underlying regular | 
 | expressions (deterministic and non-deterministic finite automata), you can refer | 
 | to almost any textbook on writing compilers. | 
 |  | 
 |  | 
 | Matching Characters | 
 | ------------------- | 
 |  | 
 | Most letters and characters will simply match themselves.  For example, the | 
 | regular expression ``test`` will match the string ``test`` exactly.  (You can | 
 | enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` | 
 | as well; more about this later.) | 
 |  | 
 | There are exceptions to this rule; some characters are special | 
 | :dfn:`metacharacters`, and don't match themselves.  Instead, they signal that | 
 | some out-of-the-ordinary thing should be matched, or they affect other portions | 
 | of the RE by repeating them or changing their meaning.  Much of this document is | 
 | devoted to discussing various metacharacters and what they do. | 
 |  | 
 | Here's a complete list of the metacharacters; their meanings will be discussed | 
 | in the rest of this HOWTO. :: | 
 |  | 
 |    . ^ $ * + ? { } [ ] \ | ( ) | 
 |  | 
 | The first metacharacters we'll look at are ``[`` and ``]``. They're used for | 
 | specifying a character class, which is a set of characters that you wish to | 
 | match.  Characters can be listed individually, or a range of characters can be | 
 | indicated by giving two characters and separating them by a ``'-'``.  For | 
 | example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this | 
 | is the same as ``[a-c]``, which uses a range to express the same set of | 
 | characters.  If you wanted to match only lowercase letters, your RE would be | 
 | ``[a-z]``. | 
 |  | 
 | Metacharacters are not active inside classes.  For example, ``[akm$]`` will | 
 | match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is | 
 | usually a metacharacter, but inside a character class it's stripped of its | 
 | special nature. | 
 |  | 
 | You can match the characters not listed within the class by :dfn:`complementing` | 
 | the set.  This is indicated by including a ``'^'`` as the first character of the | 
 | class; ``'^'`` outside a character class will simply match the ``'^'`` | 
 | character.  For example, ``[^5]`` will match any character except ``'5'``. | 
 |  | 
 | Perhaps the most important metacharacter is the backslash, ``\``.   As in Python | 
 | string literals, the backslash can be followed by various characters to signal | 
 | various special sequences.  It's also used to escape all the metacharacters so | 
 | you can still match them in patterns; for example, if you need to match a ``[`` | 
 | or  ``\``, you can precede them with a backslash to remove their special | 
 | meaning: ``\[`` or ``\\``. | 
 |  | 
 | Some of the special sequences beginning with ``'\'`` represent predefined sets | 
 | of characters that are often useful, such as the set of digits, the set of | 
 | letters, or the set of anything that isn't whitespace.  The following predefined | 
 | special sequences are a subset of those available. The equivalent classes are | 
 | for bytes patterns. For a complete list of sequences and expanded class | 
 | definitions for Unicode string patterns, see the last part of | 
 | :ref:`Regular Expression Syntax <re-syntax>`. | 
 |  | 
 | ``\d`` | 
 |    Matches any decimal digit; this is equivalent to the class ``[0-9]``. | 
 |  | 
 | ``\D`` | 
 |    Matches any non-digit character; this is equivalent to the class ``[^0-9]``. | 
 |  | 
 | ``\s`` | 
 |    Matches any whitespace character; this is equivalent to the class ``[ | 
 |    \t\n\r\f\v]``. | 
 |  | 
 | ``\S`` | 
 |    Matches any non-whitespace character; this is equivalent to the class ``[^ | 
 |    \t\n\r\f\v]``. | 
 |  | 
 | ``\w`` | 
 |    Matches any alphanumeric character; this is equivalent to the class | 
 |    ``[a-zA-Z0-9_]``. | 
 |  | 
 | ``\W`` | 
 |    Matches any non-alphanumeric character; this is equivalent to the class | 
 |    ``[^a-zA-Z0-9_]``. | 
 |  | 
 | These sequences can be included inside a character class.  For example, | 
 | ``[\s,.]`` is a character class that will match any whitespace character, or | 
 | ``','`` or ``'.'``. | 
 |  | 
 | The final metacharacter in this section is ``.``.  It matches anything except a | 
 | newline character, and there's an alternate mode (``re.DOTALL``) where it will | 
 | match even a newline.  ``'.'`` is often used where you want to match "any | 
 | character". | 
 |  | 
 |  | 
 | Repeating Things | 
 | ---------------- | 
 |  | 
 | Being able to match varying sets of characters is the first thing regular | 
 | expressions can do that isn't already possible with the methods available on | 
 | strings.  However, if that was the only additional capability of regexes, they | 
 | wouldn't be much of an advance. Another capability is that you can specify that | 
 | portions of the RE must be repeated a certain number of times. | 
 |  | 
 | The first metacharacter for repeating things that we'll look at is ``*``.  ``*`` | 
 | doesn't match the literal character ``*``; instead, it specifies that the | 
 | previous character can be matched zero or more times, instead of exactly once. | 
 |  | 
 | For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), | 
 | ``caaat`` (3 ``a`` characters), and so forth.  The RE engine has various | 
 | internal limitations stemming from the size of C's ``int`` type that will | 
 | prevent it from matching over 2 billion ``a`` characters; you probably don't | 
 | have enough memory to construct a string that large, so you shouldn't run into | 
 | that limit. | 
 |  | 
 | Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching | 
 | engine will try to repeat it as many times as possible. If later portions of the | 
 | pattern don't match, the matching engine will then back up and try again with | 
 | few repetitions. | 
 |  | 
 | A step-by-step example will make this more obvious.  Let's consider the | 
 | expression ``a[bcd]*b``.  This matches the letter ``'a'``, zero or more letters | 
 | from the class ``[bcd]``, and finally ends with a ``'b'``.  Now imagine matching | 
 | this RE against the string ``abcbd``. | 
 |  | 
 | +------+-----------+---------------------------------+ | 
 | | Step | Matched   | Explanation                     | | 
 | +======+===========+=================================+ | 
 | | 1    | ``a``     | The ``a`` in the RE matches.    | | 
 | +------+-----------+---------------------------------+ | 
 | | 2    | ``abcbd`` | The engine matches ``[bcd]*``,  | | 
 | |      |           | going as far as it can, which   | | 
 | |      |           | is to the end of the string.    | | 
 | +------+-----------+---------------------------------+ | 
 | | 3    | *Failure* | The engine tries to match       | | 
 | |      |           | ``b``, but the current position | | 
 | |      |           | is at the end of the string, so | | 
 | |      |           | it fails.                       | | 
 | +------+-----------+---------------------------------+ | 
 | | 4    | ``abcb``  | Back up, so that  ``[bcd]*``    | | 
 | |      |           | matches one less character.     | | 
 | +------+-----------+---------------------------------+ | 
 | | 5    | *Failure* | Try ``b`` again, but the        | | 
 | |      |           | current position is at the last | | 
 | |      |           | character, which is a ``'d'``.  | | 
 | +------+-----------+---------------------------------+ | 
 | | 6    | ``abc``   | Back up again, so that          | | 
 | |      |           | ``[bcd]*`` is only matching     | | 
 | |      |           | ``bc``.                         | | 
 | +------+-----------+---------------------------------+ | 
 | | 6    | ``abcb``  | Try ``b`` again.  This time     | | 
 | |      |           | the character at the            | | 
 | |      |           | current position is ``'b'``, so | | 
 | |      |           | it succeeds.                    | | 
 | +------+-----------+---------------------------------+ | 
 |  | 
 | The end of the RE has now been reached, and it has matched ``abcb``.  This | 
 | demonstrates how the matching engine goes as far as it can at first, and if no | 
 | match is found it will then progressively back up and retry the rest of the RE | 
 | again and again.  It will back up until it has tried zero matches for | 
 | ``[bcd]*``, and if that subsequently fails, the engine will conclude that the | 
 | string doesn't match the RE at all. | 
 |  | 
 | Another repeating metacharacter is ``+``, which matches one or more times.  Pay | 
 | careful attention to the difference between ``*`` and ``+``; ``*`` matches | 
 | *zero* or more times, so whatever's being repeated may not be present at all, | 
 | while ``+`` requires at least *one* occurrence.  To use a similar example, | 
 | ``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match | 
 | ``ct``. | 
 |  | 
 | There are two more repeating qualifiers.  The question mark character, ``?``, | 
 | matches either once or zero times; you can think of it as marking something as | 
 | being optional.  For example, ``home-?brew`` matches either ``homebrew`` or | 
 | ``home-brew``. | 
 |  | 
 | The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are | 
 | decimal integers.  This qualifier means there must be at least *m* repetitions, | 
 | and at most *n*.  For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and | 
 | ``a///b``.  It won't match ``ab``, which has no slashes, or ``a////b``, which | 
 | has four. | 
 |  | 
 | You can omit either *m* or *n*; in that case, a reasonable value is assumed for | 
 | the missing value.  Omitting *m* is interpreted as a lower limit of 0, while | 
 | omitting *n* results in an upper bound of infinity --- actually, the upper bound | 
 | is the 2-billion limit mentioned earlier, but that might as well be infinity. | 
 |  | 
 | Readers of a reductionist bent may notice that the three other qualifiers can | 
 | all be expressed using this notation.  ``{0,}`` is the same as ``*``, ``{1,}`` | 
 | is equivalent to ``+``, and ``{0,1}`` is the same as ``?``.  It's better to use | 
 | ``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier | 
 | to read. | 
 |  | 
 |  | 
 | Using Regular Expressions | 
 | ========================= | 
 |  | 
 | Now that we've looked at some simple regular expressions, how do we actually use | 
 | them in Python?  The :mod:`re` module provides an interface to the regular | 
 | expression engine, allowing you to compile REs into objects and then perform | 
 | matches with them. | 
 |  | 
 |  | 
 | Compiling Regular Expressions | 
 | ----------------------------- | 
 |  | 
 | Regular expressions are compiled into pattern objects, which have | 
 | methods for various operations such as searching for pattern matches or | 
 | performing string substitutions. :: | 
 |  | 
 |    >>> import re | 
 |    >>> p = re.compile('ab*') | 
 |    >>> p  #doctest: +ELLIPSIS | 
 |    <_sre.SRE_Pattern object at 0x...> | 
 |  | 
 | :func:`re.compile` also accepts an optional *flags* argument, used to enable | 
 | various special features and syntax variations.  We'll go over the available | 
 | settings later, but for now a single example will do:: | 
 |  | 
 |    >>> p = re.compile('ab*', re.IGNORECASE) | 
 |  | 
 | The RE is passed to :func:`re.compile` as a string.  REs are handled as strings | 
 | because regular expressions aren't part of the core Python language, and no | 
 | special syntax was created for expressing them.  (There are applications that | 
 | don't need REs at all, so there's no need to bloat the language specification by | 
 | including them.) Instead, the :mod:`re` module is simply a C extension module | 
 | included with Python, just like the :mod:`socket` or :mod:`zlib` modules. | 
 |  | 
 | Putting REs in strings keeps the Python language simpler, but has one | 
 | disadvantage which is the topic of the next section. | 
 |  | 
 |  | 
 | The Backslash Plague | 
 | -------------------- | 
 |  | 
 | As stated earlier, regular expressions use the backslash character (``'\'``) to | 
 | indicate special forms or to allow special characters to be used without | 
 | invoking their special meaning. This conflicts with Python's usage of the same | 
 | character for the same purpose in string literals. | 
 |  | 
 | Let's say you want to write a RE that matches the string ``\section``, which | 
 | might be found in a LaTeX file.  To figure out what to write in the program | 
 | code, start with the desired string to be matched.  Next, you must escape any | 
 | backslashes and other metacharacters by preceding them with a backslash, | 
 | resulting in the string ``\\section``.  The resulting string that must be passed | 
 | to :func:`re.compile` must be ``\\section``.  However, to express this as a | 
 | Python string literal, both backslashes must be escaped *again*. | 
 |  | 
 | +-------------------+------------------------------------------+ | 
 | | Characters        | Stage                                    | | 
 | +===================+==========================================+ | 
 | | ``\section``      | Text string to be matched                | | 
 | +-------------------+------------------------------------------+ | 
 | | ``\\section``     | Escaped backslash for :func:`re.compile` | | 
 | +-------------------+------------------------------------------+ | 
 | | ``"\\\\section"`` | Escaped backslashes for a string literal | | 
 | +-------------------+------------------------------------------+ | 
 |  | 
 | In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE | 
 | string, because the regular expression must be ``\\``, and each backslash must | 
 | be expressed as ``\\`` inside a regular Python string literal.  In REs that | 
 | feature backslashes repeatedly, this leads to lots of repeated backslashes and | 
 | makes the resulting strings difficult to understand. | 
 |  | 
 | The solution is to use Python's raw string notation for regular expressions; | 
 | backslashes are not handled in any special way in a string literal prefixed with | 
 | ``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, | 
 | while ``"\n"`` is a one-character string containing a newline. Regular | 
 | expressions will often be written in Python code using this raw string notation. | 
 |  | 
 | +-------------------+------------------+ | 
 | | Regular String    | Raw string       | | 
 | +===================+==================+ | 
 | | ``"ab*"``         | ``r"ab*"``       | | 
 | +-------------------+------------------+ | 
 | | ``"\\\\section"`` | ``r"\\section"`` | | 
 | +-------------------+------------------+ | 
 | | ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"``  | | 
 | +-------------------+------------------+ | 
 |  | 
 |  | 
 | Performing Matches | 
 | ------------------ | 
 |  | 
 | Once you have an object representing a compiled regular expression, what do you | 
 | do with it?  Pattern objects have several methods and attributes. | 
 | Only the most significant ones will be covered here; consult the :mod:`re` docs | 
 | for a complete listing. | 
 |  | 
 | +------------------+-----------------------------------------------+ | 
 | | Method/Attribute | Purpose                                       | | 
 | +==================+===============================================+ | 
 | | ``match()``      | Determine if the RE matches at the beginning  | | 
 | |                  | of the string.                                | | 
 | +------------------+-----------------------------------------------+ | 
 | | ``search()``     | Scan through a string, looking for any        | | 
 | |                  | location where this RE matches.               | | 
 | +------------------+-----------------------------------------------+ | 
 | | ``findall()``    | Find all substrings where the RE matches, and | | 
 | |                  | returns them as a list.                       | | 
 | +------------------+-----------------------------------------------+ | 
 | | ``finditer()``   | Find all substrings where the RE matches, and | | 
 | |                  | returns them as an :term:`iterator`.          | | 
 | +------------------+-----------------------------------------------+ | 
 |  | 
 | :meth:`match` and :meth:`search` return ``None`` if no match can be found.  If | 
 | they're successful, a :ref:`match object <match-objects>` instance is returned, | 
 | containing information about the match: where it starts and ends, the substring | 
 | it matched, and more. | 
 |  | 
 | You can learn about this by interactively experimenting with the :mod:`re` | 
 | module.  If you have :mod:`tkinter` available, you may also want to look at | 
 | :source:`Tools/demo/redemo.py`, a demonstration program included with the | 
 | Python distribution.  It allows you to enter REs and strings, and displays | 
 | whether the RE matches or fails. :file:`redemo.py` can be quite useful when | 
 | trying to debug a complicated RE.  Phil Schwartz's `Kodos | 
 | <http://kodos.sourceforge.net/>`_ is also an interactive tool for developing and | 
 | testing RE patterns. | 
 |  | 
 | This HOWTO uses the standard Python interpreter for its examples. First, run the | 
 | Python interpreter, import the :mod:`re` module, and compile a RE:: | 
 |  | 
 |    >>> import re | 
 |    >>> p = re.compile('[a-z]+') | 
 |    >>> p  #doctest: +ELLIPSIS | 
 |    <_sre.SRE_Pattern object at 0x...> | 
 |  | 
 | Now, you can try matching various strings against the RE ``[a-z]+``.  An empty | 
 | string shouldn't match at all, since ``+`` means 'one or more repetitions'. | 
 | :meth:`match` should return ``None`` in this case, which will cause the | 
 | interpreter to print no output.  You can explicitly print the result of | 
 | :meth:`match` to make this clear. :: | 
 |  | 
 |    >>> p.match("") | 
 |    >>> print(p.match("")) | 
 |    None | 
 |  | 
 | Now, let's try it on a string that it should match, such as ``tempo``.  In this | 
 | case, :meth:`match` will return a :ref:`match object <match-objects>`, so you | 
 | should store the result in a variable for later use. :: | 
 |  | 
 |    >>> m = p.match('tempo') | 
 |    >>> m  #doctest: +ELLIPSIS | 
 |    <_sre.SRE_Match object at 0x...> | 
 |  | 
 | Now you can query the :ref:`match object <match-objects>` for information | 
 | about the matching string.  :ref:`match object <match-objects>` instances | 
 | also have several methods and attributes; the most important ones are: | 
 |  | 
 | +------------------+--------------------------------------------+ | 
 | | Method/Attribute | Purpose                                    | | 
 | +==================+============================================+ | 
 | | ``group()``      | Return the string matched by the RE        | | 
 | +------------------+--------------------------------------------+ | 
 | | ``start()``      | Return the starting position of the match  | | 
 | +------------------+--------------------------------------------+ | 
 | | ``end()``        | Return the ending position of the match    | | 
 | +------------------+--------------------------------------------+ | 
 | | ``span()``       | Return a tuple containing the (start, end) | | 
 | |                  | positions  of the match                    | | 
 | +------------------+--------------------------------------------+ | 
 |  | 
 | Trying these methods will soon clarify their meaning:: | 
 |  | 
 |    >>> m.group() | 
 |    'tempo' | 
 |    >>> m.start(), m.end() | 
 |    (0, 5) | 
 |    >>> m.span() | 
 |    (0, 5) | 
 |  | 
 | :meth:`group` returns the substring that was matched by the RE.  :meth:`start` | 
 | and :meth:`end` return the starting and ending index of the match. :meth:`span` | 
 | returns both start and end indexes in a single tuple.  Since the :meth:`match` | 
 | method only checks if the RE matches at the start of a string, :meth:`start` | 
 | will always be zero.  However, the :meth:`search` method of patterns | 
 | scans through the string, so  the match may not start at zero in that | 
 | case. :: | 
 |  | 
 |    >>> print(p.match('::: message')) | 
 |    None | 
 |    >>> m = p.search('::: message'); print(m)  #doctest: +ELLIPSIS | 
 |    <_sre.SRE_Match object at 0x...> | 
 |    >>> m.group() | 
 |    'message' | 
 |    >>> m.span() | 
 |    (4, 11) | 
 |  | 
 | In actual programs, the most common style is to store the | 
 | :ref:`match object <match-objects>` in a variable, and then check if it was | 
 | ``None``.  This usually looks like:: | 
 |  | 
 |    p = re.compile( ... ) | 
 |    m = p.match( 'string goes here' ) | 
 |    if m: | 
 |        print('Match found: ', m.group()) | 
 |    else: | 
 |        print('No match') | 
 |  | 
 | Two pattern methods return all of the matches for a pattern. | 
 | :meth:`findall` returns a list of matching strings:: | 
 |  | 
 |    >>> p = re.compile('\d+') | 
 |    >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') | 
 |    ['12', '11', '10'] | 
 |  | 
 | :meth:`findall` has to create the entire list before it can be returned as the | 
 | result.  The :meth:`finditer` method returns a sequence of | 
 | :ref:`match object <match-objects>` instances as an :term:`iterator`:: | 
 |  | 
 |    >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') | 
 |    >>> iterator  #doctest: +ELLIPSIS | 
 |    <callable_iterator object at 0x...> | 
 |    >>> for match in iterator: | 
 |    ...     print(match.span()) | 
 |    ... | 
 |    (0, 2) | 
 |    (22, 24) | 
 |    (29, 31) | 
 |  | 
 |  | 
 | Module-Level Functions | 
 | ---------------------- | 
 |  | 
 | You don't have to create a pattern object and call its methods; the | 
 | :mod:`re` module also provides top-level functions called :func:`match`, | 
 | :func:`search`, :func:`findall`, :func:`sub`, and so forth.  These functions | 
 | take the same arguments as the corresponding pattern method, with | 
 | the RE string added as the first argument, and still return either ``None`` or a | 
 | :ref:`match object <match-objects>` instance. :: | 
 |  | 
 |    >>> print(re.match(r'From\s+', 'Fromage amk')) | 
 |    None | 
 |    >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')  #doctest: +ELLIPSIS | 
 |    <_sre.SRE_Match object at 0x...> | 
 |  | 
 | Under the hood, these functions simply create a pattern object for you | 
 | and call the appropriate method on it.  They also store the compiled object in a | 
 | cache, so future calls using the same RE are faster. | 
 |  | 
 | Should you use these module-level functions, or should you get the | 
 | pattern and call its methods yourself?  That choice depends on how | 
 | frequently the RE will be used, and on your personal coding style.  If the RE is | 
 | being used at only one point in the code, then the module functions are probably | 
 | more convenient.  If a program contains a lot of regular expressions, or re-uses | 
 | the same ones in several locations, then it might be worthwhile to collect all | 
 | the definitions in one place, in a section of code that compiles all the REs | 
 | ahead of time.  To take an example from the standard library, here's an extract | 
 | from the now-defunct Python 2 standard :mod:`xmllib` module:: | 
 |  | 
 |    ref = re.compile( ... ) | 
 |    entityref = re.compile( ... ) | 
 |    charref = re.compile( ... ) | 
 |    starttagopen = re.compile( ... ) | 
 |  | 
 | I generally prefer to work with the compiled object, even for one-time uses, but | 
 | few people will be as much of a purist about this as I am. | 
 |  | 
 |  | 
 | Compilation Flags | 
 | ----------------- | 
 |  | 
 | Compilation flags let you modify some aspects of how regular expressions work. | 
 | Flags are available in the :mod:`re` module under two names, a long name such as | 
 | :const:`IGNORECASE` and a short, one-letter form such as :const:`I`.  (If you're | 
 | familiar with Perl's pattern modifiers, the one-letter forms use the same | 
 | letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) | 
 | Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets | 
 | both the :const:`I` and :const:`M` flags, for example. | 
 |  | 
 | Here's a table of the available flags, followed by a more detailed explanation | 
 | of each one. | 
 |  | 
 | +---------------------------------+--------------------------------------------+ | 
 | | Flag                            | Meaning                                    | | 
 | +=================================+============================================+ | 
 | | :const:`DOTALL`, :const:`S`     | Make ``.`` match any character, including  | | 
 | |                                 | newlines                                   | | 
 | +---------------------------------+--------------------------------------------+ | 
 | | :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches                | | 
 | +---------------------------------+--------------------------------------------+ | 
 | | :const:`LOCALE`, :const:`L`     | Do a locale-aware match                    | | 
 | +---------------------------------+--------------------------------------------+ | 
 | | :const:`MULTILINE`, :const:`M`  | Multi-line matching, affecting ``^`` and   | | 
 | |                                 | ``$``                                      | | 
 | +---------------------------------+--------------------------------------------+ | 
 | | :const:`VERBOSE`, :const:`X`    | Enable verbose REs, which can be organized | | 
 | |                                 | more cleanly and understandably.           | | 
 | +---------------------------------+--------------------------------------------+ | 
 | | :const:`ASCII`, :const:`A`      | Makes several escapes like ``\w``, ``\b``, | | 
 | |                                 | ``\s`` and ``\d`` match only on ASCII      | | 
 | |                                 | characters with the respective property.   | | 
 | +---------------------------------+--------------------------------------------+ | 
 |  | 
 |  | 
 | .. data:: I | 
 |           IGNORECASE | 
 |    :noindex: | 
 |  | 
 |    Perform case-insensitive matching; character class and literal strings will | 
 |    match letters by ignoring case.  For example, ``[A-Z]`` will match lowercase | 
 |    letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This | 
 |    lowercasing doesn't take the current locale into account; it will if you also | 
 |    set the :const:`LOCALE` flag. | 
 |  | 
 |  | 
 | .. data:: L | 
 |           LOCALE | 
 |    :noindex: | 
 |  | 
 |    Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. | 
 |  | 
 |    Locales are a feature of the C library intended to help in writing programs that | 
 |    take account of language differences.  For example, if you're processing French | 
 |    text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only | 
 |    matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``.  If | 
 |    your system is configured properly and a French locale is selected, certain C | 
 |    functions will tell the program that ``'é'`` should also be considered a letter. | 
 |    Setting the :const:`LOCALE` flag when compiling a regular expression will cause | 
 |    the resulting compiled object to use these C functions for ``\w``; this is | 
 |    slower, but also enables ``\w+`` to match French words as you'd expect. | 
 |  | 
 |  | 
 | .. data:: M | 
 |           MULTILINE | 
 |    :noindex: | 
 |  | 
 |    (``^`` and ``$`` haven't been explained yet;  they'll be introduced in section | 
 |    :ref:`more-metacharacters`.) | 
 |  | 
 |    Usually ``^`` matches only at the beginning of the string, and ``$`` matches | 
 |    only at the end of the string and immediately before the newline (if any) at the | 
 |    end of the string. When this flag is specified, ``^`` matches at the beginning | 
 |    of the string and at the beginning of each line within the string, immediately | 
 |    following each newline.  Similarly, the ``$`` metacharacter matches either at | 
 |    the end of the string and at the end of each line (immediately preceding each | 
 |    newline). | 
 |  | 
 |  | 
 | .. data:: S | 
 |           DOTALL | 
 |    :noindex: | 
 |  | 
 |    Makes the ``'.'`` special character match any character at all, including a | 
 |    newline; without this flag, ``'.'`` will match anything *except* a newline. | 
 |  | 
 |  | 
 | .. data:: A | 
 |           ASCII | 
 |    :noindex: | 
 |  | 
 |    Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only | 
 |    matching instead of full Unicode matching. This is only meaningful for | 
 |    Unicode patterns, and is ignored for byte patterns. | 
 |  | 
 |  | 
 | .. data:: X | 
 |           VERBOSE | 
 |    :noindex: | 
 |  | 
 |    This flag allows you to write regular expressions that are more readable by | 
 |    granting you more flexibility in how you can format them.  When this flag has | 
 |    been specified, whitespace within the RE string is ignored, except when the | 
 |    whitespace is in a character class or preceded by an unescaped backslash; this | 
 |    lets you organize and indent the RE more clearly.  This flag also lets you put | 
 |    comments within a RE that will be ignored by the engine; comments are marked by | 
 |    a ``'#'`` that's neither in a character class or preceded by an unescaped | 
 |    backslash. | 
 |  | 
 |    For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it | 
 |    is to read? :: | 
 |  | 
 |       charref = re.compile(r""" | 
 |        &[#]                # Start of a numeric entity reference | 
 |        ( | 
 |            0[0-7]+         # Octal form | 
 |          | [0-9]+          # Decimal form | 
 |          | x[0-9a-fA-F]+   # Hexadecimal form | 
 |        ) | 
 |        ;                   # Trailing semicolon | 
 |       """, re.VERBOSE) | 
 |  | 
 |    Without the verbose setting, the RE would look like this:: | 
 |  | 
 |       charref = re.compile("&#(0[0-7]+" | 
 |                            "|[0-9]+" | 
 |                            "|x[0-9a-fA-F]+);") | 
 |  | 
 |    In the above example, Python's automatic concatenation of string literals has | 
 |    been used to break up the RE into smaller pieces, but it's still more difficult | 
 |    to understand than the version using :const:`re.VERBOSE`. | 
 |  | 
 |  | 
 | More Pattern Power | 
 | ================== | 
 |  | 
 | So far we've only covered a part of the features of regular expressions.  In | 
 | this section, we'll cover some new metacharacters, and how to use groups to | 
 | retrieve portions of the text that was matched. | 
 |  | 
 |  | 
 | .. _more-metacharacters: | 
 |  | 
 | More Metacharacters | 
 | ------------------- | 
 |  | 
 | There are some metacharacters that we haven't covered yet.  Most of them will be | 
 | covered in this section. | 
 |  | 
 | Some of the remaining metacharacters to be discussed are :dfn:`zero-width | 
 | assertions`.  They don't cause the engine to advance through the string; | 
 | instead, they consume no characters at all, and simply succeed or fail.  For | 
 | example, ``\b`` is an assertion that the current position is located at a word | 
 | boundary; the position isn't changed by the ``\b`` at all.  This means that | 
 | zero-width assertions should never be repeated, because if they match once at a | 
 | given location, they can obviously be matched an infinite number of times. | 
 |  | 
 | ``|`` | 
 |    Alternation, or the "or" operator.   If A and B are regular expressions, | 
 |    ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very | 
 |    low precedence in order to make it work reasonably when you're alternating | 
 |    multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, | 
 |    not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. | 
 |  | 
 |    To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, | 
 |    as in ``[|]``. | 
 |  | 
 | ``^`` | 
 |    Matches at the beginning of lines.  Unless the :const:`MULTILINE` flag has been | 
 |    set, this will only match at the beginning of the string.  In :const:`MULTILINE` | 
 |    mode, this also matches immediately after each newline within the string. | 
 |  | 
 |    For example, if you wish to match the word ``From`` only at the beginning of a | 
 |    line, the RE to use is ``^From``. :: | 
 |  | 
 |       >>> print(re.search('^From', 'From Here to Eternity'))  #doctest: +ELLIPSIS | 
 |       <_sre.SRE_Match object at 0x...> | 
 |       >>> print(re.search('^From', 'Reciting From Memory')) | 
 |       None | 
 |  | 
 |    .. To match a literal \character{\^}, use \regexp{\e\^} or enclose it | 
 |    .. inside a character class, as in \regexp{[{\e}\^]}. | 
 |  | 
 | ``$`` | 
 |    Matches at the end of a line, which is defined as either the end of the string, | 
 |    or any location followed by a newline character.     :: | 
 |  | 
 |       >>> print(re.search('}$', '{block}'))  #doctest: +ELLIPSIS | 
 |       <_sre.SRE_Match object at 0x...> | 
 |       >>> print(re.search('}$', '{block} ')) | 
 |       None | 
 |       >>> print(re.search('}$', '{block}\n'))  #doctest: +ELLIPSIS | 
 |       <_sre.SRE_Match object at 0x...> | 
 |  | 
 |    To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, | 
 |    as in  ``[$]``. | 
 |  | 
 | ``\A`` | 
 |    Matches only at the start of the string.  When not in :const:`MULTILINE` mode, | 
 |    ``\A`` and ``^`` are effectively the same.  In :const:`MULTILINE` mode, they're | 
 |    different: ``\A`` still matches only at the beginning of the string, but ``^`` | 
 |    may match at any location inside the string that follows a newline character. | 
 |  | 
 | ``\Z`` | 
 |    Matches only at the end of the string. | 
 |  | 
 | ``\b`` | 
 |    Word boundary.  This is a zero-width assertion that matches only at the | 
 |    beginning or end of a word.  A word is defined as a sequence of alphanumeric | 
 |    characters, so the end of a word is indicated by whitespace or a | 
 |    non-alphanumeric character. | 
 |  | 
 |    The following example matches ``class`` only when it's a complete word; it won't | 
 |    match when it's contained inside another word. :: | 
 |  | 
 |       >>> p = re.compile(r'\bclass\b') | 
 |       >>> print(p.search('no class at all'))  #doctest: +ELLIPSIS | 
 |       <_sre.SRE_Match object at 0x...> | 
 |       >>> print(p.search('the declassified algorithm')) | 
 |       None | 
 |       >>> print(p.search('one subclass is')) | 
 |       None | 
 |  | 
 |    There are two subtleties you should remember when using this special sequence. | 
 |    First, this is the worst collision between Python's string literals and regular | 
 |    expression sequences.  In Python's string literals, ``\b`` is the backspace | 
 |    character, ASCII value 8.  If you're not using raw strings, then Python will | 
 |    convert the ``\b`` to a backspace, and your RE won't match as you expect it to. | 
 |    The following example looks the same as our previous RE, but omits the ``'r'`` | 
 |    in front of the RE string. :: | 
 |  | 
 |       >>> p = re.compile('\bclass\b') | 
 |       >>> print(p.search('no class at all')) | 
 |       None | 
 |       >>> print(p.search('\b' + 'class' + '\b'))  #doctest: +ELLIPSIS | 
 |       <_sre.SRE_Match object at 0x...> | 
 |  | 
 |    Second, inside a character class, where there's no use for this assertion, | 
 |    ``\b`` represents the backspace character, for compatibility with Python's | 
 |    string literals. | 
 |  | 
 | ``\B`` | 
 |    Another zero-width assertion, this is the opposite of ``\b``, only matching when | 
 |    the current position is not at a word boundary. | 
 |  | 
 |  | 
 | Grouping | 
 | -------- | 
 |  | 
 | Frequently you need to obtain more information than just whether the RE matched | 
 | or not.  Regular expressions are often used to dissect strings by writing a RE | 
 | divided into several subgroups which match different components of interest. | 
 | For example, an RFC-822 header line is divided into a header name and a value, | 
 | separated by a ``':'``, like this:: | 
 |  | 
 |    From: author@example.com | 
 |    User-Agent: Thunderbird 1.5.0.9 (X11/20061227) | 
 |    MIME-Version: 1.0 | 
 |    To: editor@example.com | 
 |  | 
 | This can be handled by writing a regular expression which matches an entire | 
 | header line, and has one group which matches the header name, and another group | 
 | which matches the header's value. | 
 |  | 
 | Groups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` | 
 | have much the same meaning as they do in mathematical expressions; they group | 
 | together the expressions contained inside them, and you can repeat the contents | 
 | of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or | 
 | ``{m,n}``.  For example, ``(ab)*`` will match zero or more repetitions of | 
 | ``ab``. :: | 
 |  | 
 |    >>> p = re.compile('(ab)*') | 
 |    >>> print(p.match('ababababab').span()) | 
 |    (0, 10) | 
 |  | 
 | Groups indicated with ``'('``, ``')'`` also capture the starting and ending | 
 | index of the text that they match; this can be retrieved by passing an argument | 
 | to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`.  Groups are | 
 | numbered starting with 0.  Group 0 is always present; it's the whole RE, so | 
 | :ref:`match object <match-objects>` methods all have group 0 as their default | 
 | argument.  Later we'll see how to express groups that don't capture the span | 
 | of text that they match. :: | 
 |  | 
 |    >>> p = re.compile('(a)b') | 
 |    >>> m = p.match('ab') | 
 |    >>> m.group() | 
 |    'ab' | 
 |    >>> m.group(0) | 
 |    'ab' | 
 |  | 
 | Subgroups are numbered from left to right, from 1 upward.  Groups can be nested; | 
 | to determine the number, just count the opening parenthesis characters, going | 
 | from left to right. :: | 
 |  | 
 |    >>> p = re.compile('(a(b)c)d') | 
 |    >>> m = p.match('abcd') | 
 |    >>> m.group(0) | 
 |    'abcd' | 
 |    >>> m.group(1) | 
 |    'abc' | 
 |    >>> m.group(2) | 
 |    'b' | 
 |  | 
 | :meth:`group` can be passed multiple group numbers at a time, in which case it | 
 | will return a tuple containing the corresponding values for those groups. :: | 
 |  | 
 |    >>> m.group(2,1,2) | 
 |    ('b', 'abc', 'b') | 
 |  | 
 | The :meth:`groups` method returns a tuple containing the strings for all the | 
 | subgroups, from 1 up to however many there are. :: | 
 |  | 
 |    >>> m.groups() | 
 |    ('abc', 'b') | 
 |  | 
 | Backreferences in a pattern allow you to specify that the contents of an earlier | 
 | capturing group must also be found at the current location in the string.  For | 
 | example, ``\1`` will succeed if the exact contents of group 1 can be found at | 
 | the current position, and fails otherwise.  Remember that Python's string | 
 | literals also use a backslash followed by numbers to allow including arbitrary | 
 | characters in a string, so be sure to use a raw string when incorporating | 
 | backreferences in a RE. | 
 |  | 
 | For example, the following RE detects doubled words in a string. :: | 
 |  | 
 |    >>> p = re.compile(r'(\b\w+)\s+\1') | 
 |    >>> p.search('Paris in the the spring').group() | 
 |    'the the' | 
 |  | 
 | Backreferences like this aren't often useful for just searching through a string | 
 | --- there are few text formats which repeat data in this way --- but you'll soon | 
 | find out that they're *very* useful when performing string substitutions. | 
 |  | 
 |  | 
 | Non-capturing and Named Groups | 
 | ------------------------------ | 
 |  | 
 | Elaborate REs may use many groups, both to capture substrings of interest, and | 
 | to group and structure the RE itself.  In complex REs, it becomes difficult to | 
 | keep track of the group numbers.  There are two features which help with this | 
 | problem.  Both of them use a common syntax for regular expression extensions, so | 
 | we'll look at that first. | 
 |  | 
 | Perl 5 added several additional features to standard regular expressions, and | 
 | the Python :mod:`re` module supports most of them.   It would have been | 
 | difficult to choose new single-keystroke metacharacters or new special sequences | 
 | beginning with ``\`` to represent the new features without making Perl's regular | 
 | expressions confusingly different from standard REs.  If you chose ``&`` as a | 
 | new metacharacter, for example, old expressions would be assuming that ``&`` was | 
 | a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. | 
 |  | 
 | The solution chosen by the Perl developers was to use ``(?...)`` as the | 
 | extension syntax.  ``?`` immediately after a parenthesis was a syntax error | 
 | because the ``?`` would have nothing to repeat, so this didn't introduce any | 
 | compatibility problems.  The characters immediately after the ``?``  indicate | 
 | what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead | 
 | assertion) and ``(?:foo)`` is something else (a non-capturing group containing | 
 | the subexpression ``foo``). | 
 |  | 
 | Python adds an extension syntax to Perl's extension syntax.  If the first | 
 | character after the question mark is a ``P``, you know that it's an extension | 
 | that's specific to Python.  Currently there are two such extensions: | 
 | ``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to | 
 | a named group.  If future versions of Perl 5 add similar features using a | 
 | different syntax, the :mod:`re` module will be changed to support the new | 
 | syntax, while preserving the Python-specific syntax for compatibility's sake. | 
 |  | 
 | Now that we've looked at the general extension syntax, we can return to the | 
 | features that simplify working with groups in complex REs. Since groups are | 
 | numbered from left to right and a complex expression may use many groups, it can | 
 | become difficult to keep track of the correct numbering.  Modifying such a | 
 | complex RE is annoying, too: insert a new group near the beginning and you | 
 | change the numbers of everything that follows it. | 
 |  | 
 | Sometimes you'll want to use a group to collect a part of a regular expression, | 
 | but aren't interested in retrieving the group's contents. You can make this fact | 
 | explicit by using a non-capturing group: ``(?:...)``, where you can replace the | 
 | ``...`` with any other regular expression. :: | 
 |  | 
 |    >>> m = re.match("([abc])+", "abc") | 
 |    >>> m.groups() | 
 |    ('c',) | 
 |    >>> m = re.match("(?:[abc])+", "abc") | 
 |    >>> m.groups() | 
 |    () | 
 |  | 
 | Except for the fact that you can't retrieve the contents of what the group | 
 | matched, a non-capturing group behaves exactly the same as a capturing group; | 
 | you can put anything inside it, repeat it with a repetition metacharacter such | 
 | as ``*``, and nest it within other groups (capturing or non-capturing). | 
 | ``(?:...)`` is particularly useful when modifying an existing pattern, since you | 
 | can add new groups without changing how all the other groups are numbered.  It | 
 | should be mentioned that there's no performance difference in searching between | 
 | capturing and non-capturing groups; neither form is any faster than the other. | 
 |  | 
 | A more significant feature is named groups: instead of referring to them by | 
 | numbers, groups can be referenced by a name. | 
 |  | 
 | The syntax for a named group is one of the Python-specific extensions: | 
 | ``(?P<name>...)``.  *name* is, obviously, the name of the group.  Named groups | 
 | also behave exactly like capturing groups, and additionally associate a name | 
 | with a group.  The :ref:`match object <match-objects>` methods that deal with | 
 | capturing groups all accept either integers that refer to the group by number | 
 | or strings that contain the desired group's name.  Named groups are still | 
 | given numbers, so you can retrieve information about a group in two ways:: | 
 |  | 
 |    >>> p = re.compile(r'(?P<word>\b\w+\b)') | 
 |    >>> m = p.search( '(((( Lots of punctuation )))' ) | 
 |    >>> m.group('word') | 
 |    'Lots' | 
 |    >>> m.group(1) | 
 |    'Lots' | 
 |  | 
 | Named groups are handy because they let you use easily-remembered names, instead | 
 | of having to remember numbers.  Here's an example RE from the :mod:`imaplib` | 
 | module:: | 
 |  | 
 |    InternalDate = re.compile(r'INTERNALDATE "' | 
 |            r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' | 
 |            r'(?P<year>[0-9][0-9][0-9][0-9])' | 
 |            r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' | 
 |            r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' | 
 |            r'"') | 
 |  | 
 | It's obviously much easier to retrieve ``m.group('zonem')``, instead of having | 
 | to remember to retrieve group 9. | 
 |  | 
 | The syntax for backreferences in an expression such as ``(...)\1`` refers to the | 
 | number of the group.  There's naturally a variant that uses the group name | 
 | instead of the number. This is another Python extension: ``(?P=name)`` indicates | 
 | that the contents of the group called *name* should again be matched at the | 
 | current point.  The regular expression for finding doubled words, | 
 | ``(\b\w+)\s+\1`` can also be written as ``(?P<word>\b\w+)\s+(?P=word)``:: | 
 |  | 
 |    >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') | 
 |    >>> p.search('Paris in the the spring').group() | 
 |    'the the' | 
 |  | 
 |  | 
 | Lookahead Assertions | 
 | -------------------- | 
 |  | 
 | Another zero-width assertion is the lookahead assertion.  Lookahead assertions | 
 | are available in both positive and negative form, and  look like this: | 
 |  | 
 | ``(?=...)`` | 
 |    Positive lookahead assertion.  This succeeds if the contained regular | 
 |    expression, represented here by ``...``, successfully matches at the current | 
 |    location, and fails otherwise. But, once the contained expression has been | 
 |    tried, the matching engine doesn't advance at all; the rest of the pattern is | 
 |    tried right where the assertion started. | 
 |  | 
 | ``(?!...)`` | 
 |    Negative lookahead assertion.  This is the opposite of the positive assertion; | 
 |    it succeeds if the contained expression *doesn't* match at the current position | 
 |    in the string. | 
 |  | 
 | To make this concrete, let's look at a case where a lookahead is useful. | 
 | Consider a simple pattern to match a filename and split it apart into a base | 
 | name and an extension, separated by a ``.``.  For example, in ``news.rc``, | 
 | ``news`` is the base name, and ``rc`` is the filename's extension. | 
 |  | 
 | The pattern to match this is quite simple: | 
 |  | 
 | ``.*[.].*$`` | 
 |  | 
 | Notice that the ``.`` needs to be treated specially because it's a | 
 | metacharacter; I've put it inside a character class.  Also notice the trailing | 
 | ``$``; this is added to ensure that all the rest of the string must be included | 
 | in the extension.  This regular expression matches ``foo.bar`` and | 
 | ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. | 
 |  | 
 | Now, consider complicating the problem a bit; what if you want to match | 
 | filenames where the extension is not ``bat``? Some incorrect attempts: | 
 |  | 
 | ``.*[.][^b].*$``  The first attempt above tries to exclude ``bat`` by requiring | 
 | that the first character of the extension is not a ``b``.  This is wrong, | 
 | because the pattern also doesn't match ``foo.bar``. | 
 |  | 
 | ``.*[.]([^b]..|.[^a].|..[^t])$`` | 
 |  | 
 | The expression gets messier when you try to patch up the first solution by | 
 | requiring one of the following cases to match: the first character of the | 
 | extension isn't ``b``; the second character isn't ``a``; or the third character | 
 | isn't ``t``.  This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it | 
 | requires a three-letter extension and won't accept a filename with a two-letter | 
 | extension such as ``sendmail.cf``.  We'll complicate the pattern again in an | 
 | effort to fix it. | 
 |  | 
 | ``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` | 
 |  | 
 | In the third attempt, the second and third letters are all made optional in | 
 | order to allow matching extensions shorter than three characters, such as | 
 | ``sendmail.cf``. | 
 |  | 
 | The pattern's getting really complicated now, which makes it hard to read and | 
 | understand.  Worse, if the problem changes and you want to exclude both ``bat`` | 
 | and ``exe`` as extensions, the pattern would get even more complicated and | 
 | confusing. | 
 |  | 
 | A negative lookahead cuts through all this confusion: | 
 |  | 
 | ``.*[.](?!bat$).*$``  The negative lookahead means: if the expression ``bat`` | 
 | doesn't match at this point, try the rest of the pattern; if ``bat$`` does | 
 | match, the whole pattern will fail.  The trailing ``$`` is required to ensure | 
 | that something like ``sample.batch``, where the extension only starts with | 
 | ``bat``, will be allowed. | 
 |  | 
 | Excluding another filename extension is now easy; simply add it as an | 
 | alternative inside the assertion.  The following pattern excludes filenames that | 
 | end in either ``bat`` or ``exe``: | 
 |  | 
 | ``.*[.](?!bat$|exe$).*$`` | 
 |  | 
 |  | 
 | Modifying Strings | 
 | ================= | 
 |  | 
 | Up to this point, we've simply performed searches against a static string. | 
 | Regular expressions are also commonly used to modify strings in various ways, | 
 | using the following pattern methods: | 
 |  | 
 | +------------------+-----------------------------------------------+ | 
 | | Method/Attribute | Purpose                                       | | 
 | +==================+===============================================+ | 
 | | ``split()``      | Split the string into a list, splitting it    | | 
 | |                  | wherever the RE matches                       | | 
 | +------------------+-----------------------------------------------+ | 
 | | ``sub()``        | Find all substrings where the RE matches, and | | 
 | |                  | replace them with a different string          | | 
 | +------------------+-----------------------------------------------+ | 
 | | ``subn()``       | Does the same thing as :meth:`sub`,  but      | | 
 | |                  | returns the new string and the number of      | | 
 | |                  | replacements                                  | | 
 | +------------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | Splitting Strings | 
 | ----------------- | 
 |  | 
 | The :meth:`split` method of a pattern splits a string apart | 
 | wherever the RE matches, returning a list of the pieces. It's similar to the | 
 | :meth:`split` method of strings but provides much more generality in the | 
 | delimiters that you can split by; :meth:`split` only supports splitting by | 
 | whitespace or by a fixed string.  As you'd expect, there's a module-level | 
 | :func:`re.split` function, too. | 
 |  | 
 |  | 
 | .. method:: .split(string [, maxsplit=0]) | 
 |    :noindex: | 
 |  | 
 |    Split *string* by the matches of the regular expression.  If capturing | 
 |    parentheses are used in the RE, then their contents will also be returned as | 
 |    part of the resulting list.  If *maxsplit* is nonzero, at most *maxsplit* splits | 
 |    are performed. | 
 |  | 
 | You can limit the number of splits made, by passing a value for *maxsplit*. | 
 | When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the | 
 | remainder of the string is returned as the final element of the list.  In the | 
 | following example, the delimiter is any sequence of non-alphanumeric characters. | 
 | :: | 
 |  | 
 |    >>> p = re.compile(r'\W+') | 
 |    >>> p.split('This is a test, short and sweet, of split().') | 
 |    ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] | 
 |    >>> p.split('This is a test, short and sweet, of split().', 3) | 
 |    ['This', 'is', 'a', 'test, short and sweet, of split().'] | 
 |  | 
 | Sometimes you're not only interested in what the text between delimiters is, but | 
 | also need to know what the delimiter was.  If capturing parentheses are used in | 
 | the RE, then their values are also returned as part of the list.  Compare the | 
 | following calls:: | 
 |  | 
 |    >>> p = re.compile(r'\W+') | 
 |    >>> p2 = re.compile(r'(\W+)') | 
 |    >>> p.split('This... is a test.') | 
 |    ['This', 'is', 'a', 'test', ''] | 
 |    >>> p2.split('This... is a test.') | 
 |    ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] | 
 |  | 
 | The module-level function :func:`re.split` adds the RE to be used as the first | 
 | argument, but is otherwise the same.   :: | 
 |  | 
 |    >>> re.split('[\W]+', 'Words, words, words.') | 
 |    ['Words', 'words', 'words', ''] | 
 |    >>> re.split('([\W]+)', 'Words, words, words.') | 
 |    ['Words', ', ', 'words', ', ', 'words', '.', ''] | 
 |    >>> re.split('[\W]+', 'Words, words, words.', 1) | 
 |    ['Words', 'words, words.'] | 
 |  | 
 |  | 
 | Search and Replace | 
 | ------------------ | 
 |  | 
 | Another common task is to find all the matches for a pattern, and replace them | 
 | with a different string.  The :meth:`sub` method takes a replacement value, | 
 | which can be either a string or a function, and the string to be processed. | 
 |  | 
 |  | 
 | .. method:: .sub(replacement, string[, count=0]) | 
 |    :noindex: | 
 |  | 
 |    Returns the string obtained by replacing the leftmost non-overlapping | 
 |    occurrences of the RE in *string* by the replacement *replacement*.  If the | 
 |    pattern isn't found, *string* is returned unchanged. | 
 |  | 
 |    The optional argument *count* is the maximum number of pattern occurrences to be | 
 |    replaced; *count* must be a non-negative integer.  The default value of 0 means | 
 |    to replace all occurrences. | 
 |  | 
 | Here's a simple example of using the :meth:`sub` method.  It replaces colour | 
 | names with the word ``colour``:: | 
 |  | 
 |    >>> p = re.compile( '(blue|white|red)') | 
 |    >>> p.sub( 'colour', 'blue socks and red shoes') | 
 |    'colour socks and colour shoes' | 
 |    >>> p.sub( 'colour', 'blue socks and red shoes', count=1) | 
 |    'colour socks and red shoes' | 
 |  | 
 | The :meth:`subn` method does the same work, but returns a 2-tuple containing the | 
 | new string value and the number of replacements  that were performed:: | 
 |  | 
 |    >>> p = re.compile( '(blue|white|red)') | 
 |    >>> p.subn( 'colour', 'blue socks and red shoes') | 
 |    ('colour socks and colour shoes', 2) | 
 |    >>> p.subn( 'colour', 'no colours at all') | 
 |    ('no colours at all', 0) | 
 |  | 
 | Empty matches are replaced only when they're not adjacent to a previous match. | 
 | :: | 
 |  | 
 |    >>> p = re.compile('x*') | 
 |    >>> p.sub('-', 'abxd') | 
 |    '-a-b-d-' | 
 |  | 
 | If *replacement* is a string, any backslash escapes in it are processed.  That | 
 | is, ``\n`` is converted to a single newline character, ``\r`` is converted to a | 
 | carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. | 
 | Backreferences, such as ``\6``, are replaced with the substring matched by the | 
 | corresponding group in the RE.  This lets you incorporate portions of the | 
 | original text in the resulting replacement string. | 
 |  | 
 | This example matches the word ``section`` followed by a string enclosed in | 
 | ``{``, ``}``, and changes ``section`` to ``subsection``:: | 
 |  | 
 |    >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) | 
 |    >>> p.sub(r'subsection{\1}','section{First} section{second}') | 
 |    'subsection{First} subsection{second}' | 
 |  | 
 | There's also a syntax for referring to named groups as defined by the | 
 | ``(?P<name>...)`` syntax.  ``\g<name>`` will use the substring matched by the | 
 | group named ``name``, and  ``\g<number>``  uses the corresponding group number. | 
 | ``\g<2>`` is therefore equivalent to ``\2``,  but isn't ambiguous in a | 
 | replacement string such as ``\g<2>0``.  (``\20`` would be interpreted as a | 
 | reference to group 20, not a reference to group 2 followed by the literal | 
 | character ``'0'``.)  The following substitutions are all equivalent, but use all | 
 | three variations of the replacement string. :: | 
 |  | 
 |    >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) | 
 |    >>> p.sub(r'subsection{\1}','section{First}') | 
 |    'subsection{First}' | 
 |    >>> p.sub(r'subsection{\g<1>}','section{First}') | 
 |    'subsection{First}' | 
 |    >>> p.sub(r'subsection{\g<name>}','section{First}') | 
 |    'subsection{First}' | 
 |  | 
 | *replacement* can also be a function, which gives you even more control.  If | 
 | *replacement* is a function, the function is called for every non-overlapping | 
 | occurrence of *pattern*.  On each call, the function is passed a | 
 | :ref:`match object <match-objects>` argument for the match and can use this | 
 | information to compute the desired replacement string and return it. | 
 |  | 
 | In the following example, the replacement function translates decimals into | 
 | hexadecimal:: | 
 |  | 
 |    >>> def hexrepl(match): | 
 |    ...     "Return the hex string for a decimal number" | 
 |    ...     value = int(match.group()) | 
 |    ...     return hex(value) | 
 |    ... | 
 |    >>> p = re.compile(r'\d+') | 
 |    >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') | 
 |    'Call 0xffd2 for printing, 0xc000 for user code.' | 
 |  | 
 | When using the module-level :func:`re.sub` function, the pattern is passed as | 
 | the first argument.  The pattern may be provided as an object or as a string; if | 
 | you need to specify regular expression flags, you must either use a | 
 | pattern object as the first parameter, or use embedded modifiers in the | 
 | pattern string, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. | 
 |  | 
 |  | 
 | Common Problems | 
 | =============== | 
 |  | 
 | Regular expressions are a powerful tool for some applications, but in some ways | 
 | their behaviour isn't intuitive and at times they don't behave the way you may | 
 | expect them to.  This section will point out some of the most common pitfalls. | 
 |  | 
 |  | 
 | Use String Methods | 
 | ------------------ | 
 |  | 
 | Sometimes using the :mod:`re` module is a mistake.  If you're matching a fixed | 
 | string, or a single character class, and you're not using any :mod:`re` features | 
 | such as the :const:`IGNORECASE` flag, then the full power of regular expressions | 
 | may not be required. Strings have several methods for performing operations with | 
 | fixed strings and they're usually much faster, because the implementation is a | 
 | single small C loop that's been optimized for the purpose, instead of the large, | 
 | more generalized regular expression engine. | 
 |  | 
 | One example might be replacing a single fixed string with another one; for | 
 | example, you might replace ``word`` with ``deed``.  ``re.sub()`` seems like the | 
 | function to use for this, but consider the :meth:`replace` method.  Note that | 
 | :func:`replace` will also replace ``word`` inside words, turning ``swordfish`` | 
 | into ``sdeedfish``, but the  naive RE ``word`` would have done that, too.  (To | 
 | avoid performing the substitution on parts of words, the pattern would have to | 
 | be ``\bword\b``, in order to require that ``word`` have a word boundary on | 
 | either side.  This takes the job beyond  :meth:`replace`'s abilities.) | 
 |  | 
 | Another common task is deleting every occurrence of a single character from a | 
 | string or replacing it with another single character.  You might do this with | 
 | something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of | 
 | doing both tasks and will be faster than any regular expression operation can | 
 | be. | 
 |  | 
 | In short, before turning to the :mod:`re` module, consider whether your problem | 
 | can be solved with a faster and simpler string method. | 
 |  | 
 |  | 
 | match() versus search() | 
 | ----------------------- | 
 |  | 
 | The :func:`match` function only checks if the RE matches at the beginning of the | 
 | string while :func:`search` will scan forward through the string for a match. | 
 | It's important to keep this distinction in mind.  Remember,  :func:`match` will | 
 | only report a successful match which will start at 0; if the match wouldn't | 
 | start at zero,  :func:`match` will *not* report it. :: | 
 |  | 
 |    >>> print(re.match('super', 'superstition').span()) | 
 |    (0, 5) | 
 |    >>> print(re.match('super', 'insuperable')) | 
 |    None | 
 |  | 
 | On the other hand, :func:`search` will scan forward through the string, | 
 | reporting the first match it finds. :: | 
 |  | 
 |    >>> print(re.search('super', 'superstition').span()) | 
 |    (0, 5) | 
 |    >>> print(re.search('super', 'insuperable').span()) | 
 |    (2, 7) | 
 |  | 
 | Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` | 
 | to the front of your RE.  Resist this temptation and use :func:`re.search` | 
 | instead.  The regular expression compiler does some analysis of REs in order to | 
 | speed up the process of looking for a match.  One such analysis figures out what | 
 | the first character of a match must be; for example, a pattern starting with | 
 | ``Crow`` must match starting with a ``'C'``.  The analysis lets the engine | 
 | quickly scan through the string looking for the starting character, only trying | 
 | the full match if a ``'C'`` is found. | 
 |  | 
 | Adding ``.*`` defeats this optimization, requiring scanning to the end of the | 
 | string and then backtracking to find a match for the rest of the RE.  Use | 
 | :func:`re.search` instead. | 
 |  | 
 |  | 
 | Greedy versus Non-Greedy | 
 | ------------------------ | 
 |  | 
 | When repeating a regular expression, as in ``a*``, the resulting action is to | 
 | consume as much of the pattern as possible.  This fact often bites you when | 
 | you're trying to match a pair of balanced delimiters, such as the angle brackets | 
 | surrounding an HTML tag.  The naive pattern for matching a single HTML tag | 
 | doesn't work because of the greedy nature of ``.*``. :: | 
 |  | 
 |    >>> s = '<html><head><title>Title</title>' | 
 |    >>> len(s) | 
 |    32 | 
 |    >>> print(re.match('<.*>', s).span()) | 
 |    (0, 32) | 
 |    >>> print(re.match('<.*>', s).group()) | 
 |    <html><head><title>Title</title> | 
 |  | 
 | The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of | 
 | the string.  There's still more left in the RE, though, and the ``>`` can't | 
 | match at the end of the string, so the regular expression engine has to | 
 | backtrack character by character until it finds a match for the ``>``.   The | 
 | final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in | 
 | ``</title>``, which isn't what you want. | 
 |  | 
 | In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, | 
 | ``??``, or ``{m,n}?``, which match as *little* text as possible.  In the above | 
 | example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and | 
 | when it fails, the engine advances a character at a time, retrying the ``'>'`` | 
 | at every step.  This produces just the right result:: | 
 |  | 
 |    >>> print(re.match('<.*?>', s).group()) | 
 |    <html> | 
 |  | 
 | (Note that parsing HTML or XML with regular expressions is painful. | 
 | Quick-and-dirty patterns will handle common cases, but HTML and XML have special | 
 | cases that will break the obvious regular expression; by the time you've written | 
 | a regular expression that handles all of the possible cases, the patterns will | 
 | be *very* complicated.  Use an HTML or XML parser module for such tasks.) | 
 |  | 
 |  | 
 | Using re.VERBOSE | 
 | ---------------- | 
 |  | 
 | By now you've probably noticed that regular expressions are a very compact | 
 | notation, but they're not terribly readable.  REs of moderate complexity can | 
 | become lengthy collections of backslashes, parentheses, and metacharacters, | 
 | making them difficult to read and understand. | 
 |  | 
 | For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular | 
 | expression can be helpful, because it allows you to format the regular | 
 | expression more clearly. | 
 |  | 
 | The ``re.VERBOSE`` flag has several effects.  Whitespace in the regular | 
 | expression that *isn't* inside a character class is ignored.  This means that an | 
 | expression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, | 
 | but ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space.  In | 
 | addition, you can also put comments inside a RE; comments extend from a ``#`` | 
 | character to the next newline.  When used with triple-quoted strings, this | 
 | enables REs to be formatted more neatly:: | 
 |  | 
 |    pat = re.compile(r""" | 
 |     \s*                 # Skip leading whitespace | 
 |     (?P<header>[^:]+)   # Header name | 
 |     \s* :               # Whitespace, and a colon | 
 |     (?P<value>.*?)      # The header's value -- *? used to | 
 |                         # lose the following trailing whitespace | 
 |     \s*$                # Trailing whitespace to end-of-line | 
 |    """, re.VERBOSE) | 
 |  | 
 | This is far more readable than:: | 
 |  | 
 |    pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") | 
 |  | 
 |  | 
 | Feedback | 
 | ======== | 
 |  | 
 | Regular expressions are a complicated topic.  Did this document help you | 
 | understand them?  Were there parts that were unclear, or Problems you | 
 | encountered that weren't covered here?  If so, please send suggestions for | 
 | improvements to the author. | 
 |  | 
 | The most complete book on regular expressions is almost certainly Jeffrey | 
 | Friedl's Mastering Regular Expressions, published by O'Reilly.  Unfortunately, | 
 | it exclusively concentrates on Perl and Java's flavours of regular expressions, | 
 | and doesn't contain any Python material at all, so it won't be useful as a | 
 | reference for programming in Python.  (The first edition covered Python's | 
 | now-removed :mod:`regex` module, which won't help you much.)  Consider checking | 
 | it out from your library. | 
 |  |