blob: 67f85705169beda626086a9d7ff0cbb80d2a2f55 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5 :synopsis: Regular expression operations.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/re.py`
11
12--------------
Georg Brandl116aa622007-08-15 14:28:22 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014This module provides regular expression matching operations similar to
Georg Brandled2a1db2009-06-08 07:48:27 +000015those found in Perl.
Antoine Pitroufd036452008-08-19 17:56:33 +000016
Serhiy Storchakacd195e22017-10-14 11:14:26 +030017Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter6245cb32016-04-15 02:14:19 +000020that is, you cannot match a Unicode string with a byte pattern or
Georg Brandlae2dbe22009-03-13 19:04:40 +000021vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitroufd036452008-08-19 17:56:33 +000022string must be of the same type as both the pattern and the search string.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning. This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal.
32
33The solution is to use Python's raw string notation for regular expression
34patterns; backslashes are not handled in any special way in a string literal
35prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
36``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl9afde1c2007-11-01 20:32:30 +000037newline. Usually patterns will be expressed in Python code using this raw
38string notation.
Georg Brandl116aa622007-08-15 14:28:22 +000039
Christian Heimesb9eccbf2007-12-05 20:18:38 +000040It is important to note that most regular expression operations are available as
Georg Brandlc62a7042010-07-29 11:49:05 +000041module-level functions and methods on
42:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
43that don't require you to compile a regex object first, but miss some
Christian Heimesb9eccbf2007-12-05 20:18:38 +000044fine-tuning parameters.
45
Marco Buttued6795e2017-02-26 16:26:23 +010046.. seealso::
47
Stéphane Wirtel19177fb2018-05-15 20:58:35 +020048 The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttued6795e2017-02-26 16:26:23 +010049 which has an API compatible with the standard library :mod:`re` module,
50 but offers additional functionality and a more thorough Unicode support.
51
Georg Brandl116aa622007-08-15 14:28:22 +000052
53.. _re-syntax:
54
55Regular Expression Syntax
56-------------------------
57
58A regular expression (or RE) specifies a set of strings that matches it; the
59functions in this module let you check if a particular string matches a given
60regular expression (or if a given regular expression matches a particular
61string, which comes down to the same thing).
62
63Regular expressions can be concatenated to form new regular expressions; if *A*
64and *B* are both regular expressions, then *AB* is also a regular expression.
65In general, if a string *p* matches *A* and another string *q* matches *B*, the
66string *pq* will match AB. This holds unless *A* or *B* contain low precedence
67operations; boundary conditions between *A* and *B*; or have numbered group
68references. Thus, complex expressions can easily be constructed from simpler
69primitive expressions like the ones described here. For details of the theory
Berker Peksaga0a42d22018-03-23 16:46:52 +030070and implementation of regular expressions, consult the Friedl book [Frie09]_,
71or almost any textbook about compiler construction.
Georg Brandl116aa622007-08-15 14:28:22 +000072
73A brief explanation of the format of regular expressions follows. For further
Christian Heimes2202f872008-02-06 14:31:34 +000074information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl116aa622007-08-15 14:28:22 +000075
76Regular expressions can contain both special and ordinary characters. Most
77ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
78expressions; they simply match themselves. You can concatenate ordinary
79characters, so ``last`` matches the string ``'last'``. (In the rest of this
80section, we'll write RE's in ``this special style``, usually without quotes, and
81strings to be matched ``'in single quotes'``.)
82
83Some characters, like ``'|'`` or ``'('``, are special. Special
84characters either stand for classes of ordinary characters, or affect
Serhiy Storchakacd195e22017-10-14 11:14:26 +030085how the regular expressions around them are interpreted.
Georg Brandl116aa622007-08-15 14:28:22 +000086
Martin Panter684340e2016-10-15 01:18:16 +000087Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
88directly nested. This avoids ambiguity with the non-greedy modifier suffix
89``?``, and with other modifiers in other implementations. To apply a second
90repetition to an inner repetition, parentheses may be used. For example,
91the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
92
Georg Brandl116aa622007-08-15 14:28:22 +000093
94The special characters are:
95
Serhiy Storchakacd195e22017-10-14 11:14:26 +030096``.``
Georg Brandl116aa622007-08-15 14:28:22 +000097 (Dot.) In the default mode, this matches any character except a newline. If
98 the :const:`DOTALL` flag has been specified, this matches any character
99 including a newline.
100
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300101``^``
Georg Brandl116aa622007-08-15 14:28:22 +0000102 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
103 matches immediately after each newline.
104
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300105``$``
Georg Brandl116aa622007-08-15 14:28:22 +0000106 Matches the end of the string or just before the newline at the end of the
107 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
108 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
109 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes25bb7832008-01-11 16:17:00 +0000110 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
111 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
112 the newline, and one at the end of the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000113
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300114``*``
Georg Brandl116aa622007-08-15 14:28:22 +0000115 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
116 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
117 by any number of 'b's.
118
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300119``+``
Georg Brandl116aa622007-08-15 14:28:22 +0000120 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
121 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
122 match just 'a'.
123
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300124``?``
Georg Brandl116aa622007-08-15 14:28:22 +0000125 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
126 ``ab?`` will match either 'a' or 'ab'.
127
128``*?``, ``+?``, ``??``
129 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
130 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300131 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
132 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl116aa622007-08-15 14:28:22 +0000133 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl7ff033b2016-04-12 07:51:41 +0200134 characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300135 only ``'<a>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000136
137``{m}``
138 Specifies that exactly *m* copies of the previous RE should be matched; fewer
139 matches cause the entire RE not to match. For example, ``a{6}`` will match
140 exactly six ``'a'`` characters, but not five.
141
142``{m,n}``
143 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
144 RE, attempting to match as many repetitions as possible. For example,
145 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
146 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300147 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
148 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl116aa622007-08-15 14:28:22 +0000149 modifier would be confused with the previously described form.
150
151``{m,n}?``
152 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
153 RE, attempting to match as *few* repetitions as possible. This is the
154 non-greedy version of the previous qualifier. For example, on the
155 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
156 while ``a{3,5}?`` will only match 3 characters.
157
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300158``\``
Georg Brandl116aa622007-08-15 14:28:22 +0000159 Either escapes special characters (permitting you to match characters like
160 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
161 sequences are discussed below.
162
163 If you're not using a raw string to express the pattern, remember that Python
164 also uses the backslash as an escape sequence in string literals; if the escape
165 sequence isn't recognized by Python's parser, the backslash and subsequent
166 character are included in the resulting string. However, if Python would
167 recognize the resulting sequence, the backslash should be repeated twice. This
168 is complicated and hard to understand, so it's highly recommended that you use
169 raw strings for all but the simplest expressions.
170
171``[]``
Ezio Melotti81231d92011-10-20 19:38:04 +0300172 Used to indicate a set of characters. In a set:
Georg Brandl116aa622007-08-15 14:28:22 +0000173
Ezio Melotti81231d92011-10-20 19:38:04 +0300174 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
175 ``'m'``, or ``'k'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000176
Ezio Melotti81231d92011-10-20 19:38:04 +0300177 * Ranges of characters can be indicated by giving two characters and separating
178 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
179 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
180 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300181 ``[a\-z]``) or if it's placed as the first or last character
182 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti81231d92011-10-20 19:38:04 +0300183
184 * Special characters lose their special meaning inside sets. For example,
185 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
186 ``'*'``, or ``')'``.
187
188 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
189 inside a set, although the characters they match depends on whether
190 :const:`ASCII` or :const:`LOCALE` mode is in force.
191
192 * Characters that are not within a range can be matched by :dfn:`complementing`
193 the set. If the first character of the set is ``'^'``, all the characters
194 that are *not* in the set will be matched. For example, ``[^5]`` will match
195 any character except ``'5'``, and ``[^^]`` will match any character except
196 ``'^'``. ``^`` has no special meaning if it's not the first character in
197 the set.
198
199 * To match a literal ``']'`` inside a set, precede it with a backslash, or
200 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
201 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield9e670c22008-05-31 13:05:34 +0000202
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200203 * Support of nested sets and set operations as in `Unicode Technical
204 Standard #18`_ might be added in the future. This would change the
205 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
206 in ambiguous cases for the time being.
Andrés Delfino7dfbd492018-10-06 16:48:30 -0300207 That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200208 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
209 avoid a warning escape them with a backslash.
210
211 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
212
213 .. versionchanged:: 3.7
214 :exc:`FutureWarning` is raised if a character set contains constructs
215 that will change semantically in the future.
216
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300217``|``
218 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
219 will match either *A* or *B*. An arbitrary number of REs can be separated by the
Georg Brandl116aa622007-08-15 14:28:22 +0000220 ``'|'`` in this way. This can be used inside groups (see below) as well. As
221 the target string is scanned, REs separated by ``'|'`` are tried from left to
222 right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300223 that once *A* matches, *B* will not be tested further, even if it would
Georg Brandl116aa622007-08-15 14:28:22 +0000224 produce a longer overall match. In other words, the ``'|'`` operator is never
225 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
226 character class, as in ``[|]``.
227
228``(...)``
229 Matches whatever regular expression is inside the parentheses, and indicates the
230 start and end of a group; the contents of a group can be retrieved after a match
231 has been performed, and can be matched later in the string with the ``\number``
232 special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300233 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000234
235``(?...)``
236 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
237 otherwise). The first character after the ``'?'`` determines what the meaning
238 and further syntax of the construct is. Extensions usually do not create a new
239 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
240 currently supported extensions.
241
Antoine Pitroufd036452008-08-19 17:56:33 +0000242``(?aiLmsux)``
243 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
244 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling1c50e862009-06-01 00:11:36 +0000245 letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitroufd036452008-08-19 17:56:33 +0000246 :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl48310cd2009-01-03 21:18:54 +0000247 :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300248 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
249 for the entire regular expression.
250 (The flags are described in :ref:`contents-of-module-re`.)
251 This is useful if you wish to include the flags as part of the
252 regular expression, instead of passing a *flag* argument to the
Serhiy Storchakabd48d272016-09-11 12:50:02 +0300253 :func:`re.compile` function. Flags should be used first in the
254 expression string.
Georg Brandl116aa622007-08-15 14:28:22 +0000255
256``(?:...)``
Georg Brandl3122ce32010-10-29 06:17:38 +0000257 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl116aa622007-08-15 14:28:22 +0000258 expression is inside the parentheses, but the substring matched by the group
259 *cannot* be retrieved after performing a match or referenced later in the
260 pattern.
261
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300262``(?aiLmsux-imsx:...)``
263 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
264 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
265 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
266 The letters set or remove the corresponding flags:
267 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
268 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
269 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
270 and :const:`re.X` (verbose), for the part of the expression.
271 (The flags are described in :ref:`contents-of-module-re`.)
272
273 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
274 as inline flags, so they can't be combined or follow ``'-'``. Instead,
275 when one of them appears in an inline group, it overrides the matching mode
276 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
277 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
278 (default). In byte pattern ``(?L:...)`` switches to locale depending
279 matching, and ``(?a:...)`` switches to ASCII-only matching (default).
280 This override is only in effect for the narrow inline group, and the
281 original matching mode is restored outside of the group.
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300282
Zachary Warec3076722016-09-09 15:47:05 -0700283 .. versionadded:: 3.6
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300284
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300285 .. versionchanged:: 3.7
286 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
287
Georg Brandl116aa622007-08-15 14:28:22 +0000288``(?P<name>...)``
289 Similar to regular parentheses, but the substring matched by the group is
Georg Brandl3c6780c62013-10-06 12:08:14 +0200290 accessible via the symbolic group name *name*. Group names must be valid
291 Python identifiers, and each group name must be defined only once within a
292 regular expression. A symbolic group is also a numbered group, just as if
293 the group were not named.
Georg Brandl116aa622007-08-15 14:28:22 +0000294
Georg Brandl3c6780c62013-10-06 12:08:14 +0200295 Named groups can be referenced in three contexts. If the pattern is
296 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
297 single or double quotes):
298
299 +---------------------------------------+----------------------------------+
300 | Context of reference to group "quote" | Ways to reference it |
301 +=======================================+==================================+
302 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
303 | | * ``\1`` |
304 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300305 | when processing match object *m* | * ``m.group('quote')`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200306 | | * ``m.end('quote')`` (etc.) |
307 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300308 | in a string passed to the *repl* | * ``\g<quote>`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200309 | argument of ``re.sub()`` | * ``\g<1>`` |
310 | | * ``\1`` |
311 +---------------------------------------+----------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000312
313``(?P=name)``
Georg Brandl3c6780c62013-10-06 12:08:14 +0200314 A backreference to a named group; it matches whatever text was matched by the
315 earlier group named *name*.
Georg Brandl116aa622007-08-15 14:28:22 +0000316
317``(?#...)``
318 A comment; the contents of the parentheses are simply ignored.
319
320``(?=...)``
321 Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300322 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl116aa622007-08-15 14:28:22 +0000323 ``'Isaac '`` only if it's followed by ``'Asimov'``.
324
325``(?!...)``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300326 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl116aa622007-08-15 14:28:22 +0000327 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
328 followed by ``'Asimov'``.
329
330``(?<=...)``
331 Matches if the current position in the string is preceded by a match for ``...``
332 that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300333 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl116aa622007-08-15 14:28:22 +0000334 lookbehind will back up 3 characters and check if the contained pattern matches.
335 The contained pattern must only match strings of some fixed length, meaning that
336 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti0a6b5412012-04-29 07:34:46 +0300337 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl116aa622007-08-15 14:28:22 +0000338 beginning of the string being searched; you will most likely want to use the
Christian Heimesfe337bf2008-03-23 21:54:12 +0000339 :func:`search` function rather than the :func:`match` function:
Georg Brandl116aa622007-08-15 14:28:22 +0000340
341 >>> import re
342 >>> m = re.search('(?<=abc)def', 'abcdef')
343 >>> m.group(0)
344 'def'
345
Christian Heimesfe337bf2008-03-23 21:54:12 +0000346 This example looks for a word following a hyphen:
Georg Brandl116aa622007-08-15 14:28:22 +0000347
Cheryl Sabella66771422018-02-02 16:16:27 -0500348 >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl116aa622007-08-15 14:28:22 +0000349 >>> m.group(0)
350 'egg'
351
Georg Brandl8c16cb92016-02-25 20:17:45 +0100352 .. versionchanged:: 3.5
Serhiy Storchaka4eea62f2015-02-21 10:07:35 +0200353 Added support for group references of fixed length.
354
Georg Brandl116aa622007-08-15 14:28:22 +0000355``(?<!...)``
356 Matches if the current position in the string is not preceded by a match for
357 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
358 positive lookbehind assertions, the contained pattern must only match strings of
359 some fixed length. Patterns which start with negative lookbehind assertions may
360 match at the beginning of the string being searched.
361
362``(?(id/name)yes-pattern|no-pattern)``
orsenthil@gmail.com476021b2011-03-12 10:46:25 +0800363 Will try to match with ``yes-pattern`` if the group with given *id* or
364 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
365 optional and can be omitted. For example,
366 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
367 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200368 not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000369
Georg Brandl116aa622007-08-15 14:28:22 +0000370
371The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter98e90512016-06-12 06:17:29 +0000372If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300373resulting RE will match the second character. For example, ``\$`` matches the
374character ``'$'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000375
Georg Brandl116aa622007-08-15 14:28:22 +0000376``\number``
377 Matches the contents of the group of the same number. Groups are numbered
378 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl2070e832013-10-06 12:58:20 +0200379 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000380 can only be used to match one of the first 99 groups. If the first digit of
381 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
382 a group match, but as the character with octal value *number*. Inside the
383 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
384 characters.
385
386``\A``
387 Matches only at the start of the string.
388
389``\b``
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000390 Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300391 A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti5a045b92012-02-29 11:48:44 +0200392 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
393 (or vice versa), or between ``\w`` and the beginning/end of the string.
394 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
395 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
396
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300397 By default Unicode alphanumerics are the ones used in Unicode patterns, but
398 this can be changed by using the :const:`ASCII` flag. Word boundaries are
399 determined by the current locale if the :const:`LOCALE` flag is used.
400 Inside a character range, ``\b`` represents the backspace character, for
401 compatibility with Python's string literals.
Georg Brandl116aa622007-08-15 14:28:22 +0000402
403``\B``
Ezio Melotti5a045b92012-02-29 11:48:44 +0200404 Matches the empty string, but only when it is *not* at the beginning or end
405 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
406 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300407 ``\B`` is just the opposite of ``\b``, so word characters in Unicode
408 patterns are Unicode alphanumerics or the underscore, although this can
409 be changed by using the :const:`ASCII` flag. Word boundaries are
410 determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl116aa622007-08-15 14:28:22 +0000411
412``\d``
Antoine Pitroufd036452008-08-19 17:56:33 +0000413 For Unicode (str) patterns:
Mark Dickinson1f268282009-07-28 17:22:36 +0000414 Matches any Unicode decimal digit (that is, any character in
415 Unicode character category [Nd]). This includes ``[0-9]``, and
416 also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300417 used only ``[0-9]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300418
Antoine Pitroufd036452008-08-19 17:56:33 +0000419 For 8-bit (bytes) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000420 Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000421
422``\D``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300423 Matches any character which is not a decimal digit. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000424 the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300425 becomes the equivalent of ``[^0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000426
427``\s``
Antoine Pitroufd036452008-08-19 17:56:33 +0000428 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000429 Matches Unicode whitespace characters (which includes
430 ``[ \t\n\r\f\v]``, and also many other characters, for example the
431 non-breaking spaces mandated by typography rules in many
432 languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300433 ``[ \t\n\r\f\v]`` is matched.
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000434
Antoine Pitroufd036452008-08-19 17:56:33 +0000435 For 8-bit (bytes) patterns:
436 Matches characters considered whitespace in the ASCII character set;
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000437 this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000438
439``\S``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300440 Matches any character which is not a whitespace character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000441 the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300442 becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000443
444``\w``
Antoine Pitroufd036452008-08-19 17:56:33 +0000445 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000446 Matches Unicode word characters; this includes most characters
447 that can be part of a word in any language, as well as numbers and
448 the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300449 ``[a-zA-Z0-9_]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300450
Antoine Pitroufd036452008-08-19 17:56:33 +0000451 For 8-bit (bytes) patterns:
452 Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300453 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
454 used, matches characters considered alphanumeric in the current locale
455 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000456
457``\W``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300458 Matches any character which is not a word character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000459 the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300460 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300461 used, matches characters considered alphanumeric in the current locale
462 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000463
464``\Z``
465 Matches only at the end of the string.
466
467Most of the standard escapes supported by Python string literals are also
468accepted by the regular expression parser::
469
470 \a \b \f \n
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200471 \N \r \t \u
472 \U \v \x \\
Georg Brandl116aa622007-08-15 14:28:22 +0000473
Ezio Melotti285e51b2012-04-29 04:52:30 +0300474(Note that ``\b`` is used to represent word boundaries, and means "backspace"
475only inside character classes.)
476
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200477``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300478patterns. In bytes patterns they are errors.
Antoine Pitrou463badf2012-06-23 13:29:19 +0200479
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700480Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl116aa622007-08-15 14:28:22 +0000481there are three octal digits, it is considered an octal escape. Otherwise, it is
482a group reference. As for string literals, octal escapes are always at most
483three digits in length.
484
Antoine Pitrou463badf2012-06-23 13:29:19 +0200485.. versionchanged:: 3.3
486 The ``'\u'`` and ``'\U'`` escape sequences have been added.
487
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300488.. versionchanged:: 3.6
Martin Panter98e90512016-06-12 06:17:29 +0000489 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200490
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200491.. versionchanged:: 3.8
492 The ``'\N{name}'`` escape sequence has been added. As in string literals,
493 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou463badf2012-06-23 13:29:19 +0200494
Georg Brandl116aa622007-08-15 14:28:22 +0000495
Georg Brandl116aa622007-08-15 14:28:22 +0000496.. _contents-of-module-re:
497
498Module Contents
499---------------
500
501The module defines several functions, constants, and an exception. Some of the
502functions are simplified versions of the full featured methods for compiled
503regular expressions. Most non-trivial applications always use the compiled
504form.
505
Ethan Furmanc88c80b2016-11-21 08:29:31 -0800506.. versionchanged:: 3.6
507 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
508 :class:`enum.IntFlag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000509
Georg Brandl18244152009-09-02 20:34:52 +0000510.. function:: compile(pattern, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000511
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100512 Compile a regular expression pattern into a :ref:`regular expression object
513 <re-objects>`, which can be used for matching using its
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300514 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100515 below.
Georg Brandl116aa622007-08-15 14:28:22 +0000516
517 The expression's behaviour can be modified by specifying a *flags* value.
518 Values can be any of the following variables, combined using bitwise OR (the
519 ``|`` operator).
520
521 The sequence ::
522
Gregory P. Smith4221c742009-03-02 05:04:04 +0000523 prog = re.compile(pattern)
524 result = prog.match(string)
Georg Brandl116aa622007-08-15 14:28:22 +0000525
526 is equivalent to ::
527
Gregory P. Smith4221c742009-03-02 05:04:04 +0000528 result = re.match(pattern, string)
Georg Brandl116aa622007-08-15 14:28:22 +0000529
Georg Brandlf346ac02009-07-26 15:03:49 +0000530 but using :func:`re.compile` and saving the resulting regular expression
531 object for reuse is more efficient when the expression will be used several
532 times in a single program.
Georg Brandl116aa622007-08-15 14:28:22 +0000533
Gregory P. Smith4221c742009-03-02 05:04:04 +0000534 .. note::
535
536 The compiled versions of the most recent patterns passed to
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200537 :func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith4221c742009-03-02 05:04:04 +0000538 programs that use only a few regular expressions at a time needn't worry
539 about compiling regular expressions.
Georg Brandl116aa622007-08-15 14:28:22 +0000540
541
Antoine Pitroufd036452008-08-19 17:56:33 +0000542.. data:: A
543 ASCII
544
Georg Brandl4049ce02009-06-08 07:49:54 +0000545 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
546 perform ASCII-only matching instead of full Unicode matching. This is only
547 meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300548 Corresponds to the inline flag ``(?a)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000549
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000550 Note that for backward compatibility, the :const:`re.U` flag still
551 exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandlebeb44d2010-07-29 11:15:36 +0000552 counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000553 matches are Unicode by default for strings (and Unicode matching
554 isn't allowed for bytes).
Georg Brandl48310cd2009-01-03 21:18:54 +0000555
Antoine Pitroufd036452008-08-19 17:56:33 +0000556
Sandro Tosida785fd2012-01-01 12:55:20 +0100557.. data:: DEBUG
558
559 Display debug information about compiled expression.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300560 No corresponding inline flag.
Sandro Tosida785fd2012-01-01 12:55:20 +0100561
562
Georg Brandl116aa622007-08-15 14:28:22 +0000563.. data:: I
564 IGNORECASE
565
Brian Wardc9d6dbc2017-05-24 00:03:38 -0700566 Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300567 match lowercase letters. Full Unicode matching (such as ``Ü`` matching
568 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
569 non-ASCII matches. The current locale does not change the effect of this
570 flag unless the :const:`re.LOCALE` flag is also used.
571 Corresponds to the inline flag ``(?i)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000572
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300573 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
574 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
575 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
576 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
577 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
578 If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300579 and 'A' to 'Z' are matched.
Georg Brandl116aa622007-08-15 14:28:22 +0000580
581.. data:: L
582 LOCALE
583
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300584 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
585 dependent on the current locale. This flag can be used only with bytes
586 patterns. The use of this flag is discouraged as the locale mechanism
587 is very unreliable, it only handles one "culture" at a time, and it only
588 works with 8-bit locales. Unicode matching is already enabled by default
589 in Python 3 for Unicode (str) patterns, and it is able to handle different
590 locales/languages.
591 Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka22a309a2014-12-01 11:50:07 +0200592
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300593 .. versionchanged:: 3.6
594 :const:`re.LOCALE` can be used only with bytes patterns and is
595 not compatible with :const:`re.ASCII`.
Georg Brandl116aa622007-08-15 14:28:22 +0000596
Serhiy Storchaka898ff032017-05-05 08:53:40 +0300597 .. versionchanged:: 3.7
598 Compiled regular expression objects with the :const:`re.LOCALE` flag no
599 longer depend on the locale at compile time. Only the locale at
600 matching time affects the result of matching.
601
Georg Brandl116aa622007-08-15 14:28:22 +0000602
603.. data:: M
604 MULTILINE
605
606 When specified, the pattern character ``'^'`` matches at the beginning of the
607 string and at the beginning of each line (immediately following each newline);
608 and the pattern character ``'$'`` matches at the end of the string and at the
609 end of each line (immediately preceding each newline). By default, ``'^'``
610 matches only at the beginning of the string, and ``'$'`` only at the end of the
611 string and immediately before the newline (if any) at the end of the string.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300612 Corresponds to the inline flag ``(?m)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000613
614
615.. data:: S
616 DOTALL
617
618 Make the ``'.'`` special character match any character at all, including a
619 newline; without this flag, ``'.'`` will match anything *except* a newline.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300620 Corresponds to the inline flag ``(?s)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000621
622
Georg Brandl116aa622007-08-15 14:28:22 +0000623.. data:: X
624 VERBOSE
625
Zachary Ware71a0b432015-11-11 23:32:14 -0600626 This flag allows you to write regular expressions that look nicer and are
627 more readable by allowing you to visually separate logical sections of the
628 pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchakab0b44b42017-11-14 17:21:26 +0200629 when in a character class, or when preceded by an unescaped backslash,
630 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware71a0b432015-11-11 23:32:14 -0600631 When a line contains a ``#`` that is not in a character class and is not
632 preceded by an unescaped backslash, all characters from the leftmost such
633 ``#`` through the end of the line are ignored.
Georg Brandl116aa622007-08-15 14:28:22 +0000634
Zachary Ware71a0b432015-11-11 23:32:14 -0600635 This means that the two following regular expression objects that match a
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000636 decimal number are functionally equal::
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000637
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000638 a = re.compile(r"""\d + # the integral part
639 \. # the decimal point
640 \d * # some fractional digits""", re.X)
641 b = re.compile(r"\d+\.\d*")
Georg Brandl116aa622007-08-15 14:28:22 +0000642
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300643 Corresponds to the inline flag ``(?x)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000644
645
Georg Brandlc62a7042010-07-29 11:49:05 +0000646.. function:: search(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000647
Terry Jan Reedy0edb5c12014-05-30 16:19:59 -0400648 Scan through *string* looking for the first location where the regular expression
Georg Brandlc62a7042010-07-29 11:49:05 +0000649 *pattern* produces a match, and return a corresponding :ref:`match object
650 <match-objects>`. Return ``None`` if no position in the string matches the
651 pattern; note that this is different from finding a zero-length match at some
652 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000653
654
Georg Brandl18244152009-09-02 20:34:52 +0000655.. function:: match(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000656
657 If zero or more characters at the beginning of *string* match the regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000658 expression *pattern*, return a corresponding :ref:`match object
659 <match-objects>`. Return ``None`` if the string does not match the pattern;
660 note that this is different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000661
Ezio Melotti443f0002012-02-29 13:39:05 +0200662 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
663 at the beginning of the string and not at the beginning of each line.
Georg Brandl116aa622007-08-15 14:28:22 +0000664
Ezio Melotti443f0002012-02-29 13:39:05 +0200665 If you want to locate a match anywhere in *string*, use :func:`search`
666 instead (see also :ref:`search-vs-match`).
Georg Brandl116aa622007-08-15 14:28:22 +0000667
668
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200669.. function:: fullmatch(pattern, string, flags=0)
670
671 If the whole *string* matches the regular expression *pattern*, return a
672 corresponding :ref:`match object <match-objects>`. Return ``None`` if the
673 string does not match the pattern; note that this is different from a
674 zero-length match.
675
676 .. versionadded:: 3.4
677
678
Georg Brandl18244152009-09-02 20:34:52 +0000679.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000680
681 Split *string* by the occurrences of *pattern*. If capturing parentheses are
682 used in *pattern*, then the text of all groups in the pattern are also returned
683 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
684 splits occur, and the remainder of the string is returned as the final element
Georg Brandl96473892008-03-06 07:09:43 +0000685 of the list. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000686
Serhiy Storchakac615be52017-11-28 22:51:38 +0200687 >>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000688 ['Words', 'words', 'words', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200689 >>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000690 ['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200691 >>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl116aa622007-08-15 14:28:22 +0000692 ['Words', 'words, words.']
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000693 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
694 ['0', '3', '9']
Georg Brandl116aa622007-08-15 14:28:22 +0000695
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000696 If there are capturing groups in the separator and it matches at the start of
697 the string, the result will start with an empty string. The same holds for
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300698 the end of the string::
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000699
Serhiy Storchakac615be52017-11-28 22:51:38 +0200700 >>> re.split(r'(\W+)', '...words, words...')
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000701 ['', '...', 'words', ', ', 'words', '...', '']
702
703 That way, separator components are always found at the same relative
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700704 indices within the result list.
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000705
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200706 Empty matches for the pattern split the string only when not adjacent
707 to a previous empty match.
Thomas Wouters89d996e2007-09-08 17:39:28 +0000708
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200709 >>> re.split(r'\b', 'Words, words, words.')
710 ['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200711 >>> re.split(r'\W*', '...words...')
712 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200713 >>> re.split(r'(\W*)', '...words...')
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200714 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl116aa622007-08-15 14:28:22 +0000715
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000716 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000717 Added the optional flags argument.
718
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200719 .. versionchanged:: 3.7
720 Added support of splitting on a pattern that could match an empty string.
721
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000722
Georg Brandl18244152009-09-02 20:34:52 +0000723.. function:: findall(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000724
Georg Brandl9afde1c2007-11-01 20:32:30 +0000725 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandl3dbca812008-07-23 16:10:53 +0000726 strings. The *string* is scanned left-to-right, and matches are returned in
727 the order found. If one or more groups are present in the pattern, return a
728 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200729 one group. Empty matches are included in the result.
730
731 .. versionchanged:: 3.7
732 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000733
Georg Brandl116aa622007-08-15 14:28:22 +0000734
Georg Brandl18244152009-09-02 20:34:52 +0000735.. function:: finditer(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000736
Georg Brandlc62a7042010-07-29 11:49:05 +0000737 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
738 all non-overlapping matches for the RE *pattern* in *string*. The *string*
739 is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200740 matches are included in the result.
741
742 .. versionchanged:: 3.7
743 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000744
Georg Brandl116aa622007-08-15 14:28:22 +0000745
Georg Brandl18244152009-09-02 20:34:52 +0000746.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000747
748 Return the string obtained by replacing the leftmost non-overlapping occurrences
749 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
750 *string* is returned unchanged. *repl* can be a string or a function; if it is
751 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi6a633bb2011-08-19 22:54:50 +0200752 converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200753 so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl116aa622007-08-15 14:28:22 +0000754 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300755 For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000756
757 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
758 ... r'static PyObject*\npy_\1(void)\n{',
759 ... 'def myfunc():')
760 'static PyObject*\npy_myfunc(void)\n{'
761
762 If *repl* is a function, it is called for every non-overlapping occurrence of
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300763 *pattern*. The function takes a single :ref:`match object <match-objects>`
764 argument, and returns the replacement string. For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000765
766 >>> def dashrepl(matchobj):
767 ... if matchobj.group(0) == '-': return ' '
768 ... else: return '-'
769 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
770 'pro--gram files'
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000771 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
772 'Baked Beans & Spam'
Georg Brandl116aa622007-08-15 14:28:22 +0000773
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300774 The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl116aa622007-08-15 14:28:22 +0000775
776 The optional argument *count* is the maximum number of pattern occurrences to be
777 replaced; *count* must be a non-negative integer. If omitted or zero, all
778 occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200779 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
780 ``'-a-b--d-'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000781
Georg Brandl3c6780c62013-10-06 12:08:14 +0200782 In string-type *repl* arguments, in addition to the character escapes and
783 backreferences described above,
Georg Brandl116aa622007-08-15 14:28:22 +0000784 ``\g<name>`` will use the substring matched by the group named ``name``, as
785 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
786 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
787 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
788 reference to group 20, not a reference to group 2 followed by the literal
789 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
790 substring matched by the RE.
791
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000792 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000793 Added the optional flags argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000794
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300795 .. versionchanged:: 3.5
796 Unmatched groups are replaced with an empty string.
797
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300798 .. versionchanged:: 3.6
Serhiy Storchaka53c53ea2016-12-06 19:15:29 +0200799 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
800 now are errors.
801
Serhiy Storchakaff3dbe92016-12-06 19:25:19 +0200802 .. versionchanged:: 3.7
803 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
804 now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200805
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200806 Empty matches for the pattern are replaced when adjacent to a previous
807 non-empty match.
808
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000809
Georg Brandl18244152009-09-02 20:34:52 +0000810.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000811
812 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
813 number_of_subs_made)``.
814
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000815 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000816 Added the optional flags argument.
817
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300818 .. versionchanged:: 3.5
819 Unmatched groups are replaced with an empty string.
820
Georg Brandl116aa622007-08-15 14:28:22 +0000821
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300822.. function:: escape(pattern)
Georg Brandl116aa622007-08-15 14:28:22 +0000823
Serhiy Storchaka59083002017-04-13 21:06:43 +0300824 Escape special characters in *pattern*.
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300825 This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300826 have regular expression metacharacters in it. For example::
827
828 >>> print(re.escape('python.exe'))
829 python\.exe
830
831 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
832 >>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200833 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300834
835 >>> operators = ['+', '-', '*', '/', '**']
836 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka59083002017-04-13 21:06:43 +0300837 /|\-|\+|\*\*|\*
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300838
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300839 This functions must not be used for the replacement string in :func:`sub`
840 and :func:`subn`, only backslashes should be escaped. For example::
841
842 >>> digits_re = r'\d+'
843 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
844 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
845 /usr/sbin/sendmail - \d+ errors, \d+ warnings
846
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300847 .. versionchanged:: 3.3
848 The ``'_'`` character is no longer escaped.
Georg Brandl116aa622007-08-15 14:28:22 +0000849
Serhiy Storchaka59083002017-04-13 21:06:43 +0300850 .. versionchanged:: 3.7
851 Only characters that can have special meaning in a regular expression
852 are escaped.
853
Georg Brandl116aa622007-08-15 14:28:22 +0000854
R. David Murray522c32a2010-07-10 14:23:36 +0000855.. function:: purge()
856
857 Clear the regular expression cache.
858
859
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200860.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000861
862 Exception raised when a string passed to one of the functions here is not a
863 valid regular expression (for example, it might contain unmatched parentheses)
864 or when some other error occurs during compilation or matching. It is never an
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200865 error if a string contains no match for a pattern. The error instance has
866 the following additional attributes:
Georg Brandl116aa622007-08-15 14:28:22 +0000867
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200868 .. attribute:: msg
869
870 The unformatted error message.
871
872 .. attribute:: pattern
873
874 The regular expression pattern.
875
876 .. attribute:: pos
877
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300878 The index in *pattern* where compilation failed (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200879
880 .. attribute:: lineno
881
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300882 The line corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200883
884 .. attribute:: colno
885
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300886 The column corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200887
888 .. versionchanged:: 3.5
889 Added additional attributes.
Georg Brandl116aa622007-08-15 14:28:22 +0000890
891.. _re-objects:
892
893Regular Expression Objects
894--------------------------
895
Georg Brandlc62a7042010-07-29 11:49:05 +0000896Compiled regular expression objects support the following methods and
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700897attributes:
Brian Curtin027e4782010-03-26 00:39:56 +0000898
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300899.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +0000900
Berker Peksag84f387d2016-06-08 14:56:56 +0300901 Scan through *string* looking for the first location where this regular
902 expression produces a match, and return a corresponding :ref:`match object
Georg Brandlc62a7042010-07-29 11:49:05 +0000903 <match-objects>`. Return ``None`` if no position in the string matches the
904 pattern; note that this is different from finding a zero-length match at some
905 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000906
Georg Brandlc62a7042010-07-29 11:49:05 +0000907 The optional second parameter *pos* gives an index in the string where the
908 search is to start; it defaults to ``0``. This is not completely equivalent to
909 slicing the string; the ``'^'`` pattern character matches at the real beginning
910 of the string and at positions just after a newline, but not necessarily at the
911 index where the search is to start.
Georg Brandl116aa622007-08-15 14:28:22 +0000912
Georg Brandlc62a7042010-07-29 11:49:05 +0000913 The optional parameter *endpos* limits how far the string will be searched; it
914 will be as if the string is *endpos* characters long, so only the characters
915 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700916 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000917 expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300918 ``rx.search(string[:50], 0)``. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000919
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300920 >>> pattern = re.compile("d")
921 >>> pattern.search("dog") # Match at index 0
922 <re.Match object; span=(0, 1), match='d'>
923 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl116aa622007-08-15 14:28:22 +0000924
925
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300926.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +0000927
Georg Brandlc62a7042010-07-29 11:49:05 +0000928 If zero or more characters at the *beginning* of *string* match this regular
929 expression, return a corresponding :ref:`match object <match-objects>`.
930 Return ``None`` if the string does not match the pattern; note that this is
931 different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000932
Georg Brandlc62a7042010-07-29 11:49:05 +0000933 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300934 :meth:`~Pattern.search` method. ::
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000935
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300936 >>> pattern = re.compile("o")
937 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
938 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
939 <re.Match object; span=(1, 2), match='o'>
Georg Brandl116aa622007-08-15 14:28:22 +0000940
Ezio Melotti443f0002012-02-29 13:39:05 +0200941 If you want to locate a match anywhere in *string*, use
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300942 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti443f0002012-02-29 13:39:05 +0200943
Georg Brandl116aa622007-08-15 14:28:22 +0000944
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300945.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200946
947 If the whole *string* matches this regular expression, return a corresponding
948 :ref:`match object <match-objects>`. Return ``None`` if the string does not
949 match the pattern; note that this is different from a zero-length match.
950
951 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300952 :meth:`~Pattern.search` method. ::
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200953
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300954 >>> pattern = re.compile("o[gh]")
955 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
956 >>> pattern.fullmatch("ogre") # No match as not the full string matches.
957 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
958 <re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200959
960 .. versionadded:: 3.4
961
962
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300963.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000964
Georg Brandlc62a7042010-07-29 11:49:05 +0000965 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +0000966
967
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300968.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +0000969
Georg Brandlc62a7042010-07-29 11:49:05 +0000970 Similar to the :func:`findall` function, using the compiled pattern, but
971 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300972 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +0000973
974
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300975.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +0000976
Georg Brandlc62a7042010-07-29 11:49:05 +0000977 Similar to the :func:`finditer` function, using the compiled pattern, but
978 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300979 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +0000980
981
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300982.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000983
Georg Brandlc62a7042010-07-29 11:49:05 +0000984 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +0000985
986
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300987.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000988
Georg Brandlc62a7042010-07-29 11:49:05 +0000989 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +0000990
991
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300992.. attribute:: Pattern.flags
Georg Brandl116aa622007-08-15 14:28:22 +0000993
Georg Brandl3a19e542012-03-17 17:29:27 +0100994 The regex matching flags. This is a combination of the flags given to
995 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
996 flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl116aa622007-08-15 14:28:22 +0000997
998
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300999.. attribute:: Pattern.groups
Georg Brandlaf265f42008-12-07 15:06:20 +00001000
Georg Brandlc62a7042010-07-29 11:49:05 +00001001 The number of capturing groups in the pattern.
Georg Brandlaf265f42008-12-07 15:06:20 +00001002
1003
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001004.. attribute:: Pattern.groupindex
Georg Brandl116aa622007-08-15 14:28:22 +00001005
Georg Brandlc62a7042010-07-29 11:49:05 +00001006 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1007 numbers. The dictionary is empty if no symbolic groups were used in the
1008 pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001009
1010
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001011.. attribute:: Pattern.pattern
Georg Brandl116aa622007-08-15 14:28:22 +00001012
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001013 The pattern string from which the pattern object was compiled.
Georg Brandl116aa622007-08-15 14:28:22 +00001014
1015
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001016.. versionchanged:: 3.7
1017 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
1018 regular expression objects are considered atomic.
1019
1020
Georg Brandl116aa622007-08-15 14:28:22 +00001021.. _match-objects:
1022
1023Match Objects
1024-------------
1025
Ezio Melottib87f82f2012-11-04 06:59:22 +02001026Match objects always have a boolean value of ``True``.
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001027Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melottib87f82f2012-11-04 06:59:22 +02001028when there is no match, you can test whether there was a match with a simple
1029``if`` statement::
1030
1031 match = re.search(pattern, string)
1032 if match:
1033 process(match)
1034
1035Match objects support the following methods and attributes:
Georg Brandl116aa622007-08-15 14:28:22 +00001036
1037
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001038.. method:: Match.expand(template)
Georg Brandl116aa622007-08-15 14:28:22 +00001039
Georg Brandlc62a7042010-07-29 11:49:05 +00001040 Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001041 string *template*, as done by the :meth:`~Pattern.sub` method.
Georg Brandlc62a7042010-07-29 11:49:05 +00001042 Escapes such as ``\n`` are converted to the appropriate characters,
1043 and numeric backreferences (``\1``, ``\2``) and named backreferences
1044 (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1045 corresponding group.
Georg Brandl116aa622007-08-15 14:28:22 +00001046
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +03001047 .. versionchanged:: 3.5
1048 Unmatched groups are replaced with an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +00001049
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001050.. method:: Match.group([group1, ...])
Georg Brandl116aa622007-08-15 14:28:22 +00001051
Georg Brandlc62a7042010-07-29 11:49:05 +00001052 Returns one or more subgroups of the match. If there is a single argument, the
1053 result is a single string; if there are multiple arguments, the result is a
1054 tuple with one item per argument. Without arguments, *group1* defaults to zero
1055 (the whole match is returned). If a *groupN* argument is zero, the corresponding
1056 return value is the entire matching string; if it is in the inclusive range
1057 [1..99], it is the string matching the corresponding parenthesized group. If a
1058 group number is negative or larger than the number of groups defined in the
1059 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1060 part of the pattern that did not match, the corresponding result is ``None``.
1061 If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001062 the last match is returned. ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001063
Georg Brandlc62a7042010-07-29 11:49:05 +00001064 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1065 >>> m.group(0) # The entire match
1066 'Isaac Newton'
1067 >>> m.group(1) # The first parenthesized subgroup.
1068 'Isaac'
1069 >>> m.group(2) # The second parenthesized subgroup.
1070 'Newton'
1071 >>> m.group(1, 2) # Multiple arguments give us a tuple.
1072 ('Isaac', 'Newton')
Georg Brandl116aa622007-08-15 14:28:22 +00001073
Georg Brandlc62a7042010-07-29 11:49:05 +00001074 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1075 arguments may also be strings identifying groups by their group name. If a
1076 string argument is not used as a group name in the pattern, an :exc:`IndexError`
1077 exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +00001078
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001079 A moderately complicated example::
Georg Brandl116aa622007-08-15 14:28:22 +00001080
Georg Brandlc62a7042010-07-29 11:49:05 +00001081 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1082 >>> m.group('first_name')
1083 'Malcolm'
1084 >>> m.group('last_name')
1085 'Reynolds'
Georg Brandl116aa622007-08-15 14:28:22 +00001086
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001087 Named groups can also be referred to by their index::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001088
Georg Brandlc62a7042010-07-29 11:49:05 +00001089 >>> m.group(1)
1090 'Malcolm'
1091 >>> m.group(2)
1092 'Reynolds'
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001093
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001094 If a group matches multiple times, only the last match is accessible::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001095
Georg Brandlc62a7042010-07-29 11:49:05 +00001096 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
1097 >>> m.group(1) # Returns only the last match.
1098 'c3'
Brian Curtin027e4782010-03-26 00:39:56 +00001099
Brian Curtin48f16f92010-04-08 13:55:29 +00001100
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001101.. method:: Match.__getitem__(g)
Eric V. Smith605bdae2016-09-11 08:55:43 -04001102
1103 This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001104 an individual group from a match::
Eric V. Smith605bdae2016-09-11 08:55:43 -04001105
1106 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1107 >>> m[0] # The entire match
1108 'Isaac Newton'
1109 >>> m[1] # The first parenthesized subgroup.
1110 'Isaac'
1111 >>> m[2] # The second parenthesized subgroup.
1112 'Newton'
1113
1114 .. versionadded:: 3.6
1115
1116
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001117.. method:: Match.groups(default=None)
Brian Curtin48f16f92010-04-08 13:55:29 +00001118
Georg Brandlc62a7042010-07-29 11:49:05 +00001119 Return a tuple containing all the subgroups of the match, from 1 up to however
1120 many groups are in the pattern. The *default* argument is used for groups that
1121 did not participate in the match; it defaults to ``None``.
Brian Curtin027e4782010-03-26 00:39:56 +00001122
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001123 For example::
Brian Curtin027e4782010-03-26 00:39:56 +00001124
Georg Brandlc62a7042010-07-29 11:49:05 +00001125 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1126 >>> m.groups()
1127 ('24', '1632')
Brian Curtin027e4782010-03-26 00:39:56 +00001128
Georg Brandlc62a7042010-07-29 11:49:05 +00001129 If we make the decimal place and everything after it optional, not all groups
1130 might participate in the match. These groups will default to ``None`` unless
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001131 the *default* argument is given::
Brian Curtin027e4782010-03-26 00:39:56 +00001132
Georg Brandlc62a7042010-07-29 11:49:05 +00001133 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1134 >>> m.groups() # Second group defaults to None.
1135 ('24', None)
1136 >>> m.groups('0') # Now, the second group defaults to '0'.
1137 ('24', '0')
Georg Brandl116aa622007-08-15 14:28:22 +00001138
1139
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001140.. method:: Match.groupdict(default=None)
Georg Brandl116aa622007-08-15 14:28:22 +00001141
Georg Brandlc62a7042010-07-29 11:49:05 +00001142 Return a dictionary containing all the *named* subgroups of the match, keyed by
1143 the subgroup name. The *default* argument is used for groups that did not
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001144 participate in the match; it defaults to ``None``. For example::
Georg Brandl116aa622007-08-15 14:28:22 +00001145
Georg Brandlc62a7042010-07-29 11:49:05 +00001146 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1147 >>> m.groupdict()
1148 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001149
Georg Brandl116aa622007-08-15 14:28:22 +00001150
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001151.. method:: Match.start([group])
1152 Match.end([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001153
Georg Brandlc62a7042010-07-29 11:49:05 +00001154 Return the indices of the start and end of the substring matched by *group*;
1155 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1156 *group* exists but did not contribute to the match. For a match object *m*, and
1157 a group *g* that did contribute to the match, the substring matched by group *g*
1158 (equivalent to ``m.group(g)``) is ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001159
Georg Brandlc62a7042010-07-29 11:49:05 +00001160 m.string[m.start(g):m.end(g)]
Brian Curtin027e4782010-03-26 00:39:56 +00001161
Georg Brandlc62a7042010-07-29 11:49:05 +00001162 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1163 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
1164 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1165 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin027e4782010-03-26 00:39:56 +00001166
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001167 An example that will remove *remove_this* from email addresses::
Brian Curtin027e4782010-03-26 00:39:56 +00001168
Georg Brandlc62a7042010-07-29 11:49:05 +00001169 >>> email = "tony@tiremove_thisger.net"
1170 >>> m = re.search("remove_this", email)
1171 >>> email[:m.start()] + email[m.end():]
1172 'tony@tiger.net'
Georg Brandl116aa622007-08-15 14:28:22 +00001173
1174
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001175.. method:: Match.span([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001176
Georg Brandlc62a7042010-07-29 11:49:05 +00001177 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1178 that if *group* did not contribute to the match, this is ``(-1, -1)``.
1179 *group* defaults to zero, the entire match.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001180
Georg Brandl116aa622007-08-15 14:28:22 +00001181
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001182.. attribute:: Match.pos
Georg Brandl116aa622007-08-15 14:28:22 +00001183
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001184 The value of *pos* which was passed to the :meth:`~Pattern.search` or
1185 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001186 the index into the string at which the RE engine started looking for a match.
Georg Brandl116aa622007-08-15 14:28:22 +00001187
1188
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001189.. attribute:: Match.endpos
Georg Brandl116aa622007-08-15 14:28:22 +00001190
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001191 The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1192 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001193 the index into the string beyond which the RE engine will not go.
Georg Brandl116aa622007-08-15 14:28:22 +00001194
1195
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001196.. attribute:: Match.lastindex
Georg Brandl116aa622007-08-15 14:28:22 +00001197
Georg Brandlc62a7042010-07-29 11:49:05 +00001198 The integer index of the last matched capturing group, or ``None`` if no group
1199 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1200 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1201 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1202 string.
Georg Brandl116aa622007-08-15 14:28:22 +00001203
1204
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001205.. attribute:: Match.lastgroup
Georg Brandl116aa622007-08-15 14:28:22 +00001206
Georg Brandlc62a7042010-07-29 11:49:05 +00001207 The name of the last matched capturing group, or ``None`` if the group didn't
1208 have a name, or if no group was matched at all.
Georg Brandl116aa622007-08-15 14:28:22 +00001209
1210
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001211.. attribute:: Match.re
Georg Brandl116aa622007-08-15 14:28:22 +00001212
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001213 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001214 :meth:`~Pattern.search` method produced this match instance.
Georg Brandl116aa622007-08-15 14:28:22 +00001215
1216
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001217.. attribute:: Match.string
Georg Brandl116aa622007-08-15 14:28:22 +00001218
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001219 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001220
1221
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001222.. versionchanged:: 3.7
1223 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
1224 are considered atomic.
1225
1226
Raymond Hettinger1fa76822010-12-06 23:31:36 +00001227.. _re-examples:
1228
1229Regular Expression Examples
1230---------------------------
Georg Brandl116aa622007-08-15 14:28:22 +00001231
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001232
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001233Checking for a Pair
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001234^^^^^^^^^^^^^^^^^^^
1235
1236In this example, we'll use the following helper function to display match
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001237objects a little more gracefully::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001238
1239 def displaymatch(match):
1240 if match is None:
1241 return None
1242 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1243
1244Suppose you are writing a poker program where a player's hand is represented as
1245a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001246for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001247representing the card with that value.
1248
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001249To see if a given string is a valid hand, one could do the following::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001250
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001251 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1252 >>> displaymatch(valid.match("akt5q")) # Valid.
1253 "<Match: 'akt5q', groups=()>"
1254 >>> displaymatch(valid.match("akt5e")) # Invalid.
1255 >>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001256 >>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001257 "<Match: '727ak', groups=()>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001258
1259That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001260To match this with a regular expression, one could use backreferences as such::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001261
1262 >>> pair = re.compile(r".*(.).*\1")
1263 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001264 "<Match: '717', groups=('7',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001265 >>> displaymatch(pair.match("718ak")) # No pairs.
1266 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001267 "<Match: '354aa', groups=('a',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001268
Georg Brandlf346ac02009-07-26 15:03:49 +00001269To find out what card the pair consists of, one could use the
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001270:meth:`~Match.group` method of the match object in the following manner::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001271
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001272 >>> pair = re.compile(r".*(.).*\1")
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001273 >>> pair.match("717ak").group(1)
1274 '7'
Georg Brandl48310cd2009-01-03 21:18:54 +00001275
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001276 # Error because re.match() returns None, which doesn't have a group() method:
1277 >>> pair.match("718ak").group(1)
1278 Traceback (most recent call last):
1279 File "<pyshell#23>", line 1, in <module>
1280 re.match(r".*(.).*\1", "718ak").group(1)
1281 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl48310cd2009-01-03 21:18:54 +00001282
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001283 >>> pair.match("354aa").group(1)
1284 'a'
1285
1286
1287Simulating scanf()
1288^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +00001289
1290.. index:: single: scanf()
1291
Georg Brandl60203b42010-10-06 10:11:56 +00001292Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl116aa622007-08-15 14:28:22 +00001293expressions are generally more powerful, though also more verbose, than
Georg Brandl60203b42010-10-06 10:11:56 +00001294:c:func:`scanf` format strings. The table below offers some more-or-less
1295equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl116aa622007-08-15 14:28:22 +00001296expressions.
1297
1298+--------------------------------+---------------------------------------------+
Georg Brandl60203b42010-10-06 10:11:56 +00001299| :c:func:`scanf` Token | Regular Expression |
Georg Brandl116aa622007-08-15 14:28:22 +00001300+================================+=============================================+
1301| ``%c`` | ``.`` |
1302+--------------------------------+---------------------------------------------+
1303| ``%5c`` | ``.{5}`` |
1304+--------------------------------+---------------------------------------------+
1305| ``%d`` | ``[-+]?\d+`` |
1306+--------------------------------+---------------------------------------------+
1307| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1308+--------------------------------+---------------------------------------------+
1309| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1310+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001311| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001312+--------------------------------+---------------------------------------------+
1313| ``%s`` | ``\S+`` |
1314+--------------------------------+---------------------------------------------+
1315| ``%u`` | ``\d+`` |
1316+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001317| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001318+--------------------------------+---------------------------------------------+
1319
1320To extract the filename and numbers from a string like ::
1321
1322 /usr/sbin/sendmail - 0 errors, 4 warnings
1323
Georg Brandl60203b42010-10-06 10:11:56 +00001324you would use a :c:func:`scanf` format like ::
Georg Brandl116aa622007-08-15 14:28:22 +00001325
1326 %s - %d errors, %d warnings
1327
1328The equivalent regular expression would be ::
1329
1330 (\S+) - (\d+) errors, (\d+) warnings
1331
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001332
Ezio Melotti443f0002012-02-29 13:39:05 +02001333.. _search-vs-match:
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001334
1335search() vs. match()
1336^^^^^^^^^^^^^^^^^^^^
1337
Ezio Melotti443f0002012-02-29 13:39:05 +02001338.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001339
Ezio Melotti443f0002012-02-29 13:39:05 +02001340Python offers two different primitive operations based on regular expressions:
1341:func:`re.match` checks for a match only at the beginning of the string, while
1342:func:`re.search` checks for a match anywhere in the string (this is what Perl
1343does by default).
1344
1345For example::
1346
Serhiy Storchakadba90392016-05-10 12:01:23 +03001347 >>> re.match("c", "abcdef") # No match
1348 >>> re.search("c", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001349 <re.Match object; span=(2, 3), match='c'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001350
Ezio Melotti443f0002012-02-29 13:39:05 +02001351Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1352restrict the match at the beginning of the string::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001353
Serhiy Storchakadba90392016-05-10 12:01:23 +03001354 >>> re.match("c", "abcdef") # No match
1355 >>> re.search("^c", "abcdef") # No match
Ezio Melotti443f0002012-02-29 13:39:05 +02001356 >>> re.search("^a", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001357 <re.Match object; span=(0, 1), match='a'>
Ezio Melotti443f0002012-02-29 13:39:05 +02001358
1359Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1360beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001361beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti443f0002012-02-29 13:39:05 +02001362
1363 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1364 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001365 <re.Match object; span=(4, 5), match='X'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001366
1367
1368Making a Phonebook
1369^^^^^^^^^^^^^^^^^^
1370
Georg Brandl48310cd2009-01-03 21:18:54 +00001371:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001372method is invaluable for converting textual data into data structures that can be
1373easily read and modified by Python as demonstrated in the following example that
1374creates a phonebook.
1375
Christian Heimes255f53b2007-12-08 15:33:56 +00001376First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001377triple-quoted string syntax
1378
1379.. doctest::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001380
Georg Brandl557a3ec2012-03-17 17:26:27 +01001381 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl48310cd2009-01-03 21:18:54 +00001382 ...
Christian Heimesfe337bf2008-03-23 21:54:12 +00001383 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1384 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1385 ...
1386 ...
1387 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes255f53b2007-12-08 15:33:56 +00001388
1389The entries are separated by one or more newlines. Now we convert the string
Christian Heimesfe337bf2008-03-23 21:54:12 +00001390into a list with each nonempty line having its own entry:
1391
1392.. doctest::
1393 :options: +NORMALIZE_WHITESPACE
Christian Heimes255f53b2007-12-08 15:33:56 +00001394
Georg Brandl557a3ec2012-03-17 17:26:27 +01001395 >>> entries = re.split("\n+", text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001396 >>> entries
Christian Heimesfe337bf2008-03-23 21:54:12 +00001397 ['Ross McFluff: 834.345.1254 155 Elm Street',
1398 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1399 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1400 'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001401
1402Finally, split each entry into a list with first name, last name, telephone
Christian Heimesc3f30c42008-02-22 16:37:40 +00001403number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimesfe337bf2008-03-23 21:54:12 +00001404because the address has spaces, our splitting pattern, in it:
1405
1406.. doctest::
1407 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001408
Christian Heimes255f53b2007-12-08 15:33:56 +00001409 >>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001410 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1411 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1412 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1413 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1414
Christian Heimes255f53b2007-12-08 15:33:56 +00001415The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimesc3f30c42008-02-22 16:37:40 +00001416occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimesfe337bf2008-03-23 21:54:12 +00001417house number from the street name:
1418
1419.. doctest::
1420 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001421
Christian Heimes255f53b2007-12-08 15:33:56 +00001422 >>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001423 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1424 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1425 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1426 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1427
1428
1429Text Munging
1430^^^^^^^^^^^^
1431
1432:func:`sub` replaces every occurrence of a pattern with a string or the
1433result of a function. This example demonstrates using :func:`sub` with
1434a function to "munge" text, or randomize the order of all the characters
1435in each word of a sentence except for the first and last characters::
1436
1437 >>> def repl(m):
Serhiy Storchakadba90392016-05-10 12:01:23 +03001438 ... inner_word = list(m.group(2))
1439 ... random.shuffle(inner_word)
1440 ... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001441 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandldb4e9392010-07-12 09:06:13 +00001442 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001443 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandldb4e9392010-07-12 09:06:13 +00001444 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001445 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1446
1447
1448Finding all Adverbs
1449^^^^^^^^^^^^^^^^^^^
1450
Christian Heimesc3f30c42008-02-22 16:37:40 +00001451:func:`findall` matches *all* occurrences of a pattern, not just the first
Andrés Delfino50924392018-06-18 01:34:30 -03001452one as :func:`search` does. For example, if a writer wanted to
1453find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001454the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001455
1456 >>> text = "He was carefully disguised but captured quickly by police."
1457 >>> re.findall(r"\w+ly", text)
1458 ['carefully', 'quickly']
1459
1460
1461Finding all Adverbs and their Positions
1462^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1463
1464If one wants more information about all matches of a pattern than the matched
Georg Brandlc62a7042010-07-29 11:49:05 +00001465text, :func:`finditer` is useful as it provides :ref:`match objects
1466<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino50924392018-06-18 01:34:30 -03001467a writer wanted to find all of the adverbs *and their positions* in
1468some text, they would use :func:`finditer` in the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001469
1470 >>> text = "He was carefully disguised but captured quickly by police."
1471 >>> for m in re.finditer(r"\w+ly", text):
Christian Heimesfe337bf2008-03-23 21:54:12 +00001472 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001473 07-16: carefully
1474 40-47: quickly
1475
1476
1477Raw String Notation
1478^^^^^^^^^^^^^^^^^^^
1479
1480Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1481every backslash (``'\'``) in a regular expression would have to be prefixed with
1482another one to escape it. For example, the two following lines of code are
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001483functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001484
1485 >>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001486 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001487 >>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001488 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001489
1490When one wants to match a literal backslash, it must be escaped in the regular
1491expression. With raw string notation, this means ``r"\\"``. Without raw string
1492notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001493functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001494
1495 >>> re.match(r"\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001496 <re.Match object; span=(0, 1), match='\\'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001497 >>> re.match("\\\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001498 <re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001499
1500
1501Writing a Tokenizer
1502^^^^^^^^^^^^^^^^^^^
1503
Georg Brandl5d941342016-02-26 19:37:12 +01001504A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001505analyzes a string to categorize groups of characters. This is a useful first
1506step in writing a compiler or interpreter.
1507
1508The text categories are specified with regular expressions. The technique is
1509to combine those into a single master regular expression and to loop over
1510successive matches::
1511
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001512 import collections
1513 import re
1514
1515 Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001516
Raymond Hettingerc5664312014-08-03 23:38:54 -07001517 def tokenize(code):
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001518 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1519 token_specification = [
Serhiy Storchakadba90392016-05-10 12:01:23 +03001520 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
1521 ('ASSIGN', r':='), # Assignment operator
1522 ('END', r';'), # Statement terminator
1523 ('ID', r'[A-Za-z]+'), # Identifiers
1524 ('OP', r'[+\-*/]'), # Arithmetic operators
1525 ('NEWLINE', r'\n'), # Line endings
1526 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs
1527 ('MISMATCH',r'.'), # Any other character
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001528 ]
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001529 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettingerc5664312014-08-03 23:38:54 -07001530 line_num = 1
1531 line_start = 0
1532 for mo in re.finditer(tok_regex, code):
1533 kind = mo.lastgroup
1534 value = mo.group(kind)
1535 if kind == 'NEWLINE':
1536 line_start = mo.end()
1537 line_num += 1
1538 elif kind == 'SKIP':
1539 pass
1540 elif kind == 'MISMATCH':
Raymond Hettingerd0b91582017-02-06 07:15:31 -08001541 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettingerc5664312014-08-03 23:38:54 -07001542 else:
1543 if kind == 'ID' and value in keywords:
1544 kind = value
1545 column = mo.start() - line_start
1546 yield Token(kind, value, line_num, column)
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001547
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001548 statements = '''
1549 IF quantity THEN
1550 total := total + price * quantity;
1551 tax := price * 0.05;
1552 ENDIF;
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001553 '''
Raymond Hettinger23157e52011-05-13 01:38:31 -07001554
1555 for token in tokenize(statements):
1556 print(token)
1557
1558The tokenizer produces the following output::
Raymond Hettinger9c47d772011-05-13 01:03:50 -07001559
Raymond Hettingerc5664312014-08-03 23:38:54 -07001560 Token(typ='IF', value='IF', line=2, column=4)
1561 Token(typ='ID', value='quantity', line=2, column=7)
1562 Token(typ='THEN', value='THEN', line=2, column=16)
1563 Token(typ='ID', value='total', line=3, column=8)
1564 Token(typ='ASSIGN', value=':=', line=3, column=14)
1565 Token(typ='ID', value='total', line=3, column=17)
1566 Token(typ='OP', value='+', line=3, column=23)
1567 Token(typ='ID', value='price', line=3, column=25)
1568 Token(typ='OP', value='*', line=3, column=31)
1569 Token(typ='ID', value='quantity', line=3, column=33)
1570 Token(typ='END', value=';', line=3, column=41)
1571 Token(typ='ID', value='tax', line=4, column=8)
1572 Token(typ='ASSIGN', value=':=', line=4, column=12)
1573 Token(typ='ID', value='price', line=4, column=15)
1574 Token(typ='OP', value='*', line=4, column=21)
1575 Token(typ='NUMBER', value='0.05', line=4, column=23)
1576 Token(typ='END', value=';', line=4, column=27)
1577 Token(typ='ENDIF', value='ENDIF', line=5, column=4)
1578 Token(typ='END', value=';', line=5, column=9)
Berker Peksaga0a42d22018-03-23 16:46:52 +03001579
1580
1581.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1582 Media, 2009. The third edition of the book no longer covers Python at all,
1583 but the first edition covered writing good regular expression patterns in
1584 great detail.