blob: 7c950bfd5b1fd521e5701a660be6dc64b8e69740 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5 :synopsis: Regular expression operations.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/re.py`
11
12--------------
Georg Brandl116aa622007-08-15 14:28:22 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014This module provides regular expression matching operations similar to
Georg Brandled2a1db2009-06-08 07:48:27 +000015those found in Perl.
Antoine Pitroufd036452008-08-19 17:56:33 +000016
Serhiy Storchakacd195e22017-10-14 11:14:26 +030017Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter6245cb32016-04-15 02:14:19 +000020that is, you cannot match a Unicode string with a byte pattern or
Georg Brandlae2dbe22009-03-13 19:04:40 +000021vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitroufd036452008-08-19 17:56:33 +000022string must be of the same type as both the pattern and the search string.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning. This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
Pablo Galindoe8239b82019-01-20 18:57:56 +000031literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
Georg Brandl116aa622007-08-15 14:28:22 +000035
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl9afde1c2007-11-01 20:32:30 +000040newline. Usually patterns will be expressed in Python code using this raw
41string notation.
Georg Brandl116aa622007-08-15 14:28:22 +000042
Christian Heimesb9eccbf2007-12-05 20:18:38 +000043It is important to note that most regular expression operations are available as
Georg Brandlc62a7042010-07-29 11:49:05 +000044module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
Christian Heimesb9eccbf2007-12-05 20:18:38 +000047fine-tuning parameters.
48
Marco Buttued6795e2017-02-26 16:26:23 +010049.. seealso::
50
Stéphane Wirtel19177fb2018-05-15 20:58:35 +020051 The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttued6795e2017-02-26 16:26:23 +010052 which has an API compatible with the standard library :mod:`re` module,
53 but offers additional functionality and a more thorough Unicode support.
54
Georg Brandl116aa622007-08-15 14:28:22 +000055
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB. This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references. Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here. For details of the theory
Berker Peksaga0a42d22018-03-23 16:46:52 +030073and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
Georg Brandl116aa622007-08-15 14:28:22 +000075
76A brief explanation of the format of regular expressions follows. For further
Christian Heimes2202f872008-02-06 14:31:34 +000077information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl116aa622007-08-15 14:28:22 +000078
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves. You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``. (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
Serhiy Storchakacd195e22017-10-14 11:14:26 +030088how the regular expressions around them are interpreted.
Georg Brandl116aa622007-08-15 14:28:22 +000089
Martin Panter684340e2016-10-15 01:18:16 +000090Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
Georg Brandl116aa622007-08-15 14:28:22 +000096
97The special characters are:
98
Serhiy Storchaka913876d2018-10-28 13:41:26 +020099.. index:: single: . (dot); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300100
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300101``.``
Georg Brandl116aa622007-08-15 14:28:22 +0000102 (Dot.) In the default mode, this matches any character except a newline. If
103 the :const:`DOTALL` flag has been specified, this matches any character
104 including a newline.
105
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200106.. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300107
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300108``^``
Georg Brandl116aa622007-08-15 14:28:22 +0000109 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
110 matches immediately after each newline.
111
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200112.. index:: single: $ (dollar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300113
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300114``$``
Georg Brandl116aa622007-08-15 14:28:22 +0000115 Matches the end of the string or just before the newline at the end of the
116 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
117 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes25bb7832008-01-11 16:17:00 +0000119 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121 the newline, and one at the end of the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000122
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200123.. index:: single: * (asterisk); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300124
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300125``*``
Georg Brandl116aa622007-08-15 14:28:22 +0000126 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
128 by any number of 'b's.
129
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200130.. index:: single: + (plus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300131
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300132``+``
Georg Brandl116aa622007-08-15 14:28:22 +0000133 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135 match just 'a'.
136
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200137.. index:: single: ? (question mark); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300138
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300139``?``
Georg Brandl116aa622007-08-15 14:28:22 +0000140 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141 ``ab?`` will match either 'a' or 'ab'.
142
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300143.. index::
144 single: *?; in regular expressions
145 single: +?; in regular expressions
146 single: ??; in regular expressions
147
Georg Brandl116aa622007-08-15 14:28:22 +0000148``*?``, ``+?``, ``??``
149 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
150 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300151 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl116aa622007-08-15 14:28:22 +0000153 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl7ff033b2016-04-12 07:51:41 +0200154 characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300155 only ``'<a>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000156
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300157.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200158 single: {} (curly brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300159
Georg Brandl116aa622007-08-15 14:28:22 +0000160``{m}``
161 Specifies that exactly *m* copies of the previous RE should be matched; fewer
162 matches cause the entire RE not to match. For example, ``a{6}`` will match
163 exactly six ``'a'`` characters, but not five.
164
165``{m,n}``
166 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
167 RE, attempting to match as many repetitions as possible. For example,
168 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
169 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300170 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
171 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl116aa622007-08-15 14:28:22 +0000172 modifier would be confused with the previously described form.
173
174``{m,n}?``
175 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
176 RE, attempting to match as *few* repetitions as possible. This is the
177 non-greedy version of the previous qualifier. For example, on the
178 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179 while ``a{3,5}?`` will only match 3 characters.
180
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200181.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300182
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300183``\``
Georg Brandl116aa622007-08-15 14:28:22 +0000184 Either escapes special characters (permitting you to match characters like
185 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
186 sequences are discussed below.
187
188 If you're not using a raw string to express the pattern, remember that Python
189 also uses the backslash as an escape sequence in string literals; if the escape
190 sequence isn't recognized by Python's parser, the backslash and subsequent
191 character are included in the resulting string. However, if Python would
192 recognize the resulting sequence, the backslash should be repeated twice. This
193 is complicated and hard to understand, so it's highly recommended that you use
194 raw strings for all but the simplest expressions.
195
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300196.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200197 single: [] (square brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300198
Georg Brandl116aa622007-08-15 14:28:22 +0000199``[]``
Ezio Melotti81231d92011-10-20 19:38:04 +0300200 Used to indicate a set of characters. In a set:
Georg Brandl116aa622007-08-15 14:28:22 +0000201
Ezio Melotti81231d92011-10-20 19:38:04 +0300202 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
203 ``'m'``, or ``'k'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000204
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200205 .. index:: single: - (minus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300206
Ezio Melotti81231d92011-10-20 19:38:04 +0300207 * Ranges of characters can be indicated by giving two characters and separating
208 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
209 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
210 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300211 ``[a\-z]``) or if it's placed as the first or last character
212 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti81231d92011-10-20 19:38:04 +0300213
214 * Special characters lose their special meaning inside sets. For example,
215 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
216 ``'*'``, or ``')'``.
217
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200218 .. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300219
Ezio Melotti81231d92011-10-20 19:38:04 +0300220 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
221 inside a set, although the characters they match depends on whether
222 :const:`ASCII` or :const:`LOCALE` mode is in force.
223
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200224 .. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300225
Ezio Melotti81231d92011-10-20 19:38:04 +0300226 * Characters that are not within a range can be matched by :dfn:`complementing`
227 the set. If the first character of the set is ``'^'``, all the characters
228 that are *not* in the set will be matched. For example, ``[^5]`` will match
229 any character except ``'5'``, and ``[^^]`` will match any character except
230 ``'^'``. ``^`` has no special meaning if it's not the first character in
231 the set.
232
233 * To match a literal ``']'`` inside a set, precede it with a backslash, or
234 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
235 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield9e670c22008-05-31 13:05:34 +0000236
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300237 .. .. index:: single: --; in regular expressions
238 .. .. index:: single: &&; in regular expressions
239 .. .. index:: single: ~~; in regular expressions
240 .. .. index:: single: ||; in regular expressions
241
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200242 * Support of nested sets and set operations as in `Unicode Technical
243 Standard #18`_ might be added in the future. This would change the
244 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
245 in ambiguous cases for the time being.
Andrés Delfino7dfbd492018-10-06 16:48:30 -0300246 That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200247 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
248 avoid a warning escape them with a backslash.
249
250 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
251
252 .. versionchanged:: 3.7
253 :exc:`FutureWarning` is raised if a character set contains constructs
254 that will change semantically in the future.
255
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200256.. index:: single: | (vertical bar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300257
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300258``|``
259 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
260 will match either *A* or *B*. An arbitrary number of REs can be separated by the
Georg Brandl116aa622007-08-15 14:28:22 +0000261 ``'|'`` in this way. This can be used inside groups (see below) as well. As
262 the target string is scanned, REs separated by ``'|'`` are tried from left to
263 right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300264 that once *A* matches, *B* will not be tested further, even if it would
Georg Brandl116aa622007-08-15 14:28:22 +0000265 produce a longer overall match. In other words, the ``'|'`` operator is never
266 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
267 character class, as in ``[|]``.
268
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300269.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200270 single: () (parentheses); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300271
Georg Brandl116aa622007-08-15 14:28:22 +0000272``(...)``
273 Matches whatever regular expression is inside the parentheses, and indicates the
274 start and end of a group; the contents of a group can be retrieved after a match
275 has been performed, and can be matched later in the string with the ``\number``
276 special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300277 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000278
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300279.. index:: single: (?; in regular expressions
280
Georg Brandl116aa622007-08-15 14:28:22 +0000281``(?...)``
282 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
283 otherwise). The first character after the ``'?'`` determines what the meaning
284 and further syntax of the construct is. Extensions usually do not create a new
285 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
286 currently supported extensions.
287
Antoine Pitroufd036452008-08-19 17:56:33 +0000288``(?aiLmsux)``
289 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
290 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling1c50e862009-06-01 00:11:36 +0000291 letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitroufd036452008-08-19 17:56:33 +0000292 :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl48310cd2009-01-03 21:18:54 +0000293 :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300294 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
295 for the entire regular expression.
296 (The flags are described in :ref:`contents-of-module-re`.)
297 This is useful if you wish to include the flags as part of the
298 regular expression, instead of passing a *flag* argument to the
Serhiy Storchakabd48d272016-09-11 12:50:02 +0300299 :func:`re.compile` function. Flags should be used first in the
300 expression string.
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300302.. index:: single: (?:; in regular expressions
303
Georg Brandl116aa622007-08-15 14:28:22 +0000304``(?:...)``
Georg Brandl3122ce32010-10-29 06:17:38 +0000305 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl116aa622007-08-15 14:28:22 +0000306 expression is inside the parentheses, but the substring matched by the group
307 *cannot* be retrieved after performing a match or referenced later in the
308 pattern.
309
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300310``(?aiLmsux-imsx:...)``
311 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
312 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
313 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
314 The letters set or remove the corresponding flags:
315 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
316 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
317 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
318 and :const:`re.X` (verbose), for the part of the expression.
319 (The flags are described in :ref:`contents-of-module-re`.)
320
321 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
322 as inline flags, so they can't be combined or follow ``'-'``. Instead,
323 when one of them appears in an inline group, it overrides the matching mode
324 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
325 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
326 (default). In byte pattern ``(?L:...)`` switches to locale depending
327 matching, and ``(?a:...)`` switches to ASCII-only matching (default).
328 This override is only in effect for the narrow inline group, and the
329 original matching mode is restored outside of the group.
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300330
Zachary Warec3076722016-09-09 15:47:05 -0700331 .. versionadded:: 3.6
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300332
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300333 .. versionchanged:: 3.7
334 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
335
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300336.. index:: single: (?P<; in regular expressions
337
Georg Brandl116aa622007-08-15 14:28:22 +0000338``(?P<name>...)``
339 Similar to regular parentheses, but the substring matched by the group is
Georg Brandl3c6780c62013-10-06 12:08:14 +0200340 accessible via the symbolic group name *name*. Group names must be valid
341 Python identifiers, and each group name must be defined only once within a
342 regular expression. A symbolic group is also a numbered group, just as if
343 the group were not named.
Georg Brandl116aa622007-08-15 14:28:22 +0000344
Georg Brandl3c6780c62013-10-06 12:08:14 +0200345 Named groups can be referenced in three contexts. If the pattern is
346 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
347 single or double quotes):
348
349 +---------------------------------------+----------------------------------+
350 | Context of reference to group "quote" | Ways to reference it |
351 +=======================================+==================================+
352 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
353 | | * ``\1`` |
354 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300355 | when processing match object *m* | * ``m.group('quote')`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200356 | | * ``m.end('quote')`` (etc.) |
357 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300358 | in a string passed to the *repl* | * ``\g<quote>`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200359 | argument of ``re.sub()`` | * ``\g<1>`` |
360 | | * ``\1`` |
361 +---------------------------------------+----------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000362
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300363.. index:: single: (?P=; in regular expressions
364
Georg Brandl116aa622007-08-15 14:28:22 +0000365``(?P=name)``
Georg Brandl3c6780c62013-10-06 12:08:14 +0200366 A backreference to a named group; it matches whatever text was matched by the
367 earlier group named *name*.
Georg Brandl116aa622007-08-15 14:28:22 +0000368
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300369.. index:: single: (?#; in regular expressions
370
Georg Brandl116aa622007-08-15 14:28:22 +0000371``(?#...)``
372 A comment; the contents of the parentheses are simply ignored.
373
animalize4a7f44a2019-02-18 21:26:37 +0800374.. index:: single: (?=; in regular expressions
375
Georg Brandl116aa622007-08-15 14:28:22 +0000376``(?=...)``
377 Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300378 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl116aa622007-08-15 14:28:22 +0000379 ``'Isaac '`` only if it's followed by ``'Asimov'``.
380
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300381.. index:: single: (?!; in regular expressions
382
Georg Brandl116aa622007-08-15 14:28:22 +0000383``(?!...)``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300384 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl116aa622007-08-15 14:28:22 +0000385 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
386 followed by ``'Asimov'``.
387
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300388.. index:: single: (?<=; in regular expressions
389
Georg Brandl116aa622007-08-15 14:28:22 +0000390``(?<=...)``
391 Matches if the current position in the string is preceded by a match for ``...``
392 that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300393 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl116aa622007-08-15 14:28:22 +0000394 lookbehind will back up 3 characters and check if the contained pattern matches.
395 The contained pattern must only match strings of some fixed length, meaning that
396 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti0a6b5412012-04-29 07:34:46 +0300397 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl116aa622007-08-15 14:28:22 +0000398 beginning of the string being searched; you will most likely want to use the
Christian Heimesfe337bf2008-03-23 21:54:12 +0000399 :func:`search` function rather than the :func:`match` function:
Georg Brandl116aa622007-08-15 14:28:22 +0000400
401 >>> import re
402 >>> m = re.search('(?<=abc)def', 'abcdef')
403 >>> m.group(0)
404 'def'
405
Christian Heimesfe337bf2008-03-23 21:54:12 +0000406 This example looks for a word following a hyphen:
Georg Brandl116aa622007-08-15 14:28:22 +0000407
Cheryl Sabella66771422018-02-02 16:16:27 -0500408 >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl116aa622007-08-15 14:28:22 +0000409 >>> m.group(0)
410 'egg'
411
Georg Brandl8c16cb92016-02-25 20:17:45 +0100412 .. versionchanged:: 3.5
Serhiy Storchaka4eea62f2015-02-21 10:07:35 +0200413 Added support for group references of fixed length.
414
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300415.. index:: single: (?<!; in regular expressions
416
Georg Brandl116aa622007-08-15 14:28:22 +0000417``(?<!...)``
418 Matches if the current position in the string is not preceded by a match for
419 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
420 positive lookbehind assertions, the contained pattern must only match strings of
421 some fixed length. Patterns which start with negative lookbehind assertions may
422 match at the beginning of the string being searched.
423
424``(?(id/name)yes-pattern|no-pattern)``
orsenthil@gmail.com476021b2011-03-12 10:46:25 +0800425 Will try to match with ``yes-pattern`` if the group with given *id* or
426 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
427 optional and can be omitted. For example,
428 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
429 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200430 not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000431
Georg Brandl116aa622007-08-15 14:28:22 +0000432
433The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter98e90512016-06-12 06:17:29 +0000434If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300435resulting RE will match the second character. For example, ``\$`` matches the
436character ``'$'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000437
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200438.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300439
Georg Brandl116aa622007-08-15 14:28:22 +0000440``\number``
441 Matches the contents of the group of the same number. Groups are numbered
442 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl2070e832013-10-06 12:58:20 +0200443 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000444 can only be used to match one of the first 99 groups. If the first digit of
445 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
446 a group match, but as the character with octal value *number*. Inside the
447 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
448 characters.
449
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300450.. index:: single: \A; in regular expressions
451
Georg Brandl116aa622007-08-15 14:28:22 +0000452``\A``
453 Matches only at the start of the string.
454
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300455.. index:: single: \b; in regular expressions
456
Georg Brandl116aa622007-08-15 14:28:22 +0000457``\b``
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000458 Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300459 A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti5a045b92012-02-29 11:48:44 +0200460 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
461 (or vice versa), or between ``\w`` and the beginning/end of the string.
462 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
463 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
464
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300465 By default Unicode alphanumerics are the ones used in Unicode patterns, but
466 this can be changed by using the :const:`ASCII` flag. Word boundaries are
467 determined by the current locale if the :const:`LOCALE` flag is used.
468 Inside a character range, ``\b`` represents the backspace character, for
469 compatibility with Python's string literals.
Georg Brandl116aa622007-08-15 14:28:22 +0000470
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300471.. index:: single: \B; in regular expressions
472
Georg Brandl116aa622007-08-15 14:28:22 +0000473``\B``
Ezio Melotti5a045b92012-02-29 11:48:44 +0200474 Matches the empty string, but only when it is *not* at the beginning or end
475 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
476 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300477 ``\B`` is just the opposite of ``\b``, so word characters in Unicode
478 patterns are Unicode alphanumerics or the underscore, although this can
479 be changed by using the :const:`ASCII` flag. Word boundaries are
480 determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl116aa622007-08-15 14:28:22 +0000481
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300482.. index:: single: \d; in regular expressions
483
Georg Brandl116aa622007-08-15 14:28:22 +0000484``\d``
Antoine Pitroufd036452008-08-19 17:56:33 +0000485 For Unicode (str) patterns:
Mark Dickinson1f268282009-07-28 17:22:36 +0000486 Matches any Unicode decimal digit (that is, any character in
487 Unicode character category [Nd]). This includes ``[0-9]``, and
488 also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300489 used only ``[0-9]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300490
Antoine Pitroufd036452008-08-19 17:56:33 +0000491 For 8-bit (bytes) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000492 Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000493
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300494.. index:: single: \D; in regular expressions
495
Georg Brandl116aa622007-08-15 14:28:22 +0000496``\D``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300497 Matches any character which is not a decimal digit. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000498 the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300499 becomes the equivalent of ``[^0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000500
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300501.. index:: single: \s; in regular expressions
502
Georg Brandl116aa622007-08-15 14:28:22 +0000503``\s``
Antoine Pitroufd036452008-08-19 17:56:33 +0000504 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000505 Matches Unicode whitespace characters (which includes
506 ``[ \t\n\r\f\v]``, and also many other characters, for example the
507 non-breaking spaces mandated by typography rules in many
508 languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300509 ``[ \t\n\r\f\v]`` is matched.
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000510
Antoine Pitroufd036452008-08-19 17:56:33 +0000511 For 8-bit (bytes) patterns:
512 Matches characters considered whitespace in the ASCII character set;
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000513 this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000514
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300515.. index:: single: \S; in regular expressions
516
Georg Brandl116aa622007-08-15 14:28:22 +0000517``\S``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300518 Matches any character which is not a whitespace character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000519 the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300520 becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000521
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300522.. index:: single: \w; in regular expressions
523
Georg Brandl116aa622007-08-15 14:28:22 +0000524``\w``
Antoine Pitroufd036452008-08-19 17:56:33 +0000525 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000526 Matches Unicode word characters; this includes most characters
527 that can be part of a word in any language, as well as numbers and
528 the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300529 ``[a-zA-Z0-9_]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300530
Antoine Pitroufd036452008-08-19 17:56:33 +0000531 For 8-bit (bytes) patterns:
532 Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300533 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
534 used, matches characters considered alphanumeric in the current locale
535 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000536
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300537.. index:: single: \W; in regular expressions
538
Georg Brandl116aa622007-08-15 14:28:22 +0000539``\W``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300540 Matches any character which is not a word character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000541 the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300542 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Julien Palard1fae8442019-09-11 17:55:22 +0200543 used, matches characters which are neither alphanumeric in the current locale
544 nor the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000545
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300546.. index:: single: \Z; in regular expressions
547
Georg Brandl116aa622007-08-15 14:28:22 +0000548``\Z``
549 Matches only at the end of the string.
550
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300551.. index::
552 single: \a; in regular expressions
553 single: \b; in regular expressions
554 single: \f; in regular expressions
555 single: \n; in regular expressions
556 single: \N; in regular expressions
557 single: \r; in regular expressions
558 single: \t; in regular expressions
559 single: \u; in regular expressions
560 single: \U; in regular expressions
561 single: \v; in regular expressions
562 single: \x; in regular expressions
563 single: \\; in regular expressions
564
Georg Brandl116aa622007-08-15 14:28:22 +0000565Most of the standard escapes supported by Python string literals are also
566accepted by the regular expression parser::
567
568 \a \b \f \n
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200569 \N \r \t \u
570 \U \v \x \\
Georg Brandl116aa622007-08-15 14:28:22 +0000571
Ezio Melotti285e51b2012-04-29 04:52:30 +0300572(Note that ``\b`` is used to represent word boundaries, and means "backspace"
573only inside character classes.)
574
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200575``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchakaa180b002019-02-25 17:58:30 +0200576patterns. In bytes patterns they are errors. Unknown escapes of ASCII
577letters are reserved for future use and treated as errors.
Antoine Pitrou463badf2012-06-23 13:29:19 +0200578
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700579Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl116aa622007-08-15 14:28:22 +0000580there are three octal digits, it is considered an octal escape. Otherwise, it is
581a group reference. As for string literals, octal escapes are always at most
582three digits in length.
583
Antoine Pitrou463badf2012-06-23 13:29:19 +0200584.. versionchanged:: 3.3
585 The ``'\u'`` and ``'\U'`` escape sequences have been added.
586
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300587.. versionchanged:: 3.6
Martin Panter98e90512016-06-12 06:17:29 +0000588 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200589
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200590.. versionchanged:: 3.8
591 The ``'\N{name}'`` escape sequence has been added. As in string literals,
592 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou463badf2012-06-23 13:29:19 +0200593
Georg Brandl116aa622007-08-15 14:28:22 +0000594
Georg Brandl116aa622007-08-15 14:28:22 +0000595.. _contents-of-module-re:
596
597Module Contents
598---------------
599
600The module defines several functions, constants, and an exception. Some of the
601functions are simplified versions of the full featured methods for compiled
602regular expressions. Most non-trivial applications always use the compiled
603form.
604
Ethan Furmanc88c80b2016-11-21 08:29:31 -0800605.. versionchanged:: 3.6
606 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
607 :class:`enum.IntFlag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000608
Georg Brandl18244152009-09-02 20:34:52 +0000609.. function:: compile(pattern, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000610
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100611 Compile a regular expression pattern into a :ref:`regular expression object
612 <re-objects>`, which can be used for matching using its
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300613 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100614 below.
Georg Brandl116aa622007-08-15 14:28:22 +0000615
616 The expression's behaviour can be modified by specifying a *flags* value.
617 Values can be any of the following variables, combined using bitwise OR (the
618 ``|`` operator).
619
620 The sequence ::
621
Gregory P. Smith4221c742009-03-02 05:04:04 +0000622 prog = re.compile(pattern)
623 result = prog.match(string)
Georg Brandl116aa622007-08-15 14:28:22 +0000624
625 is equivalent to ::
626
Gregory P. Smith4221c742009-03-02 05:04:04 +0000627 result = re.match(pattern, string)
Georg Brandl116aa622007-08-15 14:28:22 +0000628
Georg Brandlf346ac02009-07-26 15:03:49 +0000629 but using :func:`re.compile` and saving the resulting regular expression
630 object for reuse is more efficient when the expression will be used several
631 times in a single program.
Georg Brandl116aa622007-08-15 14:28:22 +0000632
Gregory P. Smith4221c742009-03-02 05:04:04 +0000633 .. note::
634
635 The compiled versions of the most recent patterns passed to
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200636 :func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith4221c742009-03-02 05:04:04 +0000637 programs that use only a few regular expressions at a time needn't worry
638 about compiling regular expressions.
Georg Brandl116aa622007-08-15 14:28:22 +0000639
640
Antoine Pitroufd036452008-08-19 17:56:33 +0000641.. data:: A
642 ASCII
643
Georg Brandl4049ce02009-06-08 07:49:54 +0000644 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
645 perform ASCII-only matching instead of full Unicode matching. This is only
646 meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300647 Corresponds to the inline flag ``(?a)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000648
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000649 Note that for backward compatibility, the :const:`re.U` flag still
650 exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandlebeb44d2010-07-29 11:15:36 +0000651 counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000652 matches are Unicode by default for strings (and Unicode matching
653 isn't allowed for bytes).
Georg Brandl48310cd2009-01-03 21:18:54 +0000654
Antoine Pitroufd036452008-08-19 17:56:33 +0000655
Sandro Tosida785fd2012-01-01 12:55:20 +0100656.. data:: DEBUG
657
658 Display debug information about compiled expression.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300659 No corresponding inline flag.
Sandro Tosida785fd2012-01-01 12:55:20 +0100660
661
Georg Brandl116aa622007-08-15 14:28:22 +0000662.. data:: I
663 IGNORECASE
664
Brian Wardc9d6dbc2017-05-24 00:03:38 -0700665 Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300666 match lowercase letters. Full Unicode matching (such as ``Ü`` matching
667 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
668 non-ASCII matches. The current locale does not change the effect of this
669 flag unless the :const:`re.LOCALE` flag is also used.
670 Corresponds to the inline flag ``(?i)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000671
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300672 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
673 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
674 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
675 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
676 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
677 If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300678 and 'A' to 'Z' are matched.
Georg Brandl116aa622007-08-15 14:28:22 +0000679
680.. data:: L
681 LOCALE
682
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300683 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
684 dependent on the current locale. This flag can be used only with bytes
685 patterns. The use of this flag is discouraged as the locale mechanism
686 is very unreliable, it only handles one "culture" at a time, and it only
687 works with 8-bit locales. Unicode matching is already enabled by default
688 in Python 3 for Unicode (str) patterns, and it is able to handle different
689 locales/languages.
690 Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka22a309a2014-12-01 11:50:07 +0200691
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300692 .. versionchanged:: 3.6
693 :const:`re.LOCALE` can be used only with bytes patterns and is
694 not compatible with :const:`re.ASCII`.
Georg Brandl116aa622007-08-15 14:28:22 +0000695
Serhiy Storchaka898ff032017-05-05 08:53:40 +0300696 .. versionchanged:: 3.7
697 Compiled regular expression objects with the :const:`re.LOCALE` flag no
698 longer depend on the locale at compile time. Only the locale at
699 matching time affects the result of matching.
700
Georg Brandl116aa622007-08-15 14:28:22 +0000701
702.. data:: M
703 MULTILINE
704
705 When specified, the pattern character ``'^'`` matches at the beginning of the
706 string and at the beginning of each line (immediately following each newline);
707 and the pattern character ``'$'`` matches at the end of the string and at the
708 end of each line (immediately preceding each newline). By default, ``'^'``
709 matches only at the beginning of the string, and ``'$'`` only at the end of the
710 string and immediately before the newline (if any) at the end of the string.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300711 Corresponds to the inline flag ``(?m)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000712
713
714.. data:: S
715 DOTALL
716
717 Make the ``'.'`` special character match any character at all, including a
718 newline; without this flag, ``'.'`` will match anything *except* a newline.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300719 Corresponds to the inline flag ``(?s)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000720
721
Georg Brandl116aa622007-08-15 14:28:22 +0000722.. data:: X
723 VERBOSE
724
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200725 .. index:: single: # (hash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300726
Zachary Ware71a0b432015-11-11 23:32:14 -0600727 This flag allows you to write regular expressions that look nicer and are
728 more readable by allowing you to visually separate logical sections of the
729 pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchakab0b44b42017-11-14 17:21:26 +0200730 when in a character class, or when preceded by an unescaped backslash,
731 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware71a0b432015-11-11 23:32:14 -0600732 When a line contains a ``#`` that is not in a character class and is not
733 preceded by an unescaped backslash, all characters from the leftmost such
734 ``#`` through the end of the line are ignored.
Georg Brandl116aa622007-08-15 14:28:22 +0000735
Zachary Ware71a0b432015-11-11 23:32:14 -0600736 This means that the two following regular expression objects that match a
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000737 decimal number are functionally equal::
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000738
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000739 a = re.compile(r"""\d + # the integral part
740 \. # the decimal point
741 \d * # some fractional digits""", re.X)
742 b = re.compile(r"\d+\.\d*")
Georg Brandl116aa622007-08-15 14:28:22 +0000743
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300744 Corresponds to the inline flag ``(?x)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000745
746
Georg Brandlc62a7042010-07-29 11:49:05 +0000747.. function:: search(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000748
Terry Jan Reedy0edb5c12014-05-30 16:19:59 -0400749 Scan through *string* looking for the first location where the regular expression
Georg Brandlc62a7042010-07-29 11:49:05 +0000750 *pattern* produces a match, and return a corresponding :ref:`match object
751 <match-objects>`. Return ``None`` if no position in the string matches the
752 pattern; note that this is different from finding a zero-length match at some
753 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000754
755
Georg Brandl18244152009-09-02 20:34:52 +0000756.. function:: match(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000757
758 If zero or more characters at the beginning of *string* match the regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000759 expression *pattern*, return a corresponding :ref:`match object
760 <match-objects>`. Return ``None`` if the string does not match the pattern;
761 note that this is different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000762
Ezio Melotti443f0002012-02-29 13:39:05 +0200763 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
764 at the beginning of the string and not at the beginning of each line.
Georg Brandl116aa622007-08-15 14:28:22 +0000765
Ezio Melotti443f0002012-02-29 13:39:05 +0200766 If you want to locate a match anywhere in *string*, use :func:`search`
767 instead (see also :ref:`search-vs-match`).
Georg Brandl116aa622007-08-15 14:28:22 +0000768
769
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200770.. function:: fullmatch(pattern, string, flags=0)
771
772 If the whole *string* matches the regular expression *pattern*, return a
773 corresponding :ref:`match object <match-objects>`. Return ``None`` if the
774 string does not match the pattern; note that this is different from a
775 zero-length match.
776
777 .. versionadded:: 3.4
778
779
Georg Brandl18244152009-09-02 20:34:52 +0000780.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000781
782 Split *string* by the occurrences of *pattern*. If capturing parentheses are
783 used in *pattern*, then the text of all groups in the pattern are also returned
784 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
785 splits occur, and the remainder of the string is returned as the final element
Georg Brandl96473892008-03-06 07:09:43 +0000786 of the list. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000787
Serhiy Storchakac615be52017-11-28 22:51:38 +0200788 >>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000789 ['Words', 'words', 'words', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200790 >>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000791 ['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200792 >>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl116aa622007-08-15 14:28:22 +0000793 ['Words', 'words, words.']
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000794 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
795 ['0', '3', '9']
Georg Brandl116aa622007-08-15 14:28:22 +0000796
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000797 If there are capturing groups in the separator and it matches at the start of
798 the string, the result will start with an empty string. The same holds for
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300799 the end of the string::
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000800
Serhiy Storchakac615be52017-11-28 22:51:38 +0200801 >>> re.split(r'(\W+)', '...words, words...')
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000802 ['', '...', 'words', ', ', 'words', '...', '']
803
804 That way, separator components are always found at the same relative
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700805 indices within the result list.
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000806
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200807 Empty matches for the pattern split the string only when not adjacent
808 to a previous empty match.
Thomas Wouters89d996e2007-09-08 17:39:28 +0000809
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200810 >>> re.split(r'\b', 'Words, words, words.')
811 ['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200812 >>> re.split(r'\W*', '...words...')
813 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200814 >>> re.split(r'(\W*)', '...words...')
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200815 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl116aa622007-08-15 14:28:22 +0000816
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000817 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000818 Added the optional flags argument.
819
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200820 .. versionchanged:: 3.7
821 Added support of splitting on a pattern that could match an empty string.
822
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000823
Georg Brandl18244152009-09-02 20:34:52 +0000824.. function:: findall(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000825
Georg Brandl9afde1c2007-11-01 20:32:30 +0000826 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandl3dbca812008-07-23 16:10:53 +0000827 strings. The *string* is scanned left-to-right, and matches are returned in
828 the order found. If one or more groups are present in the pattern, return a
829 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200830 one group. Empty matches are included in the result.
831
832 .. versionchanged:: 3.7
833 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000834
Georg Brandl116aa622007-08-15 14:28:22 +0000835
Georg Brandl18244152009-09-02 20:34:52 +0000836.. function:: finditer(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000837
Georg Brandlc62a7042010-07-29 11:49:05 +0000838 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
839 all non-overlapping matches for the RE *pattern* in *string*. The *string*
840 is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200841 matches are included in the result.
842
843 .. versionchanged:: 3.7
844 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000845
Georg Brandl116aa622007-08-15 14:28:22 +0000846
Georg Brandl18244152009-09-02 20:34:52 +0000847.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000848
849 Return the string obtained by replacing the leftmost non-overlapping occurrences
850 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
851 *string* is returned unchanged. *repl* can be a string or a function; if it is
852 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi6a633bb2011-08-19 22:54:50 +0200853 converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchakaa180b002019-02-25 17:58:30 +0200854 so forth. Unknown escapes of ASCII letters are reserved for future use and
855 treated as errors. Other unknown escapes such as ``\&`` are left alone.
856 Backreferences, such
Georg Brandl116aa622007-08-15 14:28:22 +0000857 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300858 For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000859
860 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
861 ... r'static PyObject*\npy_\1(void)\n{',
862 ... 'def myfunc():')
863 'static PyObject*\npy_myfunc(void)\n{'
864
865 If *repl* is a function, it is called for every non-overlapping occurrence of
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300866 *pattern*. The function takes a single :ref:`match object <match-objects>`
867 argument, and returns the replacement string. For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000868
869 >>> def dashrepl(matchobj):
870 ... if matchobj.group(0) == '-': return ' '
871 ... else: return '-'
872 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
873 'pro--gram files'
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000874 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
875 'Baked Beans & Spam'
Georg Brandl116aa622007-08-15 14:28:22 +0000876
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300877 The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl116aa622007-08-15 14:28:22 +0000878
879 The optional argument *count* is the maximum number of pattern occurrences to be
880 replaced; *count* must be a non-negative integer. If omitted or zero, all
881 occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200882 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
883 ``'-a-b--d-'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000884
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300885 .. index:: single: \g; in regular expressions
886
Georg Brandl3c6780c62013-10-06 12:08:14 +0200887 In string-type *repl* arguments, in addition to the character escapes and
888 backreferences described above,
Georg Brandl116aa622007-08-15 14:28:22 +0000889 ``\g<name>`` will use the substring matched by the group named ``name``, as
890 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
891 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
892 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
893 reference to group 20, not a reference to group 2 followed by the literal
894 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
895 substring matched by the RE.
896
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000897 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000898 Added the optional flags argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000899
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300900 .. versionchanged:: 3.5
901 Unmatched groups are replaced with an empty string.
902
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300903 .. versionchanged:: 3.6
Serhiy Storchaka53c53ea2016-12-06 19:15:29 +0200904 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
905 now are errors.
906
Serhiy Storchakaff3dbe92016-12-06 19:25:19 +0200907 .. versionchanged:: 3.7
908 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
909 now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200910
mollison5ebfa842019-04-21 18:14:45 -0400911 .. versionchanged:: 3.7
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200912 Empty matches for the pattern are replaced when adjacent to a previous
913 non-empty match.
914
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000915
Georg Brandl18244152009-09-02 20:34:52 +0000916.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000917
918 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
919 number_of_subs_made)``.
920
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000921 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000922 Added the optional flags argument.
923
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300924 .. versionchanged:: 3.5
925 Unmatched groups are replaced with an empty string.
926
Georg Brandl116aa622007-08-15 14:28:22 +0000927
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300928.. function:: escape(pattern)
Georg Brandl116aa622007-08-15 14:28:22 +0000929
Serhiy Storchaka59083002017-04-13 21:06:43 +0300930 Escape special characters in *pattern*.
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300931 This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300932 have regular expression metacharacters in it. For example::
933
Ricardo Bánffy15ae75d2019-10-07 21:54:35 +0100934 >>> print(re.escape('http://www.python.org'))
935 http://www\.python\.org
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300936
937 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
938 >>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200939 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300940
941 >>> operators = ['+', '-', '*', '/', '**']
942 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka59083002017-04-13 21:06:43 +0300943 /|\-|\+|\*\*|\*
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300944
Robert DiPietrofb6c1f82019-07-13 04:35:04 -0400945 This function must not be used for the replacement string in :func:`sub`
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300946 and :func:`subn`, only backslashes should be escaped. For example::
947
948 >>> digits_re = r'\d+'
949 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
950 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
951 /usr/sbin/sendmail - \d+ errors, \d+ warnings
952
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300953 .. versionchanged:: 3.3
954 The ``'_'`` character is no longer escaped.
Georg Brandl116aa622007-08-15 14:28:22 +0000955
Serhiy Storchaka59083002017-04-13 21:06:43 +0300956 .. versionchanged:: 3.7
957 Only characters that can have special meaning in a regular expression
Ricardo Bánffy15ae75d2019-10-07 21:54:35 +0100958 are escaped. As a result, ``'!'``, ``'"'``, ``'%'``, ``"'"``, ``','``,
959 ``'/'``, ``':'``, ``';'``, ``'<'``, ``'='``, ``'>'``, ``'@'``, and
960 ``"`"`` are no longer escaped.
Serhiy Storchaka59083002017-04-13 21:06:43 +0300961
Georg Brandl116aa622007-08-15 14:28:22 +0000962
R. David Murray522c32a2010-07-10 14:23:36 +0000963.. function:: purge()
964
965 Clear the regular expression cache.
966
967
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200968.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000969
970 Exception raised when a string passed to one of the functions here is not a
971 valid regular expression (for example, it might contain unmatched parentheses)
972 or when some other error occurs during compilation or matching. It is never an
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200973 error if a string contains no match for a pattern. The error instance has
974 the following additional attributes:
Georg Brandl116aa622007-08-15 14:28:22 +0000975
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200976 .. attribute:: msg
977
978 The unformatted error message.
979
980 .. attribute:: pattern
981
982 The regular expression pattern.
983
984 .. attribute:: pos
985
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300986 The index in *pattern* where compilation failed (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200987
988 .. attribute:: lineno
989
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300990 The line corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200991
992 .. attribute:: colno
993
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300994 The column corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200995
996 .. versionchanged:: 3.5
997 Added additional attributes.
Georg Brandl116aa622007-08-15 14:28:22 +0000998
999.. _re-objects:
1000
1001Regular Expression Objects
1002--------------------------
1003
Georg Brandlc62a7042010-07-29 11:49:05 +00001004Compiled regular expression objects support the following methods and
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001005attributes:
Brian Curtin027e4782010-03-26 00:39:56 +00001006
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001007.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001008
Berker Peksag84f387d2016-06-08 14:56:56 +03001009 Scan through *string* looking for the first location where this regular
1010 expression produces a match, and return a corresponding :ref:`match object
Georg Brandlc62a7042010-07-29 11:49:05 +00001011 <match-objects>`. Return ``None`` if no position in the string matches the
1012 pattern; note that this is different from finding a zero-length match at some
1013 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +00001014
Georg Brandlc62a7042010-07-29 11:49:05 +00001015 The optional second parameter *pos* gives an index in the string where the
1016 search is to start; it defaults to ``0``. This is not completely equivalent to
1017 slicing the string; the ``'^'`` pattern character matches at the real beginning
1018 of the string and at positions just after a newline, but not necessarily at the
1019 index where the search is to start.
Georg Brandl116aa622007-08-15 14:28:22 +00001020
Georg Brandlc62a7042010-07-29 11:49:05 +00001021 The optional parameter *endpos* limits how far the string will be searched; it
1022 will be as if the string is *endpos* characters long, so only the characters
1023 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001024 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
Georg Brandlc62a7042010-07-29 11:49:05 +00001025 expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001026 ``rx.search(string[:50], 0)``. ::
Georg Brandl116aa622007-08-15 14:28:22 +00001027
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001028 >>> pattern = re.compile("d")
1029 >>> pattern.search("dog") # Match at index 0
1030 <re.Match object; span=(0, 1), match='d'>
1031 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl116aa622007-08-15 14:28:22 +00001032
1033
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001034.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001035
Georg Brandlc62a7042010-07-29 11:49:05 +00001036 If zero or more characters at the *beginning* of *string* match this regular
1037 expression, return a corresponding :ref:`match object <match-objects>`.
1038 Return ``None`` if the string does not match the pattern; note that this is
1039 different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +00001040
Georg Brandlc62a7042010-07-29 11:49:05 +00001041 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001042 :meth:`~Pattern.search` method. ::
Benjamin Petersond7c3ed52010-06-27 22:32:30 +00001043
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001044 >>> pattern = re.compile("o")
1045 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
1046 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
1047 <re.Match object; span=(1, 2), match='o'>
Georg Brandl116aa622007-08-15 14:28:22 +00001048
Ezio Melotti443f0002012-02-29 13:39:05 +02001049 If you want to locate a match anywhere in *string*, use
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001050 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti443f0002012-02-29 13:39:05 +02001051
Georg Brandl116aa622007-08-15 14:28:22 +00001052
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001053.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001054
1055 If the whole *string* matches this regular expression, return a corresponding
1056 :ref:`match object <match-objects>`. Return ``None`` if the string does not
1057 match the pattern; note that this is different from a zero-length match.
1058
1059 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001060 :meth:`~Pattern.search` method. ::
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001061
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001062 >>> pattern = re.compile("o[gh]")
1063 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
1064 >>> pattern.fullmatch("ogre") # No match as not the full string matches.
1065 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
1066 <re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001067
1068 .. versionadded:: 3.4
1069
1070
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001071.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001072
Georg Brandlc62a7042010-07-29 11:49:05 +00001073 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001074
1075
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001076.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001077
Georg Brandlc62a7042010-07-29 11:49:05 +00001078 Similar to the :func:`findall` function, using the compiled pattern, but
1079 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001080 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001081
1082
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001083.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001084
Georg Brandlc62a7042010-07-29 11:49:05 +00001085 Similar to the :func:`finditer` function, using the compiled pattern, but
1086 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001087 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001088
1089
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001090.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001091
Georg Brandlc62a7042010-07-29 11:49:05 +00001092 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001093
1094
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001095.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001096
Georg Brandlc62a7042010-07-29 11:49:05 +00001097 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001098
1099
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001100.. attribute:: Pattern.flags
Georg Brandl116aa622007-08-15 14:28:22 +00001101
Georg Brandl3a19e542012-03-17 17:29:27 +01001102 The regex matching flags. This is a combination of the flags given to
1103 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1104 flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl116aa622007-08-15 14:28:22 +00001105
1106
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001107.. attribute:: Pattern.groups
Georg Brandlaf265f42008-12-07 15:06:20 +00001108
Georg Brandlc62a7042010-07-29 11:49:05 +00001109 The number of capturing groups in the pattern.
Georg Brandlaf265f42008-12-07 15:06:20 +00001110
1111
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001112.. attribute:: Pattern.groupindex
Georg Brandl116aa622007-08-15 14:28:22 +00001113
Georg Brandlc62a7042010-07-29 11:49:05 +00001114 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1115 numbers. The dictionary is empty if no symbolic groups were used in the
1116 pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001117
1118
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001119.. attribute:: Pattern.pattern
Georg Brandl116aa622007-08-15 14:28:22 +00001120
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001121 The pattern string from which the pattern object was compiled.
Georg Brandl116aa622007-08-15 14:28:22 +00001122
1123
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001124.. versionchanged:: 3.7
1125 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
1126 regular expression objects are considered atomic.
1127
1128
Georg Brandl116aa622007-08-15 14:28:22 +00001129.. _match-objects:
1130
1131Match Objects
1132-------------
1133
Ezio Melottib87f82f2012-11-04 06:59:22 +02001134Match objects always have a boolean value of ``True``.
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001135Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melottib87f82f2012-11-04 06:59:22 +02001136when there is no match, you can test whether there was a match with a simple
1137``if`` statement::
1138
1139 match = re.search(pattern, string)
1140 if match:
1141 process(match)
1142
1143Match objects support the following methods and attributes:
Georg Brandl116aa622007-08-15 14:28:22 +00001144
1145
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001146.. method:: Match.expand(template)
Georg Brandl116aa622007-08-15 14:28:22 +00001147
Georg Brandlc62a7042010-07-29 11:49:05 +00001148 Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001149 string *template*, as done by the :meth:`~Pattern.sub` method.
Georg Brandlc62a7042010-07-29 11:49:05 +00001150 Escapes such as ``\n`` are converted to the appropriate characters,
1151 and numeric backreferences (``\1``, ``\2``) and named backreferences
1152 (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1153 corresponding group.
Georg Brandl116aa622007-08-15 14:28:22 +00001154
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +03001155 .. versionchanged:: 3.5
1156 Unmatched groups are replaced with an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +00001157
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001158.. method:: Match.group([group1, ...])
Georg Brandl116aa622007-08-15 14:28:22 +00001159
Georg Brandlc62a7042010-07-29 11:49:05 +00001160 Returns one or more subgroups of the match. If there is a single argument, the
1161 result is a single string; if there are multiple arguments, the result is a
1162 tuple with one item per argument. Without arguments, *group1* defaults to zero
1163 (the whole match is returned). If a *groupN* argument is zero, the corresponding
1164 return value is the entire matching string; if it is in the inclusive range
1165 [1..99], it is the string matching the corresponding parenthesized group. If a
1166 group number is negative or larger than the number of groups defined in the
1167 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1168 part of the pattern that did not match, the corresponding result is ``None``.
1169 If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001170 the last match is returned. ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001171
Georg Brandlc62a7042010-07-29 11:49:05 +00001172 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1173 >>> m.group(0) # The entire match
1174 'Isaac Newton'
1175 >>> m.group(1) # The first parenthesized subgroup.
1176 'Isaac'
1177 >>> m.group(2) # The second parenthesized subgroup.
1178 'Newton'
1179 >>> m.group(1, 2) # Multiple arguments give us a tuple.
1180 ('Isaac', 'Newton')
Georg Brandl116aa622007-08-15 14:28:22 +00001181
Georg Brandlc62a7042010-07-29 11:49:05 +00001182 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1183 arguments may also be strings identifying groups by their group name. If a
1184 string argument is not used as a group name in the pattern, an :exc:`IndexError`
1185 exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +00001186
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001187 A moderately complicated example::
Georg Brandl116aa622007-08-15 14:28:22 +00001188
Georg Brandlc62a7042010-07-29 11:49:05 +00001189 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1190 >>> m.group('first_name')
1191 'Malcolm'
1192 >>> m.group('last_name')
1193 'Reynolds'
Georg Brandl116aa622007-08-15 14:28:22 +00001194
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001195 Named groups can also be referred to by their index::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001196
Georg Brandlc62a7042010-07-29 11:49:05 +00001197 >>> m.group(1)
1198 'Malcolm'
1199 >>> m.group(2)
1200 'Reynolds'
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001201
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001202 If a group matches multiple times, only the last match is accessible::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001203
Georg Brandlc62a7042010-07-29 11:49:05 +00001204 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
1205 >>> m.group(1) # Returns only the last match.
1206 'c3'
Brian Curtin027e4782010-03-26 00:39:56 +00001207
Brian Curtin48f16f92010-04-08 13:55:29 +00001208
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001209.. method:: Match.__getitem__(g)
Eric V. Smith605bdae2016-09-11 08:55:43 -04001210
1211 This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001212 an individual group from a match::
Eric V. Smith605bdae2016-09-11 08:55:43 -04001213
1214 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1215 >>> m[0] # The entire match
1216 'Isaac Newton'
1217 >>> m[1] # The first parenthesized subgroup.
1218 'Isaac'
1219 >>> m[2] # The second parenthesized subgroup.
1220 'Newton'
1221
1222 .. versionadded:: 3.6
1223
1224
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001225.. method:: Match.groups(default=None)
Brian Curtin48f16f92010-04-08 13:55:29 +00001226
Georg Brandlc62a7042010-07-29 11:49:05 +00001227 Return a tuple containing all the subgroups of the match, from 1 up to however
1228 many groups are in the pattern. The *default* argument is used for groups that
1229 did not participate in the match; it defaults to ``None``.
Brian Curtin027e4782010-03-26 00:39:56 +00001230
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001231 For example::
Brian Curtin027e4782010-03-26 00:39:56 +00001232
Georg Brandlc62a7042010-07-29 11:49:05 +00001233 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1234 >>> m.groups()
1235 ('24', '1632')
Brian Curtin027e4782010-03-26 00:39:56 +00001236
Georg Brandlc62a7042010-07-29 11:49:05 +00001237 If we make the decimal place and everything after it optional, not all groups
1238 might participate in the match. These groups will default to ``None`` unless
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001239 the *default* argument is given::
Brian Curtin027e4782010-03-26 00:39:56 +00001240
Georg Brandlc62a7042010-07-29 11:49:05 +00001241 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1242 >>> m.groups() # Second group defaults to None.
1243 ('24', None)
1244 >>> m.groups('0') # Now, the second group defaults to '0'.
1245 ('24', '0')
Georg Brandl116aa622007-08-15 14:28:22 +00001246
1247
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001248.. method:: Match.groupdict(default=None)
Georg Brandl116aa622007-08-15 14:28:22 +00001249
Georg Brandlc62a7042010-07-29 11:49:05 +00001250 Return a dictionary containing all the *named* subgroups of the match, keyed by
1251 the subgroup name. The *default* argument is used for groups that did not
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001252 participate in the match; it defaults to ``None``. For example::
Georg Brandl116aa622007-08-15 14:28:22 +00001253
Georg Brandlc62a7042010-07-29 11:49:05 +00001254 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1255 >>> m.groupdict()
1256 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001257
Georg Brandl116aa622007-08-15 14:28:22 +00001258
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001259.. method:: Match.start([group])
1260 Match.end([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001261
Georg Brandlc62a7042010-07-29 11:49:05 +00001262 Return the indices of the start and end of the substring matched by *group*;
1263 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1264 *group* exists but did not contribute to the match. For a match object *m*, and
1265 a group *g* that did contribute to the match, the substring matched by group *g*
1266 (equivalent to ``m.group(g)``) is ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001267
Georg Brandlc62a7042010-07-29 11:49:05 +00001268 m.string[m.start(g):m.end(g)]
Brian Curtin027e4782010-03-26 00:39:56 +00001269
Georg Brandlc62a7042010-07-29 11:49:05 +00001270 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1271 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
1272 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1273 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin027e4782010-03-26 00:39:56 +00001274
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001275 An example that will remove *remove_this* from email addresses::
Brian Curtin027e4782010-03-26 00:39:56 +00001276
Georg Brandlc62a7042010-07-29 11:49:05 +00001277 >>> email = "tony@tiremove_thisger.net"
1278 >>> m = re.search("remove_this", email)
1279 >>> email[:m.start()] + email[m.end():]
1280 'tony@tiger.net'
Georg Brandl116aa622007-08-15 14:28:22 +00001281
1282
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001283.. method:: Match.span([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001284
Georg Brandlc62a7042010-07-29 11:49:05 +00001285 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1286 that if *group* did not contribute to the match, this is ``(-1, -1)``.
1287 *group* defaults to zero, the entire match.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001288
Georg Brandl116aa622007-08-15 14:28:22 +00001289
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001290.. attribute:: Match.pos
Georg Brandl116aa622007-08-15 14:28:22 +00001291
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001292 The value of *pos* which was passed to the :meth:`~Pattern.search` or
1293 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001294 the index into the string at which the RE engine started looking for a match.
Georg Brandl116aa622007-08-15 14:28:22 +00001295
1296
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001297.. attribute:: Match.endpos
Georg Brandl116aa622007-08-15 14:28:22 +00001298
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001299 The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1300 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001301 the index into the string beyond which the RE engine will not go.
Georg Brandl116aa622007-08-15 14:28:22 +00001302
1303
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001304.. attribute:: Match.lastindex
Georg Brandl116aa622007-08-15 14:28:22 +00001305
Georg Brandlc62a7042010-07-29 11:49:05 +00001306 The integer index of the last matched capturing group, or ``None`` if no group
1307 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1308 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1309 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1310 string.
Georg Brandl116aa622007-08-15 14:28:22 +00001311
1312
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001313.. attribute:: Match.lastgroup
Georg Brandl116aa622007-08-15 14:28:22 +00001314
Georg Brandlc62a7042010-07-29 11:49:05 +00001315 The name of the last matched capturing group, or ``None`` if the group didn't
1316 have a name, or if no group was matched at all.
Georg Brandl116aa622007-08-15 14:28:22 +00001317
1318
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001319.. attribute:: Match.re
Georg Brandl116aa622007-08-15 14:28:22 +00001320
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001321 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001322 :meth:`~Pattern.search` method produced this match instance.
Georg Brandl116aa622007-08-15 14:28:22 +00001323
1324
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001325.. attribute:: Match.string
Georg Brandl116aa622007-08-15 14:28:22 +00001326
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001327 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001328
1329
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001330.. versionchanged:: 3.7
1331 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
1332 are considered atomic.
1333
1334
Raymond Hettinger1fa76822010-12-06 23:31:36 +00001335.. _re-examples:
1336
1337Regular Expression Examples
1338---------------------------
Georg Brandl116aa622007-08-15 14:28:22 +00001339
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001340
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001341Checking for a Pair
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001342^^^^^^^^^^^^^^^^^^^
1343
1344In this example, we'll use the following helper function to display match
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001345objects a little more gracefully::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001346
1347 def displaymatch(match):
1348 if match is None:
1349 return None
1350 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1351
1352Suppose you are writing a poker program where a player's hand is represented as
1353a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001354for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001355representing the card with that value.
1356
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001357To see if a given string is a valid hand, one could do the following::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001358
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001359 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1360 >>> displaymatch(valid.match("akt5q")) # Valid.
1361 "<Match: 'akt5q', groups=()>"
1362 >>> displaymatch(valid.match("akt5e")) # Invalid.
1363 >>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001364 >>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001365 "<Match: '727ak', groups=()>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001366
1367That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001368To match this with a regular expression, one could use backreferences as such::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001369
1370 >>> pair = re.compile(r".*(.).*\1")
1371 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001372 "<Match: '717', groups=('7',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001373 >>> displaymatch(pair.match("718ak")) # No pairs.
1374 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001375 "<Match: '354aa', groups=('a',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001376
Georg Brandlf346ac02009-07-26 15:03:49 +00001377To find out what card the pair consists of, one could use the
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001378:meth:`~Match.group` method of the match object in the following manner::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001379
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001380 >>> pair = re.compile(r".*(.).*\1")
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001381 >>> pair.match("717ak").group(1)
1382 '7'
Georg Brandl48310cd2009-01-03 21:18:54 +00001383
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001384 # Error because re.match() returns None, which doesn't have a group() method:
1385 >>> pair.match("718ak").group(1)
1386 Traceback (most recent call last):
1387 File "<pyshell#23>", line 1, in <module>
1388 re.match(r".*(.).*\1", "718ak").group(1)
1389 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl48310cd2009-01-03 21:18:54 +00001390
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001391 >>> pair.match("354aa").group(1)
1392 'a'
1393
1394
1395Simulating scanf()
1396^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +00001397
1398.. index:: single: scanf()
1399
Georg Brandl60203b42010-10-06 10:11:56 +00001400Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl116aa622007-08-15 14:28:22 +00001401expressions are generally more powerful, though also more verbose, than
Georg Brandl60203b42010-10-06 10:11:56 +00001402:c:func:`scanf` format strings. The table below offers some more-or-less
1403equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl116aa622007-08-15 14:28:22 +00001404expressions.
1405
1406+--------------------------------+---------------------------------------------+
Georg Brandl60203b42010-10-06 10:11:56 +00001407| :c:func:`scanf` Token | Regular Expression |
Georg Brandl116aa622007-08-15 14:28:22 +00001408+================================+=============================================+
1409| ``%c`` | ``.`` |
1410+--------------------------------+---------------------------------------------+
1411| ``%5c`` | ``.{5}`` |
1412+--------------------------------+---------------------------------------------+
1413| ``%d`` | ``[-+]?\d+`` |
1414+--------------------------------+---------------------------------------------+
1415| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1416+--------------------------------+---------------------------------------------+
1417| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1418+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001419| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001420+--------------------------------+---------------------------------------------+
1421| ``%s`` | ``\S+`` |
1422+--------------------------------+---------------------------------------------+
1423| ``%u`` | ``\d+`` |
1424+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001425| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001426+--------------------------------+---------------------------------------------+
1427
1428To extract the filename and numbers from a string like ::
1429
1430 /usr/sbin/sendmail - 0 errors, 4 warnings
1431
Georg Brandl60203b42010-10-06 10:11:56 +00001432you would use a :c:func:`scanf` format like ::
Georg Brandl116aa622007-08-15 14:28:22 +00001433
1434 %s - %d errors, %d warnings
1435
1436The equivalent regular expression would be ::
1437
1438 (\S+) - (\d+) errors, (\d+) warnings
1439
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001440
Ezio Melotti443f0002012-02-29 13:39:05 +02001441.. _search-vs-match:
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001442
1443search() vs. match()
1444^^^^^^^^^^^^^^^^^^^^
1445
Ezio Melotti443f0002012-02-29 13:39:05 +02001446.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001447
Ezio Melotti443f0002012-02-29 13:39:05 +02001448Python offers two different primitive operations based on regular expressions:
1449:func:`re.match` checks for a match only at the beginning of the string, while
1450:func:`re.search` checks for a match anywhere in the string (this is what Perl
1451does by default).
1452
1453For example::
1454
Serhiy Storchakadba90392016-05-10 12:01:23 +03001455 >>> re.match("c", "abcdef") # No match
1456 >>> re.search("c", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001457 <re.Match object; span=(2, 3), match='c'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001458
Ezio Melotti443f0002012-02-29 13:39:05 +02001459Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1460restrict the match at the beginning of the string::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001461
Serhiy Storchakadba90392016-05-10 12:01:23 +03001462 >>> re.match("c", "abcdef") # No match
1463 >>> re.search("^c", "abcdef") # No match
Ezio Melotti443f0002012-02-29 13:39:05 +02001464 >>> re.search("^a", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001465 <re.Match object; span=(0, 1), match='a'>
Ezio Melotti443f0002012-02-29 13:39:05 +02001466
1467Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1468beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001469beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti443f0002012-02-29 13:39:05 +02001470
1471 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1472 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001473 <re.Match object; span=(4, 5), match='X'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001474
1475
1476Making a Phonebook
1477^^^^^^^^^^^^^^^^^^
1478
Georg Brandl48310cd2009-01-03 21:18:54 +00001479:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001480method is invaluable for converting textual data into data structures that can be
1481easily read and modified by Python as demonstrated in the following example that
1482creates a phonebook.
1483
Christian Heimes255f53b2007-12-08 15:33:56 +00001484First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001485triple-quoted string syntax
1486
1487.. doctest::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001488
Georg Brandl557a3ec2012-03-17 17:26:27 +01001489 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl48310cd2009-01-03 21:18:54 +00001490 ...
Christian Heimesfe337bf2008-03-23 21:54:12 +00001491 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1492 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1493 ...
1494 ...
1495 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes255f53b2007-12-08 15:33:56 +00001496
1497The entries are separated by one or more newlines. Now we convert the string
Christian Heimesfe337bf2008-03-23 21:54:12 +00001498into a list with each nonempty line having its own entry:
1499
1500.. doctest::
1501 :options: +NORMALIZE_WHITESPACE
Christian Heimes255f53b2007-12-08 15:33:56 +00001502
Georg Brandl557a3ec2012-03-17 17:26:27 +01001503 >>> entries = re.split("\n+", text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001504 >>> entries
Christian Heimesfe337bf2008-03-23 21:54:12 +00001505 ['Ross McFluff: 834.345.1254 155 Elm Street',
1506 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1507 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1508 'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001509
1510Finally, split each entry into a list with first name, last name, telephone
Christian Heimesc3f30c42008-02-22 16:37:40 +00001511number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimesfe337bf2008-03-23 21:54:12 +00001512because the address has spaces, our splitting pattern, in it:
1513
1514.. doctest::
1515 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001516
Christian Heimes255f53b2007-12-08 15:33:56 +00001517 >>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001518 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1519 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1520 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1521 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1522
Christian Heimes255f53b2007-12-08 15:33:56 +00001523The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimesc3f30c42008-02-22 16:37:40 +00001524occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimesfe337bf2008-03-23 21:54:12 +00001525house number from the street name:
1526
1527.. doctest::
1528 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001529
Christian Heimes255f53b2007-12-08 15:33:56 +00001530 >>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001531 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1532 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1533 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1534 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1535
1536
1537Text Munging
1538^^^^^^^^^^^^
1539
1540:func:`sub` replaces every occurrence of a pattern with a string or the
1541result of a function. This example demonstrates using :func:`sub` with
1542a function to "munge" text, or randomize the order of all the characters
1543in each word of a sentence except for the first and last characters::
1544
1545 >>> def repl(m):
Serhiy Storchakadba90392016-05-10 12:01:23 +03001546 ... inner_word = list(m.group(2))
1547 ... random.shuffle(inner_word)
1548 ... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001549 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandldb4e9392010-07-12 09:06:13 +00001550 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001551 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandldb4e9392010-07-12 09:06:13 +00001552 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001553 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1554
1555
1556Finding all Adverbs
1557^^^^^^^^^^^^^^^^^^^
1558
Christian Heimesc3f30c42008-02-22 16:37:40 +00001559:func:`findall` matches *all* occurrences of a pattern, not just the first
Andrés Delfino50924392018-06-18 01:34:30 -03001560one as :func:`search` does. For example, if a writer wanted to
1561find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001562the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001563
1564 >>> text = "He was carefully disguised but captured quickly by police."
1565 >>> re.findall(r"\w+ly", text)
1566 ['carefully', 'quickly']
1567
1568
1569Finding all Adverbs and their Positions
1570^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1571
1572If one wants more information about all matches of a pattern than the matched
Georg Brandlc62a7042010-07-29 11:49:05 +00001573text, :func:`finditer` is useful as it provides :ref:`match objects
1574<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino50924392018-06-18 01:34:30 -03001575a writer wanted to find all of the adverbs *and their positions* in
1576some text, they would use :func:`finditer` in the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001577
1578 >>> text = "He was carefully disguised but captured quickly by police."
1579 >>> for m in re.finditer(r"\w+ly", text):
Christian Heimesfe337bf2008-03-23 21:54:12 +00001580 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001581 07-16: carefully
1582 40-47: quickly
1583
1584
1585Raw String Notation
1586^^^^^^^^^^^^^^^^^^^
1587
1588Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1589every backslash (``'\'``) in a regular expression would have to be prefixed with
1590another one to escape it. For example, the two following lines of code are
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001591functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001592
1593 >>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001594 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001595 >>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001596 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001597
1598When one wants to match a literal backslash, it must be escaped in the regular
1599expression. With raw string notation, this means ``r"\\"``. Without raw string
1600notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001601functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001602
1603 >>> re.match(r"\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001604 <re.Match object; span=(0, 1), match='\\'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001605 >>> re.match("\\\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001606 <re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001607
1608
1609Writing a Tokenizer
1610^^^^^^^^^^^^^^^^^^^
1611
Georg Brandl5d941342016-02-26 19:37:12 +01001612A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001613analyzes a string to categorize groups of characters. This is a useful first
1614step in writing a compiler or interpreter.
1615
1616The text categories are specified with regular expressions. The technique is
1617to combine those into a single master regular expression and to loop over
1618successive matches::
1619
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001620 import collections
1621 import re
1622
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001623 Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001624
Raymond Hettingerc5664312014-08-03 23:38:54 -07001625 def tokenize(code):
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001626 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1627 token_specification = [
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001628 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
1629 ('ASSIGN', r':='), # Assignment operator
1630 ('END', r';'), # Statement terminator
1631 ('ID', r'[A-Za-z]+'), # Identifiers
1632 ('OP', r'[+\-*/]'), # Arithmetic operators
1633 ('NEWLINE', r'\n'), # Line endings
1634 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs
1635 ('MISMATCH', r'.'), # Any other character
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001636 ]
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001637 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettingerc5664312014-08-03 23:38:54 -07001638 line_num = 1
1639 line_start = 0
1640 for mo in re.finditer(tok_regex, code):
1641 kind = mo.lastgroup
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001642 value = mo.group()
1643 column = mo.start() - line_start
1644 if kind == 'NUMBER':
1645 value = float(value) if '.' in value else int(value)
1646 elif kind == 'ID' and value in keywords:
1647 kind = value
1648 elif kind == 'NEWLINE':
Raymond Hettingerc5664312014-08-03 23:38:54 -07001649 line_start = mo.end()
1650 line_num += 1
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001651 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001652 elif kind == 'SKIP':
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001653 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001654 elif kind == 'MISMATCH':
Raymond Hettingerd0b91582017-02-06 07:15:31 -08001655 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001656 yield Token(kind, value, line_num, column)
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001657
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001658 statements = '''
1659 IF quantity THEN
1660 total := total + price * quantity;
1661 tax := price * 0.05;
1662 ENDIF;
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001663 '''
Raymond Hettinger23157e52011-05-13 01:38:31 -07001664
1665 for token in tokenize(statements):
1666 print(token)
1667
1668The tokenizer produces the following output::
Raymond Hettinger9c47d772011-05-13 01:03:50 -07001669
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001670 Token(type='IF', value='IF', line=2, column=4)
1671 Token(type='ID', value='quantity', line=2, column=7)
1672 Token(type='THEN', value='THEN', line=2, column=16)
1673 Token(type='ID', value='total', line=3, column=8)
1674 Token(type='ASSIGN', value=':=', line=3, column=14)
1675 Token(type='ID', value='total', line=3, column=17)
1676 Token(type='OP', value='+', line=3, column=23)
1677 Token(type='ID', value='price', line=3, column=25)
1678 Token(type='OP', value='*', line=3, column=31)
1679 Token(type='ID', value='quantity', line=3, column=33)
1680 Token(type='END', value=';', line=3, column=41)
1681 Token(type='ID', value='tax', line=4, column=8)
1682 Token(type='ASSIGN', value=':=', line=4, column=12)
1683 Token(type='ID', value='price', line=4, column=15)
1684 Token(type='OP', value='*', line=4, column=21)
1685 Token(type='NUMBER', value=0.05, line=4, column=23)
1686 Token(type='END', value=';', line=4, column=27)
1687 Token(type='ENDIF', value='ENDIF', line=5, column=4)
1688 Token(type='END', value=';', line=5, column=9)
Berker Peksaga0a42d22018-03-23 16:46:52 +03001689
1690
1691.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1692 Media, 2009. The third edition of the book no longer covers Python at all,
1693 but the first edition covered writing good regular expression patterns in
1694 great detail.