blob: ac6455a22074d305e21bc1766ecc38f1f933098d [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5 :synopsis: Regular expression operations.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/re.py`
11
12--------------
Georg Brandl116aa622007-08-15 14:28:22 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014This module provides regular expression matching operations similar to
Georg Brandled2a1db2009-06-08 07:48:27 +000015those found in Perl.
Antoine Pitroufd036452008-08-19 17:56:33 +000016
Serhiy Storchakacd195e22017-10-14 11:14:26 +030017Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter6245cb32016-04-15 02:14:19 +000020that is, you cannot match a Unicode string with a byte pattern or
Georg Brandlae2dbe22009-03-13 19:04:40 +000021vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitroufd036452008-08-19 17:56:33 +000022string must be of the same type as both the pattern and the search string.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning. This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
Pablo Galindoe8239b82019-01-20 18:57:56 +000031literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
Georg Brandl116aa622007-08-15 14:28:22 +000035
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl9afde1c2007-11-01 20:32:30 +000040newline. Usually patterns will be expressed in Python code using this raw
41string notation.
Georg Brandl116aa622007-08-15 14:28:22 +000042
Christian Heimesb9eccbf2007-12-05 20:18:38 +000043It is important to note that most regular expression operations are available as
Georg Brandlc62a7042010-07-29 11:49:05 +000044module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
Christian Heimesb9eccbf2007-12-05 20:18:38 +000047fine-tuning parameters.
48
Marco Buttued6795e2017-02-26 16:26:23 +010049.. seealso::
50
Stéphane Wirtel19177fb2018-05-15 20:58:35 +020051 The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttued6795e2017-02-26 16:26:23 +010052 which has an API compatible with the standard library :mod:`re` module,
53 but offers additional functionality and a more thorough Unicode support.
54
Georg Brandl116aa622007-08-15 14:28:22 +000055
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB. This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references. Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here. For details of the theory
Berker Peksaga0a42d22018-03-23 16:46:52 +030073and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
Georg Brandl116aa622007-08-15 14:28:22 +000075
76A brief explanation of the format of regular expressions follows. For further
Christian Heimes2202f872008-02-06 14:31:34 +000077information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl116aa622007-08-15 14:28:22 +000078
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves. You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``. (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
Serhiy Storchakacd195e22017-10-14 11:14:26 +030088how the regular expressions around them are interpreted.
Georg Brandl116aa622007-08-15 14:28:22 +000089
Martin Panter684340e2016-10-15 01:18:16 +000090Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
Georg Brandl116aa622007-08-15 14:28:22 +000096
97The special characters are:
98
Serhiy Storchaka913876d2018-10-28 13:41:26 +020099.. index:: single: . (dot); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300100
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300101``.``
Georg Brandl116aa622007-08-15 14:28:22 +0000102 (Dot.) In the default mode, this matches any character except a newline. If
103 the :const:`DOTALL` flag has been specified, this matches any character
104 including a newline.
105
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200106.. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300107
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300108``^``
Georg Brandl116aa622007-08-15 14:28:22 +0000109 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
110 matches immediately after each newline.
111
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200112.. index:: single: $ (dollar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300113
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300114``$``
Georg Brandl116aa622007-08-15 14:28:22 +0000115 Matches the end of the string or just before the newline at the end of the
116 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
117 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes25bb7832008-01-11 16:17:00 +0000119 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121 the newline, and one at the end of the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000122
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200123.. index:: single: * (asterisk); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300124
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300125``*``
Georg Brandl116aa622007-08-15 14:28:22 +0000126 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
128 by any number of 'b's.
129
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200130.. index:: single: + (plus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300131
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300132``+``
Georg Brandl116aa622007-08-15 14:28:22 +0000133 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135 match just 'a'.
136
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200137.. index:: single: ? (question mark); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300138
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300139``?``
Georg Brandl116aa622007-08-15 14:28:22 +0000140 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141 ``ab?`` will match either 'a' or 'ab'.
142
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300143.. index::
144 single: *?; in regular expressions
145 single: +?; in regular expressions
146 single: ??; in regular expressions
147
Georg Brandl116aa622007-08-15 14:28:22 +0000148``*?``, ``+?``, ``??``
149 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
150 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300151 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl116aa622007-08-15 14:28:22 +0000153 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl7ff033b2016-04-12 07:51:41 +0200154 characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300155 only ``'<a>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000156
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300157.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200158 single: {} (curly brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300159
Georg Brandl116aa622007-08-15 14:28:22 +0000160``{m}``
161 Specifies that exactly *m* copies of the previous RE should be matched; fewer
162 matches cause the entire RE not to match. For example, ``a{6}`` will match
163 exactly six ``'a'`` characters, but not five.
164
165``{m,n}``
166 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
167 RE, attempting to match as many repetitions as possible. For example,
168 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
169 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300170 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
171 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl116aa622007-08-15 14:28:22 +0000172 modifier would be confused with the previously described form.
173
174``{m,n}?``
175 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
176 RE, attempting to match as *few* repetitions as possible. This is the
177 non-greedy version of the previous qualifier. For example, on the
178 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179 while ``a{3,5}?`` will only match 3 characters.
180
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200181.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300182
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300183``\``
Georg Brandl116aa622007-08-15 14:28:22 +0000184 Either escapes special characters (permitting you to match characters like
185 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
186 sequences are discussed below.
187
188 If you're not using a raw string to express the pattern, remember that Python
189 also uses the backslash as an escape sequence in string literals; if the escape
190 sequence isn't recognized by Python's parser, the backslash and subsequent
191 character are included in the resulting string. However, if Python would
192 recognize the resulting sequence, the backslash should be repeated twice. This
193 is complicated and hard to understand, so it's highly recommended that you use
194 raw strings for all but the simplest expressions.
195
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300196.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200197 single: [] (square brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300198
Georg Brandl116aa622007-08-15 14:28:22 +0000199``[]``
Ezio Melotti81231d92011-10-20 19:38:04 +0300200 Used to indicate a set of characters. In a set:
Georg Brandl116aa622007-08-15 14:28:22 +0000201
Ezio Melotti81231d92011-10-20 19:38:04 +0300202 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
203 ``'m'``, or ``'k'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000204
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200205 .. index:: single: - (minus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300206
Ezio Melotti81231d92011-10-20 19:38:04 +0300207 * Ranges of characters can be indicated by giving two characters and separating
208 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
209 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
210 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300211 ``[a\-z]``) or if it's placed as the first or last character
212 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti81231d92011-10-20 19:38:04 +0300213
214 * Special characters lose their special meaning inside sets. For example,
215 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
216 ``'*'``, or ``')'``.
217
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200218 .. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300219
Ezio Melotti81231d92011-10-20 19:38:04 +0300220 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
221 inside a set, although the characters they match depends on whether
222 :const:`ASCII` or :const:`LOCALE` mode is in force.
223
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200224 .. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300225
Ezio Melotti81231d92011-10-20 19:38:04 +0300226 * Characters that are not within a range can be matched by :dfn:`complementing`
227 the set. If the first character of the set is ``'^'``, all the characters
228 that are *not* in the set will be matched. For example, ``[^5]`` will match
229 any character except ``'5'``, and ``[^^]`` will match any character except
230 ``'^'``. ``^`` has no special meaning if it's not the first character in
231 the set.
232
233 * To match a literal ``']'`` inside a set, precede it with a backslash, or
234 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
235 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield9e670c22008-05-31 13:05:34 +0000236
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300237 .. .. index:: single: --; in regular expressions
238 .. .. index:: single: &&; in regular expressions
239 .. .. index:: single: ~~; in regular expressions
240 .. .. index:: single: ||; in regular expressions
241
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200242 * Support of nested sets and set operations as in `Unicode Technical
243 Standard #18`_ might be added in the future. This would change the
244 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
245 in ambiguous cases for the time being.
Andrés Delfino7dfbd492018-10-06 16:48:30 -0300246 That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200247 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
248 avoid a warning escape them with a backslash.
249
250 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
251
252 .. versionchanged:: 3.7
253 :exc:`FutureWarning` is raised if a character set contains constructs
254 that will change semantically in the future.
255
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200256.. index:: single: | (vertical bar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300257
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300258``|``
259 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
260 will match either *A* or *B*. An arbitrary number of REs can be separated by the
Georg Brandl116aa622007-08-15 14:28:22 +0000261 ``'|'`` in this way. This can be used inside groups (see below) as well. As
262 the target string is scanned, REs separated by ``'|'`` are tried from left to
263 right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300264 that once *A* matches, *B* will not be tested further, even if it would
Georg Brandl116aa622007-08-15 14:28:22 +0000265 produce a longer overall match. In other words, the ``'|'`` operator is never
266 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
267 character class, as in ``[|]``.
268
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300269.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200270 single: () (parentheses); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300271
Georg Brandl116aa622007-08-15 14:28:22 +0000272``(...)``
273 Matches whatever regular expression is inside the parentheses, and indicates the
274 start and end of a group; the contents of a group can be retrieved after a match
275 has been performed, and can be matched later in the string with the ``\number``
276 special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300277 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000278
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300279.. index:: single: (?; in regular expressions
280
Georg Brandl116aa622007-08-15 14:28:22 +0000281``(?...)``
282 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
283 otherwise). The first character after the ``'?'`` determines what the meaning
284 and further syntax of the construct is. Extensions usually do not create a new
285 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
286 currently supported extensions.
287
Antoine Pitroufd036452008-08-19 17:56:33 +0000288``(?aiLmsux)``
289 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
290 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling1c50e862009-06-01 00:11:36 +0000291 letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitroufd036452008-08-19 17:56:33 +0000292 :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl48310cd2009-01-03 21:18:54 +0000293 :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300294 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
295 for the entire regular expression.
296 (The flags are described in :ref:`contents-of-module-re`.)
297 This is useful if you wish to include the flags as part of the
298 regular expression, instead of passing a *flag* argument to the
Serhiy Storchakabd48d272016-09-11 12:50:02 +0300299 :func:`re.compile` function. Flags should be used first in the
300 expression string.
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300302.. index:: single: (?:; in regular expressions
303
Georg Brandl116aa622007-08-15 14:28:22 +0000304``(?:...)``
Georg Brandl3122ce32010-10-29 06:17:38 +0000305 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl116aa622007-08-15 14:28:22 +0000306 expression is inside the parentheses, but the substring matched by the group
307 *cannot* be retrieved after performing a match or referenced later in the
308 pattern.
309
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300310``(?aiLmsux-imsx:...)``
311 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
312 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
313 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
314 The letters set or remove the corresponding flags:
315 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
316 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
317 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
318 and :const:`re.X` (verbose), for the part of the expression.
319 (The flags are described in :ref:`contents-of-module-re`.)
320
321 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
322 as inline flags, so they can't be combined or follow ``'-'``. Instead,
323 when one of them appears in an inline group, it overrides the matching mode
324 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
325 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
326 (default). In byte pattern ``(?L:...)`` switches to locale depending
327 matching, and ``(?a:...)`` switches to ASCII-only matching (default).
328 This override is only in effect for the narrow inline group, and the
329 original matching mode is restored outside of the group.
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300330
Zachary Warec3076722016-09-09 15:47:05 -0700331 .. versionadded:: 3.6
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300332
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300333 .. versionchanged:: 3.7
334 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
335
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300336.. index:: single: (?P<; in regular expressions
337
Georg Brandl116aa622007-08-15 14:28:22 +0000338``(?P<name>...)``
339 Similar to regular parentheses, but the substring matched by the group is
Georg Brandl3c6780c62013-10-06 12:08:14 +0200340 accessible via the symbolic group name *name*. Group names must be valid
341 Python identifiers, and each group name must be defined only once within a
342 regular expression. A symbolic group is also a numbered group, just as if
343 the group were not named.
Georg Brandl116aa622007-08-15 14:28:22 +0000344
Georg Brandl3c6780c62013-10-06 12:08:14 +0200345 Named groups can be referenced in three contexts. If the pattern is
346 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
347 single or double quotes):
348
349 +---------------------------------------+----------------------------------+
350 | Context of reference to group "quote" | Ways to reference it |
351 +=======================================+==================================+
352 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
353 | | * ``\1`` |
354 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300355 | when processing match object *m* | * ``m.group('quote')`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200356 | | * ``m.end('quote')`` (etc.) |
357 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300358 | in a string passed to the *repl* | * ``\g<quote>`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200359 | argument of ``re.sub()`` | * ``\g<1>`` |
360 | | * ``\1`` |
361 +---------------------------------------+----------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000362
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300363.. index:: single: (?P=; in regular expressions
364
Georg Brandl116aa622007-08-15 14:28:22 +0000365``(?P=name)``
Georg Brandl3c6780c62013-10-06 12:08:14 +0200366 A backreference to a named group; it matches whatever text was matched by the
367 earlier group named *name*.
Georg Brandl116aa622007-08-15 14:28:22 +0000368
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300369.. index:: single: (?#; in regular expressions
370
Georg Brandl116aa622007-08-15 14:28:22 +0000371``(?#...)``
372 A comment; the contents of the parentheses are simply ignored.
373
374``(?=...)``
375 Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300376 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl116aa622007-08-15 14:28:22 +0000377 ``'Isaac '`` only if it's followed by ``'Asimov'``.
378
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300379.. index:: single: (?!; in regular expressions
380
Georg Brandl116aa622007-08-15 14:28:22 +0000381``(?!...)``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300382 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl116aa622007-08-15 14:28:22 +0000383 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
384 followed by ``'Asimov'``.
385
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300386.. index:: single: (?<=; in regular expressions
387
Georg Brandl116aa622007-08-15 14:28:22 +0000388``(?<=...)``
389 Matches if the current position in the string is preceded by a match for ``...``
390 that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300391 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl116aa622007-08-15 14:28:22 +0000392 lookbehind will back up 3 characters and check if the contained pattern matches.
393 The contained pattern must only match strings of some fixed length, meaning that
394 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti0a6b5412012-04-29 07:34:46 +0300395 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl116aa622007-08-15 14:28:22 +0000396 beginning of the string being searched; you will most likely want to use the
Christian Heimesfe337bf2008-03-23 21:54:12 +0000397 :func:`search` function rather than the :func:`match` function:
Georg Brandl116aa622007-08-15 14:28:22 +0000398
399 >>> import re
400 >>> m = re.search('(?<=abc)def', 'abcdef')
401 >>> m.group(0)
402 'def'
403
Christian Heimesfe337bf2008-03-23 21:54:12 +0000404 This example looks for a word following a hyphen:
Georg Brandl116aa622007-08-15 14:28:22 +0000405
Cheryl Sabella66771422018-02-02 16:16:27 -0500406 >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl116aa622007-08-15 14:28:22 +0000407 >>> m.group(0)
408 'egg'
409
Georg Brandl8c16cb92016-02-25 20:17:45 +0100410 .. versionchanged:: 3.5
Serhiy Storchaka4eea62f2015-02-21 10:07:35 +0200411 Added support for group references of fixed length.
412
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300413.. index:: single: (?<!; in regular expressions
414
Georg Brandl116aa622007-08-15 14:28:22 +0000415``(?<!...)``
416 Matches if the current position in the string is not preceded by a match for
417 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
418 positive lookbehind assertions, the contained pattern must only match strings of
419 some fixed length. Patterns which start with negative lookbehind assertions may
420 match at the beginning of the string being searched.
421
422``(?(id/name)yes-pattern|no-pattern)``
orsenthil@gmail.com476021b2011-03-12 10:46:25 +0800423 Will try to match with ``yes-pattern`` if the group with given *id* or
424 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
425 optional and can be omitted. For example,
426 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
427 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200428 not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000429
Georg Brandl116aa622007-08-15 14:28:22 +0000430
431The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter98e90512016-06-12 06:17:29 +0000432If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300433resulting RE will match the second character. For example, ``\$`` matches the
434character ``'$'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000435
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200436.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300437
Georg Brandl116aa622007-08-15 14:28:22 +0000438``\number``
439 Matches the contents of the group of the same number. Groups are numbered
440 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl2070e832013-10-06 12:58:20 +0200441 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000442 can only be used to match one of the first 99 groups. If the first digit of
443 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
444 a group match, but as the character with octal value *number*. Inside the
445 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
446 characters.
447
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300448.. index:: single: \A; in regular expressions
449
Georg Brandl116aa622007-08-15 14:28:22 +0000450``\A``
451 Matches only at the start of the string.
452
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300453.. index:: single: \b; in regular expressions
454
Georg Brandl116aa622007-08-15 14:28:22 +0000455``\b``
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000456 Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300457 A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti5a045b92012-02-29 11:48:44 +0200458 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
459 (or vice versa), or between ``\w`` and the beginning/end of the string.
460 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
461 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
462
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300463 By default Unicode alphanumerics are the ones used in Unicode patterns, but
464 this can be changed by using the :const:`ASCII` flag. Word boundaries are
465 determined by the current locale if the :const:`LOCALE` flag is used.
466 Inside a character range, ``\b`` represents the backspace character, for
467 compatibility with Python's string literals.
Georg Brandl116aa622007-08-15 14:28:22 +0000468
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300469.. index:: single: \B; in regular expressions
470
Georg Brandl116aa622007-08-15 14:28:22 +0000471``\B``
Ezio Melotti5a045b92012-02-29 11:48:44 +0200472 Matches the empty string, but only when it is *not* at the beginning or end
473 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
474 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300475 ``\B`` is just the opposite of ``\b``, so word characters in Unicode
476 patterns are Unicode alphanumerics or the underscore, although this can
477 be changed by using the :const:`ASCII` flag. Word boundaries are
478 determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl116aa622007-08-15 14:28:22 +0000479
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300480.. index:: single: \d; in regular expressions
481
Georg Brandl116aa622007-08-15 14:28:22 +0000482``\d``
Antoine Pitroufd036452008-08-19 17:56:33 +0000483 For Unicode (str) patterns:
Mark Dickinson1f268282009-07-28 17:22:36 +0000484 Matches any Unicode decimal digit (that is, any character in
485 Unicode character category [Nd]). This includes ``[0-9]``, and
486 also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300487 used only ``[0-9]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300488
Antoine Pitroufd036452008-08-19 17:56:33 +0000489 For 8-bit (bytes) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000490 Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000491
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300492.. index:: single: \D; in regular expressions
493
Georg Brandl116aa622007-08-15 14:28:22 +0000494``\D``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300495 Matches any character which is not a decimal digit. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000496 the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300497 becomes the equivalent of ``[^0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000498
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300499.. index:: single: \s; in regular expressions
500
Georg Brandl116aa622007-08-15 14:28:22 +0000501``\s``
Antoine Pitroufd036452008-08-19 17:56:33 +0000502 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000503 Matches Unicode whitespace characters (which includes
504 ``[ \t\n\r\f\v]``, and also many other characters, for example the
505 non-breaking spaces mandated by typography rules in many
506 languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300507 ``[ \t\n\r\f\v]`` is matched.
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000508
Antoine Pitroufd036452008-08-19 17:56:33 +0000509 For 8-bit (bytes) patterns:
510 Matches characters considered whitespace in the ASCII character set;
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000511 this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000512
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300513.. index:: single: \S; in regular expressions
514
Georg Brandl116aa622007-08-15 14:28:22 +0000515``\S``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300516 Matches any character which is not a whitespace character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000517 the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300518 becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000519
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300520.. index:: single: \w; in regular expressions
521
Georg Brandl116aa622007-08-15 14:28:22 +0000522``\w``
Antoine Pitroufd036452008-08-19 17:56:33 +0000523 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000524 Matches Unicode word characters; this includes most characters
525 that can be part of a word in any language, as well as numbers and
526 the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300527 ``[a-zA-Z0-9_]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300528
Antoine Pitroufd036452008-08-19 17:56:33 +0000529 For 8-bit (bytes) patterns:
530 Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300531 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
532 used, matches characters considered alphanumeric in the current locale
533 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000534
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300535.. index:: single: \W; in regular expressions
536
Georg Brandl116aa622007-08-15 14:28:22 +0000537``\W``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300538 Matches any character which is not a word character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000539 the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300540 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300541 used, matches characters considered alphanumeric in the current locale
542 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000543
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300544.. index:: single: \Z; in regular expressions
545
Georg Brandl116aa622007-08-15 14:28:22 +0000546``\Z``
547 Matches only at the end of the string.
548
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300549.. index::
550 single: \a; in regular expressions
551 single: \b; in regular expressions
552 single: \f; in regular expressions
553 single: \n; in regular expressions
554 single: \N; in regular expressions
555 single: \r; in regular expressions
556 single: \t; in regular expressions
557 single: \u; in regular expressions
558 single: \U; in regular expressions
559 single: \v; in regular expressions
560 single: \x; in regular expressions
561 single: \\; in regular expressions
562
Georg Brandl116aa622007-08-15 14:28:22 +0000563Most of the standard escapes supported by Python string literals are also
564accepted by the regular expression parser::
565
566 \a \b \f \n
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200567 \N \r \t \u
568 \U \v \x \\
Georg Brandl116aa622007-08-15 14:28:22 +0000569
Ezio Melotti285e51b2012-04-29 04:52:30 +0300570(Note that ``\b`` is used to represent word boundaries, and means "backspace"
571only inside character classes.)
572
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200573``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300574patterns. In bytes patterns they are errors.
Antoine Pitrou463badf2012-06-23 13:29:19 +0200575
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700576Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl116aa622007-08-15 14:28:22 +0000577there are three octal digits, it is considered an octal escape. Otherwise, it is
578a group reference. As for string literals, octal escapes are always at most
579three digits in length.
580
Antoine Pitrou463badf2012-06-23 13:29:19 +0200581.. versionchanged:: 3.3
582 The ``'\u'`` and ``'\U'`` escape sequences have been added.
583
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300584.. versionchanged:: 3.6
Martin Panter98e90512016-06-12 06:17:29 +0000585 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200586
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200587.. versionchanged:: 3.8
588 The ``'\N{name}'`` escape sequence has been added. As in string literals,
589 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou463badf2012-06-23 13:29:19 +0200590
Georg Brandl116aa622007-08-15 14:28:22 +0000591
Georg Brandl116aa622007-08-15 14:28:22 +0000592.. _contents-of-module-re:
593
594Module Contents
595---------------
596
597The module defines several functions, constants, and an exception. Some of the
598functions are simplified versions of the full featured methods for compiled
599regular expressions. Most non-trivial applications always use the compiled
600form.
601
Ethan Furmanc88c80b2016-11-21 08:29:31 -0800602.. versionchanged:: 3.6
603 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
604 :class:`enum.IntFlag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000605
Georg Brandl18244152009-09-02 20:34:52 +0000606.. function:: compile(pattern, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000607
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100608 Compile a regular expression pattern into a :ref:`regular expression object
609 <re-objects>`, which can be used for matching using its
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300610 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100611 below.
Georg Brandl116aa622007-08-15 14:28:22 +0000612
613 The expression's behaviour can be modified by specifying a *flags* value.
614 Values can be any of the following variables, combined using bitwise OR (the
615 ``|`` operator).
616
617 The sequence ::
618
Gregory P. Smith4221c742009-03-02 05:04:04 +0000619 prog = re.compile(pattern)
620 result = prog.match(string)
Georg Brandl116aa622007-08-15 14:28:22 +0000621
622 is equivalent to ::
623
Gregory P. Smith4221c742009-03-02 05:04:04 +0000624 result = re.match(pattern, string)
Georg Brandl116aa622007-08-15 14:28:22 +0000625
Georg Brandlf346ac02009-07-26 15:03:49 +0000626 but using :func:`re.compile` and saving the resulting regular expression
627 object for reuse is more efficient when the expression will be used several
628 times in a single program.
Georg Brandl116aa622007-08-15 14:28:22 +0000629
Gregory P. Smith4221c742009-03-02 05:04:04 +0000630 .. note::
631
632 The compiled versions of the most recent patterns passed to
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200633 :func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith4221c742009-03-02 05:04:04 +0000634 programs that use only a few regular expressions at a time needn't worry
635 about compiling regular expressions.
Georg Brandl116aa622007-08-15 14:28:22 +0000636
637
Antoine Pitroufd036452008-08-19 17:56:33 +0000638.. data:: A
639 ASCII
640
Georg Brandl4049ce02009-06-08 07:49:54 +0000641 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
642 perform ASCII-only matching instead of full Unicode matching. This is only
643 meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300644 Corresponds to the inline flag ``(?a)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000645
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000646 Note that for backward compatibility, the :const:`re.U` flag still
647 exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandlebeb44d2010-07-29 11:15:36 +0000648 counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000649 matches are Unicode by default for strings (and Unicode matching
650 isn't allowed for bytes).
Georg Brandl48310cd2009-01-03 21:18:54 +0000651
Antoine Pitroufd036452008-08-19 17:56:33 +0000652
Sandro Tosida785fd2012-01-01 12:55:20 +0100653.. data:: DEBUG
654
655 Display debug information about compiled expression.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300656 No corresponding inline flag.
Sandro Tosida785fd2012-01-01 12:55:20 +0100657
658
Georg Brandl116aa622007-08-15 14:28:22 +0000659.. data:: I
660 IGNORECASE
661
Brian Wardc9d6dbc2017-05-24 00:03:38 -0700662 Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300663 match lowercase letters. Full Unicode matching (such as ``Ü`` matching
664 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
665 non-ASCII matches. The current locale does not change the effect of this
666 flag unless the :const:`re.LOCALE` flag is also used.
667 Corresponds to the inline flag ``(?i)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000668
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300669 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
670 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
671 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
672 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
673 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
674 If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300675 and 'A' to 'Z' are matched.
Georg Brandl116aa622007-08-15 14:28:22 +0000676
677.. data:: L
678 LOCALE
679
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300680 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
681 dependent on the current locale. This flag can be used only with bytes
682 patterns. The use of this flag is discouraged as the locale mechanism
683 is very unreliable, it only handles one "culture" at a time, and it only
684 works with 8-bit locales. Unicode matching is already enabled by default
685 in Python 3 for Unicode (str) patterns, and it is able to handle different
686 locales/languages.
687 Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka22a309a2014-12-01 11:50:07 +0200688
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300689 .. versionchanged:: 3.6
690 :const:`re.LOCALE` can be used only with bytes patterns and is
691 not compatible with :const:`re.ASCII`.
Georg Brandl116aa622007-08-15 14:28:22 +0000692
Serhiy Storchaka898ff032017-05-05 08:53:40 +0300693 .. versionchanged:: 3.7
694 Compiled regular expression objects with the :const:`re.LOCALE` flag no
695 longer depend on the locale at compile time. Only the locale at
696 matching time affects the result of matching.
697
Georg Brandl116aa622007-08-15 14:28:22 +0000698
699.. data:: M
700 MULTILINE
701
702 When specified, the pattern character ``'^'`` matches at the beginning of the
703 string and at the beginning of each line (immediately following each newline);
704 and the pattern character ``'$'`` matches at the end of the string and at the
705 end of each line (immediately preceding each newline). By default, ``'^'``
706 matches only at the beginning of the string, and ``'$'`` only at the end of the
707 string and immediately before the newline (if any) at the end of the string.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300708 Corresponds to the inline flag ``(?m)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000709
710
711.. data:: S
712 DOTALL
713
714 Make the ``'.'`` special character match any character at all, including a
715 newline; without this flag, ``'.'`` will match anything *except* a newline.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300716 Corresponds to the inline flag ``(?s)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000717
718
Georg Brandl116aa622007-08-15 14:28:22 +0000719.. data:: X
720 VERBOSE
721
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200722 .. index:: single: # (hash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300723
Zachary Ware71a0b432015-11-11 23:32:14 -0600724 This flag allows you to write regular expressions that look nicer and are
725 more readable by allowing you to visually separate logical sections of the
726 pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchakab0b44b42017-11-14 17:21:26 +0200727 when in a character class, or when preceded by an unescaped backslash,
728 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware71a0b432015-11-11 23:32:14 -0600729 When a line contains a ``#`` that is not in a character class and is not
730 preceded by an unescaped backslash, all characters from the leftmost such
731 ``#`` through the end of the line are ignored.
Georg Brandl116aa622007-08-15 14:28:22 +0000732
Zachary Ware71a0b432015-11-11 23:32:14 -0600733 This means that the two following regular expression objects that match a
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000734 decimal number are functionally equal::
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000735
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000736 a = re.compile(r"""\d + # the integral part
737 \. # the decimal point
738 \d * # some fractional digits""", re.X)
739 b = re.compile(r"\d+\.\d*")
Georg Brandl116aa622007-08-15 14:28:22 +0000740
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300741 Corresponds to the inline flag ``(?x)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000742
743
Georg Brandlc62a7042010-07-29 11:49:05 +0000744.. function:: search(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000745
Terry Jan Reedy0edb5c12014-05-30 16:19:59 -0400746 Scan through *string* looking for the first location where the regular expression
Georg Brandlc62a7042010-07-29 11:49:05 +0000747 *pattern* produces a match, and return a corresponding :ref:`match object
748 <match-objects>`. Return ``None`` if no position in the string matches the
749 pattern; note that this is different from finding a zero-length match at some
750 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000751
752
Georg Brandl18244152009-09-02 20:34:52 +0000753.. function:: match(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000754
755 If zero or more characters at the beginning of *string* match the regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000756 expression *pattern*, return a corresponding :ref:`match object
757 <match-objects>`. Return ``None`` if the string does not match the pattern;
758 note that this is different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000759
Ezio Melotti443f0002012-02-29 13:39:05 +0200760 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
761 at the beginning of the string and not at the beginning of each line.
Georg Brandl116aa622007-08-15 14:28:22 +0000762
Ezio Melotti443f0002012-02-29 13:39:05 +0200763 If you want to locate a match anywhere in *string*, use :func:`search`
764 instead (see also :ref:`search-vs-match`).
Georg Brandl116aa622007-08-15 14:28:22 +0000765
766
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200767.. function:: fullmatch(pattern, string, flags=0)
768
769 If the whole *string* matches the regular expression *pattern*, return a
770 corresponding :ref:`match object <match-objects>`. Return ``None`` if the
771 string does not match the pattern; note that this is different from a
772 zero-length match.
773
774 .. versionadded:: 3.4
775
776
Georg Brandl18244152009-09-02 20:34:52 +0000777.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000778
779 Split *string* by the occurrences of *pattern*. If capturing parentheses are
780 used in *pattern*, then the text of all groups in the pattern are also returned
781 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
782 splits occur, and the remainder of the string is returned as the final element
Georg Brandl96473892008-03-06 07:09:43 +0000783 of the list. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000784
Serhiy Storchakac615be52017-11-28 22:51:38 +0200785 >>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000786 ['Words', 'words', 'words', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200787 >>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000788 ['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200789 >>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl116aa622007-08-15 14:28:22 +0000790 ['Words', 'words, words.']
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000791 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
792 ['0', '3', '9']
Georg Brandl116aa622007-08-15 14:28:22 +0000793
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000794 If there are capturing groups in the separator and it matches at the start of
795 the string, the result will start with an empty string. The same holds for
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300796 the end of the string::
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000797
Serhiy Storchakac615be52017-11-28 22:51:38 +0200798 >>> re.split(r'(\W+)', '...words, words...')
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000799 ['', '...', 'words', ', ', 'words', '...', '']
800
801 That way, separator components are always found at the same relative
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700802 indices within the result list.
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000803
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200804 Empty matches for the pattern split the string only when not adjacent
805 to a previous empty match.
Thomas Wouters89d996e2007-09-08 17:39:28 +0000806
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200807 >>> re.split(r'\b', 'Words, words, words.')
808 ['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200809 >>> re.split(r'\W*', '...words...')
810 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200811 >>> re.split(r'(\W*)', '...words...')
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200812 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl116aa622007-08-15 14:28:22 +0000813
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000814 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000815 Added the optional flags argument.
816
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200817 .. versionchanged:: 3.7
818 Added support of splitting on a pattern that could match an empty string.
819
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000820
Georg Brandl18244152009-09-02 20:34:52 +0000821.. function:: findall(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000822
Georg Brandl9afde1c2007-11-01 20:32:30 +0000823 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandl3dbca812008-07-23 16:10:53 +0000824 strings. The *string* is scanned left-to-right, and matches are returned in
825 the order found. If one or more groups are present in the pattern, return a
826 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200827 one group. Empty matches are included in the result.
828
829 .. versionchanged:: 3.7
830 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000831
Georg Brandl116aa622007-08-15 14:28:22 +0000832
Georg Brandl18244152009-09-02 20:34:52 +0000833.. function:: finditer(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000834
Georg Brandlc62a7042010-07-29 11:49:05 +0000835 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
836 all non-overlapping matches for the RE *pattern* in *string*. The *string*
837 is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200838 matches are included in the result.
839
840 .. versionchanged:: 3.7
841 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000842
Georg Brandl116aa622007-08-15 14:28:22 +0000843
Georg Brandl18244152009-09-02 20:34:52 +0000844.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000845
846 Return the string obtained by replacing the leftmost non-overlapping occurrences
847 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
848 *string* is returned unchanged. *repl* can be a string or a function; if it is
849 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi6a633bb2011-08-19 22:54:50 +0200850 converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200851 so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl116aa622007-08-15 14:28:22 +0000852 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300853 For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000854
855 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
856 ... r'static PyObject*\npy_\1(void)\n{',
857 ... 'def myfunc():')
858 'static PyObject*\npy_myfunc(void)\n{'
859
860 If *repl* is a function, it is called for every non-overlapping occurrence of
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300861 *pattern*. The function takes a single :ref:`match object <match-objects>`
862 argument, and returns the replacement string. For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000863
864 >>> def dashrepl(matchobj):
865 ... if matchobj.group(0) == '-': return ' '
866 ... else: return '-'
867 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
868 'pro--gram files'
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000869 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
870 'Baked Beans & Spam'
Georg Brandl116aa622007-08-15 14:28:22 +0000871
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300872 The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl116aa622007-08-15 14:28:22 +0000873
874 The optional argument *count* is the maximum number of pattern occurrences to be
875 replaced; *count* must be a non-negative integer. If omitted or zero, all
876 occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200877 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
878 ``'-a-b--d-'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000879
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300880 .. index:: single: \g; in regular expressions
881
Georg Brandl3c6780c62013-10-06 12:08:14 +0200882 In string-type *repl* arguments, in addition to the character escapes and
883 backreferences described above,
Georg Brandl116aa622007-08-15 14:28:22 +0000884 ``\g<name>`` will use the substring matched by the group named ``name``, as
885 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
886 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
887 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
888 reference to group 20, not a reference to group 2 followed by the literal
889 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
890 substring matched by the RE.
891
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000892 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000893 Added the optional flags argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000894
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300895 .. versionchanged:: 3.5
896 Unmatched groups are replaced with an empty string.
897
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300898 .. versionchanged:: 3.6
Serhiy Storchaka53c53ea2016-12-06 19:15:29 +0200899 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
900 now are errors.
901
Serhiy Storchakaff3dbe92016-12-06 19:25:19 +0200902 .. versionchanged:: 3.7
903 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
904 now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200905
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200906 Empty matches for the pattern are replaced when adjacent to a previous
907 non-empty match.
908
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000909
Georg Brandl18244152009-09-02 20:34:52 +0000910.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000911
912 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
913 number_of_subs_made)``.
914
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000915 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000916 Added the optional flags argument.
917
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300918 .. versionchanged:: 3.5
919 Unmatched groups are replaced with an empty string.
920
Georg Brandl116aa622007-08-15 14:28:22 +0000921
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300922.. function:: escape(pattern)
Georg Brandl116aa622007-08-15 14:28:22 +0000923
Serhiy Storchaka59083002017-04-13 21:06:43 +0300924 Escape special characters in *pattern*.
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300925 This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300926 have regular expression metacharacters in it. For example::
927
928 >>> print(re.escape('python.exe'))
929 python\.exe
930
931 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
932 >>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200933 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300934
935 >>> operators = ['+', '-', '*', '/', '**']
936 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka59083002017-04-13 21:06:43 +0300937 /|\-|\+|\*\*|\*
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300938
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300939 This functions must not be used for the replacement string in :func:`sub`
940 and :func:`subn`, only backslashes should be escaped. For example::
941
942 >>> digits_re = r'\d+'
943 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
944 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
945 /usr/sbin/sendmail - \d+ errors, \d+ warnings
946
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300947 .. versionchanged:: 3.3
948 The ``'_'`` character is no longer escaped.
Georg Brandl116aa622007-08-15 14:28:22 +0000949
Serhiy Storchaka59083002017-04-13 21:06:43 +0300950 .. versionchanged:: 3.7
951 Only characters that can have special meaning in a regular expression
952 are escaped.
953
Georg Brandl116aa622007-08-15 14:28:22 +0000954
R. David Murray522c32a2010-07-10 14:23:36 +0000955.. function:: purge()
956
957 Clear the regular expression cache.
958
959
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200960.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000961
962 Exception raised when a string passed to one of the functions here is not a
963 valid regular expression (for example, it might contain unmatched parentheses)
964 or when some other error occurs during compilation or matching. It is never an
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200965 error if a string contains no match for a pattern. The error instance has
966 the following additional attributes:
Georg Brandl116aa622007-08-15 14:28:22 +0000967
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200968 .. attribute:: msg
969
970 The unformatted error message.
971
972 .. attribute:: pattern
973
974 The regular expression pattern.
975
976 .. attribute:: pos
977
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300978 The index in *pattern* where compilation failed (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200979
980 .. attribute:: lineno
981
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300982 The line corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200983
984 .. attribute:: colno
985
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300986 The column corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200987
988 .. versionchanged:: 3.5
989 Added additional attributes.
Georg Brandl116aa622007-08-15 14:28:22 +0000990
991.. _re-objects:
992
993Regular Expression Objects
994--------------------------
995
Georg Brandlc62a7042010-07-29 11:49:05 +0000996Compiled regular expression objects support the following methods and
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700997attributes:
Brian Curtin027e4782010-03-26 00:39:56 +0000998
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300999.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001000
Berker Peksag84f387d2016-06-08 14:56:56 +03001001 Scan through *string* looking for the first location where this regular
1002 expression produces a match, and return a corresponding :ref:`match object
Georg Brandlc62a7042010-07-29 11:49:05 +00001003 <match-objects>`. Return ``None`` if no position in the string matches the
1004 pattern; note that this is different from finding a zero-length match at some
1005 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +00001006
Georg Brandlc62a7042010-07-29 11:49:05 +00001007 The optional second parameter *pos* gives an index in the string where the
1008 search is to start; it defaults to ``0``. This is not completely equivalent to
1009 slicing the string; the ``'^'`` pattern character matches at the real beginning
1010 of the string and at positions just after a newline, but not necessarily at the
1011 index where the search is to start.
Georg Brandl116aa622007-08-15 14:28:22 +00001012
Georg Brandlc62a7042010-07-29 11:49:05 +00001013 The optional parameter *endpos* limits how far the string will be searched; it
1014 will be as if the string is *endpos* characters long, so only the characters
1015 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001016 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
Georg Brandlc62a7042010-07-29 11:49:05 +00001017 expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001018 ``rx.search(string[:50], 0)``. ::
Georg Brandl116aa622007-08-15 14:28:22 +00001019
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001020 >>> pattern = re.compile("d")
1021 >>> pattern.search("dog") # Match at index 0
1022 <re.Match object; span=(0, 1), match='d'>
1023 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl116aa622007-08-15 14:28:22 +00001024
1025
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001026.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001027
Georg Brandlc62a7042010-07-29 11:49:05 +00001028 If zero or more characters at the *beginning* of *string* match this regular
1029 expression, return a corresponding :ref:`match object <match-objects>`.
1030 Return ``None`` if the string does not match the pattern; note that this is
1031 different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +00001032
Georg Brandlc62a7042010-07-29 11:49:05 +00001033 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001034 :meth:`~Pattern.search` method. ::
Benjamin Petersond7c3ed52010-06-27 22:32:30 +00001035
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001036 >>> pattern = re.compile("o")
1037 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
1038 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
1039 <re.Match object; span=(1, 2), match='o'>
Georg Brandl116aa622007-08-15 14:28:22 +00001040
Ezio Melotti443f0002012-02-29 13:39:05 +02001041 If you want to locate a match anywhere in *string*, use
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001042 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti443f0002012-02-29 13:39:05 +02001043
Georg Brandl116aa622007-08-15 14:28:22 +00001044
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001045.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001046
1047 If the whole *string* matches this regular expression, return a corresponding
1048 :ref:`match object <match-objects>`. Return ``None`` if the string does not
1049 match the pattern; note that this is different from a zero-length match.
1050
1051 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001052 :meth:`~Pattern.search` method. ::
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001053
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001054 >>> pattern = re.compile("o[gh]")
1055 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
1056 >>> pattern.fullmatch("ogre") # No match as not the full string matches.
1057 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
1058 <re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001059
1060 .. versionadded:: 3.4
1061
1062
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001063.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001064
Georg Brandlc62a7042010-07-29 11:49:05 +00001065 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001066
1067
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001068.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001069
Georg Brandlc62a7042010-07-29 11:49:05 +00001070 Similar to the :func:`findall` function, using the compiled pattern, but
1071 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001072 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001073
1074
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001075.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001076
Georg Brandlc62a7042010-07-29 11:49:05 +00001077 Similar to the :func:`finditer` function, using the compiled pattern, but
1078 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001079 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001080
1081
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001082.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001083
Georg Brandlc62a7042010-07-29 11:49:05 +00001084 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001085
1086
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001087.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001088
Georg Brandlc62a7042010-07-29 11:49:05 +00001089 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001090
1091
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001092.. attribute:: Pattern.flags
Georg Brandl116aa622007-08-15 14:28:22 +00001093
Georg Brandl3a19e542012-03-17 17:29:27 +01001094 The regex matching flags. This is a combination of the flags given to
1095 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1096 flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl116aa622007-08-15 14:28:22 +00001097
1098
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001099.. attribute:: Pattern.groups
Georg Brandlaf265f42008-12-07 15:06:20 +00001100
Georg Brandlc62a7042010-07-29 11:49:05 +00001101 The number of capturing groups in the pattern.
Georg Brandlaf265f42008-12-07 15:06:20 +00001102
1103
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001104.. attribute:: Pattern.groupindex
Georg Brandl116aa622007-08-15 14:28:22 +00001105
Georg Brandlc62a7042010-07-29 11:49:05 +00001106 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1107 numbers. The dictionary is empty if no symbolic groups were used in the
1108 pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001109
1110
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001111.. attribute:: Pattern.pattern
Georg Brandl116aa622007-08-15 14:28:22 +00001112
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001113 The pattern string from which the pattern object was compiled.
Georg Brandl116aa622007-08-15 14:28:22 +00001114
1115
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001116.. versionchanged:: 3.7
1117 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
1118 regular expression objects are considered atomic.
1119
1120
Georg Brandl116aa622007-08-15 14:28:22 +00001121.. _match-objects:
1122
1123Match Objects
1124-------------
1125
Ezio Melottib87f82f2012-11-04 06:59:22 +02001126Match objects always have a boolean value of ``True``.
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001127Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melottib87f82f2012-11-04 06:59:22 +02001128when there is no match, you can test whether there was a match with a simple
1129``if`` statement::
1130
1131 match = re.search(pattern, string)
1132 if match:
1133 process(match)
1134
1135Match objects support the following methods and attributes:
Georg Brandl116aa622007-08-15 14:28:22 +00001136
1137
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001138.. method:: Match.expand(template)
Georg Brandl116aa622007-08-15 14:28:22 +00001139
Georg Brandlc62a7042010-07-29 11:49:05 +00001140 Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001141 string *template*, as done by the :meth:`~Pattern.sub` method.
Georg Brandlc62a7042010-07-29 11:49:05 +00001142 Escapes such as ``\n`` are converted to the appropriate characters,
1143 and numeric backreferences (``\1``, ``\2``) and named backreferences
1144 (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1145 corresponding group.
Georg Brandl116aa622007-08-15 14:28:22 +00001146
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +03001147 .. versionchanged:: 3.5
1148 Unmatched groups are replaced with an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +00001149
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001150.. method:: Match.group([group1, ...])
Georg Brandl116aa622007-08-15 14:28:22 +00001151
Georg Brandlc62a7042010-07-29 11:49:05 +00001152 Returns one or more subgroups of the match. If there is a single argument, the
1153 result is a single string; if there are multiple arguments, the result is a
1154 tuple with one item per argument. Without arguments, *group1* defaults to zero
1155 (the whole match is returned). If a *groupN* argument is zero, the corresponding
1156 return value is the entire matching string; if it is in the inclusive range
1157 [1..99], it is the string matching the corresponding parenthesized group. If a
1158 group number is negative or larger than the number of groups defined in the
1159 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1160 part of the pattern that did not match, the corresponding result is ``None``.
1161 If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001162 the last match is returned. ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001163
Georg Brandlc62a7042010-07-29 11:49:05 +00001164 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1165 >>> m.group(0) # The entire match
1166 'Isaac Newton'
1167 >>> m.group(1) # The first parenthesized subgroup.
1168 'Isaac'
1169 >>> m.group(2) # The second parenthesized subgroup.
1170 'Newton'
1171 >>> m.group(1, 2) # Multiple arguments give us a tuple.
1172 ('Isaac', 'Newton')
Georg Brandl116aa622007-08-15 14:28:22 +00001173
Georg Brandlc62a7042010-07-29 11:49:05 +00001174 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1175 arguments may also be strings identifying groups by their group name. If a
1176 string argument is not used as a group name in the pattern, an :exc:`IndexError`
1177 exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +00001178
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001179 A moderately complicated example::
Georg Brandl116aa622007-08-15 14:28:22 +00001180
Georg Brandlc62a7042010-07-29 11:49:05 +00001181 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1182 >>> m.group('first_name')
1183 'Malcolm'
1184 >>> m.group('last_name')
1185 'Reynolds'
Georg Brandl116aa622007-08-15 14:28:22 +00001186
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001187 Named groups can also be referred to by their index::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001188
Georg Brandlc62a7042010-07-29 11:49:05 +00001189 >>> m.group(1)
1190 'Malcolm'
1191 >>> m.group(2)
1192 'Reynolds'
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001193
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001194 If a group matches multiple times, only the last match is accessible::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001195
Georg Brandlc62a7042010-07-29 11:49:05 +00001196 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
1197 >>> m.group(1) # Returns only the last match.
1198 'c3'
Brian Curtin027e4782010-03-26 00:39:56 +00001199
Brian Curtin48f16f92010-04-08 13:55:29 +00001200
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001201.. method:: Match.__getitem__(g)
Eric V. Smith605bdae2016-09-11 08:55:43 -04001202
1203 This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001204 an individual group from a match::
Eric V. Smith605bdae2016-09-11 08:55:43 -04001205
1206 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1207 >>> m[0] # The entire match
1208 'Isaac Newton'
1209 >>> m[1] # The first parenthesized subgroup.
1210 'Isaac'
1211 >>> m[2] # The second parenthesized subgroup.
1212 'Newton'
1213
1214 .. versionadded:: 3.6
1215
1216
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001217.. method:: Match.groups(default=None)
Brian Curtin48f16f92010-04-08 13:55:29 +00001218
Georg Brandlc62a7042010-07-29 11:49:05 +00001219 Return a tuple containing all the subgroups of the match, from 1 up to however
1220 many groups are in the pattern. The *default* argument is used for groups that
1221 did not participate in the match; it defaults to ``None``.
Brian Curtin027e4782010-03-26 00:39:56 +00001222
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001223 For example::
Brian Curtin027e4782010-03-26 00:39:56 +00001224
Georg Brandlc62a7042010-07-29 11:49:05 +00001225 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1226 >>> m.groups()
1227 ('24', '1632')
Brian Curtin027e4782010-03-26 00:39:56 +00001228
Georg Brandlc62a7042010-07-29 11:49:05 +00001229 If we make the decimal place and everything after it optional, not all groups
1230 might participate in the match. These groups will default to ``None`` unless
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001231 the *default* argument is given::
Brian Curtin027e4782010-03-26 00:39:56 +00001232
Georg Brandlc62a7042010-07-29 11:49:05 +00001233 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1234 >>> m.groups() # Second group defaults to None.
1235 ('24', None)
1236 >>> m.groups('0') # Now, the second group defaults to '0'.
1237 ('24', '0')
Georg Brandl116aa622007-08-15 14:28:22 +00001238
1239
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001240.. method:: Match.groupdict(default=None)
Georg Brandl116aa622007-08-15 14:28:22 +00001241
Georg Brandlc62a7042010-07-29 11:49:05 +00001242 Return a dictionary containing all the *named* subgroups of the match, keyed by
1243 the subgroup name. The *default* argument is used for groups that did not
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001244 participate in the match; it defaults to ``None``. For example::
Georg Brandl116aa622007-08-15 14:28:22 +00001245
Georg Brandlc62a7042010-07-29 11:49:05 +00001246 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1247 >>> m.groupdict()
1248 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001249
Georg Brandl116aa622007-08-15 14:28:22 +00001250
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001251.. method:: Match.start([group])
1252 Match.end([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001253
Georg Brandlc62a7042010-07-29 11:49:05 +00001254 Return the indices of the start and end of the substring matched by *group*;
1255 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1256 *group* exists but did not contribute to the match. For a match object *m*, and
1257 a group *g* that did contribute to the match, the substring matched by group *g*
1258 (equivalent to ``m.group(g)``) is ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001259
Georg Brandlc62a7042010-07-29 11:49:05 +00001260 m.string[m.start(g):m.end(g)]
Brian Curtin027e4782010-03-26 00:39:56 +00001261
Georg Brandlc62a7042010-07-29 11:49:05 +00001262 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1263 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
1264 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1265 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin027e4782010-03-26 00:39:56 +00001266
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001267 An example that will remove *remove_this* from email addresses::
Brian Curtin027e4782010-03-26 00:39:56 +00001268
Georg Brandlc62a7042010-07-29 11:49:05 +00001269 >>> email = "tony@tiremove_thisger.net"
1270 >>> m = re.search("remove_this", email)
1271 >>> email[:m.start()] + email[m.end():]
1272 'tony@tiger.net'
Georg Brandl116aa622007-08-15 14:28:22 +00001273
1274
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001275.. method:: Match.span([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001276
Georg Brandlc62a7042010-07-29 11:49:05 +00001277 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1278 that if *group* did not contribute to the match, this is ``(-1, -1)``.
1279 *group* defaults to zero, the entire match.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001280
Georg Brandl116aa622007-08-15 14:28:22 +00001281
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001282.. attribute:: Match.pos
Georg Brandl116aa622007-08-15 14:28:22 +00001283
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001284 The value of *pos* which was passed to the :meth:`~Pattern.search` or
1285 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001286 the index into the string at which the RE engine started looking for a match.
Georg Brandl116aa622007-08-15 14:28:22 +00001287
1288
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001289.. attribute:: Match.endpos
Georg Brandl116aa622007-08-15 14:28:22 +00001290
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001291 The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1292 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001293 the index into the string beyond which the RE engine will not go.
Georg Brandl116aa622007-08-15 14:28:22 +00001294
1295
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001296.. attribute:: Match.lastindex
Georg Brandl116aa622007-08-15 14:28:22 +00001297
Georg Brandlc62a7042010-07-29 11:49:05 +00001298 The integer index of the last matched capturing group, or ``None`` if no group
1299 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1300 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1301 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1302 string.
Georg Brandl116aa622007-08-15 14:28:22 +00001303
1304
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001305.. attribute:: Match.lastgroup
Georg Brandl116aa622007-08-15 14:28:22 +00001306
Georg Brandlc62a7042010-07-29 11:49:05 +00001307 The name of the last matched capturing group, or ``None`` if the group didn't
1308 have a name, or if no group was matched at all.
Georg Brandl116aa622007-08-15 14:28:22 +00001309
1310
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001311.. attribute:: Match.re
Georg Brandl116aa622007-08-15 14:28:22 +00001312
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001313 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001314 :meth:`~Pattern.search` method produced this match instance.
Georg Brandl116aa622007-08-15 14:28:22 +00001315
1316
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001317.. attribute:: Match.string
Georg Brandl116aa622007-08-15 14:28:22 +00001318
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001319 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001320
1321
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001322.. versionchanged:: 3.7
1323 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
1324 are considered atomic.
1325
1326
Raymond Hettinger1fa76822010-12-06 23:31:36 +00001327.. _re-examples:
1328
1329Regular Expression Examples
1330---------------------------
Georg Brandl116aa622007-08-15 14:28:22 +00001331
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001332
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001333Checking for a Pair
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001334^^^^^^^^^^^^^^^^^^^
1335
1336In this example, we'll use the following helper function to display match
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001337objects a little more gracefully::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001338
1339 def displaymatch(match):
1340 if match is None:
1341 return None
1342 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1343
1344Suppose you are writing a poker program where a player's hand is represented as
1345a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001346for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001347representing the card with that value.
1348
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001349To see if a given string is a valid hand, one could do the following::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001350
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001351 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1352 >>> displaymatch(valid.match("akt5q")) # Valid.
1353 "<Match: 'akt5q', groups=()>"
1354 >>> displaymatch(valid.match("akt5e")) # Invalid.
1355 >>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001356 >>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001357 "<Match: '727ak', groups=()>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001358
1359That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001360To match this with a regular expression, one could use backreferences as such::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001361
1362 >>> pair = re.compile(r".*(.).*\1")
1363 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001364 "<Match: '717', groups=('7',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001365 >>> displaymatch(pair.match("718ak")) # No pairs.
1366 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001367 "<Match: '354aa', groups=('a',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001368
Georg Brandlf346ac02009-07-26 15:03:49 +00001369To find out what card the pair consists of, one could use the
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001370:meth:`~Match.group` method of the match object in the following manner::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001371
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001372 >>> pair = re.compile(r".*(.).*\1")
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001373 >>> pair.match("717ak").group(1)
1374 '7'
Georg Brandl48310cd2009-01-03 21:18:54 +00001375
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001376 # Error because re.match() returns None, which doesn't have a group() method:
1377 >>> pair.match("718ak").group(1)
1378 Traceback (most recent call last):
1379 File "<pyshell#23>", line 1, in <module>
1380 re.match(r".*(.).*\1", "718ak").group(1)
1381 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl48310cd2009-01-03 21:18:54 +00001382
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001383 >>> pair.match("354aa").group(1)
1384 'a'
1385
1386
1387Simulating scanf()
1388^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +00001389
1390.. index:: single: scanf()
1391
Georg Brandl60203b42010-10-06 10:11:56 +00001392Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl116aa622007-08-15 14:28:22 +00001393expressions are generally more powerful, though also more verbose, than
Georg Brandl60203b42010-10-06 10:11:56 +00001394:c:func:`scanf` format strings. The table below offers some more-or-less
1395equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl116aa622007-08-15 14:28:22 +00001396expressions.
1397
1398+--------------------------------+---------------------------------------------+
Georg Brandl60203b42010-10-06 10:11:56 +00001399| :c:func:`scanf` Token | Regular Expression |
Georg Brandl116aa622007-08-15 14:28:22 +00001400+================================+=============================================+
1401| ``%c`` | ``.`` |
1402+--------------------------------+---------------------------------------------+
1403| ``%5c`` | ``.{5}`` |
1404+--------------------------------+---------------------------------------------+
1405| ``%d`` | ``[-+]?\d+`` |
1406+--------------------------------+---------------------------------------------+
1407| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1408+--------------------------------+---------------------------------------------+
1409| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1410+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001411| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001412+--------------------------------+---------------------------------------------+
1413| ``%s`` | ``\S+`` |
1414+--------------------------------+---------------------------------------------+
1415| ``%u`` | ``\d+`` |
1416+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001417| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001418+--------------------------------+---------------------------------------------+
1419
1420To extract the filename and numbers from a string like ::
1421
1422 /usr/sbin/sendmail - 0 errors, 4 warnings
1423
Georg Brandl60203b42010-10-06 10:11:56 +00001424you would use a :c:func:`scanf` format like ::
Georg Brandl116aa622007-08-15 14:28:22 +00001425
1426 %s - %d errors, %d warnings
1427
1428The equivalent regular expression would be ::
1429
1430 (\S+) - (\d+) errors, (\d+) warnings
1431
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001432
Ezio Melotti443f0002012-02-29 13:39:05 +02001433.. _search-vs-match:
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001434
1435search() vs. match()
1436^^^^^^^^^^^^^^^^^^^^
1437
Ezio Melotti443f0002012-02-29 13:39:05 +02001438.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001439
Ezio Melotti443f0002012-02-29 13:39:05 +02001440Python offers two different primitive operations based on regular expressions:
1441:func:`re.match` checks for a match only at the beginning of the string, while
1442:func:`re.search` checks for a match anywhere in the string (this is what Perl
1443does by default).
1444
1445For example::
1446
Serhiy Storchakadba90392016-05-10 12:01:23 +03001447 >>> re.match("c", "abcdef") # No match
1448 >>> re.search("c", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001449 <re.Match object; span=(2, 3), match='c'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001450
Ezio Melotti443f0002012-02-29 13:39:05 +02001451Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1452restrict the match at the beginning of the string::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001453
Serhiy Storchakadba90392016-05-10 12:01:23 +03001454 >>> re.match("c", "abcdef") # No match
1455 >>> re.search("^c", "abcdef") # No match
Ezio Melotti443f0002012-02-29 13:39:05 +02001456 >>> re.search("^a", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001457 <re.Match object; span=(0, 1), match='a'>
Ezio Melotti443f0002012-02-29 13:39:05 +02001458
1459Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1460beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001461beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti443f0002012-02-29 13:39:05 +02001462
1463 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1464 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001465 <re.Match object; span=(4, 5), match='X'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001466
1467
1468Making a Phonebook
1469^^^^^^^^^^^^^^^^^^
1470
Georg Brandl48310cd2009-01-03 21:18:54 +00001471:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001472method is invaluable for converting textual data into data structures that can be
1473easily read and modified by Python as demonstrated in the following example that
1474creates a phonebook.
1475
Christian Heimes255f53b2007-12-08 15:33:56 +00001476First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001477triple-quoted string syntax
1478
1479.. doctest::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001480
Georg Brandl557a3ec2012-03-17 17:26:27 +01001481 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl48310cd2009-01-03 21:18:54 +00001482 ...
Christian Heimesfe337bf2008-03-23 21:54:12 +00001483 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1484 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1485 ...
1486 ...
1487 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes255f53b2007-12-08 15:33:56 +00001488
1489The entries are separated by one or more newlines. Now we convert the string
Christian Heimesfe337bf2008-03-23 21:54:12 +00001490into a list with each nonempty line having its own entry:
1491
1492.. doctest::
1493 :options: +NORMALIZE_WHITESPACE
Christian Heimes255f53b2007-12-08 15:33:56 +00001494
Georg Brandl557a3ec2012-03-17 17:26:27 +01001495 >>> entries = re.split("\n+", text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001496 >>> entries
Christian Heimesfe337bf2008-03-23 21:54:12 +00001497 ['Ross McFluff: 834.345.1254 155 Elm Street',
1498 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1499 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1500 'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001501
1502Finally, split each entry into a list with first name, last name, telephone
Christian Heimesc3f30c42008-02-22 16:37:40 +00001503number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimesfe337bf2008-03-23 21:54:12 +00001504because the address has spaces, our splitting pattern, in it:
1505
1506.. doctest::
1507 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001508
Christian Heimes255f53b2007-12-08 15:33:56 +00001509 >>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001510 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1511 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1512 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1513 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1514
Christian Heimes255f53b2007-12-08 15:33:56 +00001515The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimesc3f30c42008-02-22 16:37:40 +00001516occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimesfe337bf2008-03-23 21:54:12 +00001517house number from the street name:
1518
1519.. doctest::
1520 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001521
Christian Heimes255f53b2007-12-08 15:33:56 +00001522 >>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001523 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1524 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1525 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1526 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1527
1528
1529Text Munging
1530^^^^^^^^^^^^
1531
1532:func:`sub` replaces every occurrence of a pattern with a string or the
1533result of a function. This example demonstrates using :func:`sub` with
1534a function to "munge" text, or randomize the order of all the characters
1535in each word of a sentence except for the first and last characters::
1536
1537 >>> def repl(m):
Serhiy Storchakadba90392016-05-10 12:01:23 +03001538 ... inner_word = list(m.group(2))
1539 ... random.shuffle(inner_word)
1540 ... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001541 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandldb4e9392010-07-12 09:06:13 +00001542 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001543 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandldb4e9392010-07-12 09:06:13 +00001544 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001545 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1546
1547
1548Finding all Adverbs
1549^^^^^^^^^^^^^^^^^^^
1550
Christian Heimesc3f30c42008-02-22 16:37:40 +00001551:func:`findall` matches *all* occurrences of a pattern, not just the first
Andrés Delfino50924392018-06-18 01:34:30 -03001552one as :func:`search` does. For example, if a writer wanted to
1553find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001554the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001555
1556 >>> text = "He was carefully disguised but captured quickly by police."
1557 >>> re.findall(r"\w+ly", text)
1558 ['carefully', 'quickly']
1559
1560
1561Finding all Adverbs and their Positions
1562^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1563
1564If one wants more information about all matches of a pattern than the matched
Georg Brandlc62a7042010-07-29 11:49:05 +00001565text, :func:`finditer` is useful as it provides :ref:`match objects
1566<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino50924392018-06-18 01:34:30 -03001567a writer wanted to find all of the adverbs *and their positions* in
1568some text, they would use :func:`finditer` in the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001569
1570 >>> text = "He was carefully disguised but captured quickly by police."
1571 >>> for m in re.finditer(r"\w+ly", text):
Christian Heimesfe337bf2008-03-23 21:54:12 +00001572 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001573 07-16: carefully
1574 40-47: quickly
1575
1576
1577Raw String Notation
1578^^^^^^^^^^^^^^^^^^^
1579
1580Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1581every backslash (``'\'``) in a regular expression would have to be prefixed with
1582another one to escape it. For example, the two following lines of code are
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001583functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001584
1585 >>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001586 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001587 >>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001588 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001589
1590When one wants to match a literal backslash, it must be escaped in the regular
1591expression. With raw string notation, this means ``r"\\"``. Without raw string
1592notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001593functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001594
1595 >>> re.match(r"\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001596 <re.Match object; span=(0, 1), match='\\'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001597 >>> re.match("\\\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001598 <re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001599
1600
1601Writing a Tokenizer
1602^^^^^^^^^^^^^^^^^^^
1603
Georg Brandl5d941342016-02-26 19:37:12 +01001604A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001605analyzes a string to categorize groups of characters. This is a useful first
1606step in writing a compiler or interpreter.
1607
1608The text categories are specified with regular expressions. The technique is
1609to combine those into a single master regular expression and to loop over
1610successive matches::
1611
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001612 import collections
1613 import re
1614
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001615 Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001616
Raymond Hettingerc5664312014-08-03 23:38:54 -07001617 def tokenize(code):
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001618 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1619 token_specification = [
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001620 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
1621 ('ASSIGN', r':='), # Assignment operator
1622 ('END', r';'), # Statement terminator
1623 ('ID', r'[A-Za-z]+'), # Identifiers
1624 ('OP', r'[+\-*/]'), # Arithmetic operators
1625 ('NEWLINE', r'\n'), # Line endings
1626 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs
1627 ('MISMATCH', r'.'), # Any other character
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001628 ]
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001629 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettingerc5664312014-08-03 23:38:54 -07001630 line_num = 1
1631 line_start = 0
1632 for mo in re.finditer(tok_regex, code):
1633 kind = mo.lastgroup
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001634 value = mo.group()
1635 column = mo.start() - line_start
1636 if kind == 'NUMBER':
1637 value = float(value) if '.' in value else int(value)
1638 elif kind == 'ID' and value in keywords:
1639 kind = value
1640 elif kind == 'NEWLINE':
Raymond Hettingerc5664312014-08-03 23:38:54 -07001641 line_start = mo.end()
1642 line_num += 1
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001643 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001644 elif kind == 'SKIP':
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001645 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001646 elif kind == 'MISMATCH':
Raymond Hettingerd0b91582017-02-06 07:15:31 -08001647 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001648 yield Token(kind, value, line_num, column)
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001649
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001650 statements = '''
1651 IF quantity THEN
1652 total := total + price * quantity;
1653 tax := price * 0.05;
1654 ENDIF;
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001655 '''
Raymond Hettinger23157e52011-05-13 01:38:31 -07001656
1657 for token in tokenize(statements):
1658 print(token)
1659
1660The tokenizer produces the following output::
Raymond Hettinger9c47d772011-05-13 01:03:50 -07001661
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001662 Token(type='IF', value='IF', line=2, column=4)
1663 Token(type='ID', value='quantity', line=2, column=7)
1664 Token(type='THEN', value='THEN', line=2, column=16)
1665 Token(type='ID', value='total', line=3, column=8)
1666 Token(type='ASSIGN', value=':=', line=3, column=14)
1667 Token(type='ID', value='total', line=3, column=17)
1668 Token(type='OP', value='+', line=3, column=23)
1669 Token(type='ID', value='price', line=3, column=25)
1670 Token(type='OP', value='*', line=3, column=31)
1671 Token(type='ID', value='quantity', line=3, column=33)
1672 Token(type='END', value=';', line=3, column=41)
1673 Token(type='ID', value='tax', line=4, column=8)
1674 Token(type='ASSIGN', value=':=', line=4, column=12)
1675 Token(type='ID', value='price', line=4, column=15)
1676 Token(type='OP', value='*', line=4, column=21)
1677 Token(type='NUMBER', value=0.05, line=4, column=23)
1678 Token(type='END', value=';', line=4, column=27)
1679 Token(type='ENDIF', value='ENDIF', line=5, column=4)
1680 Token(type='END', value=';', line=5, column=9)
Berker Peksaga0a42d22018-03-23 16:46:52 +03001681
1682
1683.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1684 Media, 2009. The third edition of the book no longer covers Python at all,
1685 but the first edition covered writing good regular expression patterns in
1686 great detail.