blob: b51283089c82e977e2182271d9b43a66e7f4fc0e [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5 :synopsis: Regular expression operations.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/re.py`
11
12--------------
Georg Brandl116aa622007-08-15 14:28:22 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014This module provides regular expression matching operations similar to
Georg Brandled2a1db2009-06-08 07:48:27 +000015those found in Perl.
Antoine Pitroufd036452008-08-19 17:56:33 +000016
Serhiy Storchakacd195e22017-10-14 11:14:26 +030017Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter6245cb32016-04-15 02:14:19 +000020that is, you cannot match a Unicode string with a byte pattern or
Georg Brandlae2dbe22009-03-13 19:04:40 +000021vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitroufd036452008-08-19 17:56:33 +000022string must be of the same type as both the pattern and the search string.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning. This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
Pablo Galindoe8239b82019-01-20 18:57:56 +000031literal. Also, please note that any invalid escape sequences in Python's
32usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
33and in the future this will become a :exc:`SyntaxError`. This behaviour
34will happen even if it is a valid escape sequence for a regular expression.
Georg Brandl116aa622007-08-15 14:28:22 +000035
36The solution is to use Python's raw string notation for regular expression
37patterns; backslashes are not handled in any special way in a string literal
38prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
39``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl9afde1c2007-11-01 20:32:30 +000040newline. Usually patterns will be expressed in Python code using this raw
41string notation.
Georg Brandl116aa622007-08-15 14:28:22 +000042
Christian Heimesb9eccbf2007-12-05 20:18:38 +000043It is important to note that most regular expression operations are available as
Georg Brandlc62a7042010-07-29 11:49:05 +000044module-level functions and methods on
45:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
46that don't require you to compile a regex object first, but miss some
Christian Heimesb9eccbf2007-12-05 20:18:38 +000047fine-tuning parameters.
48
Marco Buttued6795e2017-02-26 16:26:23 +010049.. seealso::
50
Stéphane Wirtel19177fb2018-05-15 20:58:35 +020051 The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttued6795e2017-02-26 16:26:23 +010052 which has an API compatible with the standard library :mod:`re` module,
53 but offers additional functionality and a more thorough Unicode support.
54
Georg Brandl116aa622007-08-15 14:28:22 +000055
56.. _re-syntax:
57
58Regular Expression Syntax
59-------------------------
60
61A regular expression (or RE) specifies a set of strings that matches it; the
62functions in this module let you check if a particular string matches a given
63regular expression (or if a given regular expression matches a particular
64string, which comes down to the same thing).
65
66Regular expressions can be concatenated to form new regular expressions; if *A*
67and *B* are both regular expressions, then *AB* is also a regular expression.
68In general, if a string *p* matches *A* and another string *q* matches *B*, the
69string *pq* will match AB. This holds unless *A* or *B* contain low precedence
70operations; boundary conditions between *A* and *B*; or have numbered group
71references. Thus, complex expressions can easily be constructed from simpler
72primitive expressions like the ones described here. For details of the theory
Berker Peksaga0a42d22018-03-23 16:46:52 +030073and implementation of regular expressions, consult the Friedl book [Frie09]_,
74or almost any textbook about compiler construction.
Georg Brandl116aa622007-08-15 14:28:22 +000075
76A brief explanation of the format of regular expressions follows. For further
Christian Heimes2202f872008-02-06 14:31:34 +000077information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl116aa622007-08-15 14:28:22 +000078
79Regular expressions can contain both special and ordinary characters. Most
80ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
81expressions; they simply match themselves. You can concatenate ordinary
82characters, so ``last`` matches the string ``'last'``. (In the rest of this
83section, we'll write RE's in ``this special style``, usually without quotes, and
84strings to be matched ``'in single quotes'``.)
85
86Some characters, like ``'|'`` or ``'('``, are special. Special
87characters either stand for classes of ordinary characters, or affect
Serhiy Storchakacd195e22017-10-14 11:14:26 +030088how the regular expressions around them are interpreted.
Georg Brandl116aa622007-08-15 14:28:22 +000089
Martin Panter684340e2016-10-15 01:18:16 +000090Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
91directly nested. This avoids ambiguity with the non-greedy modifier suffix
92``?``, and with other modifiers in other implementations. To apply a second
93repetition to an inner repetition, parentheses may be used. For example,
94the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
95
Georg Brandl116aa622007-08-15 14:28:22 +000096
97The special characters are:
98
Serhiy Storchaka913876d2018-10-28 13:41:26 +020099.. index:: single: . (dot); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300100
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300101``.``
Georg Brandl116aa622007-08-15 14:28:22 +0000102 (Dot.) In the default mode, this matches any character except a newline. If
103 the :const:`DOTALL` flag has been specified, this matches any character
104 including a newline.
105
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200106.. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300107
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300108``^``
Georg Brandl116aa622007-08-15 14:28:22 +0000109 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
110 matches immediately after each newline.
111
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200112.. index:: single: $ (dollar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300113
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300114``$``
Georg Brandl116aa622007-08-15 14:28:22 +0000115 Matches the end of the string or just before the newline at the end of the
116 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
117 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
118 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes25bb7832008-01-11 16:17:00 +0000119 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
120 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
121 the newline, and one at the end of the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000122
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200123.. index:: single: * (asterisk); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300124
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300125``*``
Georg Brandl116aa622007-08-15 14:28:22 +0000126 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
127 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
128 by any number of 'b's.
129
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200130.. index:: single: + (plus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300131
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300132``+``
Georg Brandl116aa622007-08-15 14:28:22 +0000133 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
134 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
135 match just 'a'.
136
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200137.. index:: single: ? (question mark); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300138
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300139``?``
Georg Brandl116aa622007-08-15 14:28:22 +0000140 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
141 ``ab?`` will match either 'a' or 'ab'.
142
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300143.. index::
144 single: *?; in regular expressions
145 single: +?; in regular expressions
146 single: ??; in regular expressions
147
Georg Brandl116aa622007-08-15 14:28:22 +0000148``*?``, ``+?``, ``??``
149 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
150 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300151 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
152 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl116aa622007-08-15 14:28:22 +0000153 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl7ff033b2016-04-12 07:51:41 +0200154 characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300155 only ``'<a>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000156
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300157.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200158 single: {} (curly brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300159
Georg Brandl116aa622007-08-15 14:28:22 +0000160``{m}``
161 Specifies that exactly *m* copies of the previous RE should be matched; fewer
162 matches cause the entire RE not to match. For example, ``a{6}`` will match
163 exactly six ``'a'`` characters, but not five.
164
165``{m,n}``
166 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
167 RE, attempting to match as many repetitions as possible. For example,
168 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
169 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300170 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
171 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl116aa622007-08-15 14:28:22 +0000172 modifier would be confused with the previously described form.
173
174``{m,n}?``
175 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
176 RE, attempting to match as *few* repetitions as possible. This is the
177 non-greedy version of the previous qualifier. For example, on the
178 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
179 while ``a{3,5}?`` will only match 3 characters.
180
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200181.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300182
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300183``\``
Georg Brandl116aa622007-08-15 14:28:22 +0000184 Either escapes special characters (permitting you to match characters like
185 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
186 sequences are discussed below.
187
188 If you're not using a raw string to express the pattern, remember that Python
189 also uses the backslash as an escape sequence in string literals; if the escape
190 sequence isn't recognized by Python's parser, the backslash and subsequent
191 character are included in the resulting string. However, if Python would
192 recognize the resulting sequence, the backslash should be repeated twice. This
193 is complicated and hard to understand, so it's highly recommended that you use
194 raw strings for all but the simplest expressions.
195
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300196.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200197 single: [] (square brackets); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300198
Georg Brandl116aa622007-08-15 14:28:22 +0000199``[]``
Ezio Melotti81231d92011-10-20 19:38:04 +0300200 Used to indicate a set of characters. In a set:
Georg Brandl116aa622007-08-15 14:28:22 +0000201
Ezio Melotti81231d92011-10-20 19:38:04 +0300202 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
203 ``'m'``, or ``'k'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000204
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200205 .. index:: single: - (minus); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300206
Ezio Melotti81231d92011-10-20 19:38:04 +0300207 * Ranges of characters can be indicated by giving two characters and separating
208 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
209 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
210 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300211 ``[a\-z]``) or if it's placed as the first or last character
212 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti81231d92011-10-20 19:38:04 +0300213
214 * Special characters lose their special meaning inside sets. For example,
215 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
216 ``'*'``, or ``')'``.
217
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200218 .. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300219
Ezio Melotti81231d92011-10-20 19:38:04 +0300220 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
221 inside a set, although the characters they match depends on whether
222 :const:`ASCII` or :const:`LOCALE` mode is in force.
223
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200224 .. index:: single: ^ (caret); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300225
Ezio Melotti81231d92011-10-20 19:38:04 +0300226 * Characters that are not within a range can be matched by :dfn:`complementing`
227 the set. If the first character of the set is ``'^'``, all the characters
228 that are *not* in the set will be matched. For example, ``[^5]`` will match
229 any character except ``'5'``, and ``[^^]`` will match any character except
230 ``'^'``. ``^`` has no special meaning if it's not the first character in
231 the set.
232
233 * To match a literal ``']'`` inside a set, precede it with a backslash, or
234 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
235 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield9e670c22008-05-31 13:05:34 +0000236
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300237 .. .. index:: single: --; in regular expressions
238 .. .. index:: single: &&; in regular expressions
239 .. .. index:: single: ~~; in regular expressions
240 .. .. index:: single: ||; in regular expressions
241
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200242 * Support of nested sets and set operations as in `Unicode Technical
243 Standard #18`_ might be added in the future. This would change the
244 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
245 in ambiguous cases for the time being.
Andrés Delfino7dfbd492018-10-06 16:48:30 -0300246 That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200247 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
248 avoid a warning escape them with a backslash.
249
250 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
251
252 .. versionchanged:: 3.7
253 :exc:`FutureWarning` is raised if a character set contains constructs
254 that will change semantically in the future.
255
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200256.. index:: single: | (vertical bar); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300257
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300258``|``
259 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
260 will match either *A* or *B*. An arbitrary number of REs can be separated by the
Georg Brandl116aa622007-08-15 14:28:22 +0000261 ``'|'`` in this way. This can be used inside groups (see below) as well. As
262 the target string is scanned, REs separated by ``'|'`` are tried from left to
263 right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300264 that once *A* matches, *B* will not be tested further, even if it would
Georg Brandl116aa622007-08-15 14:28:22 +0000265 produce a longer overall match. In other words, the ``'|'`` operator is never
266 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
267 character class, as in ``[|]``.
268
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300269.. index::
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200270 single: () (parentheses); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300271
Georg Brandl116aa622007-08-15 14:28:22 +0000272``(...)``
273 Matches whatever regular expression is inside the parentheses, and indicates the
274 start and end of a group; the contents of a group can be retrieved after a match
275 has been performed, and can be matched later in the string with the ``\number``
276 special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300277 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000278
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300279.. index:: single: (?; in regular expressions
280
Georg Brandl116aa622007-08-15 14:28:22 +0000281``(?...)``
282 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
283 otherwise). The first character after the ``'?'`` determines what the meaning
284 and further syntax of the construct is. Extensions usually do not create a new
285 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
286 currently supported extensions.
287
Antoine Pitroufd036452008-08-19 17:56:33 +0000288``(?aiLmsux)``
289 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
290 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling1c50e862009-06-01 00:11:36 +0000291 letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitroufd036452008-08-19 17:56:33 +0000292 :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl48310cd2009-01-03 21:18:54 +0000293 :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300294 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
295 for the entire regular expression.
296 (The flags are described in :ref:`contents-of-module-re`.)
297 This is useful if you wish to include the flags as part of the
298 regular expression, instead of passing a *flag* argument to the
Serhiy Storchakabd48d272016-09-11 12:50:02 +0300299 :func:`re.compile` function. Flags should be used first in the
300 expression string.
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300302.. index:: single: (?:; in regular expressions
303
Georg Brandl116aa622007-08-15 14:28:22 +0000304``(?:...)``
Georg Brandl3122ce32010-10-29 06:17:38 +0000305 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl116aa622007-08-15 14:28:22 +0000306 expression is inside the parentheses, but the substring matched by the group
307 *cannot* be retrieved after performing a match or referenced later in the
308 pattern.
309
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300310``(?aiLmsux-imsx:...)``
311 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
312 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
313 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
314 The letters set or remove the corresponding flags:
315 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
316 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
317 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
318 and :const:`re.X` (verbose), for the part of the expression.
319 (The flags are described in :ref:`contents-of-module-re`.)
320
321 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
322 as inline flags, so they can't be combined or follow ``'-'``. Instead,
323 when one of them appears in an inline group, it overrides the matching mode
324 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
325 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
326 (default). In byte pattern ``(?L:...)`` switches to locale depending
327 matching, and ``(?a:...)`` switches to ASCII-only matching (default).
328 This override is only in effect for the narrow inline group, and the
329 original matching mode is restored outside of the group.
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300330
Zachary Warec3076722016-09-09 15:47:05 -0700331 .. versionadded:: 3.6
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300332
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300333 .. versionchanged:: 3.7
334 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
335
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300336.. index:: single: (?P<; in regular expressions
337
Georg Brandl116aa622007-08-15 14:28:22 +0000338``(?P<name>...)``
339 Similar to regular parentheses, but the substring matched by the group is
Georg Brandl3c6780c62013-10-06 12:08:14 +0200340 accessible via the symbolic group name *name*. Group names must be valid
341 Python identifiers, and each group name must be defined only once within a
342 regular expression. A symbolic group is also a numbered group, just as if
343 the group were not named.
Georg Brandl116aa622007-08-15 14:28:22 +0000344
Georg Brandl3c6780c62013-10-06 12:08:14 +0200345 Named groups can be referenced in three contexts. If the pattern is
346 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
347 single or double quotes):
348
349 +---------------------------------------+----------------------------------+
350 | Context of reference to group "quote" | Ways to reference it |
351 +=======================================+==================================+
352 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
353 | | * ``\1`` |
354 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300355 | when processing match object *m* | * ``m.group('quote')`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200356 | | * ``m.end('quote')`` (etc.) |
357 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300358 | in a string passed to the *repl* | * ``\g<quote>`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200359 | argument of ``re.sub()`` | * ``\g<1>`` |
360 | | * ``\1`` |
361 +---------------------------------------+----------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000362
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300363.. index:: single: (?P=; in regular expressions
364
Georg Brandl116aa622007-08-15 14:28:22 +0000365``(?P=name)``
Georg Brandl3c6780c62013-10-06 12:08:14 +0200366 A backreference to a named group; it matches whatever text was matched by the
367 earlier group named *name*.
Georg Brandl116aa622007-08-15 14:28:22 +0000368
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300369.. index:: single: (?#; in regular expressions
370
Georg Brandl116aa622007-08-15 14:28:22 +0000371``(?#...)``
372 A comment; the contents of the parentheses are simply ignored.
373
animalize4a7f44a2019-02-18 21:26:37 +0800374.. index:: single: (?=; in regular expressions
375
Georg Brandl116aa622007-08-15 14:28:22 +0000376``(?=...)``
377 Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300378 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl116aa622007-08-15 14:28:22 +0000379 ``'Isaac '`` only if it's followed by ``'Asimov'``.
380
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300381.. index:: single: (?!; in regular expressions
382
Georg Brandl116aa622007-08-15 14:28:22 +0000383``(?!...)``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300384 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl116aa622007-08-15 14:28:22 +0000385 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
386 followed by ``'Asimov'``.
387
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300388.. index:: single: (?<=; in regular expressions
389
Georg Brandl116aa622007-08-15 14:28:22 +0000390``(?<=...)``
391 Matches if the current position in the string is preceded by a match for ``...``
392 that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300393 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl116aa622007-08-15 14:28:22 +0000394 lookbehind will back up 3 characters and check if the contained pattern matches.
395 The contained pattern must only match strings of some fixed length, meaning that
396 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti0a6b5412012-04-29 07:34:46 +0300397 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl116aa622007-08-15 14:28:22 +0000398 beginning of the string being searched; you will most likely want to use the
Christian Heimesfe337bf2008-03-23 21:54:12 +0000399 :func:`search` function rather than the :func:`match` function:
Georg Brandl116aa622007-08-15 14:28:22 +0000400
401 >>> import re
402 >>> m = re.search('(?<=abc)def', 'abcdef')
403 >>> m.group(0)
404 'def'
405
Christian Heimesfe337bf2008-03-23 21:54:12 +0000406 This example looks for a word following a hyphen:
Georg Brandl116aa622007-08-15 14:28:22 +0000407
Cheryl Sabella66771422018-02-02 16:16:27 -0500408 >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl116aa622007-08-15 14:28:22 +0000409 >>> m.group(0)
410 'egg'
411
Georg Brandl8c16cb92016-02-25 20:17:45 +0100412 .. versionchanged:: 3.5
Serhiy Storchaka4eea62f2015-02-21 10:07:35 +0200413 Added support for group references of fixed length.
414
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300415.. index:: single: (?<!; in regular expressions
416
Georg Brandl116aa622007-08-15 14:28:22 +0000417``(?<!...)``
418 Matches if the current position in the string is not preceded by a match for
419 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
420 positive lookbehind assertions, the contained pattern must only match strings of
421 some fixed length. Patterns which start with negative lookbehind assertions may
422 match at the beginning of the string being searched.
423
424``(?(id/name)yes-pattern|no-pattern)``
orsenthil@gmail.com476021b2011-03-12 10:46:25 +0800425 Will try to match with ``yes-pattern`` if the group with given *id* or
426 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
427 optional and can be omitted. For example,
428 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
429 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200430 not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000431
Georg Brandl116aa622007-08-15 14:28:22 +0000432
433The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter98e90512016-06-12 06:17:29 +0000434If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300435resulting RE will match the second character. For example, ``\$`` matches the
436character ``'$'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000437
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200438.. index:: single: \ (backslash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300439
Georg Brandl116aa622007-08-15 14:28:22 +0000440``\number``
441 Matches the contents of the group of the same number. Groups are numbered
442 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl2070e832013-10-06 12:58:20 +0200443 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000444 can only be used to match one of the first 99 groups. If the first digit of
445 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
446 a group match, but as the character with octal value *number*. Inside the
447 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
448 characters.
449
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300450.. index:: single: \A; in regular expressions
451
Georg Brandl116aa622007-08-15 14:28:22 +0000452``\A``
453 Matches only at the start of the string.
454
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300455.. index:: single: \b; in regular expressions
456
Georg Brandl116aa622007-08-15 14:28:22 +0000457``\b``
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000458 Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300459 A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti5a045b92012-02-29 11:48:44 +0200460 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
461 (or vice versa), or between ``\w`` and the beginning/end of the string.
462 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
463 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
464
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300465 By default Unicode alphanumerics are the ones used in Unicode patterns, but
466 this can be changed by using the :const:`ASCII` flag. Word boundaries are
467 determined by the current locale if the :const:`LOCALE` flag is used.
468 Inside a character range, ``\b`` represents the backspace character, for
469 compatibility with Python's string literals.
Georg Brandl116aa622007-08-15 14:28:22 +0000470
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300471.. index:: single: \B; in regular expressions
472
Georg Brandl116aa622007-08-15 14:28:22 +0000473``\B``
Ezio Melotti5a045b92012-02-29 11:48:44 +0200474 Matches the empty string, but only when it is *not* at the beginning or end
475 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
476 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300477 ``\B`` is just the opposite of ``\b``, so word characters in Unicode
478 patterns are Unicode alphanumerics or the underscore, although this can
479 be changed by using the :const:`ASCII` flag. Word boundaries are
480 determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl116aa622007-08-15 14:28:22 +0000481
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300482.. index:: single: \d; in regular expressions
483
Georg Brandl116aa622007-08-15 14:28:22 +0000484``\d``
Antoine Pitroufd036452008-08-19 17:56:33 +0000485 For Unicode (str) patterns:
Mark Dickinson1f268282009-07-28 17:22:36 +0000486 Matches any Unicode decimal digit (that is, any character in
487 Unicode character category [Nd]). This includes ``[0-9]``, and
488 also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300489 used only ``[0-9]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300490
Antoine Pitroufd036452008-08-19 17:56:33 +0000491 For 8-bit (bytes) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000492 Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000493
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300494.. index:: single: \D; in regular expressions
495
Georg Brandl116aa622007-08-15 14:28:22 +0000496``\D``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300497 Matches any character which is not a decimal digit. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000498 the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300499 becomes the equivalent of ``[^0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000500
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300501.. index:: single: \s; in regular expressions
502
Georg Brandl116aa622007-08-15 14:28:22 +0000503``\s``
Antoine Pitroufd036452008-08-19 17:56:33 +0000504 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000505 Matches Unicode whitespace characters (which includes
506 ``[ \t\n\r\f\v]``, and also many other characters, for example the
507 non-breaking spaces mandated by typography rules in many
508 languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300509 ``[ \t\n\r\f\v]`` is matched.
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000510
Antoine Pitroufd036452008-08-19 17:56:33 +0000511 For 8-bit (bytes) patterns:
512 Matches characters considered whitespace in the ASCII character set;
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000513 this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000514
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300515.. index:: single: \S; in regular expressions
516
Georg Brandl116aa622007-08-15 14:28:22 +0000517``\S``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300518 Matches any character which is not a whitespace character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000519 the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300520 becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000521
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300522.. index:: single: \w; in regular expressions
523
Georg Brandl116aa622007-08-15 14:28:22 +0000524``\w``
Antoine Pitroufd036452008-08-19 17:56:33 +0000525 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000526 Matches Unicode word characters; this includes most characters
527 that can be part of a word in any language, as well as numbers and
528 the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300529 ``[a-zA-Z0-9_]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300530
Antoine Pitroufd036452008-08-19 17:56:33 +0000531 For 8-bit (bytes) patterns:
532 Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300533 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
534 used, matches characters considered alphanumeric in the current locale
535 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000536
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300537.. index:: single: \W; in regular expressions
538
Georg Brandl116aa622007-08-15 14:28:22 +0000539``\W``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300540 Matches any character which is not a word character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000541 the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300542 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300543 used, matches characters considered alphanumeric in the current locale
544 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000545
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300546.. index:: single: \Z; in regular expressions
547
Georg Brandl116aa622007-08-15 14:28:22 +0000548``\Z``
549 Matches only at the end of the string.
550
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300551.. index::
552 single: \a; in regular expressions
553 single: \b; in regular expressions
554 single: \f; in regular expressions
555 single: \n; in regular expressions
556 single: \N; in regular expressions
557 single: \r; in regular expressions
558 single: \t; in regular expressions
559 single: \u; in regular expressions
560 single: \U; in regular expressions
561 single: \v; in regular expressions
562 single: \x; in regular expressions
563 single: \\; in regular expressions
564
Georg Brandl116aa622007-08-15 14:28:22 +0000565Most of the standard escapes supported by Python string literals are also
566accepted by the regular expression parser::
567
568 \a \b \f \n
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200569 \N \r \t \u
570 \U \v \x \\
Georg Brandl116aa622007-08-15 14:28:22 +0000571
Ezio Melotti285e51b2012-04-29 04:52:30 +0300572(Note that ``\b`` is used to represent word boundaries, and means "backspace"
573only inside character classes.)
574
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200575``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300576patterns. In bytes patterns they are errors.
Antoine Pitrou463badf2012-06-23 13:29:19 +0200577
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700578Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl116aa622007-08-15 14:28:22 +0000579there are three octal digits, it is considered an octal escape. Otherwise, it is
580a group reference. As for string literals, octal escapes are always at most
581three digits in length.
582
Antoine Pitrou463badf2012-06-23 13:29:19 +0200583.. versionchanged:: 3.3
584 The ``'\u'`` and ``'\U'`` escape sequences have been added.
585
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300586.. versionchanged:: 3.6
Martin Panter98e90512016-06-12 06:17:29 +0000587 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200588
Serhiy Storchakaa445feb2018-02-10 00:08:17 +0200589.. versionchanged:: 3.8
590 The ``'\N{name}'`` escape sequence has been added. As in string literals,
591 it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou463badf2012-06-23 13:29:19 +0200592
Georg Brandl116aa622007-08-15 14:28:22 +0000593
Georg Brandl116aa622007-08-15 14:28:22 +0000594.. _contents-of-module-re:
595
596Module Contents
597---------------
598
599The module defines several functions, constants, and an exception. Some of the
600functions are simplified versions of the full featured methods for compiled
601regular expressions. Most non-trivial applications always use the compiled
602form.
603
Ethan Furmanc88c80b2016-11-21 08:29:31 -0800604.. versionchanged:: 3.6
605 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
606 :class:`enum.IntFlag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000607
Georg Brandl18244152009-09-02 20:34:52 +0000608.. function:: compile(pattern, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000609
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100610 Compile a regular expression pattern into a :ref:`regular expression object
611 <re-objects>`, which can be used for matching using its
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300612 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100613 below.
Georg Brandl116aa622007-08-15 14:28:22 +0000614
615 The expression's behaviour can be modified by specifying a *flags* value.
616 Values can be any of the following variables, combined using bitwise OR (the
617 ``|`` operator).
618
619 The sequence ::
620
Gregory P. Smith4221c742009-03-02 05:04:04 +0000621 prog = re.compile(pattern)
622 result = prog.match(string)
Georg Brandl116aa622007-08-15 14:28:22 +0000623
624 is equivalent to ::
625
Gregory P. Smith4221c742009-03-02 05:04:04 +0000626 result = re.match(pattern, string)
Georg Brandl116aa622007-08-15 14:28:22 +0000627
Georg Brandlf346ac02009-07-26 15:03:49 +0000628 but using :func:`re.compile` and saving the resulting regular expression
629 object for reuse is more efficient when the expression will be used several
630 times in a single program.
Georg Brandl116aa622007-08-15 14:28:22 +0000631
Gregory P. Smith4221c742009-03-02 05:04:04 +0000632 .. note::
633
634 The compiled versions of the most recent patterns passed to
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200635 :func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith4221c742009-03-02 05:04:04 +0000636 programs that use only a few regular expressions at a time needn't worry
637 about compiling regular expressions.
Georg Brandl116aa622007-08-15 14:28:22 +0000638
639
Antoine Pitroufd036452008-08-19 17:56:33 +0000640.. data:: A
641 ASCII
642
Georg Brandl4049ce02009-06-08 07:49:54 +0000643 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
644 perform ASCII-only matching instead of full Unicode matching. This is only
645 meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300646 Corresponds to the inline flag ``(?a)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000647
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000648 Note that for backward compatibility, the :const:`re.U` flag still
649 exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandlebeb44d2010-07-29 11:15:36 +0000650 counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000651 matches are Unicode by default for strings (and Unicode matching
652 isn't allowed for bytes).
Georg Brandl48310cd2009-01-03 21:18:54 +0000653
Antoine Pitroufd036452008-08-19 17:56:33 +0000654
Sandro Tosida785fd2012-01-01 12:55:20 +0100655.. data:: DEBUG
656
657 Display debug information about compiled expression.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300658 No corresponding inline flag.
Sandro Tosida785fd2012-01-01 12:55:20 +0100659
660
Georg Brandl116aa622007-08-15 14:28:22 +0000661.. data:: I
662 IGNORECASE
663
Brian Wardc9d6dbc2017-05-24 00:03:38 -0700664 Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300665 match lowercase letters. Full Unicode matching (such as ``Ü`` matching
666 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
667 non-ASCII matches. The current locale does not change the effect of this
668 flag unless the :const:`re.LOCALE` flag is also used.
669 Corresponds to the inline flag ``(?i)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000670
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300671 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
672 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
673 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
674 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
675 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
676 If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300677 and 'A' to 'Z' are matched.
Georg Brandl116aa622007-08-15 14:28:22 +0000678
679.. data:: L
680 LOCALE
681
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300682 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
683 dependent on the current locale. This flag can be used only with bytes
684 patterns. The use of this flag is discouraged as the locale mechanism
685 is very unreliable, it only handles one "culture" at a time, and it only
686 works with 8-bit locales. Unicode matching is already enabled by default
687 in Python 3 for Unicode (str) patterns, and it is able to handle different
688 locales/languages.
689 Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka22a309a2014-12-01 11:50:07 +0200690
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300691 .. versionchanged:: 3.6
692 :const:`re.LOCALE` can be used only with bytes patterns and is
693 not compatible with :const:`re.ASCII`.
Georg Brandl116aa622007-08-15 14:28:22 +0000694
Serhiy Storchaka898ff032017-05-05 08:53:40 +0300695 .. versionchanged:: 3.7
696 Compiled regular expression objects with the :const:`re.LOCALE` flag no
697 longer depend on the locale at compile time. Only the locale at
698 matching time affects the result of matching.
699
Georg Brandl116aa622007-08-15 14:28:22 +0000700
701.. data:: M
702 MULTILINE
703
704 When specified, the pattern character ``'^'`` matches at the beginning of the
705 string and at the beginning of each line (immediately following each newline);
706 and the pattern character ``'$'`` matches at the end of the string and at the
707 end of each line (immediately preceding each newline). By default, ``'^'``
708 matches only at the beginning of the string, and ``'$'`` only at the end of the
709 string and immediately before the newline (if any) at the end of the string.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300710 Corresponds to the inline flag ``(?m)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000711
712
713.. data:: S
714 DOTALL
715
716 Make the ``'.'`` special character match any character at all, including a
717 newline; without this flag, ``'.'`` will match anything *except* a newline.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300718 Corresponds to the inline flag ``(?s)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000719
720
Georg Brandl116aa622007-08-15 14:28:22 +0000721.. data:: X
722 VERBOSE
723
Serhiy Storchaka913876d2018-10-28 13:41:26 +0200724 .. index:: single: # (hash); in regular expressions
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300725
Zachary Ware71a0b432015-11-11 23:32:14 -0600726 This flag allows you to write regular expressions that look nicer and are
727 more readable by allowing you to visually separate logical sections of the
728 pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchakab0b44b42017-11-14 17:21:26 +0200729 when in a character class, or when preceded by an unescaped backslash,
730 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware71a0b432015-11-11 23:32:14 -0600731 When a line contains a ``#`` that is not in a character class and is not
732 preceded by an unescaped backslash, all characters from the leftmost such
733 ``#`` through the end of the line are ignored.
Georg Brandl116aa622007-08-15 14:28:22 +0000734
Zachary Ware71a0b432015-11-11 23:32:14 -0600735 This means that the two following regular expression objects that match a
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000736 decimal number are functionally equal::
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000737
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000738 a = re.compile(r"""\d + # the integral part
739 \. # the decimal point
740 \d * # some fractional digits""", re.X)
741 b = re.compile(r"\d+\.\d*")
Georg Brandl116aa622007-08-15 14:28:22 +0000742
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300743 Corresponds to the inline flag ``(?x)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000744
745
Georg Brandlc62a7042010-07-29 11:49:05 +0000746.. function:: search(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000747
Terry Jan Reedy0edb5c12014-05-30 16:19:59 -0400748 Scan through *string* looking for the first location where the regular expression
Georg Brandlc62a7042010-07-29 11:49:05 +0000749 *pattern* produces a match, and return a corresponding :ref:`match object
750 <match-objects>`. Return ``None`` if no position in the string matches the
751 pattern; note that this is different from finding a zero-length match at some
752 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000753
754
Georg Brandl18244152009-09-02 20:34:52 +0000755.. function:: match(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000756
757 If zero or more characters at the beginning of *string* match the regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000758 expression *pattern*, return a corresponding :ref:`match object
759 <match-objects>`. Return ``None`` if the string does not match the pattern;
760 note that this is different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000761
Ezio Melotti443f0002012-02-29 13:39:05 +0200762 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
763 at the beginning of the string and not at the beginning of each line.
Georg Brandl116aa622007-08-15 14:28:22 +0000764
Ezio Melotti443f0002012-02-29 13:39:05 +0200765 If you want to locate a match anywhere in *string*, use :func:`search`
766 instead (see also :ref:`search-vs-match`).
Georg Brandl116aa622007-08-15 14:28:22 +0000767
768
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200769.. function:: fullmatch(pattern, string, flags=0)
770
771 If the whole *string* matches the regular expression *pattern*, return a
772 corresponding :ref:`match object <match-objects>`. Return ``None`` if the
773 string does not match the pattern; note that this is different from a
774 zero-length match.
775
776 .. versionadded:: 3.4
777
778
Georg Brandl18244152009-09-02 20:34:52 +0000779.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000780
781 Split *string* by the occurrences of *pattern*. If capturing parentheses are
782 used in *pattern*, then the text of all groups in the pattern are also returned
783 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
784 splits occur, and the remainder of the string is returned as the final element
Georg Brandl96473892008-03-06 07:09:43 +0000785 of the list. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000786
Serhiy Storchakac615be52017-11-28 22:51:38 +0200787 >>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000788 ['Words', 'words', 'words', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200789 >>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000790 ['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200791 >>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl116aa622007-08-15 14:28:22 +0000792 ['Words', 'words, words.']
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000793 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
794 ['0', '3', '9']
Georg Brandl116aa622007-08-15 14:28:22 +0000795
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000796 If there are capturing groups in the separator and it matches at the start of
797 the string, the result will start with an empty string. The same holds for
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300798 the end of the string::
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000799
Serhiy Storchakac615be52017-11-28 22:51:38 +0200800 >>> re.split(r'(\W+)', '...words, words...')
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000801 ['', '...', 'words', ', ', 'words', '...', '']
802
803 That way, separator components are always found at the same relative
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700804 indices within the result list.
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000805
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200806 Empty matches for the pattern split the string only when not adjacent
807 to a previous empty match.
Thomas Wouters89d996e2007-09-08 17:39:28 +0000808
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200809 >>> re.split(r'\b', 'Words, words, words.')
810 ['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200811 >>> re.split(r'\W*', '...words...')
812 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200813 >>> re.split(r'(\W*)', '...words...')
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200814 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl116aa622007-08-15 14:28:22 +0000815
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000816 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000817 Added the optional flags argument.
818
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200819 .. versionchanged:: 3.7
820 Added support of splitting on a pattern that could match an empty string.
821
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000822
Georg Brandl18244152009-09-02 20:34:52 +0000823.. function:: findall(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000824
Georg Brandl9afde1c2007-11-01 20:32:30 +0000825 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandl3dbca812008-07-23 16:10:53 +0000826 strings. The *string* is scanned left-to-right, and matches are returned in
827 the order found. If one or more groups are present in the pattern, return a
828 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200829 one group. Empty matches are included in the result.
830
831 .. versionchanged:: 3.7
832 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000833
Georg Brandl116aa622007-08-15 14:28:22 +0000834
Georg Brandl18244152009-09-02 20:34:52 +0000835.. function:: finditer(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000836
Georg Brandlc62a7042010-07-29 11:49:05 +0000837 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
838 all non-overlapping matches for the RE *pattern* in *string*. The *string*
839 is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200840 matches are included in the result.
841
842 .. versionchanged:: 3.7
843 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000844
Georg Brandl116aa622007-08-15 14:28:22 +0000845
Georg Brandl18244152009-09-02 20:34:52 +0000846.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000847
848 Return the string obtained by replacing the leftmost non-overlapping occurrences
849 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
850 *string* is returned unchanged. *repl* can be a string or a function; if it is
851 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi6a633bb2011-08-19 22:54:50 +0200852 converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200853 so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl116aa622007-08-15 14:28:22 +0000854 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300855 For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000856
857 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
858 ... r'static PyObject*\npy_\1(void)\n{',
859 ... 'def myfunc():')
860 'static PyObject*\npy_myfunc(void)\n{'
861
862 If *repl* is a function, it is called for every non-overlapping occurrence of
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300863 *pattern*. The function takes a single :ref:`match object <match-objects>`
864 argument, and returns the replacement string. For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000865
866 >>> def dashrepl(matchobj):
867 ... if matchobj.group(0) == '-': return ' '
868 ... else: return '-'
869 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
870 'pro--gram files'
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000871 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
872 'Baked Beans & Spam'
Georg Brandl116aa622007-08-15 14:28:22 +0000873
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300874 The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl116aa622007-08-15 14:28:22 +0000875
876 The optional argument *count* is the maximum number of pattern occurrences to be
877 replaced; *count* must be a non-negative integer. If omitted or zero, all
878 occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200879 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
880 ``'-a-b--d-'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000881
Serhiy Storchakaddb961d2018-10-26 09:00:49 +0300882 .. index:: single: \g; in regular expressions
883
Georg Brandl3c6780c62013-10-06 12:08:14 +0200884 In string-type *repl* arguments, in addition to the character escapes and
885 backreferences described above,
Georg Brandl116aa622007-08-15 14:28:22 +0000886 ``\g<name>`` will use the substring matched by the group named ``name``, as
887 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
888 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
889 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
890 reference to group 20, not a reference to group 2 followed by the literal
891 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
892 substring matched by the RE.
893
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000894 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000895 Added the optional flags argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000896
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300897 .. versionchanged:: 3.5
898 Unmatched groups are replaced with an empty string.
899
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300900 .. versionchanged:: 3.6
Serhiy Storchaka53c53ea2016-12-06 19:15:29 +0200901 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
902 now are errors.
903
Serhiy Storchakaff3dbe92016-12-06 19:25:19 +0200904 .. versionchanged:: 3.7
905 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
906 now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200907
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200908 Empty matches for the pattern are replaced when adjacent to a previous
909 non-empty match.
910
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000911
Georg Brandl18244152009-09-02 20:34:52 +0000912.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000913
914 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
915 number_of_subs_made)``.
916
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000917 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000918 Added the optional flags argument.
919
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300920 .. versionchanged:: 3.5
921 Unmatched groups are replaced with an empty string.
922
Georg Brandl116aa622007-08-15 14:28:22 +0000923
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300924.. function:: escape(pattern)
Georg Brandl116aa622007-08-15 14:28:22 +0000925
Serhiy Storchaka59083002017-04-13 21:06:43 +0300926 Escape special characters in *pattern*.
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300927 This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300928 have regular expression metacharacters in it. For example::
929
930 >>> print(re.escape('python.exe'))
931 python\.exe
932
933 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
934 >>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200935 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300936
937 >>> operators = ['+', '-', '*', '/', '**']
938 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka59083002017-04-13 21:06:43 +0300939 /|\-|\+|\*\*|\*
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300940
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300941 This functions must not be used for the replacement string in :func:`sub`
942 and :func:`subn`, only backslashes should be escaped. For example::
943
944 >>> digits_re = r'\d+'
945 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
946 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
947 /usr/sbin/sendmail - \d+ errors, \d+ warnings
948
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300949 .. versionchanged:: 3.3
950 The ``'_'`` character is no longer escaped.
Georg Brandl116aa622007-08-15 14:28:22 +0000951
Serhiy Storchaka59083002017-04-13 21:06:43 +0300952 .. versionchanged:: 3.7
953 Only characters that can have special meaning in a regular expression
954 are escaped.
955
Georg Brandl116aa622007-08-15 14:28:22 +0000956
R. David Murray522c32a2010-07-10 14:23:36 +0000957.. function:: purge()
958
959 Clear the regular expression cache.
960
961
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200962.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000963
964 Exception raised when a string passed to one of the functions here is not a
965 valid regular expression (for example, it might contain unmatched parentheses)
966 or when some other error occurs during compilation or matching. It is never an
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200967 error if a string contains no match for a pattern. The error instance has
968 the following additional attributes:
Georg Brandl116aa622007-08-15 14:28:22 +0000969
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200970 .. attribute:: msg
971
972 The unformatted error message.
973
974 .. attribute:: pattern
975
976 The regular expression pattern.
977
978 .. attribute:: pos
979
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300980 The index in *pattern* where compilation failed (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200981
982 .. attribute:: lineno
983
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300984 The line corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200985
986 .. attribute:: colno
987
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300988 The column corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200989
990 .. versionchanged:: 3.5
991 Added additional attributes.
Georg Brandl116aa622007-08-15 14:28:22 +0000992
993.. _re-objects:
994
995Regular Expression Objects
996--------------------------
997
Georg Brandlc62a7042010-07-29 11:49:05 +0000998Compiled regular expression objects support the following methods and
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700999attributes:
Brian Curtin027e4782010-03-26 00:39:56 +00001000
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001001.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001002
Berker Peksag84f387d2016-06-08 14:56:56 +03001003 Scan through *string* looking for the first location where this regular
1004 expression produces a match, and return a corresponding :ref:`match object
Georg Brandlc62a7042010-07-29 11:49:05 +00001005 <match-objects>`. Return ``None`` if no position in the string matches the
1006 pattern; note that this is different from finding a zero-length match at some
1007 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +00001008
Georg Brandlc62a7042010-07-29 11:49:05 +00001009 The optional second parameter *pos* gives an index in the string where the
1010 search is to start; it defaults to ``0``. This is not completely equivalent to
1011 slicing the string; the ``'^'`` pattern character matches at the real beginning
1012 of the string and at positions just after a newline, but not necessarily at the
1013 index where the search is to start.
Georg Brandl116aa622007-08-15 14:28:22 +00001014
Georg Brandlc62a7042010-07-29 11:49:05 +00001015 The optional parameter *endpos* limits how far the string will be searched; it
1016 will be as if the string is *endpos* characters long, so only the characters
1017 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001018 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
Georg Brandlc62a7042010-07-29 11:49:05 +00001019 expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001020 ``rx.search(string[:50], 0)``. ::
Georg Brandl116aa622007-08-15 14:28:22 +00001021
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001022 >>> pattern = re.compile("d")
1023 >>> pattern.search("dog") # Match at index 0
1024 <re.Match object; span=(0, 1), match='d'>
1025 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl116aa622007-08-15 14:28:22 +00001026
1027
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001028.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001029
Georg Brandlc62a7042010-07-29 11:49:05 +00001030 If zero or more characters at the *beginning* of *string* match this regular
1031 expression, return a corresponding :ref:`match object <match-objects>`.
1032 Return ``None`` if the string does not match the pattern; note that this is
1033 different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +00001034
Georg Brandlc62a7042010-07-29 11:49:05 +00001035 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001036 :meth:`~Pattern.search` method. ::
Benjamin Petersond7c3ed52010-06-27 22:32:30 +00001037
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001038 >>> pattern = re.compile("o")
1039 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
1040 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
1041 <re.Match object; span=(1, 2), match='o'>
Georg Brandl116aa622007-08-15 14:28:22 +00001042
Ezio Melotti443f0002012-02-29 13:39:05 +02001043 If you want to locate a match anywhere in *string*, use
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001044 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti443f0002012-02-29 13:39:05 +02001045
Georg Brandl116aa622007-08-15 14:28:22 +00001046
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001047.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001048
1049 If the whole *string* matches this regular expression, return a corresponding
1050 :ref:`match object <match-objects>`. Return ``None`` if the string does not
1051 match the pattern; note that this is different from a zero-length match.
1052
1053 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001054 :meth:`~Pattern.search` method. ::
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001055
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001056 >>> pattern = re.compile("o[gh]")
1057 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
1058 >>> pattern.fullmatch("ogre") # No match as not the full string matches.
1059 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
1060 <re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001061
1062 .. versionadded:: 3.4
1063
1064
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001065.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001066
Georg Brandlc62a7042010-07-29 11:49:05 +00001067 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001068
1069
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001070.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001071
Georg Brandlc62a7042010-07-29 11:49:05 +00001072 Similar to the :func:`findall` function, using the compiled pattern, but
1073 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001074 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001075
1076
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001077.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001078
Georg Brandlc62a7042010-07-29 11:49:05 +00001079 Similar to the :func:`finditer` function, using the compiled pattern, but
1080 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001081 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001082
1083
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001084.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001085
Georg Brandlc62a7042010-07-29 11:49:05 +00001086 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001087
1088
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001089.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001090
Georg Brandlc62a7042010-07-29 11:49:05 +00001091 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001092
1093
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001094.. attribute:: Pattern.flags
Georg Brandl116aa622007-08-15 14:28:22 +00001095
Georg Brandl3a19e542012-03-17 17:29:27 +01001096 The regex matching flags. This is a combination of the flags given to
1097 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1098 flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl116aa622007-08-15 14:28:22 +00001099
1100
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001101.. attribute:: Pattern.groups
Georg Brandlaf265f42008-12-07 15:06:20 +00001102
Georg Brandlc62a7042010-07-29 11:49:05 +00001103 The number of capturing groups in the pattern.
Georg Brandlaf265f42008-12-07 15:06:20 +00001104
1105
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001106.. attribute:: Pattern.groupindex
Georg Brandl116aa622007-08-15 14:28:22 +00001107
Georg Brandlc62a7042010-07-29 11:49:05 +00001108 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1109 numbers. The dictionary is empty if no symbolic groups were used in the
1110 pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001111
1112
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001113.. attribute:: Pattern.pattern
Georg Brandl116aa622007-08-15 14:28:22 +00001114
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001115 The pattern string from which the pattern object was compiled.
Georg Brandl116aa622007-08-15 14:28:22 +00001116
1117
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001118.. versionchanged:: 3.7
1119 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
1120 regular expression objects are considered atomic.
1121
1122
Georg Brandl116aa622007-08-15 14:28:22 +00001123.. _match-objects:
1124
1125Match Objects
1126-------------
1127
Ezio Melottib87f82f2012-11-04 06:59:22 +02001128Match objects always have a boolean value of ``True``.
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001129Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melottib87f82f2012-11-04 06:59:22 +02001130when there is no match, you can test whether there was a match with a simple
1131``if`` statement::
1132
1133 match = re.search(pattern, string)
1134 if match:
1135 process(match)
1136
1137Match objects support the following methods and attributes:
Georg Brandl116aa622007-08-15 14:28:22 +00001138
1139
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001140.. method:: Match.expand(template)
Georg Brandl116aa622007-08-15 14:28:22 +00001141
Georg Brandlc62a7042010-07-29 11:49:05 +00001142 Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001143 string *template*, as done by the :meth:`~Pattern.sub` method.
Georg Brandlc62a7042010-07-29 11:49:05 +00001144 Escapes such as ``\n`` are converted to the appropriate characters,
1145 and numeric backreferences (``\1``, ``\2``) and named backreferences
1146 (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1147 corresponding group.
Georg Brandl116aa622007-08-15 14:28:22 +00001148
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +03001149 .. versionchanged:: 3.5
1150 Unmatched groups are replaced with an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +00001151
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001152.. method:: Match.group([group1, ...])
Georg Brandl116aa622007-08-15 14:28:22 +00001153
Georg Brandlc62a7042010-07-29 11:49:05 +00001154 Returns one or more subgroups of the match. If there is a single argument, the
1155 result is a single string; if there are multiple arguments, the result is a
1156 tuple with one item per argument. Without arguments, *group1* defaults to zero
1157 (the whole match is returned). If a *groupN* argument is zero, the corresponding
1158 return value is the entire matching string; if it is in the inclusive range
1159 [1..99], it is the string matching the corresponding parenthesized group. If a
1160 group number is negative or larger than the number of groups defined in the
1161 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1162 part of the pattern that did not match, the corresponding result is ``None``.
1163 If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001164 the last match is returned. ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001165
Georg Brandlc62a7042010-07-29 11:49:05 +00001166 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1167 >>> m.group(0) # The entire match
1168 'Isaac Newton'
1169 >>> m.group(1) # The first parenthesized subgroup.
1170 'Isaac'
1171 >>> m.group(2) # The second parenthesized subgroup.
1172 'Newton'
1173 >>> m.group(1, 2) # Multiple arguments give us a tuple.
1174 ('Isaac', 'Newton')
Georg Brandl116aa622007-08-15 14:28:22 +00001175
Georg Brandlc62a7042010-07-29 11:49:05 +00001176 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1177 arguments may also be strings identifying groups by their group name. If a
1178 string argument is not used as a group name in the pattern, an :exc:`IndexError`
1179 exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +00001180
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001181 A moderately complicated example::
Georg Brandl116aa622007-08-15 14:28:22 +00001182
Georg Brandlc62a7042010-07-29 11:49:05 +00001183 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1184 >>> m.group('first_name')
1185 'Malcolm'
1186 >>> m.group('last_name')
1187 'Reynolds'
Georg Brandl116aa622007-08-15 14:28:22 +00001188
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001189 Named groups can also be referred to by their index::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001190
Georg Brandlc62a7042010-07-29 11:49:05 +00001191 >>> m.group(1)
1192 'Malcolm'
1193 >>> m.group(2)
1194 'Reynolds'
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001195
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001196 If a group matches multiple times, only the last match is accessible::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001197
Georg Brandlc62a7042010-07-29 11:49:05 +00001198 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
1199 >>> m.group(1) # Returns only the last match.
1200 'c3'
Brian Curtin027e4782010-03-26 00:39:56 +00001201
Brian Curtin48f16f92010-04-08 13:55:29 +00001202
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001203.. method:: Match.__getitem__(g)
Eric V. Smith605bdae2016-09-11 08:55:43 -04001204
1205 This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001206 an individual group from a match::
Eric V. Smith605bdae2016-09-11 08:55:43 -04001207
1208 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1209 >>> m[0] # The entire match
1210 'Isaac Newton'
1211 >>> m[1] # The first parenthesized subgroup.
1212 'Isaac'
1213 >>> m[2] # The second parenthesized subgroup.
1214 'Newton'
1215
1216 .. versionadded:: 3.6
1217
1218
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001219.. method:: Match.groups(default=None)
Brian Curtin48f16f92010-04-08 13:55:29 +00001220
Georg Brandlc62a7042010-07-29 11:49:05 +00001221 Return a tuple containing all the subgroups of the match, from 1 up to however
1222 many groups are in the pattern. The *default* argument is used for groups that
1223 did not participate in the match; it defaults to ``None``.
Brian Curtin027e4782010-03-26 00:39:56 +00001224
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001225 For example::
Brian Curtin027e4782010-03-26 00:39:56 +00001226
Georg Brandlc62a7042010-07-29 11:49:05 +00001227 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1228 >>> m.groups()
1229 ('24', '1632')
Brian Curtin027e4782010-03-26 00:39:56 +00001230
Georg Brandlc62a7042010-07-29 11:49:05 +00001231 If we make the decimal place and everything after it optional, not all groups
1232 might participate in the match. These groups will default to ``None`` unless
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001233 the *default* argument is given::
Brian Curtin027e4782010-03-26 00:39:56 +00001234
Georg Brandlc62a7042010-07-29 11:49:05 +00001235 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1236 >>> m.groups() # Second group defaults to None.
1237 ('24', None)
1238 >>> m.groups('0') # Now, the second group defaults to '0'.
1239 ('24', '0')
Georg Brandl116aa622007-08-15 14:28:22 +00001240
1241
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001242.. method:: Match.groupdict(default=None)
Georg Brandl116aa622007-08-15 14:28:22 +00001243
Georg Brandlc62a7042010-07-29 11:49:05 +00001244 Return a dictionary containing all the *named* subgroups of the match, keyed by
1245 the subgroup name. The *default* argument is used for groups that did not
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001246 participate in the match; it defaults to ``None``. For example::
Georg Brandl116aa622007-08-15 14:28:22 +00001247
Georg Brandlc62a7042010-07-29 11:49:05 +00001248 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1249 >>> m.groupdict()
1250 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001251
Georg Brandl116aa622007-08-15 14:28:22 +00001252
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001253.. method:: Match.start([group])
1254 Match.end([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001255
Georg Brandlc62a7042010-07-29 11:49:05 +00001256 Return the indices of the start and end of the substring matched by *group*;
1257 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1258 *group* exists but did not contribute to the match. For a match object *m*, and
1259 a group *g* that did contribute to the match, the substring matched by group *g*
1260 (equivalent to ``m.group(g)``) is ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001261
Georg Brandlc62a7042010-07-29 11:49:05 +00001262 m.string[m.start(g):m.end(g)]
Brian Curtin027e4782010-03-26 00:39:56 +00001263
Georg Brandlc62a7042010-07-29 11:49:05 +00001264 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1265 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
1266 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1267 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin027e4782010-03-26 00:39:56 +00001268
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001269 An example that will remove *remove_this* from email addresses::
Brian Curtin027e4782010-03-26 00:39:56 +00001270
Georg Brandlc62a7042010-07-29 11:49:05 +00001271 >>> email = "tony@tiremove_thisger.net"
1272 >>> m = re.search("remove_this", email)
1273 >>> email[:m.start()] + email[m.end():]
1274 'tony@tiger.net'
Georg Brandl116aa622007-08-15 14:28:22 +00001275
1276
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001277.. method:: Match.span([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001278
Georg Brandlc62a7042010-07-29 11:49:05 +00001279 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1280 that if *group* did not contribute to the match, this is ``(-1, -1)``.
1281 *group* defaults to zero, the entire match.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001282
Georg Brandl116aa622007-08-15 14:28:22 +00001283
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001284.. attribute:: Match.pos
Georg Brandl116aa622007-08-15 14:28:22 +00001285
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001286 The value of *pos* which was passed to the :meth:`~Pattern.search` or
1287 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001288 the index into the string at which the RE engine started looking for a match.
Georg Brandl116aa622007-08-15 14:28:22 +00001289
1290
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001291.. attribute:: Match.endpos
Georg Brandl116aa622007-08-15 14:28:22 +00001292
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001293 The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1294 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001295 the index into the string beyond which the RE engine will not go.
Georg Brandl116aa622007-08-15 14:28:22 +00001296
1297
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001298.. attribute:: Match.lastindex
Georg Brandl116aa622007-08-15 14:28:22 +00001299
Georg Brandlc62a7042010-07-29 11:49:05 +00001300 The integer index of the last matched capturing group, or ``None`` if no group
1301 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1302 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1303 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1304 string.
Georg Brandl116aa622007-08-15 14:28:22 +00001305
1306
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001307.. attribute:: Match.lastgroup
Georg Brandl116aa622007-08-15 14:28:22 +00001308
Georg Brandlc62a7042010-07-29 11:49:05 +00001309 The name of the last matched capturing group, or ``None`` if the group didn't
1310 have a name, or if no group was matched at all.
Georg Brandl116aa622007-08-15 14:28:22 +00001311
1312
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001313.. attribute:: Match.re
Georg Brandl116aa622007-08-15 14:28:22 +00001314
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001315 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001316 :meth:`~Pattern.search` method produced this match instance.
Georg Brandl116aa622007-08-15 14:28:22 +00001317
1318
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001319.. attribute:: Match.string
Georg Brandl116aa622007-08-15 14:28:22 +00001320
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001321 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001322
1323
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001324.. versionchanged:: 3.7
1325 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
1326 are considered atomic.
1327
1328
Raymond Hettinger1fa76822010-12-06 23:31:36 +00001329.. _re-examples:
1330
1331Regular Expression Examples
1332---------------------------
Georg Brandl116aa622007-08-15 14:28:22 +00001333
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001334
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001335Checking for a Pair
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001336^^^^^^^^^^^^^^^^^^^
1337
1338In this example, we'll use the following helper function to display match
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001339objects a little more gracefully::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001340
1341 def displaymatch(match):
1342 if match is None:
1343 return None
1344 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1345
1346Suppose you are writing a poker program where a player's hand is represented as
1347a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001348for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001349representing the card with that value.
1350
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001351To see if a given string is a valid hand, one could do the following::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001352
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001353 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1354 >>> displaymatch(valid.match("akt5q")) # Valid.
1355 "<Match: 'akt5q', groups=()>"
1356 >>> displaymatch(valid.match("akt5e")) # Invalid.
1357 >>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001358 >>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001359 "<Match: '727ak', groups=()>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001360
1361That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001362To match this with a regular expression, one could use backreferences as such::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001363
1364 >>> pair = re.compile(r".*(.).*\1")
1365 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001366 "<Match: '717', groups=('7',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001367 >>> displaymatch(pair.match("718ak")) # No pairs.
1368 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001369 "<Match: '354aa', groups=('a',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001370
Georg Brandlf346ac02009-07-26 15:03:49 +00001371To find out what card the pair consists of, one could use the
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001372:meth:`~Match.group` method of the match object in the following manner::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001373
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001374 >>> pair = re.compile(r".*(.).*\1")
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001375 >>> pair.match("717ak").group(1)
1376 '7'
Georg Brandl48310cd2009-01-03 21:18:54 +00001377
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001378 # Error because re.match() returns None, which doesn't have a group() method:
1379 >>> pair.match("718ak").group(1)
1380 Traceback (most recent call last):
1381 File "<pyshell#23>", line 1, in <module>
1382 re.match(r".*(.).*\1", "718ak").group(1)
1383 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl48310cd2009-01-03 21:18:54 +00001384
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001385 >>> pair.match("354aa").group(1)
1386 'a'
1387
1388
1389Simulating scanf()
1390^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +00001391
1392.. index:: single: scanf()
1393
Georg Brandl60203b42010-10-06 10:11:56 +00001394Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl116aa622007-08-15 14:28:22 +00001395expressions are generally more powerful, though also more verbose, than
Georg Brandl60203b42010-10-06 10:11:56 +00001396:c:func:`scanf` format strings. The table below offers some more-or-less
1397equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl116aa622007-08-15 14:28:22 +00001398expressions.
1399
1400+--------------------------------+---------------------------------------------+
Georg Brandl60203b42010-10-06 10:11:56 +00001401| :c:func:`scanf` Token | Regular Expression |
Georg Brandl116aa622007-08-15 14:28:22 +00001402+================================+=============================================+
1403| ``%c`` | ``.`` |
1404+--------------------------------+---------------------------------------------+
1405| ``%5c`` | ``.{5}`` |
1406+--------------------------------+---------------------------------------------+
1407| ``%d`` | ``[-+]?\d+`` |
1408+--------------------------------+---------------------------------------------+
1409| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1410+--------------------------------+---------------------------------------------+
1411| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1412+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001413| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001414+--------------------------------+---------------------------------------------+
1415| ``%s`` | ``\S+`` |
1416+--------------------------------+---------------------------------------------+
1417| ``%u`` | ``\d+`` |
1418+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001419| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001420+--------------------------------+---------------------------------------------+
1421
1422To extract the filename and numbers from a string like ::
1423
1424 /usr/sbin/sendmail - 0 errors, 4 warnings
1425
Georg Brandl60203b42010-10-06 10:11:56 +00001426you would use a :c:func:`scanf` format like ::
Georg Brandl116aa622007-08-15 14:28:22 +00001427
1428 %s - %d errors, %d warnings
1429
1430The equivalent regular expression would be ::
1431
1432 (\S+) - (\d+) errors, (\d+) warnings
1433
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001434
Ezio Melotti443f0002012-02-29 13:39:05 +02001435.. _search-vs-match:
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001436
1437search() vs. match()
1438^^^^^^^^^^^^^^^^^^^^
1439
Ezio Melotti443f0002012-02-29 13:39:05 +02001440.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001441
Ezio Melotti443f0002012-02-29 13:39:05 +02001442Python offers two different primitive operations based on regular expressions:
1443:func:`re.match` checks for a match only at the beginning of the string, while
1444:func:`re.search` checks for a match anywhere in the string (this is what Perl
1445does by default).
1446
1447For example::
1448
Serhiy Storchakadba90392016-05-10 12:01:23 +03001449 >>> re.match("c", "abcdef") # No match
1450 >>> re.search("c", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001451 <re.Match object; span=(2, 3), match='c'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001452
Ezio Melotti443f0002012-02-29 13:39:05 +02001453Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1454restrict the match at the beginning of the string::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001455
Serhiy Storchakadba90392016-05-10 12:01:23 +03001456 >>> re.match("c", "abcdef") # No match
1457 >>> re.search("^c", "abcdef") # No match
Ezio Melotti443f0002012-02-29 13:39:05 +02001458 >>> re.search("^a", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001459 <re.Match object; span=(0, 1), match='a'>
Ezio Melotti443f0002012-02-29 13:39:05 +02001460
1461Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1462beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001463beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti443f0002012-02-29 13:39:05 +02001464
1465 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1466 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001467 <re.Match object; span=(4, 5), match='X'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001468
1469
1470Making a Phonebook
1471^^^^^^^^^^^^^^^^^^
1472
Georg Brandl48310cd2009-01-03 21:18:54 +00001473:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001474method is invaluable for converting textual data into data structures that can be
1475easily read and modified by Python as demonstrated in the following example that
1476creates a phonebook.
1477
Christian Heimes255f53b2007-12-08 15:33:56 +00001478First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel859c0682018-10-12 09:51:05 +02001479triple-quoted string syntax
1480
1481.. doctest::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001482
Georg Brandl557a3ec2012-03-17 17:26:27 +01001483 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl48310cd2009-01-03 21:18:54 +00001484 ...
Christian Heimesfe337bf2008-03-23 21:54:12 +00001485 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1486 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1487 ...
1488 ...
1489 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes255f53b2007-12-08 15:33:56 +00001490
1491The entries are separated by one or more newlines. Now we convert the string
Christian Heimesfe337bf2008-03-23 21:54:12 +00001492into a list with each nonempty line having its own entry:
1493
1494.. doctest::
1495 :options: +NORMALIZE_WHITESPACE
Christian Heimes255f53b2007-12-08 15:33:56 +00001496
Georg Brandl557a3ec2012-03-17 17:26:27 +01001497 >>> entries = re.split("\n+", text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001498 >>> entries
Christian Heimesfe337bf2008-03-23 21:54:12 +00001499 ['Ross McFluff: 834.345.1254 155 Elm Street',
1500 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1501 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1502 'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001503
1504Finally, split each entry into a list with first name, last name, telephone
Christian Heimesc3f30c42008-02-22 16:37:40 +00001505number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimesfe337bf2008-03-23 21:54:12 +00001506because the address has spaces, our splitting pattern, in it:
1507
1508.. doctest::
1509 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001510
Christian Heimes255f53b2007-12-08 15:33:56 +00001511 >>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001512 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1513 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1514 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1515 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1516
Christian Heimes255f53b2007-12-08 15:33:56 +00001517The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimesc3f30c42008-02-22 16:37:40 +00001518occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimesfe337bf2008-03-23 21:54:12 +00001519house number from the street name:
1520
1521.. doctest::
1522 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001523
Christian Heimes255f53b2007-12-08 15:33:56 +00001524 >>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001525 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1526 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1527 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1528 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1529
1530
1531Text Munging
1532^^^^^^^^^^^^
1533
1534:func:`sub` replaces every occurrence of a pattern with a string or the
1535result of a function. This example demonstrates using :func:`sub` with
1536a function to "munge" text, or randomize the order of all the characters
1537in each word of a sentence except for the first and last characters::
1538
1539 >>> def repl(m):
Serhiy Storchakadba90392016-05-10 12:01:23 +03001540 ... inner_word = list(m.group(2))
1541 ... random.shuffle(inner_word)
1542 ... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001543 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandldb4e9392010-07-12 09:06:13 +00001544 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001545 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandldb4e9392010-07-12 09:06:13 +00001546 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001547 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1548
1549
1550Finding all Adverbs
1551^^^^^^^^^^^^^^^^^^^
1552
Christian Heimesc3f30c42008-02-22 16:37:40 +00001553:func:`findall` matches *all* occurrences of a pattern, not just the first
Andrés Delfino50924392018-06-18 01:34:30 -03001554one as :func:`search` does. For example, if a writer wanted to
1555find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001556the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001557
1558 >>> text = "He was carefully disguised but captured quickly by police."
1559 >>> re.findall(r"\w+ly", text)
1560 ['carefully', 'quickly']
1561
1562
1563Finding all Adverbs and their Positions
1564^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1565
1566If one wants more information about all matches of a pattern than the matched
Georg Brandlc62a7042010-07-29 11:49:05 +00001567text, :func:`finditer` is useful as it provides :ref:`match objects
1568<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino50924392018-06-18 01:34:30 -03001569a writer wanted to find all of the adverbs *and their positions* in
1570some text, they would use :func:`finditer` in the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001571
1572 >>> text = "He was carefully disguised but captured quickly by police."
1573 >>> for m in re.finditer(r"\w+ly", text):
Christian Heimesfe337bf2008-03-23 21:54:12 +00001574 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001575 07-16: carefully
1576 40-47: quickly
1577
1578
1579Raw String Notation
1580^^^^^^^^^^^^^^^^^^^
1581
1582Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1583every backslash (``'\'``) in a regular expression would have to be prefixed with
1584another one to escape it. For example, the two following lines of code are
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001585functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001586
1587 >>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001588 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001589 >>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001590 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001591
1592When one wants to match a literal backslash, it must be escaped in the regular
1593expression. With raw string notation, this means ``r"\\"``. Without raw string
1594notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001595functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001596
1597 >>> re.match(r"\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001598 <re.Match object; span=(0, 1), match='\\'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001599 >>> re.match("\\\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001600 <re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001601
1602
1603Writing a Tokenizer
1604^^^^^^^^^^^^^^^^^^^
1605
Georg Brandl5d941342016-02-26 19:37:12 +01001606A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001607analyzes a string to categorize groups of characters. This is a useful first
1608step in writing a compiler or interpreter.
1609
1610The text categories are specified with regular expressions. The technique is
1611to combine those into a single master regular expression and to loop over
1612successive matches::
1613
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001614 import collections
1615 import re
1616
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001617 Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001618
Raymond Hettingerc5664312014-08-03 23:38:54 -07001619 def tokenize(code):
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001620 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1621 token_specification = [
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001622 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
1623 ('ASSIGN', r':='), # Assignment operator
1624 ('END', r';'), # Statement terminator
1625 ('ID', r'[A-Za-z]+'), # Identifiers
1626 ('OP', r'[+\-*/]'), # Arithmetic operators
1627 ('NEWLINE', r'\n'), # Line endings
1628 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs
1629 ('MISMATCH', r'.'), # Any other character
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001630 ]
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001631 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettingerc5664312014-08-03 23:38:54 -07001632 line_num = 1
1633 line_start = 0
1634 for mo in re.finditer(tok_regex, code):
1635 kind = mo.lastgroup
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001636 value = mo.group()
1637 column = mo.start() - line_start
1638 if kind == 'NUMBER':
1639 value = float(value) if '.' in value else int(value)
1640 elif kind == 'ID' and value in keywords:
1641 kind = value
1642 elif kind == 'NEWLINE':
Raymond Hettingerc5664312014-08-03 23:38:54 -07001643 line_start = mo.end()
1644 line_num += 1
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001645 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001646 elif kind == 'SKIP':
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001647 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001648 elif kind == 'MISMATCH':
Raymond Hettingerd0b91582017-02-06 07:15:31 -08001649 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001650 yield Token(kind, value, line_num, column)
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001651
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001652 statements = '''
1653 IF quantity THEN
1654 total := total + price * quantity;
1655 tax := price * 0.05;
1656 ENDIF;
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001657 '''
Raymond Hettinger23157e52011-05-13 01:38:31 -07001658
1659 for token in tokenize(statements):
1660 print(token)
1661
1662The tokenizer produces the following output::
Raymond Hettinger9c47d772011-05-13 01:03:50 -07001663
Raymond Hettingerb83942c2018-11-09 01:19:33 -08001664 Token(type='IF', value='IF', line=2, column=4)
1665 Token(type='ID', value='quantity', line=2, column=7)
1666 Token(type='THEN', value='THEN', line=2, column=16)
1667 Token(type='ID', value='total', line=3, column=8)
1668 Token(type='ASSIGN', value=':=', line=3, column=14)
1669 Token(type='ID', value='total', line=3, column=17)
1670 Token(type='OP', value='+', line=3, column=23)
1671 Token(type='ID', value='price', line=3, column=25)
1672 Token(type='OP', value='*', line=3, column=31)
1673 Token(type='ID', value='quantity', line=3, column=33)
1674 Token(type='END', value=';', line=3, column=41)
1675 Token(type='ID', value='tax', line=4, column=8)
1676 Token(type='ASSIGN', value=':=', line=4, column=12)
1677 Token(type='ID', value='price', line=4, column=15)
1678 Token(type='OP', value='*', line=4, column=21)
1679 Token(type='NUMBER', value=0.05, line=4, column=23)
1680 Token(type='END', value=';', line=4, column=27)
1681 Token(type='ENDIF', value='ENDIF', line=5, column=4)
1682 Token(type='END', value=';', line=5, column=9)
Berker Peksaga0a42d22018-03-23 16:46:52 +03001683
1684
1685.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1686 Media, 2009. The third edition of the book no longer covers Python at all,
1687 but the first edition covered writing good regular expression patterns in
1688 great detail.