blob: 39ba44eba1c194f61662bf224ffcafbcc1516194 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`re` --- Regular expression operations
2===========================================
3
4.. module:: re
5 :synopsis: Regular expression operations.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Georg Brandl116aa622007-08-15 14:28:22 +00007.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040010**Source code:** :source:`Lib/re.py`
11
12--------------
Georg Brandl116aa622007-08-15 14:28:22 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014This module provides regular expression matching operations similar to
Georg Brandled2a1db2009-06-08 07:48:27 +000015those found in Perl.
Antoine Pitroufd036452008-08-19 17:56:33 +000016
Serhiy Storchakacd195e22017-10-14 11:14:26 +030017Both patterns and strings to be searched can be Unicode strings (:class:`str`)
18as well as 8-bit strings (:class:`bytes`).
19However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter6245cb32016-04-15 02:14:19 +000020that is, you cannot match a Unicode string with a byte pattern or
Georg Brandlae2dbe22009-03-13 19:04:40 +000021vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitroufd036452008-08-19 17:56:33 +000022string must be of the same type as both the pattern and the search string.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24Regular expressions use the backslash character (``'\'``) to indicate
25special forms or to allow special characters to be used without invoking
26their special meaning. This collides with Python's usage of the same
27character for the same purpose in string literals; for example, to match
28a literal backslash, one might have to write ``'\\\\'`` as the pattern
29string, because the regular expression must be ``\\``, and each
30backslash must be expressed as ``\\`` inside a regular Python string
31literal.
32
33The solution is to use Python's raw string notation for regular expression
34patterns; backslashes are not handled in any special way in a string literal
35prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
36``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl9afde1c2007-11-01 20:32:30 +000037newline. Usually patterns will be expressed in Python code using this raw
38string notation.
Georg Brandl116aa622007-08-15 14:28:22 +000039
Christian Heimesb9eccbf2007-12-05 20:18:38 +000040It is important to note that most regular expression operations are available as
Georg Brandlc62a7042010-07-29 11:49:05 +000041module-level functions and methods on
42:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
43that don't require you to compile a regex object first, but miss some
Christian Heimesb9eccbf2007-12-05 20:18:38 +000044fine-tuning parameters.
45
Marco Buttued6795e2017-02-26 16:26:23 +010046.. seealso::
47
Miss Islington (bot)51b2f6d2018-05-16 07:05:46 -070048 The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttued6795e2017-02-26 16:26:23 +010049 which has an API compatible with the standard library :mod:`re` module,
50 but offers additional functionality and a more thorough Unicode support.
51
Georg Brandl116aa622007-08-15 14:28:22 +000052
53.. _re-syntax:
54
55Regular Expression Syntax
56-------------------------
57
58A regular expression (or RE) specifies a set of strings that matches it; the
59functions in this module let you check if a particular string matches a given
60regular expression (or if a given regular expression matches a particular
61string, which comes down to the same thing).
62
63Regular expressions can be concatenated to form new regular expressions; if *A*
64and *B* are both regular expressions, then *AB* is also a regular expression.
65In general, if a string *p* matches *A* and another string *q* matches *B*, the
66string *pq* will match AB. This holds unless *A* or *B* contain low precedence
67operations; boundary conditions between *A* and *B*; or have numbered group
68references. Thus, complex expressions can easily be constructed from simpler
69primitive expressions like the ones described here. For details of the theory
Miss Islington (bot)67d3f8b2018-03-23 08:55:26 -070070and implementation of regular expressions, consult the Friedl book [Frie09]_,
71or almost any textbook about compiler construction.
Georg Brandl116aa622007-08-15 14:28:22 +000072
73A brief explanation of the format of regular expressions follows. For further
Christian Heimes2202f872008-02-06 14:31:34 +000074information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl116aa622007-08-15 14:28:22 +000075
76Regular expressions can contain both special and ordinary characters. Most
77ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
78expressions; they simply match themselves. You can concatenate ordinary
79characters, so ``last`` matches the string ``'last'``. (In the rest of this
80section, we'll write RE's in ``this special style``, usually without quotes, and
81strings to be matched ``'in single quotes'``.)
82
83Some characters, like ``'|'`` or ``'('``, are special. Special
84characters either stand for classes of ordinary characters, or affect
Serhiy Storchakacd195e22017-10-14 11:14:26 +030085how the regular expressions around them are interpreted.
Georg Brandl116aa622007-08-15 14:28:22 +000086
Martin Panter684340e2016-10-15 01:18:16 +000087Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
88directly nested. This avoids ambiguity with the non-greedy modifier suffix
89``?``, and with other modifiers in other implementations. To apply a second
90repetition to an inner repetition, parentheses may be used. For example,
91the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
92
Georg Brandl116aa622007-08-15 14:28:22 +000093
94The special characters are:
95
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -070096.. index:: single: . (dot); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +030097
Serhiy Storchakacd195e22017-10-14 11:14:26 +030098``.``
Georg Brandl116aa622007-08-15 14:28:22 +000099 (Dot.) In the default mode, this matches any character except a newline. If
100 the :const:`DOTALL` flag has been specified, this matches any character
101 including a newline.
102
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700103.. index:: single: ^ (caret); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300104
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300105``^``
Georg Brandl116aa622007-08-15 14:28:22 +0000106 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
107 matches immediately after each newline.
108
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700109.. index:: single: $ (dollar); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300110
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300111``$``
Georg Brandl116aa622007-08-15 14:28:22 +0000112 Matches the end of the string or just before the newline at the end of the
113 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
114 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
115 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes25bb7832008-01-11 16:17:00 +0000116 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
117 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
118 the newline, and one at the end of the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000119
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700120.. index:: single: * (asterisk); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300121
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300122``*``
Georg Brandl116aa622007-08-15 14:28:22 +0000123 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
124 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
125 by any number of 'b's.
126
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700127.. index:: single: + (plus); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300128
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300129``+``
Georg Brandl116aa622007-08-15 14:28:22 +0000130 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
131 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
132 match just 'a'.
133
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700134.. index:: single: ? (question mark); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300135
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300136``?``
Georg Brandl116aa622007-08-15 14:28:22 +0000137 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
138 ``ab?`` will match either 'a' or 'ab'.
139
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300140.. index::
141 single: *?; in regular expressions
142 single: +?; in regular expressions
143 single: ??; in regular expressions
144
Georg Brandl116aa622007-08-15 14:28:22 +0000145``*?``, ``+?``, ``??``
146 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
147 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300148 ``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
149 string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl116aa622007-08-15 14:28:22 +0000150 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl7ff033b2016-04-12 07:51:41 +0200151 characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300152 only ``'<a>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000153
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300154.. index::
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700155 single: {} (curly brackets); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300156
Georg Brandl116aa622007-08-15 14:28:22 +0000157``{m}``
158 Specifies that exactly *m* copies of the previous RE should be matched; fewer
159 matches cause the entire RE not to match. For example, ``a{6}`` will match
160 exactly six ``'a'`` characters, but not five.
161
162``{m,n}``
163 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
164 RE, attempting to match as many repetitions as possible. For example,
165 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
166 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300167 example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
168 followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl116aa622007-08-15 14:28:22 +0000169 modifier would be confused with the previously described form.
170
171``{m,n}?``
172 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
173 RE, attempting to match as *few* repetitions as possible. This is the
174 non-greedy version of the previous qualifier. For example, on the
175 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
176 while ``a{3,5}?`` will only match 3 characters.
177
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700178.. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300179
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300180``\``
Georg Brandl116aa622007-08-15 14:28:22 +0000181 Either escapes special characters (permitting you to match characters like
182 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
183 sequences are discussed below.
184
185 If you're not using a raw string to express the pattern, remember that Python
186 also uses the backslash as an escape sequence in string literals; if the escape
187 sequence isn't recognized by Python's parser, the backslash and subsequent
188 character are included in the resulting string. However, if Python would
189 recognize the resulting sequence, the backslash should be repeated twice. This
190 is complicated and hard to understand, so it's highly recommended that you use
191 raw strings for all but the simplest expressions.
192
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300193.. index::
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700194 single: [] (square brackets); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300195
Georg Brandl116aa622007-08-15 14:28:22 +0000196``[]``
Ezio Melotti81231d92011-10-20 19:38:04 +0300197 Used to indicate a set of characters. In a set:
Georg Brandl116aa622007-08-15 14:28:22 +0000198
Ezio Melotti81231d92011-10-20 19:38:04 +0300199 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
200 ``'m'``, or ``'k'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000201
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700202 .. index:: single: - (minus); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300203
Ezio Melotti81231d92011-10-20 19:38:04 +0300204 * Ranges of characters can be indicated by giving two characters and separating
205 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
206 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
207 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300208 ``[a\-z]``) or if it's placed as the first or last character
209 (e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti81231d92011-10-20 19:38:04 +0300210
211 * Special characters lose their special meaning inside sets. For example,
212 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
213 ``'*'``, or ``')'``.
214
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700215 .. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300216
Ezio Melotti81231d92011-10-20 19:38:04 +0300217 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
218 inside a set, although the characters they match depends on whether
219 :const:`ASCII` or :const:`LOCALE` mode is in force.
220
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700221 .. index:: single: ^ (caret); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300222
Ezio Melotti81231d92011-10-20 19:38:04 +0300223 * Characters that are not within a range can be matched by :dfn:`complementing`
224 the set. If the first character of the set is ``'^'``, all the characters
225 that are *not* in the set will be matched. For example, ``[^5]`` will match
226 any character except ``'5'``, and ``[^^]`` will match any character except
227 ``'^'``. ``^`` has no special meaning if it's not the first character in
228 the set.
229
230 * To match a literal ``']'`` inside a set, precede it with a backslash, or
231 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
232 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield9e670c22008-05-31 13:05:34 +0000233
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300234 .. .. index:: single: --; in regular expressions
235 .. .. index:: single: &&; in regular expressions
236 .. .. index:: single: ~~; in regular expressions
237 .. .. index:: single: ||; in regular expressions
238
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200239 * Support of nested sets and set operations as in `Unicode Technical
240 Standard #18`_ might be added in the future. This would change the
241 syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
242 in ambiguous cases for the time being.
Miss Islington (bot)4322b8d2018-10-06 12:56:45 -0700243 That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200244 character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'||'``. To
245 avoid a warning escape them with a backslash.
246
247 .. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
248
249 .. versionchanged:: 3.7
250 :exc:`FutureWarning` is raised if a character set contains constructs
251 that will change semantically in the future.
252
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700253.. index:: single: | (vertical bar); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300254
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300255``|``
256 ``A|B``, where *A* and *B* can be arbitrary REs, creates a regular expression that
257 will match either *A* or *B*. An arbitrary number of REs can be separated by the
Georg Brandl116aa622007-08-15 14:28:22 +0000258 ``'|'`` in this way. This can be used inside groups (see below) as well. As
259 the target string is scanned, REs separated by ``'|'`` are tried from left to
260 right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300261 that once *A* matches, *B* will not be tested further, even if it would
Georg Brandl116aa622007-08-15 14:28:22 +0000262 produce a longer overall match. In other words, the ``'|'`` operator is never
263 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
264 character class, as in ``[|]``.
265
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300266.. index::
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700267 single: () (parentheses); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300268
Georg Brandl116aa622007-08-15 14:28:22 +0000269``(...)``
270 Matches whatever regular expression is inside the parentheses, and indicates the
271 start and end of a group; the contents of a group can be retrieved after a match
272 has been performed, and can be matched later in the string with the ``\number``
273 special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300274 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300276.. index:: single: (?; in regular expressions
277
Georg Brandl116aa622007-08-15 14:28:22 +0000278``(?...)``
279 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
280 otherwise). The first character after the ``'?'`` determines what the meaning
281 and further syntax of the construct is. Extensions usually do not create a new
282 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
283 currently supported extensions.
284
Antoine Pitroufd036452008-08-19 17:56:33 +0000285``(?aiLmsux)``
286 (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
287 ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling1c50e862009-06-01 00:11:36 +0000288 letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitroufd036452008-08-19 17:56:33 +0000289 :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl48310cd2009-01-03 21:18:54 +0000290 :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300291 :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
292 for the entire regular expression.
293 (The flags are described in :ref:`contents-of-module-re`.)
294 This is useful if you wish to include the flags as part of the
295 regular expression, instead of passing a *flag* argument to the
Serhiy Storchakabd48d272016-09-11 12:50:02 +0300296 :func:`re.compile` function. Flags should be used first in the
297 expression string.
Georg Brandl116aa622007-08-15 14:28:22 +0000298
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300299.. index:: single: (?:; in regular expressions
300
Georg Brandl116aa622007-08-15 14:28:22 +0000301``(?:...)``
Georg Brandl3122ce32010-10-29 06:17:38 +0000302 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl116aa622007-08-15 14:28:22 +0000303 expression is inside the parentheses, but the substring matched by the group
304 *cannot* be retrieved after performing a match or referenced later in the
305 pattern.
306
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300307``(?aiLmsux-imsx:...)``
308 (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
309 ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
310 one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
311 The letters set or remove the corresponding flags:
312 :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
313 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
314 :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
315 and :const:`re.X` (verbose), for the part of the expression.
316 (The flags are described in :ref:`contents-of-module-re`.)
317
318 The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
319 as inline flags, so they can't be combined or follow ``'-'``. Instead,
320 when one of them appears in an inline group, it overrides the matching mode
321 in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
322 ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
323 (default). In byte pattern ``(?L:...)`` switches to locale depending
324 matching, and ``(?a:...)`` switches to ASCII-only matching (default).
325 This override is only in effect for the narrow inline group, and the
326 original matching mode is restored outside of the group.
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300327
Zachary Warec3076722016-09-09 15:47:05 -0700328 .. versionadded:: 3.6
Serhiy Storchakabe9a4e52016-09-10 00:57:55 +0300329
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300330 .. versionchanged:: 3.7
331 The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
332
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300333.. index:: single: (?P<; in regular expressions
334
Georg Brandl116aa622007-08-15 14:28:22 +0000335``(?P<name>...)``
336 Similar to regular parentheses, but the substring matched by the group is
Georg Brandl3c6780c62013-10-06 12:08:14 +0200337 accessible via the symbolic group name *name*. Group names must be valid
338 Python identifiers, and each group name must be defined only once within a
339 regular expression. A symbolic group is also a numbered group, just as if
340 the group were not named.
Georg Brandl116aa622007-08-15 14:28:22 +0000341
Georg Brandl3c6780c62013-10-06 12:08:14 +0200342 Named groups can be referenced in three contexts. If the pattern is
343 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
344 single or double quotes):
345
346 +---------------------------------------+----------------------------------+
347 | Context of reference to group "quote" | Ways to reference it |
348 +=======================================+==================================+
349 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
350 | | * ``\1`` |
351 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300352 | when processing match object *m* | * ``m.group('quote')`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200353 | | * ``m.end('quote')`` (etc.) |
354 +---------------------------------------+----------------------------------+
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300355 | in a string passed to the *repl* | * ``\g<quote>`` |
Georg Brandl3c6780c62013-10-06 12:08:14 +0200356 | argument of ``re.sub()`` | * ``\g<1>`` |
357 | | * ``\1`` |
358 +---------------------------------------+----------------------------------+
Georg Brandl116aa622007-08-15 14:28:22 +0000359
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300360.. index:: single: (?P=; in regular expressions
361
Georg Brandl116aa622007-08-15 14:28:22 +0000362``(?P=name)``
Georg Brandl3c6780c62013-10-06 12:08:14 +0200363 A backreference to a named group; it matches whatever text was matched by the
364 earlier group named *name*.
Georg Brandl116aa622007-08-15 14:28:22 +0000365
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300366.. index:: single: (?#; in regular expressions
367
Georg Brandl116aa622007-08-15 14:28:22 +0000368``(?#...)``
369 A comment; the contents of the parentheses are simply ignored.
370
Miss Islington (bot)0e379d42019-02-18 05:48:23 -0800371.. index:: single: (?=; in regular expressions
372
Georg Brandl116aa622007-08-15 14:28:22 +0000373``(?=...)``
374 Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300375 called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl116aa622007-08-15 14:28:22 +0000376 ``'Isaac '`` only if it's followed by ``'Asimov'``.
377
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300378.. index:: single: (?!; in regular expressions
379
Georg Brandl116aa622007-08-15 14:28:22 +0000380``(?!...)``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300381 Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl116aa622007-08-15 14:28:22 +0000382 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
383 followed by ``'Asimov'``.
384
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300385.. index:: single: (?<=; in regular expressions
386
Georg Brandl116aa622007-08-15 14:28:22 +0000387``(?<=...)``
388 Matches if the current position in the string is preceded by a match for ``...``
389 that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300390 assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl116aa622007-08-15 14:28:22 +0000391 lookbehind will back up 3 characters and check if the contained pattern matches.
392 The contained pattern must only match strings of some fixed length, meaning that
393 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti0a6b5412012-04-29 07:34:46 +0300394 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl116aa622007-08-15 14:28:22 +0000395 beginning of the string being searched; you will most likely want to use the
Christian Heimesfe337bf2008-03-23 21:54:12 +0000396 :func:`search` function rather than the :func:`match` function:
Georg Brandl116aa622007-08-15 14:28:22 +0000397
398 >>> import re
399 >>> m = re.search('(?<=abc)def', 'abcdef')
400 >>> m.group(0)
401 'def'
402
Christian Heimesfe337bf2008-03-23 21:54:12 +0000403 This example looks for a word following a hyphen:
Georg Brandl116aa622007-08-15 14:28:22 +0000404
Miss Islington (bot)c7de1d72018-02-02 13:50:44 -0800405 >>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl116aa622007-08-15 14:28:22 +0000406 >>> m.group(0)
407 'egg'
408
Georg Brandl8c16cb92016-02-25 20:17:45 +0100409 .. versionchanged:: 3.5
Serhiy Storchaka4eea62f2015-02-21 10:07:35 +0200410 Added support for group references of fixed length.
411
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300412.. index:: single: (?<!; in regular expressions
413
Georg Brandl116aa622007-08-15 14:28:22 +0000414``(?<!...)``
415 Matches if the current position in the string is not preceded by a match for
416 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
417 positive lookbehind assertions, the contained pattern must only match strings of
418 some fixed length. Patterns which start with negative lookbehind assertions may
419 match at the beginning of the string being searched.
420
421``(?(id/name)yes-pattern|no-pattern)``
orsenthil@gmail.com476021b2011-03-12 10:46:25 +0800422 Will try to match with ``yes-pattern`` if the group with given *id* or
423 *name* exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
424 optional and can be omitted. For example,
425 ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$)`` is a poor email matching pattern, which
426 will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchakaa4d170d2013-12-23 18:20:51 +0200427 not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000428
Georg Brandl116aa622007-08-15 14:28:22 +0000429
430The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter98e90512016-06-12 06:17:29 +0000431If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300432resulting RE will match the second character. For example, ``\$`` matches the
433character ``'$'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000434
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700435.. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300436
Georg Brandl116aa622007-08-15 14:28:22 +0000437``\number``
438 Matches the contents of the group of the same number. Groups are numbered
439 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl2070e832013-10-06 12:58:20 +0200440 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl116aa622007-08-15 14:28:22 +0000441 can only be used to match one of the first 99 groups. If the first digit of
442 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
443 a group match, but as the character with octal value *number*. Inside the
444 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
445 characters.
446
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300447.. index:: single: \A; in regular expressions
448
Georg Brandl116aa622007-08-15 14:28:22 +0000449``\A``
450 Matches only at the start of the string.
451
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300452.. index:: single: \b; in regular expressions
453
Georg Brandl116aa622007-08-15 14:28:22 +0000454``\b``
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000455 Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300456 A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti5a045b92012-02-29 11:48:44 +0200457 ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
458 (or vice versa), or between ``\w`` and the beginning/end of the string.
459 This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
460 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
461
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300462 By default Unicode alphanumerics are the ones used in Unicode patterns, but
463 this can be changed by using the :const:`ASCII` flag. Word boundaries are
464 determined by the current locale if the :const:`LOCALE` flag is used.
465 Inside a character range, ``\b`` represents the backspace character, for
466 compatibility with Python's string literals.
Georg Brandl116aa622007-08-15 14:28:22 +0000467
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300468.. index:: single: \B; in regular expressions
469
Georg Brandl116aa622007-08-15 14:28:22 +0000470``\B``
Ezio Melotti5a045b92012-02-29 11:48:44 +0200471 Matches the empty string, but only when it is *not* at the beginning or end
472 of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
473 ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300474 ``\B`` is just the opposite of ``\b``, so word characters in Unicode
475 patterns are Unicode alphanumerics or the underscore, although this can
476 be changed by using the :const:`ASCII` flag. Word boundaries are
477 determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl116aa622007-08-15 14:28:22 +0000478
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300479.. index:: single: \d; in regular expressions
480
Georg Brandl116aa622007-08-15 14:28:22 +0000481``\d``
Antoine Pitroufd036452008-08-19 17:56:33 +0000482 For Unicode (str) patterns:
Mark Dickinson1f268282009-07-28 17:22:36 +0000483 Matches any Unicode decimal digit (that is, any character in
484 Unicode character category [Nd]). This includes ``[0-9]``, and
485 also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300486 used only ``[0-9]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300487
Antoine Pitroufd036452008-08-19 17:56:33 +0000488 For 8-bit (bytes) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000489 Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000490
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300491.. index:: single: \D; in regular expressions
492
Georg Brandl116aa622007-08-15 14:28:22 +0000493``\D``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300494 Matches any character which is not a decimal digit. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000495 the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300496 becomes the equivalent of ``[^0-9]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000497
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300498.. index:: single: \s; in regular expressions
499
Georg Brandl116aa622007-08-15 14:28:22 +0000500``\s``
Antoine Pitroufd036452008-08-19 17:56:33 +0000501 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000502 Matches Unicode whitespace characters (which includes
503 ``[ \t\n\r\f\v]``, and also many other characters, for example the
504 non-breaking spaces mandated by typography rules in many
505 languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300506 ``[ \t\n\r\f\v]`` is matched.
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000507
Antoine Pitroufd036452008-08-19 17:56:33 +0000508 For 8-bit (bytes) patterns:
509 Matches characters considered whitespace in the ASCII character set;
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000510 this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000511
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300512.. index:: single: \S; in regular expressions
513
Georg Brandl116aa622007-08-15 14:28:22 +0000514``\S``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300515 Matches any character which is not a whitespace character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000516 the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300517 becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl116aa622007-08-15 14:28:22 +0000518
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300519.. index:: single: \w; in regular expressions
520
Georg Brandl116aa622007-08-15 14:28:22 +0000521``\w``
Antoine Pitroufd036452008-08-19 17:56:33 +0000522 For Unicode (str) patterns:
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000523 Matches Unicode word characters; this includes most characters
524 that can be part of a word in any language, as well as numbers and
525 the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300526 ``[a-zA-Z0-9_]`` is matched.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300527
Antoine Pitroufd036452008-08-19 17:56:33 +0000528 For 8-bit (bytes) patterns:
529 Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300530 this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
531 used, matches characters considered alphanumeric in the current locale
532 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000533
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300534.. index:: single: \W; in regular expressions
535
Georg Brandl116aa622007-08-15 14:28:22 +0000536``\W``
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300537 Matches any character which is not a word character. This is
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000538 the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300539 becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300540 used, matches characters considered alphanumeric in the current locale
541 and the underscore.
Georg Brandl116aa622007-08-15 14:28:22 +0000542
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300543.. index:: single: \Z; in regular expressions
544
Georg Brandl116aa622007-08-15 14:28:22 +0000545``\Z``
546 Matches only at the end of the string.
547
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300548.. index::
549 single: \a; in regular expressions
550 single: \b; in regular expressions
551 single: \f; in regular expressions
552 single: \n; in regular expressions
553 single: \N; in regular expressions
554 single: \r; in regular expressions
555 single: \t; in regular expressions
556 single: \u; in regular expressions
557 single: \U; in regular expressions
558 single: \v; in regular expressions
559 single: \x; in regular expressions
560 single: \\; in regular expressions
561
Georg Brandl116aa622007-08-15 14:28:22 +0000562Most of the standard escapes supported by Python string literals are also
563accepted by the regular expression parser::
564
565 \a \b \f \n
Antoine Pitrou463badf2012-06-23 13:29:19 +0200566 \r \t \u \U
567 \v \x \\
Georg Brandl116aa622007-08-15 14:28:22 +0000568
Ezio Melotti285e51b2012-04-29 04:52:30 +0300569(Note that ``\b`` is used to represent word boundaries, and means "backspace"
570only inside character classes.)
571
Antoine Pitrou463badf2012-06-23 13:29:19 +0200572``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300573patterns. In bytes patterns they are errors.
Antoine Pitrou463badf2012-06-23 13:29:19 +0200574
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700575Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl116aa622007-08-15 14:28:22 +0000576there are three octal digits, it is considered an octal escape. Otherwise, it is
577a group reference. As for string literals, octal escapes are always at most
578three digits in length.
579
Antoine Pitrou463badf2012-06-23 13:29:19 +0200580.. versionchanged:: 3.3
581 The ``'\u'`` and ``'\U'`` escape sequences have been added.
582
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300583.. versionchanged:: 3.6
Martin Panter98e90512016-06-12 06:17:29 +0000584 Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200585
Antoine Pitrou463badf2012-06-23 13:29:19 +0200586
Georg Brandl116aa622007-08-15 14:28:22 +0000587
Georg Brandl116aa622007-08-15 14:28:22 +0000588.. _contents-of-module-re:
589
590Module Contents
591---------------
592
593The module defines several functions, constants, and an exception. Some of the
594functions are simplified versions of the full featured methods for compiled
595regular expressions. Most non-trivial applications always use the compiled
596form.
597
Ethan Furmanc88c80b2016-11-21 08:29:31 -0800598.. versionchanged:: 3.6
599 Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
600 :class:`enum.IntFlag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000601
Georg Brandl18244152009-09-02 20:34:52 +0000602.. function:: compile(pattern, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000603
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100604 Compile a regular expression pattern into a :ref:`regular expression object
605 <re-objects>`, which can be used for matching using its
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300606 :func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaared94a8b2017-08-28 06:41:20 +0100607 below.
Georg Brandl116aa622007-08-15 14:28:22 +0000608
609 The expression's behaviour can be modified by specifying a *flags* value.
610 Values can be any of the following variables, combined using bitwise OR (the
611 ``|`` operator).
612
613 The sequence ::
614
Gregory P. Smith4221c742009-03-02 05:04:04 +0000615 prog = re.compile(pattern)
616 result = prog.match(string)
Georg Brandl116aa622007-08-15 14:28:22 +0000617
618 is equivalent to ::
619
Gregory P. Smith4221c742009-03-02 05:04:04 +0000620 result = re.match(pattern, string)
Georg Brandl116aa622007-08-15 14:28:22 +0000621
Georg Brandlf346ac02009-07-26 15:03:49 +0000622 but using :func:`re.compile` and saving the resulting regular expression
623 object for reuse is more efficient when the expression will be used several
624 times in a single program.
Georg Brandl116aa622007-08-15 14:28:22 +0000625
Gregory P. Smith4221c742009-03-02 05:04:04 +0000626 .. note::
627
628 The compiled versions of the most recent patterns passed to
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200629 :func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith4221c742009-03-02 05:04:04 +0000630 programs that use only a few regular expressions at a time needn't worry
631 about compiling regular expressions.
Georg Brandl116aa622007-08-15 14:28:22 +0000632
633
Antoine Pitroufd036452008-08-19 17:56:33 +0000634.. data:: A
635 ASCII
636
Georg Brandl4049ce02009-06-08 07:49:54 +0000637 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
638 perform ASCII-only matching instead of full Unicode matching. This is only
639 meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300640 Corresponds to the inline flag ``(?a)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000641
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000642 Note that for backward compatibility, the :const:`re.U` flag still
643 exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandlebeb44d2010-07-29 11:15:36 +0000644 counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield6c4f6172008-08-20 07:34:41 +0000645 matches are Unicode by default for strings (and Unicode matching
646 isn't allowed for bytes).
Georg Brandl48310cd2009-01-03 21:18:54 +0000647
Antoine Pitroufd036452008-08-19 17:56:33 +0000648
Sandro Tosida785fd2012-01-01 12:55:20 +0100649.. data:: DEBUG
650
651 Display debug information about compiled expression.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300652 No corresponding inline flag.
Sandro Tosida785fd2012-01-01 12:55:20 +0100653
654
Georg Brandl116aa622007-08-15 14:28:22 +0000655.. data:: I
656 IGNORECASE
657
Brian Wardc9d6dbc2017-05-24 00:03:38 -0700658 Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300659 match lowercase letters. Full Unicode matching (such as ``Ü`` matching
660 ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
661 non-ASCII matches. The current locale does not change the effect of this
662 flag unless the :const:`re.LOCALE` flag is also used.
663 Corresponds to the inline flag ``(?i)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000664
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300665 Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
666 combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
667 letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
668 letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
669 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
670 If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka3557b052017-10-24 23:31:42 +0300671 and 'A' to 'Z' are matched.
Georg Brandl116aa622007-08-15 14:28:22 +0000672
673.. data:: L
674 LOCALE
675
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300676 Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
677 dependent on the current locale. This flag can be used only with bytes
678 patterns. The use of this flag is discouraged as the locale mechanism
679 is very unreliable, it only handles one "culture" at a time, and it only
680 works with 8-bit locales. Unicode matching is already enabled by default
681 in Python 3 for Unicode (str) patterns, and it is able to handle different
682 locales/languages.
683 Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka22a309a2014-12-01 11:50:07 +0200684
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300685 .. versionchanged:: 3.6
686 :const:`re.LOCALE` can be used only with bytes patterns and is
687 not compatible with :const:`re.ASCII`.
Georg Brandl116aa622007-08-15 14:28:22 +0000688
Serhiy Storchaka898ff032017-05-05 08:53:40 +0300689 .. versionchanged:: 3.7
690 Compiled regular expression objects with the :const:`re.LOCALE` flag no
691 longer depend on the locale at compile time. Only the locale at
692 matching time affects the result of matching.
693
Georg Brandl116aa622007-08-15 14:28:22 +0000694
695.. data:: M
696 MULTILINE
697
698 When specified, the pattern character ``'^'`` matches at the beginning of the
699 string and at the beginning of each line (immediately following each newline);
700 and the pattern character ``'$'`` matches at the end of the string and at the
701 end of each line (immediately preceding each newline). By default, ``'^'``
702 matches only at the beginning of the string, and ``'$'`` only at the end of the
703 string and immediately before the newline (if any) at the end of the string.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300704 Corresponds to the inline flag ``(?m)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000705
706
707.. data:: S
708 DOTALL
709
710 Make the ``'.'`` special character match any character at all, including a
711 newline; without this flag, ``'.'`` will match anything *except* a newline.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300712 Corresponds to the inline flag ``(?s)``.
Georg Brandl116aa622007-08-15 14:28:22 +0000713
714
Georg Brandl116aa622007-08-15 14:28:22 +0000715.. data:: X
716 VERBOSE
717
Miss Islington (bot)fdf48b62018-10-28 09:43:32 -0700718 .. index:: single: # (hash); in regular expressions
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300719
Zachary Ware71a0b432015-11-11 23:32:14 -0600720 This flag allows you to write regular expressions that look nicer and are
721 more readable by allowing you to visually separate logical sections of the
722 pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchakab0b44b42017-11-14 17:21:26 +0200723 when in a character class, or when preceded by an unescaped backslash,
724 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware71a0b432015-11-11 23:32:14 -0600725 When a line contains a ``#`` that is not in a character class and is not
726 preceded by an unescaped backslash, all characters from the leftmost such
727 ``#`` through the end of the line are ignored.
Georg Brandl116aa622007-08-15 14:28:22 +0000728
Zachary Ware71a0b432015-11-11 23:32:14 -0600729 This means that the two following regular expression objects that match a
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000730 decimal number are functionally equal::
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000731
Christian Heimesb9eccbf2007-12-05 20:18:38 +0000732 a = re.compile(r"""\d + # the integral part
733 \. # the decimal point
734 \d * # some fractional digits""", re.X)
735 b = re.compile(r"\d+\.\d*")
Georg Brandl116aa622007-08-15 14:28:22 +0000736
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300737 Corresponds to the inline flag ``(?x)``.
Antoine Pitroufd036452008-08-19 17:56:33 +0000738
739
Georg Brandlc62a7042010-07-29 11:49:05 +0000740.. function:: search(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000741
Terry Jan Reedy0edb5c12014-05-30 16:19:59 -0400742 Scan through *string* looking for the first location where the regular expression
Georg Brandlc62a7042010-07-29 11:49:05 +0000743 *pattern* produces a match, and return a corresponding :ref:`match object
744 <match-objects>`. Return ``None`` if no position in the string matches the
745 pattern; note that this is different from finding a zero-length match at some
746 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +0000747
748
Georg Brandl18244152009-09-02 20:34:52 +0000749.. function:: match(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000750
751 If zero or more characters at the beginning of *string* match the regular
Georg Brandlc62a7042010-07-29 11:49:05 +0000752 expression *pattern*, return a corresponding :ref:`match object
753 <match-objects>`. Return ``None`` if the string does not match the pattern;
754 note that this is different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +0000755
Ezio Melotti443f0002012-02-29 13:39:05 +0200756 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
757 at the beginning of the string and not at the beginning of each line.
Georg Brandl116aa622007-08-15 14:28:22 +0000758
Ezio Melotti443f0002012-02-29 13:39:05 +0200759 If you want to locate a match anywhere in *string*, use :func:`search`
760 instead (see also :ref:`search-vs-match`).
Georg Brandl116aa622007-08-15 14:28:22 +0000761
762
Serhiy Storchaka32eddc12013-11-23 23:20:30 +0200763.. function:: fullmatch(pattern, string, flags=0)
764
765 If the whole *string* matches the regular expression *pattern*, return a
766 corresponding :ref:`match object <match-objects>`. Return ``None`` if the
767 string does not match the pattern; note that this is different from a
768 zero-length match.
769
770 .. versionadded:: 3.4
771
772
Georg Brandl18244152009-09-02 20:34:52 +0000773.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000774
775 Split *string* by the occurrences of *pattern*. If capturing parentheses are
776 used in *pattern*, then the text of all groups in the pattern are also returned
777 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
778 splits occur, and the remainder of the string is returned as the final element
Georg Brandl96473892008-03-06 07:09:43 +0000779 of the list. ::
Georg Brandl116aa622007-08-15 14:28:22 +0000780
Serhiy Storchakac615be52017-11-28 22:51:38 +0200781 >>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000782 ['Words', 'words', 'words', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200783 >>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl116aa622007-08-15 14:28:22 +0000784 ['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchakac615be52017-11-28 22:51:38 +0200785 >>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl116aa622007-08-15 14:28:22 +0000786 ['Words', 'words, words.']
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000787 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
788 ['0', '3', '9']
Georg Brandl116aa622007-08-15 14:28:22 +0000789
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000790 If there are capturing groups in the separator and it matches at the start of
791 the string, the result will start with an empty string. The same holds for
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300792 the end of the string::
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000793
Serhiy Storchakac615be52017-11-28 22:51:38 +0200794 >>> re.split(r'(\W+)', '...words, words...')
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000795 ['', '...', 'words', ', ', 'words', '...', '']
796
797 That way, separator components are always found at the same relative
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700798 indices within the result list.
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000799
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200800 Empty matches for the pattern split the string only when not adjacent
801 to a previous empty match.
Thomas Wouters89d996e2007-09-08 17:39:28 +0000802
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200803 >>> re.split(r'\b', 'Words, words, words.')
804 ['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200805 >>> re.split(r'\W*', '...words...')
806 ['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200807 >>> re.split(r'(\W*)', '...words...')
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200808 ['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl116aa622007-08-15 14:28:22 +0000809
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000810 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000811 Added the optional flags argument.
812
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200813 .. versionchanged:: 3.7
814 Added support of splitting on a pattern that could match an empty string.
815
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000816
Georg Brandl18244152009-09-02 20:34:52 +0000817.. function:: findall(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000818
Georg Brandl9afde1c2007-11-01 20:32:30 +0000819 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandl3dbca812008-07-23 16:10:53 +0000820 strings. The *string* is scanned left-to-right, and matches are returned in
821 the order found. If one or more groups are present in the pattern, return a
822 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200823 one group. Empty matches are included in the result.
824
825 .. versionchanged:: 3.7
826 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000827
Georg Brandl116aa622007-08-15 14:28:22 +0000828
Georg Brandl18244152009-09-02 20:34:52 +0000829.. function:: finditer(pattern, string, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000830
Georg Brandlc62a7042010-07-29 11:49:05 +0000831 Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
832 all non-overlapping matches for the RE *pattern* in *string*. The *string*
833 is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka70d56fb2017-12-04 14:29:05 +0200834 matches are included in the result.
835
836 .. versionchanged:: 3.7
837 Non-empty matches can now start just after a previous empty match.
Georg Brandl116aa622007-08-15 14:28:22 +0000838
Georg Brandl116aa622007-08-15 14:28:22 +0000839
Georg Brandl18244152009-09-02 20:34:52 +0000840.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000841
842 Return the string obtained by replacing the leftmost non-overlapping occurrences
843 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
844 *string* is returned unchanged. *repl* can be a string or a function; if it is
845 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi6a633bb2011-08-19 22:54:50 +0200846 converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200847 so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl116aa622007-08-15 14:28:22 +0000848 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300849 For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000850
851 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
852 ... r'static PyObject*\npy_\1(void)\n{',
853 ... 'def myfunc():')
854 'static PyObject*\npy_myfunc(void)\n{'
855
856 If *repl* is a function, it is called for every non-overlapping occurrence of
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300857 *pattern*. The function takes a single :ref:`match object <match-objects>`
858 argument, and returns the replacement string. For example::
Georg Brandl116aa622007-08-15 14:28:22 +0000859
860 >>> def dashrepl(matchobj):
861 ... if matchobj.group(0) == '-': return ' '
862 ... else: return '-'
863 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
864 'pro--gram files'
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000865 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
866 'Baked Beans & Spam'
Georg Brandl116aa622007-08-15 14:28:22 +0000867
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300868 The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl116aa622007-08-15 14:28:22 +0000869
870 The optional argument *count* is the maximum number of pattern occurrences to be
871 replaced; *count* must be a non-negative integer. If omitted or zero, all
872 occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200873 when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
874 ``'-a-b--d-'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000875
Serhiy Storchaka9a75b842018-10-26 11:18:42 +0300876 .. index:: single: \g; in regular expressions
877
Georg Brandl3c6780c62013-10-06 12:08:14 +0200878 In string-type *repl* arguments, in addition to the character escapes and
879 backreferences described above,
Georg Brandl116aa622007-08-15 14:28:22 +0000880 ``\g<name>`` will use the substring matched by the group named ``name``, as
881 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
882 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
883 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
884 reference to group 20, not a reference to group 2 followed by the literal
885 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
886 substring matched by the RE.
887
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000888 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000889 Added the optional flags argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000890
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300891 .. versionchanged:: 3.5
892 Unmatched groups are replaced with an empty string.
893
Serhiy Storchaka9bd85b82016-06-11 19:15:00 +0300894 .. versionchanged:: 3.6
Serhiy Storchaka53c53ea2016-12-06 19:15:29 +0200895 Unknown escapes in *pattern* consisting of ``'\'`` and an ASCII letter
896 now are errors.
897
Serhiy Storchakaff3dbe92016-12-06 19:25:19 +0200898 .. versionchanged:: 3.7
899 Unknown escapes in *repl* consisting of ``'\'`` and an ASCII letter
900 now are errors.
Serhiy Storchakaa54aae02015-03-24 22:58:14 +0200901
Serhiy Storchakafbb490f2018-01-04 11:06:13 +0200902 Empty matches for the pattern are replaced when adjacent to a previous
903 non-empty match.
904
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000905
Georg Brandl18244152009-09-02 20:34:52 +0000906.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl116aa622007-08-15 14:28:22 +0000907
908 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
909 number_of_subs_made)``.
910
Jeroen Ruigrok van der Wervenb70ccc32009-04-27 08:07:12 +0000911 .. versionchanged:: 3.1
Gregory P. Smithccc5ae72009-03-02 05:21:55 +0000912 Added the optional flags argument.
913
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +0300914 .. versionchanged:: 3.5
915 Unmatched groups are replaced with an empty string.
916
Georg Brandl116aa622007-08-15 14:28:22 +0000917
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300918.. function:: escape(pattern)
Georg Brandl116aa622007-08-15 14:28:22 +0000919
Serhiy Storchaka59083002017-04-13 21:06:43 +0300920 Escape special characters in *pattern*.
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300921 This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300922 have regular expression metacharacters in it. For example::
923
924 >>> print(re.escape('python.exe'))
925 python\.exe
926
927 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
928 >>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka05cb7282017-11-16 12:38:26 +0200929 [abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\|\~:]+
Serhiy Storchaka8fc7bc22017-04-13 19:17:36 +0300930
931 >>> operators = ['+', '-', '*', '/', '**']
932 >>> print('|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka59083002017-04-13 21:06:43 +0300933 /|\-|\+|\*\*|\*
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300934
Serhiy Storchakacd195e22017-10-14 11:14:26 +0300935 This functions must not be used for the replacement string in :func:`sub`
936 and :func:`subn`, only backslashes should be escaped. For example::
937
938 >>> digits_re = r'\d+'
939 >>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
940 >>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
941 /usr/sbin/sendmail - \d+ errors, \d+ warnings
942
Ezio Melotti88fdeb42011-04-10 12:59:16 +0300943 .. versionchanged:: 3.3
944 The ``'_'`` character is no longer escaped.
Georg Brandl116aa622007-08-15 14:28:22 +0000945
Serhiy Storchaka59083002017-04-13 21:06:43 +0300946 .. versionchanged:: 3.7
947 Only characters that can have special meaning in a regular expression
948 are escaped.
949
Georg Brandl116aa622007-08-15 14:28:22 +0000950
R. David Murray522c32a2010-07-10 14:23:36 +0000951.. function:: purge()
952
953 Clear the regular expression cache.
954
955
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200956.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000957
958 Exception raised when a string passed to one of the functions here is not a
959 valid regular expression (for example, it might contain unmatched parentheses)
960 or when some other error occurs during compilation or matching. It is never an
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200961 error if a string contains no match for a pattern. The error instance has
962 the following additional attributes:
Georg Brandl116aa622007-08-15 14:28:22 +0000963
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200964 .. attribute:: msg
965
966 The unformatted error message.
967
968 .. attribute:: pattern
969
970 The regular expression pattern.
971
972 .. attribute:: pos
973
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300974 The index in *pattern* where compilation failed (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200975
976 .. attribute:: lineno
977
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300978 The line corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200979
980 .. attribute:: colno
981
Serhiy Storchaka12d6b5d2017-05-27 16:12:48 +0300982 The column corresponding to *pos* (may be ``None``).
Serhiy Storchakaad446d52014-11-10 13:49:00 +0200983
984 .. versionchanged:: 3.5
985 Added additional attributes.
Georg Brandl116aa622007-08-15 14:28:22 +0000986
987.. _re-objects:
988
989Regular Expression Objects
990--------------------------
991
Georg Brandlc62a7042010-07-29 11:49:05 +0000992Compiled regular expression objects support the following methods and
Raymond Hettinger5768e0c2011-10-19 14:10:07 -0700993attributes:
Brian Curtin027e4782010-03-26 00:39:56 +0000994
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +0300995.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +0000996
Berker Peksag84f387d2016-06-08 14:56:56 +0300997 Scan through *string* looking for the first location where this regular
998 expression produces a match, and return a corresponding :ref:`match object
Georg Brandlc62a7042010-07-29 11:49:05 +0000999 <match-objects>`. Return ``None`` if no position in the string matches the
1000 pattern; note that this is different from finding a zero-length match at some
1001 point in the string.
Georg Brandl116aa622007-08-15 14:28:22 +00001002
Georg Brandlc62a7042010-07-29 11:49:05 +00001003 The optional second parameter *pos* gives an index in the string where the
1004 search is to start; it defaults to ``0``. This is not completely equivalent to
1005 slicing the string; the ``'^'`` pattern character matches at the real beginning
1006 of the string and at positions just after a newline, but not necessarily at the
1007 index where the search is to start.
Georg Brandl116aa622007-08-15 14:28:22 +00001008
Georg Brandlc62a7042010-07-29 11:49:05 +00001009 The optional parameter *endpos* limits how far the string will be searched; it
1010 will be as if the string is *endpos* characters long, so only the characters
1011 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001012 than *pos*, no match will be found; otherwise, if *rx* is a compiled regular
Georg Brandlc62a7042010-07-29 11:49:05 +00001013 expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001014 ``rx.search(string[:50], 0)``. ::
Georg Brandl116aa622007-08-15 14:28:22 +00001015
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001016 >>> pattern = re.compile("d")
1017 >>> pattern.search("dog") # Match at index 0
1018 <re.Match object; span=(0, 1), match='d'>
1019 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl116aa622007-08-15 14:28:22 +00001020
1021
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001022.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001023
Georg Brandlc62a7042010-07-29 11:49:05 +00001024 If zero or more characters at the *beginning* of *string* match this regular
1025 expression, return a corresponding :ref:`match object <match-objects>`.
1026 Return ``None`` if the string does not match the pattern; note that this is
1027 different from a zero-length match.
Georg Brandl116aa622007-08-15 14:28:22 +00001028
Georg Brandlc62a7042010-07-29 11:49:05 +00001029 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001030 :meth:`~Pattern.search` method. ::
Benjamin Petersond7c3ed52010-06-27 22:32:30 +00001031
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001032 >>> pattern = re.compile("o")
1033 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
1034 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
1035 <re.Match object; span=(1, 2), match='o'>
Georg Brandl116aa622007-08-15 14:28:22 +00001036
Ezio Melotti443f0002012-02-29 13:39:05 +02001037 If you want to locate a match anywhere in *string*, use
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001038 :meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti443f0002012-02-29 13:39:05 +02001039
Georg Brandl116aa622007-08-15 14:28:22 +00001040
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001041.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001042
1043 If the whole *string* matches this regular expression, return a corresponding
1044 :ref:`match object <match-objects>`. Return ``None`` if the string does not
1045 match the pattern; note that this is different from a zero-length match.
1046
1047 The optional *pos* and *endpos* parameters have the same meaning as for the
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001048 :meth:`~Pattern.search` method. ::
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001049
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001050 >>> pattern = re.compile("o[gh]")
1051 >>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
1052 >>> pattern.fullmatch("ogre") # No match as not the full string matches.
1053 >>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
1054 <re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka32eddc12013-11-23 23:20:30 +02001055
1056 .. versionadded:: 3.4
1057
1058
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001059.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001060
Georg Brandlc62a7042010-07-29 11:49:05 +00001061 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001062
1063
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001064.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001065
Georg Brandlc62a7042010-07-29 11:49:05 +00001066 Similar to the :func:`findall` function, using the compiled pattern, but
1067 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001068 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001069
1070
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001071.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl116aa622007-08-15 14:28:22 +00001072
Georg Brandlc62a7042010-07-29 11:49:05 +00001073 Similar to the :func:`finditer` function, using the compiled pattern, but
1074 also accepts optional *pos* and *endpos* parameters that limit the search
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001075 region like for :meth:`search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001076
1077
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001078.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001079
Georg Brandlc62a7042010-07-29 11:49:05 +00001080 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001081
1082
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001083.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl116aa622007-08-15 14:28:22 +00001084
Georg Brandlc62a7042010-07-29 11:49:05 +00001085 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001086
1087
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001088.. attribute:: Pattern.flags
Georg Brandl116aa622007-08-15 14:28:22 +00001089
Georg Brandl3a19e542012-03-17 17:29:27 +01001090 The regex matching flags. This is a combination of the flags given to
1091 :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
1092 flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl116aa622007-08-15 14:28:22 +00001093
1094
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001095.. attribute:: Pattern.groups
Georg Brandlaf265f42008-12-07 15:06:20 +00001096
Georg Brandlc62a7042010-07-29 11:49:05 +00001097 The number of capturing groups in the pattern.
Georg Brandlaf265f42008-12-07 15:06:20 +00001098
1099
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001100.. attribute:: Pattern.groupindex
Georg Brandl116aa622007-08-15 14:28:22 +00001101
Georg Brandlc62a7042010-07-29 11:49:05 +00001102 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
1103 numbers. The dictionary is empty if no symbolic groups were used in the
1104 pattern.
Georg Brandl116aa622007-08-15 14:28:22 +00001105
1106
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001107.. attribute:: Pattern.pattern
Georg Brandl116aa622007-08-15 14:28:22 +00001108
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001109 The pattern string from which the pattern object was compiled.
Georg Brandl116aa622007-08-15 14:28:22 +00001110
1111
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001112.. versionchanged:: 3.7
1113 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
1114 regular expression objects are considered atomic.
1115
1116
Georg Brandl116aa622007-08-15 14:28:22 +00001117.. _match-objects:
1118
1119Match Objects
1120-------------
1121
Ezio Melottib87f82f2012-11-04 06:59:22 +02001122Match objects always have a boolean value of ``True``.
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001123Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melottib87f82f2012-11-04 06:59:22 +02001124when there is no match, you can test whether there was a match with a simple
1125``if`` statement::
1126
1127 match = re.search(pattern, string)
1128 if match:
1129 process(match)
1130
1131Match objects support the following methods and attributes:
Georg Brandl116aa622007-08-15 14:28:22 +00001132
1133
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001134.. method:: Match.expand(template)
Georg Brandl116aa622007-08-15 14:28:22 +00001135
Georg Brandlc62a7042010-07-29 11:49:05 +00001136 Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001137 string *template*, as done by the :meth:`~Pattern.sub` method.
Georg Brandlc62a7042010-07-29 11:49:05 +00001138 Escapes such as ``\n`` are converted to the appropriate characters,
1139 and numeric backreferences (``\1``, ``\2``) and named backreferences
1140 (``\g<1>``, ``\g<name>``) are replaced by the contents of the
1141 corresponding group.
Georg Brandl116aa622007-08-15 14:28:22 +00001142
Serhiy Storchaka7438e4b2014-10-10 11:06:31 +03001143 .. versionchanged:: 3.5
1144 Unmatched groups are replaced with an empty string.
Georg Brandl116aa622007-08-15 14:28:22 +00001145
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001146.. method:: Match.group([group1, ...])
Georg Brandl116aa622007-08-15 14:28:22 +00001147
Georg Brandlc62a7042010-07-29 11:49:05 +00001148 Returns one or more subgroups of the match. If there is a single argument, the
1149 result is a single string; if there are multiple arguments, the result is a
1150 tuple with one item per argument. Without arguments, *group1* defaults to zero
1151 (the whole match is returned). If a *groupN* argument is zero, the corresponding
1152 return value is the entire matching string; if it is in the inclusive range
1153 [1..99], it is the string matching the corresponding parenthesized group. If a
1154 group number is negative or larger than the number of groups defined in the
1155 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
1156 part of the pattern that did not match, the corresponding result is ``None``.
1157 If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001158 the last match is returned. ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001159
Georg Brandlc62a7042010-07-29 11:49:05 +00001160 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1161 >>> m.group(0) # The entire match
1162 'Isaac Newton'
1163 >>> m.group(1) # The first parenthesized subgroup.
1164 'Isaac'
1165 >>> m.group(2) # The second parenthesized subgroup.
1166 'Newton'
1167 >>> m.group(1, 2) # Multiple arguments give us a tuple.
1168 ('Isaac', 'Newton')
Georg Brandl116aa622007-08-15 14:28:22 +00001169
Georg Brandlc62a7042010-07-29 11:49:05 +00001170 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
1171 arguments may also be strings identifying groups by their group name. If a
1172 string argument is not used as a group name in the pattern, an :exc:`IndexError`
1173 exception is raised.
Georg Brandl116aa622007-08-15 14:28:22 +00001174
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001175 A moderately complicated example::
Georg Brandl116aa622007-08-15 14:28:22 +00001176
Georg Brandlc62a7042010-07-29 11:49:05 +00001177 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1178 >>> m.group('first_name')
1179 'Malcolm'
1180 >>> m.group('last_name')
1181 'Reynolds'
Georg Brandl116aa622007-08-15 14:28:22 +00001182
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001183 Named groups can also be referred to by their index::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001184
Georg Brandlc62a7042010-07-29 11:49:05 +00001185 >>> m.group(1)
1186 'Malcolm'
1187 >>> m.group(2)
1188 'Reynolds'
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001189
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001190 If a group matches multiple times, only the last match is accessible::
Christian Heimesfe337bf2008-03-23 21:54:12 +00001191
Georg Brandlc62a7042010-07-29 11:49:05 +00001192 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
1193 >>> m.group(1) # Returns only the last match.
1194 'c3'
Brian Curtin027e4782010-03-26 00:39:56 +00001195
Brian Curtin48f16f92010-04-08 13:55:29 +00001196
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001197.. method:: Match.__getitem__(g)
Eric V. Smith605bdae2016-09-11 08:55:43 -04001198
1199 This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001200 an individual group from a match::
Eric V. Smith605bdae2016-09-11 08:55:43 -04001201
1202 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
1203 >>> m[0] # The entire match
1204 'Isaac Newton'
1205 >>> m[1] # The first parenthesized subgroup.
1206 'Isaac'
1207 >>> m[2] # The second parenthesized subgroup.
1208 'Newton'
1209
1210 .. versionadded:: 3.6
1211
1212
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001213.. method:: Match.groups(default=None)
Brian Curtin48f16f92010-04-08 13:55:29 +00001214
Georg Brandlc62a7042010-07-29 11:49:05 +00001215 Return a tuple containing all the subgroups of the match, from 1 up to however
1216 many groups are in the pattern. The *default* argument is used for groups that
1217 did not participate in the match; it defaults to ``None``.
Brian Curtin027e4782010-03-26 00:39:56 +00001218
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001219 For example::
Brian Curtin027e4782010-03-26 00:39:56 +00001220
Georg Brandlc62a7042010-07-29 11:49:05 +00001221 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
1222 >>> m.groups()
1223 ('24', '1632')
Brian Curtin027e4782010-03-26 00:39:56 +00001224
Georg Brandlc62a7042010-07-29 11:49:05 +00001225 If we make the decimal place and everything after it optional, not all groups
1226 might participate in the match. These groups will default to ``None`` unless
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001227 the *default* argument is given::
Brian Curtin027e4782010-03-26 00:39:56 +00001228
Georg Brandlc62a7042010-07-29 11:49:05 +00001229 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
1230 >>> m.groups() # Second group defaults to None.
1231 ('24', None)
1232 >>> m.groups('0') # Now, the second group defaults to '0'.
1233 ('24', '0')
Georg Brandl116aa622007-08-15 14:28:22 +00001234
1235
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001236.. method:: Match.groupdict(default=None)
Georg Brandl116aa622007-08-15 14:28:22 +00001237
Georg Brandlc62a7042010-07-29 11:49:05 +00001238 Return a dictionary containing all the *named* subgroups of the match, keyed by
1239 the subgroup name. The *default* argument is used for groups that did not
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001240 participate in the match; it defaults to ``None``. For example::
Georg Brandl116aa622007-08-15 14:28:22 +00001241
Georg Brandlc62a7042010-07-29 11:49:05 +00001242 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
1243 >>> m.groupdict()
1244 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001245
Georg Brandl116aa622007-08-15 14:28:22 +00001246
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001247.. method:: Match.start([group])
1248 Match.end([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001249
Georg Brandlc62a7042010-07-29 11:49:05 +00001250 Return the indices of the start and end of the substring matched by *group*;
1251 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
1252 *group* exists but did not contribute to the match. For a match object *m*, and
1253 a group *g* that did contribute to the match, the substring matched by group *g*
1254 (equivalent to ``m.group(g)``) is ::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001255
Georg Brandlc62a7042010-07-29 11:49:05 +00001256 m.string[m.start(g):m.end(g)]
Brian Curtin027e4782010-03-26 00:39:56 +00001257
Georg Brandlc62a7042010-07-29 11:49:05 +00001258 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
1259 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
1260 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
1261 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin027e4782010-03-26 00:39:56 +00001262
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001263 An example that will remove *remove_this* from email addresses::
Brian Curtin027e4782010-03-26 00:39:56 +00001264
Georg Brandlc62a7042010-07-29 11:49:05 +00001265 >>> email = "tony@tiremove_thisger.net"
1266 >>> m = re.search("remove_this", email)
1267 >>> email[:m.start()] + email[m.end():]
1268 'tony@tiger.net'
Georg Brandl116aa622007-08-15 14:28:22 +00001269
1270
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001271.. method:: Match.span([group])
Georg Brandl116aa622007-08-15 14:28:22 +00001272
Georg Brandlc62a7042010-07-29 11:49:05 +00001273 For a match *m*, return the 2-tuple ``(m.start(group), m.end(group))``. Note
1274 that if *group* did not contribute to the match, this is ``(-1, -1)``.
1275 *group* defaults to zero, the entire match.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001276
Georg Brandl116aa622007-08-15 14:28:22 +00001277
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001278.. attribute:: Match.pos
Georg Brandl116aa622007-08-15 14:28:22 +00001279
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001280 The value of *pos* which was passed to the :meth:`~Pattern.search` or
1281 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001282 the index into the string at which the RE engine started looking for a match.
Georg Brandl116aa622007-08-15 14:28:22 +00001283
1284
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001285.. attribute:: Match.endpos
Georg Brandl116aa622007-08-15 14:28:22 +00001286
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001287 The value of *endpos* which was passed to the :meth:`~Pattern.search` or
1288 :meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl69c7a692012-03-14 08:02:43 +01001289 the index into the string beyond which the RE engine will not go.
Georg Brandl116aa622007-08-15 14:28:22 +00001290
1291
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001292.. attribute:: Match.lastindex
Georg Brandl116aa622007-08-15 14:28:22 +00001293
Georg Brandlc62a7042010-07-29 11:49:05 +00001294 The integer index of the last matched capturing group, or ``None`` if no group
1295 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1296 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1297 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1298 string.
Georg Brandl116aa622007-08-15 14:28:22 +00001299
1300
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001301.. attribute:: Match.lastgroup
Georg Brandl116aa622007-08-15 14:28:22 +00001302
Georg Brandlc62a7042010-07-29 11:49:05 +00001303 The name of the last matched capturing group, or ``None`` if the group didn't
1304 have a name, or if no group was matched at all.
Georg Brandl116aa622007-08-15 14:28:22 +00001305
1306
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001307.. attribute:: Match.re
Georg Brandl116aa622007-08-15 14:28:22 +00001308
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001309 The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001310 :meth:`~Pattern.search` method produced this match instance.
Georg Brandl116aa622007-08-15 14:28:22 +00001311
1312
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001313.. attribute:: Match.string
Georg Brandl116aa622007-08-15 14:28:22 +00001314
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001315 The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl116aa622007-08-15 14:28:22 +00001316
1317
Serhiy Storchakafdbd0112017-04-16 10:16:03 +03001318.. versionchanged:: 3.7
1319 Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
1320 are considered atomic.
1321
1322
Raymond Hettinger1fa76822010-12-06 23:31:36 +00001323.. _re-examples:
1324
1325Regular Expression Examples
1326---------------------------
Georg Brandl116aa622007-08-15 14:28:22 +00001327
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001328
Raymond Hettinger5768e0c2011-10-19 14:10:07 -07001329Checking for a Pair
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001330^^^^^^^^^^^^^^^^^^^
1331
1332In this example, we'll use the following helper function to display match
Christian Heimesfe337bf2008-03-23 21:54:12 +00001333objects a little more gracefully:
1334
1335.. testcode::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001336
1337 def displaymatch(match):
1338 if match is None:
1339 return None
1340 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1341
1342Suppose you are writing a poker program where a player's hand is represented as
1343a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001344for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001345representing the card with that value.
1346
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001347To see if a given string is a valid hand, one could do the following::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001348
Ezio Melottie5b2ac82011-12-17 01:17:17 +02001349 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1350 >>> displaymatch(valid.match("akt5q")) # Valid.
1351 "<Match: 'akt5q', groups=()>"
1352 >>> displaymatch(valid.match("akt5e")) # Invalid.
1353 >>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001354 >>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001355 "<Match: '727ak', groups=()>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001356
1357That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001358To match this with a regular expression, one could use backreferences as such::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001359
1360 >>> pair = re.compile(r".*(.).*\1")
1361 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001362 "<Match: '717', groups=('7',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001363 >>> displaymatch(pair.match("718ak")) # No pairs.
1364 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimesfe337bf2008-03-23 21:54:12 +00001365 "<Match: '354aa', groups=('a',)>"
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001366
Georg Brandlf346ac02009-07-26 15:03:49 +00001367To find out what card the pair consists of, one could use the
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001368:meth:`~Match.group` method of the match object in the following manner:
Christian Heimesfe337bf2008-03-23 21:54:12 +00001369
1370.. doctest::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001371
1372 >>> pair.match("717ak").group(1)
1373 '7'
Georg Brandl48310cd2009-01-03 21:18:54 +00001374
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001375 # Error because re.match() returns None, which doesn't have a group() method:
1376 >>> pair.match("718ak").group(1)
1377 Traceback (most recent call last):
1378 File "<pyshell#23>", line 1, in <module>
1379 re.match(r".*(.).*\1", "718ak").group(1)
1380 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl48310cd2009-01-03 21:18:54 +00001381
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001382 >>> pair.match("354aa").group(1)
1383 'a'
1384
1385
1386Simulating scanf()
1387^^^^^^^^^^^^^^^^^^
Georg Brandl116aa622007-08-15 14:28:22 +00001388
1389.. index:: single: scanf()
1390
Georg Brandl60203b42010-10-06 10:11:56 +00001391Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl116aa622007-08-15 14:28:22 +00001392expressions are generally more powerful, though also more verbose, than
Georg Brandl60203b42010-10-06 10:11:56 +00001393:c:func:`scanf` format strings. The table below offers some more-or-less
1394equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl116aa622007-08-15 14:28:22 +00001395expressions.
1396
1397+--------------------------------+---------------------------------------------+
Georg Brandl60203b42010-10-06 10:11:56 +00001398| :c:func:`scanf` Token | Regular Expression |
Georg Brandl116aa622007-08-15 14:28:22 +00001399+================================+=============================================+
1400| ``%c`` | ``.`` |
1401+--------------------------------+---------------------------------------------+
1402| ``%5c`` | ``.{5}`` |
1403+--------------------------------+---------------------------------------------+
1404| ``%d`` | ``[-+]?\d+`` |
1405+--------------------------------+---------------------------------------------+
1406| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1407+--------------------------------+---------------------------------------------+
1408| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1409+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001410| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001411+--------------------------------+---------------------------------------------+
1412| ``%s`` | ``\S+`` |
1413+--------------------------------+---------------------------------------------+
1414| ``%u`` | ``\d+`` |
1415+--------------------------------+---------------------------------------------+
Ezio Melottia0b1d1e2012-04-29 11:47:28 +03001416| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl116aa622007-08-15 14:28:22 +00001417+--------------------------------+---------------------------------------------+
1418
1419To extract the filename and numbers from a string like ::
1420
1421 /usr/sbin/sendmail - 0 errors, 4 warnings
1422
Georg Brandl60203b42010-10-06 10:11:56 +00001423you would use a :c:func:`scanf` format like ::
Georg Brandl116aa622007-08-15 14:28:22 +00001424
1425 %s - %d errors, %d warnings
1426
1427The equivalent regular expression would be ::
1428
1429 (\S+) - (\d+) errors, (\d+) warnings
1430
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001431
Ezio Melotti443f0002012-02-29 13:39:05 +02001432.. _search-vs-match:
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001433
1434search() vs. match()
1435^^^^^^^^^^^^^^^^^^^^
1436
Ezio Melotti443f0002012-02-29 13:39:05 +02001437.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001438
Ezio Melotti443f0002012-02-29 13:39:05 +02001439Python offers two different primitive operations based on regular expressions:
1440:func:`re.match` checks for a match only at the beginning of the string, while
1441:func:`re.search` checks for a match anywhere in the string (this is what Perl
1442does by default).
1443
1444For example::
1445
Serhiy Storchakadba90392016-05-10 12:01:23 +03001446 >>> re.match("c", "abcdef") # No match
1447 >>> re.search("c", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001448 <re.Match object; span=(2, 3), match='c'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001449
Ezio Melotti443f0002012-02-29 13:39:05 +02001450Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1451restrict the match at the beginning of the string::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001452
Serhiy Storchakadba90392016-05-10 12:01:23 +03001453 >>> re.match("c", "abcdef") # No match
1454 >>> re.search("^c", "abcdef") # No match
Ezio Melotti443f0002012-02-29 13:39:05 +02001455 >>> re.search("^a", "abcdef") # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001456 <re.Match object; span=(0, 1), match='a'>
Ezio Melotti443f0002012-02-29 13:39:05 +02001457
1458Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1459beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001460beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti443f0002012-02-29 13:39:05 +02001461
1462 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1463 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001464 <re.Match object; span=(4, 5), match='X'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001465
1466
1467Making a Phonebook
1468^^^^^^^^^^^^^^^^^^
1469
Georg Brandl48310cd2009-01-03 21:18:54 +00001470:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001471method is invaluable for converting textual data into data structures that can be
1472easily read and modified by Python as demonstrated in the following example that
1473creates a phonebook.
1474
Christian Heimes255f53b2007-12-08 15:33:56 +00001475First, here is the input. Normally it may come from a file, here we are using
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001476triple-quoted string syntax::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001477
Georg Brandl557a3ec2012-03-17 17:26:27 +01001478 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl48310cd2009-01-03 21:18:54 +00001479 ...
Christian Heimesfe337bf2008-03-23 21:54:12 +00001480 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1481 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1482 ...
1483 ...
1484 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes255f53b2007-12-08 15:33:56 +00001485
1486The entries are separated by one or more newlines. Now we convert the string
Christian Heimesfe337bf2008-03-23 21:54:12 +00001487into a list with each nonempty line having its own entry:
1488
1489.. doctest::
1490 :options: +NORMALIZE_WHITESPACE
Christian Heimes255f53b2007-12-08 15:33:56 +00001491
Georg Brandl557a3ec2012-03-17 17:26:27 +01001492 >>> entries = re.split("\n+", text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001493 >>> entries
Christian Heimesfe337bf2008-03-23 21:54:12 +00001494 ['Ross McFluff: 834.345.1254 155 Elm Street',
1495 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1496 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1497 'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001498
1499Finally, split each entry into a list with first name, last name, telephone
Christian Heimesc3f30c42008-02-22 16:37:40 +00001500number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimesfe337bf2008-03-23 21:54:12 +00001501because the address has spaces, our splitting pattern, in it:
1502
1503.. doctest::
1504 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001505
Christian Heimes255f53b2007-12-08 15:33:56 +00001506 >>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001507 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1508 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1509 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1510 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1511
Christian Heimes255f53b2007-12-08 15:33:56 +00001512The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimesc3f30c42008-02-22 16:37:40 +00001513occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimesfe337bf2008-03-23 21:54:12 +00001514house number from the street name:
1515
1516.. doctest::
1517 :options: +NORMALIZE_WHITESPACE
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001518
Christian Heimes255f53b2007-12-08 15:33:56 +00001519 >>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001520 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1521 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1522 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1523 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1524
1525
1526Text Munging
1527^^^^^^^^^^^^
1528
1529:func:`sub` replaces every occurrence of a pattern with a string or the
1530result of a function. This example demonstrates using :func:`sub` with
1531a function to "munge" text, or randomize the order of all the characters
1532in each word of a sentence except for the first and last characters::
1533
1534 >>> def repl(m):
Serhiy Storchakadba90392016-05-10 12:01:23 +03001535 ... inner_word = list(m.group(2))
1536 ... random.shuffle(inner_word)
1537 ... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001538 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandldb4e9392010-07-12 09:06:13 +00001539 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001540 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandldb4e9392010-07-12 09:06:13 +00001541 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001542 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1543
1544
1545Finding all Adverbs
1546^^^^^^^^^^^^^^^^^^^
1547
Christian Heimesc3f30c42008-02-22 16:37:40 +00001548:func:`findall` matches *all* occurrences of a pattern, not just the first
Miss Islington (bot)5f165852018-06-17 21:49:43 -07001549one as :func:`search` does. For example, if a writer wanted to
1550find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001551the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001552
1553 >>> text = "He was carefully disguised but captured quickly by police."
1554 >>> re.findall(r"\w+ly", text)
1555 ['carefully', 'quickly']
1556
1557
1558Finding all Adverbs and their Positions
1559^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1560
1561If one wants more information about all matches of a pattern than the matched
Georg Brandlc62a7042010-07-29 11:49:05 +00001562text, :func:`finditer` is useful as it provides :ref:`match objects
1563<match-objects>` instead of strings. Continuing with the previous example, if
Miss Islington (bot)5f165852018-06-17 21:49:43 -07001564a writer wanted to find all of the adverbs *and their positions* in
1565some text, they would use :func:`finditer` in the following manner::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001566
1567 >>> text = "He was carefully disguised but captured quickly by police."
1568 >>> for m in re.finditer(r"\w+ly", text):
Christian Heimesfe337bf2008-03-23 21:54:12 +00001569 ... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001570 07-16: carefully
1571 40-47: quickly
1572
1573
1574Raw String Notation
1575^^^^^^^^^^^^^^^^^^^
1576
1577Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1578every backslash (``'\'``) in a regular expression would have to be prefixed with
1579another one to escape it. For example, the two following lines of code are
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001580functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001581
1582 >>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001583 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001584 >>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001585 <re.Match object; span=(0, 4), match=' ff '>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001586
1587When one wants to match a literal backslash, it must be escaped in the regular
1588expression. With raw string notation, this means ``r"\\"``. Without raw string
1589notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchakacd195e22017-10-14 11:14:26 +03001590functionally identical::
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001591
1592 >>> re.match(r"\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001593 <re.Match object; span=(0, 1), match='\\'>
Christian Heimesb9eccbf2007-12-05 20:18:38 +00001594 >>> re.match("\\\\", r"\\")
Serhiy Storchaka0b5e61d2017-10-04 20:09:49 +03001595 <re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001596
1597
1598Writing a Tokenizer
1599^^^^^^^^^^^^^^^^^^^
1600
Georg Brandl5d941342016-02-26 19:37:12 +01001601A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001602analyzes a string to categorize groups of characters. This is a useful first
1603step in writing a compiler or interpreter.
1604
1605The text categories are specified with regular expressions. The technique is
1606to combine those into a single master regular expression and to loop over
1607successive matches::
1608
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001609 import collections
1610 import re
1611
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001612 Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001613
Raymond Hettingerc5664312014-08-03 23:38:54 -07001614 def tokenize(code):
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001615 keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
1616 token_specification = [
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001617 ('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
1618 ('ASSIGN', r':='), # Assignment operator
1619 ('END', r';'), # Statement terminator
1620 ('ID', r'[A-Za-z]+'), # Identifiers
1621 ('OP', r'[+\-*/]'), # Arithmetic operators
1622 ('NEWLINE', r'\n'), # Line endings
1623 ('SKIP', r'[ \t]+'), # Skip over spaces and tabs
1624 ('MISMATCH', r'.'), # Any other character
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001625 ]
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001626 tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettingerc5664312014-08-03 23:38:54 -07001627 line_num = 1
1628 line_start = 0
1629 for mo in re.finditer(tok_regex, code):
1630 kind = mo.lastgroup
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001631 value = mo.group()
1632 column = mo.start() - line_start
1633 if kind == 'NUMBER':
1634 value = float(value) if '.' in value else int(value)
1635 elif kind == 'ID' and value in keywords:
1636 kind = value
1637 elif kind == 'NEWLINE':
Raymond Hettingerc5664312014-08-03 23:38:54 -07001638 line_start = mo.end()
1639 line_num += 1
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001640 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001641 elif kind == 'SKIP':
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001642 continue
Raymond Hettingerc5664312014-08-03 23:38:54 -07001643 elif kind == 'MISMATCH':
Raymond Hettingerd0b91582017-02-06 07:15:31 -08001644 raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001645 yield Token(kind, value, line_num, column)
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001646
Raymond Hettinger4b244ef2011-05-23 12:45:34 -07001647 statements = '''
1648 IF quantity THEN
1649 total := total + price * quantity;
1650 tax := price * 0.05;
1651 ENDIF;
Raymond Hettinger37ade9c2010-09-16 12:02:17 +00001652 '''
Raymond Hettinger23157e52011-05-13 01:38:31 -07001653
1654 for token in tokenize(statements):
1655 print(token)
1656
1657The tokenizer produces the following output::
Raymond Hettinger9c47d772011-05-13 01:03:50 -07001658
Miss Islington (bot)33fd60d2018-11-09 01:26:55 -08001659 Token(type='IF', value='IF', line=2, column=4)
1660 Token(type='ID', value='quantity', line=2, column=7)
1661 Token(type='THEN', value='THEN', line=2, column=16)
1662 Token(type='ID', value='total', line=3, column=8)
1663 Token(type='ASSIGN', value=':=', line=3, column=14)
1664 Token(type='ID', value='total', line=3, column=17)
1665 Token(type='OP', value='+', line=3, column=23)
1666 Token(type='ID', value='price', line=3, column=25)
1667 Token(type='OP', value='*', line=3, column=31)
1668 Token(type='ID', value='quantity', line=3, column=33)
1669 Token(type='END', value=';', line=3, column=41)
1670 Token(type='ID', value='tax', line=4, column=8)
1671 Token(type='ASSIGN', value=':=', line=4, column=12)
1672 Token(type='ID', value='price', line=4, column=15)
1673 Token(type='OP', value='*', line=4, column=21)
1674 Token(type='NUMBER', value=0.05, line=4, column=23)
1675 Token(type='END', value=';', line=4, column=27)
1676 Token(type='ENDIF', value='ENDIF', line=5, column=4)
1677 Token(type='END', value=';', line=5, column=9)
Miss Islington (bot)67d3f8b2018-03-23 08:55:26 -07001678
1679
1680.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
1681 Media, 2009. The third edition of the book no longer covers Python at all,
1682 but the first edition covered writing good regular expression patterns in
1683 great detail.