blob: c424230e13a4191e5d03a3efe8f6fca08215bade [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
Georg Brandl8ec7f652007-08-15 14:28:01 +000011This module provides regular expression matching operations similar to
12those found in Perl. Both patterns and strings to be searched can be
Georg Brandl382edff2009-03-31 15:43:20 +000013Unicode strings as well as 8-bit strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +000014
15Regular expressions use the backslash character (``'\'``) to indicate
16special forms or to allow special characters to be used without invoking
17their special meaning. This collides with Python's usage of the same
18character for the same purpose in string literals; for example, to match
19a literal backslash, one might have to write ``'\\\\'`` as the pattern
20string, because the regular expression must be ``\\``, and each
21backslash must be expressed as ``\\`` inside a regular Python string
22literal.
23
24The solution is to use Python's raw string notation for regular expression
25patterns; backslashes are not handled in any special way in a string literal
26prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandlba2e5192007-09-27 06:26:58 +000028newline. Usually patterns will be expressed in Python code using this raw
29string notation.
Georg Brandl8ec7f652007-08-15 14:28:01 +000030
Georg Brandlb8df1562007-12-05 18:30:48 +000031It is important to note that most regular expression operations are available as
32module-level functions and :class:`RegexObject` methods. The functions are
33shortcuts that don't require you to compile a regex object first, but miss some
34fine-tuning parameters.
35
Mariattac8e20212017-02-26 08:56:21 -080036.. seealso::
37
38 The third-party `regex <https://pypi.python.org/pypi/regex/>`_ module,
39 which has an API compatible with the standard library :mod:`re` module,
40 but offers additional functionality and a more thorough Unicode support.
41
Georg Brandl8ec7f652007-08-15 14:28:01 +000042
43.. _re-syntax:
44
45Regular Expression Syntax
46-------------------------
47
48A regular expression (or RE) specifies a set of strings that matches it; the
49functions in this module let you check if a particular string matches a given
50regular expression (or if a given regular expression matches a particular
51string, which comes down to the same thing).
52
53Regular expressions can be concatenated to form new regular expressions; if *A*
54and *B* are both regular expressions, then *AB* is also a regular expression.
55In general, if a string *p* matches *A* and another string *q* matches *B*, the
56string *pq* will match AB. This holds unless *A* or *B* contain low precedence
57operations; boundary conditions between *A* and *B*; or have numbered group
58references. Thus, complex expressions can easily be constructed from simpler
59primitive expressions like the ones described here. For details of the theory
60and implementation of regular expressions, consult the Friedl book referenced
61above, or almost any textbook about compiler construction.
62
63A brief explanation of the format of regular expressions follows. For further
Georg Brandl1cf05222008-02-05 12:01:24 +000064information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl8ec7f652007-08-15 14:28:01 +000065
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves. You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``. (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
Martin Panter197332a2016-10-15 01:18:16 +000079Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
80directly nested. This avoids ambiguity with the non-greedy modifier suffix
81``?``, and with other modifiers in other implementations. To apply a second
82repetition to an inner repetition, parentheses may be used. For example,
83the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
84
Georg Brandl8ec7f652007-08-15 14:28:01 +000085
86The special characters are:
87
Georg Brandl8ec7f652007-08-15 14:28:01 +000088``'.'``
89 (Dot.) In the default mode, this matches any character except a newline. If
90 the :const:`DOTALL` flag has been specified, this matches any character
91 including a newline.
92
93``'^'``
94 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
95 matches immediately after each newline.
96
97``'$'``
98 Matches the end of the string or just before the newline at the end of the
99 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
100 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
101 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arcd08a8eb2008-01-10 21:59:42 +0000102 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
103 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
104 the newline, and one at the end of the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000105
106``'*'``
107 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
108 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
109 by any number of 'b's.
110
111``'+'``
112 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
113 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
114 match just 'a'.
115
116``'?'``
117 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
118 ``ab?`` will match either 'a' or 'ab'.
119
120``*?``, ``+?``, ``??``
121 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
122 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Georg Brandl5892ab12016-04-12 07:51:41 +0200123 ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
124 string, and not just ``<a>``. Adding ``?`` after the qualifier makes it
Georg Brandl8ec7f652007-08-15 14:28:01 +0000125 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl5892ab12016-04-12 07:51:41 +0200126 characters as possible will be matched. Using the RE ``<.*?>`` will match
127 only ``<a>``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000128
129``{m}``
130 Specifies that exactly *m* copies of the previous RE should be matched; fewer
131 matches cause the entire RE not to match. For example, ``a{6}`` will match
132 exactly six ``'a'`` characters, but not five.
133
134``{m,n}``
135 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
136 RE, attempting to match as many repetitions as possible. For example,
137 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
138 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
139 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
140 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
141 modifier would be confused with the previously described form.
142
143``{m,n}?``
144 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
145 RE, attempting to match as *few* repetitions as possible. This is the
146 non-greedy version of the previous qualifier. For example, on the
147 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
148 while ``a{3,5}?`` will only match 3 characters.
149
150``'\'``
151 Either escapes special characters (permitting you to match characters like
152 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
153 sequences are discussed below.
154
155 If you're not using a raw string to express the pattern, remember that Python
156 also uses the backslash as an escape sequence in string literals; if the escape
157 sequence isn't recognized by Python's parser, the backslash and subsequent
158 character are included in the resulting string. However, if Python would
159 recognize the resulting sequence, the backslash should be repeated twice. This
160 is complicated and hard to understand, so it's highly recommended that you use
161 raw strings for all but the simplest expressions.
162
163``[]``
Ezio Melottia1958732011-10-20 19:31:08 +0300164 Used to indicate a set of characters. In a set:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000165
Ezio Melottia1958732011-10-20 19:31:08 +0300166 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
167 ``'m'``, or ``'k'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000168
Ezio Melottia1958732011-10-20 19:31:08 +0300169 * Ranges of characters can be indicated by giving two characters and separating
170 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
171 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
172 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
173 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
174 it will match a literal ``'-'``.
175
176 * Special characters lose their special meaning inside sets. For example,
177 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
178 ``'*'``, or ``')'``.
179
180 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
181 inside a set, although the characters they match depends on whether
182 :const:`LOCALE` or :const:`UNICODE` mode is in force.
183
184 * Characters that are not within a range can be matched by :dfn:`complementing`
185 the set. If the first character of the set is ``'^'``, all the characters
186 that are *not* in the set will be matched. For example, ``[^5]`` will match
187 any character except ``'5'``, and ``[^^]`` will match any character except
188 ``'^'``. ``^`` has no special meaning if it's not the first character in
189 the set.
190
191 * To match a literal ``']'`` inside a set, precede it with a backslash, or
192 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
193 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield700a6352008-05-31 13:05:34 +0000194
Georg Brandl8ec7f652007-08-15 14:28:01 +0000195``'|'``
196 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
197 will match either A or B. An arbitrary number of REs can be separated by the
198 ``'|'`` in this way. This can be used inside groups (see below) as well. As
199 the target string is scanned, REs separated by ``'|'`` are tried from left to
200 right. When one pattern completely matches, that branch is accepted. This means
201 that once ``A`` matches, ``B`` will not be tested further, even if it would
202 produce a longer overall match. In other words, the ``'|'`` operator is never
203 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
204 character class, as in ``[|]``.
205
206``(...)``
207 Matches whatever regular expression is inside the parentheses, and indicates the
208 start and end of a group; the contents of a group can be retrieved after a match
209 has been performed, and can be matched later in the string with the ``\number``
210 special sequence, described below. To match the literals ``'('`` or ``')'``,
211 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
212
213``(?...)``
214 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
215 otherwise). The first character after the ``'?'`` determines what the meaning
216 and further syntax of the construct is. Extensions usually do not create a new
217 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
218 currently supported extensions.
219
220``(?iLmsux)``
221 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
222 ``'u'``, ``'x'``.) The group matches the empty string; the letters
223 set the corresponding flags: :const:`re.I` (ignore case),
224 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
225 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
226 and :const:`re.X` (verbose), for the entire regular expression. (The
227 flags are described in :ref:`contents-of-module-re`.) This
228 is useful if you wish to include the flags as part of the regular
229 expression, instead of passing a *flag* argument to the
Georg Brandl74f8fc02009-07-26 13:36:39 +0000230 :func:`re.compile` function.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000231
232 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
233 used first in the expression string, or after one or more whitespace characters.
234 If there are non-whitespace characters before the flag, the results are
235 undefined.
236
237``(?:...)``
Georg Brandl3b85b9b2010-11-26 08:20:18 +0000238 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl8ec7f652007-08-15 14:28:01 +0000239 expression is inside the parentheses, but the substring matched by the group
240 *cannot* be retrieved after performing a match or referenced later in the
241 pattern.
242
243``(?P<name>...)``
244 Similar to regular parentheses, but the substring matched by the group is
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200245 accessible via the symbolic group name *name*. Group names must be valid
246 Python identifiers, and each group name must be defined only once within a
247 regular expression. A symbolic group is also a numbered group, just as if
248 the group were not named.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000249
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200250 Named groups can be referenced in three contexts. If the pattern is
251 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
252 single or double quotes):
253
254 +---------------------------------------+----------------------------------+
255 | Context of reference to group "quote" | Ways to reference it |
256 +=======================================+==================================+
257 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
258 | | * ``\1`` |
259 +---------------------------------------+----------------------------------+
260 | when processing match object ``m`` | * ``m.group('quote')`` |
261 | | * ``m.end('quote')`` (etc.) |
262 +---------------------------------------+----------------------------------+
263 | in a string passed to the ``repl`` | * ``\g<quote>`` |
264 | argument of ``re.sub()`` | * ``\g<1>`` |
265 | | * ``\1`` |
266 +---------------------------------------+----------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +0000267
268``(?P=name)``
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200269 A backreference to a named group; it matches whatever text was matched by the
270 earlier group named *name*.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000271
272``(?#...)``
273 A comment; the contents of the parentheses are simply ignored.
274
275``(?=...)``
276 Matches if ``...`` matches next, but doesn't consume any of the string. This is
277 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
278 ``'Isaac '`` only if it's followed by ``'Asimov'``.
279
280``(?!...)``
281 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
282 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
283 followed by ``'Asimov'``.
284
285``(?<=...)``
286 Matches if the current position in the string is preceded by a match for ``...``
287 that ends at the current position. This is called a :dfn:`positive lookbehind
288 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
289 lookbehind will back up 3 characters and check if the contained pattern matches.
290 The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200291 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
292 references are not supported even if they match strings of some fixed length.
293 Note that
Ezio Melotti11427732012-04-29 07:34:22 +0300294 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000295 beginning of the string being searched; you will most likely want to use the
Georg Brandl6199e322008-03-22 12:04:26 +0000296 :func:`search` function rather than the :func:`match` function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000297
298 >>> import re
299 >>> m = re.search('(?<=abc)def', 'abcdef')
300 >>> m.group(0)
301 'def'
302
Georg Brandl6199e322008-03-22 12:04:26 +0000303 This example looks for a word following a hyphen:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000304
305 >>> m = re.search('(?<=-)\w+', 'spam-egg')
306 >>> m.group(0)
307 'egg'
308
309``(?<!...)``
310 Matches if the current position in the string is not preceded by a match for
311 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
312 positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200313 some fixed length and shouldn't contain group references.
314 Patterns which start with negative lookbehind assertions may
Georg Brandl8ec7f652007-08-15 14:28:01 +0000315 match at the beginning of the string being searched.
316
317``(?(id/name)yes-pattern|no-pattern)``
318 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
319 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
320 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
321 matching pattern, which will match with ``'<user@host.com>'`` as well as
322 ``'user@host.com'``, but not with ``'<user@host.com'``.
323
324 .. versionadded:: 2.4
325
326The special sequences consist of ``'\'`` and a character from the list below.
327If the ordinary character is not on the list, then the resulting RE will match
328the second character. For example, ``\$`` matches the character ``'$'``.
329
Georg Brandl8ec7f652007-08-15 14:28:01 +0000330``\number``
331 Matches the contents of the group of the same number. Groups are numbered
332 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl980db0a2013-10-06 12:58:20 +0200333 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl8ec7f652007-08-15 14:28:01 +0000334 can only be used to match one of the first 99 groups. If the first digit of
335 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
336 a group match, but as the character with octal value *number*. Inside the
337 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
338 characters.
339
340``\A``
341 Matches only at the start of the string.
342
343``\b``
344 Matches the empty string, but only at the beginning or end of a word. A word is
345 defined as a sequence of alphanumeric or underscore characters, so the end of a
346 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200347 Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
348 a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
349 of the string, so the precise set of characters deemed to be alphanumeric
350 depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
351 For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
352 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200353 Inside a character range, ``\b`` represents the backspace character, for
354 compatibility with Python's string literals.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000355
356``\B``
357 Matches the empty string, but only when it is *not* at the beginning or end of a
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200358 word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
359 but not ``'py'``, ``'py.'``, or ``'py!'``.
360 ``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl8ec7f652007-08-15 14:28:01 +0000361 of ``LOCALE`` and ``UNICODE``.
362
363``\d``
364 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
365 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinsonfe67bd92009-07-28 20:35:03 +0000366 whatever is classified as a decimal digit in the Unicode character properties
367 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000368
369``\D``
370 When the :const:`UNICODE` flag is not specified, matches any non-digit
371 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
372 will match anything other than character marked as digits in the Unicode
373 character properties database.
374
375``\s``
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800376 When the :const:`UNICODE` flag is not specified, it matches any whitespace
377 character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
378 :const:`LOCALE` flag has no extra effect on matching of the space.
379 If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
380 plus whatever is classified as space in the Unicode character properties
381 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000382
383``\S``
Benjamin Peterson72275ef2014-11-25 14:54:45 -0600384 When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800385 character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
386 :const:`LOCALE` flag has no extra effect on non-whitespace match. If
387 :const:`UNICODE` is set, then any character not marked as space in the
388 Unicode character properties database is matched.
389
Georg Brandl8ec7f652007-08-15 14:28:01 +0000390
391``\w``
392 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
393 any alphanumeric character and the underscore; this is equivalent to the set
394 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
395 whatever characters are defined as alphanumeric for the current locale. If
396 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
397 is classified as alphanumeric in the Unicode character properties database.
398
399``\W``
400 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
401 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
402 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
403 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware7ca2a902014-10-19 01:06:58 -0500404 this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700405 not alphanumeric in the Unicode character properties database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000406
407``\Z``
408 Matches only at the end of the string.
409
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700410If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
411particular sequence, then :const:`LOCALE` flag takes effect first followed by
412the :const:`UNICODE`.
413
Georg Brandl8ec7f652007-08-15 14:28:01 +0000414Most of the standard escapes supported by Python string literals are also
415accepted by the regular expression parser::
416
417 \a \b \f \n
418 \r \t \v \x
419 \\
420
Ezio Melotti48d886b2012-04-29 04:46:34 +0300421(Note that ``\b`` is used to represent word boundaries, and means "backspace"
422only inside character classes.)
423
Georg Brandl8ec7f652007-08-15 14:28:01 +0000424Octal escapes are included in a limited form: If the first digit is a 0, or if
425there are three octal digits, it is considered an octal escape. Otherwise, it is
426a group reference. As for string literals, octal escapes are always at most
427three digits in length.
428
Georg Brandlae4ca792014-10-28 21:41:51 +0100429.. seealso::
430
431 Mastering Regular Expressions
432 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
433 second edition of the book no longer covers Python at all, but the first
434 edition covered writing good regular expression patterns in great detail.
435
436
Georg Brandl8ec7f652007-08-15 14:28:01 +0000437
Georg Brandl8ec7f652007-08-15 14:28:01 +0000438.. _contents-of-module-re:
439
440Module Contents
441---------------
442
443The module defines several functions, constants, and an exception. Some of the
444functions are simplified versions of the full featured methods for compiled
445regular expressions. Most non-trivial applications always use the compiled
446form.
447
448
Eli Benderskyeb711382011-11-14 01:02:20 +0200449.. function:: compile(pattern, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000450
Georg Brandlba2e5192007-09-27 06:26:58 +0000451 Compile a regular expression pattern into a regular expression object, which
Ezio Melotti33b810d2014-06-20 00:47:11 +0300452 can be used for matching using its :func:`~RegexObject.match` and
453 :func:`~RegexObject.search` methods, described below.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000454
455 The expression's behaviour can be modified by specifying a *flags* value.
456 Values can be any of the following variables, combined using bitwise OR (the
457 ``|`` operator).
458
459 The sequence ::
460
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000461 prog = re.compile(pattern)
462 result = prog.match(string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000463
464 is equivalent to ::
465
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000466 result = re.match(pattern, string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000467
Georg Brandl74f8fc02009-07-26 13:36:39 +0000468 but using :func:`re.compile` and saving the resulting regular expression
469 object for reuse is more efficient when the expression will be used several
470 times in a single program.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000471
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000472 .. note::
473
474 The compiled versions of the most recent patterns passed to
475 :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
476 programs that use only a few regular expressions at a time needn't worry
477 about compiling regular expressions.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000478
479
Sandro Tosie827c132012-01-01 12:52:24 +0100480.. data:: DEBUG
481
482 Display debug information about compiled expression.
483
484
Georg Brandl8ec7f652007-08-15 14:28:01 +0000485.. data:: I
486 IGNORECASE
487
488 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Brian Ward9395ca42017-05-24 00:08:41 -0700489 lowercase letters, too. This is not affected by the current locale. To
490 get this effect on non-ASCII Unicode characters such as ``ü`` and ``Ü``,
491 add the :const:`UNICODE` flag.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000492
493
494.. data:: L
495 LOCALE
496
Georg Brandlba2e5192007-09-27 06:26:58 +0000497 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
498 current locale.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000499
500
501.. data:: M
502 MULTILINE
503
504 When specified, the pattern character ``'^'`` matches at the beginning of the
505 string and at the beginning of each line (immediately following each newline);
506 and the pattern character ``'$'`` matches at the end of the string and at the
507 end of each line (immediately preceding each newline). By default, ``'^'``
508 matches only at the beginning of the string, and ``'$'`` only at the end of the
509 string and immediately before the newline (if any) at the end of the string.
510
511
512.. data:: S
513 DOTALL
514
515 Make the ``'.'`` special character match any character at all, including a
516 newline; without this flag, ``'.'`` will match anything *except* a newline.
517
518
519.. data:: U
520 UNICODE
521
Brian Ward9395ca42017-05-24 00:08:41 -0700522 Make the ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
523 sequences dependent on the Unicode character properties database. Also
524 enables non-ASCII matching for :const:`IGNORECASE`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000525
526 .. versionadded:: 2.0
527
528
529.. data:: X
530 VERBOSE
531
Zachary Ware77d61d42015-11-11 23:32:14 -0600532 This flag allows you to write regular expressions that look nicer and are
533 more readable by allowing you to visually separate logical sections of the
534 pattern and add comments. Whitespace within the pattern is ignored, except
Miss Islington (bot)a2f1be02017-11-14 07:39:04 -0800535 when in a character class, or when preceded by an unescaped backslash,
536 or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware77d61d42015-11-11 23:32:14 -0600537 When a line contains a ``#`` that is not in a character class and is not
538 preceded by an unescaped backslash, all characters from the leftmost such
539 ``#`` through the end of the line are ignored.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000540
Zachary Ware77d61d42015-11-11 23:32:14 -0600541 This means that the two following regular expression objects that match a
Georg Brandlb8df1562007-12-05 18:30:48 +0000542 decimal number are functionally equal::
543
544 a = re.compile(r"""\d + # the integral part
545 \. # the decimal point
546 \d * # some fractional digits""", re.X)
547 b = re.compile(r"\d+\.\d*")
Georg Brandl8ec7f652007-08-15 14:28:01 +0000548
549
Eli Benderskyeb711382011-11-14 01:02:20 +0200550.. function:: search(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000551
Terry Jan Reedy9f7f62f2014-05-30 16:19:50 -0400552 Scan through *string* looking for the first location where the regular expression
Georg Brandl8ec7f652007-08-15 14:28:01 +0000553 *pattern* produces a match, and return a corresponding :class:`MatchObject`
554 instance. Return ``None`` if no position in the string matches the pattern; note
555 that this is different from finding a zero-length match at some point in the
556 string.
557
558
Eli Benderskyeb711382011-11-14 01:02:20 +0200559.. function:: match(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000560
561 If zero or more characters at the beginning of *string* match the regular
562 expression *pattern*, return a corresponding :class:`MatchObject` instance.
563 Return ``None`` if the string does not match the pattern; note that this is
564 different from a zero-length match.
565
Ezio Melottid9de93e2012-02-29 13:37:07 +0200566 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
567 at the beginning of the string and not at the beginning of each line.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000568
Ezio Melottid9de93e2012-02-29 13:37:07 +0200569 If you want to locate a match anywhere in *string*, use :func:`search`
570 instead (see also :ref:`search-vs-match`).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000571
572
Eli Benderskyeb711382011-11-14 01:02:20 +0200573.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000574
575 Split *string* by the occurrences of *pattern*. If capturing parentheses are
576 used in *pattern*, then the text of all groups in the pattern are also returned
577 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
578 splits occur, and the remainder of the string is returned as the final element
579 of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl6199e322008-03-22 12:04:26 +0000580 *maxsplit* was ignored. This has been fixed in later releases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000581
582 >>> re.split('\W+', 'Words, words, words.')
583 ['Words', 'words', 'words', '']
584 >>> re.split('(\W+)', 'Words, words, words.')
585 ['Words', ', ', 'words', ', ', 'words', '.', '']
586 >>> re.split('\W+', 'Words, words, words.', 1)
587 ['Words', 'words, words.']
Gregory P. Smithae91d092009-03-02 05:13:57 +0000588 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
589 ['0', '3', '9']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000590
Georg Brandl70992c32008-03-06 07:19:15 +0000591 If there are capturing groups in the separator and it matches at the start of
592 the string, the result will start with an empty string. The same holds for
Georg Brandl6199e322008-03-22 12:04:26 +0000593 the end of the string:
Georg Brandl70992c32008-03-06 07:19:15 +0000594
595 >>> re.split('(\W+)', '...words, words...')
596 ['', '...', 'words', ', ', 'words', '...', '']
597
598 That way, separator components are always found at the same relative
599 indices within the result list (e.g., if there's one capturing group
600 in the separator, the 0th, the 2nd and so forth).
601
Skip Montanaro222907d2007-09-01 17:40:03 +0000602 Note that *split* will never split a string on an empty pattern match.
Georg Brandl6199e322008-03-22 12:04:26 +0000603 For example:
Skip Montanaro222907d2007-09-01 17:40:03 +0000604
605 >>> re.split('x*', 'foo')
606 ['foo']
607 >>> re.split("(?m)^$", "foo\n\nbar\n")
608 ['foo\n\nbar\n']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000609
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000610 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000611 Added the optional flags argument.
612
Georg Brandl70992c32008-03-06 07:19:15 +0000613
Serhiy Storchakaca547402018-01-04 14:08:27 +0200614
Eli Benderskyeb711382011-11-14 01:02:20 +0200615.. function:: findall(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000616
Georg Brandlba2e5192007-09-27 06:26:58 +0000617 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000618 strings. The *string* is scanned left-to-right, and matches are returned in
619 the order found. If one or more groups are present in the pattern, return a
620 list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchakaca547402018-01-04 14:08:27 +0200621 one group. Empty matches are included in the result.
622
623 .. note::
624
625 Due to the limitation of the current implementation the character
626 following an empty match is not included in a next match, so
627 ``findall(r'^|\w+', 'two words')`` returns ``['', 'wo', 'words']``
628 (note missed "t"). This is changed in Python 3.7.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000629
630 .. versionadded:: 1.5.2
631
632 .. versionchanged:: 2.4
633 Added the optional flags argument.
634
635
Eli Benderskyeb711382011-11-14 01:02:20 +0200636.. function:: finditer(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000637
Georg Brandle7a09902007-10-21 12:10:28 +0000638 Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000639 non-overlapping matches for the RE *pattern* in *string*. The *string* is
640 scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchakaca547402018-01-04 14:08:27 +0200641 matches are included in the result. See also the note about :func:`findall`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000642
643 .. versionadded:: 2.2
644
645 .. versionchanged:: 2.4
646 Added the optional flags argument.
647
648
Eli Benderskyeb711382011-11-14 01:02:20 +0200649.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000650
651 Return the string obtained by replacing the leftmost non-overlapping occurrences
652 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
653 *string* is returned unchanged. *repl* can be a string or a function; if it is
654 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosia7eb3c82011-08-19 22:54:33 +0200655 converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000656 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
657 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl6199e322008-03-22 12:04:26 +0000658 For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000659
660 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
661 ... r'static PyObject*\npy_\1(void)\n{',
662 ... 'def myfunc():')
663 'static PyObject*\npy_myfunc(void)\n{'
664
665 If *repl* is a function, it is called for every non-overlapping occurrence of
666 *pattern*. The function takes a single match object argument, and returns the
Georg Brandl6199e322008-03-22 12:04:26 +0000667 replacement string. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000668
669 >>> def dashrepl(matchobj):
670 ... if matchobj.group(0) == '-': return ' '
671 ... else: return '-'
672 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
673 'pro--gram files'
Gregory P. Smithae91d092009-03-02 05:13:57 +0000674 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
675 'Baked Beans & Spam'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000676
Georg Brandl04fd3242009-08-13 07:48:05 +0000677 The pattern may be a string or an RE object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000678
679 The optional argument *count* is the maximum number of pattern occurrences to be
680 replaced; *count* must be a non-negative integer. If omitted or zero, all
681 occurrences will be replaced. Empty matches for the pattern are replaced only
682 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
683 ``'-a-b-c-'``.
684
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200685 In string-type *repl* arguments, in addition to the character escapes and
686 backreferences described above,
Georg Brandl8ec7f652007-08-15 14:28:01 +0000687 ``\g<name>`` will use the substring matched by the group named ``name``, as
688 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
689 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
690 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
691 reference to group 20, not a reference to group 2 followed by the literal
692 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
693 substring matched by the RE.
694
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000695 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000696 Added the optional flags argument.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000697
Gregory P. Smithae91d092009-03-02 05:13:57 +0000698
Eli Benderskyeb711382011-11-14 01:02:20 +0200699.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000700
701 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
702 number_of_subs_made)``.
703
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000704 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000705 Added the optional flags argument.
706
Georg Brandl8ec7f652007-08-15 14:28:01 +0000707
Serhiy Storchaka53ad6842017-04-13 19:47:18 +0300708.. function:: escape(pattern)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000709
Serhiy Storchaka53ad6842017-04-13 19:47:18 +0300710 Escape all the characters in *pattern* except ASCII letters and numbers.
711 This is useful if you want to match an arbitrary literal string that may
712 have regular expression metacharacters in it. For example::
713
714 >>> print re.escape('python.exe')
715 python\.exe
716
717 >>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`|~:"
718 >>> print '[%s]+' % re.escape(legal_chars)
719 [abcdefghijklmnopqrstuvwxyz0123456789\!\#\$\%\&\'\*\+\-\.\^\_\`\|\~\:]+
720
721 >>> operators = ['+', '-', '*', '/', '**']
722 >>> print '|'.join(map(re.escape, sorted(operators, reverse=True)))
723 \/|\-|\+|\*\*|\*
Georg Brandl8ec7f652007-08-15 14:28:01 +0000724
725
R. David Murraya63f9b62010-07-10 14:25:18 +0000726.. function:: purge()
727
728 Clear the regular expression cache.
729
730
Georg Brandl8ec7f652007-08-15 14:28:01 +0000731.. exception:: error
732
733 Exception raised when a string passed to one of the functions here is not a
734 valid regular expression (for example, it might contain unmatched parentheses)
735 or when some other error occurs during compilation or matching. It is never an
736 error if a string contains no match for a pattern.
737
738
739.. _re-objects:
740
741Regular Expression Objects
742--------------------------
743
Brian Curtinfbe51992010-03-25 23:48:54 +0000744.. class:: RegexObject
745
746 The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000747
Georg Brandlb1a14052010-06-01 07:25:23 +0000748 .. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000749
Georg Brandlb1a14052010-06-01 07:25:23 +0000750 Scan through *string* looking for a location where this regular expression
751 produces a match, and return a corresponding :class:`MatchObject` instance.
752 Return ``None`` if no position in the string matches the pattern; note that this
753 is different from finding a zero-length match at some point in the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000754
Brian Curtinfbe51992010-03-25 23:48:54 +0000755 The optional second parameter *pos* gives an index in the string where the
756 search is to start; it defaults to ``0``. This is not completely equivalent to
757 slicing the string; the ``'^'`` pattern character matches at the real beginning
758 of the string and at positions just after a newline, but not necessarily at the
759 index where the search is to start.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000760
Brian Curtinfbe51992010-03-25 23:48:54 +0000761 The optional parameter *endpos* limits how far the string will be searched; it
762 will be as if the string is *endpos* characters long, so only the characters
763 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
764 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
Georg Brandlb1a14052010-06-01 07:25:23 +0000765 expression object, ``rx.search(string, 0, 50)`` is equivalent to
766 ``rx.search(string[:50], 0)``.
Georg Brandlb8df1562007-12-05 18:30:48 +0000767
Georg Brandlb1a14052010-06-01 07:25:23 +0000768 >>> pattern = re.compile("d")
769 >>> pattern.search("dog") # Match at index 0
770 <_sre.SRE_Match object at ...>
771 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl8ec7f652007-08-15 14:28:01 +0000772
773
Georg Brandlb1a14052010-06-01 07:25:23 +0000774 .. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000775
Georg Brandlb1a14052010-06-01 07:25:23 +0000776 If zero or more characters at the *beginning* of *string* match this regular
777 expression, return a corresponding :class:`MatchObject` instance. Return
778 ``None`` if the string does not match the pattern; note that this is different
779 from a zero-length match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000780
Brian Curtinfbe51992010-03-25 23:48:54 +0000781 The optional *pos* and *endpos* parameters have the same meaning as for the
Georg Brandlb1a14052010-06-01 07:25:23 +0000782 :meth:`~RegexObject.search` method.
783
Georg Brandlb1a14052010-06-01 07:25:23 +0000784 >>> pattern = re.compile("o")
785 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
786 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
787 <_sre.SRE_Match object at ...>
Georg Brandl8ec7f652007-08-15 14:28:01 +0000788
Ezio Melottid9de93e2012-02-29 13:37:07 +0200789 If you want to locate a match anywhere in *string*, use
790 :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
791
Georg Brandl8ec7f652007-08-15 14:28:01 +0000792
Eli Benderskyeb711382011-11-14 01:02:20 +0200793 .. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000794
Brian Curtinfbe51992010-03-25 23:48:54 +0000795 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000796
797
Brian Curtinfbe51992010-03-25 23:48:54 +0000798 .. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000799
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000800 Similar to the :func:`findall` function, using the compiled pattern, but
801 also accepts optional *pos* and *endpos* parameters that limit the search
802 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000803
804
Brian Curtinfbe51992010-03-25 23:48:54 +0000805 .. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000806
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000807 Similar to the :func:`finditer` function, using the compiled pattern, but
808 also accepts optional *pos* and *endpos* parameters that limit the search
809 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000810
811
Eli Benderskyeb711382011-11-14 01:02:20 +0200812 .. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000813
Brian Curtinfbe51992010-03-25 23:48:54 +0000814 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000815
816
Eli Benderskyeb711382011-11-14 01:02:20 +0200817 .. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000818
Brian Curtinfbe51992010-03-25 23:48:54 +0000819 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000820
821
Brian Curtinfbe51992010-03-25 23:48:54 +0000822 .. attribute:: RegexObject.flags
Georg Brandl8ec7f652007-08-15 14:28:01 +0000823
Georg Brandl94a10572012-03-17 17:31:32 +0100824 The regex matching flags. This is a combination of the flags given to
825 :func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000826
827
Brian Curtinfbe51992010-03-25 23:48:54 +0000828 .. attribute:: RegexObject.groups
Georg Brandlb46f0d72008-12-05 07:49:49 +0000829
Brian Curtinfbe51992010-03-25 23:48:54 +0000830 The number of capturing groups in the pattern.
Georg Brandlb46f0d72008-12-05 07:49:49 +0000831
832
Brian Curtinfbe51992010-03-25 23:48:54 +0000833 .. attribute:: RegexObject.groupindex
Georg Brandl8ec7f652007-08-15 14:28:01 +0000834
Brian Curtinfbe51992010-03-25 23:48:54 +0000835 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
836 numbers. The dictionary is empty if no symbolic groups were used in the
837 pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000838
839
Brian Curtinfbe51992010-03-25 23:48:54 +0000840 .. attribute:: RegexObject.pattern
Georg Brandl8ec7f652007-08-15 14:28:01 +0000841
Brian Curtinfbe51992010-03-25 23:48:54 +0000842 The pattern string from which the RE object was compiled.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000843
844
845.. _match-objects:
846
847Match Objects
848-------------
849
Brian Curtinfbe51992010-03-25 23:48:54 +0000850.. class:: MatchObject
851
Ezio Melotti51c374d2012-11-04 06:46:28 +0200852 Match objects always have a boolean value of ``True``.
853 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
854 when there is no match, you can test whether there was a match with a simple
855 ``if`` statement::
856
857 match = re.search(pattern, string)
858 if match:
859 process(match)
860
861 Match objects support the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000862
863
Brian Curtinfbe51992010-03-25 23:48:54 +0000864 .. method:: MatchObject.expand(template)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000865
Brian Curtinfbe51992010-03-25 23:48:54 +0000866 Return the string obtained by doing backslash substitution on the template
867 string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes
868 such as ``\n`` are converted to the appropriate characters, and numeric
869 backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
870 ``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000871
872
Brian Curtinfbe51992010-03-25 23:48:54 +0000873 .. method:: MatchObject.group([group1, ...])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000874
Brian Curtinfbe51992010-03-25 23:48:54 +0000875 Returns one or more subgroups of the match. If there is a single argument, the
876 result is a single string; if there are multiple arguments, the result is a
877 tuple with one item per argument. Without arguments, *group1* defaults to zero
878 (the whole match is returned). If a *groupN* argument is zero, the corresponding
879 return value is the entire matching string; if it is in the inclusive range
880 [1..99], it is the string matching the corresponding parenthesized group. If a
881 group number is negative or larger than the number of groups defined in the
882 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
883 part of the pattern that did not match, the corresponding result is ``None``.
884 If a group is contained in a part of the pattern that matched multiple times,
885 the last match is returned.
Georg Brandlb8df1562007-12-05 18:30:48 +0000886
Brian Curtinfbe51992010-03-25 23:48:54 +0000887 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
888 >>> m.group(0) # The entire match
889 'Isaac Newton'
890 >>> m.group(1) # The first parenthesized subgroup.
891 'Isaac'
892 >>> m.group(2) # The second parenthesized subgroup.
893 'Newton'
894 >>> m.group(1, 2) # Multiple arguments give us a tuple.
895 ('Isaac', 'Newton')
Georg Brandl8ec7f652007-08-15 14:28:01 +0000896
Brian Curtinfbe51992010-03-25 23:48:54 +0000897 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
898 arguments may also be strings identifying groups by their group name. If a
899 string argument is not used as a group name in the pattern, an :exc:`IndexError`
900 exception is raised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000901
Brian Curtinfbe51992010-03-25 23:48:54 +0000902 A moderately complicated example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000903
Brian Curtinfbe51992010-03-25 23:48:54 +0000904 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
905 >>> m.group('first_name')
906 'Malcolm'
907 >>> m.group('last_name')
908 'Reynolds'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000909
Brian Curtinfbe51992010-03-25 23:48:54 +0000910 Named groups can also be referred to by their index:
Georg Brandlb8df1562007-12-05 18:30:48 +0000911
Brian Curtinfbe51992010-03-25 23:48:54 +0000912 >>> m.group(1)
913 'Malcolm'
914 >>> m.group(2)
915 'Reynolds'
Georg Brandlb8df1562007-12-05 18:30:48 +0000916
Brian Curtinfbe51992010-03-25 23:48:54 +0000917 If a group matches multiple times, only the last match is accessible:
Georg Brandl6199e322008-03-22 12:04:26 +0000918
Brian Curtinfbe51992010-03-25 23:48:54 +0000919 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
920 >>> m.group(1) # Returns only the last match.
921 'c3'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000922
923
Brian Curtinfbe51992010-03-25 23:48:54 +0000924 .. method:: MatchObject.groups([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000925
Brian Curtinfbe51992010-03-25 23:48:54 +0000926 Return a tuple containing all the subgroups of the match, from 1 up to however
927 many groups are in the pattern. The *default* argument is used for groups that
928 did not participate in the match; it defaults to ``None``. (Incompatibility
929 note: in the original Python 1.5 release, if the tuple was one element long, a
930 string would be returned instead. In later versions (from 1.5.1 on), a
931 singleton tuple is returned in such cases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000932
Brian Curtinfbe51992010-03-25 23:48:54 +0000933 For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000934
Brian Curtinfbe51992010-03-25 23:48:54 +0000935 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
936 >>> m.groups()
937 ('24', '1632')
Georg Brandlb8df1562007-12-05 18:30:48 +0000938
Brian Curtinfbe51992010-03-25 23:48:54 +0000939 If we make the decimal place and everything after it optional, not all groups
940 might participate in the match. These groups will default to ``None`` unless
941 the *default* argument is given:
Georg Brandlb8df1562007-12-05 18:30:48 +0000942
Brian Curtinfbe51992010-03-25 23:48:54 +0000943 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
944 >>> m.groups() # Second group defaults to None.
945 ('24', None)
946 >>> m.groups('0') # Now, the second group defaults to '0'.
947 ('24', '0')
Georg Brandlb8df1562007-12-05 18:30:48 +0000948
Georg Brandl8ec7f652007-08-15 14:28:01 +0000949
Brian Curtinfbe51992010-03-25 23:48:54 +0000950 .. method:: MatchObject.groupdict([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000951
Brian Curtinfbe51992010-03-25 23:48:54 +0000952 Return a dictionary containing all the *named* subgroups of the match, keyed by
953 the subgroup name. The *default* argument is used for groups that did not
954 participate in the match; it defaults to ``None``. For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000955
Brian Curtinfbe51992010-03-25 23:48:54 +0000956 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
957 >>> m.groupdict()
958 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl8ec7f652007-08-15 14:28:01 +0000959
960
Brian Curtinfbe51992010-03-25 23:48:54 +0000961 .. method:: MatchObject.start([group])
962 MatchObject.end([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000963
Brian Curtinfbe51992010-03-25 23:48:54 +0000964 Return the indices of the start and end of the substring matched by *group*;
965 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
966 *group* exists but did not contribute to the match. For a match object *m*, and
967 a group *g* that did contribute to the match, the substring matched by group *g*
968 (equivalent to ``m.group(g)``) is ::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000969
Brian Curtinfbe51992010-03-25 23:48:54 +0000970 m.string[m.start(g):m.end(g)]
Georg Brandl8ec7f652007-08-15 14:28:01 +0000971
Brian Curtinfbe51992010-03-25 23:48:54 +0000972 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
973 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
974 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
975 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000976
Brian Curtinfbe51992010-03-25 23:48:54 +0000977 An example that will remove *remove_this* from email addresses:
Georg Brandlb8df1562007-12-05 18:30:48 +0000978
Brian Curtinfbe51992010-03-25 23:48:54 +0000979 >>> email = "tony@tiremove_thisger.net"
980 >>> m = re.search("remove_this", email)
981 >>> email[:m.start()] + email[m.end():]
982 'tony@tiger.net'
Georg Brandlb8df1562007-12-05 18:30:48 +0000983
Georg Brandl8ec7f652007-08-15 14:28:01 +0000984
Brian Curtinfbe51992010-03-25 23:48:54 +0000985 .. method:: MatchObject.span([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000986
Brian Curtinfbe51992010-03-25 23:48:54 +0000987 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
988 m.end(group))``. Note that if *group* did not contribute to the match, this is
989 ``(-1, -1)``. *group* defaults to zero, the entire match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000990
991
Brian Curtinfbe51992010-03-25 23:48:54 +0000992 .. attribute:: MatchObject.pos
Georg Brandl8ec7f652007-08-15 14:28:01 +0000993
Brian Curtinfbe51992010-03-25 23:48:54 +0000994 The value of *pos* which was passed to the :meth:`~RegexObject.search` or
995 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
996 index into the string at which the RE engine started looking for a match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000997
998
Brian Curtinfbe51992010-03-25 23:48:54 +0000999 .. attribute:: MatchObject.endpos
Georg Brandl8ec7f652007-08-15 14:28:01 +00001000
Brian Curtinfbe51992010-03-25 23:48:54 +00001001 The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
1002 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
1003 index into the string beyond which the RE engine will not go.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001004
1005
Brian Curtinfbe51992010-03-25 23:48:54 +00001006 .. attribute:: MatchObject.lastindex
Georg Brandl8ec7f652007-08-15 14:28:01 +00001007
Brian Curtinfbe51992010-03-25 23:48:54 +00001008 The integer index of the last matched capturing group, or ``None`` if no group
1009 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
1010 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
1011 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
1012 string.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001013
1014
Brian Curtinfbe51992010-03-25 23:48:54 +00001015 .. attribute:: MatchObject.lastgroup
Georg Brandl8ec7f652007-08-15 14:28:01 +00001016
Brian Curtinfbe51992010-03-25 23:48:54 +00001017 The name of the last matched capturing group, or ``None`` if the group didn't
1018 have a name, or if no group was matched at all.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001019
1020
Brian Curtinfbe51992010-03-25 23:48:54 +00001021 .. attribute:: MatchObject.re
Georg Brandl8ec7f652007-08-15 14:28:01 +00001022
Brian Curtinfbe51992010-03-25 23:48:54 +00001023 The regular expression object whose :meth:`~RegexObject.match` or
1024 :meth:`~RegexObject.search` method produced this :class:`MatchObject`
1025 instance.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001026
1027
Brian Curtinfbe51992010-03-25 23:48:54 +00001028 .. attribute:: MatchObject.string
Georg Brandl8ec7f652007-08-15 14:28:01 +00001029
Brian Curtinfbe51992010-03-25 23:48:54 +00001030 The string passed to :meth:`~RegexObject.match` or
1031 :meth:`~RegexObject.search`.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001032
1033
1034Examples
1035--------
1036
Georg Brandlb8df1562007-12-05 18:30:48 +00001037
1038Checking For a Pair
1039^^^^^^^^^^^^^^^^^^^
1040
1041In this example, we'll use the following helper function to display match
Georg Brandl6199e322008-03-22 12:04:26 +00001042objects a little more gracefully:
1043
Georg Brandl838b4b02008-03-22 13:07:06 +00001044.. testcode::
Georg Brandlb8df1562007-12-05 18:30:48 +00001045
1046 def displaymatch(match):
1047 if match is None:
1048 return None
1049 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1050
1051Suppose you are writing a poker program where a player's hand is represented as
1052a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti13c82d02011-12-17 01:17:17 +02001053for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandlb8df1562007-12-05 18:30:48 +00001054representing the card with that value.
1055
Georg Brandl6199e322008-03-22 12:04:26 +00001056To see if a given string is a valid hand, one could do the following:
Georg Brandlb8df1562007-12-05 18:30:48 +00001057
Ezio Melotti13c82d02011-12-17 01:17:17 +02001058 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1059 >>> displaymatch(valid.match("akt5q")) # Valid.
1060 "<Match: 'akt5q', groups=()>"
1061 >>> displaymatch(valid.match("akt5e")) # Invalid.
1062 >>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandlb8df1562007-12-05 18:30:48 +00001063 >>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl6199e322008-03-22 12:04:26 +00001064 "<Match: '727ak', groups=()>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001065
1066That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl6199e322008-03-22 12:04:26 +00001067To match this with a regular expression, one could use backreferences as such:
Georg Brandlb8df1562007-12-05 18:30:48 +00001068
1069 >>> pair = re.compile(r".*(.).*\1")
1070 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl6199e322008-03-22 12:04:26 +00001071 "<Match: '717', groups=('7',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001072 >>> displaymatch(pair.match("718ak")) # No pairs.
1073 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl6199e322008-03-22 12:04:26 +00001074 "<Match: '354aa', groups=('a',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001075
Georg Brandl74f8fc02009-07-26 13:36:39 +00001076To find out what card the pair consists of, one could use the
1077:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1078manner:
Georg Brandl6199e322008-03-22 12:04:26 +00001079
Georg Brandl838b4b02008-03-22 13:07:06 +00001080.. doctest::
Georg Brandlb8df1562007-12-05 18:30:48 +00001081
1082 >>> pair.match("717ak").group(1)
1083 '7'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001084
Georg Brandlb8df1562007-12-05 18:30:48 +00001085 # Error because re.match() returns None, which doesn't have a group() method:
1086 >>> pair.match("718ak").group(1)
1087 Traceback (most recent call last):
1088 File "<pyshell#23>", line 1, in <module>
1089 re.match(r".*(.).*\1", "718ak").group(1)
1090 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001091
Georg Brandlb8df1562007-12-05 18:30:48 +00001092 >>> pair.match("354aa").group(1)
1093 'a'
1094
1095
1096Simulating scanf()
1097^^^^^^^^^^^^^^^^^^
Georg Brandl8ec7f652007-08-15 14:28:01 +00001098
1099.. index:: single: scanf()
1100
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001101Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001102expressions are generally more powerful, though also more verbose, than
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001103:c:func:`scanf` format strings. The table below offers some more-or-less
1104equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001105expressions.
1106
1107+--------------------------------+---------------------------------------------+
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001108| :c:func:`scanf` Token | Regular Expression |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001109+================================+=============================================+
1110| ``%c`` | ``.`` |
1111+--------------------------------+---------------------------------------------+
1112| ``%5c`` | ``.{5}`` |
1113+--------------------------------+---------------------------------------------+
1114| ``%d`` | ``[-+]?\d+`` |
1115+--------------------------------+---------------------------------------------+
1116| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1117+--------------------------------+---------------------------------------------+
1118| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1119+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001120| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001121+--------------------------------+---------------------------------------------+
1122| ``%s`` | ``\S+`` |
1123+--------------------------------+---------------------------------------------+
1124| ``%u`` | ``\d+`` |
1125+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001126| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001127+--------------------------------+---------------------------------------------+
1128
1129To extract the filename and numbers from a string like ::
1130
1131 /usr/sbin/sendmail - 0 errors, 4 warnings
1132
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001133you would use a :c:func:`scanf` format like ::
Georg Brandl8ec7f652007-08-15 14:28:01 +00001134
1135 %s - %d errors, %d warnings
1136
1137The equivalent regular expression would be ::
1138
1139 (\S+) - (\d+) errors, (\d+) warnings
1140
Georg Brandlb8df1562007-12-05 18:30:48 +00001141
Ezio Melottid9de93e2012-02-29 13:37:07 +02001142.. _search-vs-match:
Georg Brandlb8df1562007-12-05 18:30:48 +00001143
1144search() vs. match()
1145^^^^^^^^^^^^^^^^^^^^
1146
Ezio Melottid9de93e2012-02-29 13:37:07 +02001147.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandlb8df1562007-12-05 18:30:48 +00001148
Ezio Melottid9de93e2012-02-29 13:37:07 +02001149Python offers two different primitive operations based on regular expressions:
1150:func:`re.match` checks for a match only at the beginning of the string, while
1151:func:`re.search` checks for a match anywhere in the string (this is what Perl
1152does by default).
1153
1154For example::
1155
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001156 >>> re.match("c", "abcdef") # No match
1157 >>> re.search("c", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001158 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001159
Ezio Melottid9de93e2012-02-29 13:37:07 +02001160Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1161restrict the match at the beginning of the string::
Georg Brandlb8df1562007-12-05 18:30:48 +00001162
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001163 >>> re.match("c", "abcdef") # No match
1164 >>> re.search("^c", "abcdef") # No match
Ezio Melottid9de93e2012-02-29 13:37:07 +02001165 >>> re.search("^a", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001166 <_sre.SRE_Match object at ...>
Ezio Melottid9de93e2012-02-29 13:37:07 +02001167
1168Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1169beginning of the string, whereas using :func:`search` with a regular expression
1170beginning with ``'^'`` will match at the beginning of each line.
1171
1172 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1173 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
1174 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001175
1176
1177Making a Phonebook
1178^^^^^^^^^^^^^^^^^^
1179
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001180:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandlb8df1562007-12-05 18:30:48 +00001181method is invaluable for converting textual data into data structures that can be
1182easily read and modified by Python as demonstrated in the following example that
1183creates a phonebook.
1184
Georg Brandld6b20dc2007-12-06 09:45:39 +00001185First, here is the input. Normally it may come from a file, here we are using
Georg Brandl6199e322008-03-22 12:04:26 +00001186triple-quoted string syntax:
Georg Brandlb8df1562007-12-05 18:30:48 +00001187
Georg Brandl5a607b02012-03-17 17:26:27 +01001188 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001189 ...
Georg Brandl6199e322008-03-22 12:04:26 +00001190 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1191 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1192 ...
1193 ...
1194 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandld6b20dc2007-12-06 09:45:39 +00001195
1196The entries are separated by one or more newlines. Now we convert the string
Georg Brandl6199e322008-03-22 12:04:26 +00001197into a list with each nonempty line having its own entry:
1198
Georg Brandl838b4b02008-03-22 13:07:06 +00001199.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001200 :options: +NORMALIZE_WHITESPACE
Georg Brandld6b20dc2007-12-06 09:45:39 +00001201
Georg Brandl5a607b02012-03-17 17:26:27 +01001202 >>> entries = re.split("\n+", text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001203 >>> entries
Georg Brandl6199e322008-03-22 12:04:26 +00001204 ['Ross McFluff: 834.345.1254 155 Elm Street',
1205 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1206 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1207 'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandlb8df1562007-12-05 18:30:48 +00001208
1209Finally, split each entry into a list with first name, last name, telephone
Georg Brandl907a7202008-02-22 12:31:45 +00001210number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl6199e322008-03-22 12:04:26 +00001211because the address has spaces, our splitting pattern, in it:
1212
Georg Brandl838b4b02008-03-22 13:07:06 +00001213.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001214 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001215
Georg Brandld6b20dc2007-12-06 09:45:39 +00001216 >>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001217 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1218 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1219 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1220 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1221
Georg Brandld6b20dc2007-12-06 09:45:39 +00001222The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl907a7202008-02-22 12:31:45 +00001223occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl6199e322008-03-22 12:04:26 +00001224house number from the street name:
1225
Georg Brandl838b4b02008-03-22 13:07:06 +00001226.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001227 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001228
Georg Brandld6b20dc2007-12-06 09:45:39 +00001229 >>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001230 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1231 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1232 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1233 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1234
1235
1236Text Munging
1237^^^^^^^^^^^^
1238
1239:func:`sub` replaces every occurrence of a pattern with a string or the
1240result of a function. This example demonstrates using :func:`sub` with
1241a function to "munge" text, or randomize the order of all the characters
1242in each word of a sentence except for the first and last characters::
1243
1244 >>> def repl(m):
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001245 ... inner_word = list(m.group(2))
1246 ... random.shuffle(inner_word)
1247 ... return m.group(1) + "".join(inner_word) + m.group(3)
Georg Brandlb8df1562007-12-05 18:30:48 +00001248 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandle0289a32010-08-01 21:44:38 +00001249 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001250 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandle0289a32010-08-01 21:44:38 +00001251 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001252 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1253
1254
1255Finding all Adverbs
1256^^^^^^^^^^^^^^^^^^^
1257
Georg Brandl907a7202008-02-22 12:31:45 +00001258:func:`findall` matches *all* occurrences of a pattern, not just the first
Georg Brandlb8df1562007-12-05 18:30:48 +00001259one as :func:`search` does. For example, if one was a writer and wanted to
1260find all of the adverbs in some text, he or she might use :func:`findall` in
Georg Brandl6199e322008-03-22 12:04:26 +00001261the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001262
1263 >>> text = "He was carefully disguised but captured quickly by police."
1264 >>> re.findall(r"\w+ly", text)
1265 ['carefully', 'quickly']
1266
1267
1268Finding all Adverbs and their Positions
1269^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1270
1271If one wants more information about all matches of a pattern than the matched
1272text, :func:`finditer` is useful as it provides instances of
1273:class:`MatchObject` instead of strings. Continuing with the previous example,
1274if one was a writer who wanted to find all of the adverbs *and their positions*
Georg Brandl6199e322008-03-22 12:04:26 +00001275in some text, he or she would use :func:`finditer` in the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001276
1277 >>> text = "He was carefully disguised but captured quickly by police."
1278 >>> for m in re.finditer(r"\w+ly", text):
Georg Brandl6199e322008-03-22 12:04:26 +00001279 ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandlb8df1562007-12-05 18:30:48 +00001280 07-16: carefully
1281 40-47: quickly
1282
1283
1284Raw String Notation
1285^^^^^^^^^^^^^^^^^^^
1286
1287Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1288every backslash (``'\'``) in a regular expression would have to be prefixed with
1289another one to escape it. For example, the two following lines of code are
Georg Brandl6199e322008-03-22 12:04:26 +00001290functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001291
1292 >>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001293 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001294 >>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001295 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001296
1297When one wants to match a literal backslash, it must be escaped in the regular
1298expression. With raw string notation, this means ``r"\\"``. Without raw string
1299notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl6199e322008-03-22 12:04:26 +00001300functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001301
1302 >>> re.match(r"\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001303 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001304 >>> re.match("\\\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001305 <_sre.SRE_Match object at ...>