blob: 7b76d0c47d2e7ec453c036311261143e5391623d [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
Georg Brandl8ec7f652007-08-15 14:28:01 +000011This module provides regular expression matching operations similar to
12those found in Perl. Both patterns and strings to be searched can be
Georg Brandl382edff2009-03-31 15:43:20 +000013Unicode strings as well as 8-bit strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +000014
15Regular expressions use the backslash character (``'\'``) to indicate
16special forms or to allow special characters to be used without invoking
17their special meaning. This collides with Python's usage of the same
18character for the same purpose in string literals; for example, to match
19a literal backslash, one might have to write ``'\\\\'`` as the pattern
20string, because the regular expression must be ``\\``, and each
21backslash must be expressed as ``\\`` inside a regular Python string
22literal.
23
24The solution is to use Python's raw string notation for regular expression
25patterns; backslashes are not handled in any special way in a string literal
26prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandlba2e5192007-09-27 06:26:58 +000028newline. Usually patterns will be expressed in Python code using this raw
29string notation.
Georg Brandl8ec7f652007-08-15 14:28:01 +000030
Georg Brandlb8df1562007-12-05 18:30:48 +000031It is important to note that most regular expression operations are available as
32module-level functions and :class:`RegexObject` methods. The functions are
33shortcuts that don't require you to compile a regex object first, but miss some
34fine-tuning parameters.
35
Georg Brandl8ec7f652007-08-15 14:28:01 +000036
37.. _re-syntax:
38
39Regular Expression Syntax
40-------------------------
41
42A regular expression (or RE) specifies a set of strings that matches it; the
43functions in this module let you check if a particular string matches a given
44regular expression (or if a given regular expression matches a particular
45string, which comes down to the same thing).
46
47Regular expressions can be concatenated to form new regular expressions; if *A*
48and *B* are both regular expressions, then *AB* is also a regular expression.
49In general, if a string *p* matches *A* and another string *q* matches *B*, the
50string *pq* will match AB. This holds unless *A* or *B* contain low precedence
51operations; boundary conditions between *A* and *B*; or have numbered group
52references. Thus, complex expressions can easily be constructed from simpler
53primitive expressions like the ones described here. For details of the theory
54and implementation of regular expressions, consult the Friedl book referenced
55above, or almost any textbook about compiler construction.
56
57A brief explanation of the format of regular expressions follows. For further
Georg Brandl1cf05222008-02-05 12:01:24 +000058information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl8ec7f652007-08-15 14:28:01 +000059
60Regular expressions can contain both special and ordinary characters. Most
61ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
62expressions; they simply match themselves. You can concatenate ordinary
63characters, so ``last`` matches the string ``'last'``. (In the rest of this
64section, we'll write RE's in ``this special style``, usually without quotes, and
65strings to be matched ``'in single quotes'``.)
66
67Some characters, like ``'|'`` or ``'('``, are special. Special
68characters either stand for classes of ordinary characters, or affect
69how the regular expressions around them are interpreted. Regular
70expression pattern strings may not contain null bytes, but can specify
71the null byte using the ``\number`` notation, e.g., ``'\x00'``.
72
Martin Panter197332a2016-10-15 01:18:16 +000073Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
74directly nested. This avoids ambiguity with the non-greedy modifier suffix
75``?``, and with other modifiers in other implementations. To apply a second
76repetition to an inner repetition, parentheses may be used. For example,
77the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
78
Georg Brandl8ec7f652007-08-15 14:28:01 +000079
80The special characters are:
81
Georg Brandl8ec7f652007-08-15 14:28:01 +000082``'.'``
83 (Dot.) In the default mode, this matches any character except a newline. If
84 the :const:`DOTALL` flag has been specified, this matches any character
85 including a newline.
86
87``'^'``
88 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
89 matches immediately after each newline.
90
91``'$'``
92 Matches the end of the string or just before the newline at the end of the
93 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
94 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
95 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arcd08a8eb2008-01-10 21:59:42 +000096 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
97 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
98 the newline, and one at the end of the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +000099
100``'*'``
101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
103 by any number of 'b's.
104
105``'+'``
106 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108 match just 'a'.
109
110``'?'``
111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112 ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116 as much text as possible. Sometimes this behaviour isn't desired; if the RE
Georg Brandl5892ab12016-04-12 07:51:41 +0200117 ``<.*>`` is matched against ``<a> b <c>``, it will match the entire
118 string, and not just ``<a>``. Adding ``?`` after the qualifier makes it
Georg Brandl8ec7f652007-08-15 14:28:01 +0000119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
Georg Brandl5892ab12016-04-12 07:51:41 +0200120 characters as possible will be matched. Using the RE ``<.*?>`` will match
121 only ``<a>``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000122
123``{m}``
124 Specifies that exactly *m* copies of the previous RE should be matched; fewer
125 matches cause the entire RE not to match. For example, ``a{6}`` will match
126 exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130 RE, attempting to match as many repetitions as possible. For example,
131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135 modifier would be confused with the previously described form.
136
137``{m,n}?``
138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139 RE, attempting to match as *few* repetitions as possible. This is the
140 non-greedy version of the previous qualifier. For example, on the
141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142 while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145 Either escapes special characters (permitting you to match characters like
146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147 sequences are discussed below.
148
149 If you're not using a raw string to express the pattern, remember that Python
150 also uses the backslash as an escape sequence in string literals; if the escape
151 sequence isn't recognized by Python's parser, the backslash and subsequent
152 character are included in the resulting string. However, if Python would
153 recognize the resulting sequence, the backslash should be repeated twice. This
154 is complicated and hard to understand, so it's highly recommended that you use
155 raw strings for all but the simplest expressions.
156
157``[]``
Ezio Melottia1958732011-10-20 19:31:08 +0300158 Used to indicate a set of characters. In a set:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000159
Ezio Melottia1958732011-10-20 19:31:08 +0300160 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
161 ``'m'``, or ``'k'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000162
Ezio Melottia1958732011-10-20 19:31:08 +0300163 * Ranges of characters can be indicated by giving two characters and separating
164 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
165 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
166 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
167 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
168 it will match a literal ``'-'``.
169
170 * Special characters lose their special meaning inside sets. For example,
171 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
172 ``'*'``, or ``')'``.
173
174 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
175 inside a set, although the characters they match depends on whether
176 :const:`LOCALE` or :const:`UNICODE` mode is in force.
177
178 * Characters that are not within a range can be matched by :dfn:`complementing`
179 the set. If the first character of the set is ``'^'``, all the characters
180 that are *not* in the set will be matched. For example, ``[^5]`` will match
181 any character except ``'5'``, and ``[^^]`` will match any character except
182 ``'^'``. ``^`` has no special meaning if it's not the first character in
183 the set.
184
185 * To match a literal ``']'`` inside a set, precede it with a backslash, or
186 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
187 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield700a6352008-05-31 13:05:34 +0000188
Georg Brandl8ec7f652007-08-15 14:28:01 +0000189``'|'``
190 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
191 will match either A or B. An arbitrary number of REs can be separated by the
192 ``'|'`` in this way. This can be used inside groups (see below) as well. As
193 the target string is scanned, REs separated by ``'|'`` are tried from left to
194 right. When one pattern completely matches, that branch is accepted. This means
195 that once ``A`` matches, ``B`` will not be tested further, even if it would
196 produce a longer overall match. In other words, the ``'|'`` operator is never
197 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
198 character class, as in ``[|]``.
199
200``(...)``
201 Matches whatever regular expression is inside the parentheses, and indicates the
202 start and end of a group; the contents of a group can be retrieved after a match
203 has been performed, and can be matched later in the string with the ``\number``
204 special sequence, described below. To match the literals ``'('`` or ``')'``,
205 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
206
207``(?...)``
208 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
209 otherwise). The first character after the ``'?'`` determines what the meaning
210 and further syntax of the construct is. Extensions usually do not create a new
211 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
212 currently supported extensions.
213
214``(?iLmsux)``
215 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
216 ``'u'``, ``'x'``.) The group matches the empty string; the letters
217 set the corresponding flags: :const:`re.I` (ignore case),
218 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
219 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
220 and :const:`re.X` (verbose), for the entire regular expression. (The
221 flags are described in :ref:`contents-of-module-re`.) This
222 is useful if you wish to include the flags as part of the regular
223 expression, instead of passing a *flag* argument to the
Georg Brandl74f8fc02009-07-26 13:36:39 +0000224 :func:`re.compile` function.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000225
226 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
227 used first in the expression string, or after one or more whitespace characters.
228 If there are non-whitespace characters before the flag, the results are
229 undefined.
230
231``(?:...)``
Georg Brandl3b85b9b2010-11-26 08:20:18 +0000232 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl8ec7f652007-08-15 14:28:01 +0000233 expression is inside the parentheses, but the substring matched by the group
234 *cannot* be retrieved after performing a match or referenced later in the
235 pattern.
236
237``(?P<name>...)``
238 Similar to regular parentheses, but the substring matched by the group is
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200239 accessible via the symbolic group name *name*. Group names must be valid
240 Python identifiers, and each group name must be defined only once within a
241 regular expression. A symbolic group is also a numbered group, just as if
242 the group were not named.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000243
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200244 Named groups can be referenced in three contexts. If the pattern is
245 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
246 single or double quotes):
247
248 +---------------------------------------+----------------------------------+
249 | Context of reference to group "quote" | Ways to reference it |
250 +=======================================+==================================+
251 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
252 | | * ``\1`` |
253 +---------------------------------------+----------------------------------+
254 | when processing match object ``m`` | * ``m.group('quote')`` |
255 | | * ``m.end('quote')`` (etc.) |
256 +---------------------------------------+----------------------------------+
257 | in a string passed to the ``repl`` | * ``\g<quote>`` |
258 | argument of ``re.sub()`` | * ``\g<1>`` |
259 | | * ``\1`` |
260 +---------------------------------------+----------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +0000261
262``(?P=name)``
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200263 A backreference to a named group; it matches whatever text was matched by the
264 earlier group named *name*.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000265
266``(?#...)``
267 A comment; the contents of the parentheses are simply ignored.
268
269``(?=...)``
270 Matches if ``...`` matches next, but doesn't consume any of the string. This is
271 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
272 ``'Isaac '`` only if it's followed by ``'Asimov'``.
273
274``(?!...)``
275 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
276 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
277 followed by ``'Asimov'``.
278
279``(?<=...)``
280 Matches if the current position in the string is preceded by a match for ``...``
281 that ends at the current position. This is called a :dfn:`positive lookbehind
282 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
283 lookbehind will back up 3 characters and check if the contained pattern matches.
284 The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200285 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
286 references are not supported even if they match strings of some fixed length.
287 Note that
Ezio Melotti11427732012-04-29 07:34:22 +0300288 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000289 beginning of the string being searched; you will most likely want to use the
Georg Brandl6199e322008-03-22 12:04:26 +0000290 :func:`search` function rather than the :func:`match` function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000291
292 >>> import re
293 >>> m = re.search('(?<=abc)def', 'abcdef')
294 >>> m.group(0)
295 'def'
296
Georg Brandl6199e322008-03-22 12:04:26 +0000297 This example looks for a word following a hyphen:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000298
299 >>> m = re.search('(?<=-)\w+', 'spam-egg')
300 >>> m.group(0)
301 'egg'
302
303``(?<!...)``
304 Matches if the current position in the string is not preceded by a match for
305 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
306 positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200307 some fixed length and shouldn't contain group references.
308 Patterns which start with negative lookbehind assertions may
Georg Brandl8ec7f652007-08-15 14:28:01 +0000309 match at the beginning of the string being searched.
310
311``(?(id/name)yes-pattern|no-pattern)``
312 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
313 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
314 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
315 matching pattern, which will match with ``'<user@host.com>'`` as well as
316 ``'user@host.com'``, but not with ``'<user@host.com'``.
317
318 .. versionadded:: 2.4
319
320The special sequences consist of ``'\'`` and a character from the list below.
321If the ordinary character is not on the list, then the resulting RE will match
322the second character. For example, ``\$`` matches the character ``'$'``.
323
Georg Brandl8ec7f652007-08-15 14:28:01 +0000324``\number``
325 Matches the contents of the group of the same number. Groups are numbered
326 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl980db0a2013-10-06 12:58:20 +0200327 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl8ec7f652007-08-15 14:28:01 +0000328 can only be used to match one of the first 99 groups. If the first digit of
329 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
330 a group match, but as the character with octal value *number*. Inside the
331 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
332 characters.
333
334``\A``
335 Matches only at the start of the string.
336
337``\b``
338 Matches the empty string, but only at the beginning or end of a word. A word is
339 defined as a sequence of alphanumeric or underscore characters, so the end of a
340 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200341 Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
342 a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
343 of the string, so the precise set of characters deemed to be alphanumeric
344 depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
345 For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
346 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200347 Inside a character range, ``\b`` represents the backspace character, for
348 compatibility with Python's string literals.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000349
350``\B``
351 Matches the empty string, but only when it is *not* at the beginning or end of a
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200352 word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
353 but not ``'py'``, ``'py.'``, or ``'py!'``.
354 ``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl8ec7f652007-08-15 14:28:01 +0000355 of ``LOCALE`` and ``UNICODE``.
356
357``\d``
358 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
359 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinsonfe67bd92009-07-28 20:35:03 +0000360 whatever is classified as a decimal digit in the Unicode character properties
361 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000362
363``\D``
364 When the :const:`UNICODE` flag is not specified, matches any non-digit
365 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
366 will match anything other than character marked as digits in the Unicode
367 character properties database.
368
369``\s``
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800370 When the :const:`UNICODE` flag is not specified, it matches any whitespace
371 character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
372 :const:`LOCALE` flag has no extra effect on matching of the space.
373 If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
374 plus whatever is classified as space in the Unicode character properties
375 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000376
377``\S``
Benjamin Peterson72275ef2014-11-25 14:54:45 -0600378 When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800379 character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
380 :const:`LOCALE` flag has no extra effect on non-whitespace match. If
381 :const:`UNICODE` is set, then any character not marked as space in the
382 Unicode character properties database is matched.
383
Georg Brandl8ec7f652007-08-15 14:28:01 +0000384
385``\w``
386 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
387 any alphanumeric character and the underscore; this is equivalent to the set
388 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
389 whatever characters are defined as alphanumeric for the current locale. If
390 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
391 is classified as alphanumeric in the Unicode character properties database.
392
393``\W``
394 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
395 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
396 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
397 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware7ca2a902014-10-19 01:06:58 -0500398 this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700399 not alphanumeric in the Unicode character properties database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000400
401``\Z``
402 Matches only at the end of the string.
403
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700404If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
405particular sequence, then :const:`LOCALE` flag takes effect first followed by
406the :const:`UNICODE`.
407
Georg Brandl8ec7f652007-08-15 14:28:01 +0000408Most of the standard escapes supported by Python string literals are also
409accepted by the regular expression parser::
410
411 \a \b \f \n
412 \r \t \v \x
413 \\
414
Ezio Melotti48d886b2012-04-29 04:46:34 +0300415(Note that ``\b`` is used to represent word boundaries, and means "backspace"
416only inside character classes.)
417
Georg Brandl8ec7f652007-08-15 14:28:01 +0000418Octal escapes are included in a limited form: If the first digit is a 0, or if
419there are three octal digits, it is considered an octal escape. Otherwise, it is
420a group reference. As for string literals, octal escapes are always at most
421three digits in length.
422
Georg Brandlae4ca792014-10-28 21:41:51 +0100423.. seealso::
424
425 Mastering Regular Expressions
426 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
427 second edition of the book no longer covers Python at all, but the first
428 edition covered writing good regular expression patterns in great detail.
429
430
Georg Brandl8ec7f652007-08-15 14:28:01 +0000431
Georg Brandl8ec7f652007-08-15 14:28:01 +0000432.. _contents-of-module-re:
433
434Module Contents
435---------------
436
437The module defines several functions, constants, and an exception. Some of the
438functions are simplified versions of the full featured methods for compiled
439regular expressions. Most non-trivial applications always use the compiled
440form.
441
442
Eli Benderskyeb711382011-11-14 01:02:20 +0200443.. function:: compile(pattern, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000444
Georg Brandlba2e5192007-09-27 06:26:58 +0000445 Compile a regular expression pattern into a regular expression object, which
Ezio Melotti33b810d2014-06-20 00:47:11 +0300446 can be used for matching using its :func:`~RegexObject.match` and
447 :func:`~RegexObject.search` methods, described below.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000448
449 The expression's behaviour can be modified by specifying a *flags* value.
450 Values can be any of the following variables, combined using bitwise OR (the
451 ``|`` operator).
452
453 The sequence ::
454
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000455 prog = re.compile(pattern)
456 result = prog.match(string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000457
458 is equivalent to ::
459
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000460 result = re.match(pattern, string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000461
Georg Brandl74f8fc02009-07-26 13:36:39 +0000462 but using :func:`re.compile` and saving the resulting regular expression
463 object for reuse is more efficient when the expression will be used several
464 times in a single program.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000465
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000466 .. note::
467
468 The compiled versions of the most recent patterns passed to
469 :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
470 programs that use only a few regular expressions at a time needn't worry
471 about compiling regular expressions.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000472
473
Sandro Tosie827c132012-01-01 12:52:24 +0100474.. data:: DEBUG
475
476 Display debug information about compiled expression.
477
478
Georg Brandl8ec7f652007-08-15 14:28:01 +0000479.. data:: I
480 IGNORECASE
481
482 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
483 lowercase letters, too. This is not affected by the current locale.
484
485
486.. data:: L
487 LOCALE
488
Georg Brandlba2e5192007-09-27 06:26:58 +0000489 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
490 current locale.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000491
492
493.. data:: M
494 MULTILINE
495
496 When specified, the pattern character ``'^'`` matches at the beginning of the
497 string and at the beginning of each line (immediately following each newline);
498 and the pattern character ``'$'`` matches at the end of the string and at the
499 end of each line (immediately preceding each newline). By default, ``'^'``
500 matches only at the beginning of the string, and ``'$'`` only at the end of the
501 string and immediately before the newline (if any) at the end of the string.
502
503
504.. data:: S
505 DOTALL
506
507 Make the ``'.'`` special character match any character at all, including a
508 newline; without this flag, ``'.'`` will match anything *except* a newline.
509
510
511.. data:: U
512 UNICODE
513
514 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
515 on the Unicode character properties database.
516
517 .. versionadded:: 2.0
518
519
520.. data:: X
521 VERBOSE
522
Zachary Ware77d61d42015-11-11 23:32:14 -0600523 This flag allows you to write regular expressions that look nicer and are
524 more readable by allowing you to visually separate logical sections of the
525 pattern and add comments. Whitespace within the pattern is ignored, except
526 when in a character class or when preceded by an unescaped backslash.
527 When a line contains a ``#`` that is not in a character class and is not
528 preceded by an unescaped backslash, all characters from the leftmost such
529 ``#`` through the end of the line are ignored.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000530
Zachary Ware77d61d42015-11-11 23:32:14 -0600531 This means that the two following regular expression objects that match a
Georg Brandlb8df1562007-12-05 18:30:48 +0000532 decimal number are functionally equal::
533
534 a = re.compile(r"""\d + # the integral part
535 \. # the decimal point
536 \d * # some fractional digits""", re.X)
537 b = re.compile(r"\d+\.\d*")
Georg Brandl8ec7f652007-08-15 14:28:01 +0000538
539
Eli Benderskyeb711382011-11-14 01:02:20 +0200540.. function:: search(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000541
Terry Jan Reedy9f7f62f2014-05-30 16:19:50 -0400542 Scan through *string* looking for the first location where the regular expression
Georg Brandl8ec7f652007-08-15 14:28:01 +0000543 *pattern* produces a match, and return a corresponding :class:`MatchObject`
544 instance. Return ``None`` if no position in the string matches the pattern; note
545 that this is different from finding a zero-length match at some point in the
546 string.
547
548
Eli Benderskyeb711382011-11-14 01:02:20 +0200549.. function:: match(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000550
551 If zero or more characters at the beginning of *string* match the regular
552 expression *pattern*, return a corresponding :class:`MatchObject` instance.
553 Return ``None`` if the string does not match the pattern; note that this is
554 different from a zero-length match.
555
Ezio Melottid9de93e2012-02-29 13:37:07 +0200556 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
557 at the beginning of the string and not at the beginning of each line.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000558
Ezio Melottid9de93e2012-02-29 13:37:07 +0200559 If you want to locate a match anywhere in *string*, use :func:`search`
560 instead (see also :ref:`search-vs-match`).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000561
562
Eli Benderskyeb711382011-11-14 01:02:20 +0200563.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000564
565 Split *string* by the occurrences of *pattern*. If capturing parentheses are
566 used in *pattern*, then the text of all groups in the pattern are also returned
567 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
568 splits occur, and the remainder of the string is returned as the final element
569 of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl6199e322008-03-22 12:04:26 +0000570 *maxsplit* was ignored. This has been fixed in later releases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000571
572 >>> re.split('\W+', 'Words, words, words.')
573 ['Words', 'words', 'words', '']
574 >>> re.split('(\W+)', 'Words, words, words.')
575 ['Words', ', ', 'words', ', ', 'words', '.', '']
576 >>> re.split('\W+', 'Words, words, words.', 1)
577 ['Words', 'words, words.']
Gregory P. Smithae91d092009-03-02 05:13:57 +0000578 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
579 ['0', '3', '9']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000580
Georg Brandl70992c32008-03-06 07:19:15 +0000581 If there are capturing groups in the separator and it matches at the start of
582 the string, the result will start with an empty string. The same holds for
Georg Brandl6199e322008-03-22 12:04:26 +0000583 the end of the string:
Georg Brandl70992c32008-03-06 07:19:15 +0000584
585 >>> re.split('(\W+)', '...words, words...')
586 ['', '...', 'words', ', ', 'words', '...', '']
587
588 That way, separator components are always found at the same relative
589 indices within the result list (e.g., if there's one capturing group
590 in the separator, the 0th, the 2nd and so forth).
591
Skip Montanaro222907d2007-09-01 17:40:03 +0000592 Note that *split* will never split a string on an empty pattern match.
Georg Brandl6199e322008-03-22 12:04:26 +0000593 For example:
Skip Montanaro222907d2007-09-01 17:40:03 +0000594
595 >>> re.split('x*', 'foo')
596 ['foo']
597 >>> re.split("(?m)^$", "foo\n\nbar\n")
598 ['foo\n\nbar\n']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000599
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000600 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000601 Added the optional flags argument.
602
Georg Brandl70992c32008-03-06 07:19:15 +0000603
Eli Benderskyeb711382011-11-14 01:02:20 +0200604.. function:: findall(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000605
Georg Brandlba2e5192007-09-27 06:26:58 +0000606 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000607 strings. The *string* is scanned left-to-right, and matches are returned in
608 the order found. If one or more groups are present in the pattern, return a
609 list of groups; this will be a list of tuples if the pattern has more than
610 one group. Empty matches are included in the result unless they touch the
611 beginning of another match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000612
613 .. versionadded:: 1.5.2
614
615 .. versionchanged:: 2.4
616 Added the optional flags argument.
617
618
Eli Benderskyeb711382011-11-14 01:02:20 +0200619.. function:: finditer(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000620
Georg Brandle7a09902007-10-21 12:10:28 +0000621 Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000622 non-overlapping matches for the RE *pattern* in *string*. The *string* is
623 scanned left-to-right, and matches are returned in the order found. Empty
624 matches are included in the result unless they touch the beginning of another
625 match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000626
627 .. versionadded:: 2.2
628
629 .. versionchanged:: 2.4
630 Added the optional flags argument.
631
632
Eli Benderskyeb711382011-11-14 01:02:20 +0200633.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000634
635 Return the string obtained by replacing the leftmost non-overlapping occurrences
636 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
637 *string* is returned unchanged. *repl* can be a string or a function; if it is
638 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosia7eb3c82011-08-19 22:54:33 +0200639 converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000640 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
641 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl6199e322008-03-22 12:04:26 +0000642 For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000643
644 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
645 ... r'static PyObject*\npy_\1(void)\n{',
646 ... 'def myfunc():')
647 'static PyObject*\npy_myfunc(void)\n{'
648
649 If *repl* is a function, it is called for every non-overlapping occurrence of
650 *pattern*. The function takes a single match object argument, and returns the
Georg Brandl6199e322008-03-22 12:04:26 +0000651 replacement string. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000652
653 >>> def dashrepl(matchobj):
654 ... if matchobj.group(0) == '-': return ' '
655 ... else: return '-'
656 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
657 'pro--gram files'
Gregory P. Smithae91d092009-03-02 05:13:57 +0000658 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
659 'Baked Beans & Spam'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000660
Georg Brandl04fd3242009-08-13 07:48:05 +0000661 The pattern may be a string or an RE object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000662
663 The optional argument *count* is the maximum number of pattern occurrences to be
664 replaced; *count* must be a non-negative integer. If omitted or zero, all
665 occurrences will be replaced. Empty matches for the pattern are replaced only
666 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
667 ``'-a-b-c-'``.
668
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200669 In string-type *repl* arguments, in addition to the character escapes and
670 backreferences described above,
Georg Brandl8ec7f652007-08-15 14:28:01 +0000671 ``\g<name>`` will use the substring matched by the group named ``name``, as
672 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
673 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
674 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
675 reference to group 20, not a reference to group 2 followed by the literal
676 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
677 substring matched by the RE.
678
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000679 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000680 Added the optional flags argument.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000681
Gregory P. Smithae91d092009-03-02 05:13:57 +0000682
Eli Benderskyeb711382011-11-14 01:02:20 +0200683.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000684
685 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
686 number_of_subs_made)``.
687
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000688 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000689 Added the optional flags argument.
690
Georg Brandl8ec7f652007-08-15 14:28:01 +0000691
692.. function:: escape(string)
693
694 Return *string* with all non-alphanumerics backslashed; this is useful if you
695 want to match an arbitrary literal string that may have regular expression
696 metacharacters in it.
697
698
R. David Murraya63f9b62010-07-10 14:25:18 +0000699.. function:: purge()
700
701 Clear the regular expression cache.
702
703
Georg Brandl8ec7f652007-08-15 14:28:01 +0000704.. exception:: error
705
706 Exception raised when a string passed to one of the functions here is not a
707 valid regular expression (for example, it might contain unmatched parentheses)
708 or when some other error occurs during compilation or matching. It is never an
709 error if a string contains no match for a pattern.
710
711
712.. _re-objects:
713
714Regular Expression Objects
715--------------------------
716
Brian Curtinfbe51992010-03-25 23:48:54 +0000717.. class:: RegexObject
718
719 The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000720
Georg Brandlb1a14052010-06-01 07:25:23 +0000721 .. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000722
Georg Brandlb1a14052010-06-01 07:25:23 +0000723 Scan through *string* looking for a location where this regular expression
724 produces a match, and return a corresponding :class:`MatchObject` instance.
725 Return ``None`` if no position in the string matches the pattern; note that this
726 is different from finding a zero-length match at some point in the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000727
Brian Curtinfbe51992010-03-25 23:48:54 +0000728 The optional second parameter *pos* gives an index in the string where the
729 search is to start; it defaults to ``0``. This is not completely equivalent to
730 slicing the string; the ``'^'`` pattern character matches at the real beginning
731 of the string and at positions just after a newline, but not necessarily at the
732 index where the search is to start.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000733
Brian Curtinfbe51992010-03-25 23:48:54 +0000734 The optional parameter *endpos* limits how far the string will be searched; it
735 will be as if the string is *endpos* characters long, so only the characters
736 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
737 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
Georg Brandlb1a14052010-06-01 07:25:23 +0000738 expression object, ``rx.search(string, 0, 50)`` is equivalent to
739 ``rx.search(string[:50], 0)``.
Georg Brandlb8df1562007-12-05 18:30:48 +0000740
Georg Brandlb1a14052010-06-01 07:25:23 +0000741 >>> pattern = re.compile("d")
742 >>> pattern.search("dog") # Match at index 0
743 <_sre.SRE_Match object at ...>
744 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl8ec7f652007-08-15 14:28:01 +0000745
746
Georg Brandlb1a14052010-06-01 07:25:23 +0000747 .. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000748
Georg Brandlb1a14052010-06-01 07:25:23 +0000749 If zero or more characters at the *beginning* of *string* match this regular
750 expression, return a corresponding :class:`MatchObject` instance. Return
751 ``None`` if the string does not match the pattern; note that this is different
752 from a zero-length match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000753
Brian Curtinfbe51992010-03-25 23:48:54 +0000754 The optional *pos* and *endpos* parameters have the same meaning as for the
Georg Brandlb1a14052010-06-01 07:25:23 +0000755 :meth:`~RegexObject.search` method.
756
Georg Brandlb1a14052010-06-01 07:25:23 +0000757 >>> pattern = re.compile("o")
758 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
759 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
760 <_sre.SRE_Match object at ...>
Georg Brandl8ec7f652007-08-15 14:28:01 +0000761
Ezio Melottid9de93e2012-02-29 13:37:07 +0200762 If you want to locate a match anywhere in *string*, use
763 :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
764
Georg Brandl8ec7f652007-08-15 14:28:01 +0000765
Eli Benderskyeb711382011-11-14 01:02:20 +0200766 .. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000767
Brian Curtinfbe51992010-03-25 23:48:54 +0000768 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000769
770
Brian Curtinfbe51992010-03-25 23:48:54 +0000771 .. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000772
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000773 Similar to the :func:`findall` function, using the compiled pattern, but
774 also accepts optional *pos* and *endpos* parameters that limit the search
775 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000776
777
Brian Curtinfbe51992010-03-25 23:48:54 +0000778 .. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000779
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000780 Similar to the :func:`finditer` function, using the compiled pattern, but
781 also accepts optional *pos* and *endpos* parameters that limit the search
782 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000783
784
Eli Benderskyeb711382011-11-14 01:02:20 +0200785 .. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000786
Brian Curtinfbe51992010-03-25 23:48:54 +0000787 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000788
789
Eli Benderskyeb711382011-11-14 01:02:20 +0200790 .. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000791
Brian Curtinfbe51992010-03-25 23:48:54 +0000792 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000793
794
Brian Curtinfbe51992010-03-25 23:48:54 +0000795 .. attribute:: RegexObject.flags
Georg Brandl8ec7f652007-08-15 14:28:01 +0000796
Georg Brandl94a10572012-03-17 17:31:32 +0100797 The regex matching flags. This is a combination of the flags given to
798 :func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000799
800
Brian Curtinfbe51992010-03-25 23:48:54 +0000801 .. attribute:: RegexObject.groups
Georg Brandlb46f0d72008-12-05 07:49:49 +0000802
Brian Curtinfbe51992010-03-25 23:48:54 +0000803 The number of capturing groups in the pattern.
Georg Brandlb46f0d72008-12-05 07:49:49 +0000804
805
Brian Curtinfbe51992010-03-25 23:48:54 +0000806 .. attribute:: RegexObject.groupindex
Georg Brandl8ec7f652007-08-15 14:28:01 +0000807
Brian Curtinfbe51992010-03-25 23:48:54 +0000808 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
809 numbers. The dictionary is empty if no symbolic groups were used in the
810 pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000811
812
Brian Curtinfbe51992010-03-25 23:48:54 +0000813 .. attribute:: RegexObject.pattern
Georg Brandl8ec7f652007-08-15 14:28:01 +0000814
Brian Curtinfbe51992010-03-25 23:48:54 +0000815 The pattern string from which the RE object was compiled.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000816
817
818.. _match-objects:
819
820Match Objects
821-------------
822
Brian Curtinfbe51992010-03-25 23:48:54 +0000823.. class:: MatchObject
824
Ezio Melotti51c374d2012-11-04 06:46:28 +0200825 Match objects always have a boolean value of ``True``.
826 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
827 when there is no match, you can test whether there was a match with a simple
828 ``if`` statement::
829
830 match = re.search(pattern, string)
831 if match:
832 process(match)
833
834 Match objects support the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000835
836
Brian Curtinfbe51992010-03-25 23:48:54 +0000837 .. method:: MatchObject.expand(template)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000838
Brian Curtinfbe51992010-03-25 23:48:54 +0000839 Return the string obtained by doing backslash substitution on the template
840 string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes
841 such as ``\n`` are converted to the appropriate characters, and numeric
842 backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
843 ``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000844
845
Brian Curtinfbe51992010-03-25 23:48:54 +0000846 .. method:: MatchObject.group([group1, ...])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000847
Brian Curtinfbe51992010-03-25 23:48:54 +0000848 Returns one or more subgroups of the match. If there is a single argument, the
849 result is a single string; if there are multiple arguments, the result is a
850 tuple with one item per argument. Without arguments, *group1* defaults to zero
851 (the whole match is returned). If a *groupN* argument is zero, the corresponding
852 return value is the entire matching string; if it is in the inclusive range
853 [1..99], it is the string matching the corresponding parenthesized group. If a
854 group number is negative or larger than the number of groups defined in the
855 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
856 part of the pattern that did not match, the corresponding result is ``None``.
857 If a group is contained in a part of the pattern that matched multiple times,
858 the last match is returned.
Georg Brandlb8df1562007-12-05 18:30:48 +0000859
Brian Curtinfbe51992010-03-25 23:48:54 +0000860 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
861 >>> m.group(0) # The entire match
862 'Isaac Newton'
863 >>> m.group(1) # The first parenthesized subgroup.
864 'Isaac'
865 >>> m.group(2) # The second parenthesized subgroup.
866 'Newton'
867 >>> m.group(1, 2) # Multiple arguments give us a tuple.
868 ('Isaac', 'Newton')
Georg Brandl8ec7f652007-08-15 14:28:01 +0000869
Brian Curtinfbe51992010-03-25 23:48:54 +0000870 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
871 arguments may also be strings identifying groups by their group name. If a
872 string argument is not used as a group name in the pattern, an :exc:`IndexError`
873 exception is raised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000874
Brian Curtinfbe51992010-03-25 23:48:54 +0000875 A moderately complicated example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000876
Brian Curtinfbe51992010-03-25 23:48:54 +0000877 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
878 >>> m.group('first_name')
879 'Malcolm'
880 >>> m.group('last_name')
881 'Reynolds'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000882
Brian Curtinfbe51992010-03-25 23:48:54 +0000883 Named groups can also be referred to by their index:
Georg Brandlb8df1562007-12-05 18:30:48 +0000884
Brian Curtinfbe51992010-03-25 23:48:54 +0000885 >>> m.group(1)
886 'Malcolm'
887 >>> m.group(2)
888 'Reynolds'
Georg Brandlb8df1562007-12-05 18:30:48 +0000889
Brian Curtinfbe51992010-03-25 23:48:54 +0000890 If a group matches multiple times, only the last match is accessible:
Georg Brandl6199e322008-03-22 12:04:26 +0000891
Brian Curtinfbe51992010-03-25 23:48:54 +0000892 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
893 >>> m.group(1) # Returns only the last match.
894 'c3'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000895
896
Brian Curtinfbe51992010-03-25 23:48:54 +0000897 .. method:: MatchObject.groups([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000898
Brian Curtinfbe51992010-03-25 23:48:54 +0000899 Return a tuple containing all the subgroups of the match, from 1 up to however
900 many groups are in the pattern. The *default* argument is used for groups that
901 did not participate in the match; it defaults to ``None``. (Incompatibility
902 note: in the original Python 1.5 release, if the tuple was one element long, a
903 string would be returned instead. In later versions (from 1.5.1 on), a
904 singleton tuple is returned in such cases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000905
Brian Curtinfbe51992010-03-25 23:48:54 +0000906 For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000907
Brian Curtinfbe51992010-03-25 23:48:54 +0000908 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
909 >>> m.groups()
910 ('24', '1632')
Georg Brandlb8df1562007-12-05 18:30:48 +0000911
Brian Curtinfbe51992010-03-25 23:48:54 +0000912 If we make the decimal place and everything after it optional, not all groups
913 might participate in the match. These groups will default to ``None`` unless
914 the *default* argument is given:
Georg Brandlb8df1562007-12-05 18:30:48 +0000915
Brian Curtinfbe51992010-03-25 23:48:54 +0000916 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
917 >>> m.groups() # Second group defaults to None.
918 ('24', None)
919 >>> m.groups('0') # Now, the second group defaults to '0'.
920 ('24', '0')
Georg Brandlb8df1562007-12-05 18:30:48 +0000921
Georg Brandl8ec7f652007-08-15 14:28:01 +0000922
Brian Curtinfbe51992010-03-25 23:48:54 +0000923 .. method:: MatchObject.groupdict([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000924
Brian Curtinfbe51992010-03-25 23:48:54 +0000925 Return a dictionary containing all the *named* subgroups of the match, keyed by
926 the subgroup name. The *default* argument is used for groups that did not
927 participate in the match; it defaults to ``None``. For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000928
Brian Curtinfbe51992010-03-25 23:48:54 +0000929 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
930 >>> m.groupdict()
931 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl8ec7f652007-08-15 14:28:01 +0000932
933
Brian Curtinfbe51992010-03-25 23:48:54 +0000934 .. method:: MatchObject.start([group])
935 MatchObject.end([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000936
Brian Curtinfbe51992010-03-25 23:48:54 +0000937 Return the indices of the start and end of the substring matched by *group*;
938 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
939 *group* exists but did not contribute to the match. For a match object *m*, and
940 a group *g* that did contribute to the match, the substring matched by group *g*
941 (equivalent to ``m.group(g)``) is ::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000942
Brian Curtinfbe51992010-03-25 23:48:54 +0000943 m.string[m.start(g):m.end(g)]
Georg Brandl8ec7f652007-08-15 14:28:01 +0000944
Brian Curtinfbe51992010-03-25 23:48:54 +0000945 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
946 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
947 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
948 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000949
Brian Curtinfbe51992010-03-25 23:48:54 +0000950 An example that will remove *remove_this* from email addresses:
Georg Brandlb8df1562007-12-05 18:30:48 +0000951
Brian Curtinfbe51992010-03-25 23:48:54 +0000952 >>> email = "tony@tiremove_thisger.net"
953 >>> m = re.search("remove_this", email)
954 >>> email[:m.start()] + email[m.end():]
955 'tony@tiger.net'
Georg Brandlb8df1562007-12-05 18:30:48 +0000956
Georg Brandl8ec7f652007-08-15 14:28:01 +0000957
Brian Curtinfbe51992010-03-25 23:48:54 +0000958 .. method:: MatchObject.span([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000959
Brian Curtinfbe51992010-03-25 23:48:54 +0000960 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
961 m.end(group))``. Note that if *group* did not contribute to the match, this is
962 ``(-1, -1)``. *group* defaults to zero, the entire match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000963
964
Brian Curtinfbe51992010-03-25 23:48:54 +0000965 .. attribute:: MatchObject.pos
Georg Brandl8ec7f652007-08-15 14:28:01 +0000966
Brian Curtinfbe51992010-03-25 23:48:54 +0000967 The value of *pos* which was passed to the :meth:`~RegexObject.search` or
968 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
969 index into the string at which the RE engine started looking for a match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000970
971
Brian Curtinfbe51992010-03-25 23:48:54 +0000972 .. attribute:: MatchObject.endpos
Georg Brandl8ec7f652007-08-15 14:28:01 +0000973
Brian Curtinfbe51992010-03-25 23:48:54 +0000974 The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
975 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
976 index into the string beyond which the RE engine will not go.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000977
978
Brian Curtinfbe51992010-03-25 23:48:54 +0000979 .. attribute:: MatchObject.lastindex
Georg Brandl8ec7f652007-08-15 14:28:01 +0000980
Brian Curtinfbe51992010-03-25 23:48:54 +0000981 The integer index of the last matched capturing group, or ``None`` if no group
982 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
983 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
984 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
985 string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000986
987
Brian Curtinfbe51992010-03-25 23:48:54 +0000988 .. attribute:: MatchObject.lastgroup
Georg Brandl8ec7f652007-08-15 14:28:01 +0000989
Brian Curtinfbe51992010-03-25 23:48:54 +0000990 The name of the last matched capturing group, or ``None`` if the group didn't
991 have a name, or if no group was matched at all.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000992
993
Brian Curtinfbe51992010-03-25 23:48:54 +0000994 .. attribute:: MatchObject.re
Georg Brandl8ec7f652007-08-15 14:28:01 +0000995
Brian Curtinfbe51992010-03-25 23:48:54 +0000996 The regular expression object whose :meth:`~RegexObject.match` or
997 :meth:`~RegexObject.search` method produced this :class:`MatchObject`
998 instance.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000999
1000
Brian Curtinfbe51992010-03-25 23:48:54 +00001001 .. attribute:: MatchObject.string
Georg Brandl8ec7f652007-08-15 14:28:01 +00001002
Brian Curtinfbe51992010-03-25 23:48:54 +00001003 The string passed to :meth:`~RegexObject.match` or
1004 :meth:`~RegexObject.search`.
Georg Brandl8ec7f652007-08-15 14:28:01 +00001005
1006
1007Examples
1008--------
1009
Georg Brandlb8df1562007-12-05 18:30:48 +00001010
1011Checking For a Pair
1012^^^^^^^^^^^^^^^^^^^
1013
1014In this example, we'll use the following helper function to display match
Georg Brandl6199e322008-03-22 12:04:26 +00001015objects a little more gracefully:
1016
Georg Brandl838b4b02008-03-22 13:07:06 +00001017.. testcode::
Georg Brandlb8df1562007-12-05 18:30:48 +00001018
1019 def displaymatch(match):
1020 if match is None:
1021 return None
1022 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1023
1024Suppose you are writing a poker program where a player's hand is represented as
1025a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti13c82d02011-12-17 01:17:17 +02001026for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandlb8df1562007-12-05 18:30:48 +00001027representing the card with that value.
1028
Georg Brandl6199e322008-03-22 12:04:26 +00001029To see if a given string is a valid hand, one could do the following:
Georg Brandlb8df1562007-12-05 18:30:48 +00001030
Ezio Melotti13c82d02011-12-17 01:17:17 +02001031 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1032 >>> displaymatch(valid.match("akt5q")) # Valid.
1033 "<Match: 'akt5q', groups=()>"
1034 >>> displaymatch(valid.match("akt5e")) # Invalid.
1035 >>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandlb8df1562007-12-05 18:30:48 +00001036 >>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl6199e322008-03-22 12:04:26 +00001037 "<Match: '727ak', groups=()>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001038
1039That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl6199e322008-03-22 12:04:26 +00001040To match this with a regular expression, one could use backreferences as such:
Georg Brandlb8df1562007-12-05 18:30:48 +00001041
1042 >>> pair = re.compile(r".*(.).*\1")
1043 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl6199e322008-03-22 12:04:26 +00001044 "<Match: '717', groups=('7',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001045 >>> displaymatch(pair.match("718ak")) # No pairs.
1046 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl6199e322008-03-22 12:04:26 +00001047 "<Match: '354aa', groups=('a',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001048
Georg Brandl74f8fc02009-07-26 13:36:39 +00001049To find out what card the pair consists of, one could use the
1050:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1051manner:
Georg Brandl6199e322008-03-22 12:04:26 +00001052
Georg Brandl838b4b02008-03-22 13:07:06 +00001053.. doctest::
Georg Brandlb8df1562007-12-05 18:30:48 +00001054
1055 >>> pair.match("717ak").group(1)
1056 '7'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001057
Georg Brandlb8df1562007-12-05 18:30:48 +00001058 # Error because re.match() returns None, which doesn't have a group() method:
1059 >>> pair.match("718ak").group(1)
1060 Traceback (most recent call last):
1061 File "<pyshell#23>", line 1, in <module>
1062 re.match(r".*(.).*\1", "718ak").group(1)
1063 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001064
Georg Brandlb8df1562007-12-05 18:30:48 +00001065 >>> pair.match("354aa").group(1)
1066 'a'
1067
1068
1069Simulating scanf()
1070^^^^^^^^^^^^^^^^^^
Georg Brandl8ec7f652007-08-15 14:28:01 +00001071
1072.. index:: single: scanf()
1073
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001074Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001075expressions are generally more powerful, though also more verbose, than
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001076:c:func:`scanf` format strings. The table below offers some more-or-less
1077equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001078expressions.
1079
1080+--------------------------------+---------------------------------------------+
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001081| :c:func:`scanf` Token | Regular Expression |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001082+================================+=============================================+
1083| ``%c`` | ``.`` |
1084+--------------------------------+---------------------------------------------+
1085| ``%5c`` | ``.{5}`` |
1086+--------------------------------+---------------------------------------------+
1087| ``%d`` | ``[-+]?\d+`` |
1088+--------------------------------+---------------------------------------------+
1089| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1090+--------------------------------+---------------------------------------------+
1091| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1092+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001093| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001094+--------------------------------+---------------------------------------------+
1095| ``%s`` | ``\S+`` |
1096+--------------------------------+---------------------------------------------+
1097| ``%u`` | ``\d+`` |
1098+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001099| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001100+--------------------------------+---------------------------------------------+
1101
1102To extract the filename and numbers from a string like ::
1103
1104 /usr/sbin/sendmail - 0 errors, 4 warnings
1105
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001106you would use a :c:func:`scanf` format like ::
Georg Brandl8ec7f652007-08-15 14:28:01 +00001107
1108 %s - %d errors, %d warnings
1109
1110The equivalent regular expression would be ::
1111
1112 (\S+) - (\d+) errors, (\d+) warnings
1113
Georg Brandlb8df1562007-12-05 18:30:48 +00001114
Ezio Melottid9de93e2012-02-29 13:37:07 +02001115.. _search-vs-match:
Georg Brandlb8df1562007-12-05 18:30:48 +00001116
1117search() vs. match()
1118^^^^^^^^^^^^^^^^^^^^
1119
Ezio Melottid9de93e2012-02-29 13:37:07 +02001120.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandlb8df1562007-12-05 18:30:48 +00001121
Ezio Melottid9de93e2012-02-29 13:37:07 +02001122Python offers two different primitive operations based on regular expressions:
1123:func:`re.match` checks for a match only at the beginning of the string, while
1124:func:`re.search` checks for a match anywhere in the string (this is what Perl
1125does by default).
1126
1127For example::
1128
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001129 >>> re.match("c", "abcdef") # No match
1130 >>> re.search("c", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001131 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001132
Ezio Melottid9de93e2012-02-29 13:37:07 +02001133Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1134restrict the match at the beginning of the string::
Georg Brandlb8df1562007-12-05 18:30:48 +00001135
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001136 >>> re.match("c", "abcdef") # No match
1137 >>> re.search("^c", "abcdef") # No match
Ezio Melottid9de93e2012-02-29 13:37:07 +02001138 >>> re.search("^a", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001139 <_sre.SRE_Match object at ...>
Ezio Melottid9de93e2012-02-29 13:37:07 +02001140
1141Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1142beginning of the string, whereas using :func:`search` with a regular expression
1143beginning with ``'^'`` will match at the beginning of each line.
1144
1145 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1146 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
1147 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001148
1149
1150Making a Phonebook
1151^^^^^^^^^^^^^^^^^^
1152
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001153:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandlb8df1562007-12-05 18:30:48 +00001154method is invaluable for converting textual data into data structures that can be
1155easily read and modified by Python as demonstrated in the following example that
1156creates a phonebook.
1157
Georg Brandld6b20dc2007-12-06 09:45:39 +00001158First, here is the input. Normally it may come from a file, here we are using
Georg Brandl6199e322008-03-22 12:04:26 +00001159triple-quoted string syntax:
Georg Brandlb8df1562007-12-05 18:30:48 +00001160
Georg Brandl5a607b02012-03-17 17:26:27 +01001161 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001162 ...
Georg Brandl6199e322008-03-22 12:04:26 +00001163 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1164 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1165 ...
1166 ...
1167 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandld6b20dc2007-12-06 09:45:39 +00001168
1169The entries are separated by one or more newlines. Now we convert the string
Georg Brandl6199e322008-03-22 12:04:26 +00001170into a list with each nonempty line having its own entry:
1171
Georg Brandl838b4b02008-03-22 13:07:06 +00001172.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001173 :options: +NORMALIZE_WHITESPACE
Georg Brandld6b20dc2007-12-06 09:45:39 +00001174
Georg Brandl5a607b02012-03-17 17:26:27 +01001175 >>> entries = re.split("\n+", text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001176 >>> entries
Georg Brandl6199e322008-03-22 12:04:26 +00001177 ['Ross McFluff: 834.345.1254 155 Elm Street',
1178 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1179 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1180 'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandlb8df1562007-12-05 18:30:48 +00001181
1182Finally, split each entry into a list with first name, last name, telephone
Georg Brandl907a7202008-02-22 12:31:45 +00001183number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl6199e322008-03-22 12:04:26 +00001184because the address has spaces, our splitting pattern, in it:
1185
Georg Brandl838b4b02008-03-22 13:07:06 +00001186.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001187 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001188
Georg Brandld6b20dc2007-12-06 09:45:39 +00001189 >>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001190 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1191 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1192 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1193 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1194
Georg Brandld6b20dc2007-12-06 09:45:39 +00001195The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl907a7202008-02-22 12:31:45 +00001196occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl6199e322008-03-22 12:04:26 +00001197house number from the street name:
1198
Georg Brandl838b4b02008-03-22 13:07:06 +00001199.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001200 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001201
Georg Brandld6b20dc2007-12-06 09:45:39 +00001202 >>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001203 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1204 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1205 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1206 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1207
1208
1209Text Munging
1210^^^^^^^^^^^^
1211
1212:func:`sub` replaces every occurrence of a pattern with a string or the
1213result of a function. This example demonstrates using :func:`sub` with
1214a function to "munge" text, or randomize the order of all the characters
1215in each word of a sentence except for the first and last characters::
1216
1217 >>> def repl(m):
Serhiy Storchaka12d547a2016-05-10 13:45:32 +03001218 ... inner_word = list(m.group(2))
1219 ... random.shuffle(inner_word)
1220 ... return m.group(1) + "".join(inner_word) + m.group(3)
Georg Brandlb8df1562007-12-05 18:30:48 +00001221 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandle0289a32010-08-01 21:44:38 +00001222 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001223 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandle0289a32010-08-01 21:44:38 +00001224 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001225 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1226
1227
1228Finding all Adverbs
1229^^^^^^^^^^^^^^^^^^^
1230
Georg Brandl907a7202008-02-22 12:31:45 +00001231:func:`findall` matches *all* occurrences of a pattern, not just the first
Georg Brandlb8df1562007-12-05 18:30:48 +00001232one as :func:`search` does. For example, if one was a writer and wanted to
1233find all of the adverbs in some text, he or she might use :func:`findall` in
Georg Brandl6199e322008-03-22 12:04:26 +00001234the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001235
1236 >>> text = "He was carefully disguised but captured quickly by police."
1237 >>> re.findall(r"\w+ly", text)
1238 ['carefully', 'quickly']
1239
1240
1241Finding all Adverbs and their Positions
1242^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1243
1244If one wants more information about all matches of a pattern than the matched
1245text, :func:`finditer` is useful as it provides instances of
1246:class:`MatchObject` instead of strings. Continuing with the previous example,
1247if one was a writer who wanted to find all of the adverbs *and their positions*
Georg Brandl6199e322008-03-22 12:04:26 +00001248in some text, he or she would use :func:`finditer` in the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001249
1250 >>> text = "He was carefully disguised but captured quickly by police."
1251 >>> for m in re.finditer(r"\w+ly", text):
Georg Brandl6199e322008-03-22 12:04:26 +00001252 ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandlb8df1562007-12-05 18:30:48 +00001253 07-16: carefully
1254 40-47: quickly
1255
1256
1257Raw String Notation
1258^^^^^^^^^^^^^^^^^^^
1259
1260Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1261every backslash (``'\'``) in a regular expression would have to be prefixed with
1262another one to escape it. For example, the two following lines of code are
Georg Brandl6199e322008-03-22 12:04:26 +00001263functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001264
1265 >>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001266 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001267 >>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001268 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001269
1270When one wants to match a literal backslash, it must be escaped in the regular
1271expression. With raw string notation, this means ``r"\\"``. Without raw string
1272notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl6199e322008-03-22 12:04:26 +00001273functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001274
1275 >>> re.match(r"\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001276 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001277 >>> re.match("\\\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001278 <_sre.SRE_Match object at ...>