blob: c4029c57d549eb0a1973ea2119e892804306d8a0 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
Georg Brandl8ec7f652007-08-15 14:28:01 +000011This module provides regular expression matching operations similar to
12those found in Perl. Both patterns and strings to be searched can be
Georg Brandl382edff2009-03-31 15:43:20 +000013Unicode strings as well as 8-bit strings.
Georg Brandl8ec7f652007-08-15 14:28:01 +000014
15Regular expressions use the backslash character (``'\'``) to indicate
16special forms or to allow special characters to be used without invoking
17their special meaning. This collides with Python's usage of the same
18character for the same purpose in string literals; for example, to match
19a literal backslash, one might have to write ``'\\\\'`` as the pattern
20string, because the regular expression must be ``\\``, and each
21backslash must be expressed as ``\\`` inside a regular Python string
22literal.
23
24The solution is to use Python's raw string notation for regular expression
25patterns; backslashes are not handled in any special way in a string literal
26prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
27``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandlba2e5192007-09-27 06:26:58 +000028newline. Usually patterns will be expressed in Python code using this raw
29string notation.
Georg Brandl8ec7f652007-08-15 14:28:01 +000030
Georg Brandlb8df1562007-12-05 18:30:48 +000031It is important to note that most regular expression operations are available as
32module-level functions and :class:`RegexObject` methods. The functions are
33shortcuts that don't require you to compile a regex object first, but miss some
34fine-tuning parameters.
35
Georg Brandl8ec7f652007-08-15 14:28:01 +000036
37.. _re-syntax:
38
39Regular Expression Syntax
40-------------------------
41
42A regular expression (or RE) specifies a set of strings that matches it; the
43functions in this module let you check if a particular string matches a given
44regular expression (or if a given regular expression matches a particular
45string, which comes down to the same thing).
46
47Regular expressions can be concatenated to form new regular expressions; if *A*
48and *B* are both regular expressions, then *AB* is also a regular expression.
49In general, if a string *p* matches *A* and another string *q* matches *B*, the
50string *pq* will match AB. This holds unless *A* or *B* contain low precedence
51operations; boundary conditions between *A* and *B*; or have numbered group
52references. Thus, complex expressions can easily be constructed from simpler
53primitive expressions like the ones described here. For details of the theory
54and implementation of regular expressions, consult the Friedl book referenced
55above, or almost any textbook about compiler construction.
56
57A brief explanation of the format of regular expressions follows. For further
Georg Brandl1cf05222008-02-05 12:01:24 +000058information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl8ec7f652007-08-15 14:28:01 +000059
60Regular expressions can contain both special and ordinary characters. Most
61ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
62expressions; they simply match themselves. You can concatenate ordinary
63characters, so ``last`` matches the string ``'last'``. (In the rest of this
64section, we'll write RE's in ``this special style``, usually without quotes, and
65strings to be matched ``'in single quotes'``.)
66
67Some characters, like ``'|'`` or ``'('``, are special. Special
68characters either stand for classes of ordinary characters, or affect
69how the regular expressions around them are interpreted. Regular
70expression pattern strings may not contain null bytes, but can specify
71the null byte using the ``\number`` notation, e.g., ``'\x00'``.
72
73
74The special characters are:
75
Georg Brandl8ec7f652007-08-15 14:28:01 +000076``'.'``
77 (Dot.) In the default mode, this matches any character except a newline. If
78 the :const:`DOTALL` flag has been specified, this matches any character
79 including a newline.
80
81``'^'``
82 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
83 matches immediately after each newline.
84
85``'$'``
86 Matches the end of the string or just before the newline at the end of the
87 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
88 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
89 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arcd08a8eb2008-01-10 21:59:42 +000090 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
91 a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
92 the newline, and one at the end of the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +000093
94``'*'``
95 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
96 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
97 by any number of 'b's.
98
99``'+'``
100 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
101 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
102 match just 'a'.
103
104``'?'``
105 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
106 ``ab?`` will match either 'a' or 'ab'.
107
108``*?``, ``+?``, ``??``
109 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
110 as much text as possible. Sometimes this behaviour isn't desired; if the RE
111 ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
112 string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
113 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
114 characters as possible will be matched. Using ``.*?`` in the previous
115 expression will match only ``'<H1>'``.
116
117``{m}``
118 Specifies that exactly *m* copies of the previous RE should be matched; fewer
119 matches cause the entire RE not to match. For example, ``a{6}`` will match
120 exactly six ``'a'`` characters, but not five.
121
122``{m,n}``
123 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
124 RE, attempting to match as many repetitions as possible. For example,
125 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
126 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
127 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
128 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
129 modifier would be confused with the previously described form.
130
131``{m,n}?``
132 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
133 RE, attempting to match as *few* repetitions as possible. This is the
134 non-greedy version of the previous qualifier. For example, on the
135 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
136 while ``a{3,5}?`` will only match 3 characters.
137
138``'\'``
139 Either escapes special characters (permitting you to match characters like
140 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
141 sequences are discussed below.
142
143 If you're not using a raw string to express the pattern, remember that Python
144 also uses the backslash as an escape sequence in string literals; if the escape
145 sequence isn't recognized by Python's parser, the backslash and subsequent
146 character are included in the resulting string. However, if Python would
147 recognize the resulting sequence, the backslash should be repeated twice. This
148 is complicated and hard to understand, so it's highly recommended that you use
149 raw strings for all but the simplest expressions.
150
151``[]``
Ezio Melottia1958732011-10-20 19:31:08 +0300152 Used to indicate a set of characters. In a set:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000153
Ezio Melottia1958732011-10-20 19:31:08 +0300154 * Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
155 ``'m'``, or ``'k'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000156
Ezio Melottia1958732011-10-20 19:31:08 +0300157 * Ranges of characters can be indicated by giving two characters and separating
158 them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
159 ``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
160 ``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
161 ``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
162 it will match a literal ``'-'``.
163
164 * Special characters lose their special meaning inside sets. For example,
165 ``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
166 ``'*'``, or ``')'``.
167
168 * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
169 inside a set, although the characters they match depends on whether
170 :const:`LOCALE` or :const:`UNICODE` mode is in force.
171
172 * Characters that are not within a range can be matched by :dfn:`complementing`
173 the set. If the first character of the set is ``'^'``, all the characters
174 that are *not* in the set will be matched. For example, ``[^5]`` will match
175 any character except ``'5'``, and ``[^^]`` will match any character except
176 ``'^'``. ``^`` has no special meaning if it's not the first character in
177 the set.
178
179 * To match a literal ``']'`` inside a set, precede it with a backslash, or
180 place it at the beginning of the set. For example, both ``[()[\]{}]`` and
181 ``[]()[{}]`` will both match a parenthesis.
Mark Summerfield700a6352008-05-31 13:05:34 +0000182
Georg Brandl8ec7f652007-08-15 14:28:01 +0000183``'|'``
184 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
185 will match either A or B. An arbitrary number of REs can be separated by the
186 ``'|'`` in this way. This can be used inside groups (see below) as well. As
187 the target string is scanned, REs separated by ``'|'`` are tried from left to
188 right. When one pattern completely matches, that branch is accepted. This means
189 that once ``A`` matches, ``B`` will not be tested further, even if it would
190 produce a longer overall match. In other words, the ``'|'`` operator is never
191 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
192 character class, as in ``[|]``.
193
194``(...)``
195 Matches whatever regular expression is inside the parentheses, and indicates the
196 start and end of a group; the contents of a group can be retrieved after a match
197 has been performed, and can be matched later in the string with the ``\number``
198 special sequence, described below. To match the literals ``'('`` or ``')'``,
199 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
200
201``(?...)``
202 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
203 otherwise). The first character after the ``'?'`` determines what the meaning
204 and further syntax of the construct is. Extensions usually do not create a new
205 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
206 currently supported extensions.
207
208``(?iLmsux)``
209 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
210 ``'u'``, ``'x'``.) The group matches the empty string; the letters
211 set the corresponding flags: :const:`re.I` (ignore case),
212 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
213 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
214 and :const:`re.X` (verbose), for the entire regular expression. (The
215 flags are described in :ref:`contents-of-module-re`.) This
216 is useful if you wish to include the flags as part of the regular
217 expression, instead of passing a *flag* argument to the
Georg Brandl74f8fc02009-07-26 13:36:39 +0000218 :func:`re.compile` function.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000219
220 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
221 used first in the expression string, or after one or more whitespace characters.
222 If there are non-whitespace characters before the flag, the results are
223 undefined.
224
225``(?:...)``
Georg Brandl3b85b9b2010-11-26 08:20:18 +0000226 A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl8ec7f652007-08-15 14:28:01 +0000227 expression is inside the parentheses, but the substring matched by the group
228 *cannot* be retrieved after performing a match or referenced later in the
229 pattern.
230
231``(?P<name>...)``
232 Similar to regular parentheses, but the substring matched by the group is
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200233 accessible via the symbolic group name *name*. Group names must be valid
234 Python identifiers, and each group name must be defined only once within a
235 regular expression. A symbolic group is also a numbered group, just as if
236 the group were not named.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000237
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200238 Named groups can be referenced in three contexts. If the pattern is
239 ``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
240 single or double quotes):
241
242 +---------------------------------------+----------------------------------+
243 | Context of reference to group "quote" | Ways to reference it |
244 +=======================================+==================================+
245 | in the same pattern itself | * ``(?P=quote)`` (as shown) |
246 | | * ``\1`` |
247 +---------------------------------------+----------------------------------+
248 | when processing match object ``m`` | * ``m.group('quote')`` |
249 | | * ``m.end('quote')`` (etc.) |
250 +---------------------------------------+----------------------------------+
251 | in a string passed to the ``repl`` | * ``\g<quote>`` |
252 | argument of ``re.sub()`` | * ``\g<1>`` |
253 | | * ``\1`` |
254 +---------------------------------------+----------------------------------+
Georg Brandl8ec7f652007-08-15 14:28:01 +0000255
256``(?P=name)``
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200257 A backreference to a named group; it matches whatever text was matched by the
258 earlier group named *name*.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000259
260``(?#...)``
261 A comment; the contents of the parentheses are simply ignored.
262
263``(?=...)``
264 Matches if ``...`` matches next, but doesn't consume any of the string. This is
265 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
266 ``'Isaac '`` only if it's followed by ``'Asimov'``.
267
268``(?!...)``
269 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
270 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
271 followed by ``'Asimov'``.
272
273``(?<=...)``
274 Matches if the current position in the string is preceded by a match for ``...``
275 that ends at the current position. This is called a :dfn:`positive lookbehind
276 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
277 lookbehind will back up 3 characters and check if the contained pattern matches.
278 The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200279 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
280 references are not supported even if they match strings of some fixed length.
281 Note that
Ezio Melotti11427732012-04-29 07:34:22 +0300282 patterns which start with positive lookbehind assertions will not match at the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000283 beginning of the string being searched; you will most likely want to use the
Georg Brandl6199e322008-03-22 12:04:26 +0000284 :func:`search` function rather than the :func:`match` function:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000285
286 >>> import re
287 >>> m = re.search('(?<=abc)def', 'abcdef')
288 >>> m.group(0)
289 'def'
290
Georg Brandl6199e322008-03-22 12:04:26 +0000291 This example looks for a word following a hyphen:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000292
293 >>> m = re.search('(?<=-)\w+', 'spam-egg')
294 >>> m.group(0)
295 'egg'
296
297``(?<!...)``
298 Matches if the current position in the string is not preceded by a match for
299 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
300 positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka4809d1f2015-02-21 12:08:36 +0200301 some fixed length and shouldn't contain group references.
302 Patterns which start with negative lookbehind assertions may
Georg Brandl8ec7f652007-08-15 14:28:01 +0000303 match at the beginning of the string being searched.
304
305``(?(id/name)yes-pattern|no-pattern)``
306 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
307 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
308 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
309 matching pattern, which will match with ``'<user@host.com>'`` as well as
310 ``'user@host.com'``, but not with ``'<user@host.com'``.
311
312 .. versionadded:: 2.4
313
314The special sequences consist of ``'\'`` and a character from the list below.
315If the ordinary character is not on the list, then the resulting RE will match
316the second character. For example, ``\$`` matches the character ``'$'``.
317
Georg Brandl8ec7f652007-08-15 14:28:01 +0000318``\number``
319 Matches the contents of the group of the same number. Groups are numbered
320 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl980db0a2013-10-06 12:58:20 +0200321 but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl8ec7f652007-08-15 14:28:01 +0000322 can only be used to match one of the first 99 groups. If the first digit of
323 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
324 a group match, but as the character with octal value *number*. Inside the
325 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
326 characters.
327
328``\A``
329 Matches only at the start of the string.
330
331``\b``
332 Matches the empty string, but only at the beginning or end of a word. A word is
333 defined as a sequence of alphanumeric or underscore characters, so the end of a
334 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200335 Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
336 a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
337 of the string, so the precise set of characters deemed to be alphanumeric
338 depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
339 For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
340 ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200341 Inside a character range, ``\b`` represents the backspace character, for
342 compatibility with Python's string literals.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000343
344``\B``
345 Matches the empty string, but only when it is *not* at the beginning or end of a
Ezio Melotti38ae5b22012-02-29 11:40:00 +0200346 word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
347 but not ``'py'``, ``'py.'``, or ``'py!'``.
348 ``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl8ec7f652007-08-15 14:28:01 +0000349 of ``LOCALE`` and ``UNICODE``.
350
351``\d``
352 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
353 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinsonfe67bd92009-07-28 20:35:03 +0000354 whatever is classified as a decimal digit in the Unicode character properties
355 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000356
357``\D``
358 When the :const:`UNICODE` flag is not specified, matches any non-digit
359 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
360 will match anything other than character marked as digits in the Unicode
361 character properties database.
362
363``\s``
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800364 When the :const:`UNICODE` flag is not specified, it matches any whitespace
365 character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
366 :const:`LOCALE` flag has no extra effect on matching of the space.
367 If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
368 plus whatever is classified as space in the Unicode character properties
369 database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000370
371``\S``
Benjamin Peterson72275ef2014-11-25 14:54:45 -0600372 When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumarandc0b3242012-04-11 03:22:58 +0800373 character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
374 :const:`LOCALE` flag has no extra effect on non-whitespace match. If
375 :const:`UNICODE` is set, then any character not marked as space in the
376 Unicode character properties database is matched.
377
Georg Brandl8ec7f652007-08-15 14:28:01 +0000378
379``\w``
380 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
381 any alphanumeric character and the underscore; this is equivalent to the set
382 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
383 whatever characters are defined as alphanumeric for the current locale. If
384 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
385 is classified as alphanumeric in the Unicode character properties database.
386
387``\W``
388 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
389 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
390 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
391 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware7ca2a902014-10-19 01:06:58 -0500392 this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700393 not alphanumeric in the Unicode character properties database.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000394
395``\Z``
396 Matches only at the end of the string.
397
Senthil Kumaran15b6f3f2012-03-11 20:37:39 -0700398If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
399particular sequence, then :const:`LOCALE` flag takes effect first followed by
400the :const:`UNICODE`.
401
Georg Brandl8ec7f652007-08-15 14:28:01 +0000402Most of the standard escapes supported by Python string literals are also
403accepted by the regular expression parser::
404
405 \a \b \f \n
406 \r \t \v \x
407 \\
408
Ezio Melotti48d886b2012-04-29 04:46:34 +0300409(Note that ``\b`` is used to represent word boundaries, and means "backspace"
410only inside character classes.)
411
Georg Brandl8ec7f652007-08-15 14:28:01 +0000412Octal escapes are included in a limited form: If the first digit is a 0, or if
413there are three octal digits, it is considered an octal escape. Otherwise, it is
414a group reference. As for string literals, octal escapes are always at most
415three digits in length.
416
Georg Brandlae4ca792014-10-28 21:41:51 +0100417.. seealso::
418
419 Mastering Regular Expressions
420 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
421 second edition of the book no longer covers Python at all, but the first
422 edition covered writing good regular expression patterns in great detail.
423
424
Georg Brandl8ec7f652007-08-15 14:28:01 +0000425
Georg Brandl8ec7f652007-08-15 14:28:01 +0000426.. _contents-of-module-re:
427
428Module Contents
429---------------
430
431The module defines several functions, constants, and an exception. Some of the
432functions are simplified versions of the full featured methods for compiled
433regular expressions. Most non-trivial applications always use the compiled
434form.
435
436
Eli Benderskyeb711382011-11-14 01:02:20 +0200437.. function:: compile(pattern, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000438
Georg Brandlba2e5192007-09-27 06:26:58 +0000439 Compile a regular expression pattern into a regular expression object, which
Ezio Melotti33b810d2014-06-20 00:47:11 +0300440 can be used for matching using its :func:`~RegexObject.match` and
441 :func:`~RegexObject.search` methods, described below.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000442
443 The expression's behaviour can be modified by specifying a *flags* value.
444 Values can be any of the following variables, combined using bitwise OR (the
445 ``|`` operator).
446
447 The sequence ::
448
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000449 prog = re.compile(pattern)
450 result = prog.match(string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000451
452 is equivalent to ::
453
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000454 result = re.match(pattern, string)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000455
Georg Brandl74f8fc02009-07-26 13:36:39 +0000456 but using :func:`re.compile` and saving the resulting regular expression
457 object for reuse is more efficient when the expression will be used several
458 times in a single program.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000459
Gregory P. Smith0261e5d2009-03-02 04:53:24 +0000460 .. note::
461
462 The compiled versions of the most recent patterns passed to
463 :func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
464 programs that use only a few regular expressions at a time needn't worry
465 about compiling regular expressions.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000466
467
Sandro Tosie827c132012-01-01 12:52:24 +0100468.. data:: DEBUG
469
470 Display debug information about compiled expression.
471
472
Georg Brandl8ec7f652007-08-15 14:28:01 +0000473.. data:: I
474 IGNORECASE
475
476 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
477 lowercase letters, too. This is not affected by the current locale.
478
479
480.. data:: L
481 LOCALE
482
Georg Brandlba2e5192007-09-27 06:26:58 +0000483 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
484 current locale.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000485
486
487.. data:: M
488 MULTILINE
489
490 When specified, the pattern character ``'^'`` matches at the beginning of the
491 string and at the beginning of each line (immediately following each newline);
492 and the pattern character ``'$'`` matches at the end of the string and at the
493 end of each line (immediately preceding each newline). By default, ``'^'``
494 matches only at the beginning of the string, and ``'$'`` only at the end of the
495 string and immediately before the newline (if any) at the end of the string.
496
497
498.. data:: S
499 DOTALL
500
501 Make the ``'.'`` special character match any character at all, including a
502 newline; without this flag, ``'.'`` will match anything *except* a newline.
503
504
505.. data:: U
506 UNICODE
507
508 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
509 on the Unicode character properties database.
510
511 .. versionadded:: 2.0
512
513
514.. data:: X
515 VERBOSE
516
Zachary Ware77d61d42015-11-11 23:32:14 -0600517 This flag allows you to write regular expressions that look nicer and are
518 more readable by allowing you to visually separate logical sections of the
519 pattern and add comments. Whitespace within the pattern is ignored, except
520 when in a character class or when preceded by an unescaped backslash.
521 When a line contains a ``#`` that is not in a character class and is not
522 preceded by an unescaped backslash, all characters from the leftmost such
523 ``#`` through the end of the line are ignored.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000524
Zachary Ware77d61d42015-11-11 23:32:14 -0600525 This means that the two following regular expression objects that match a
Georg Brandlb8df1562007-12-05 18:30:48 +0000526 decimal number are functionally equal::
527
528 a = re.compile(r"""\d + # the integral part
529 \. # the decimal point
530 \d * # some fractional digits""", re.X)
531 b = re.compile(r"\d+\.\d*")
Georg Brandl8ec7f652007-08-15 14:28:01 +0000532
533
Eli Benderskyeb711382011-11-14 01:02:20 +0200534.. function:: search(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000535
Terry Jan Reedy9f7f62f2014-05-30 16:19:50 -0400536 Scan through *string* looking for the first location where the regular expression
Georg Brandl8ec7f652007-08-15 14:28:01 +0000537 *pattern* produces a match, and return a corresponding :class:`MatchObject`
538 instance. Return ``None`` if no position in the string matches the pattern; note
539 that this is different from finding a zero-length match at some point in the
540 string.
541
542
Eli Benderskyeb711382011-11-14 01:02:20 +0200543.. function:: match(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000544
545 If zero or more characters at the beginning of *string* match the regular
546 expression *pattern*, return a corresponding :class:`MatchObject` instance.
547 Return ``None`` if the string does not match the pattern; note that this is
548 different from a zero-length match.
549
Ezio Melottid9de93e2012-02-29 13:37:07 +0200550 Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
551 at the beginning of the string and not at the beginning of each line.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000552
Ezio Melottid9de93e2012-02-29 13:37:07 +0200553 If you want to locate a match anywhere in *string*, use :func:`search`
554 instead (see also :ref:`search-vs-match`).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000555
556
Eli Benderskyeb711382011-11-14 01:02:20 +0200557.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000558
559 Split *string* by the occurrences of *pattern*. If capturing parentheses are
560 used in *pattern*, then the text of all groups in the pattern are also returned
561 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
562 splits occur, and the remainder of the string is returned as the final element
563 of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl6199e322008-03-22 12:04:26 +0000564 *maxsplit* was ignored. This has been fixed in later releases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000565
566 >>> re.split('\W+', 'Words, words, words.')
567 ['Words', 'words', 'words', '']
568 >>> re.split('(\W+)', 'Words, words, words.')
569 ['Words', ', ', 'words', ', ', 'words', '.', '']
570 >>> re.split('\W+', 'Words, words, words.', 1)
571 ['Words', 'words, words.']
Gregory P. Smithae91d092009-03-02 05:13:57 +0000572 >>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
573 ['0', '3', '9']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000574
Georg Brandl70992c32008-03-06 07:19:15 +0000575 If there are capturing groups in the separator and it matches at the start of
576 the string, the result will start with an empty string. The same holds for
Georg Brandl6199e322008-03-22 12:04:26 +0000577 the end of the string:
Georg Brandl70992c32008-03-06 07:19:15 +0000578
579 >>> re.split('(\W+)', '...words, words...')
580 ['', '...', 'words', ', ', 'words', '...', '']
581
582 That way, separator components are always found at the same relative
583 indices within the result list (e.g., if there's one capturing group
584 in the separator, the 0th, the 2nd and so forth).
585
Skip Montanaro222907d2007-09-01 17:40:03 +0000586 Note that *split* will never split a string on an empty pattern match.
Georg Brandl6199e322008-03-22 12:04:26 +0000587 For example:
Skip Montanaro222907d2007-09-01 17:40:03 +0000588
589 >>> re.split('x*', 'foo')
590 ['foo']
591 >>> re.split("(?m)^$", "foo\n\nbar\n")
592 ['foo\n\nbar\n']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000593
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000594 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000595 Added the optional flags argument.
596
Georg Brandl70992c32008-03-06 07:19:15 +0000597
Eli Benderskyeb711382011-11-14 01:02:20 +0200598.. function:: findall(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000599
Georg Brandlba2e5192007-09-27 06:26:58 +0000600 Return all non-overlapping matches of *pattern* in *string*, as a list of
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000601 strings. The *string* is scanned left-to-right, and matches are returned in
602 the order found. If one or more groups are present in the pattern, return a
603 list of groups; this will be a list of tuples if the pattern has more than
604 one group. Empty matches are included in the result unless they touch the
605 beginning of another match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000606
607 .. versionadded:: 1.5.2
608
609 .. versionchanged:: 2.4
610 Added the optional flags argument.
611
612
Eli Benderskyeb711382011-11-14 01:02:20 +0200613.. function:: finditer(pattern, string, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000614
Georg Brandle7a09902007-10-21 12:10:28 +0000615 Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandlb46d6ff2008-07-19 13:48:44 +0000616 non-overlapping matches for the RE *pattern* in *string*. The *string* is
617 scanned left-to-right, and matches are returned in the order found. Empty
618 matches are included in the result unless they touch the beginning of another
619 match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000620
621 .. versionadded:: 2.2
622
623 .. versionchanged:: 2.4
624 Added the optional flags argument.
625
626
Eli Benderskyeb711382011-11-14 01:02:20 +0200627.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000628
629 Return the string obtained by replacing the leftmost non-overlapping occurrences
630 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
631 *string* is returned unchanged. *repl* can be a string or a function; if it is
632 a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosia7eb3c82011-08-19 22:54:33 +0200633 converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl8ec7f652007-08-15 14:28:01 +0000634 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
635 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl6199e322008-03-22 12:04:26 +0000636 For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000637
638 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
639 ... r'static PyObject*\npy_\1(void)\n{',
640 ... 'def myfunc():')
641 'static PyObject*\npy_myfunc(void)\n{'
642
643 If *repl* is a function, it is called for every non-overlapping occurrence of
644 *pattern*. The function takes a single match object argument, and returns the
Georg Brandl6199e322008-03-22 12:04:26 +0000645 replacement string. For example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000646
647 >>> def dashrepl(matchobj):
648 ... if matchobj.group(0) == '-': return ' '
649 ... else: return '-'
650 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
651 'pro--gram files'
Gregory P. Smithae91d092009-03-02 05:13:57 +0000652 >>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
653 'Baked Beans & Spam'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000654
Georg Brandl04fd3242009-08-13 07:48:05 +0000655 The pattern may be a string or an RE object.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000656
657 The optional argument *count* is the maximum number of pattern occurrences to be
658 replaced; *count* must be a non-negative integer. If omitted or zero, all
659 occurrences will be replaced. Empty matches for the pattern are replaced only
660 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
661 ``'-a-b-c-'``.
662
Georg Brandlddbdc9a2013-10-06 12:08:14 +0200663 In string-type *repl* arguments, in addition to the character escapes and
664 backreferences described above,
Georg Brandl8ec7f652007-08-15 14:28:01 +0000665 ``\g<name>`` will use the substring matched by the group named ``name``, as
666 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
667 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
668 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
669 reference to group 20, not a reference to group 2 followed by the literal
670 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
671 substring matched by the RE.
672
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000673 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000674 Added the optional flags argument.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000675
Gregory P. Smithae91d092009-03-02 05:13:57 +0000676
Eli Benderskyeb711382011-11-14 01:02:20 +0200677.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000678
679 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
680 number_of_subs_made)``.
681
Ezio Melotti1e5d3182010-11-26 09:30:44 +0000682 .. versionchanged:: 2.7
Gregory P. Smithae91d092009-03-02 05:13:57 +0000683 Added the optional flags argument.
684
Georg Brandl8ec7f652007-08-15 14:28:01 +0000685
686.. function:: escape(string)
687
688 Return *string* with all non-alphanumerics backslashed; this is useful if you
689 want to match an arbitrary literal string that may have regular expression
690 metacharacters in it.
691
692
R. David Murraya63f9b62010-07-10 14:25:18 +0000693.. function:: purge()
694
695 Clear the regular expression cache.
696
697
Georg Brandl8ec7f652007-08-15 14:28:01 +0000698.. exception:: error
699
700 Exception raised when a string passed to one of the functions here is not a
701 valid regular expression (for example, it might contain unmatched parentheses)
702 or when some other error occurs during compilation or matching. It is never an
703 error if a string contains no match for a pattern.
704
705
706.. _re-objects:
707
708Regular Expression Objects
709--------------------------
710
Brian Curtinfbe51992010-03-25 23:48:54 +0000711.. class:: RegexObject
712
713 The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000714
Georg Brandlb1a14052010-06-01 07:25:23 +0000715 .. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000716
Georg Brandlb1a14052010-06-01 07:25:23 +0000717 Scan through *string* looking for a location where this regular expression
718 produces a match, and return a corresponding :class:`MatchObject` instance.
719 Return ``None`` if no position in the string matches the pattern; note that this
720 is different from finding a zero-length match at some point in the string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000721
Brian Curtinfbe51992010-03-25 23:48:54 +0000722 The optional second parameter *pos* gives an index in the string where the
723 search is to start; it defaults to ``0``. This is not completely equivalent to
724 slicing the string; the ``'^'`` pattern character matches at the real beginning
725 of the string and at positions just after a newline, but not necessarily at the
726 index where the search is to start.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000727
Brian Curtinfbe51992010-03-25 23:48:54 +0000728 The optional parameter *endpos* limits how far the string will be searched; it
729 will be as if the string is *endpos* characters long, so only the characters
730 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
731 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
Georg Brandlb1a14052010-06-01 07:25:23 +0000732 expression object, ``rx.search(string, 0, 50)`` is equivalent to
733 ``rx.search(string[:50], 0)``.
Georg Brandlb8df1562007-12-05 18:30:48 +0000734
Georg Brandlb1a14052010-06-01 07:25:23 +0000735 >>> pattern = re.compile("d")
736 >>> pattern.search("dog") # Match at index 0
737 <_sre.SRE_Match object at ...>
738 >>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl8ec7f652007-08-15 14:28:01 +0000739
740
Georg Brandlb1a14052010-06-01 07:25:23 +0000741 .. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000742
Georg Brandlb1a14052010-06-01 07:25:23 +0000743 If zero or more characters at the *beginning* of *string* match this regular
744 expression, return a corresponding :class:`MatchObject` instance. Return
745 ``None`` if the string does not match the pattern; note that this is different
746 from a zero-length match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000747
Brian Curtinfbe51992010-03-25 23:48:54 +0000748 The optional *pos* and *endpos* parameters have the same meaning as for the
Georg Brandlb1a14052010-06-01 07:25:23 +0000749 :meth:`~RegexObject.search` method.
750
Georg Brandlb1a14052010-06-01 07:25:23 +0000751 >>> pattern = re.compile("o")
752 >>> pattern.match("dog") # No match as "o" is not at the start of "dog".
753 >>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
754 <_sre.SRE_Match object at ...>
Georg Brandl8ec7f652007-08-15 14:28:01 +0000755
Ezio Melottid9de93e2012-02-29 13:37:07 +0200756 If you want to locate a match anywhere in *string*, use
757 :meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
758
Georg Brandl8ec7f652007-08-15 14:28:01 +0000759
Eli Benderskyeb711382011-11-14 01:02:20 +0200760 .. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000761
Brian Curtinfbe51992010-03-25 23:48:54 +0000762 Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000763
764
Brian Curtinfbe51992010-03-25 23:48:54 +0000765 .. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000766
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000767 Similar to the :func:`findall` function, using the compiled pattern, but
768 also accepts optional *pos* and *endpos* parameters that limit the search
769 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000770
771
Brian Curtinfbe51992010-03-25 23:48:54 +0000772 .. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000773
Georg Brandlf93ce0c2010-05-22 08:17:23 +0000774 Similar to the :func:`finditer` function, using the compiled pattern, but
775 also accepts optional *pos* and *endpos* parameters that limit the search
776 region like for :meth:`match`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000777
778
Eli Benderskyeb711382011-11-14 01:02:20 +0200779 .. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000780
Brian Curtinfbe51992010-03-25 23:48:54 +0000781 Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000782
783
Eli Benderskyeb711382011-11-14 01:02:20 +0200784 .. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000785
Brian Curtinfbe51992010-03-25 23:48:54 +0000786 Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000787
788
Brian Curtinfbe51992010-03-25 23:48:54 +0000789 .. attribute:: RegexObject.flags
Georg Brandl8ec7f652007-08-15 14:28:01 +0000790
Georg Brandl94a10572012-03-17 17:31:32 +0100791 The regex matching flags. This is a combination of the flags given to
792 :func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000793
794
Brian Curtinfbe51992010-03-25 23:48:54 +0000795 .. attribute:: RegexObject.groups
Georg Brandlb46f0d72008-12-05 07:49:49 +0000796
Brian Curtinfbe51992010-03-25 23:48:54 +0000797 The number of capturing groups in the pattern.
Georg Brandlb46f0d72008-12-05 07:49:49 +0000798
799
Brian Curtinfbe51992010-03-25 23:48:54 +0000800 .. attribute:: RegexObject.groupindex
Georg Brandl8ec7f652007-08-15 14:28:01 +0000801
Brian Curtinfbe51992010-03-25 23:48:54 +0000802 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
803 numbers. The dictionary is empty if no symbolic groups were used in the
804 pattern.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000805
806
Brian Curtinfbe51992010-03-25 23:48:54 +0000807 .. attribute:: RegexObject.pattern
Georg Brandl8ec7f652007-08-15 14:28:01 +0000808
Brian Curtinfbe51992010-03-25 23:48:54 +0000809 The pattern string from which the RE object was compiled.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000810
811
812.. _match-objects:
813
814Match Objects
815-------------
816
Brian Curtinfbe51992010-03-25 23:48:54 +0000817.. class:: MatchObject
818
Ezio Melotti51c374d2012-11-04 06:46:28 +0200819 Match objects always have a boolean value of ``True``.
820 Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
821 when there is no match, you can test whether there was a match with a simple
822 ``if`` statement::
823
824 match = re.search(pattern, string)
825 if match:
826 process(match)
827
828 Match objects support the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000829
830
Brian Curtinfbe51992010-03-25 23:48:54 +0000831 .. method:: MatchObject.expand(template)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000832
Brian Curtinfbe51992010-03-25 23:48:54 +0000833 Return the string obtained by doing backslash substitution on the template
834 string *template*, as done by the :meth:`~RegexObject.sub` method. Escapes
835 such as ``\n`` are converted to the appropriate characters, and numeric
836 backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
837 ``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000838
839
Brian Curtinfbe51992010-03-25 23:48:54 +0000840 .. method:: MatchObject.group([group1, ...])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000841
Brian Curtinfbe51992010-03-25 23:48:54 +0000842 Returns one or more subgroups of the match. If there is a single argument, the
843 result is a single string; if there are multiple arguments, the result is a
844 tuple with one item per argument. Without arguments, *group1* defaults to zero
845 (the whole match is returned). If a *groupN* argument is zero, the corresponding
846 return value is the entire matching string; if it is in the inclusive range
847 [1..99], it is the string matching the corresponding parenthesized group. If a
848 group number is negative or larger than the number of groups defined in the
849 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
850 part of the pattern that did not match, the corresponding result is ``None``.
851 If a group is contained in a part of the pattern that matched multiple times,
852 the last match is returned.
Georg Brandlb8df1562007-12-05 18:30:48 +0000853
Brian Curtinfbe51992010-03-25 23:48:54 +0000854 >>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
855 >>> m.group(0) # The entire match
856 'Isaac Newton'
857 >>> m.group(1) # The first parenthesized subgroup.
858 'Isaac'
859 >>> m.group(2) # The second parenthesized subgroup.
860 'Newton'
861 >>> m.group(1, 2) # Multiple arguments give us a tuple.
862 ('Isaac', 'Newton')
Georg Brandl8ec7f652007-08-15 14:28:01 +0000863
Brian Curtinfbe51992010-03-25 23:48:54 +0000864 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
865 arguments may also be strings identifying groups by their group name. If a
866 string argument is not used as a group name in the pattern, an :exc:`IndexError`
867 exception is raised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000868
Brian Curtinfbe51992010-03-25 23:48:54 +0000869 A moderately complicated example:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000870
Brian Curtinfbe51992010-03-25 23:48:54 +0000871 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
872 >>> m.group('first_name')
873 'Malcolm'
874 >>> m.group('last_name')
875 'Reynolds'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000876
Brian Curtinfbe51992010-03-25 23:48:54 +0000877 Named groups can also be referred to by their index:
Georg Brandlb8df1562007-12-05 18:30:48 +0000878
Brian Curtinfbe51992010-03-25 23:48:54 +0000879 >>> m.group(1)
880 'Malcolm'
881 >>> m.group(2)
882 'Reynolds'
Georg Brandlb8df1562007-12-05 18:30:48 +0000883
Brian Curtinfbe51992010-03-25 23:48:54 +0000884 If a group matches multiple times, only the last match is accessible:
Georg Brandl6199e322008-03-22 12:04:26 +0000885
Brian Curtinfbe51992010-03-25 23:48:54 +0000886 >>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
887 >>> m.group(1) # Returns only the last match.
888 'c3'
Georg Brandl8ec7f652007-08-15 14:28:01 +0000889
890
Brian Curtinfbe51992010-03-25 23:48:54 +0000891 .. method:: MatchObject.groups([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000892
Brian Curtinfbe51992010-03-25 23:48:54 +0000893 Return a tuple containing all the subgroups of the match, from 1 up to however
894 many groups are in the pattern. The *default* argument is used for groups that
895 did not participate in the match; it defaults to ``None``. (Incompatibility
896 note: in the original Python 1.5 release, if the tuple was one element long, a
897 string would be returned instead. In later versions (from 1.5.1 on), a
898 singleton tuple is returned in such cases.)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000899
Brian Curtinfbe51992010-03-25 23:48:54 +0000900 For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000901
Brian Curtinfbe51992010-03-25 23:48:54 +0000902 >>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
903 >>> m.groups()
904 ('24', '1632')
Georg Brandlb8df1562007-12-05 18:30:48 +0000905
Brian Curtinfbe51992010-03-25 23:48:54 +0000906 If we make the decimal place and everything after it optional, not all groups
907 might participate in the match. These groups will default to ``None`` unless
908 the *default* argument is given:
Georg Brandlb8df1562007-12-05 18:30:48 +0000909
Brian Curtinfbe51992010-03-25 23:48:54 +0000910 >>> m = re.match(r"(\d+)\.?(\d+)?", "24")
911 >>> m.groups() # Second group defaults to None.
912 ('24', None)
913 >>> m.groups('0') # Now, the second group defaults to '0'.
914 ('24', '0')
Georg Brandlb8df1562007-12-05 18:30:48 +0000915
Georg Brandl8ec7f652007-08-15 14:28:01 +0000916
Brian Curtinfbe51992010-03-25 23:48:54 +0000917 .. method:: MatchObject.groupdict([default])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000918
Brian Curtinfbe51992010-03-25 23:48:54 +0000919 Return a dictionary containing all the *named* subgroups of the match, keyed by
920 the subgroup name. The *default* argument is used for groups that did not
921 participate in the match; it defaults to ``None``. For example:
Georg Brandlb8df1562007-12-05 18:30:48 +0000922
Brian Curtinfbe51992010-03-25 23:48:54 +0000923 >>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
924 >>> m.groupdict()
925 {'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl8ec7f652007-08-15 14:28:01 +0000926
927
Brian Curtinfbe51992010-03-25 23:48:54 +0000928 .. method:: MatchObject.start([group])
929 MatchObject.end([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000930
Brian Curtinfbe51992010-03-25 23:48:54 +0000931 Return the indices of the start and end of the substring matched by *group*;
932 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
933 *group* exists but did not contribute to the match. For a match object *m*, and
934 a group *g* that did contribute to the match, the substring matched by group *g*
935 (equivalent to ``m.group(g)``) is ::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000936
Brian Curtinfbe51992010-03-25 23:48:54 +0000937 m.string[m.start(g):m.end(g)]
Georg Brandl8ec7f652007-08-15 14:28:01 +0000938
Brian Curtinfbe51992010-03-25 23:48:54 +0000939 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
940 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
941 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
942 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000943
Brian Curtinfbe51992010-03-25 23:48:54 +0000944 An example that will remove *remove_this* from email addresses:
Georg Brandlb8df1562007-12-05 18:30:48 +0000945
Brian Curtinfbe51992010-03-25 23:48:54 +0000946 >>> email = "tony@tiremove_thisger.net"
947 >>> m = re.search("remove_this", email)
948 >>> email[:m.start()] + email[m.end():]
949 'tony@tiger.net'
Georg Brandlb8df1562007-12-05 18:30:48 +0000950
Georg Brandl8ec7f652007-08-15 14:28:01 +0000951
Brian Curtinfbe51992010-03-25 23:48:54 +0000952 .. method:: MatchObject.span([group])
Georg Brandl8ec7f652007-08-15 14:28:01 +0000953
Brian Curtinfbe51992010-03-25 23:48:54 +0000954 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
955 m.end(group))``. Note that if *group* did not contribute to the match, this is
956 ``(-1, -1)``. *group* defaults to zero, the entire match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000957
958
Brian Curtinfbe51992010-03-25 23:48:54 +0000959 .. attribute:: MatchObject.pos
Georg Brandl8ec7f652007-08-15 14:28:01 +0000960
Brian Curtinfbe51992010-03-25 23:48:54 +0000961 The value of *pos* which was passed to the :meth:`~RegexObject.search` or
962 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
963 index into the string at which the RE engine started looking for a match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000964
965
Brian Curtinfbe51992010-03-25 23:48:54 +0000966 .. attribute:: MatchObject.endpos
Georg Brandl8ec7f652007-08-15 14:28:01 +0000967
Brian Curtinfbe51992010-03-25 23:48:54 +0000968 The value of *endpos* which was passed to the :meth:`~RegexObject.search` or
969 :meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
970 index into the string beyond which the RE engine will not go.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000971
972
Brian Curtinfbe51992010-03-25 23:48:54 +0000973 .. attribute:: MatchObject.lastindex
Georg Brandl8ec7f652007-08-15 14:28:01 +0000974
Brian Curtinfbe51992010-03-25 23:48:54 +0000975 The integer index of the last matched capturing group, or ``None`` if no group
976 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
977 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
978 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
979 string.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000980
981
Brian Curtinfbe51992010-03-25 23:48:54 +0000982 .. attribute:: MatchObject.lastgroup
Georg Brandl8ec7f652007-08-15 14:28:01 +0000983
Brian Curtinfbe51992010-03-25 23:48:54 +0000984 The name of the last matched capturing group, or ``None`` if the group didn't
985 have a name, or if no group was matched at all.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000986
987
Brian Curtinfbe51992010-03-25 23:48:54 +0000988 .. attribute:: MatchObject.re
Georg Brandl8ec7f652007-08-15 14:28:01 +0000989
Brian Curtinfbe51992010-03-25 23:48:54 +0000990 The regular expression object whose :meth:`~RegexObject.match` or
991 :meth:`~RegexObject.search` method produced this :class:`MatchObject`
992 instance.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000993
994
Brian Curtinfbe51992010-03-25 23:48:54 +0000995 .. attribute:: MatchObject.string
Georg Brandl8ec7f652007-08-15 14:28:01 +0000996
Brian Curtinfbe51992010-03-25 23:48:54 +0000997 The string passed to :meth:`~RegexObject.match` or
998 :meth:`~RegexObject.search`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000999
1000
1001Examples
1002--------
1003
Georg Brandlb8df1562007-12-05 18:30:48 +00001004
1005Checking For a Pair
1006^^^^^^^^^^^^^^^^^^^
1007
1008In this example, we'll use the following helper function to display match
Georg Brandl6199e322008-03-22 12:04:26 +00001009objects a little more gracefully:
1010
Georg Brandl838b4b02008-03-22 13:07:06 +00001011.. testcode::
Georg Brandlb8df1562007-12-05 18:30:48 +00001012
1013 def displaymatch(match):
1014 if match is None:
1015 return None
1016 return '<Match: %r, groups=%r>' % (match.group(), match.groups())
1017
1018Suppose you are writing a poker program where a player's hand is represented as
1019a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti13c82d02011-12-17 01:17:17 +02001020for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandlb8df1562007-12-05 18:30:48 +00001021representing the card with that value.
1022
Georg Brandl6199e322008-03-22 12:04:26 +00001023To see if a given string is a valid hand, one could do the following:
Georg Brandlb8df1562007-12-05 18:30:48 +00001024
Ezio Melotti13c82d02011-12-17 01:17:17 +02001025 >>> valid = re.compile(r"^[a2-9tjqk]{5}$")
1026 >>> displaymatch(valid.match("akt5q")) # Valid.
1027 "<Match: 'akt5q', groups=()>"
1028 >>> displaymatch(valid.match("akt5e")) # Invalid.
1029 >>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandlb8df1562007-12-05 18:30:48 +00001030 >>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl6199e322008-03-22 12:04:26 +00001031 "<Match: '727ak', groups=()>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001032
1033That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl6199e322008-03-22 12:04:26 +00001034To match this with a regular expression, one could use backreferences as such:
Georg Brandlb8df1562007-12-05 18:30:48 +00001035
1036 >>> pair = re.compile(r".*(.).*\1")
1037 >>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl6199e322008-03-22 12:04:26 +00001038 "<Match: '717', groups=('7',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001039 >>> displaymatch(pair.match("718ak")) # No pairs.
1040 >>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl6199e322008-03-22 12:04:26 +00001041 "<Match: '354aa', groups=('a',)>"
Georg Brandlb8df1562007-12-05 18:30:48 +00001042
Georg Brandl74f8fc02009-07-26 13:36:39 +00001043To find out what card the pair consists of, one could use the
1044:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
1045manner:
Georg Brandl6199e322008-03-22 12:04:26 +00001046
Georg Brandl838b4b02008-03-22 13:07:06 +00001047.. doctest::
Georg Brandlb8df1562007-12-05 18:30:48 +00001048
1049 >>> pair.match("717ak").group(1)
1050 '7'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001051
Georg Brandlb8df1562007-12-05 18:30:48 +00001052 # Error because re.match() returns None, which doesn't have a group() method:
1053 >>> pair.match("718ak").group(1)
1054 Traceback (most recent call last):
1055 File "<pyshell#23>", line 1, in <module>
1056 re.match(r".*(.).*\1", "718ak").group(1)
1057 AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001058
Georg Brandlb8df1562007-12-05 18:30:48 +00001059 >>> pair.match("354aa").group(1)
1060 'a'
1061
1062
1063Simulating scanf()
1064^^^^^^^^^^^^^^^^^^
Georg Brandl8ec7f652007-08-15 14:28:01 +00001065
1066.. index:: single: scanf()
1067
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001068Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001069expressions are generally more powerful, though also more verbose, than
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001070:c:func:`scanf` format strings. The table below offers some more-or-less
1071equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl8ec7f652007-08-15 14:28:01 +00001072expressions.
1073
1074+--------------------------------+---------------------------------------------+
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001075| :c:func:`scanf` Token | Regular Expression |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001076+================================+=============================================+
1077| ``%c`` | ``.`` |
1078+--------------------------------+---------------------------------------------+
1079| ``%5c`` | ``.{5}`` |
1080+--------------------------------+---------------------------------------------+
1081| ``%d`` | ``[-+]?\d+`` |
1082+--------------------------------+---------------------------------------------+
1083| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
1084+--------------------------------+---------------------------------------------+
1085| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
1086+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001087| ``%o`` | ``[-+]?[0-7]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001088+--------------------------------+---------------------------------------------+
1089| ``%s`` | ``\S+`` |
1090+--------------------------------+---------------------------------------------+
1091| ``%u`` | ``\d+`` |
1092+--------------------------------+---------------------------------------------+
Ezio Melotti89500192012-04-29 11:47:28 +03001093| ``%x``, ``%X`` | ``[-+]?(0[xX])?[\dA-Fa-f]+`` |
Georg Brandl8ec7f652007-08-15 14:28:01 +00001094+--------------------------------+---------------------------------------------+
1095
1096To extract the filename and numbers from a string like ::
1097
1098 /usr/sbin/sendmail - 0 errors, 4 warnings
1099
Sandro Tosi98ed08f2012-01-14 16:42:02 +01001100you would use a :c:func:`scanf` format like ::
Georg Brandl8ec7f652007-08-15 14:28:01 +00001101
1102 %s - %d errors, %d warnings
1103
1104The equivalent regular expression would be ::
1105
1106 (\S+) - (\d+) errors, (\d+) warnings
1107
Georg Brandlb8df1562007-12-05 18:30:48 +00001108
Ezio Melottid9de93e2012-02-29 13:37:07 +02001109.. _search-vs-match:
Georg Brandlb8df1562007-12-05 18:30:48 +00001110
1111search() vs. match()
1112^^^^^^^^^^^^^^^^^^^^
1113
Ezio Melottid9de93e2012-02-29 13:37:07 +02001114.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandlb8df1562007-12-05 18:30:48 +00001115
Ezio Melottid9de93e2012-02-29 13:37:07 +02001116Python offers two different primitive operations based on regular expressions:
1117:func:`re.match` checks for a match only at the beginning of the string, while
1118:func:`re.search` checks for a match anywhere in the string (this is what Perl
1119does by default).
1120
1121For example::
1122
1123 >>> re.match("c", "abcdef") # No match
1124 >>> re.search("c", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001125 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001126
Ezio Melottid9de93e2012-02-29 13:37:07 +02001127Regular expressions beginning with ``'^'`` can be used with :func:`search` to
1128restrict the match at the beginning of the string::
Georg Brandlb8df1562007-12-05 18:30:48 +00001129
Ezio Melottid9de93e2012-02-29 13:37:07 +02001130 >>> re.match("c", "abcdef") # No match
1131 >>> re.search("^c", "abcdef") # No match
1132 >>> re.search("^a", "abcdef") # Match
Georg Brandl6199e322008-03-22 12:04:26 +00001133 <_sre.SRE_Match object at ...>
Ezio Melottid9de93e2012-02-29 13:37:07 +02001134
1135Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
1136beginning of the string, whereas using :func:`search` with a regular expression
1137beginning with ``'^'`` will match at the beginning of each line.
1138
1139 >>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
1140 >>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
1141 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001142
1143
1144Making a Phonebook
1145^^^^^^^^^^^^^^^^^^
1146
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001147:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandlb8df1562007-12-05 18:30:48 +00001148method is invaluable for converting textual data into data structures that can be
1149easily read and modified by Python as demonstrated in the following example that
1150creates a phonebook.
1151
Georg Brandld6b20dc2007-12-06 09:45:39 +00001152First, here is the input. Normally it may come from a file, here we are using
Georg Brandl6199e322008-03-22 12:04:26 +00001153triple-quoted string syntax:
Georg Brandlb8df1562007-12-05 18:30:48 +00001154
Georg Brandl5a607b02012-03-17 17:26:27 +01001155 >>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandlc62ef8b2009-01-03 20:55:06 +00001156 ...
Georg Brandl6199e322008-03-22 12:04:26 +00001157 ... Ronald Heathmore: 892.345.3428 436 Finley Avenue
1158 ... Frank Burger: 925.541.7625 662 South Dogwood Way
1159 ...
1160 ...
1161 ... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandld6b20dc2007-12-06 09:45:39 +00001162
1163The entries are separated by one or more newlines. Now we convert the string
Georg Brandl6199e322008-03-22 12:04:26 +00001164into a list with each nonempty line having its own entry:
1165
Georg Brandl838b4b02008-03-22 13:07:06 +00001166.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001167 :options: +NORMALIZE_WHITESPACE
Georg Brandld6b20dc2007-12-06 09:45:39 +00001168
Georg Brandl5a607b02012-03-17 17:26:27 +01001169 >>> entries = re.split("\n+", text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001170 >>> entries
Georg Brandl6199e322008-03-22 12:04:26 +00001171 ['Ross McFluff: 834.345.1254 155 Elm Street',
1172 'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
1173 'Frank Burger: 925.541.7625 662 South Dogwood Way',
1174 'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandlb8df1562007-12-05 18:30:48 +00001175
1176Finally, split each entry into a list with first name, last name, telephone
Georg Brandl907a7202008-02-22 12:31:45 +00001177number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl6199e322008-03-22 12:04:26 +00001178because the address has spaces, our splitting pattern, in it:
1179
Georg Brandl838b4b02008-03-22 13:07:06 +00001180.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001181 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001182
Georg Brandld6b20dc2007-12-06 09:45:39 +00001183 >>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001184 [['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
1185 ['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
1186 ['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
1187 ['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
1188
Georg Brandld6b20dc2007-12-06 09:45:39 +00001189The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl907a7202008-02-22 12:31:45 +00001190occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl6199e322008-03-22 12:04:26 +00001191house number from the street name:
1192
Georg Brandl838b4b02008-03-22 13:07:06 +00001193.. doctest::
Georg Brandl6199e322008-03-22 12:04:26 +00001194 :options: +NORMALIZE_WHITESPACE
Georg Brandlb8df1562007-12-05 18:30:48 +00001195
Georg Brandld6b20dc2007-12-06 09:45:39 +00001196 >>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandlb8df1562007-12-05 18:30:48 +00001197 [['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
1198 ['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
1199 ['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
1200 ['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
1201
1202
1203Text Munging
1204^^^^^^^^^^^^
1205
1206:func:`sub` replaces every occurrence of a pattern with a string or the
1207result of a function. This example demonstrates using :func:`sub` with
1208a function to "munge" text, or randomize the order of all the characters
1209in each word of a sentence except for the first and last characters::
1210
1211 >>> def repl(m):
1212 ... inner_word = list(m.group(2))
1213 ... random.shuffle(inner_word)
1214 ... return m.group(1) + "".join(inner_word) + m.group(3)
1215 >>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandle0289a32010-08-01 21:44:38 +00001216 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001217 'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandle0289a32010-08-01 21:44:38 +00001218 >>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandlb8df1562007-12-05 18:30:48 +00001219 'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
1220
1221
1222Finding all Adverbs
1223^^^^^^^^^^^^^^^^^^^
1224
Georg Brandl907a7202008-02-22 12:31:45 +00001225:func:`findall` matches *all* occurrences of a pattern, not just the first
Georg Brandlb8df1562007-12-05 18:30:48 +00001226one as :func:`search` does. For example, if one was a writer and wanted to
1227find all of the adverbs in some text, he or she might use :func:`findall` in
Georg Brandl6199e322008-03-22 12:04:26 +00001228the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001229
1230 >>> text = "He was carefully disguised but captured quickly by police."
1231 >>> re.findall(r"\w+ly", text)
1232 ['carefully', 'quickly']
1233
1234
1235Finding all Adverbs and their Positions
1236^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1237
1238If one wants more information about all matches of a pattern than the matched
1239text, :func:`finditer` is useful as it provides instances of
1240:class:`MatchObject` instead of strings. Continuing with the previous example,
1241if one was a writer who wanted to find all of the adverbs *and their positions*
Georg Brandl6199e322008-03-22 12:04:26 +00001242in some text, he or she would use :func:`finditer` in the following manner:
Georg Brandlb8df1562007-12-05 18:30:48 +00001243
1244 >>> text = "He was carefully disguised but captured quickly by police."
1245 >>> for m in re.finditer(r"\w+ly", text):
Georg Brandl6199e322008-03-22 12:04:26 +00001246 ... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandlb8df1562007-12-05 18:30:48 +00001247 07-16: carefully
1248 40-47: quickly
1249
1250
1251Raw String Notation
1252^^^^^^^^^^^^^^^^^^^
1253
1254Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
1255every backslash (``'\'``) in a regular expression would have to be prefixed with
1256another one to escape it. For example, the two following lines of code are
Georg Brandl6199e322008-03-22 12:04:26 +00001257functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001258
1259 >>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001260 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001261 >>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl6199e322008-03-22 12:04:26 +00001262 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001263
1264When one wants to match a literal backslash, it must be escaped in the regular
1265expression. With raw string notation, this means ``r"\\"``. Without raw string
1266notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl6199e322008-03-22 12:04:26 +00001267functionally identical:
Georg Brandlb8df1562007-12-05 18:30:48 +00001268
1269 >>> re.match(r"\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001270 <_sre.SRE_Match object at ...>
Georg Brandlb8df1562007-12-05 18:30:48 +00001271 >>> re.match("\\\\", r"\\")
Georg Brandl6199e322008-03-22 12:04:26 +00001272 <_sre.SRE_Match object at ...>