blob: d5abcdd2a270a2930bf43316b5e9d85a33081b6a [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11
12
13This module provides regular expression matching operations similar to
14those found in Perl. Both patterns and strings to be searched can be
15Unicode strings as well as 8-bit strings. The :mod:`re` module is
16always available.
17
18Regular expressions use the backslash character (``'\'``) to indicate
19special forms or to allow special characters to be used without invoking
20their special meaning. This collides with Python's usage of the same
21character for the same purpose in string literals; for example, to match
22a literal backslash, one might have to write ``'\\\\'`` as the pattern
23string, because the regular expression must be ``\\``, and each
24backslash must be expressed as ``\\`` inside a regular Python string
25literal.
26
27The solution is to use Python's raw string notation for regular expression
28patterns; backslashes are not handled in any special way in a string literal
29prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
30``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
31newline. Usually patterns will be expressed in Python code using this raw string
32notation.
33
34.. seealso::
35
36 Mastering Regular Expressions
37 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
38 second edition of the book no longer covers Python at all, but the first
39 edition covered writing good regular expression patterns in great detail.
40
41
42.. _re-syntax:
43
44Regular Expression Syntax
45-------------------------
46
47A regular expression (or RE) specifies a set of strings that matches it; the
48functions in this module let you check if a particular string matches a given
49regular expression (or if a given regular expression matches a particular
50string, which comes down to the same thing).
51
52Regular expressions can be concatenated to form new regular expressions; if *A*
53and *B* are both regular expressions, then *AB* is also a regular expression.
54In general, if a string *p* matches *A* and another string *q* matches *B*, the
55string *pq* will match AB. This holds unless *A* or *B* contain low precedence
56operations; boundary conditions between *A* and *B*; or have numbered group
57references. Thus, complex expressions can easily be constructed from simpler
58primitive expressions like the ones described here. For details of the theory
59and implementation of regular expressions, consult the Friedl book referenced
60above, or almost any textbook about compiler construction.
61
62A brief explanation of the format of regular expressions follows. For further
63information and a gentler presentation, consult the Regular Expression HOWTO,
64accessible from http://www.python.org/doc/howto/.
65
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves. You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``. (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
79
80The special characters are:
81
82.. %
83
84``'.'``
85 (Dot.) In the default mode, this matches any character except a newline. If
86 the :const:`DOTALL` flag has been specified, this matches any character
87 including a newline.
88
89``'^'``
90 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
91 matches immediately after each newline.
92
93``'$'``
94 Matches the end of the string or just before the newline at the end of the
95 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
96 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
97 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
98 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
99
100``'*'``
101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
103 by any number of 'b's.
104
105``'+'``
106 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108 match just 'a'.
109
110``'?'``
111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112 ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116 as much text as possible. Sometimes this behaviour isn't desired; if the RE
117 ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
118 string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
120 characters as possible will be matched. Using ``.*?`` in the previous
121 expression will match only ``'<H1>'``.
122
123``{m}``
124 Specifies that exactly *m* copies of the previous RE should be matched; fewer
125 matches cause the entire RE not to match. For example, ``a{6}`` will match
126 exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130 RE, attempting to match as many repetitions as possible. For example,
131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135 modifier would be confused with the previously described form.
136
137``{m,n}?``
138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139 RE, attempting to match as *few* repetitions as possible. This is the
140 non-greedy version of the previous qualifier. For example, on the
141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142 while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145 Either escapes special characters (permitting you to match characters like
146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147 sequences are discussed below.
148
149 If you're not using a raw string to express the pattern, remember that Python
150 also uses the backslash as an escape sequence in string literals; if the escape
151 sequence isn't recognized by Python's parser, the backslash and subsequent
152 character are included in the resulting string. However, if Python would
153 recognize the resulting sequence, the backslash should be repeated twice. This
154 is complicated and hard to understand, so it's highly recommended that you use
155 raw strings for all but the simplest expressions.
156
157``[]``
158 Used to indicate a set of characters. Characters can be listed individually, or
159 a range of characters can be indicated by giving two characters and separating
160 them by a ``'-'``. Special characters are not active inside sets. For example,
161 ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
162 ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
163 ``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
164 as ``\w`` or ``\S`` (defined below) are also acceptable inside a
165 range, although the characters they match depends on whether :const:`LOCALE`
166 or :const:`UNICODE` mode is in force. If you want to include a
167 ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
168 place it as the first character. The pattern ``[]]`` will match
169 ``']'``, for example.
170
171 You can match the characters not within a range by :dfn:`complementing` the set.
172 This is indicated by including a ``'^'`` as the first character of the set;
173 ``'^'`` elsewhere will simply match the ``'^'`` character. For example,
174 ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
175 character except ``'^'``.
176
177``'|'``
178 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
179 will match either A or B. An arbitrary number of REs can be separated by the
180 ``'|'`` in this way. This can be used inside groups (see below) as well. As
181 the target string is scanned, REs separated by ``'|'`` are tried from left to
182 right. When one pattern completely matches, that branch is accepted. This means
183 that once ``A`` matches, ``B`` will not be tested further, even if it would
184 produce a longer overall match. In other words, the ``'|'`` operator is never
185 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
186 character class, as in ``[|]``.
187
188``(...)``
189 Matches whatever regular expression is inside the parentheses, and indicates the
190 start and end of a group; the contents of a group can be retrieved after a match
191 has been performed, and can be matched later in the string with the ``\number``
192 special sequence, described below. To match the literals ``'('`` or ``')'``,
193 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
194
195``(?...)``
196 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
197 otherwise). The first character after the ``'?'`` determines what the meaning
198 and further syntax of the construct is. Extensions usually do not create a new
199 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
200 currently supported extensions.
201
202``(?iLmsux)``
203 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
204 ``'u'``, ``'x'``.) The group matches the empty string; the letters
205 set the corresponding flags: :const:`re.I` (ignore case),
206 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
207 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
208 and :const:`re.X` (verbose), for the entire regular expression. (The
209 flags are described in :ref:`contents-of-module-re`.) This
210 is useful if you wish to include the flags as part of the regular
211 expression, instead of passing a *flag* argument to the
212 :func:`compile` function.
213
214 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
215 used first in the expression string, or after one or more whitespace characters.
216 If there are non-whitespace characters before the flag, the results are
217 undefined.
218
219``(?:...)``
220 A non-grouping version of regular parentheses. Matches whatever regular
221 expression is inside the parentheses, but the substring matched by the group
222 *cannot* be retrieved after performing a match or referenced later in the
223 pattern.
224
225``(?P<name>...)``
226 Similar to regular parentheses, but the substring matched by the group is
227 accessible via the symbolic group name *name*. Group names must be valid Python
228 identifiers, and each group name must be defined only once within a regular
229 expression. A symbolic group is also a numbered group, just as if the group
230 were not named. So the group named 'id' in the example below can also be
231 referenced as the numbered group 1.
232
233 For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
234 referenced by its name in arguments to methods of match objects, such as
235 ``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
236 example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
237
238``(?P=name)``
239 Matches whatever text was matched by the earlier group named *name*.
240
241``(?#...)``
242 A comment; the contents of the parentheses are simply ignored.
243
244``(?=...)``
245 Matches if ``...`` matches next, but doesn't consume any of the string. This is
246 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
247 ``'Isaac '`` only if it's followed by ``'Asimov'``.
248
249``(?!...)``
250 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
251 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
252 followed by ``'Asimov'``.
253
254``(?<=...)``
255 Matches if the current position in the string is preceded by a match for ``...``
256 that ends at the current position. This is called a :dfn:`positive lookbehind
257 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
258 lookbehind will back up 3 characters and check if the contained pattern matches.
259 The contained pattern must only match strings of some fixed length, meaning that
260 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
261 patterns which start with positive lookbehind assertions will never match at the
262 beginning of the string being searched; you will most likely want to use the
263 :func:`search` function rather than the :func:`match` function::
264
265 >>> import re
266 >>> m = re.search('(?<=abc)def', 'abcdef')
267 >>> m.group(0)
268 'def'
269
270 This example looks for a word following a hyphen::
271
272 >>> m = re.search('(?<=-)\w+', 'spam-egg')
273 >>> m.group(0)
274 'egg'
275
276``(?<!...)``
277 Matches if the current position in the string is not preceded by a match for
278 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
279 positive lookbehind assertions, the contained pattern must only match strings of
280 some fixed length. Patterns which start with negative lookbehind assertions may
281 match at the beginning of the string being searched.
282
283``(?(id/name)yes-pattern|no-pattern)``
284 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
285 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
286 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
287 matching pattern, which will match with ``'<user@host.com>'`` as well as
288 ``'user@host.com'``, but not with ``'<user@host.com'``.
289
290 .. versionadded:: 2.4
291
292The special sequences consist of ``'\'`` and a character from the list below.
293If the ordinary character is not on the list, then the resulting RE will match
294the second character. For example, ``\$`` matches the character ``'$'``.
295
296.. %
297
298``\number``
299 Matches the contents of the group of the same number. Groups are numbered
300 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
301 but not ``'the end'`` (note the space after the group). This special sequence
302 can only be used to match one of the first 99 groups. If the first digit of
303 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
304 a group match, but as the character with octal value *number*. Inside the
305 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
306 characters.
307
308``\A``
309 Matches only at the start of the string.
310
311``\b``
312 Matches the empty string, but only at the beginning or end of a word. A word is
313 defined as a sequence of alphanumeric or underscore characters, so the end of a
314 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
315 Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
316 precise set of characters deemed to be alphanumeric depends on the values of the
317 ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
318 the backspace character, for compatibility with Python's string literals.
319
320``\B``
321 Matches the empty string, but only when it is *not* at the beginning or end of a
322 word. This is just the opposite of ``\b``, so is also subject to the settings
323 of ``LOCALE`` and ``UNICODE``.
324
325``\d``
326 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
327 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
328 whatever is classified as a digit in the Unicode character properties database.
329
330``\D``
331 When the :const:`UNICODE` flag is not specified, matches any non-digit
332 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
333 will match anything other than character marked as digits in the Unicode
334 character properties database.
335
336``\s``
337 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
338 any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
339 :const:`LOCALE`, it will match this set plus whatever characters are defined as
340 space for the current locale. If :const:`UNICODE` is set, this will match the
341 characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
342 character properties database.
343
344``\S``
345 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
346 any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
347 With :const:`LOCALE`, it will match any character not in this set, and not
348 defined as space in the current locale. If :const:`UNICODE` is set, this will
349 match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
350 the Unicode character properties database.
351
352``\w``
353 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
354 any alphanumeric character and the underscore; this is equivalent to the set
355 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
356 whatever characters are defined as alphanumeric for the current locale. If
357 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
358 is classified as alphanumeric in the Unicode character properties database.
359
360``\W``
361 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
362 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
363 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
364 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
365 this will match anything other than ``[0-9_]`` and characters marked as
366 alphanumeric in the Unicode character properties database.
367
368``\Z``
369 Matches only at the end of the string.
370
371Most of the standard escapes supported by Python string literals are also
372accepted by the regular expression parser::
373
374 \a \b \f \n
375 \r \t \v \x
376 \\
377
378Octal escapes are included in a limited form: If the first digit is a 0, or if
379there are three octal digits, it is considered an octal escape. Otherwise, it is
380a group reference. As for string literals, octal escapes are always at most
381three digits in length.
382
383.. % Note the lack of a period in the section title; it causes problems
384.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
385
386
387.. _matching-searching:
388
389Matching vs Searching
390---------------------
391
392.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
393
394
395Python offers two different primitive operations based on regular expressions:
Guido van Rossum04110fb2007-08-24 16:32:05 +0000396**match** checks for a match only at the beginning of the string, while
397**search** checks for a match anywhere in the string (this is what Perl does
398by default).
Georg Brandl116aa622007-08-15 14:28:22 +0000399
Guido van Rossum04110fb2007-08-24 16:32:05 +0000400Note that match may differ from search even when using a regular expression
401beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl116aa622007-08-15 14:28:22 +0000402:const:`MULTILINE` mode also immediately following a newline. The "match"
403operation succeeds only if the pattern matches at the start of the string
404regardless of mode, or at the starting position given by the optional *pos*
405argument regardless of whether a newline precedes it.
406
407.. % Examples from Tim Peters:
408
409::
410
411 re.compile("a").match("ba", 1) # succeeds
412 re.compile("^a").search("ba", 1) # fails; 'a' not at start
413 re.compile("^a").search("\na", 1) # fails; 'a' not at start
414 re.compile("^a", re.M).search("\na", 1) # succeeds
415 re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
416
417
418.. _contents-of-module-re:
419
420Module Contents
421---------------
422
423The module defines several functions, constants, and an exception. Some of the
424functions are simplified versions of the full featured methods for compiled
425regular expressions. Most non-trivial applications always use the compiled
426form.
427
428
429.. function:: compile(pattern[, flags])
430
431 Compile a regular expression pattern into a regular expression object, which can
432 be used for matching using its :func:`match` and :func:`search` methods,
433 described below.
434
435 The expression's behaviour can be modified by specifying a *flags* value.
436 Values can be any of the following variables, combined using bitwise OR (the
437 ``|`` operator).
438
439 The sequence ::
440
441 prog = re.compile(pat)
442 result = prog.match(str)
443
444 is equivalent to ::
445
446 result = re.match(pat, str)
447
448 but the version using :func:`compile` is more efficient when the expression will
449 be used several times in a single program.
450
451 .. % (The compiled version of the last pattern passed to
452 .. % \function{re.match()} or \function{re.search()} is cached, so
453 .. % programs that use only a single regular expression at a time needn't
454 .. % worry about compiling regular expressions.)
455
456
457.. data:: I
458 IGNORECASE
459
460 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
461 lowercase letters, too. This is not affected by the current locale.
462
463
464.. data:: L
465 LOCALE
466
467 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the current
468 locale.
469
470
471.. data:: M
472 MULTILINE
473
474 When specified, the pattern character ``'^'`` matches at the beginning of the
475 string and at the beginning of each line (immediately following each newline);
476 and the pattern character ``'$'`` matches at the end of the string and at the
477 end of each line (immediately preceding each newline). By default, ``'^'``
478 matches only at the beginning of the string, and ``'$'`` only at the end of the
479 string and immediately before the newline (if any) at the end of the string.
480
481
482.. data:: S
483 DOTALL
484
485 Make the ``'.'`` special character match any character at all, including a
486 newline; without this flag, ``'.'`` will match anything *except* a newline.
487
488
489.. data:: U
490 UNICODE
491
492 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
493 on the Unicode character properties database.
494
495 .. versionadded:: 2.0
496
497
498.. data:: X
499 VERBOSE
500
501 This flag allows you to write regular expressions that look nicer. Whitespace
502 within the pattern is ignored, except when in a character class or preceded by
503 an unescaped backslash, and, when a line contains a ``'#'`` neither in a
504 character class or preceded by an unescaped backslash, all characters from the
505 leftmost such ``'#'`` through the end of the line are ignored.
506
507 .. % XXX should add an example here
508
509
510.. function:: search(pattern, string[, flags])
511
512 Scan through *string* looking for a location where the regular expression
513 *pattern* produces a match, and return a corresponding :class:`MatchObject`
514 instance. Return ``None`` if no position in the string matches the pattern; note
515 that this is different from finding a zero-length match at some point in the
516 string.
517
518
519.. function:: match(pattern, string[, flags])
520
521 If zero or more characters at the beginning of *string* match the regular
522 expression *pattern*, return a corresponding :class:`MatchObject` instance.
523 Return ``None`` if the string does not match the pattern; note that this is
524 different from a zero-length match.
525
526 .. note::
527
528 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
529
530
531.. function:: split(pattern, string[, maxsplit=0])
532
533 Split *string* by the occurrences of *pattern*. If capturing parentheses are
534 used in *pattern*, then the text of all groups in the pattern are also returned
535 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
536 splits occur, and the remainder of the string is returned as the final element
537 of the list. (Incompatibility note: in the original Python 1.5 release,
538 *maxsplit* was ignored. This has been fixed in later releases.) ::
539
540 >>> re.split('\W+', 'Words, words, words.')
541 ['Words', 'words', 'words', '']
542 >>> re.split('(\W+)', 'Words, words, words.')
543 ['Words', ', ', 'words', ', ', 'words', '.', '']
544 >>> re.split('\W+', 'Words, words, words.', 1)
545 ['Words', 'words, words.']
546
547
548.. function:: findall(pattern, string[, flags])
549
550 Return a list of all non-overlapping matches of *pattern* in *string*. If one
551 or more groups are present in the pattern, return a list of groups; this will be
552 a list of tuples if the pattern has more than one group. Empty matches are
553 included in the result unless they touch the beginning of another match.
554
555 .. versionadded:: 1.5.2
556
557 .. versionchanged:: 2.4
558 Added the optional flags argument.
559
560
561.. function:: finditer(pattern, string[, flags])
562
563 Return an iterator over all non-overlapping matches for the RE *pattern* in
564 *string*. For each match, the iterator returns a match object. Empty matches
565 are included in the result unless they touch the beginning of another match.
566
567 .. versionadded:: 2.2
568
569 .. versionchanged:: 2.4
570 Added the optional flags argument.
571
572
573.. function:: sub(pattern, repl, string[, count])
574
575 Return the string obtained by replacing the leftmost non-overlapping occurrences
576 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
577 *string* is returned unchanged. *repl* can be a string or a function; if it is
578 a string, any backslash escapes in it are processed. That is, ``\n`` is
579 converted to a single newline character, ``\r`` is converted to a linefeed, and
580 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
581 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
582 For example::
583
584 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
585 ... r'static PyObject*\npy_\1(void)\n{',
586 ... 'def myfunc():')
587 'static PyObject*\npy_myfunc(void)\n{'
588
589 If *repl* is a function, it is called for every non-overlapping occurrence of
590 *pattern*. The function takes a single match object argument, and returns the
591 replacement string. For example::
592
593 >>> def dashrepl(matchobj):
594 ... if matchobj.group(0) == '-': return ' '
595 ... else: return '-'
596 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
597 'pro--gram files'
598
599 The pattern may be a string or an RE object; if you need to specify regular
600 expression flags, you must use a RE object, or use embedded modifiers in a
601 pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
602
603 The optional argument *count* is the maximum number of pattern occurrences to be
604 replaced; *count* must be a non-negative integer. If omitted or zero, all
605 occurrences will be replaced. Empty matches for the pattern are replaced only
606 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
607 ``'-a-b-c-'``.
608
609 In addition to character escapes and backreferences as described above,
610 ``\g<name>`` will use the substring matched by the group named ``name``, as
611 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
612 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
613 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
614 reference to group 20, not a reference to group 2 followed by the literal
615 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
616 substring matched by the RE.
617
618
619.. function:: subn(pattern, repl, string[, count])
620
621 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
622 number_of_subs_made)``.
623
624
625.. function:: escape(string)
626
627 Return *string* with all non-alphanumerics backslashed; this is useful if you
628 want to match an arbitrary literal string that may have regular expression
629 metacharacters in it.
630
631
632.. exception:: error
633
634 Exception raised when a string passed to one of the functions here is not a
635 valid regular expression (for example, it might contain unmatched parentheses)
636 or when some other error occurs during compilation or matching. It is never an
637 error if a string contains no match for a pattern.
638
639
640.. _re-objects:
641
642Regular Expression Objects
643--------------------------
644
645Compiled regular expression objects support the following methods and
646attributes:
647
648
649.. method:: RegexObject.match(string[, pos[, endpos]])
650
651 If zero or more characters at the beginning of *string* match this regular
652 expression, return a corresponding :class:`MatchObject` instance. Return
653 ``None`` if the string does not match the pattern; note that this is different
654 from a zero-length match.
655
656 .. note::
657
658 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
659
660 The optional second parameter *pos* gives an index in the string where the
661 search is to start; it defaults to ``0``. This is not completely equivalent to
662 slicing the string; the ``'^'`` pattern character matches at the real beginning
663 of the string and at positions just after a newline, but not necessarily at the
664 index where the search is to start.
665
666 The optional parameter *endpos* limits how far the string will be searched; it
667 will be as if the string is *endpos* characters long, so only the characters
668 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
669 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
670 expression object, ``rx.match(string, 0, 50)`` is equivalent to
671 ``rx.match(string[:50], 0)``.
672
673
674.. method:: RegexObject.search(string[, pos[, endpos]])
675
676 Scan through *string* looking for a location where this regular expression
677 produces a match, and return a corresponding :class:`MatchObject` instance.
678 Return ``None`` if no position in the string matches the pattern; note that this
679 is different from finding a zero-length match at some point in the string.
680
681 The optional *pos* and *endpos* parameters have the same meaning as for the
682 :meth:`match` method.
683
684
685.. method:: RegexObject.split(string[, maxsplit=0])
686
687 Identical to the :func:`split` function, using the compiled pattern.
688
689
690.. method:: RegexObject.findall(string[, pos[, endpos]])
691
692 Identical to the :func:`findall` function, using the compiled pattern.
693
694
695.. method:: RegexObject.finditer(string[, pos[, endpos]])
696
697 Identical to the :func:`finditer` function, using the compiled pattern.
698
699
700.. method:: RegexObject.sub(repl, string[, count=0])
701
702 Identical to the :func:`sub` function, using the compiled pattern.
703
704
705.. method:: RegexObject.subn(repl, string[, count=0])
706
707 Identical to the :func:`subn` function, using the compiled pattern.
708
709
710.. attribute:: RegexObject.flags
711
712 The flags argument used when the RE object was compiled, or ``0`` if no flags
713 were provided.
714
715
716.. attribute:: RegexObject.groupindex
717
718 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
719 numbers. The dictionary is empty if no symbolic groups were used in the
720 pattern.
721
722
723.. attribute:: RegexObject.pattern
724
725 The pattern string from which the RE object was compiled.
726
727
728.. _match-objects:
729
730Match Objects
731-------------
732
733:class:`MatchObject` instances support the following methods and attributes:
734
735
736.. method:: MatchObject.expand(template)
737
738 Return the string obtained by doing backslash substitution on the template
739 string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
740 converted to the appropriate characters, and numeric backreferences (``\1``,
741 ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
742 contents of the corresponding group.
743
744
745.. method:: MatchObject.group([group1, ...])
746
747 Returns one or more subgroups of the match. If there is a single argument, the
748 result is a single string; if there are multiple arguments, the result is a
749 tuple with one item per argument. Without arguments, *group1* defaults to zero
750 (the whole match is returned). If a *groupN* argument is zero, the corresponding
751 return value is the entire matching string; if it is in the inclusive range
752 [1..99], it is the string matching the corresponding parenthesized group. If a
753 group number is negative or larger than the number of groups defined in the
754 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
755 part of the pattern that did not match, the corresponding result is ``None``.
756 If a group is contained in a part of the pattern that matched multiple times,
757 the last match is returned.
758
759 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
760 arguments may also be strings identifying groups by their group name. If a
761 string argument is not used as a group name in the pattern, an :exc:`IndexError`
762 exception is raised.
763
764 A moderately complicated example::
765
766 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
767
768 After performing this match, ``m.group(1)`` is ``'3'``, as is
769 ``m.group('int')``, and ``m.group(2)`` is ``'14'``.
770
771
772.. method:: MatchObject.groups([default])
773
774 Return a tuple containing all the subgroups of the match, from 1 up to however
775 many groups are in the pattern. The *default* argument is used for groups that
776 did not participate in the match; it defaults to ``None``. (Incompatibility
777 note: in the original Python 1.5 release, if the tuple was one element long, a
778 string would be returned instead. In later versions (from 1.5.1 on), a
779 singleton tuple is returned in such cases.)
780
781
782.. method:: MatchObject.groupdict([default])
783
784 Return a dictionary containing all the *named* subgroups of the match, keyed by
785 the subgroup name. The *default* argument is used for groups that did not
786 participate in the match; it defaults to ``None``.
787
788
789.. method:: MatchObject.start([group])
790 MatchObject.end([group])
791
792 Return the indices of the start and end of the substring matched by *group*;
793 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
794 *group* exists but did not contribute to the match. For a match object *m*, and
795 a group *g* that did contribute to the match, the substring matched by group *g*
796 (equivalent to ``m.group(g)``) is ::
797
798 m.string[m.start(g):m.end(g)]
799
800 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
801 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
802 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
803 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
804
805
806.. method:: MatchObject.span([group])
807
808 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
809 m.end(group))``. Note that if *group* did not contribute to the match, this is
810 ``(-1, -1)``. Again, *group* defaults to zero.
811
812
813.. attribute:: MatchObject.pos
814
815 The value of *pos* which was passed to the :func:`search` or :func:`match`
816 method of the :class:`RegexObject`. This is the index into the string at which
817 the RE engine started looking for a match.
818
819
820.. attribute:: MatchObject.endpos
821
822 The value of *endpos* which was passed to the :func:`search` or :func:`match`
823 method of the :class:`RegexObject`. This is the index into the string beyond
824 which the RE engine will not go.
825
826
827.. attribute:: MatchObject.lastindex
828
829 The integer index of the last matched capturing group, or ``None`` if no group
830 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
831 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
832 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
833 string.
834
835
836.. attribute:: MatchObject.lastgroup
837
838 The name of the last matched capturing group, or ``None`` if the group didn't
839 have a name, or if no group was matched at all.
840
841
842.. attribute:: MatchObject.re
843
844 The regular expression object whose :meth:`match` or :meth:`search` method
845 produced this :class:`MatchObject` instance.
846
847
848.. attribute:: MatchObject.string
849
850 The string passed to :func:`match` or :func:`search`.
851
852
853Examples
854--------
855
856**Simulating scanf()**
857
858.. index:: single: scanf()
859
860Python does not currently have an equivalent to :cfunc:`scanf`. Regular
861expressions are generally more powerful, though also more verbose, than
862:cfunc:`scanf` format strings. The table below offers some more-or-less
863equivalent mappings between :cfunc:`scanf` format tokens and regular
864expressions.
865
866+--------------------------------+---------------------------------------------+
867| :cfunc:`scanf` Token | Regular Expression |
868+================================+=============================================+
869| ``%c`` | ``.`` |
870+--------------------------------+---------------------------------------------+
871| ``%5c`` | ``.{5}`` |
872+--------------------------------+---------------------------------------------+
873| ``%d`` | ``[-+]?\d+`` |
874+--------------------------------+---------------------------------------------+
875| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
876+--------------------------------+---------------------------------------------+
877| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
878+--------------------------------+---------------------------------------------+
879| ``%o`` | ``0[0-7]*`` |
880+--------------------------------+---------------------------------------------+
881| ``%s`` | ``\S+`` |
882+--------------------------------+---------------------------------------------+
883| ``%u`` | ``\d+`` |
884+--------------------------------+---------------------------------------------+
885| ``%x``, ``%X`` | ``0[xX][\dA-Fa-f]+`` |
886+--------------------------------+---------------------------------------------+
887
888To extract the filename and numbers from a string like ::
889
890 /usr/sbin/sendmail - 0 errors, 4 warnings
891
892you would use a :cfunc:`scanf` format like ::
893
894 %s - %d errors, %d warnings
895
896The equivalent regular expression would be ::
897
898 (\S+) - (\d+) errors, (\d+) warnings
899
900**Avoiding recursion**
901
902If you create regular expressions that require the engine to perform a lot of
903recursion, you may encounter a :exc:`RuntimeError` exception with the message
904``maximum recursion limit`` exceeded. For example, ::
905
906 >>> import re
907 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
908 >>> re.match('Begin (\w| )*? end', s).end()
909 Traceback (most recent call last):
910 File "<stdin>", line 1, in ?
911 File "/usr/local/lib/python2.5/re.py", line 132, in match
912 return _compile(pattern, flags).match(string)
913 RuntimeError: maximum recursion limit exceeded
914
915You can often restructure your regular expression to avoid recursion.
916
917Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
918avoid recursion. Thus, the above regular expression can avoid recursion by
919being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
920regular expressions will run faster than their recursive equivalents.
921