blob: fef6d2d6f397988c8b5dbbca957f3af87037907e [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11
12
13This module provides regular expression matching operations similar to
14those found in Perl. Both patterns and strings to be searched can be
15Unicode strings as well as 8-bit strings. The :mod:`re` module is
16always available.
17
18Regular expressions use the backslash character (``'\'``) to indicate
19special forms or to allow special characters to be used without invoking
20their special meaning. This collides with Python's usage of the same
21character for the same purpose in string literals; for example, to match
22a literal backslash, one might have to write ``'\\\\'`` as the pattern
23string, because the regular expression must be ``\\``, and each
24backslash must be expressed as ``\\`` inside a regular Python string
25literal.
26
27The solution is to use Python's raw string notation for regular expression
28patterns; backslashes are not handled in any special way in a string literal
29prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
30``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
31newline. Usually patterns will be expressed in Python code using this raw string
32notation.
33
34.. seealso::
35
36 Mastering Regular Expressions
37 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
38 second edition of the book no longer covers Python at all, but the first
39 edition covered writing good regular expression patterns in great detail.
40
41
42.. _re-syntax:
43
44Regular Expression Syntax
45-------------------------
46
47A regular expression (or RE) specifies a set of strings that matches it; the
48functions in this module let you check if a particular string matches a given
49regular expression (or if a given regular expression matches a particular
50string, which comes down to the same thing).
51
52Regular expressions can be concatenated to form new regular expressions; if *A*
53and *B* are both regular expressions, then *AB* is also a regular expression.
54In general, if a string *p* matches *A* and another string *q* matches *B*, the
55string *pq* will match AB. This holds unless *A* or *B* contain low precedence
56operations; boundary conditions between *A* and *B*; or have numbered group
57references. Thus, complex expressions can easily be constructed from simpler
58primitive expressions like the ones described here. For details of the theory
59and implementation of regular expressions, consult the Friedl book referenced
60above, or almost any textbook about compiler construction.
61
62A brief explanation of the format of regular expressions follows. For further
63information and a gentler presentation, consult the Regular Expression HOWTO,
64accessible from http://www.python.org/doc/howto/.
65
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves. You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``. (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
79
80The special characters are:
81
82.. %
83
84``'.'``
85 (Dot.) In the default mode, this matches any character except a newline. If
86 the :const:`DOTALL` flag has been specified, this matches any character
87 including a newline.
88
89``'^'``
90 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
91 matches immediately after each newline.
92
93``'$'``
94 Matches the end of the string or just before the newline at the end of the
95 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
96 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
97 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
98 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
99
100``'*'``
101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
103 by any number of 'b's.
104
105``'+'``
106 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108 match just 'a'.
109
110``'?'``
111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112 ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116 as much text as possible. Sometimes this behaviour isn't desired; if the RE
117 ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
118 string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
120 characters as possible will be matched. Using ``.*?`` in the previous
121 expression will match only ``'<H1>'``.
122
123``{m}``
124 Specifies that exactly *m* copies of the previous RE should be matched; fewer
125 matches cause the entire RE not to match. For example, ``a{6}`` will match
126 exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130 RE, attempting to match as many repetitions as possible. For example,
131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135 modifier would be confused with the previously described form.
136
137``{m,n}?``
138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139 RE, attempting to match as *few* repetitions as possible. This is the
140 non-greedy version of the previous qualifier. For example, on the
141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142 while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145 Either escapes special characters (permitting you to match characters like
146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147 sequences are discussed below.
148
149 If you're not using a raw string to express the pattern, remember that Python
150 also uses the backslash as an escape sequence in string literals; if the escape
151 sequence isn't recognized by Python's parser, the backslash and subsequent
152 character are included in the resulting string. However, if Python would
153 recognize the resulting sequence, the backslash should be repeated twice. This
154 is complicated and hard to understand, so it's highly recommended that you use
155 raw strings for all but the simplest expressions.
156
157``[]``
158 Used to indicate a set of characters. Characters can be listed individually, or
159 a range of characters can be indicated by giving two characters and separating
160 them by a ``'-'``. Special characters are not active inside sets. For example,
161 ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
162 ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
163 ``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
164 as ``\w`` or ``\S`` (defined below) are also acceptable inside a
165 range, although the characters they match depends on whether :const:`LOCALE`
166 or :const:`UNICODE` mode is in force. If you want to include a
167 ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
168 place it as the first character. The pattern ``[]]`` will match
169 ``']'``, for example.
170
171 You can match the characters not within a range by :dfn:`complementing` the set.
172 This is indicated by including a ``'^'`` as the first character of the set;
173 ``'^'`` elsewhere will simply match the ``'^'`` character. For example,
174 ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
175 character except ``'^'``.
176
177``'|'``
178 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
179 will match either A or B. An arbitrary number of REs can be separated by the
180 ``'|'`` in this way. This can be used inside groups (see below) as well. As
181 the target string is scanned, REs separated by ``'|'`` are tried from left to
182 right. When one pattern completely matches, that branch is accepted. This means
183 that once ``A`` matches, ``B`` will not be tested further, even if it would
184 produce a longer overall match. In other words, the ``'|'`` operator is never
185 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
186 character class, as in ``[|]``.
187
188``(...)``
189 Matches whatever regular expression is inside the parentheses, and indicates the
190 start and end of a group; the contents of a group can be retrieved after a match
191 has been performed, and can be matched later in the string with the ``\number``
192 special sequence, described below. To match the literals ``'('`` or ``')'``,
193 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
194
195``(?...)``
196 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
197 otherwise). The first character after the ``'?'`` determines what the meaning
198 and further syntax of the construct is. Extensions usually do not create a new
199 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
200 currently supported extensions.
201
202``(?iLmsux)``
203 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
204 ``'u'``, ``'x'``.) The group matches the empty string; the letters
205 set the corresponding flags: :const:`re.I` (ignore case),
206 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
207 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
208 and :const:`re.X` (verbose), for the entire regular expression. (The
209 flags are described in :ref:`contents-of-module-re`.) This
210 is useful if you wish to include the flags as part of the regular
211 expression, instead of passing a *flag* argument to the
212 :func:`compile` function.
213
214 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
215 used first in the expression string, or after one or more whitespace characters.
216 If there are non-whitespace characters before the flag, the results are
217 undefined.
218
219``(?:...)``
220 A non-grouping version of regular parentheses. Matches whatever regular
221 expression is inside the parentheses, but the substring matched by the group
222 *cannot* be retrieved after performing a match or referenced later in the
223 pattern.
224
225``(?P<name>...)``
226 Similar to regular parentheses, but the substring matched by the group is
227 accessible via the symbolic group name *name*. Group names must be valid Python
228 identifiers, and each group name must be defined only once within a regular
229 expression. A symbolic group is also a numbered group, just as if the group
230 were not named. So the group named 'id' in the example below can also be
231 referenced as the numbered group 1.
232
233 For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
234 referenced by its name in arguments to methods of match objects, such as
235 ``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
236 example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
237
238``(?P=name)``
239 Matches whatever text was matched by the earlier group named *name*.
240
241``(?#...)``
242 A comment; the contents of the parentheses are simply ignored.
243
244``(?=...)``
245 Matches if ``...`` matches next, but doesn't consume any of the string. This is
246 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
247 ``'Isaac '`` only if it's followed by ``'Asimov'``.
248
249``(?!...)``
250 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
251 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
252 followed by ``'Asimov'``.
253
254``(?<=...)``
255 Matches if the current position in the string is preceded by a match for ``...``
256 that ends at the current position. This is called a :dfn:`positive lookbehind
257 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
258 lookbehind will back up 3 characters and check if the contained pattern matches.
259 The contained pattern must only match strings of some fixed length, meaning that
260 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
261 patterns which start with positive lookbehind assertions will never match at the
262 beginning of the string being searched; you will most likely want to use the
263 :func:`search` function rather than the :func:`match` function::
264
265 >>> import re
266 >>> m = re.search('(?<=abc)def', 'abcdef')
267 >>> m.group(0)
268 'def'
269
270 This example looks for a word following a hyphen::
271
272 >>> m = re.search('(?<=-)\w+', 'spam-egg')
273 >>> m.group(0)
274 'egg'
275
276``(?<!...)``
277 Matches if the current position in the string is not preceded by a match for
278 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
279 positive lookbehind assertions, the contained pattern must only match strings of
280 some fixed length. Patterns which start with negative lookbehind assertions may
281 match at the beginning of the string being searched.
282
283``(?(id/name)yes-pattern|no-pattern)``
284 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
285 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
286 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
287 matching pattern, which will match with ``'<user@host.com>'`` as well as
288 ``'user@host.com'``, but not with ``'<user@host.com'``.
289
Georg Brandl116aa622007-08-15 14:28:22 +0000290
291The special sequences consist of ``'\'`` and a character from the list below.
292If the ordinary character is not on the list, then the resulting RE will match
293the second character. For example, ``\$`` matches the character ``'$'``.
294
295.. %
296
297``\number``
298 Matches the contents of the group of the same number. Groups are numbered
299 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
300 but not ``'the end'`` (note the space after the group). This special sequence
301 can only be used to match one of the first 99 groups. If the first digit of
302 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
303 a group match, but as the character with octal value *number*. Inside the
304 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
305 characters.
306
307``\A``
308 Matches only at the start of the string.
309
310``\b``
311 Matches the empty string, but only at the beginning or end of a word. A word is
312 defined as a sequence of alphanumeric or underscore characters, so the end of a
313 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
314 Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
315 precise set of characters deemed to be alphanumeric depends on the values of the
316 ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
317 the backspace character, for compatibility with Python's string literals.
318
319``\B``
320 Matches the empty string, but only when it is *not* at the beginning or end of a
321 word. This is just the opposite of ``\b``, so is also subject to the settings
322 of ``LOCALE`` and ``UNICODE``.
323
324``\d``
325 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
326 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
327 whatever is classified as a digit in the Unicode character properties database.
328
329``\D``
330 When the :const:`UNICODE` flag is not specified, matches any non-digit
331 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
332 will match anything other than character marked as digits in the Unicode
333 character properties database.
334
335``\s``
336 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
337 any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
338 :const:`LOCALE`, it will match this set plus whatever characters are defined as
339 space for the current locale. If :const:`UNICODE` is set, this will match the
340 characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
341 character properties database.
342
343``\S``
344 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
345 any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
346 With :const:`LOCALE`, it will match any character not in this set, and not
347 defined as space in the current locale. If :const:`UNICODE` is set, this will
348 match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
349 the Unicode character properties database.
350
351``\w``
352 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
353 any alphanumeric character and the underscore; this is equivalent to the set
354 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
355 whatever characters are defined as alphanumeric for the current locale. If
356 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
357 is classified as alphanumeric in the Unicode character properties database.
358
359``\W``
360 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
361 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
362 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
363 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
364 this will match anything other than ``[0-9_]`` and characters marked as
365 alphanumeric in the Unicode character properties database.
366
367``\Z``
368 Matches only at the end of the string.
369
370Most of the standard escapes supported by Python string literals are also
371accepted by the regular expression parser::
372
373 \a \b \f \n
374 \r \t \v \x
375 \\
376
377Octal escapes are included in a limited form: If the first digit is a 0, or if
378there are three octal digits, it is considered an octal escape. Otherwise, it is
379a group reference. As for string literals, octal escapes are always at most
380three digits in length.
381
382.. % Note the lack of a period in the section title; it causes problems
383.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
384
385
386.. _matching-searching:
387
388Matching vs Searching
389---------------------
390
391.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
392
393
394Python offers two different primitive operations based on regular expressions:
Guido van Rossum04110fb2007-08-24 16:32:05 +0000395**match** checks for a match only at the beginning of the string, while
396**search** checks for a match anywhere in the string (this is what Perl does
397by default).
Georg Brandl116aa622007-08-15 14:28:22 +0000398
Guido van Rossum04110fb2007-08-24 16:32:05 +0000399Note that match may differ from search even when using a regular expression
400beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl116aa622007-08-15 14:28:22 +0000401:const:`MULTILINE` mode also immediately following a newline. The "match"
402operation succeeds only if the pattern matches at the start of the string
403regardless of mode, or at the starting position given by the optional *pos*
404argument regardless of whether a newline precedes it.
405
406.. % Examples from Tim Peters:
407
408::
409
410 re.compile("a").match("ba", 1) # succeeds
411 re.compile("^a").search("ba", 1) # fails; 'a' not at start
412 re.compile("^a").search("\na", 1) # fails; 'a' not at start
413 re.compile("^a", re.M).search("\na", 1) # succeeds
414 re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
415
416
417.. _contents-of-module-re:
418
419Module Contents
420---------------
421
422The module defines several functions, constants, and an exception. Some of the
423functions are simplified versions of the full featured methods for compiled
424regular expressions. Most non-trivial applications always use the compiled
425form.
426
427
428.. function:: compile(pattern[, flags])
429
430 Compile a regular expression pattern into a regular expression object, which can
431 be used for matching using its :func:`match` and :func:`search` methods,
432 described below.
433
434 The expression's behaviour can be modified by specifying a *flags* value.
435 Values can be any of the following variables, combined using bitwise OR (the
436 ``|`` operator).
437
438 The sequence ::
439
440 prog = re.compile(pat)
441 result = prog.match(str)
442
443 is equivalent to ::
444
445 result = re.match(pat, str)
446
447 but the version using :func:`compile` is more efficient when the expression will
448 be used several times in a single program.
449
450 .. % (The compiled version of the last pattern passed to
451 .. % \function{re.match()} or \function{re.search()} is cached, so
452 .. % programs that use only a single regular expression at a time needn't
453 .. % worry about compiling regular expressions.)
454
455
456.. data:: I
457 IGNORECASE
458
459 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
460 lowercase letters, too. This is not affected by the current locale.
461
462
463.. data:: L
464 LOCALE
465
466 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the current
467 locale.
468
469
470.. data:: M
471 MULTILINE
472
473 When specified, the pattern character ``'^'`` matches at the beginning of the
474 string and at the beginning of each line (immediately following each newline);
475 and the pattern character ``'$'`` matches at the end of the string and at the
476 end of each line (immediately preceding each newline). By default, ``'^'``
477 matches only at the beginning of the string, and ``'$'`` only at the end of the
478 string and immediately before the newline (if any) at the end of the string.
479
480
481.. data:: S
482 DOTALL
483
484 Make the ``'.'`` special character match any character at all, including a
485 newline; without this flag, ``'.'`` will match anything *except* a newline.
486
487
488.. data:: U
489 UNICODE
490
491 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
492 on the Unicode character properties database.
493
Georg Brandl116aa622007-08-15 14:28:22 +0000494
495.. data:: X
496 VERBOSE
497
498 This flag allows you to write regular expressions that look nicer. Whitespace
499 within the pattern is ignored, except when in a character class or preceded by
500 an unescaped backslash, and, when a line contains a ``'#'`` neither in a
501 character class or preceded by an unescaped backslash, all characters from the
502 leftmost such ``'#'`` through the end of the line are ignored.
503
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000504 This means that the two following regular expression objects are equal::
505
506 re.compile(r""" [a-z]+ # some letters
507 \.\. # two dots
508 [a-z]* # perhaps more letters""")
509 re.compile(r"[a-z]+\.\.[a-z]*")
Georg Brandl116aa622007-08-15 14:28:22 +0000510
511
512.. function:: search(pattern, string[, flags])
513
514 Scan through *string* looking for a location where the regular expression
515 *pattern* produces a match, and return a corresponding :class:`MatchObject`
516 instance. Return ``None`` if no position in the string matches the pattern; note
517 that this is different from finding a zero-length match at some point in the
518 string.
519
520
521.. function:: match(pattern, string[, flags])
522
523 If zero or more characters at the beginning of *string* match the regular
524 expression *pattern*, return a corresponding :class:`MatchObject` instance.
525 Return ``None`` if the string does not match the pattern; note that this is
526 different from a zero-length match.
527
528 .. note::
529
530 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
531
532
533.. function:: split(pattern, string[, maxsplit=0])
534
535 Split *string* by the occurrences of *pattern*. If capturing parentheses are
536 used in *pattern*, then the text of all groups in the pattern are also returned
537 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
538 splits occur, and the remainder of the string is returned as the final element
539 of the list. (Incompatibility note: in the original Python 1.5 release,
540 *maxsplit* was ignored. This has been fixed in later releases.) ::
541
542 >>> re.split('\W+', 'Words, words, words.')
543 ['Words', 'words', 'words', '']
544 >>> re.split('(\W+)', 'Words, words, words.')
545 ['Words', ', ', 'words', ', ', 'words', '.', '']
546 >>> re.split('\W+', 'Words, words, words.', 1)
547 ['Words', 'words, words.']
548
549
550.. function:: findall(pattern, string[, flags])
551
552 Return a list of all non-overlapping matches of *pattern* in *string*. If one
553 or more groups are present in the pattern, return a list of groups; this will be
554 a list of tuples if the pattern has more than one group. Empty matches are
555 included in the result unless they touch the beginning of another match.
556
Georg Brandl116aa622007-08-15 14:28:22 +0000557
558.. function:: finditer(pattern, string[, flags])
559
560 Return an iterator over all non-overlapping matches for the RE *pattern* in
561 *string*. For each match, the iterator returns a match object. Empty matches
562 are included in the result unless they touch the beginning of another match.
563
Georg Brandl116aa622007-08-15 14:28:22 +0000564
565.. function:: sub(pattern, repl, string[, count])
566
567 Return the string obtained by replacing the leftmost non-overlapping occurrences
568 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
569 *string* is returned unchanged. *repl* can be a string or a function; if it is
570 a string, any backslash escapes in it are processed. That is, ``\n`` is
571 converted to a single newline character, ``\r`` is converted to a linefeed, and
572 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
573 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
574 For example::
575
576 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
577 ... r'static PyObject*\npy_\1(void)\n{',
578 ... 'def myfunc():')
579 'static PyObject*\npy_myfunc(void)\n{'
580
581 If *repl* is a function, it is called for every non-overlapping occurrence of
582 *pattern*. The function takes a single match object argument, and returns the
583 replacement string. For example::
584
585 >>> def dashrepl(matchobj):
586 ... if matchobj.group(0) == '-': return ' '
587 ... else: return '-'
588 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
589 'pro--gram files'
590
591 The pattern may be a string or an RE object; if you need to specify regular
592 expression flags, you must use a RE object, or use embedded modifiers in a
593 pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
594
595 The optional argument *count* is the maximum number of pattern occurrences to be
596 replaced; *count* must be a non-negative integer. If omitted or zero, all
597 occurrences will be replaced. Empty matches for the pattern are replaced only
598 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
599 ``'-a-b-c-'``.
600
601 In addition to character escapes and backreferences as described above,
602 ``\g<name>`` will use the substring matched by the group named ``name``, as
603 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
604 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
605 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
606 reference to group 20, not a reference to group 2 followed by the literal
607 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
608 substring matched by the RE.
609
610
611.. function:: subn(pattern, repl, string[, count])
612
613 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
614 number_of_subs_made)``.
615
616
617.. function:: escape(string)
618
619 Return *string* with all non-alphanumerics backslashed; this is useful if you
620 want to match an arbitrary literal string that may have regular expression
621 metacharacters in it.
622
623
624.. exception:: error
625
626 Exception raised when a string passed to one of the functions here is not a
627 valid regular expression (for example, it might contain unmatched parentheses)
628 or when some other error occurs during compilation or matching. It is never an
629 error if a string contains no match for a pattern.
630
631
632.. _re-objects:
633
634Regular Expression Objects
635--------------------------
636
637Compiled regular expression objects support the following methods and
638attributes:
639
640
641.. method:: RegexObject.match(string[, pos[, endpos]])
642
643 If zero or more characters at the beginning of *string* match this regular
644 expression, return a corresponding :class:`MatchObject` instance. Return
645 ``None`` if the string does not match the pattern; note that this is different
646 from a zero-length match.
647
648 .. note::
649
650 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
651
652 The optional second parameter *pos* gives an index in the string where the
653 search is to start; it defaults to ``0``. This is not completely equivalent to
654 slicing the string; the ``'^'`` pattern character matches at the real beginning
655 of the string and at positions just after a newline, but not necessarily at the
656 index where the search is to start.
657
658 The optional parameter *endpos* limits how far the string will be searched; it
659 will be as if the string is *endpos* characters long, so only the characters
660 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
661 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
662 expression object, ``rx.match(string, 0, 50)`` is equivalent to
663 ``rx.match(string[:50], 0)``.
664
665
666.. method:: RegexObject.search(string[, pos[, endpos]])
667
668 Scan through *string* looking for a location where this regular expression
669 produces a match, and return a corresponding :class:`MatchObject` instance.
670 Return ``None`` if no position in the string matches the pattern; note that this
671 is different from finding a zero-length match at some point in the string.
672
673 The optional *pos* and *endpos* parameters have the same meaning as for the
674 :meth:`match` method.
675
676
677.. method:: RegexObject.split(string[, maxsplit=0])
678
679 Identical to the :func:`split` function, using the compiled pattern.
680
681
682.. method:: RegexObject.findall(string[, pos[, endpos]])
683
684 Identical to the :func:`findall` function, using the compiled pattern.
685
686
687.. method:: RegexObject.finditer(string[, pos[, endpos]])
688
689 Identical to the :func:`finditer` function, using the compiled pattern.
690
691
692.. method:: RegexObject.sub(repl, string[, count=0])
693
694 Identical to the :func:`sub` function, using the compiled pattern.
695
696
697.. method:: RegexObject.subn(repl, string[, count=0])
698
699 Identical to the :func:`subn` function, using the compiled pattern.
700
701
702.. attribute:: RegexObject.flags
703
704 The flags argument used when the RE object was compiled, or ``0`` if no flags
705 were provided.
706
707
708.. attribute:: RegexObject.groupindex
709
710 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
711 numbers. The dictionary is empty if no symbolic groups were used in the
712 pattern.
713
714
715.. attribute:: RegexObject.pattern
716
717 The pattern string from which the RE object was compiled.
718
719
720.. _match-objects:
721
722Match Objects
723-------------
724
725:class:`MatchObject` instances support the following methods and attributes:
726
727
728.. method:: MatchObject.expand(template)
729
730 Return the string obtained by doing backslash substitution on the template
731 string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
732 converted to the appropriate characters, and numeric backreferences (``\1``,
733 ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
734 contents of the corresponding group.
735
736
737.. method:: MatchObject.group([group1, ...])
738
739 Returns one or more subgroups of the match. If there is a single argument, the
740 result is a single string; if there are multiple arguments, the result is a
741 tuple with one item per argument. Without arguments, *group1* defaults to zero
742 (the whole match is returned). If a *groupN* argument is zero, the corresponding
743 return value is the entire matching string; if it is in the inclusive range
744 [1..99], it is the string matching the corresponding parenthesized group. If a
745 group number is negative or larger than the number of groups defined in the
746 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
747 part of the pattern that did not match, the corresponding result is ``None``.
748 If a group is contained in a part of the pattern that matched multiple times,
749 the last match is returned.
750
751 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
752 arguments may also be strings identifying groups by their group name. If a
753 string argument is not used as a group name in the pattern, an :exc:`IndexError`
754 exception is raised.
755
756 A moderately complicated example::
757
758 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
759
760 After performing this match, ``m.group(1)`` is ``'3'``, as is
761 ``m.group('int')``, and ``m.group(2)`` is ``'14'``.
762
763
764.. method:: MatchObject.groups([default])
765
766 Return a tuple containing all the subgroups of the match, from 1 up to however
767 many groups are in the pattern. The *default* argument is used for groups that
768 did not participate in the match; it defaults to ``None``. (Incompatibility
769 note: in the original Python 1.5 release, if the tuple was one element long, a
770 string would be returned instead. In later versions (from 1.5.1 on), a
771 singleton tuple is returned in such cases.)
772
773
774.. method:: MatchObject.groupdict([default])
775
776 Return a dictionary containing all the *named* subgroups of the match, keyed by
777 the subgroup name. The *default* argument is used for groups that did not
778 participate in the match; it defaults to ``None``.
779
780
781.. method:: MatchObject.start([group])
782 MatchObject.end([group])
783
784 Return the indices of the start and end of the substring matched by *group*;
785 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
786 *group* exists but did not contribute to the match. For a match object *m*, and
787 a group *g* that did contribute to the match, the substring matched by group *g*
788 (equivalent to ``m.group(g)``) is ::
789
790 m.string[m.start(g):m.end(g)]
791
792 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
793 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
794 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
795 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
796
797
798.. method:: MatchObject.span([group])
799
800 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
801 m.end(group))``. Note that if *group* did not contribute to the match, this is
802 ``(-1, -1)``. Again, *group* defaults to zero.
803
804
805.. attribute:: MatchObject.pos
806
807 The value of *pos* which was passed to the :func:`search` or :func:`match`
808 method of the :class:`RegexObject`. This is the index into the string at which
809 the RE engine started looking for a match.
810
811
812.. attribute:: MatchObject.endpos
813
814 The value of *endpos* which was passed to the :func:`search` or :func:`match`
815 method of the :class:`RegexObject`. This is the index into the string beyond
816 which the RE engine will not go.
817
818
819.. attribute:: MatchObject.lastindex
820
821 The integer index of the last matched capturing group, or ``None`` if no group
822 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
823 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
824 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
825 string.
826
827
828.. attribute:: MatchObject.lastgroup
829
830 The name of the last matched capturing group, or ``None`` if the group didn't
831 have a name, or if no group was matched at all.
832
833
834.. attribute:: MatchObject.re
835
836 The regular expression object whose :meth:`match` or :meth:`search` method
837 produced this :class:`MatchObject` instance.
838
839
840.. attribute:: MatchObject.string
841
842 The string passed to :func:`match` or :func:`search`.
843
844
845Examples
846--------
847
848**Simulating scanf()**
849
850.. index:: single: scanf()
851
852Python does not currently have an equivalent to :cfunc:`scanf`. Regular
853expressions are generally more powerful, though also more verbose, than
854:cfunc:`scanf` format strings. The table below offers some more-or-less
855equivalent mappings between :cfunc:`scanf` format tokens and regular
856expressions.
857
858+--------------------------------+---------------------------------------------+
859| :cfunc:`scanf` Token | Regular Expression |
860+================================+=============================================+
861| ``%c`` | ``.`` |
862+--------------------------------+---------------------------------------------+
863| ``%5c`` | ``.{5}`` |
864+--------------------------------+---------------------------------------------+
865| ``%d`` | ``[-+]?\d+`` |
866+--------------------------------+---------------------------------------------+
867| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
868+--------------------------------+---------------------------------------------+
869| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
870+--------------------------------+---------------------------------------------+
871| ``%o`` | ``0[0-7]*`` |
872+--------------------------------+---------------------------------------------+
873| ``%s`` | ``\S+`` |
874+--------------------------------+---------------------------------------------+
875| ``%u`` | ``\d+`` |
876+--------------------------------+---------------------------------------------+
877| ``%x``, ``%X`` | ``0[xX][\dA-Fa-f]+`` |
878+--------------------------------+---------------------------------------------+
879
880To extract the filename and numbers from a string like ::
881
882 /usr/sbin/sendmail - 0 errors, 4 warnings
883
884you would use a :cfunc:`scanf` format like ::
885
886 %s - %d errors, %d warnings
887
888The equivalent regular expression would be ::
889
890 (\S+) - (\d+) errors, (\d+) warnings
891
892**Avoiding recursion**
893
894If you create regular expressions that require the engine to perform a lot of
895recursion, you may encounter a :exc:`RuntimeError` exception with the message
896``maximum recursion limit`` exceeded. For example, ::
897
898 >>> import re
899 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
900 >>> re.match('Begin (\w| )*? end', s).end()
901 Traceback (most recent call last):
902 File "<stdin>", line 1, in ?
903 File "/usr/local/lib/python2.5/re.py", line 132, in match
904 return _compile(pattern, flags).match(string)
905 RuntimeError: maximum recursion limit exceeded
906
907You can often restructure your regular expression to avoid recursion.
908
909Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
910avoid recursion. Thus, the above regular expression can avoid recursion by
911being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
912regular expressions will run faster than their recursive equivalents.
913