blob: a3d3deaa8e35caf0ef02943dde10e6b59bf0b05f [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11
12
13This module provides regular expression matching operations similar to
14those found in Perl. Both patterns and strings to be searched can be
15Unicode strings as well as 8-bit strings. The :mod:`re` module is
16always available.
17
18Regular expressions use the backslash character (``'\'``) to indicate
19special forms or to allow special characters to be used without invoking
20their special meaning. This collides with Python's usage of the same
21character for the same purpose in string literals; for example, to match
22a literal backslash, one might have to write ``'\\\\'`` as the pattern
23string, because the regular expression must be ``\\``, and each
24backslash must be expressed as ``\\`` inside a regular Python string
25literal.
26
27The solution is to use Python's raw string notation for regular expression
28patterns; backslashes are not handled in any special way in a string literal
29prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
30``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
31newline. Usually patterns will be expressed in Python code using this raw string
32notation.
33
34.. seealso::
35
36 Mastering Regular Expressions
37 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
38 second edition of the book no longer covers Python at all, but the first
39 edition covered writing good regular expression patterns in great detail.
40
41
42.. _re-syntax:
43
44Regular Expression Syntax
45-------------------------
46
47A regular expression (or RE) specifies a set of strings that matches it; the
48functions in this module let you check if a particular string matches a given
49regular expression (or if a given regular expression matches a particular
50string, which comes down to the same thing).
51
52Regular expressions can be concatenated to form new regular expressions; if *A*
53and *B* are both regular expressions, then *AB* is also a regular expression.
54In general, if a string *p* matches *A* and another string *q* matches *B*, the
55string *pq* will match AB. This holds unless *A* or *B* contain low precedence
56operations; boundary conditions between *A* and *B*; or have numbered group
57references. Thus, complex expressions can easily be constructed from simpler
58primitive expressions like the ones described here. For details of the theory
59and implementation of regular expressions, consult the Friedl book referenced
60above, or almost any textbook about compiler construction.
61
62A brief explanation of the format of regular expressions follows. For further
63information and a gentler presentation, consult the Regular Expression HOWTO,
64accessible from http://www.python.org/doc/howto/.
65
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves. You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``. (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
79
80The special characters are:
81
82.. %
83
84``'.'``
85 (Dot.) In the default mode, this matches any character except a newline. If
86 the :const:`DOTALL` flag has been specified, this matches any character
87 including a newline.
88
89``'^'``
90 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
91 matches immediately after each newline.
92
93``'$'``
94 Matches the end of the string or just before the newline at the end of the
95 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
96 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
97 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
98 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
99
100``'*'``
101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
103 by any number of 'b's.
104
105``'+'``
106 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108 match just 'a'.
109
110``'?'``
111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112 ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116 as much text as possible. Sometimes this behaviour isn't desired; if the RE
117 ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
118 string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
120 characters as possible will be matched. Using ``.*?`` in the previous
121 expression will match only ``'<H1>'``.
122
123``{m}``
124 Specifies that exactly *m* copies of the previous RE should be matched; fewer
125 matches cause the entire RE not to match. For example, ``a{6}`` will match
126 exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130 RE, attempting to match as many repetitions as possible. For example,
131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135 modifier would be confused with the previously described form.
136
137``{m,n}?``
138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139 RE, attempting to match as *few* repetitions as possible. This is the
140 non-greedy version of the previous qualifier. For example, on the
141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142 while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145 Either escapes special characters (permitting you to match characters like
146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147 sequences are discussed below.
148
149 If you're not using a raw string to express the pattern, remember that Python
150 also uses the backslash as an escape sequence in string literals; if the escape
151 sequence isn't recognized by Python's parser, the backslash and subsequent
152 character are included in the resulting string. However, if Python would
153 recognize the resulting sequence, the backslash should be repeated twice. This
154 is complicated and hard to understand, so it's highly recommended that you use
155 raw strings for all but the simplest expressions.
156
157``[]``
158 Used to indicate a set of characters. Characters can be listed individually, or
159 a range of characters can be indicated by giving two characters and separating
160 them by a ``'-'``. Special characters are not active inside sets. For example,
161 ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
162 ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
163 ``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
164 as ``\w`` or ``\S`` (defined below) are also acceptable inside a
165 range, although the characters they match depends on whether :const:`LOCALE`
166 or :const:`UNICODE` mode is in force. If you want to include a
167 ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
168 place it as the first character. The pattern ``[]]`` will match
169 ``']'``, for example.
170
171 You can match the characters not within a range by :dfn:`complementing` the set.
172 This is indicated by including a ``'^'`` as the first character of the set;
173 ``'^'`` elsewhere will simply match the ``'^'`` character. For example,
174 ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
175 character except ``'^'``.
176
177``'|'``
178 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
179 will match either A or B. An arbitrary number of REs can be separated by the
180 ``'|'`` in this way. This can be used inside groups (see below) as well. As
181 the target string is scanned, REs separated by ``'|'`` are tried from left to
182 right. When one pattern completely matches, that branch is accepted. This means
183 that once ``A`` matches, ``B`` will not be tested further, even if it would
184 produce a longer overall match. In other words, the ``'|'`` operator is never
185 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
186 character class, as in ``[|]``.
187
188``(...)``
189 Matches whatever regular expression is inside the parentheses, and indicates the
190 start and end of a group; the contents of a group can be retrieved after a match
191 has been performed, and can be matched later in the string with the ``\number``
192 special sequence, described below. To match the literals ``'('`` or ``')'``,
193 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
194
195``(?...)``
196 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
197 otherwise). The first character after the ``'?'`` determines what the meaning
198 and further syntax of the construct is. Extensions usually do not create a new
199 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
200 currently supported extensions.
201
202``(?iLmsux)``
203 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
204 ``'u'``, ``'x'``.) The group matches the empty string; the letters
205 set the corresponding flags: :const:`re.I` (ignore case),
206 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
207 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
208 and :const:`re.X` (verbose), for the entire regular expression. (The
209 flags are described in :ref:`contents-of-module-re`.) This
210 is useful if you wish to include the flags as part of the regular
211 expression, instead of passing a *flag* argument to the
212 :func:`compile` function.
213
214 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
215 used first in the expression string, or after one or more whitespace characters.
216 If there are non-whitespace characters before the flag, the results are
217 undefined.
218
219``(?:...)``
220 A non-grouping version of regular parentheses. Matches whatever regular
221 expression is inside the parentheses, but the substring matched by the group
222 *cannot* be retrieved after performing a match or referenced later in the
223 pattern.
224
225``(?P<name>...)``
226 Similar to regular parentheses, but the substring matched by the group is
227 accessible via the symbolic group name *name*. Group names must be valid Python
228 identifiers, and each group name must be defined only once within a regular
229 expression. A symbolic group is also a numbered group, just as if the group
230 were not named. So the group named 'id' in the example below can also be
231 referenced as the numbered group 1.
232
233 For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
234 referenced by its name in arguments to methods of match objects, such as
235 ``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
236 example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
237
238``(?P=name)``
239 Matches whatever text was matched by the earlier group named *name*.
240
241``(?#...)``
242 A comment; the contents of the parentheses are simply ignored.
243
244``(?=...)``
245 Matches if ``...`` matches next, but doesn't consume any of the string. This is
246 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
247 ``'Isaac '`` only if it's followed by ``'Asimov'``.
248
249``(?!...)``
250 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
251 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
252 followed by ``'Asimov'``.
253
254``(?<=...)``
255 Matches if the current position in the string is preceded by a match for ``...``
256 that ends at the current position. This is called a :dfn:`positive lookbehind
257 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
258 lookbehind will back up 3 characters and check if the contained pattern matches.
259 The contained pattern must only match strings of some fixed length, meaning that
260 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
261 patterns which start with positive lookbehind assertions will never match at the
262 beginning of the string being searched; you will most likely want to use the
263 :func:`search` function rather than the :func:`match` function::
264
265 >>> import re
266 >>> m = re.search('(?<=abc)def', 'abcdef')
267 >>> m.group(0)
268 'def'
269
270 This example looks for a word following a hyphen::
271
272 >>> m = re.search('(?<=-)\w+', 'spam-egg')
273 >>> m.group(0)
274 'egg'
275
276``(?<!...)``
277 Matches if the current position in the string is not preceded by a match for
278 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
279 positive lookbehind assertions, the contained pattern must only match strings of
280 some fixed length. Patterns which start with negative lookbehind assertions may
281 match at the beginning of the string being searched.
282
283``(?(id/name)yes-pattern|no-pattern)``
284 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
285 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
286 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
287 matching pattern, which will match with ``'<user@host.com>'`` as well as
288 ``'user@host.com'``, but not with ``'<user@host.com'``.
289
Georg Brandl116aa622007-08-15 14:28:22 +0000290
291The special sequences consist of ``'\'`` and a character from the list below.
292If the ordinary character is not on the list, then the resulting RE will match
293the second character. For example, ``\$`` matches the character ``'$'``.
294
295.. %
296
297``\number``
298 Matches the contents of the group of the same number. Groups are numbered
299 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
300 but not ``'the end'`` (note the space after the group). This special sequence
301 can only be used to match one of the first 99 groups. If the first digit of
302 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
303 a group match, but as the character with octal value *number*. Inside the
304 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
305 characters.
306
307``\A``
308 Matches only at the start of the string.
309
310``\b``
311 Matches the empty string, but only at the beginning or end of a word. A word is
312 defined as a sequence of alphanumeric or underscore characters, so the end of a
313 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
314 Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
315 precise set of characters deemed to be alphanumeric depends on the values of the
316 ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
317 the backspace character, for compatibility with Python's string literals.
318
319``\B``
320 Matches the empty string, but only when it is *not* at the beginning or end of a
321 word. This is just the opposite of ``\b``, so is also subject to the settings
322 of ``LOCALE`` and ``UNICODE``.
323
324``\d``
325 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
326 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
327 whatever is classified as a digit in the Unicode character properties database.
328
329``\D``
330 When the :const:`UNICODE` flag is not specified, matches any non-digit
331 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
332 will match anything other than character marked as digits in the Unicode
333 character properties database.
334
335``\s``
336 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
337 any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
338 :const:`LOCALE`, it will match this set plus whatever characters are defined as
339 space for the current locale. If :const:`UNICODE` is set, this will match the
340 characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
341 character properties database.
342
343``\S``
344 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
345 any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
346 With :const:`LOCALE`, it will match any character not in this set, and not
347 defined as space in the current locale. If :const:`UNICODE` is set, this will
348 match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
349 the Unicode character properties database.
350
351``\w``
352 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
353 any alphanumeric character and the underscore; this is equivalent to the set
354 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
355 whatever characters are defined as alphanumeric for the current locale. If
356 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
357 is classified as alphanumeric in the Unicode character properties database.
358
359``\W``
360 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
361 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
362 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
363 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
364 this will match anything other than ``[0-9_]`` and characters marked as
365 alphanumeric in the Unicode character properties database.
366
367``\Z``
368 Matches only at the end of the string.
369
370Most of the standard escapes supported by Python string literals are also
371accepted by the regular expression parser::
372
373 \a \b \f \n
374 \r \t \v \x
375 \\
376
377Octal escapes are included in a limited form: If the first digit is a 0, or if
378there are three octal digits, it is considered an octal escape. Otherwise, it is
379a group reference. As for string literals, octal escapes are always at most
380three digits in length.
381
382.. % Note the lack of a period in the section title; it causes problems
383.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
384
385
386.. _matching-searching:
387
388Matching vs Searching
389---------------------
390
391.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
392
393
394Python offers two different primitive operations based on regular expressions:
Guido van Rossum04110fb2007-08-24 16:32:05 +0000395**match** checks for a match only at the beginning of the string, while
396**search** checks for a match anywhere in the string (this is what Perl does
397by default).
Georg Brandl116aa622007-08-15 14:28:22 +0000398
Guido van Rossum04110fb2007-08-24 16:32:05 +0000399Note that match may differ from search even when using a regular expression
400beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl116aa622007-08-15 14:28:22 +0000401:const:`MULTILINE` mode also immediately following a newline. The "match"
402operation succeeds only if the pattern matches at the start of the string
403regardless of mode, or at the starting position given by the optional *pos*
404argument regardless of whether a newline precedes it.
405
406.. % Examples from Tim Peters:
407
408::
409
410 re.compile("a").match("ba", 1) # succeeds
411 re.compile("^a").search("ba", 1) # fails; 'a' not at start
412 re.compile("^a").search("\na", 1) # fails; 'a' not at start
413 re.compile("^a", re.M).search("\na", 1) # succeeds
414 re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
415
416
417.. _contents-of-module-re:
418
419Module Contents
420---------------
421
422The module defines several functions, constants, and an exception. Some of the
423functions are simplified versions of the full featured methods for compiled
424regular expressions. Most non-trivial applications always use the compiled
425form.
426
427
428.. function:: compile(pattern[, flags])
429
430 Compile a regular expression pattern into a regular expression object, which can
431 be used for matching using its :func:`match` and :func:`search` methods,
432 described below.
433
434 The expression's behaviour can be modified by specifying a *flags* value.
435 Values can be any of the following variables, combined using bitwise OR (the
436 ``|`` operator).
437
438 The sequence ::
439
440 prog = re.compile(pat)
441 result = prog.match(str)
442
443 is equivalent to ::
444
445 result = re.match(pat, str)
446
447 but the version using :func:`compile` is more efficient when the expression will
448 be used several times in a single program.
449
450 .. % (The compiled version of the last pattern passed to
451 .. % \function{re.match()} or \function{re.search()} is cached, so
452 .. % programs that use only a single regular expression at a time needn't
453 .. % worry about compiling regular expressions.)
454
455
456.. data:: I
457 IGNORECASE
458
459 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
460 lowercase letters, too. This is not affected by the current locale.
461
462
463.. data:: L
464 LOCALE
465
466 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the current
467 locale.
468
469
470.. data:: M
471 MULTILINE
472
473 When specified, the pattern character ``'^'`` matches at the beginning of the
474 string and at the beginning of each line (immediately following each newline);
475 and the pattern character ``'$'`` matches at the end of the string and at the
476 end of each line (immediately preceding each newline). By default, ``'^'``
477 matches only at the beginning of the string, and ``'$'`` only at the end of the
478 string and immediately before the newline (if any) at the end of the string.
479
480
481.. data:: S
482 DOTALL
483
484 Make the ``'.'`` special character match any character at all, including a
485 newline; without this flag, ``'.'`` will match anything *except* a newline.
486
487
488.. data:: U
489 UNICODE
490
491 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
492 on the Unicode character properties database.
493
Georg Brandl116aa622007-08-15 14:28:22 +0000494
495.. data:: X
496 VERBOSE
497
498 This flag allows you to write regular expressions that look nicer. Whitespace
499 within the pattern is ignored, except when in a character class or preceded by
500 an unescaped backslash, and, when a line contains a ``'#'`` neither in a
501 character class or preceded by an unescaped backslash, all characters from the
502 leftmost such ``'#'`` through the end of the line are ignored.
503
Georg Brandl81ac1ce2007-08-31 17:17:17 +0000504 This means that the two following regular expression objects are equal::
505
506 re.compile(r""" [a-z]+ # some letters
507 \.\. # two dots
508 [a-z]* # perhaps more letters""")
509 re.compile(r"[a-z]+\.\.[a-z]*")
Georg Brandl116aa622007-08-15 14:28:22 +0000510
511
512.. function:: search(pattern, string[, flags])
513
514 Scan through *string* looking for a location where the regular expression
515 *pattern* produces a match, and return a corresponding :class:`MatchObject`
516 instance. Return ``None`` if no position in the string matches the pattern; note
517 that this is different from finding a zero-length match at some point in the
518 string.
519
520
521.. function:: match(pattern, string[, flags])
522
523 If zero or more characters at the beginning of *string* match the regular
524 expression *pattern*, return a corresponding :class:`MatchObject` instance.
525 Return ``None`` if the string does not match the pattern; note that this is
526 different from a zero-length match.
527
528 .. note::
529
530 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
531
532
533.. function:: split(pattern, string[, maxsplit=0])
534
535 Split *string* by the occurrences of *pattern*. If capturing parentheses are
536 used in *pattern*, then the text of all groups in the pattern are also returned
537 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
538 splits occur, and the remainder of the string is returned as the final element
539 of the list. (Incompatibility note: in the original Python 1.5 release,
540 *maxsplit* was ignored. This has been fixed in later releases.) ::
541
542 >>> re.split('\W+', 'Words, words, words.')
543 ['Words', 'words', 'words', '']
544 >>> re.split('(\W+)', 'Words, words, words.')
545 ['Words', ', ', 'words', ', ', 'words', '.', '']
546 >>> re.split('\W+', 'Words, words, words.', 1)
547 ['Words', 'words, words.']
548
Thomas Wouters89d996e2007-09-08 17:39:28 +0000549 Note that *split* will never split a string on an empty pattern match.
550 For example ::
551
552 >>> re.split('x*', 'foo')
553 ['foo']
554 >>> re.split("(?m)^$", "foo\n\nbar\n")
555 ['foo\n\nbar\n']
Georg Brandl116aa622007-08-15 14:28:22 +0000556
557.. function:: findall(pattern, string[, flags])
558
559 Return a list of all non-overlapping matches of *pattern* in *string*. If one
560 or more groups are present in the pattern, return a list of groups; this will be
561 a list of tuples if the pattern has more than one group. Empty matches are
562 included in the result unless they touch the beginning of another match.
563
Georg Brandl116aa622007-08-15 14:28:22 +0000564
565.. function:: finditer(pattern, string[, flags])
566
567 Return an iterator over all non-overlapping matches for the RE *pattern* in
568 *string*. For each match, the iterator returns a match object. Empty matches
569 are included in the result unless they touch the beginning of another match.
570
Georg Brandl116aa622007-08-15 14:28:22 +0000571
572.. function:: sub(pattern, repl, string[, count])
573
574 Return the string obtained by replacing the leftmost non-overlapping occurrences
575 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
576 *string* is returned unchanged. *repl* can be a string or a function; if it is
577 a string, any backslash escapes in it are processed. That is, ``\n`` is
578 converted to a single newline character, ``\r`` is converted to a linefeed, and
579 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
580 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
581 For example::
582
583 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
584 ... r'static PyObject*\npy_\1(void)\n{',
585 ... 'def myfunc():')
586 'static PyObject*\npy_myfunc(void)\n{'
587
588 If *repl* is a function, it is called for every non-overlapping occurrence of
589 *pattern*. The function takes a single match object argument, and returns the
590 replacement string. For example::
591
592 >>> def dashrepl(matchobj):
593 ... if matchobj.group(0) == '-': return ' '
594 ... else: return '-'
595 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
596 'pro--gram files'
597
598 The pattern may be a string or an RE object; if you need to specify regular
599 expression flags, you must use a RE object, or use embedded modifiers in a
600 pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
601
602 The optional argument *count* is the maximum number of pattern occurrences to be
603 replaced; *count* must be a non-negative integer. If omitted or zero, all
604 occurrences will be replaced. Empty matches for the pattern are replaced only
605 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
606 ``'-a-b-c-'``.
607
608 In addition to character escapes and backreferences as described above,
609 ``\g<name>`` will use the substring matched by the group named ``name``, as
610 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
611 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
612 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
613 reference to group 20, not a reference to group 2 followed by the literal
614 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
615 substring matched by the RE.
616
617
618.. function:: subn(pattern, repl, string[, count])
619
620 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
621 number_of_subs_made)``.
622
623
624.. function:: escape(string)
625
626 Return *string* with all non-alphanumerics backslashed; this is useful if you
627 want to match an arbitrary literal string that may have regular expression
628 metacharacters in it.
629
630
631.. exception:: error
632
633 Exception raised when a string passed to one of the functions here is not a
634 valid regular expression (for example, it might contain unmatched parentheses)
635 or when some other error occurs during compilation or matching. It is never an
636 error if a string contains no match for a pattern.
637
638
639.. _re-objects:
640
641Regular Expression Objects
642--------------------------
643
644Compiled regular expression objects support the following methods and
645attributes:
646
647
648.. method:: RegexObject.match(string[, pos[, endpos]])
649
650 If zero or more characters at the beginning of *string* match this regular
651 expression, return a corresponding :class:`MatchObject` instance. Return
652 ``None`` if the string does not match the pattern; note that this is different
653 from a zero-length match.
654
655 .. note::
656
657 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
658
659 The optional second parameter *pos* gives an index in the string where the
660 search is to start; it defaults to ``0``. This is not completely equivalent to
661 slicing the string; the ``'^'`` pattern character matches at the real beginning
662 of the string and at positions just after a newline, but not necessarily at the
663 index where the search is to start.
664
665 The optional parameter *endpos* limits how far the string will be searched; it
666 will be as if the string is *endpos* characters long, so only the characters
667 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
668 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
669 expression object, ``rx.match(string, 0, 50)`` is equivalent to
670 ``rx.match(string[:50], 0)``.
671
672
673.. method:: RegexObject.search(string[, pos[, endpos]])
674
675 Scan through *string* looking for a location where this regular expression
676 produces a match, and return a corresponding :class:`MatchObject` instance.
677 Return ``None`` if no position in the string matches the pattern; note that this
678 is different from finding a zero-length match at some point in the string.
679
680 The optional *pos* and *endpos* parameters have the same meaning as for the
681 :meth:`match` method.
682
683
684.. method:: RegexObject.split(string[, maxsplit=0])
685
686 Identical to the :func:`split` function, using the compiled pattern.
687
688
689.. method:: RegexObject.findall(string[, pos[, endpos]])
690
691 Identical to the :func:`findall` function, using the compiled pattern.
692
693
694.. method:: RegexObject.finditer(string[, pos[, endpos]])
695
696 Identical to the :func:`finditer` function, using the compiled pattern.
697
698
699.. method:: RegexObject.sub(repl, string[, count=0])
700
701 Identical to the :func:`sub` function, using the compiled pattern.
702
703
704.. method:: RegexObject.subn(repl, string[, count=0])
705
706 Identical to the :func:`subn` function, using the compiled pattern.
707
708
709.. attribute:: RegexObject.flags
710
711 The flags argument used when the RE object was compiled, or ``0`` if no flags
712 were provided.
713
714
715.. attribute:: RegexObject.groupindex
716
717 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
718 numbers. The dictionary is empty if no symbolic groups were used in the
719 pattern.
720
721
722.. attribute:: RegexObject.pattern
723
724 The pattern string from which the RE object was compiled.
725
726
727.. _match-objects:
728
729Match Objects
730-------------
731
732:class:`MatchObject` instances support the following methods and attributes:
733
734
735.. method:: MatchObject.expand(template)
736
737 Return the string obtained by doing backslash substitution on the template
738 string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
739 converted to the appropriate characters, and numeric backreferences (``\1``,
740 ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
741 contents of the corresponding group.
742
743
744.. method:: MatchObject.group([group1, ...])
745
746 Returns one or more subgroups of the match. If there is a single argument, the
747 result is a single string; if there are multiple arguments, the result is a
748 tuple with one item per argument. Without arguments, *group1* defaults to zero
749 (the whole match is returned). If a *groupN* argument is zero, the corresponding
750 return value is the entire matching string; if it is in the inclusive range
751 [1..99], it is the string matching the corresponding parenthesized group. If a
752 group number is negative or larger than the number of groups defined in the
753 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
754 part of the pattern that did not match, the corresponding result is ``None``.
755 If a group is contained in a part of the pattern that matched multiple times,
756 the last match is returned.
757
758 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
759 arguments may also be strings identifying groups by their group name. If a
760 string argument is not used as a group name in the pattern, an :exc:`IndexError`
761 exception is raised.
762
763 A moderately complicated example::
764
765 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
766
767 After performing this match, ``m.group(1)`` is ``'3'``, as is
768 ``m.group('int')``, and ``m.group(2)`` is ``'14'``.
769
770
771.. method:: MatchObject.groups([default])
772
773 Return a tuple containing all the subgroups of the match, from 1 up to however
774 many groups are in the pattern. The *default* argument is used for groups that
775 did not participate in the match; it defaults to ``None``. (Incompatibility
776 note: in the original Python 1.5 release, if the tuple was one element long, a
777 string would be returned instead. In later versions (from 1.5.1 on), a
778 singleton tuple is returned in such cases.)
779
780
781.. method:: MatchObject.groupdict([default])
782
783 Return a dictionary containing all the *named* subgroups of the match, keyed by
784 the subgroup name. The *default* argument is used for groups that did not
785 participate in the match; it defaults to ``None``.
786
787
788.. method:: MatchObject.start([group])
789 MatchObject.end([group])
790
791 Return the indices of the start and end of the substring matched by *group*;
792 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
793 *group* exists but did not contribute to the match. For a match object *m*, and
794 a group *g* that did contribute to the match, the substring matched by group *g*
795 (equivalent to ``m.group(g)``) is ::
796
797 m.string[m.start(g):m.end(g)]
798
799 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
800 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
801 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
802 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
803
804
805.. method:: MatchObject.span([group])
806
807 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
808 m.end(group))``. Note that if *group* did not contribute to the match, this is
809 ``(-1, -1)``. Again, *group* defaults to zero.
810
811
812.. attribute:: MatchObject.pos
813
814 The value of *pos* which was passed to the :func:`search` or :func:`match`
815 method of the :class:`RegexObject`. This is the index into the string at which
816 the RE engine started looking for a match.
817
818
819.. attribute:: MatchObject.endpos
820
821 The value of *endpos* which was passed to the :func:`search` or :func:`match`
822 method of the :class:`RegexObject`. This is the index into the string beyond
823 which the RE engine will not go.
824
825
826.. attribute:: MatchObject.lastindex
827
828 The integer index of the last matched capturing group, or ``None`` if no group
829 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
830 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
831 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
832 string.
833
834
835.. attribute:: MatchObject.lastgroup
836
837 The name of the last matched capturing group, or ``None`` if the group didn't
838 have a name, or if no group was matched at all.
839
840
841.. attribute:: MatchObject.re
842
843 The regular expression object whose :meth:`match` or :meth:`search` method
844 produced this :class:`MatchObject` instance.
845
846
847.. attribute:: MatchObject.string
848
849 The string passed to :func:`match` or :func:`search`.
850
851
852Examples
853--------
854
855**Simulating scanf()**
856
857.. index:: single: scanf()
858
859Python does not currently have an equivalent to :cfunc:`scanf`. Regular
860expressions are generally more powerful, though also more verbose, than
861:cfunc:`scanf` format strings. The table below offers some more-or-less
862equivalent mappings between :cfunc:`scanf` format tokens and regular
863expressions.
864
865+--------------------------------+---------------------------------------------+
866| :cfunc:`scanf` Token | Regular Expression |
867+================================+=============================================+
868| ``%c`` | ``.`` |
869+--------------------------------+---------------------------------------------+
870| ``%5c`` | ``.{5}`` |
871+--------------------------------+---------------------------------------------+
872| ``%d`` | ``[-+]?\d+`` |
873+--------------------------------+---------------------------------------------+
874| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
875+--------------------------------+---------------------------------------------+
876| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
877+--------------------------------+---------------------------------------------+
878| ``%o`` | ``0[0-7]*`` |
879+--------------------------------+---------------------------------------------+
880| ``%s`` | ``\S+`` |
881+--------------------------------+---------------------------------------------+
882| ``%u`` | ``\d+`` |
883+--------------------------------+---------------------------------------------+
884| ``%x``, ``%X`` | ``0[xX][\dA-Fa-f]+`` |
885+--------------------------------+---------------------------------------------+
886
887To extract the filename and numbers from a string like ::
888
889 /usr/sbin/sendmail - 0 errors, 4 warnings
890
891you would use a :cfunc:`scanf` format like ::
892
893 %s - %d errors, %d warnings
894
895The equivalent regular expression would be ::
896
897 (\S+) - (\d+) errors, (\d+) warnings
898
899**Avoiding recursion**
900
901If you create regular expressions that require the engine to perform a lot of
902recursion, you may encounter a :exc:`RuntimeError` exception with the message
903``maximum recursion limit`` exceeded. For example, ::
904
905 >>> import re
906 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
907 >>> re.match('Begin (\w| )*? end', s).end()
908 Traceback (most recent call last):
909 File "<stdin>", line 1, in ?
910 File "/usr/local/lib/python2.5/re.py", line 132, in match
911 return _compile(pattern, flags).match(string)
912 RuntimeError: maximum recursion limit exceeded
913
914You can often restructure your regular expression to avoid recursion.
915
916Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
917avoid recursion. Thus, the above regular expression can avoid recursion by
918being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
919regular expressions will run faster than their recursive equivalents.
920