blob: 1caaaf291a106ebc10ef348516f6873c1ea3ffe3 [file] [log] [blame]
Georg Brandl8ec7f652007-08-15 14:28:01 +00001
2:mod:`re` --- Regular expression operations
3===========================================
4
5.. module:: re
6 :synopsis: Regular expression operations.
7.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
8.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
9
10
11
12
13This module provides regular expression matching operations similar to
14those found in Perl. Both patterns and strings to be searched can be
15Unicode strings as well as 8-bit strings. The :mod:`re` module is
16always available.
17
18Regular expressions use the backslash character (``'\'``) to indicate
19special forms or to allow special characters to be used without invoking
20their special meaning. This collides with Python's usage of the same
21character for the same purpose in string literals; for example, to match
22a literal backslash, one might have to write ``'\\\\'`` as the pattern
23string, because the regular expression must be ``\\``, and each
24backslash must be expressed as ``\\`` inside a regular Python string
25literal.
26
27The solution is to use Python's raw string notation for regular expression
28patterns; backslashes are not handled in any special way in a string literal
29prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
30``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandlba2e5192007-09-27 06:26:58 +000031newline. Usually patterns will be expressed in Python code using this raw
32string notation.
Georg Brandl8ec7f652007-08-15 14:28:01 +000033
34.. seealso::
35
36 Mastering Regular Expressions
37 Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandlba2e5192007-09-27 06:26:58 +000038 second edition of the book no longer covers Python at all, but the first
Georg Brandl8ec7f652007-08-15 14:28:01 +000039 edition covered writing good regular expression patterns in great detail.
40
41
42.. _re-syntax:
43
44Regular Expression Syntax
45-------------------------
46
47A regular expression (or RE) specifies a set of strings that matches it; the
48functions in this module let you check if a particular string matches a given
49regular expression (or if a given regular expression matches a particular
50string, which comes down to the same thing).
51
52Regular expressions can be concatenated to form new regular expressions; if *A*
53and *B* are both regular expressions, then *AB* is also a regular expression.
54In general, if a string *p* matches *A* and another string *q* matches *B*, the
55string *pq* will match AB. This holds unless *A* or *B* contain low precedence
56operations; boundary conditions between *A* and *B*; or have numbered group
57references. Thus, complex expressions can easily be constructed from simpler
58primitive expressions like the ones described here. For details of the theory
59and implementation of regular expressions, consult the Friedl book referenced
60above, or almost any textbook about compiler construction.
61
62A brief explanation of the format of regular expressions follows. For further
63information and a gentler presentation, consult the Regular Expression HOWTO,
64accessible from http://www.python.org/doc/howto/.
65
66Regular expressions can contain both special and ordinary characters. Most
67ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
68expressions; they simply match themselves. You can concatenate ordinary
69characters, so ``last`` matches the string ``'last'``. (In the rest of this
70section, we'll write RE's in ``this special style``, usually without quotes, and
71strings to be matched ``'in single quotes'``.)
72
73Some characters, like ``'|'`` or ``'('``, are special. Special
74characters either stand for classes of ordinary characters, or affect
75how the regular expressions around them are interpreted. Regular
76expression pattern strings may not contain null bytes, but can specify
77the null byte using the ``\number`` notation, e.g., ``'\x00'``.
78
79
80The special characters are:
81
82.. %
83
84``'.'``
85 (Dot.) In the default mode, this matches any character except a newline. If
86 the :const:`DOTALL` flag has been specified, this matches any character
87 including a newline.
88
89``'^'``
90 (Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
91 matches immediately after each newline.
92
93``'$'``
94 Matches the end of the string or just before the newline at the end of the
95 string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
96 matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
97 only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
98 matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
99
100``'*'``
101 Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
102 many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
103 by any number of 'b's.
104
105``'+'``
106 Causes the resulting RE to match 1 or more repetitions of the preceding RE.
107 ``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
108 match just 'a'.
109
110``'?'``
111 Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
112 ``ab?`` will match either 'a' or 'ab'.
113
114``*?``, ``+?``, ``??``
115 The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
116 as much text as possible. Sometimes this behaviour isn't desired; if the RE
117 ``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
118 string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
119 perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as *few*
120 characters as possible will be matched. Using ``.*?`` in the previous
121 expression will match only ``'<H1>'``.
122
123``{m}``
124 Specifies that exactly *m* copies of the previous RE should be matched; fewer
125 matches cause the entire RE not to match. For example, ``a{6}`` will match
126 exactly six ``'a'`` characters, but not five.
127
128``{m,n}``
129 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
130 RE, attempting to match as many repetitions as possible. For example,
131 ``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting *m* specifies a
132 lower bound of zero, and omitting *n* specifies an infinite upper bound. As an
133 example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
134 followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
135 modifier would be confused with the previously described form.
136
137``{m,n}?``
138 Causes the resulting RE to match from *m* to *n* repetitions of the preceding
139 RE, attempting to match as *few* repetitions as possible. This is the
140 non-greedy version of the previous qualifier. For example, on the
141 6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
142 while ``a{3,5}?`` will only match 3 characters.
143
144``'\'``
145 Either escapes special characters (permitting you to match characters like
146 ``'*'``, ``'?'``, and so forth), or signals a special sequence; special
147 sequences are discussed below.
148
149 If you're not using a raw string to express the pattern, remember that Python
150 also uses the backslash as an escape sequence in string literals; if the escape
151 sequence isn't recognized by Python's parser, the backslash and subsequent
152 character are included in the resulting string. However, if Python would
153 recognize the resulting sequence, the backslash should be repeated twice. This
154 is complicated and hard to understand, so it's highly recommended that you use
155 raw strings for all but the simplest expressions.
156
157``[]``
158 Used to indicate a set of characters. Characters can be listed individually, or
159 a range of characters can be indicated by giving two characters and separating
160 them by a ``'-'``. Special characters are not active inside sets. For example,
161 ``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
162 ``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
163 ``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
164 as ``\w`` or ``\S`` (defined below) are also acceptable inside a
165 range, although the characters they match depends on whether :const:`LOCALE`
166 or :const:`UNICODE` mode is in force. If you want to include a
167 ``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
168 place it as the first character. The pattern ``[]]`` will match
169 ``']'``, for example.
170
171 You can match the characters not within a range by :dfn:`complementing` the set.
172 This is indicated by including a ``'^'`` as the first character of the set;
173 ``'^'`` elsewhere will simply match the ``'^'`` character. For example,
174 ``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
175 character except ``'^'``.
176
177``'|'``
178 ``A|B``, where A and B can be arbitrary REs, creates a regular expression that
179 will match either A or B. An arbitrary number of REs can be separated by the
180 ``'|'`` in this way. This can be used inside groups (see below) as well. As
181 the target string is scanned, REs separated by ``'|'`` are tried from left to
182 right. When one pattern completely matches, that branch is accepted. This means
183 that once ``A`` matches, ``B`` will not be tested further, even if it would
184 produce a longer overall match. In other words, the ``'|'`` operator is never
185 greedy. To match a literal ``'|'``, use ``\|``, or enclose it inside a
186 character class, as in ``[|]``.
187
188``(...)``
189 Matches whatever regular expression is inside the parentheses, and indicates the
190 start and end of a group; the contents of a group can be retrieved after a match
191 has been performed, and can be matched later in the string with the ``\number``
192 special sequence, described below. To match the literals ``'('`` or ``')'``,
193 use ``\(`` or ``\)``, or enclose them inside a character class: ``[(] [)]``.
194
195``(?...)``
196 This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
197 otherwise). The first character after the ``'?'`` determines what the meaning
198 and further syntax of the construct is. Extensions usually do not create a new
199 group; ``(?P<name>...)`` is the only exception to this rule. Following are the
200 currently supported extensions.
201
202``(?iLmsux)``
203 (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
204 ``'u'``, ``'x'``.) The group matches the empty string; the letters
205 set the corresponding flags: :const:`re.I` (ignore case),
206 :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
207 :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
208 and :const:`re.X` (verbose), for the entire regular expression. (The
209 flags are described in :ref:`contents-of-module-re`.) This
210 is useful if you wish to include the flags as part of the regular
211 expression, instead of passing a *flag* argument to the
212 :func:`compile` function.
213
214 Note that the ``(?x)`` flag changes how the expression is parsed. It should be
215 used first in the expression string, or after one or more whitespace characters.
216 If there are non-whitespace characters before the flag, the results are
217 undefined.
218
219``(?:...)``
220 A non-grouping version of regular parentheses. Matches whatever regular
221 expression is inside the parentheses, but the substring matched by the group
222 *cannot* be retrieved after performing a match or referenced later in the
223 pattern.
224
225``(?P<name>...)``
226 Similar to regular parentheses, but the substring matched by the group is
227 accessible via the symbolic group name *name*. Group names must be valid Python
228 identifiers, and each group name must be defined only once within a regular
229 expression. A symbolic group is also a numbered group, just as if the group
230 were not named. So the group named 'id' in the example below can also be
231 referenced as the numbered group 1.
232
233 For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
234 referenced by its name in arguments to methods of match objects, such as
235 ``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
236 example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
237
238``(?P=name)``
239 Matches whatever text was matched by the earlier group named *name*.
240
241``(?#...)``
242 A comment; the contents of the parentheses are simply ignored.
243
244``(?=...)``
245 Matches if ``...`` matches next, but doesn't consume any of the string. This is
246 called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
247 ``'Isaac '`` only if it's followed by ``'Asimov'``.
248
249``(?!...)``
250 Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
251 For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's *not*
252 followed by ``'Asimov'``.
253
254``(?<=...)``
255 Matches if the current position in the string is preceded by a match for ``...``
256 that ends at the current position. This is called a :dfn:`positive lookbehind
257 assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
258 lookbehind will back up 3 characters and check if the contained pattern matches.
259 The contained pattern must only match strings of some fixed length, meaning that
260 ``abc`` or ``a|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
261 patterns which start with positive lookbehind assertions will never match at the
262 beginning of the string being searched; you will most likely want to use the
263 :func:`search` function rather than the :func:`match` function::
264
265 >>> import re
266 >>> m = re.search('(?<=abc)def', 'abcdef')
267 >>> m.group(0)
268 'def'
269
270 This example looks for a word following a hyphen::
271
272 >>> m = re.search('(?<=-)\w+', 'spam-egg')
273 >>> m.group(0)
274 'egg'
275
276``(?<!...)``
277 Matches if the current position in the string is not preceded by a match for
278 ``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
279 positive lookbehind assertions, the contained pattern must only match strings of
280 some fixed length. Patterns which start with negative lookbehind assertions may
281 match at the beginning of the string being searched.
282
283``(?(id/name)yes-pattern|no-pattern)``
284 Will try to match with ``yes-pattern`` if the group with given *id* or *name*
285 exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
286 can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
287 matching pattern, which will match with ``'<user@host.com>'`` as well as
288 ``'user@host.com'``, but not with ``'<user@host.com'``.
289
290 .. versionadded:: 2.4
291
292The special sequences consist of ``'\'`` and a character from the list below.
293If the ordinary character is not on the list, then the resulting RE will match
294the second character. For example, ``\$`` matches the character ``'$'``.
295
296.. %
297
298``\number``
299 Matches the contents of the group of the same number. Groups are numbered
300 starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
301 but not ``'the end'`` (note the space after the group). This special sequence
302 can only be used to match one of the first 99 groups. If the first digit of
303 *number* is 0, or *number* is 3 octal digits long, it will not be interpreted as
304 a group match, but as the character with octal value *number*. Inside the
305 ``'['`` and ``']'`` of a character class, all numeric escapes are treated as
306 characters.
307
308``\A``
309 Matches only at the start of the string.
310
311``\b``
312 Matches the empty string, but only at the beginning or end of a word. A word is
313 defined as a sequence of alphanumeric or underscore characters, so the end of a
314 word is indicated by whitespace or a non-alphanumeric, non-underscore character.
315 Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
316 precise set of characters deemed to be alphanumeric depends on the values of the
317 ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
318 the backspace character, for compatibility with Python's string literals.
319
320``\B``
321 Matches the empty string, but only when it is *not* at the beginning or end of a
322 word. This is just the opposite of ``\b``, so is also subject to the settings
323 of ``LOCALE`` and ``UNICODE``.
324
325``\d``
326 When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
327 is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
328 whatever is classified as a digit in the Unicode character properties database.
329
330``\D``
331 When the :const:`UNICODE` flag is not specified, matches any non-digit
332 character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
333 will match anything other than character marked as digits in the Unicode
334 character properties database.
335
336``\s``
337 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
338 any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
339 :const:`LOCALE`, it will match this set plus whatever characters are defined as
340 space for the current locale. If :const:`UNICODE` is set, this will match the
341 characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
342 character properties database.
343
344``\S``
345 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
346 any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
347 With :const:`LOCALE`, it will match any character not in this set, and not
348 defined as space in the current locale. If :const:`UNICODE` is set, this will
349 match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
350 the Unicode character properties database.
351
352``\w``
353 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
354 any alphanumeric character and the underscore; this is equivalent to the set
355 ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
356 whatever characters are defined as alphanumeric for the current locale. If
357 :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
358 is classified as alphanumeric in the Unicode character properties database.
359
360``\W``
361 When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
362 any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
363 With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
364 not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
365 this will match anything other than ``[0-9_]`` and characters marked as
366 alphanumeric in the Unicode character properties database.
367
368``\Z``
369 Matches only at the end of the string.
370
371Most of the standard escapes supported by Python string literals are also
372accepted by the regular expression parser::
373
374 \a \b \f \n
375 \r \t \v \x
376 \\
377
378Octal escapes are included in a limited form: If the first digit is a 0, or if
379there are three octal digits, it is considered an octal escape. Otherwise, it is
380a group reference. As for string literals, octal escapes are always at most
381three digits in length.
382
383.. % Note the lack of a period in the section title; it causes problems
384.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
385
386
387.. _matching-searching:
388
389Matching vs Searching
390---------------------
391
392.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
393
394
395Python offers two different primitive operations based on regular expressions:
Georg Brandl604c1212007-08-23 21:36:05 +0000396**match** checks for a match only at the beginning of the string, while
397**search** checks for a match anywhere in the string (this is what Perl does
398by default).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000399
Georg Brandl604c1212007-08-23 21:36:05 +0000400Note that match may differ from search even when using a regular expression
401beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl8ec7f652007-08-15 14:28:01 +0000402:const:`MULTILINE` mode also immediately following a newline. The "match"
403operation succeeds only if the pattern matches at the start of the string
404regardless of mode, or at the starting position given by the optional *pos*
405argument regardless of whether a newline precedes it.
406
407.. % Examples from Tim Peters:
408
409::
410
411 re.compile("a").match("ba", 1) # succeeds
412 re.compile("^a").search("ba", 1) # fails; 'a' not at start
413 re.compile("^a").search("\na", 1) # fails; 'a' not at start
414 re.compile("^a", re.M).search("\na", 1) # succeeds
415 re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
416
417
418.. _contents-of-module-re:
419
420Module Contents
421---------------
422
423The module defines several functions, constants, and an exception. Some of the
424functions are simplified versions of the full featured methods for compiled
425regular expressions. Most non-trivial applications always use the compiled
426form.
427
428
429.. function:: compile(pattern[, flags])
430
Georg Brandlba2e5192007-09-27 06:26:58 +0000431 Compile a regular expression pattern into a regular expression object, which
432 can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl8ec7f652007-08-15 14:28:01 +0000433 described below.
434
435 The expression's behaviour can be modified by specifying a *flags* value.
436 Values can be any of the following variables, combined using bitwise OR (the
437 ``|`` operator).
438
439 The sequence ::
440
441 prog = re.compile(pat)
442 result = prog.match(str)
443
444 is equivalent to ::
445
446 result = re.match(pat, str)
447
Georg Brandlba2e5192007-09-27 06:26:58 +0000448 but the version using :func:`compile` is more efficient when the expression
449 will be used several times in a single program.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000450
451 .. % (The compiled version of the last pattern passed to
452 .. % \function{re.match()} or \function{re.search()} is cached, so
453 .. % programs that use only a single regular expression at a time needn't
454 .. % worry about compiling regular expressions.)
455
456
457.. data:: I
458 IGNORECASE
459
460 Perform case-insensitive matching; expressions like ``[A-Z]`` will match
461 lowercase letters, too. This is not affected by the current locale.
462
463
464.. data:: L
465 LOCALE
466
Georg Brandlba2e5192007-09-27 06:26:58 +0000467 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
468 current locale.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000469
470
471.. data:: M
472 MULTILINE
473
474 When specified, the pattern character ``'^'`` matches at the beginning of the
475 string and at the beginning of each line (immediately following each newline);
476 and the pattern character ``'$'`` matches at the end of the string and at the
477 end of each line (immediately preceding each newline). By default, ``'^'``
478 matches only at the beginning of the string, and ``'$'`` only at the end of the
479 string and immediately before the newline (if any) at the end of the string.
480
481
482.. data:: S
483 DOTALL
484
485 Make the ``'.'`` special character match any character at all, including a
486 newline; without this flag, ``'.'`` will match anything *except* a newline.
487
488
489.. data:: U
490 UNICODE
491
492 Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
493 on the Unicode character properties database.
494
495 .. versionadded:: 2.0
496
497
498.. data:: X
499 VERBOSE
500
501 This flag allows you to write regular expressions that look nicer. Whitespace
502 within the pattern is ignored, except when in a character class or preceded by
503 an unescaped backslash, and, when a line contains a ``'#'`` neither in a
504 character class or preceded by an unescaped backslash, all characters from the
505 leftmost such ``'#'`` through the end of the line are ignored.
506
507 .. % XXX should add an example here
508
509
510.. function:: search(pattern, string[, flags])
511
512 Scan through *string* looking for a location where the regular expression
513 *pattern* produces a match, and return a corresponding :class:`MatchObject`
514 instance. Return ``None`` if no position in the string matches the pattern; note
515 that this is different from finding a zero-length match at some point in the
516 string.
517
518
519.. function:: match(pattern, string[, flags])
520
521 If zero or more characters at the beginning of *string* match the regular
522 expression *pattern*, return a corresponding :class:`MatchObject` instance.
523 Return ``None`` if the string does not match the pattern; note that this is
524 different from a zero-length match.
525
526 .. note::
527
528 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
529
530
531.. function:: split(pattern, string[, maxsplit=0])
532
533 Split *string* by the occurrences of *pattern*. If capturing parentheses are
534 used in *pattern*, then the text of all groups in the pattern are also returned
535 as part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit*
536 splits occur, and the remainder of the string is returned as the final element
537 of the list. (Incompatibility note: in the original Python 1.5 release,
538 *maxsplit* was ignored. This has been fixed in later releases.) ::
539
540 >>> re.split('\W+', 'Words, words, words.')
541 ['Words', 'words', 'words', '']
542 >>> re.split('(\W+)', 'Words, words, words.')
543 ['Words', ', ', 'words', ', ', 'words', '.', '']
544 >>> re.split('\W+', 'Words, words, words.', 1)
545 ['Words', 'words, words.']
546
Skip Montanaro222907d2007-09-01 17:40:03 +0000547 Note that *split* will never split a string on an empty pattern match.
548 For example ::
549
550 >>> re.split('x*', 'foo')
551 ['foo']
552 >>> re.split("(?m)^$", "foo\n\nbar\n")
553 ['foo\n\nbar\n']
Georg Brandl8ec7f652007-08-15 14:28:01 +0000554
555.. function:: findall(pattern, string[, flags])
556
Georg Brandlba2e5192007-09-27 06:26:58 +0000557 Return all non-overlapping matches of *pattern* in *string*, as a list of
558 strings. If one or more groups are present in the pattern, return a list of
559 groups; this will be a list of tuples if the pattern has more than one group.
560 Empty matches are included in the result unless they touch the beginning of
561 another match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000562
563 .. versionadded:: 1.5.2
564
565 .. versionchanged:: 2.4
566 Added the optional flags argument.
567
568
569.. function:: finditer(pattern, string[, flags])
570
Georg Brandle7a09902007-10-21 12:10:28 +0000571 Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandlba2e5192007-09-27 06:26:58 +0000572 non-overlapping matches for the RE *pattern* in *string*. Empty matches are
573 included in the result unless they touch the beginning of another match.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000574
575 .. versionadded:: 2.2
576
577 .. versionchanged:: 2.4
578 Added the optional flags argument.
579
580
581.. function:: sub(pattern, repl, string[, count])
582
583 Return the string obtained by replacing the leftmost non-overlapping occurrences
584 of *pattern* in *string* by the replacement *repl*. If the pattern isn't found,
585 *string* is returned unchanged. *repl* can be a string or a function; if it is
586 a string, any backslash escapes in it are processed. That is, ``\n`` is
587 converted to a single newline character, ``\r`` is converted to a linefeed, and
588 so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
589 as ``\6``, are replaced with the substring matched by group 6 in the pattern.
590 For example::
591
592 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
593 ... r'static PyObject*\npy_\1(void)\n{',
594 ... 'def myfunc():')
595 'static PyObject*\npy_myfunc(void)\n{'
596
597 If *repl* is a function, it is called for every non-overlapping occurrence of
598 *pattern*. The function takes a single match object argument, and returns the
599 replacement string. For example::
600
601 >>> def dashrepl(matchobj):
602 ... if matchobj.group(0) == '-': return ' '
603 ... else: return '-'
604 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
605 'pro--gram files'
606
607 The pattern may be a string or an RE object; if you need to specify regular
608 expression flags, you must use a RE object, or use embedded modifiers in a
609 pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
610
611 The optional argument *count* is the maximum number of pattern occurrences to be
612 replaced; *count* must be a non-negative integer. If omitted or zero, all
613 occurrences will be replaced. Empty matches for the pattern are replaced only
614 when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
615 ``'-a-b-c-'``.
616
617 In addition to character escapes and backreferences as described above,
618 ``\g<name>`` will use the substring matched by the group named ``name``, as
619 defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
620 group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
621 in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
622 reference to group 20, not a reference to group 2 followed by the literal
623 character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
624 substring matched by the RE.
625
626
627.. function:: subn(pattern, repl, string[, count])
628
629 Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
630 number_of_subs_made)``.
631
632
633.. function:: escape(string)
634
635 Return *string* with all non-alphanumerics backslashed; this is useful if you
636 want to match an arbitrary literal string that may have regular expression
637 metacharacters in it.
638
639
640.. exception:: error
641
642 Exception raised when a string passed to one of the functions here is not a
643 valid regular expression (for example, it might contain unmatched parentheses)
644 or when some other error occurs during compilation or matching. It is never an
645 error if a string contains no match for a pattern.
646
647
648.. _re-objects:
649
650Regular Expression Objects
651--------------------------
652
653Compiled regular expression objects support the following methods and
654attributes:
655
656
657.. method:: RegexObject.match(string[, pos[, endpos]])
658
659 If zero or more characters at the beginning of *string* match this regular
660 expression, return a corresponding :class:`MatchObject` instance. Return
661 ``None`` if the string does not match the pattern; note that this is different
662 from a zero-length match.
663
664 .. note::
665
666 If you want to locate a match anywhere in *string*, use :meth:`search` instead.
667
668 The optional second parameter *pos* gives an index in the string where the
669 search is to start; it defaults to ``0``. This is not completely equivalent to
670 slicing the string; the ``'^'`` pattern character matches at the real beginning
671 of the string and at positions just after a newline, but not necessarily at the
672 index where the search is to start.
673
674 The optional parameter *endpos* limits how far the string will be searched; it
675 will be as if the string is *endpos* characters long, so only the characters
676 from *pos* to ``endpos - 1`` will be searched for a match. If *endpos* is less
677 than *pos*, no match will be found, otherwise, if *rx* is a compiled regular
678 expression object, ``rx.match(string, 0, 50)`` is equivalent to
679 ``rx.match(string[:50], 0)``.
680
681
682.. method:: RegexObject.search(string[, pos[, endpos]])
683
684 Scan through *string* looking for a location where this regular expression
685 produces a match, and return a corresponding :class:`MatchObject` instance.
686 Return ``None`` if no position in the string matches the pattern; note that this
687 is different from finding a zero-length match at some point in the string.
688
689 The optional *pos* and *endpos* parameters have the same meaning as for the
690 :meth:`match` method.
691
692
693.. method:: RegexObject.split(string[, maxsplit=0])
694
695 Identical to the :func:`split` function, using the compiled pattern.
696
697
698.. method:: RegexObject.findall(string[, pos[, endpos]])
699
700 Identical to the :func:`findall` function, using the compiled pattern.
701
702
703.. method:: RegexObject.finditer(string[, pos[, endpos]])
704
705 Identical to the :func:`finditer` function, using the compiled pattern.
706
707
708.. method:: RegexObject.sub(repl, string[, count=0])
709
710 Identical to the :func:`sub` function, using the compiled pattern.
711
712
713.. method:: RegexObject.subn(repl, string[, count=0])
714
715 Identical to the :func:`subn` function, using the compiled pattern.
716
717
718.. attribute:: RegexObject.flags
719
720 The flags argument used when the RE object was compiled, or ``0`` if no flags
721 were provided.
722
723
724.. attribute:: RegexObject.groupindex
725
726 A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
727 numbers. The dictionary is empty if no symbolic groups were used in the
728 pattern.
729
730
731.. attribute:: RegexObject.pattern
732
733 The pattern string from which the RE object was compiled.
734
735
736.. _match-objects:
737
738Match Objects
739-------------
740
Georg Brandlba2e5192007-09-27 06:26:58 +0000741Match objects always have a boolean value of :const:`True`, so that you can test
742whether e.g. :func:`match` resulted in a match with a simple if statement. They
743support the following methods and attributes:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000744
745
746.. method:: MatchObject.expand(template)
747
748 Return the string obtained by doing backslash substitution on the template
749 string *template*, as done by the :meth:`sub` method. Escapes such as ``\n`` are
750 converted to the appropriate characters, and numeric backreferences (``\1``,
751 ``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
752 contents of the corresponding group.
753
754
755.. method:: MatchObject.group([group1, ...])
756
757 Returns one or more subgroups of the match. If there is a single argument, the
758 result is a single string; if there are multiple arguments, the result is a
759 tuple with one item per argument. Without arguments, *group1* defaults to zero
760 (the whole match is returned). If a *groupN* argument is zero, the corresponding
761 return value is the entire matching string; if it is in the inclusive range
762 [1..99], it is the string matching the corresponding parenthesized group. If a
763 group number is negative or larger than the number of groups defined in the
764 pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
765 part of the pattern that did not match, the corresponding result is ``None``.
766 If a group is contained in a part of the pattern that matched multiple times,
767 the last match is returned.
768
769 If the regular expression uses the ``(?P<name>...)`` syntax, the *groupN*
770 arguments may also be strings identifying groups by their group name. If a
771 string argument is not used as a group name in the pattern, an :exc:`IndexError`
772 exception is raised.
773
774 A moderately complicated example::
775
776 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
777
778 After performing this match, ``m.group(1)`` is ``'3'``, as is
779 ``m.group('int')``, and ``m.group(2)`` is ``'14'``.
780
781
782.. method:: MatchObject.groups([default])
783
784 Return a tuple containing all the subgroups of the match, from 1 up to however
785 many groups are in the pattern. The *default* argument is used for groups that
786 did not participate in the match; it defaults to ``None``. (Incompatibility
787 note: in the original Python 1.5 release, if the tuple was one element long, a
788 string would be returned instead. In later versions (from 1.5.1 on), a
789 singleton tuple is returned in such cases.)
790
791
792.. method:: MatchObject.groupdict([default])
793
794 Return a dictionary containing all the *named* subgroups of the match, keyed by
795 the subgroup name. The *default* argument is used for groups that did not
796 participate in the match; it defaults to ``None``.
797
798
799.. method:: MatchObject.start([group])
800 MatchObject.end([group])
801
802 Return the indices of the start and end of the substring matched by *group*;
803 *group* defaults to zero (meaning the whole matched substring). Return ``-1`` if
804 *group* exists but did not contribute to the match. For a match object *m*, and
805 a group *g* that did contribute to the match, the substring matched by group *g*
806 (equivalent to ``m.group(g)``) is ::
807
808 m.string[m.start(g):m.end(g)]
809
810 Note that ``m.start(group)`` will equal ``m.end(group)`` if *group* matched a
811 null string. For example, after ``m = re.search('b(c?)', 'cba')``,
812 ``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
813 2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
814
815
816.. method:: MatchObject.span([group])
817
818 For :class:`MatchObject` *m*, return the 2-tuple ``(m.start(group),
819 m.end(group))``. Note that if *group* did not contribute to the match, this is
820 ``(-1, -1)``. Again, *group* defaults to zero.
821
822
823.. attribute:: MatchObject.pos
824
825 The value of *pos* which was passed to the :func:`search` or :func:`match`
826 method of the :class:`RegexObject`. This is the index into the string at which
827 the RE engine started looking for a match.
828
829
830.. attribute:: MatchObject.endpos
831
832 The value of *endpos* which was passed to the :func:`search` or :func:`match`
833 method of the :class:`RegexObject`. This is the index into the string beyond
834 which the RE engine will not go.
835
836
837.. attribute:: MatchObject.lastindex
838
839 The integer index of the last matched capturing group, or ``None`` if no group
840 was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
841 ``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
842 the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
843 string.
844
845
846.. attribute:: MatchObject.lastgroup
847
848 The name of the last matched capturing group, or ``None`` if the group didn't
849 have a name, or if no group was matched at all.
850
851
852.. attribute:: MatchObject.re
853
854 The regular expression object whose :meth:`match` or :meth:`search` method
855 produced this :class:`MatchObject` instance.
856
857
858.. attribute:: MatchObject.string
859
860 The string passed to :func:`match` or :func:`search`.
861
862
863Examples
864--------
865
866**Simulating scanf()**
867
868.. index:: single: scanf()
869
870Python does not currently have an equivalent to :cfunc:`scanf`. Regular
871expressions are generally more powerful, though also more verbose, than
872:cfunc:`scanf` format strings. The table below offers some more-or-less
873equivalent mappings between :cfunc:`scanf` format tokens and regular
874expressions.
875
876+--------------------------------+---------------------------------------------+
877| :cfunc:`scanf` Token | Regular Expression |
878+================================+=============================================+
879| ``%c`` | ``.`` |
880+--------------------------------+---------------------------------------------+
881| ``%5c`` | ``.{5}`` |
882+--------------------------------+---------------------------------------------+
883| ``%d`` | ``[-+]?\d+`` |
884+--------------------------------+---------------------------------------------+
885| ``%e``, ``%E``, ``%f``, ``%g`` | ``[-+]?(\d+(\.\d*)?|\.\d+)([eE][-+]?\d+)?`` |
886+--------------------------------+---------------------------------------------+
887| ``%i`` | ``[-+]?(0[xX][\dA-Fa-f]+|0[0-7]*|\d+)`` |
888+--------------------------------+---------------------------------------------+
889| ``%o`` | ``0[0-7]*`` |
890+--------------------------------+---------------------------------------------+
891| ``%s`` | ``\S+`` |
892+--------------------------------+---------------------------------------------+
893| ``%u`` | ``\d+`` |
894+--------------------------------+---------------------------------------------+
895| ``%x``, ``%X`` | ``0[xX][\dA-Fa-f]+`` |
896+--------------------------------+---------------------------------------------+
897
898To extract the filename and numbers from a string like ::
899
900 /usr/sbin/sendmail - 0 errors, 4 warnings
901
902you would use a :cfunc:`scanf` format like ::
903
904 %s - %d errors, %d warnings
905
906The equivalent regular expression would be ::
907
908 (\S+) - (\d+) errors, (\d+) warnings
909
910**Avoiding recursion**
911
912If you create regular expressions that require the engine to perform a lot of
913recursion, you may encounter a :exc:`RuntimeError` exception with the message
914``maximum recursion limit`` exceeded. For example, ::
915
916 >>> import re
917 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
918 >>> re.match('Begin (\w| )*? end', s).end()
919 Traceback (most recent call last):
920 File "<stdin>", line 1, in ?
921 File "/usr/local/lib/python2.5/re.py", line 132, in match
922 return _compile(pattern, flags).match(string)
923 RuntimeError: maximum recursion limit exceeded
924
925You can often restructure your regular expression to avoid recursion.
926
927Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
928avoid recursion. Thus, the above regular expression can avoid recursion by
929being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
930regular expressions will run faster than their recursive equivalents.
931