Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: 0c64c722d32b9e1dab1b94703aadb6709ab71e5a [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
				11
				12
				13	This module provides regular expression matching operations similar to
				14	those found in Perl. Both patterns and strings to be searched can be
				15	Unicode strings as well as 8-bit strings. The :mod:`re` module is
				16	always available.
				17
				18	Regular expressions use the backslash character (``'\'``) to indicate
				19	special forms or to allow special characters to be used without invoking
				20	their special meaning. This collides with Python's usage of the same
				21	character for the same purpose in string literals; for example, to match
				22	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				23	string, because the regular expression must be ``\\``, and each
				24	backslash must be expressed as ``\\`` inside a regular Python string
				25	literal.
				26
				27	The solution is to use Python's raw string notation for regular expression
				28	patterns; backslashes are not handled in any special way in a string literal
				29	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				30	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	31	newline. Usually patterns will be expressed in Python code using this raw
				32	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	33
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	34	It is important to note that most regular expression operations are available as
				35	module-level functions and :class:`RegexObject` methods. The functions are
				36	shortcuts that don't require you to compile a regex object first, but miss some
				37	fine-tuning parameters.
				38
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39	.. seealso::
				40
				41	Mastering Regular Expressions
				42	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	43	second edition of the book no longer covers Python at all, but the first
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44	edition covered writing good regular expression patterns in great detail.
				45
				46
				47	.. _re-syntax:
				48
				49	Regular Expression Syntax
				50	-------------------------
				51
				52	A regular expression (or RE) specifies a set of strings that matches it; the
				53	functions in this module let you check if a particular string matches a given
				54	regular expression (or if a given regular expression matches a particular
				55	string, which comes down to the same thing).
				56
				57	Regular expressions can be concatenated to form new regular expressions; if A
				58	and B are both regular expressions, then AB is also a regular expression.
				59	In general, if a string p matches A and another string q matches B, the
				60	string pq will match AB. This holds unless A or B contain low precedence
				61	operations; boundary conditions between A and B; or have numbered group
				62	references. Thus, complex expressions can easily be constructed from simpler
				63	primitive expressions like the ones described here. For details of the theory
				64	and implementation of regular expressions, consult the Friedl book referenced
				65	above, or almost any textbook about compiler construction.
				66
				67	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame^]	68	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	69
				70	Regular expressions can contain both special and ordinary characters. Most
				71	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				72	expressions; they simply match themselves. You can concatenate ordinary
				73	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				74	section, we'll write RE's in ``this special style``, usually without quotes, and
				75	strings to be matched ``'in single quotes'``.)
				76
				77	Some characters, like ``'\|'`` or ``'('``, are special. Special
				78	characters either stand for classes of ordinary characters, or affect
				79	how the regular expressions around them are interpreted. Regular
				80	expression pattern strings may not contain null bytes, but can specify
				81	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				82
				83
				84	The special characters are:
				85
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86	``'.'``
				87	(Dot.) In the default mode, this matches any character except a newline. If
				88	the :const:`DOTALL` flag has been specified, this matches any character
				89	including a newline.
				90
				91	``'^'``
				92	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				93	matches immediately after each newline.
				94
				95	``'$'``
				96	Matches the end of the string or just before the newline at the end of the
				97	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				98	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				99	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	100	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				101	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				102	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	103
				104	``'*'``
				105	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				106	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				107	by any number of 'b's.
				108
				109	``'+'``
				110	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				111	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				112	match just 'a'.
				113
				114	``'?'``
				115	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				116	``ab?`` will match either 'a' or 'ab'.
				117
				118	``*?``, ``+?``, ``??``
				119	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				120	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				121	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				122	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				123	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				124	characters as possible will be matched. Using ``.*?`` in the previous
				125	expression will match only ``'<H1>'``.
				126
				127	``{m}``
				128	Specifies that exactly m copies of the previous RE should be matched; fewer
				129	matches cause the entire RE not to match. For example, ``a{6}`` will match
				130	exactly six ``'a'`` characters, but not five.
				131
				132	``{m,n}``
				133	Causes the resulting RE to match from m to n repetitions of the preceding
				134	RE, attempting to match as many repetitions as possible. For example,
				135	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				136	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				137	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				138	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				139	modifier would be confused with the previously described form.
				140
				141	``{m,n}?``
				142	Causes the resulting RE to match from m to n repetitions of the preceding
				143	RE, attempting to match as few repetitions as possible. This is the
				144	non-greedy version of the previous qualifier. For example, on the
				145	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				146	while ``a{3,5}?`` will only match 3 characters.
				147
				148	``'\'``
				149	Either escapes special characters (permitting you to match characters like
				150	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				151	sequences are discussed below.
				152
				153	If you're not using a raw string to express the pattern, remember that Python
				154	also uses the backslash as an escape sequence in string literals; if the escape
				155	sequence isn't recognized by Python's parser, the backslash and subsequent
				156	character are included in the resulting string. However, if Python would
				157	recognize the resulting sequence, the backslash should be repeated twice. This
				158	is complicated and hard to understand, so it's highly recommended that you use
				159	raw strings for all but the simplest expressions.
				160
				161	``[]``
				162	Used to indicate a set of characters. Characters can be listed individually, or
				163	a range of characters can be indicated by giving two characters and separating
				164	them by a ``'-'``. Special characters are not active inside sets. For example,
				165	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				166	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				167	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				168	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
				169	range, although the characters they match depends on whether :const:`LOCALE`
				170	or :const:`UNICODE` mode is in force. If you want to include a
				171	``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
				172	place it as the first character. The pattern ``[]]`` will match
				173	``']'``, for example.
				174
				175	You can match the characters not within a range by :dfn:`complementing` the set.
				176	This is indicated by including a ``'^'`` as the first character of the set;
				177	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				178	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				179	character except ``'^'``.
				180
				181	``'\|'``
				182	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				183	will match either A or B. An arbitrary number of REs can be separated by the
				184	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				185	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				186	right. When one pattern completely matches, that branch is accepted. This means
				187	that once ``A`` matches, ``B`` will not be tested further, even if it would
				188	produce a longer overall match. In other words, the ``'\|'`` operator is never
				189	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				190	character class, as in ``[\|]``.
				191
				192	``(...)``
				193	Matches whatever regular expression is inside the parentheses, and indicates the
				194	start and end of a group; the contents of a group can be retrieved after a match
				195	has been performed, and can be matched later in the string with the ``\number``
				196	special sequence, described below. To match the literals ``'('`` or ``')'``,
				197	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				198
				199	``(?...)``
				200	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				201	otherwise). The first character after the ``'?'`` determines what the meaning
				202	and further syntax of the construct is. Extensions usually do not create a new
				203	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				204	currently supported extensions.
				205
				206	``(?iLmsux)``
				207	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				208	``'u'``, ``'x'``.) The group matches the empty string; the letters
				209	set the corresponding flags: :const:`re.I` (ignore case),
				210	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				211	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				212	and :const:`re.X` (verbose), for the entire regular expression. (The
				213	flags are described in :ref:`contents-of-module-re`.) This
				214	is useful if you wish to include the flags as part of the regular
				215	expression, instead of passing a flag argument to the
				216	:func:`compile` function.
				217
				218	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				219	used first in the expression string, or after one or more whitespace characters.
				220	If there are non-whitespace characters before the flag, the results are
				221	undefined.
				222
				223	``(?:...)``
				224	A non-grouping version of regular parentheses. Matches whatever regular
				225	expression is inside the parentheses, but the substring matched by the group
				226	cannot be retrieved after performing a match or referenced later in the
				227	pattern.
				228
				229	``(?P<name>...)``
				230	Similar to regular parentheses, but the substring matched by the group is
				231	accessible via the symbolic group name name. Group names must be valid Python
				232	identifiers, and each group name must be defined only once within a regular
				233	expression. A symbolic group is also a numbered group, just as if the group
				234	were not named. So the group named 'id' in the example below can also be
				235	referenced as the numbered group 1.
				236
				237	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				238	referenced by its name in arguments to methods of match objects, such as
				239	``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
				240	example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
				241
				242	``(?P=name)``
				243	Matches whatever text was matched by the earlier group named name.
				244
				245	``(?#...)``
				246	A comment; the contents of the parentheses are simply ignored.
				247
				248	``(?=...)``
				249	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				250	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				251	``'Isaac '`` only if it's followed by ``'Asimov'``.
				252
				253	``(?!...)``
				254	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				255	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				256	followed by ``'Asimov'``.
				257
				258	``(?<=...)``
				259	Matches if the current position in the string is preceded by a match for ``...``
				260	that ends at the current position. This is called a :dfn:`positive lookbehind
				261	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				262	lookbehind will back up 3 characters and check if the contained pattern matches.
				263	The contained pattern must only match strings of some fixed length, meaning that
				264	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				265	patterns which start with positive lookbehind assertions will never match at the
				266	beginning of the string being searched; you will most likely want to use the
				267	:func:`search` function rather than the :func:`match` function::
				268
				269	>>> import re
				270	>>> m = re.search('(?<=abc)def', 'abcdef')
				271	>>> m.group(0)
				272	'def'
				273
				274	This example looks for a word following a hyphen::
				275
				276	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				277	>>> m.group(0)
				278	'egg'
				279
				280	``(?<!...)``
				281	Matches if the current position in the string is not preceded by a match for
				282	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				283	positive lookbehind assertions, the contained pattern must only match strings of
				284	some fixed length. Patterns which start with negative lookbehind assertions may
				285	match at the beginning of the string being searched.
				286
				287	``(?(id/name)yes-pattern\|no-pattern)``
				288	Will try to match with ``yes-pattern`` if the group with given id or name
				289	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				290	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				291	matching pattern, which will match with ``'<user@host.com>'`` as well as
				292	``'user@host.com'``, but not with ``'<user@host.com'``.
				293
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	294
				295	The special sequences consist of ``'\'`` and a character from the list below.
				296	If the ordinary character is not on the list, then the resulting RE will match
				297	the second character. For example, ``\$`` matches the character ``'$'``.
				298
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	299	``\number``
				300	Matches the contents of the group of the same number. Groups are numbered
				301	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				302	but not ``'the end'`` (note the space after the group). This special sequence
				303	can only be used to match one of the first 99 groups. If the first digit of
				304	number is 0, or number is 3 octal digits long, it will not be interpreted as
				305	a group match, but as the character with octal value number. Inside the
				306	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				307	characters.
				308
				309	``\A``
				310	Matches only at the start of the string.
				311
				312	``\b``
				313	Matches the empty string, but only at the beginning or end of a word. A word is
				314	defined as a sequence of alphanumeric or underscore characters, so the end of a
				315	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
				316	Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
				317	precise set of characters deemed to be alphanumeric depends on the values of the
				318	``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
				319	the backspace character, for compatibility with Python's string literals.
				320
				321	``\B``
				322	Matches the empty string, but only when it is not at the beginning or end of a
				323	word. This is just the opposite of ``\b``, so is also subject to the settings
				324	of ``LOCALE`` and ``UNICODE``.
				325
				326	``\d``
				327	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				328	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
				329	whatever is classified as a digit in the Unicode character properties database.
				330
				331	``\D``
				332	When the :const:`UNICODE` flag is not specified, matches any non-digit
				333	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				334	will match anything other than character marked as digits in the Unicode
				335	character properties database.
				336
				337	``\s``
				338	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				339	any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
				340	:const:`LOCALE`, it will match this set plus whatever characters are defined as
				341	space for the current locale. If :const:`UNICODE` is set, this will match the
				342	characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
				343	character properties database.
				344
				345	``\S``
				346	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				347	any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
				348	With :const:`LOCALE`, it will match any character not in this set, and not
				349	defined as space in the current locale. If :const:`UNICODE` is set, this will
				350	match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
				351	the Unicode character properties database.
				352
				353	``\w``
				354	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				355	any alphanumeric character and the underscore; this is equivalent to the set
				356	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				357	whatever characters are defined as alphanumeric for the current locale. If
				358	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				359	is classified as alphanumeric in the Unicode character properties database.
				360
				361	``\W``
				362	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				363	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				364	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				365	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
				366	this will match anything other than ``[0-9_]`` and characters marked as
				367	alphanumeric in the Unicode character properties database.
				368
				369	``\Z``
				370	Matches only at the end of the string.
				371
				372	Most of the standard escapes supported by Python string literals are also
				373	accepted by the regular expression parser::
				374
				375	\a \b \f \n
				376	\r \t \v \x
				377	\\
				378
				379	Octal escapes are included in a limited form: If the first digit is a 0, or if
				380	there are three octal digits, it is considered an octal escape. Otherwise, it is
				381	a group reference. As for string literals, octal escapes are always at most
				382	three digits in length.
				383
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	384
				385	.. _matching-searching:
				386
				387	Matching vs Searching
				388	---------------------
				389
				390	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				391
				392
				393	Python offers two different primitive operations based on regular expressions:
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	394	match checks for a match only at the beginning of the string, while
				395	search checks for a match anywhere in the string (this is what Perl does
				396	by default).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	397
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	398	Note that match may differ from search even when using a regular expression
				399	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	400	:const:`MULTILINE` mode also immediately following a newline. The "match"
				401	operation succeeds only if the pattern matches at the start of the string
				402	regardless of mode, or at the starting position given by the optional pos
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	403	argument regardless of whether a newline precedes it. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	404
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	405	>>> re.match("c", "abcdef") # No match
				406	>>> re.search("c", "abcdef")
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	407	<_sre.SRE_Match object at 0x827e9c0> # Match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	408
				409
				410	.. _contents-of-module-re:
				411
				412	Module Contents
				413	---------------
				414
				415	The module defines several functions, constants, and an exception. Some of the
				416	functions are simplified versions of the full featured methods for compiled
				417	regular expressions. Most non-trivial applications always use the compiled
				418	form.
				419
				420
				421	.. function:: compile(pattern[, flags])
				422
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	423	Compile a regular expression pattern into a regular expression object, which
				424	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	425	described below.
				426
				427	The expression's behaviour can be modified by specifying a flags value.
				428	Values can be any of the following variables, combined using bitwise OR (the
				429	``\|`` operator).
				430
				431	The sequence ::
				432
				433	prog = re.compile(pat)
				434	result = prog.match(str)
				435
				436	is equivalent to ::
				437
				438	result = re.match(pat, str)
				439
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	440	but the version using :func:`compile` is more efficient when the expression
				441	will be used several times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	442
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	443	.. (The compiled version of the last pattern passed to :func:`re.match` or
				444	:func:`re.search` is cached, so programs that use only a single regular
				445	expression at a time needn't worry about compiling regular expressions.)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	446
				447
				448	.. data:: I
				449	IGNORECASE
				450
				451	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				452	lowercase letters, too. This is not affected by the current locale.
				453
				454
				455	.. data:: L
				456	LOCALE
				457
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	458	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				459	current locale.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	460
				461
				462	.. data:: M
				463	MULTILINE
				464
				465	When specified, the pattern character ``'^'`` matches at the beginning of the
				466	string and at the beginning of each line (immediately following each newline);
				467	and the pattern character ``'$'`` matches at the end of the string and at the
				468	end of each line (immediately preceding each newline). By default, ``'^'``
				469	matches only at the beginning of the string, and ``'$'`` only at the end of the
				470	string and immediately before the newline (if any) at the end of the string.
				471
				472
				473	.. data:: S
				474	DOTALL
				475
				476	Make the ``'.'`` special character match any character at all, including a
				477	newline; without this flag, ``'.'`` will match anything except a newline.
				478
				479
				480	.. data:: U
				481	UNICODE
				482
				483	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				484	on the Unicode character properties database.
				485
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	486
				487	.. data:: X
				488	VERBOSE
				489
				490	This flag allows you to write regular expressions that look nicer. Whitespace
				491	within the pattern is ignored, except when in a character class or preceded by
				492	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				493	character class or preceded by an unescaped backslash, all characters from the
				494	leftmost such ``'#'`` through the end of the line are ignored.
				495
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	496	That means that the two following regular expression objects that match a
				497	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	498
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	499	a = re.compile(r"""\d + # the integral part
				500	\. # the decimal point
				501	\d * # some fractional digits""", re.X)
				502	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	503
				504
				505	.. function:: search(pattern, string[, flags])
				506
				507	Scan through string looking for a location where the regular expression
				508	pattern produces a match, and return a corresponding :class:`MatchObject`
				509	instance. Return ``None`` if no position in the string matches the pattern; note
				510	that this is different from finding a zero-length match at some point in the
				511	string.
				512
				513
				514	.. function:: match(pattern, string[, flags])
				515
				516	If zero or more characters at the beginning of string match the regular
				517	expression pattern, return a corresponding :class:`MatchObject` instance.
				518	Return ``None`` if the string does not match the pattern; note that this is
				519	different from a zero-length match.
				520
				521	.. note::
				522
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	523	If you want to locate a match anywhere in string, use :meth:`search`
				524	instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	525
				526
				527	.. function:: split(pattern, string[, maxsplit=0])
				528
				529	Split string by the occurrences of pattern. If capturing parentheses are
				530	used in pattern, then the text of all groups in the pattern are also returned
				531	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				532	splits occur, and the remainder of the string is returned as the final element
				533	of the list. (Incompatibility note: in the original Python 1.5 release,
				534	maxsplit was ignored. This has been fixed in later releases.) ::
				535
				536	>>> re.split('\W+', 'Words, words, words.')
				537	['Words', 'words', 'words', '']
				538	>>> re.split('(\W+)', 'Words, words, words.')
				539	['Words', ', ', 'words', ', ', 'words', '.', '']
				540	>>> re.split('\W+', 'Words, words, words.', 1)
				541	['Words', 'words, words.']
				542
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	543	Note that split will never split a string on an empty pattern match.
				544	For example ::
				545
				546	>>> re.split('x*', 'foo')
				547	['foo']
				548	>>> re.split("(?m)^$", "foo\n\nbar\n")
				549	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	550
				551	.. function:: findall(pattern, string[, flags])
				552
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	553	Return all non-overlapping matches of pattern in string, as a list of
				554	strings. If one or more groups are present in the pattern, return a list of
				555	groups; this will be a list of tuples if the pattern has more than one group.
				556	Empty matches are included in the result unless they touch the beginning of
				557	another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	558
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	559
				560	.. function:: finditer(pattern, string[, flags])
				561
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	562	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
				563	non-overlapping matches for the RE pattern in string. Empty matches are
				564	included in the result unless they touch the beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	565
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	566
				567	.. function:: sub(pattern, repl, string[, count])
				568
				569	Return the string obtained by replacing the leftmost non-overlapping occurrences
				570	of pattern in string by the replacement repl. If the pattern isn't found,
				571	string is returned unchanged. repl can be a string or a function; if it is
				572	a string, any backslash escapes in it are processed. That is, ``\n`` is
				573	converted to a single newline character, ``\r`` is converted to a linefeed, and
				574	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				575	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
				576	For example::
				577
				578	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				579	... r'static PyObject*\npy_\1(void)\n{',
				580	... 'def myfunc():')
				581	'static PyObject*\npy_myfunc(void)\n{'
				582
				583	If repl is a function, it is called for every non-overlapping occurrence of
				584	pattern. The function takes a single match object argument, and returns the
				585	replacement string. For example::
				586
				587	>>> def dashrepl(matchobj):
				588	... if matchobj.group(0) == '-': return ' '
				589	... else: return '-'
				590	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				591	'pro--gram files'
				592
				593	The pattern may be a string or an RE object; if you need to specify regular
				594	expression flags, you must use a RE object, or use embedded modifiers in a
				595	pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
				596
				597	The optional argument count is the maximum number of pattern occurrences to be
				598	replaced; count must be a non-negative integer. If omitted or zero, all
				599	occurrences will be replaced. Empty matches for the pattern are replaced only
				600	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				601	``'-a-b-c-'``.
				602
				603	In addition to character escapes and backreferences as described above,
				604	``\g<name>`` will use the substring matched by the group named ``name``, as
				605	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				606	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				607	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				608	reference to group 20, not a reference to group 2 followed by the literal
				609	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				610	substring matched by the RE.
				611
				612
				613	.. function:: subn(pattern, repl, string[, count])
				614
				615	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				616	number_of_subs_made)``.
				617
				618
				619	.. function:: escape(string)
				620
				621	Return string with all non-alphanumerics backslashed; this is useful if you
				622	want to match an arbitrary literal string that may have regular expression
				623	metacharacters in it.
				624
				625
				626	.. exception:: error
				627
				628	Exception raised when a string passed to one of the functions here is not a
				629	valid regular expression (for example, it might contain unmatched parentheses)
				630	or when some other error occurs during compilation or matching. It is never an
				631	error if a string contains no match for a pattern.
				632
				633
				634	.. _re-objects:
				635
				636	Regular Expression Objects
				637	--------------------------
				638
				639	Compiled regular expression objects support the following methods and
				640	attributes:
				641
				642
				643	.. method:: RegexObject.match(string[, pos[, endpos]])
				644
				645	If zero or more characters at the beginning of string match this regular
				646	expression, return a corresponding :class:`MatchObject` instance. Return
				647	``None`` if the string does not match the pattern; note that this is different
				648	from a zero-length match.
				649
				650	.. note::
				651
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	652	If you want to locate a match anywhere in string, use :meth:`search`
				653	instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	654
				655	The optional second parameter pos gives an index in the string where the
				656	search is to start; it defaults to ``0``. This is not completely equivalent to
				657	slicing the string; the ``'^'`` pattern character matches at the real beginning
				658	of the string and at positions just after a newline, but not necessarily at the
				659	index where the search is to start.
				660
				661	The optional parameter endpos limits how far the string will be searched; it
				662	will be as if the string is endpos characters long, so only the characters
				663	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				664	than pos, no match will be found, otherwise, if rx is a compiled regular
				665	expression object, ``rx.match(string, 0, 50)`` is equivalent to
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	666	``rx.match(string[:50], 0)``. ::
				667
				668	>>> pattern = re.compile("o")
				669	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				670	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				671	<_sre.SRE_Match object at 0x827eb10>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	672
				673
				674	.. method:: RegexObject.search(string[, pos[, endpos]])
				675
				676	Scan through string looking for a location where this regular expression
				677	produces a match, and return a corresponding :class:`MatchObject` instance.
				678	Return ``None`` if no position in the string matches the pattern; note that this
				679	is different from finding a zero-length match at some point in the string.
				680
				681	The optional pos and endpos parameters have the same meaning as for the
				682	:meth:`match` method.
				683
				684
				685	.. method:: RegexObject.split(string[, maxsplit=0])
				686
				687	Identical to the :func:`split` function, using the compiled pattern.
				688
				689
				690	.. method:: RegexObject.findall(string[, pos[, endpos]])
				691
				692	Identical to the :func:`findall` function, using the compiled pattern.
				693
				694
				695	.. method:: RegexObject.finditer(string[, pos[, endpos]])
				696
				697	Identical to the :func:`finditer` function, using the compiled pattern.
				698
				699
				700	.. method:: RegexObject.sub(repl, string[, count=0])
				701
				702	Identical to the :func:`sub` function, using the compiled pattern.
				703
				704
				705	.. method:: RegexObject.subn(repl, string[, count=0])
				706
				707	Identical to the :func:`subn` function, using the compiled pattern.
				708
				709
				710	.. attribute:: RegexObject.flags
				711
				712	The flags argument used when the RE object was compiled, or ``0`` if no flags
				713	were provided.
				714
				715
				716	.. attribute:: RegexObject.groupindex
				717
				718	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				719	numbers. The dictionary is empty if no symbolic groups were used in the
				720	pattern.
				721
				722
				723	.. attribute:: RegexObject.pattern
				724
				725	The pattern string from which the RE object was compiled.
				726
				727
				728	.. _match-objects:
				729
				730	Match Objects
				731	-------------
				732
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	733	Match objects always have a boolean value of :const:`True`, so that you can test
				734	whether e.g. :func:`match` resulted in a match with a simple if statement. They
				735	support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	736
				737
				738	.. method:: MatchObject.expand(template)
				739
				740	Return the string obtained by doing backslash substitution on the template
				741	string template, as done by the :meth:`sub` method. Escapes such as ``\n`` are
				742	converted to the appropriate characters, and numeric backreferences (``\1``,
				743	``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
				744	contents of the corresponding group.
				745
				746
				747	.. method:: MatchObject.group([group1, ...])
				748
				749	Returns one or more subgroups of the match. If there is a single argument, the
				750	result is a single string; if there are multiple arguments, the result is a
				751	tuple with one item per argument. Without arguments, group1 defaults to zero
				752	(the whole match is returned). If a groupN argument is zero, the corresponding
				753	return value is the entire matching string; if it is in the inclusive range
				754	[1..99], it is the string matching the corresponding parenthesized group. If a
				755	group number is negative or larger than the number of groups defined in the
				756	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				757	part of the pattern that did not match, the corresponding result is ``None``.
				758	If a group is contained in a part of the pattern that matched multiple times,
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	759	the last match is returned. ::
				760
				761	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				762	>>> m.group(0)
				763	'Isaac Newton' # The entire match
				764	>>> m.group(1)
				765	'Isaac' # The first parenthesized subgroup.
				766	>>> m.group(2)
				767	'Newton' # The second parenthesized subgroup.
				768	>>> m.group(1, 2)
				769	('Isaac', 'Newton') # Multiple arguments give us a tuple.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	770
				771	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				772	arguments may also be strings identifying groups by their group name. If a
				773	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				774	exception is raised.
				775
				776	A moderately complicated example::
				777
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	778	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				779	>>> m.group('first_name')
				780	'Malcom'
				781	>>> m.group('last_name')
				782	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	783
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	784	Named groups can also be referred to by their index::
				785
				786	>>> m.group(1)
				787	'Malcom'
				788	>>> m.group(2)
				789	'Reynolds'
				790
				791	If a group matches multiple times, only the last match is accessible::
				792	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				793	>>> m.group(1) # Returns only the last match.
				794	'c3'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	795
				796
				797	.. method:: MatchObject.groups([default])
				798
				799	Return a tuple containing all the subgroups of the match, from 1 up to however
				800	many groups are in the pattern. The default argument is used for groups that
				801	did not participate in the match; it defaults to ``None``. (Incompatibility
				802	note: in the original Python 1.5 release, if the tuple was one element long, a
				803	string would be returned instead. In later versions (from 1.5.1 on), a
				804	singleton tuple is returned in such cases.)
				805
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	806	For example::
				807
				808	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				809	>>> m.groups()
				810	('24', '1632')
				811
				812	If we make the decimal place and everything after it optional, not all groups
				813	might participate in the match. These groups will default to ``None`` unless
				814	the default argument is given::
				815
				816	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				817	>>> m.groups()
				818	('24', None) # Second group defaults to None.
				819	>>> m.groups('0')
				820	('24', '0') # Now, the second group defaults to '0'.
				821
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	822
				823	.. method:: MatchObject.groupdict([default])
				824
				825	Return a dictionary containing all the named subgroups of the match, keyed by
				826	the subgroup name. The default argument is used for groups that did not
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	827	participate in the match; it defaults to ``None``. For example::
				828
				829	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				830	>>> m.groupdict()
				831	{'first_name': 'Malcom', 'last_name': 'Reynolds'}
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	832
				833
				834	.. method:: MatchObject.start([group])
				835	MatchObject.end([group])
				836
				837	Return the indices of the start and end of the substring matched by group;
				838	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				839	group exists but did not contribute to the match. For a match object m, and
				840	a group g that did contribute to the match, the substring matched by group g
				841	(equivalent to ``m.group(g)``) is ::
				842
				843	m.string[m.start(g):m.end(g)]
				844
				845	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				846	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				847	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				848	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
				849
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	850	An example that will remove remove_this from email addresses::
				851
				852	>>> email = "tony@tiremove_thisger.net"
				853	>>> m = re.search("remove_this", email)
				854	>>> email[:m.start()] + email[m.end():]
				855	'tony@tiger.net'
				856
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	857
				858	.. method:: MatchObject.span([group])
				859
				860	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				861	m.end(group))``. Note that if group did not contribute to the match, this is
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	862	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	863
				864
				865	.. attribute:: MatchObject.pos
				866
				867	The value of pos which was passed to the :func:`search` or :func:`match`
				868	method of the :class:`RegexObject`. This is the index into the string at which
				869	the RE engine started looking for a match.
				870
				871
				872	.. attribute:: MatchObject.endpos
				873
				874	The value of endpos which was passed to the :func:`search` or :func:`match`
				875	method of the :class:`RegexObject`. This is the index into the string beyond
				876	which the RE engine will not go.
				877
				878
				879	.. attribute:: MatchObject.lastindex
				880
				881	The integer index of the last matched capturing group, or ``None`` if no group
				882	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				883	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				884	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				885	string.
				886
				887
				888	.. attribute:: MatchObject.lastgroup
				889
				890	The name of the last matched capturing group, or ``None`` if the group didn't
				891	have a name, or if no group was matched at all.
				892
				893
				894	.. attribute:: MatchObject.re
				895
				896	The regular expression object whose :meth:`match` or :meth:`search` method
				897	produced this :class:`MatchObject` instance.
				898
				899
				900	.. attribute:: MatchObject.string
				901
				902	The string passed to :func:`match` or :func:`search`.
				903
				904
				905	Examples
				906	--------
				907
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	908
				909	Checking For a Pair
				910	^^^^^^^^^^^^^^^^^^^
				911
				912	In this example, we'll use the following helper function to display match
				913	objects a little more gracefully::
				914
				915	def displaymatch(match):
				916	if match is None:
				917	return None
				918	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				919
				920	Suppose you are writing a poker program where a player's hand is represented as
				921	a 5-character string with each character representing a card, "a" for ace, "k"
				922	for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
				923	representing the card with that value.
				924
				925	To see if a given string is a valid hand, one could do the following::
				926
				927	>>> valid = re.compile(r"[0-9akqj]{5}$"
				928	>>> displaymatch(valid.match("ak05q")) # Valid.
				929	<Match: 'ak05q', groups=()>
				930	>>> displaymatch(valid.match("ak05e")) # Invalid.
				931	>>> displaymatch(valid.match("ak0")) # Invalid.
				932	>>> displaymatch(valid.match("727ak")) # Valid.
				933	<Match: '727ak', groups=()>
				934
				935	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
				936	To match this with a regular expression, one could use backreferences as such::
				937
				938	>>> pair = re.compile(r".(.).\1")
				939	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
				940	<Match: '717', groups=('7',)>
				941	>>> displaymatch(pair.match("718ak")) # No pairs.
				942	>>> displaymatch(pair.match("354aa")) # Pair of aces.
				943	<Match: '345aa', groups=('a',)>
				944
				945	To find out what card the pair consists of, one could use the :func:`group`
				946	method of :class:`MatchObject` in the following manner::
				947
				948	>>> pair.match("717ak").group(1)
				949	'7'
				950
				951	# Error because re.match() returns None, which doesn't have a group() method:
				952	>>> pair.match("718ak").group(1)
				953	Traceback (most recent call last):
				954	File "<pyshell#23>", line 1, in <module>
				955	re.match(r".(.).\1", "718ak").group(1)
				956	AttributeError: 'NoneType' object has no attribute 'group'
				957
				958	>>> pair.match("354aa").group(1)
				959	'a'
				960
				961
				962	Simulating scanf()
				963	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	964
				965	.. index:: single: scanf()
				966
				967	Python does not currently have an equivalent to :cfunc:`scanf`. Regular
				968	expressions are generally more powerful, though also more verbose, than
				969	:cfunc:`scanf` format strings. The table below offers some more-or-less
				970	equivalent mappings between :cfunc:`scanf` format tokens and regular
				971	expressions.
				972
				973	+--------------------------------+---------------------------------------------+
				974	\| :cfunc:`scanf` Token \| Regular Expression \|
				975	+================================+=============================================+
				976	\| ``%c`` \| ``.`` \|
				977	+--------------------------------+---------------------------------------------+
				978	\| ``%5c`` \| ``.{5}`` \|
				979	+--------------------------------+---------------------------------------------+
				980	\| ``%d`` \| ``[-+]?\d+`` \|
				981	+--------------------------------+---------------------------------------------+
				982	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				983	+--------------------------------+---------------------------------------------+
				984	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				985	+--------------------------------+---------------------------------------------+
				986	\| ``%o`` \| ``0[0-7]*`` \|
				987	+--------------------------------+---------------------------------------------+
				988	\| ``%s`` \| ``\S+`` \|
				989	+--------------------------------+---------------------------------------------+
				990	\| ``%u`` \| ``\d+`` \|
				991	+--------------------------------+---------------------------------------------+
				992	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				993	+--------------------------------+---------------------------------------------+
				994
				995	To extract the filename and numbers from a string like ::
				996
				997	/usr/sbin/sendmail - 0 errors, 4 warnings
				998
				999	you would use a :cfunc:`scanf` format like ::
				1000
				1001	%s - %d errors, %d warnings
				1002
				1003	The equivalent regular expression would be ::
				1004
				1005	(\S+) - (\d+) errors, (\d+) warnings
				1006
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1007
				1008	Avoiding recursion
				1009	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1010
				1011	If you create regular expressions that require the engine to perform a lot of
				1012	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				1013	``maximum recursion limit`` exceeded. For example, ::
				1014
				1015	>>> import re
				1016	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				1017	>>> re.match('Begin (\w\| )*? end', s).end()
				1018	Traceback (most recent call last):
				1019	File "<stdin>", line 1, in ?
				1020	File "/usr/local/lib/python2.5/re.py", line 132, in match
				1021	return _compile(pattern, flags).match(string)
				1022	RuntimeError: maximum recursion limit exceeded
				1023
				1024	You can often restructure your regular expression to avoid recursion.
				1025
				1026	Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
				1027	avoid recursion. Thus, the above regular expression can avoid recursion by
				1028	being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
				1029	regular expressions will run faster than their recursive equivalents.
				1030
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1031
				1032	search() vs. match()
				1033	^^^^^^^^^^^^^^^^^^^^
				1034
				1035	In a nutshell, :func:`match` only attempts to match a pattern at the beginning
				1036	of a string where :func:`search` will match a pattern anywhere in a string.
				1037	For example::
				1038
				1039	>>> re.match("o", "dog") # No match as "o" is not the first letter of "dog".
				1040	>>> re.search("o", "dog") # Match as search() looks everywhere in the string.
				1041	<_sre.SRE_Match object at 0x827e9f8>
				1042
				1043	.. note::
				1044
				1045	The following applies only to regular expression objects like those created
				1046	with ``re.compile("pattern")``, not the primitives
				1047	``re.match(pattern, string)`` or ``re.search(pattern, string)``.
				1048
				1049	:func:`match` has an optional second parameter that gives an index in the string
				1050	where the search is to start::
				1051
				1052	>>> pattern = re.compile("o")
				1053	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				1054	# Equivalent to the above expression as 0 is the default starting index:
				1055	>>> pattern.match("dog", 0)
				1056	# Match as "o" is the 2nd character of "dog" (index 0 is the first):
				1057	>>> pattern.match("dog", 1)
				1058	<_sre.SRE_Match object at 0x827eb10>
				1059	>>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog."
				1060
				1061
				1062	Making a Phonebook
				1063	^^^^^^^^^^^^^^^^^^
				1064
				1065	:func:`split` splits a string into a list delimited by the passed pattern. The
				1066	method is invaluable for converting textual data into data structures that can be
				1067	easily read and modified by Python as demonstrated in the following example that
				1068	creates a phonebook.
				1069
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1070	First, here is the input. Normally it may come from a file, here we are using
				1071	triple-quoted string syntax::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1072
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1073	>>> input = """Ross McFluff: 834.345.1254 155 Elm Street
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1074
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1075	Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1076	Frank Burger: 925.541.7625 662 South Dogwood Way
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1077
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1078
				1079	Heather Albrecht: 548.326.4584 919 Park Place"""
				1080
				1081	The entries are separated by one or more newlines. Now we convert the string
				1082	into a list with each nonempty line having its own entry::
				1083
				1084	>>> entries = re.split("\n+", input)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1085	>>> entries
				1086	['Ross McFluff 834.345.1254 155 Elm Street',
				1087	'Ronald Heathmore 892.345.3428 436 Finley Avenue',
				1088	'Frank Burger 925.541.7625 662 South Dogwood Way',
				1089	'Heather Albrecht 548.326.4584 919 Park Place']
				1090
				1091	Finally, split each entry into a list with first name, last name, telephone
				1092	number, and address. We use the ``maxsplit`` paramater of :func:`split`
				1093	because the address has spaces, our splitting pattern, in it::
				1094
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1095	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1096	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1097	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1098	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1099	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1100
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1101	The ``:?`` pattern matches the colon after the last name, so that it does not
				1102	occur in the result list. With a ``maxsplit`` of ``4``, we could seperate the
				1103	house number from the street name::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1104
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1105	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1106	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1107	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1108	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1109	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1110
				1111
				1112	Text Munging
				1113	^^^^^^^^^^^^
				1114
				1115	:func:`sub` replaces every occurrence of a pattern with a string or the
				1116	result of a function. This example demonstrates using :func:`sub` with
				1117	a function to "munge" text, or randomize the order of all the characters
				1118	in each word of a sentence except for the first and last characters::
				1119
				1120	>>> def repl(m):
				1121	... inner_word = list(m.group(2))
				1122	... random.shuffle(inner_word)
				1123	... return m.group(1) + "".join(inner_word) + m.group(3)
				1124	>>> text = "Professor Abdolmalek, please report your absences promptly."
				1125	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1126	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
				1127	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1128	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1129
				1130
				1131	Finding all Adverbs
				1132	^^^^^^^^^^^^^^^^^^^
				1133
				1134	:func:`findall` matches all occurences of a pattern, not just the first
				1135	one as :func:`search` does. For example, if one was a writer and wanted to
				1136	find all of the adverbs in some text, he or she might use :func:`findall` in
				1137	the following manner::
				1138
				1139	>>> text = "He was carefully disguised but captured quickly by police."
				1140	>>> re.findall(r"\w+ly", text)
				1141	['carefully', 'quickly']
				1142
				1143
				1144	Finding all Adverbs and their Positions
				1145	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1146
				1147	If one wants more information about all matches of a pattern than the matched
				1148	text, :func:`finditer` is useful as it provides instances of
				1149	:class:`MatchObject` instead of strings. Continuing with the previous example,
				1150	if one was a writer who wanted to find all of the adverbs and their positions
				1151	in some text, he or she would use :func:`finditer` in the following manner::
				1152
				1153	>>> text = "He was carefully disguised but captured quickly by police."
				1154	>>> for m in re.finditer(r"\w+ly", text):
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	1155	print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1156	07-16: carefully
				1157	40-47: quickly
				1158
				1159
				1160	Raw String Notation
				1161	^^^^^^^^^^^^^^^^^^^
				1162
				1163	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1164	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1165	another one to escape it. For example, the two following lines of code are
				1166	functionally identical::
				1167
				1168	>>> re.match(r"\W(.)\1\W", " ff ")
				1169	<_sre.SRE_Match object at 0x8262760>
				1170	>>> re.match("\\W(.)\\1\\W", " ff ")
				1171	<_sre.SRE_Match object at 0x82627a0>
				1172
				1173	When one wants to match a literal backslash, it must be escaped in the regular
				1174	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1175	notation, one must use ``"\\\\"``, making the following lines of code
				1176	functionally identical::
				1177
				1178	>>> re.match(r"\\", r"\\")
				1179	<_sre.SRE_Match object at 0x827eb48>
				1180	>>> re.match("\\\\", r"\\")
				1181	<_sre.SRE_Match object at 0x827ec60>