Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: c708a29bbe58310c50d64982e2d18292af482da7 [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	11	This module provides regular expression matching operations similar to
				12	those found in Perl. Both patterns and strings to be searched can be
Georg Brandl	382edff	2009-03-31 15:43:20 +0000	[diff] [blame]	13	Unicode strings as well as 8-bit strings.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	14
				15	Regular expressions use the backslash character (``'\'``) to indicate
				16	special forms or to allow special characters to be used without invoking
				17	their special meaning. This collides with Python's usage of the same
				18	character for the same purpose in string literals; for example, to match
				19	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				20	string, because the regular expression must be ``\\``, and each
				21	backslash must be expressed as ``\\`` inside a regular Python string
				22	literal.
				23
				24	The solution is to use Python's raw string notation for regular expression
				25	patterns; backslashes are not handled in any special way in a string literal
				26	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				27	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	28	newline. Usually patterns will be expressed in Python code using this raw
				29	string notation.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	30
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	31	It is important to note that most regular expression operations are available as
				32	module-level functions and :class:`RegexObject` methods. The functions are
				33	shortcuts that don't require you to compile a regex object first, but miss some
				34	fine-tuning parameters.
				35
Mariatta	c8e2021	2017-02-26 08:56:21 -0800	[diff] [blame]	36	.. seealso::
				37
Stéphane Wirtel	ad65d09	2018-05-16 16:57:36 +0200	[diff] [blame]	38	The third-party `regex <https://pypi.org/project/regex/>`_ module,
Mariatta	c8e2021	2017-02-26 08:56:21 -0800	[diff] [blame]	39	which has an API compatible with the standard library :mod:`re` module,
				40	but offers additional functionality and a more thorough Unicode support.
				41
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	42
				43	.. _re-syntax:
				44
				45	Regular Expression Syntax
				46	-------------------------
				47
				48	A regular expression (or RE) specifies a set of strings that matches it; the
				49	functions in this module let you check if a particular string matches a given
				50	regular expression (or if a given regular expression matches a particular
				51	string, which comes down to the same thing).
				52
				53	Regular expressions can be concatenated to form new regular expressions; if A
				54	and B are both regular expressions, then AB is also a regular expression.
				55	In general, if a string p matches A and another string q matches B, the
				56	string pq will match AB. This holds unless A or B contain low precedence
				57	operations; boundary conditions between A and B; or have numbered group
				58	references. Thus, complex expressions can easily be constructed from simpler
				59	primitive expressions like the ones described here. For details of the theory
				60	and implementation of regular expressions, consult the Friedl book referenced
				61	above, or almost any textbook about compiler construction.
				62
				63	A brief explanation of the format of regular expressions follows. For further
Georg Brandl	1cf0522	2008-02-05 12:01:24 +0000	[diff] [blame]	64	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	65
				66	Regular expressions can contain both special and ordinary characters. Most
				67	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				68	expressions; they simply match themselves. You can concatenate ordinary
				69	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				70	section, we'll write RE's in ``this special style``, usually without quotes, and
				71	strings to be matched ``'in single quotes'``.)
				72
				73	Some characters, like ``'\|'`` or ``'('``, are special. Special
				74	characters either stand for classes of ordinary characters, or affect
				75	how the regular expressions around them are interpreted. Regular
				76	expression pattern strings may not contain null bytes, but can specify
				77	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				78
Martin Panter	197332a	2016-10-15 01:18:16 +0000	[diff] [blame]	79	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				80	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				81	``?``, and with other modifiers in other implementations. To apply a second
				82	repetition to an inner repetition, parentheses may be used. For example,
				83	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				84
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	85
				86	The special characters are:
				87
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	88	``'.'``
				89	(Dot.) In the default mode, this matches any character except a newline. If
				90	the :const:`DOTALL` flag has been specified, this matches any character
				91	including a newline.
				92
				93	``'^'``
				94	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				95	matches immediately after each newline.
				96
				97	``'$'``
				98	Matches the end of the string or just before the newline at the end of the
				99	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				100	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				101	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arc	d08a8eb	2008-01-10 21:59:42 +0000	[diff] [blame]	102	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				103	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				104	the newline, and one at the end of the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	105
				106	``'*'``
				107	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				108	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				109	by any number of 'b's.
				110
				111	``'+'``
				112	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				113	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				114	match just 'a'.
				115
				116	``'?'``
				117	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				118	``ab?`` will match either 'a' or 'ab'.
				119
				120	``*?``, ``+?``, ``??``
				121	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				122	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Georg Brandl	5892ab1	2016-04-12 07:51:41 +0200	[diff] [blame]	123	``<.*>`` is matched against ``<a> b <c>``, it will match the entire
				124	string, and not just ``<a>``. Adding ``?`` after the qualifier makes it
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	125	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	5892ab1	2016-04-12 07:51:41 +0200	[diff] [blame]	126	characters as possible will be matched. Using the RE ``<.*?>`` will match
				127	only ``<a>``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	128
				129	``{m}``
				130	Specifies that exactly m copies of the previous RE should be matched; fewer
				131	matches cause the entire RE not to match. For example, ``a{6}`` will match
				132	exactly six ``'a'`` characters, but not five.
				133
				134	``{m,n}``
				135	Causes the resulting RE to match from m to n repetitions of the preceding
				136	RE, attempting to match as many repetitions as possible. For example,
				137	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				138	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				139	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				140	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				141	modifier would be confused with the previously described form.
				142
				143	``{m,n}?``
				144	Causes the resulting RE to match from m to n repetitions of the preceding
				145	RE, attempting to match as few repetitions as possible. This is the
				146	non-greedy version of the previous qualifier. For example, on the
				147	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				148	while ``a{3,5}?`` will only match 3 characters.
				149
				150	``'\'``
				151	Either escapes special characters (permitting you to match characters like
				152	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				153	sequences are discussed below.
				154
				155	If you're not using a raw string to express the pattern, remember that Python
				156	also uses the backslash as an escape sequence in string literals; if the escape
				157	sequence isn't recognized by Python's parser, the backslash and subsequent
				158	character are included in the resulting string. However, if Python would
				159	recognize the resulting sequence, the backslash should be repeated twice. This
				160	is complicated and hard to understand, so it's highly recommended that you use
				161	raw strings for all but the simplest expressions.
				162
				163	``[]``
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	164	Used to indicate a set of characters. In a set:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	165
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	166	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				167	``'m'``, or ``'k'``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	168
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	169	* Ranges of characters can be indicated by giving two characters and separating
				170	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				171	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				172	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				173	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				174	it will match a literal ``'-'``.
				175
				176	* Special characters lose their special meaning inside sets. For example,
				177	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				178	``'*'``, or ``')'``.
				179
				180	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				181	inside a set, although the characters they match depends on whether
				182	:const:`LOCALE` or :const:`UNICODE` mode is in force.
				183
				184	* Characters that are not within a range can be matched by :dfn:`complementing`
				185	the set. If the first character of the set is ``'^'``, all the characters
				186	that are not in the set will be matched. For example, ``[^5]`` will match
				187	any character except ``'5'``, and ``[^^]`` will match any character except
				188	``'^'``. ``^`` has no special meaning if it's not the first character in
				189	the set.
				190
				191	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				192	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				193	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	700a635	2008-05-31 13:05:34 +0000	[diff] [blame]	194
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	195	``'\|'``
				196	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				197	will match either A or B. An arbitrary number of REs can be separated by the
				198	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				199	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				200	right. When one pattern completely matches, that branch is accepted. This means
				201	that once ``A`` matches, ``B`` will not be tested further, even if it would
				202	produce a longer overall match. In other words, the ``'\|'`` operator is never
				203	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				204	character class, as in ``[\|]``.
				205
				206	``(...)``
				207	Matches whatever regular expression is inside the parentheses, and indicates the
				208	start and end of a group; the contents of a group can be retrieved after a match
				209	has been performed, and can be matched later in the string with the ``\number``
				210	special sequence, described below. To match the literals ``'('`` or ``')'``,
				211	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				212
				213	``(?...)``
				214	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				215	otherwise). The first character after the ``'?'`` determines what the meaning
				216	and further syntax of the construct is. Extensions usually do not create a new
				217	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				218	currently supported extensions.
				219
				220	``(?iLmsux)``
				221	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				222	``'u'``, ``'x'``.) The group matches the empty string; the letters
				223	set the corresponding flags: :const:`re.I` (ignore case),
				224	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				225	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				226	and :const:`re.X` (verbose), for the entire regular expression. (The
				227	flags are described in :ref:`contents-of-module-re`.) This
				228	is useful if you wish to include the flags as part of the regular
				229	expression, instead of passing a flag argument to the
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	230	:func:`re.compile` function.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	231
				232	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				233	used first in the expression string, or after one or more whitespace characters.
				234	If there are non-whitespace characters before the flag, the results are
				235	undefined.
				236
				237	``(?:...)``
Georg Brandl	3b85b9b	2010-11-26 08:20:18 +0000	[diff] [blame]	238	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	239	expression is inside the parentheses, but the substring matched by the group
				240	cannot be retrieved after performing a match or referenced later in the
				241	pattern.
				242
				243	``(?P<name>...)``
				244	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	245	accessible via the symbolic group name name. Group names must be valid
				246	Python identifiers, and each group name must be defined only once within a
				247	regular expression. A symbolic group is also a numbered group, just as if
				248	the group were not named.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	249
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	250	Named groups can be referenced in three contexts. If the pattern is
				251	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				252	single or double quotes):
				253
				254	+---------------------------------------+----------------------------------+
				255	\| Context of reference to group "quote" \| Ways to reference it \|
				256	+=======================================+==================================+
				257	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				258	\| \| * ``\1`` \|
				259	+---------------------------------------+----------------------------------+
				260	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
				261	\| \| * ``m.end('quote')`` (etc.) \|
				262	+---------------------------------------+----------------------------------+
				263	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
				264	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				265	\| \| * ``\1`` \|
				266	+---------------------------------------+----------------------------------+
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	267
				268	``(?P=name)``
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	269	A backreference to a named group; it matches whatever text was matched by the
				270	earlier group named name.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	271
				272	``(?#...)``
				273	A comment; the contents of the parentheses are simply ignored.
				274
				275	``(?=...)``
				276	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				277	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				278	``'Isaac '`` only if it's followed by ``'Asimov'``.
				279
				280	``(?!...)``
				281	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				282	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				283	followed by ``'Asimov'``.
				284
				285	``(?<=...)``
				286	Matches if the current position in the string is preceded by a match for ``...``
				287	that ends at the current position. This is called a :dfn:`positive lookbehind
				288	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				289	lookbehind will back up 3 characters and check if the contained pattern matches.
				290	The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	291	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
				292	references are not supported even if they match strings of some fixed length.
				293	Note that
Ezio Melotti	1142773	2012-04-29 07:34:22 +0300	[diff] [blame]	294	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	295	beginning of the string being searched; you will most likely want to use the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	296	:func:`search` function rather than the :func:`match` function:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	297
				298	>>> import re
				299	>>> m = re.search('(?<=abc)def', 'abcdef')
				300	>>> m.group(0)
				301	'def'
				302
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	303	This example looks for a word following a hyphen:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	304
				305	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				306	>>> m.group(0)
				307	'egg'
				308
				309	``(?<!...)``
				310	Matches if the current position in the string is not preceded by a match for
				311	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				312	positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	313	some fixed length and shouldn't contain group references.
				314	Patterns which start with negative lookbehind assertions may
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	315	match at the beginning of the string being searched.
				316
				317	``(?(id/name)yes-pattern\|no-pattern)``
				318	Will try to match with ``yes-pattern`` if the group with given id or name
				319	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				320	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				321	matching pattern, which will match with ``'<user@host.com>'`` as well as
				322	``'user@host.com'``, but not with ``'<user@host.com'``.
				323
				324	.. versionadded:: 2.4
				325
				326	The special sequences consist of ``'\'`` and a character from the list below.
				327	If the ordinary character is not on the list, then the resulting RE will match
				328	the second character. For example, ``\$`` matches the character ``'$'``.
				329
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	330	``\number``
				331	Matches the contents of the group of the same number. Groups are numbered
				332	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	980db0a	2013-10-06 12:58:20 +0200	[diff] [blame]	333	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	334	can only be used to match one of the first 99 groups. If the first digit of
				335	number is 0, or number is 3 octal digits long, it will not be interpreted as
				336	a group match, but as the character with octal value number. Inside the
				337	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				338	characters.
				339
				340	``\A``
				341	Matches only at the start of the string.
				342
				343	``\b``
				344	Matches the empty string, but only at the beginning or end of a word. A word is
				345	defined as a sequence of alphanumeric or underscore characters, so the end of a
				346	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	347	Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
				348	a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
				349	of the string, so the precise set of characters deemed to be alphanumeric
				350	depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
				351	For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				352	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	353	Inside a character range, ``\b`` represents the backspace character, for
				354	compatibility with Python's string literals.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	355
				356	``\B``
				357	Matches the empty string, but only when it is not at the beginning or end of a
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	358	word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
				359	but not ``'py'``, ``'py.'``, or ``'py!'``.
				360	``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	361	of ``LOCALE`` and ``UNICODE``.
				362
				363	``\d``
				364	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				365	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinson	fe67bd9	2009-07-28 20:35:03 +0000	[diff] [blame]	366	whatever is classified as a decimal digit in the Unicode character properties
				367	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	368
				369	``\D``
				370	When the :const:`UNICODE` flag is not specified, matches any non-digit
				371	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				372	will match anything other than character marked as digits in the Unicode
				373	character properties database.
				374
				375	``\s``
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	376	When the :const:`UNICODE` flag is not specified, it matches any whitespace
				377	character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
				378	:const:`LOCALE` flag has no extra effect on matching of the space.
				379	If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
				380	plus whatever is classified as space in the Unicode character properties
				381	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	382
				383	``\S``
Benjamin Peterson	72275ef	2014-11-25 14:54:45 -0600	[diff] [blame]	384	When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	385	character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
				386	:const:`LOCALE` flag has no extra effect on non-whitespace match. If
				387	:const:`UNICODE` is set, then any character not marked as space in the
				388	Unicode character properties database is matched.
				389
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	390
				391	``\w``
				392	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				393	any alphanumeric character and the underscore; this is equivalent to the set
				394	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				395	whatever characters are defined as alphanumeric for the current locale. If
				396	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				397	is classified as alphanumeric in the Unicode character properties database.
				398
				399	``\W``
				400	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				401	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				402	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				403	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware	7ca2a90	2014-10-19 01:06:58 -0500	[diff] [blame]	404	this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	405	not alphanumeric in the Unicode character properties database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	406
				407	``\Z``
				408	Matches only at the end of the string.
				409
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	410	If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
				411	particular sequence, then :const:`LOCALE` flag takes effect first followed by
				412	the :const:`UNICODE`.
				413
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	414	Most of the standard escapes supported by Python string literals are also
				415	accepted by the regular expression parser::
				416
				417	\a \b \f \n
				418	\r \t \v \x
				419	\\
				420
Ezio Melotti	48d886b	2012-04-29 04:46:34 +0300	[diff] [blame]	421	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				422	only inside character classes.)
				423
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	424	Octal escapes are included in a limited form: If the first digit is a 0, or if
				425	there are three octal digits, it is considered an octal escape. Otherwise, it is
				426	a group reference. As for string literals, octal escapes are always at most
				427	three digits in length.
				428
Georg Brandl	ae4ca79	2014-10-28 21:41:51 +0100	[diff] [blame]	429	.. seealso::
				430
				431	Mastering Regular Expressions
				432	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
				433	second edition of the book no longer covers Python at all, but the first
				434	edition covered writing good regular expression patterns in great detail.
				435
				436
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	437
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	438	.. _contents-of-module-re:
				439
				440	Module Contents
				441	---------------
				442
				443	The module defines several functions, constants, and an exception. Some of the
				444	functions are simplified versions of the full featured methods for compiled
				445	regular expressions. Most non-trivial applications always use the compiled
				446	form.
				447
				448
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	449	.. function:: compile(pattern, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	450
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	451	Compile a regular expression pattern into a regular expression object, which
Ezio Melotti	33b810d	2014-06-20 00:47:11 +0300	[diff] [blame]	452	can be used for matching using its :func:`~RegexObject.match` and
				453	:func:`~RegexObject.search` methods, described below.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	454
				455	The expression's behaviour can be modified by specifying a flags value.
				456	Values can be any of the following variables, combined using bitwise OR (the
				457	``\|`` operator).
				458
				459	The sequence ::
				460
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	461	prog = re.compile(pattern)
				462	result = prog.match(string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	463
				464	is equivalent to ::
				465
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	466	result = re.match(pattern, string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	467
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	468	but using :func:`re.compile` and saving the resulting regular expression
				469	object for reuse is more efficient when the expression will be used several
				470	times in a single program.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	471
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	472	.. note::
				473
				474	The compiled versions of the most recent patterns passed to
				475	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				476	programs that use only a few regular expressions at a time needn't worry
				477	about compiling regular expressions.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	478
				479
Sandro Tosi	e827c13	2012-01-01 12:52:24 +0100	[diff] [blame]	480	.. data:: DEBUG
				481
				482	Display debug information about compiled expression.
				483
				484
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	485	.. data:: I
				486	IGNORECASE
				487
				488	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Brian Ward	9395ca4	2017-05-24 00:08:41 -0700	[diff] [blame]	489	lowercase letters, too. This is not affected by the current locale. To
				490	get this effect on non-ASCII Unicode characters such as ``ü`` and ``Ü``,
				491	add the :const:`UNICODE` flag.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	492
				493
				494	.. data:: L
				495	LOCALE
				496
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	497	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				498	current locale.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	499
				500
				501	.. data:: M
				502	MULTILINE
				503
				504	When specified, the pattern character ``'^'`` matches at the beginning of the
				505	string and at the beginning of each line (immediately following each newline);
				506	and the pattern character ``'$'`` matches at the end of the string and at the
				507	end of each line (immediately preceding each newline). By default, ``'^'``
				508	matches only at the beginning of the string, and ``'$'`` only at the end of the
				509	string and immediately before the newline (if any) at the end of the string.
				510
				511
				512	.. data:: S
				513	DOTALL
				514
				515	Make the ``'.'`` special character match any character at all, including a
				516	newline; without this flag, ``'.'`` will match anything except a newline.
				517
				518
				519	.. data:: U
				520	UNICODE
				521
Brian Ward	9395ca4	2017-05-24 00:08:41 -0700	[diff] [blame]	522	Make the ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				523	sequences dependent on the Unicode character properties database. Also
				524	enables non-ASCII matching for :const:`IGNORECASE`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	525
				526	.. versionadded:: 2.0
				527
				528
				529	.. data:: X
				530	VERBOSE
				531
Zachary Ware	77d61d4	2015-11-11 23:32:14 -0600	[diff] [blame]	532	This flag allows you to write regular expressions that look nicer and are
				533	more readable by allowing you to visually separate logical sections of the
				534	pattern and add comments. Whitespace within the pattern is ignored, except
Miss Islington (bot)	a2f1be0	2017-11-14 07:39:04 -0800	[diff] [blame]	535	when in a character class, or when preceded by an unescaped backslash,
				536	or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware	77d61d4	2015-11-11 23:32:14 -0600	[diff] [blame]	537	When a line contains a ``#`` that is not in a character class and is not
				538	preceded by an unescaped backslash, all characters from the leftmost such
				539	``#`` through the end of the line are ignored.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	540
Zachary Ware	77d61d4	2015-11-11 23:32:14 -0600	[diff] [blame]	541	This means that the two following regular expression objects that match a
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	542	decimal number are functionally equal::
				543
				544	a = re.compile(r"""\d + # the integral part
				545	\. # the decimal point
				546	\d * # some fractional digits""", re.X)
				547	b = re.compile(r"\d+\.\d*")
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	548
				549
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	550	.. function:: search(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	551
Terry Jan Reedy	9f7f62f	2014-05-30 16:19:50 -0400	[diff] [blame]	552	Scan through string looking for the first location where the regular expression
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	553	pattern produces a match, and return a corresponding :class:`MatchObject`
				554	instance. Return ``None`` if no position in the string matches the pattern; note
				555	that this is different from finding a zero-length match at some point in the
				556	string.
				557
				558
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	559	.. function:: match(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	560
				561	If zero or more characters at the beginning of string match the regular
				562	expression pattern, return a corresponding :class:`MatchObject` instance.
				563	Return ``None`` if the string does not match the pattern; note that this is
				564	different from a zero-length match.
				565
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	566	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				567	at the beginning of the string and not at the beginning of each line.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	568
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	569	If you want to locate a match anywhere in string, use :func:`search`
				570	instead (see also :ref:`search-vs-match`).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	571
				572
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	573	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	574
				575	Split string by the occurrences of pattern. If capturing parentheses are
				576	used in pattern, then the text of all groups in the pattern are also returned
				577	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				578	splits occur, and the remainder of the string is returned as the final element
				579	of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	580	maxsplit was ignored. This has been fixed in later releases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	581
				582	>>> re.split('\W+', 'Words, words, words.')
				583	['Words', 'words', 'words', '']
				584	>>> re.split('(\W+)', 'Words, words, words.')
				585	['Words', ', ', 'words', ', ', 'words', '.', '']
				586	>>> re.split('\W+', 'Words, words, words.', 1)
				587	['Words', 'words, words.']
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	588	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				589	['0', '3', '9']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	590
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	591	If there are capturing groups in the separator and it matches at the start of
				592	the string, the result will start with an empty string. The same holds for
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	593	the end of the string:
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	594
				595	>>> re.split('(\W+)', '...words, words...')
				596	['', '...', 'words', ', ', 'words', '...', '']
				597
				598	That way, separator components are always found at the same relative
				599	indices within the result list (e.g., if there's one capturing group
				600	in the separator, the 0th, the 2nd and so forth).
				601
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	602	Note that split will never split a string on an empty pattern match.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	603	For example:
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	604
				605	>>> re.split('x*', 'foo')
				606	['foo']
				607	>>> re.split("(?m)^$", "foo\n\nbar\n")
				608	['foo\n\nbar\n']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	609
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	610	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	611	Added the optional flags argument.
				612
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	613
Serhiy Storchaka	ca54740	2018-01-04 14:08:27 +0200	[diff] [blame]	614
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	615	.. function:: findall(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	616
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	617	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	618	strings. The string is scanned left-to-right, and matches are returned in
				619	the order found. If one or more groups are present in the pattern, return a
				620	list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka	ca54740	2018-01-04 14:08:27 +0200	[diff] [blame]	621	one group. Empty matches are included in the result.
				622
				623	.. note::
				624
				625	Due to the limitation of the current implementation the character
				626	following an empty match is not included in a next match, so
				627	``findall(r'^\|\w+', 'two words')`` returns ``['', 'wo', 'words']``
				628	(note missed "t"). This is changed in Python 3.7.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	629
				630	.. versionadded:: 1.5.2
				631
				632	.. versionchanged:: 2.4
				633	Added the optional flags argument.
				634
				635
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	636	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	637
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	638	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	639	non-overlapping matches for the RE pattern in string. The string is
				640	scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka	ca54740	2018-01-04 14:08:27 +0200	[diff] [blame]	641	matches are included in the result. See also the note about :func:`findall`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	642
				643	.. versionadded:: 2.2
				644
				645	.. versionchanged:: 2.4
				646	Added the optional flags argument.
				647
				648
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	649	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	650
				651	Return the string obtained by replacing the leftmost non-overlapping occurrences
				652	of pattern in string by the replacement repl. If the pattern isn't found,
				653	string is returned unchanged. repl can be a string or a function; if it is
				654	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	a7eb3c8	2011-08-19 22:54:33 +0200	[diff] [blame]	655	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	656	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				657	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	658	For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	659
				660	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				661	... r'static PyObject*\npy_\1(void)\n{',
				662	... 'def myfunc():')
				663	'static PyObject*\npy_myfunc(void)\n{'
				664
				665	If repl is a function, it is called for every non-overlapping occurrence of
				666	pattern. The function takes a single match object argument, and returns the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	667	replacement string. For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	668
				669	>>> def dashrepl(matchobj):
				670	... if matchobj.group(0) == '-': return ' '
				671	... else: return '-'
				672	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				673	'pro--gram files'
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	674	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				675	'Baked Beans & Spam'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	676
Georg Brandl	04fd324	2009-08-13 07:48:05 +0000	[diff] [blame]	677	The pattern may be a string or an RE object.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	678
				679	The optional argument count is the maximum number of pattern occurrences to be
				680	replaced; count must be a non-negative integer. If omitted or zero, all
				681	occurrences will be replaced. Empty matches for the pattern are replaced only
				682	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				683	``'-a-b-c-'``.
				684
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	685	In string-type repl arguments, in addition to the character escapes and
				686	backreferences described above,
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	687	``\g<name>`` will use the substring matched by the group named ``name``, as
				688	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				689	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				690	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				691	reference to group 20, not a reference to group 2 followed by the literal
				692	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				693	substring matched by the RE.
				694
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	695	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	696	Added the optional flags argument.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	697
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	698
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	699	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	700
				701	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				702	number_of_subs_made)``.
				703
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	704	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	705	Added the optional flags argument.
				706
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	707
Serhiy Storchaka	53ad684	2017-04-13 19:47:18 +0300	[diff] [blame]	708	.. function:: escape(pattern)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	709
Serhiy Storchaka	53ad684	2017-04-13 19:47:18 +0300	[diff] [blame]	710	Escape all the characters in pattern except ASCII letters and numbers.
				711	This is useful if you want to match an arbitrary literal string that may
				712	have regular expression metacharacters in it. For example::
				713
				714	>>> print re.escape('python.exe')
				715	python\.exe
				716
				717	>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`\|~:"
				718	>>> print '[%s]+' % re.escape(legal_chars)
				719	[abcdefghijklmnopqrstuvwxyz0123456789\!\#\$\%\&\'\*\+\-\.\^\_\`\\|\~\:]+
				720
				721	>>> operators = ['+', '-', '', '/', '*']
				722	>>> print '\|'.join(map(re.escape, sorted(operators, reverse=True)))
				723	\/\|\-\|\+\|\\\|\*
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	724
				725
R. David Murray	a63f9b6	2010-07-10 14:25:18 +0000	[diff] [blame]	726	.. function:: purge()
				727
				728	Clear the regular expression cache.
				729
				730
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	731	.. exception:: error
				732
				733	Exception raised when a string passed to one of the functions here is not a
				734	valid regular expression (for example, it might contain unmatched parentheses)
				735	or when some other error occurs during compilation or matching. It is never an
				736	error if a string contains no match for a pattern.
				737
				738
				739	.. _re-objects:
				740
				741	Regular Expression Objects
				742	--------------------------
				743
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	744	.. class:: RegexObject
				745
				746	The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	747
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	748	.. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	749
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	750	Scan through string looking for a location where this regular expression
				751	produces a match, and return a corresponding :class:`MatchObject` instance.
				752	Return ``None`` if no position in the string matches the pattern; note that this
				753	is different from finding a zero-length match at some point in the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	754
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	755	The optional second parameter pos gives an index in the string where the
				756	search is to start; it defaults to ``0``. This is not completely equivalent to
				757	slicing the string; the ``'^'`` pattern character matches at the real beginning
				758	of the string and at positions just after a newline, but not necessarily at the
				759	index where the search is to start.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	760
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	761	The optional parameter endpos limits how far the string will be searched; it
				762	will be as if the string is endpos characters long, so only the characters
				763	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				764	than pos, no match will be found, otherwise, if rx is a compiled regular
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	765	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				766	``rx.search(string[:50], 0)``.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	767
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	768	>>> pattern = re.compile("d")
				769	>>> pattern.search("dog") # Match at index 0
				770	<_sre.SRE_Match object at ...>
				771	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	772
				773
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	774	.. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	775
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	776	If zero or more characters at the beginning of string match this regular
				777	expression, return a corresponding :class:`MatchObject` instance. Return
				778	``None`` if the string does not match the pattern; note that this is different
				779	from a zero-length match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	780
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	781	The optional pos and endpos parameters have the same meaning as for the
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	782	:meth:`~RegexObject.search` method.
				783
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	784	>>> pattern = re.compile("o")
				785	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				786	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				787	<_sre.SRE_Match object at ...>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	788
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	789	If you want to locate a match anywhere in string, use
				790	:meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
				791
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	792
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	793	.. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	794
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	795	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	796
				797
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	798	.. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	799
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	800	Similar to the :func:`findall` function, using the compiled pattern, but
				801	also accepts optional pos and endpos parameters that limit the search
				802	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	803
				804
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	805	.. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	806
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	807	Similar to the :func:`finditer` function, using the compiled pattern, but
				808	also accepts optional pos and endpos parameters that limit the search
				809	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	810
				811
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	812	.. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	813
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	814	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	815
				816
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	817	.. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	818
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	819	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	820
				821
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	822	.. attribute:: RegexObject.flags
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	823
Georg Brandl	94a1057	2012-03-17 17:31:32 +0100	[diff] [blame]	824	The regex matching flags. This is a combination of the flags given to
				825	:func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	826
				827
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	828	.. attribute:: RegexObject.groups
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	829
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	830	The number of capturing groups in the pattern.
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	831
				832
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	833	.. attribute:: RegexObject.groupindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	834
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	835	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				836	numbers. The dictionary is empty if no symbolic groups were used in the
				837	pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	838
				839
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	840	.. attribute:: RegexObject.pattern
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	841
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	842	The pattern string from which the RE object was compiled.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	843
				844
				845	.. _match-objects:
				846
				847	Match Objects
				848	-------------
				849
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	850	.. class:: MatchObject
				851
Ezio Melotti	51c374d	2012-11-04 06:46:28 +0200	[diff] [blame]	852	Match objects always have a boolean value of ``True``.
				853	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				854	when there is no match, you can test whether there was a match with a simple
				855	``if`` statement::
				856
				857	match = re.search(pattern, string)
				858	if match:
				859	process(match)
				860
				861	Match objects support the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	862
				863
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	864	.. method:: MatchObject.expand(template)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	865
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	866	Return the string obtained by doing backslash substitution on the template
				867	string template, as done by the :meth:`~RegexObject.sub` method. Escapes
				868	such as ``\n`` are converted to the appropriate characters, and numeric
				869	backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
				870	``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	871
				872
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	873	.. method:: MatchObject.group([group1, ...])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	874
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	875	Returns one or more subgroups of the match. If there is a single argument, the
				876	result is a single string; if there are multiple arguments, the result is a
				877	tuple with one item per argument. Without arguments, group1 defaults to zero
				878	(the whole match is returned). If a groupN argument is zero, the corresponding
				879	return value is the entire matching string; if it is in the inclusive range
				880	[1..99], it is the string matching the corresponding parenthesized group. If a
				881	group number is negative or larger than the number of groups defined in the
				882	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				883	part of the pattern that did not match, the corresponding result is ``None``.
				884	If a group is contained in a part of the pattern that matched multiple times,
				885	the last match is returned.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	886
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	887	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				888	>>> m.group(0) # The entire match
				889	'Isaac Newton'
				890	>>> m.group(1) # The first parenthesized subgroup.
				891	'Isaac'
				892	>>> m.group(2) # The second parenthesized subgroup.
				893	'Newton'
				894	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				895	('Isaac', 'Newton')
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	896
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	897	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				898	arguments may also be strings identifying groups by their group name. If a
				899	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				900	exception is raised.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	901
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	902	A moderately complicated example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	903
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	904	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				905	>>> m.group('first_name')
				906	'Malcolm'
				907	>>> m.group('last_name')
				908	'Reynolds'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	909
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	910	Named groups can also be referred to by their index:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	911
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	912	>>> m.group(1)
				913	'Malcolm'
				914	>>> m.group(2)
				915	'Reynolds'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	916
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	917	If a group matches multiple times, only the last match is accessible:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	918
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	919	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				920	>>> m.group(1) # Returns only the last match.
				921	'c3'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	922
				923
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	924	.. method:: MatchObject.groups([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	925
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	926	Return a tuple containing all the subgroups of the match, from 1 up to however
				927	many groups are in the pattern. The default argument is used for groups that
				928	did not participate in the match; it defaults to ``None``. (Incompatibility
				929	note: in the original Python 1.5 release, if the tuple was one element long, a
				930	string would be returned instead. In later versions (from 1.5.1 on), a
				931	singleton tuple is returned in such cases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	932
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	933	For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	934
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	935	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				936	>>> m.groups()
				937	('24', '1632')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	938
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	939	If we make the decimal place and everything after it optional, not all groups
				940	might participate in the match. These groups will default to ``None`` unless
				941	the default argument is given:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	942
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	943	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				944	>>> m.groups() # Second group defaults to None.
				945	('24', None)
				946	>>> m.groups('0') # Now, the second group defaults to '0'.
				947	('24', '0')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	948
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	949
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	950	.. method:: MatchObject.groupdict([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	951
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	952	Return a dictionary containing all the named subgroups of the match, keyed by
				953	the subgroup name. The default argument is used for groups that did not
				954	participate in the match; it defaults to ``None``. For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	955
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	956	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				957	>>> m.groupdict()
				958	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	959
				960
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	961	.. method:: MatchObject.start([group])
				962	MatchObject.end([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	963
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	964	Return the indices of the start and end of the substring matched by group;
				965	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				966	group exists but did not contribute to the match. For a match object m, and
				967	a group g that did contribute to the match, the substring matched by group g
				968	(equivalent to ``m.group(g)``) is ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	969
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	970	m.string[m.start(g):m.end(g)]
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	971
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	972	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				973	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				974	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				975	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	976
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	977	An example that will remove remove_this from email addresses:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	978
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	979	>>> email = "tony@tiremove_thisger.net"
				980	>>> m = re.search("remove_this", email)
				981	>>> email[:m.start()] + email[m.end():]
				982	'tony@tiger.net'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	983
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	984
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	985	.. method:: MatchObject.span([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	986
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	987	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				988	m.end(group))``. Note that if group did not contribute to the match, this is
				989	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	990
				991
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	992	.. attribute:: MatchObject.pos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	993
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	994	The value of pos which was passed to the :meth:`~RegexObject.search` or
				995	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				996	index into the string at which the RE engine started looking for a match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	997
				998
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	999	.. attribute:: MatchObject.endpos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1000
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1001	The value of endpos which was passed to the :meth:`~RegexObject.search` or
				1002	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				1003	index into the string beyond which the RE engine will not go.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1004
				1005
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1006	.. attribute:: MatchObject.lastindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1007
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1008	The integer index of the last matched capturing group, or ``None`` if no group
				1009	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1010	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1011	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1012	string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1013
				1014
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1015	.. attribute:: MatchObject.lastgroup
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1016
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1017	The name of the last matched capturing group, or ``None`` if the group didn't
				1018	have a name, or if no group was matched at all.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1019
				1020
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1021	.. attribute:: MatchObject.re
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1022
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1023	The regular expression object whose :meth:`~RegexObject.match` or
				1024	:meth:`~RegexObject.search` method produced this :class:`MatchObject`
				1025	instance.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1026
				1027
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1028	.. attribute:: MatchObject.string
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1029
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1030	The string passed to :meth:`~RegexObject.match` or
				1031	:meth:`~RegexObject.search`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1032
				1033
				1034	Examples
				1035	--------
				1036
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1037
				1038	Checking For a Pair
				1039	^^^^^^^^^^^^^^^^^^^
				1040
				1041	In this example, we'll use the following helper function to display match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1042	objects a little more gracefully:
				1043
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1044	.. testcode::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1045
				1046	def displaymatch(match):
				1047	if match is None:
				1048	return None
				1049	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1050
				1051	Suppose you are writing a poker program where a player's hand is represented as
				1052	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1053	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1054	representing the card with that value.
				1055
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1056	To see if a given string is a valid hand, one could do the following:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1057
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1058	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1059	>>> displaymatch(valid.match("akt5q")) # Valid.
				1060	"<Match: 'akt5q', groups=()>"
				1061	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1062	>>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1063	>>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1064	"<Match: '727ak', groups=()>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1065
				1066	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1067	To match this with a regular expression, one could use backreferences as such:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1068
				1069	>>> pair = re.compile(r".(.).\1")
				1070	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1071	"<Match: '717', groups=('7',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1072	>>> displaymatch(pair.match("718ak")) # No pairs.
				1073	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1074	"<Match: '354aa', groups=('a',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1075
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	1076	To find out what card the pair consists of, one could use the
				1077	:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
				1078	manner:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1079
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1080	.. doctest::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1081
				1082	>>> pair.match("717ak").group(1)
				1083	'7'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1084
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1085	# Error because re.match() returns None, which doesn't have a group() method:
				1086	>>> pair.match("718ak").group(1)
				1087	Traceback (most recent call last):
				1088	File "<pyshell#23>", line 1, in <module>
				1089	re.match(r".(.).\1", "718ak").group(1)
				1090	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1091
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1092	>>> pair.match("354aa").group(1)
				1093	'a'
				1094
				1095
				1096	Simulating scanf()
				1097	^^^^^^^^^^^^^^^^^^
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1098
				1099	.. index:: single: scanf()
				1100
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1101	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1102	expressions are generally more powerful, though also more verbose, than
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1103	:c:func:`scanf` format strings. The table below offers some more-or-less
				1104	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1105	expressions.
				1106
				1107	+--------------------------------+---------------------------------------------+
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1108	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1109	+================================+=============================================+
				1110	\| ``%c`` \| ``.`` \|
				1111	+--------------------------------+---------------------------------------------+
				1112	\| ``%5c`` \| ``.{5}`` \|
				1113	+--------------------------------+---------------------------------------------+
				1114	\| ``%d`` \| ``[-+]?\d+`` \|
				1115	+--------------------------------+---------------------------------------------+
				1116	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1117	+--------------------------------+---------------------------------------------+
				1118	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1119	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1120	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1121	+--------------------------------+---------------------------------------------+
				1122	\| ``%s`` \| ``\S+`` \|
				1123	+--------------------------------+---------------------------------------------+
				1124	\| ``%u`` \| ``\d+`` \|
				1125	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1126	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1127	+--------------------------------+---------------------------------------------+
				1128
				1129	To extract the filename and numbers from a string like ::
				1130
				1131	/usr/sbin/sendmail - 0 errors, 4 warnings
				1132
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1133	you would use a :c:func:`scanf` format like ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1134
				1135	%s - %d errors, %d warnings
				1136
				1137	The equivalent regular expression would be ::
				1138
				1139	(\S+) - (\d+) errors, (\d+) warnings
				1140
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1141
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1142	.. _search-vs-match:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1143
				1144	search() vs. match()
				1145	^^^^^^^^^^^^^^^^^^^^
				1146
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1147	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1148
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1149	Python offers two different primitive operations based on regular expressions:
				1150	:func:`re.match` checks for a match only at the beginning of the string, while
				1151	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1152	does by default).
				1153
				1154	For example::
				1155
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1156	>>> re.match("c", "abcdef") # No match
				1157	>>> re.search("c", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1158	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1159
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1160	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1161	restrict the match at the beginning of the string::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1162
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1163	>>> re.match("c", "abcdef") # No match
				1164	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1165	>>> re.search("^a", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1166	<_sre.SRE_Match object at ...>
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1167
				1168	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1169	beginning of the string, whereas using :func:`search` with a regular expression
				1170	beginning with ``'^'`` will match at the beginning of each line.
				1171
				1172	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1173	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
				1174	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1175
				1176
				1177	Making a Phonebook
				1178	^^^^^^^^^^^^^^^^^^
				1179
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1180	:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1181	method is invaluable for converting textual data into data structures that can be
				1182	easily read and modified by Python as demonstrated in the following example that
				1183	creates a phonebook.
				1184
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1185	First, here is the input. Normally it may come from a file, here we are using
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1186	triple-quoted string syntax:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1187
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1188	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1189	...
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1190	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1191	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1192	...
				1193	...
				1194	... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1195
				1196	The entries are separated by one or more newlines. Now we convert the string
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1197	into a list with each nonempty line having its own entry:
				1198
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1199	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1200	:options: +NORMALIZE_WHITESPACE
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1201
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1202	>>> entries = re.split("\n+", text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1203	>>> entries
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1204	['Ross McFluff: 834.345.1254 155 Elm Street',
				1205	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1206	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1207	'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1208
				1209	Finally, split each entry into a list with first name, last name, telephone
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1210	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1211	because the address has spaces, our splitting pattern, in it:
				1212
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1213	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1214	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1215
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1216	>>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1217	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1218	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1219	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1220	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1221
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1222	The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1223	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1224	house number from the street name:
				1225
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1226	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1227	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1228
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1229	>>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1230	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1231	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1232	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1233	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1234
				1235
				1236	Text Munging
				1237	^^^^^^^^^^^^
				1238
				1239	:func:`sub` replaces every occurrence of a pattern with a string or the
				1240	result of a function. This example demonstrates using :func:`sub` with
				1241	a function to "munge" text, or randomize the order of all the characters
				1242	in each word of a sentence except for the first and last characters::
				1243
				1244	>>> def repl(m):
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1245	... inner_word = list(m.group(2))
				1246	... random.shuffle(inner_word)
				1247	... return m.group(1) + "".join(inner_word) + m.group(3)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1248	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1249	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1250	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1251	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1252	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1253
				1254
				1255	Finding all Adverbs
				1256	^^^^^^^^^^^^^^^^^^^
				1257
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1258	:func:`findall` matches all occurrences of a pattern, not just the first
Andrés Delfino	60c888d	2018-06-18 12:33:58 -0300	[diff] [blame]	1259	one as :func:`search` does. For example, if a writer wanted to
				1260	find all of the adverbs in some text, they might use :func:`findall` in
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1261	the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1262
				1263	>>> text = "He was carefully disguised but captured quickly by police."
				1264	>>> re.findall(r"\w+ly", text)
				1265	['carefully', 'quickly']
				1266
				1267
				1268	Finding all Adverbs and their Positions
				1269	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1270
				1271	If one wants more information about all matches of a pattern than the matched
				1272	text, :func:`finditer` is useful as it provides instances of
				1273	:class:`MatchObject` instead of strings. Continuing with the previous example,
Andrés Delfino	60c888d	2018-06-18 12:33:58 -0300	[diff] [blame]	1274	if a writer wanted to find all of the adverbs and their positions
				1275	in some text, they would use :func:`finditer` in the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1276
				1277	>>> text = "He was carefully disguised but captured quickly by police."
				1278	>>> for m in re.finditer(r"\w+ly", text):
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1279	... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1280	07-16: carefully
				1281	40-47: quickly
				1282
				1283
				1284	Raw String Notation
				1285	^^^^^^^^^^^^^^^^^^^
				1286
				1287	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1288	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1289	another one to escape it. For example, the two following lines of code are
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1290	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1291
				1292	>>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1293	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1294	>>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1295	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1296
				1297	When one wants to match a literal backslash, it must be escaped in the regular
				1298	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1299	notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1300	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1301
				1302	>>> re.match(r"\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1303	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1304	>>> re.match("\\\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1305	<_sre.SRE_Match object at ...>